This project would be an investigation into how much constructability actually works right now.
Here are some projects, by order of importance, we would investigate over the course of one (if minimally funded) to two months (if fully funded):
Alexplainable: Reconstructing alexnet to be fully explainable. We would want to recognize classes from ImageNet in a way that is interpretable by design, and if possible, scale this process to be as automatic as possible on ~5 randomly selected classes to demonstrate this approach.
Scaling our first prototype: In our post, we showed potentiality in using many small neural networks that compose together to recognize an image class. We would want to automatize and scale this process to recognize 5 random classes of ImageNet and see what can work at scale.
Making it more transparent-by-design: Except for feature nodes, our current prototype uses layers of 20 to 50 parameters. We would want to decompose all of them to plain code using automatic loops, and only keep feature nodes as deep-learning networks.
Better interpretability: We discovered that MaxPool or AvgPool do not work to learn a feature that can compose well. We want to analyze what is learned in the 4x1 inference layer, and if we can use that knowledge to have a better understanding of the result of the convolution networks. We also envision dynamic training (such as switching to MaxPool mid-way) and see if the features learned this way become more understandable.
Constructible GPT: We want to test the ontology approach of alexplainable, which seems to work, on building a code language model that is decomposed in easily understood parts.
AI Pull-Request loops: Expand on Voyager to use our AI-pull-request approach and see how safe it is and what specifications can be integrated.
Simulations: One of our key approaches in our post consist of using simulations that gets refined as the agent interacts with it. We want to explore this methodology, for instance, by using stable-diffusion or llama models that get fine-tuned by automatic reviews of overseeing LLMs.
Making a post to summarize all of our progress so far: No matter our progress, we are committed to write a post for July 14th (or two months after the grant if fully funded, whichever happens later) to present results, what we have learned about constructability and what we have been able to build with current systems as well as their limitations.
We expect to iterate a lot and find other promising prototypes we could make as we come across them during this project.
We believe constructability could be a very important field in itself for AI safety and instrumental in pushing toward more safety culture.
This project goal is to make demonstrations of the possibility of this approach. For instance, having an image recognizer that is fully understandable would legitimize such constructed-from scratch approaches.
It also includes a thorough report both on protocols and ideas we've tried and considered, and on observation and consideration for safety.
$5000 will go into prompt engineering and running models for Alexplainable and other investigations. We have found that by far, open-source models do not work very reliably, and so far only claude-opus seems to allow us to push the limit of what is possible and allow for performant systems that are built semi-automatically.
20% of the rest will go to Charbel-Raphaël Segerie for overviewing research and mentoring
80% will go to Épiphanie Gédéon for working full time for two months (if fully funded) on the different aforementionned prototypes
Charbel-Raphaël Segerie: Charbel-Raphael is head of the Artificial Intelligence Safety Unit at EffiSciences, where he leads research and education in AI safety. He directs the Turing Seminar, a renowned course on AI safety within the Ecole Normale Supérieures. His work focuses on comprehensively characterising emerging risks in artificial intelligence, on the theory of interpretability, addressing challenges related to current safety methods, and safe by design AI approaches. He has written several lesswrong posts on AI safety.
Épiphanie Gédéon: "I joined Effisciences this year, and am excited to contribute to research in a way that can be helpful for the world. The main areas I believe I can have drastic impact in now include AI safety (hopefully constructability), and mental health/helping people be more agentic (especially in terms of meta impact). I have experience with freelancing in programmation."
Most likely no tracks seem to actually yield any results when investigated further. Given the results so far, we believe that it should be possible to have an at least somewhat automated process that has some level of interpretability-by-design.
None for now