Updated the github to put scripts and data: https://github.com/joy-void-joy/alexplainable
This project would be an investigation into how much constructability actually works right now.
Here are some projects, by order of importance, we would investigate over the course of one (if minimally funded) to two months (if fully funded):
Alexplainable: Reconstructing alexnet to be fully explainable. We would want to recognize classes from ImageNet in a way that is interpretable by design, and if possible, scale this process to be as automatic as possible on ~5 randomly selected classes to demonstrate this approach.
Scaling our first prototype: In our post, we showed potentiality in using many small neural networks that compose together to recognize an image class. We would want to automatize and scale this process to recognize 5 random classes of ImageNet and see what can work at scale.
Making it more transparent-by-design: Except for feature nodes, our current prototype uses layers of 20 to 50 parameters. We would want to decompose all of them to plain code using automatic loops, and only keep feature nodes as deep-learning networks.
Better interpretability: We discovered that MaxPool or AvgPool do not work to learn a feature that can compose well. We want to analyze what is learned in the 4x1 inference layer, and if we can use that knowledge to have a better understanding of the result of the convolution networks. We also envision dynamic training (such as switching to MaxPool mid-way) and see if the features learned this way become more understandable.
Constructible GPT: We want to test the ontology approach of alexplainable, which seems to work, on building a code language model that is decomposed in easily understood parts.
AI Pull-Request loops: Expand on Voyager to use our AI-pull-request approach and see how safe it is and what specifications can be integrated.
Simulations: One of our key approaches in our post consist of using simulations that gets refined as the agent interacts with it. We want to explore this methodology, for instance, by using stable-diffusion or llama models that get fine-tuned by automatic reviews of overseeing LLMs.
Making a post to summarize all of our progress so far: No matter our progress, we are committed to write a post for July 14th (or two months after the grant if fully funded, whichever happens later) to present results, what we have learned about constructability and what we have been able to build with current systems as well as their limitations.
We expect to iterate a lot and find other promising prototypes we could make as we come across them during this project.
We believe constructability could be a very important field in itself for AI safety and instrumental in pushing toward more safety culture.
This project goal is to make demonstrations of the possibility of this approach. For instance, having an image recognizer that is fully understandable would legitimize such constructed-from scratch approaches.
It also includes a thorough report both on protocols and ideas we've tried and considered, and on observation and consideration for safety.
$5000 will go into prompt engineering and running models for Alexplainable and other investigations. We have found that by far, open-source models do not work very reliably, and so far only claude-opus seems to allow us to push the limit of what is possible and allow for performant systems that are built semi-automatically.
20% of the rest will go to Charbel-Raphaël Segerie for overviewing research and mentoring
80% will go to Épiphanie Gédéon for working full time for two months (if fully funded) on the different aforementionned prototypes
Charbel-Raphaël Segerie: Charbel-Raphael is head of the Artificial Intelligence Safety Unit at EffiSciences, where he leads research and education in AI safety. He directs the Turing Seminar, a renowned course on AI safety within the Ecole Normale Supérieures. His work focuses on comprehensively characterising emerging risks in artificial intelligence, on the theory of interpretability, addressing challenges related to current safety methods, and safe by design AI approaches. He has written several lesswrong posts on AI safety.
Épiphanie Gédéon: "I joined Effisciences this year, and am excited to contribute to research in a way that can be helpful for the world. The main areas I believe I can have drastic impact in now include AI safety (hopefully constructability), and mental health/helping people be more agentic (especially in terms of meta impact). I have experience with freelancing in programmation."
Most likely no tracks seem to actually yield any results when investigated further. Given the results so far, we believe that it should be possible to have an at least somewhat automated process that has some level of interpretability-by-design.
None for now
Épiphanie Gédéon
7 months ago
Updated the github to put scripts and data: https://github.com/joy-void-joy/alexplainable
Loppukilpailija
7 months ago
I overall find the direction of safer-by-design constructions of AI systems an exciting direction: the ideas of constructability are quite orthogonal to other approaches, and marginal progress there could turn out to be broadly useful.
That said, I do think this direction is littered by skulls, and consider the modal outcome to be failure. I think that especially the fully-plain-coded approaches are not practical for the types of AI we are most worried about, and working on this would very likely be a dead end. I'm more excited about top-down approaches: trying to make models more modular-by-design while essentially retaining performance, in the sense of "we have replaced one big model with N not-so-big models".
The project authors seem to be aware of the skulls, and indeed the proposal has some novel components that may get around some issues. While I think it's still easy to run into dead ends, this is good enough of a reason for me to fund the project.
Overall, I think simply having a better understanding of the constraints involved when trying to make systems safer-by-design is great. I'd quite like there to be people thinking about this, and would be happy about progress on mapping out dead ends and not-so-dead ends.
Épiphanie Gédéon
7 months ago
Thank you very much for the kind words, feedback, and generous contribution! We'd be very excited to work on this, and this helps us move toward our minimal funding goal greatly!
As we want to have concrete results, one of our main focus would indeed be on this modular-by-design approach, trying to break each modules down into parts until they become small enough that we can train and control their output to ensure they are working on the correct tasks.
If I understand your concerns correctly, you see more value and practicality in the first steps of this decomposition (taking one big N-level model and decomposing it into n (N-1)-level models) rather than the last steps (e.g. taking one model trained on "white petal", and recoding it into a white long oval shape detector without drop of performance)? Based on our small prototype and how strong composability seems to be, I would expect the largest performance hit to occur primarily in the initial decomposition steps, and for decomposition to hold on until the end. However, this is something we will closely monitor and adapt to if this does not seem to be the case.
One failure mode we are cognizant of with one-step-decomposition is that of plasticity: Even a one-layer convolution network seems to be able to adapt to a submodel that has "learned the wrong task" (this has happened in our prototype before correcting for it), and so we can't just use overall performance as a measure for how well each submodule have learned what we intend them to. What metrics or training methods we could use to track this better and ensure things compose as we want them to would be one central question of our project.
With respect to type of AIs, we are still considering this: We originally wanted to investigate GPT-like systems directly, but are currently worried about both expanding capabilities and safety concerns.
Once again, thank you for your support and valuable insights. We look forward to sharing our progress and findings!
Loppukilpailija
7 months ago
"you see more value and practicality in the first steps of this decomposition (taking one big N-level model and decomposing it into n (N-1)-level models) rather than the last steps [...]?"
Yes, I'd see a lot of value in being able to do the first steps of decomposition. I'm particularly thinking about concerns stemming from the AI itself being dangerous, as opposed to systematic risks. Here I think that "a system built of n (N-1)-level models" would likely be much safer than "one N-level model" for reasonable values of n. (E.g. I think this would plausibly be much better in terms of hidden cognition, AI control, deceptive alignment, and staying within assigned boundaries.)
"I would expect the largest performance hit to occur primarily in the initial decomposition steps, and for decomposition to hold on until the end."
I would expect this, too. This is a big factor for why I think one should look here: it doesn't really help if one can solve the (relatively easy) problem of constructing plain-coded white-petal-detectors, if one can't decompose the big dangerous systems into smaller systems. But if one the other hand one could get comparable performance from a bunch of small models, or even just one (N-0.1)-level model and a lot of specialized models, then that would be really valuable.
"but are currently worried about both expanding capabilities and safety concerns"
Makes sense. "We are able to get comparable performance by using small models" has the pro of "we can use small models", but the con of "we can get better performance by such-and-such assembles". I do think this is something one has to seriously think about.
Épiphanie Gédéon
7 months ago
Here I think that "a system built of n (N-1)-level models" would likely be much safer than "one N-level model" for reasonable values of n
That makes sense, in my understanding, this is also the approach of coem. What I would worry about is the composition being unreliable and failing spectacularly without a precise review system.
I think this is less likely to happen if the composition layer is very reviewable (for instance, written in plain code, but all other (N-1)-level models stay opaque). To keep using the prototype as an example, if it can be reviewed that we are using a "disk surrounded by lots of petal"-detector thoroughly, then overall performance would indeed force each model to be closely aligned to their tasks. I would be interested to see how true this is, for instance, what happens when we train models together this way.
I think your point really makes sense, and is one indication to focus on one steps thoroughly rather than on all the process at once.
If you want to talk more about this, please feel free to write me at epiphanie.gedeon@gmail.com or book a call