Matthew A. Clarke, Hardik Bhatnagar, Joseph Bloom
We’re seeking funding (mainly living expenses) for Matthew Clarke and Hardik Bhatnagar to finish a project Joseph and Matthew have been working on studying the co-occurence of SAE Latents.
Scientific details:
Not All Language Model Features Are Linear claims that not all fundamental units of language model computation are uni-dimensional. We are trying to answer related questions including:
1. What fraction of SAE latents might best be understood in groups rather than individually?
2. Do low-dimensional subspaces mapped by co-occurring groups of features ever provide a better unit of analysis (or possibly intervention) than individual SAE latents?
Our initial results suggest more extensive co-occurrence structure in smaller SAEs and more subspaces of interest that previously found by Engels et al, including subspaces that may track:
Uncertainty such as between various plausible hypotheses about how a word is being used.
Continuous quantities such as how fair through a 10 token url a token occurs.
Goal 1: Produce a LessWrong post / Academic paper comprehensively studying SAE Latent co-occurrence. We’ll achieve this by:
Finishing our current draft. We’ve got most of a draft written and mainly want to get a few more experimental results / do a good job communicating our results to the community.
Reproducing our existing results on larger models / more SAEs. Our methods include measuring latent co-occurrence and generating co-occurrence networks on Gemma Scope SAEs (so far we’ve studied Joseph’s GPT2 small feature splitting SAEs).
(Stretch Goal): Train probes of various kinds and study causal interventions on feature subspaces to provide more conclusive evidence of the need to reason about some features as groups rather than individually.
Goal 2: Provide further mentorship to Matthew and Hardik. We’ll achieve this by:
Meeting for 1-2 hours a week to discuss the results / the state of the project.
Joseph will review work done by Matthew / Hardik and assist with the write-up.
MVP: $6400 USD
Research Assistant Salaries: $3000 per person for one month, total $6000
Compute Budget: Compute costs for 1 month: code requires A100s, cost of $1.19 per hour on RunPod, estimate of $50 per person per week: total $400
Lean: $9600 USD
Research Assistant Salaries: $3000 per person for 1.5 months, total $9000
Compute Budget: Compute costs for 1.5 months: code requires A100s, cost of $1.19 per hour on RunPod, estimate of $50 per person per week: total $600
Ideal: $18800
Research Assistant Salaries: $3000 per person for 2 months, total $18000
Compute Budget: Compute costs for 2 months: code requires A100s, cost of $1.19 per hour on RunPod, estimate of $5075 per person per week: total $800
Joseph Bloom - Mentor / Supervisor
Recent Mechanistic Interpretability Research Mentoring: LASR Scholars recently published “A is for Absorption” demonstrating that the sparsity objective also encourages undesirable “gerrymandering” of information. (Neel’s tweet here, accepted to Interpretable AI workshop at NeurIPS).
Mechanistic Interpretability Research: Nanda MATS Alumni. Various works include Published exceptionally popular open source SAEs, Understanding SAE Features with the Logit Lens, Linear Representations underpinning spelling in GPT-J, and various publications on Decision Transformer Interpretability. Work mentioned in the Circuits Thread Interpretability Update.
Mechanistic Interpretability Infrastructure: Author of SAELens. Previous Maintainer of TransformerLens. Cofounder of Decode Research (building Neuronpedia). Author of DecisionTransformerInterpretability Library.
Matthew Clarke - Research Assistant / Mentee:
Mechanistic Interpretability Research: 3 months work solo (interrupted by surgery with long recovery) on this project as part of PIBBSS, see end of fellowship talk on this project here: Examining Co-occurrence of SAE Features - Matthew A. Clarke - PIBBSS Symposium.
10 years of experience in academic research, focusing on modelling the regulatory networks of cancer cells to better understand and treat this disease.
Four first-author publications in leading scientific journals, and experience in leading successful collaborations as part of the Jasmin Fisher Lab UCL.
Now transitioning into mechanistic interpretability research to bring the skills learned from understanding biological regulatory networks and apply those to the problems of AI safety, and vice versa.
Website: https://mclarke1991.github.io/
Hardik Bhatnagar - Research Assistant / Mentee:
Research scholar in the LASR labs program. Mentored by Joseph Bloom working on Feature Absorption project. The project paper is on arxiv (submitted to ICLR) and on LessWrong.
Previously worked at Microsoft Research on understanding model jailbreaks and how harmful concepts are represented in large language models as a function of training (pretraining, finetuning, RLHF).
Previously worked in computational neuroscience for a year where he tried to understand the mechanisms for human vision saliency with human fMRI experiments.
Technical Challenges: We could post what we have today (though the results would be less convincing / claims can’t be as solid), but the marginal work to be done involves working with larger models and possibly some newer methods which may fail to provide additional value if Matthew / Hardik aren’t able to get them working. Joseph assigns < 20% chance of substantial complications in execution / implementation.
Team members are offered other opportunities: Both Hardik and Matthew are applying to MATS and other AI safety related jobs, and may leave the project sooner than optimal. This is probably a good outcome and we will know if this is the case before taking receipt of funding.
Via Decode Research / Neuronpedia, projects associated with Joseph have received substantial amounts of funding (> $500K USD). PIBBSS and LASR programs which Joseph has mentored in have received funding.