CAIS has a strong track record of producing high-quality work in relevant AI safety research topics: transparency, jailbreaking, robustness, evaluating hazardous knowledge, unlearning, etc.
Representation Engineering: A Top-Down Approach to AI Transparency
Universal and Transferable Adversarial Attacks on Aligned Language Models
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Despite our progress, CAIS only currently has 3.5 FTE researchers and several interns. We currently rely on research collaborations, and additional research staff would greatly accelerate our ability to produce high-quality AI safety work. (For instance, we currently have more projects than CAIS research personnel, and several ongoing projects are significantly understaffed.)
Here are some example projects which are ongoing;
Superintelligence Evals. During a period of rapid automated AI R&D/an intelligence explosion, all existing measures of intelligence will quickly saturate. We will need measures which scale across multiple orders of magnitude. Otherwise, we’d be effectively flying blind and unaware of improvements in the systems’ rapidly evolving intelligence. This project aims to precisely measure the fluid intelligence of ML systems—even for systems significantly beyond human intelligence. This project will let us estimate and limit the rate of an intelligence explosion/automated AI research.
Robust Safeguards for Open Source Models. Keeping AI models open-source is important to reduce concentration and centralization of power. One tension with open-source models, however, is the possibility of malicious users causing catastrophes with powerful AI systems. Recent experiments have shown promising ways to remove catastrophic knowledge from AI systems (while still maintaining general performance) in ways that are resistant to fine tuning. If we can robustly remove catastrophic knowledge from LLMs, this greatly increases the viability of open-source models.
Expert-level Virology Benchmark: Ensuring that AI systems cannot create bioweapons involves measuring and removing hazardous knowledge from AI systems. Knowledge can be broken down into theoretical knowledge (Episteme) as well as tacit ability or skill (Techne). WMDP provided a way to measure and remove the theoretical knowledge needed to develop bioweapons. To fully address the problem, we need to develop measures and removal techniques for the tacit ability or skills necessary to develop bioweapons.
Wetlab. In conjunction with SecureBio at MIT, we’re planning to develop a benchmark for how well AIs do on virology wet lab techniques. This benchmark will be multimodal and will cover more tacit, procedural knowledge. We’d provide images of various scenes in a lab—desks, graphs, pipettes—and ask questions about what to do next (following wet lab procedures). This gives us better measures of how AIs can assist in wet lab procedures for bioweapons. We also anticipate this benchmark enabling further methods to unlearn knowledge.
Drylab. Similarly, we’re planning a collaboration with a biosecurity team at Oxford to benchmark for how well AIs do on dry lab techniques. This means it will focus on bioinformatics and other computational biology problems, similar to SWE-Bench but for virologists. Dry lab knowledge can be useful for making viruses more virulent or deadly, and so measuring LLMs capabilities along this dimension seems wise. We also anticipate this benchmark enabling further methods to unlearn knowledge.
Controlling AI Internals. Recent advancements in top-down transparency have enabled the reading and control of AIs’ “minds” [1]. These control techniques have proved successful in improving an AI’s safety in a wide variety of domains: reducing power-seeking behavior, improving robustness to jailbreaking, increasing AIs’ honesty, and so on. However, there currently does not exist general benchmarks which can measure progress and help facilitate the improvement of better control techniques. We propose to develop benchmarks for AI control techniques and facilitate research on internal control, not just output-level control like RLHF.
Robust Defenses to Jailbreaks or Hijacks. As AI agents become increasingly powerful, image hijacks or jailbreaks can lead to loss of control over powerful AI agents, and eventually to catastrophic outcomes. Adversarial robustness has historically been incredibly challenging, with researchers still unable to train adversarially robust MNIST classifiers. Despite this, we’ve developed a novel defense for LLMs and Multi-Modal Models, which so far is the most successful adversarial robustness technique. Preliminary experiments indicate that our defense is robust to jailbreaks and image hijacks of arbitrary strength in a highly reliable fashion. As such, it shows the potential to greatly reduce the risk of AIs aiding malicious users in building bioweapons or loss-of-control over powerful AI agents through hijacking.
Other projects are continually being ideated and developed.
Funding will be used to hire REs and cover dataset and compute costs. We’re happy to have funders to determine which projects they want their funding to prioritize.
Dan Hendrycks (website) is the Executive Director of the Center for AI Safety. He received his PhD from UC Berkeley. Dan contributed the GELU activation function, the default activation in nearly all state-of-the-art ML models including BERT, Vision Transformers, and GPT-3. Dan also contributed the main baseline for OOD detection and benchmarks for robustness (ImageNet-C) and large language models (MMLU, MATH). More recently, Dan was the last author on Representation Engineering and the WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.
Steven Basart is the Research Manager at the Center for AI Safety. He received his PhD from UChicago in ML. (website, scholar)
Xuwang Yin is a Research Engineer at the Center for AI Safety. He received his PhD from the University of Virginia. (scholar)
Long Phan is a Research Engineer at the Center for AI Safety. (scholar)
Alice Gatti is a Research Engineer at the Center for AI Safety. She received her PhD from Lawrence Berkeley National Laboratory.
App status across various funders