People– no linked people
Updated 05/18/26Funding Details
Updated 05/18/26- Annual Budget
- $9,050,000
- Current Runway
- -
- Funding Goal
- -
- Funding Raised to Date
- -
Org Details
Updated 05/18/26The Alignment Research Center (ARC) is a nonprofit research organization whose mission is to align future machine learning systems with human interests. Founded in April 2021 by Paul Christiano, a former OpenAI researcher and one of the principal architects of Reinforcement Learning from Human Feedback (RLHF), ARC is headquartered in Berkeley, California, where it operates out of the Constellation co-working space alongside other AI safety organizations. ARC focuses on theoretical research addressing the problem of scalable alignment: how to ensure AI systems remain intent-aligned even as they become more capable. The organization employs a builder-breaker methodology, making worst-case assumptions rather than extrapolating from current systems. Their approach allows rapid theoretical evaluation of alignment strategies without requiring full implementation. The organization's core research directions include heuristic arguments (machine-checkable reasoning about neural network behavior that does not require the certainty of formal proofs), Eliciting Latent Knowledge (ELK, training systems to report their genuine internal beliefs rather than predicted human thoughts), mechanistic anomaly detection, and low probability estimation for rare catastrophic failures. Their current work can be understood as an effort to combine mechanistic interpretability with formal verification, producing mathematical frameworks for explaining neural network behaviors automatically. In 2022, Beth Barnes joined ARC from OpenAI to start ARC Evals, a team focused on evaluating the capabilities and alignment of advanced AI models. ARC Evals notably evaluated GPT-4 for OpenAI, assessing the model's ability to exhibit power-seeking behavior. In late 2023, ARC Evals spun out as METR, an independent 501(c)(3) nonprofit. Paul Christiano led ARC from its founding until 2024, when he was appointed Head of AI Safety at the U.S. AI Safety Institute within NIST. Leadership transitioned to Jacob Hilton, who serves as President. The board of directors includes Jacob Hilton, Buck Shlegeris, and Ben Hoskin. The team consists of approximately nine full-time staff (six researchers and three operations personnel), supplemented by external research collaborators and visiting researchers. In 2025, ARC reported making conceptual and theoretical progress at the fastest pace seen since 2022, publishing work on topics including heuristic explanations, surprise accounting, the presumption of independence, and their first empirical paper on estimating probabilities of rare language model outputs. ARC holds a 4-star rating from Charity Navigator with a 90% overall score.
Theory of Change
Updated 05/18/26ARC believes that as ML systems become more capable, current alignment approaches may fail to scale, potentially leading to systems that pursue goals misaligned with human interests. Their theory of change centers on developing rigorous theoretical foundations for alignment before superintelligent systems arrive. By creating formal mechanistic explanations of neural network behavior, combining ideas from mechanistic interpretability and formal verification into heuristic arguments, ARC aims to enable reliable detection of when AI systems might behave in dangerous or deceptive ways. This theoretical groundwork is intended to inform practical alignment techniques that can be applied by AI labs building frontier models, ensuring that powerful AI systems remain genuinely helpful and honest rather than merely appearing aligned.
Grants Received
Updated 05/18/26Projects– no linked projects
Updated 05/18/26Discussion
Key risk: Their theory-heavy program may not translate into actionable, lab-deployable techniques fast enough for frontier timelines—particularly post-ARC Evals spinout and leadership transition from Paul to Jacob—making the counterfactual impact of additional funding uncertain.
Case for funding: ARC is one of the few teams pursuing machine-checkable heuristic arguments that fuse mechanistic interpretability with formal verification—a plausible path to scalable alignment guarantees and deception detection—supported by a strong track record (ELK, builder–breaker stress tests, GPT-4 power-seeking evals via ARC Evals, rare-failure probability estimation) that has influenced labs and the field.