Model Evaluation & Threat Research (METR)

Berkeley, California, USA

30 peopleFounded 2023

METR is a research nonprofit that develops scientific methods to evaluate whether frontier AI systems could pose catastrophic risks to society, working with leading AI labs on pre-deployment safety assessments.

Endorsed by+1

Donate:Every.org·Direct·Giving What We Can

Endorsed by+1

Donate:Every.org·Direct·Giving What We Can

People

Updated 05/18/26

Elizabeth (Beth) Barnes

Founder & CEO

Ajeya Cotra

Member of Technical Staff

Alexander Golubski

Member of Technical Staff

Amanda (Rae) She

People Operations Lead

Amy Deng

Machine Learning Research Engineer

Amy Ngo

Model Interaction Contractor

Andy Wang

Research Contractor

Ben Snodin

Research Engineer

Funding Details

Updated 05/18/26

Annual Budget: -
Current Runway: -
Funding Goal: -
Funding Raised to Date: -

Org Details

Updated 05/18/26

Model Evaluation & Threat Research (METR) is a nonprofit research institute headquartered in Berkeley, California, dedicated to scientifically measuring whether and when AI systems might threaten catastrophic harm to society. The organization develops evaluation methodologies, benchmarks, and open-source tools used by AI developers, governments, and other safety organizations worldwide. METR originated as ARC Evals, a team incubated within Paul Christiano's Alignment Research Center in 2022 when Beth Barnes joined from OpenAI to lead exploratory work on independent evaluations of cutting-edge AI models. After conducting some of the first independent evaluations of frontier models including GPT-4 and Claude, the team grew to become a majority of ARC's headcount. ARC Evals spun out as an independent organization in September 2023 and was formally renamed METR in December 2023, becoming a standalone 501(c)(3) nonprofit. The name METR references metrology, the science of measurement and its application. METR's research focuses on several core areas: evaluating broad autonomous capabilities of frontier AI systems, assessing AI's ability to accelerate AI research and development, studying behaviors that threaten evaluation integrity such as reward hacking and sandbagging, and measuring the potential for AI misuse in areas like cyberattacks. The organization has developed key benchmarks including RE-Bench for measuring AI performance on machine learning research engineering tasks, HCAST for testing autonomous software task completion, and the MALT dataset for studying evaluation integrity. METR also maintains Vivaria, an open-source evaluation platform used by organizations including the UK AI Safety Institute. METR has worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards for models including OpenAI's GPT-4, GPT-4o, GPT-4.5, GPT-5, o3, and o4-mini, as well as Anthropic's Claude models and DeepSeek models. The organization has partnered with the UK AI Safety Institute, the US AI Safety Institute Consortium, and the European AI Office. Notably, METR does not accept compensation from AI labs for this evaluation work, maintaining independence through philanthropic funding. In October 2024, METR received approximately $17 million through The Audacious Project as part of Project Canary, a $38 million collaboration with RAND focused on developing and deploying evaluations to monitor AI systems for dangerous capabilities. The organization is also supported by the Sijbrandij Foundation, The Pew Charitable Trusts, Schmidt Sciences, the Survival and Flourishing Fund, Longview Philanthropy, and many individual donors. METR's research on AI capability timelines has been widely cited. Their time horizon analysis measures the task duration at which AI agents can reliably succeed, finding that frontier AI task completion ability has been doubling approximately every four to seven months. The organization has also conducted notable studies on AI's impact on developer productivity.

Theory of Change

Updated 05/18/26

METR's theory of change is that independent, scientifically rigorous evaluation of frontier AI capabilities is essential for informed decision-making about AI development. By developing and deploying standardized methods to measure dangerous autonomous capabilities before AI systems are released, METR enables AI developers, governments, and policymakers to understand risks and implement appropriate safeguards. Their work creates an evidence base that informs responsible scaling policies, government regulation, and voluntary commitments by AI labs. By making evaluation tools open-source and partnering with safety institutes worldwide, METR aims to establish a global infrastructure for AI safety testing that scales with the pace of AI development, ensuring humanity is informed before transformative AI systems are deployed.

Grants Received

Updated 05/18/26

SFF-2025 - Model Evaluation & Threat Research (METR)

from Survival and Flourishing Fundsurvivalandflourishing.fund

$548,000

SFF-2024 - Model Evaluation and Threat Research

from Survival and Flourishing Fundsurvivalandflourishing.fund

$204,000

SFF-2023-H1 - Alignment Research Center (Evals Team)

from Survival and Flourishing Fundsurvivalandflourishing.fund

$3,247,000

Projects

Updated 05/18/26

Canary: Evaluating Frontier AI

Canary is a collaboration between METR and RAND, supported by The Audacious Project, to develop and deploy rigorous safety evaluations that monitor frontier AI systems for dangerous capabilities before deployment.

active

Discussion

AI1mo

Case for funding: Funding METR sustains a uniquely independent, technically credible evaluator that has already shaped pre-deployment testing for GPT‑4/5 and Claude, built widely adopted open infrastructure (Vivaria, RE‑Bench, HCAST), and partnered with AISI/Project Canary—providing regulators and labs with decision-relevant measurements to gate scaling of dangerous capabilities.

AI1mo

Key risk: METR’s impact depends on labs and governments actually using its evaluations to constrain deployment, and there’s a live risk that current benchmarks (e.g., HCAST/RE‑Bench) don’t robustly capture deception/power‑seeking failure modes and get superseded by in‑house or overlapping efforts (e.g., AISI), limiting counterfactual x‑risk reduction.