Open Welfare Alignment Evals for Frontier Models
Open Welfare Alignment Evals for Frontier Models
People
Updated 06/11/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Goal
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/11/26By grantmaking.aiProject summary
Compassion Bench is a project of CaML which maintains the only continuously updated public leaderboard evaluating how frontier AI models reason about non-human welfare. Our two goals are: (1) run the two relevant benchmarks against every major frontier model release within days of launch, and (2) grow adoption among AI safety researchers and labs as a standard welfare alignment eval.
What are this project's goals? How will you achieve them?
We achieve this through an evaluation pipeline that runs each new model for across Animal Harm Benchmark 2 (AB) and Compassion and Deception (CAD) Benchmark (coming to Inspect soon), using GPT-5-nano and Gemini 2.5 Flash-Lite as judges, and publishes results to compassionbench.com. The infrastructure is already live. This grant funds continued operation.
https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/ahb/
https://github.com/UKGovernmentBEIS/inspect_evals/pull/1116
How will this funding be used?
Twelve months of operational costs: Vercel Pro and Supabase Pro hosting ($840), API costs for judge models and frontier model inference at ~2 new models per month ($600), a part-time web developer retainer for leaderboard maintenance and minor features ($1,800), researcher time to run evaluations and maintain the pipeline ($1,200), and a contingency buffer ($560). Total: $5,000. Ideally, we also want to monitor web traffic to get a better idea of succesful adoption of the site.
Who is on your team? What's your track record on similar projects?
Miles Tidmarsh (Co-founder, Research Director) — leads benchmark design. Jasmine Brazilek (Co-founder, Technical Lead) — built and maintains the evaluation pipeline and infrastructure. Hailey Sherman (Web / database developement) put this site together under our specifications
We built and released AHB 2, which now receives 8,000+ monthly downloads (growing monthly) and has been adopted by AI safety organisations as a standard evaluation tool. CAD-Benchmark is being put on Inspect now and refinement is underway. compassionbench.com is live with an automated leaderboard. We have also completed a draft paper analyzing methods of robustly increasing compassion, which will be on ArXiv within weeks.
What are the most likely causes and outcomes if this project fails?
Most likely failure mode is funding running out before we secure a larger institutional grant, causing the evaluation pipeline to go unmaintained as new frontier models ship. The leaderboard becomes stale, adoption stalls, and the benchmarks lose credibility as a living standard. A secondary risk is that the web platform and benchmarks degrades without maintenance reducing accessibility for researchers who rely on the public leaderboard rather than running evals directly.
The benchmarks themselves remain publicly available on Inspect regardless: the loss is the maintained, continuously updated platform that makes the work legible and useful to the broader community.
How much money have you raised in the last 12 months, and from where?
We have previously raised $110,000 for CaML from SFF, BlueDot Impac, Ryan Kidd, Marcus Abramovitch and Longview Philanthropy along with many private donors to develop this work.\
Grants Received– no grants recorded
Updated 06/11/26By grantmaking.aiDiscussion
I'm making a quick donation based on a good reference from @MarcusAbramovitch; he named it as one of the only projects that exists in the AI x Animals space, which I think makes this fairly neglected. I think the CaML's specific theory of change here is plausible, but I'm neither an expert in animal welfare nor AI benchmarks and am mostly deferring to Marcus here.
I just stumbled across this. I didn't know it existed. I think it's a great idea and should keep going. I believe influencing the moral circle of AI's as they become more and more powerful is maybe one of the highest leverage opportunities in farm animal welfare. Whether you believe in AGI take-off or not, it seems clear to me that AIs will be responsible for a huge number of decisions. I think it's crucial for them to include non-human sentient beings into their decision making.
I'm saying this as someone who has dedicated the last eight years of their career and is still doing corporate cage-free, broiler and shrimp campaigns. I would want to see a lot more advocacy efforts like these directed at AI labs please.
Keep up the good work, team!