Avery Griffin
Bio
Avery Griffin is an AI safety researcher who has worked at Anthropic and previously served as a Research Manager at MATS (ML Alignment & Theory Scholars) from 2024 to 2025. Griffin became interested in AI safety through AlphaGo and AlphaZero, and after reading Nick Bostrom's Superintelligence, applied to AI Safety Camp in late 2021, receiving a grant that enabled full-time research focus. As a MATS alumni turned research manager, Griffin supported cohorts of AI safety scholars through the program's training pipeline. Griffin has co-authored several papers in AI safety and interpretability, including work on the Local Interaction Basis for mechanistic interpretability (with Apollo Research), modifying LLM beliefs via synthetic document fine-tuning (Anthropic/MATS, April 2025), and eliciting harmful capabilities by fine-tuning on safeguarded outputs (Anthropic, January 2026). Earlier research contributions include a project on Selection Theorems for Modularity from AI Safety Camp 2022, exploring what factors select for modularity in learned systems.
Links
- Personal Website
- -
- Twitter / X
- -
- LessWrong
- -
Grants
No grants recorded.
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.
Details
- Last Updated
- Mar 22, 2026, 2:29 PM UTC
- Created
- Mar 20, 2026, 3:00 AM UTC