Thomas Kwa
Bio
Thomas Kwa is a researcher on the technical staff at METR (Model Evaluation & Threat Research), where he focuses on measuring AI capabilities and autonomous task completion. He holds a Computer Science degree from Caltech and previously worked at MIRI and conducted interpretability research through the MATS/SERI MATS program with Adrià Garriga-Alonso and Jason Gross. He is the lead author of the influential METR paper introducing the '50% time horizon' metric—the length of tasks AI models can complete autonomously with 50% probability—which found this metric has been doubling roughly every seven months. He also co-authored 'Catastrophic Goodhart,' a 2024 paper demonstrating that KL divergence regularization in RLHF fails to prevent reward hacking under heavy-tailed reward misspecification, presented at ICML and NeurIPS 2024. Thomas is an active contributor to LessWrong and the AI Alignment Forum, where he publishes research on interpretability, Goodhart's Law, and AI safety methodology.
Links
- Personal Website
- -
- Twitter / X
- LessWrong
- thomas-kwa
Grants
from Long-Term Future Fund
from Long-Term Future Fund
from Long-Term Future Fund
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.
Details
- Last Updated
- Mar 23, 2026, 1:41 AM UTC
- Created
- Mar 20, 2026, 2:59 AM UTC