Oscar Balcells Obeso
Bio
Oscar Balcells Obeso is a mathematics student at ETH Zurich and AI safety researcher specializing in mechanistic interpretability. Originally from Spain, he previously studied Aerospace Engineering and Mathematics at Universitat Politecnica de Catalunya (BarcelonaTech) in Barcelona. He participated in the ML Alignment & Theory Scholars (MATS) program in its Winter 2023-24 cohort under the mentorship of Neel Nanda, and received a Long-Term Future Fund stipend to research the mechanisms of refusal in chat LLMs. His work on refusal led to the NeurIPS 2024 paper "Refusal in Language Models Is Mediated by a Single Direction" (co-authored with Andy Arditi, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda), which demonstrated that refusal behavior in open-source chat models is controlled by a single linear direction in the residual stream. He also co-authored the ICLR 2025 Oral paper "Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models" with Javier Ferrando, Senthooran Rajamanoharan, and Neel Nanda. He has additional experience at Anthropic and competed in the International Olympiad in Informatics in 2020 representing Spain, earning a bronze medal.
Links
- Personal Website
- https://oscarbalcells.com/
- Twitter / X
- LessWrong
- -
Grants
from Long-Term Future Fund
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.
Details
- Last Updated
- Mar 23, 2026, 12:11 AM UTC
- Created
- Mar 20, 2026, 2:56 AM UTC