Oscar Balcells Obeso

Zurich, Switzerland

Bio

Updated 03/23/26

Oscar Balcells Obeso is a mathematics student at ETH Zurich and AI safety researcher specializing in mechanistic interpretability. Originally from Spain, he previously studied Aerospace Engineering and Mathematics at Universitat Politecnica de Catalunya (BarcelonaTech) in Barcelona. He participated in the ML Alignment & Theory Scholars (MATS) program in its Winter 2023-24 cohort under the mentorship of Neel Nanda, and received a Long-Term Future Fund stipend to research the mechanisms of refusal in chat LLMs. His work on refusal led to the NeurIPS 2024 paper "Refusal in Language Models Is Mediated by a Single Direction" (co-authored with Andy Arditi, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda), which demonstrated that refusal behavior in open-source chat models is controlled by a single linear direction in the residual stream. He also co-authored the ICLR 2025 Oral paper "Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models" with Javier Ferrando, Senthooran Rajamanoharan, and Neel Nanda. He has additional experience at Anthropic and competed in the International Olympiad in Informatics in 2020 representing Spain, earning a bronze medal.

Community Signal

Updated 03/23/26

0Upvotes

0Downvotes

0Endorsements

No endorsements yet.

Grants

Updated 03/23/26

LTFF 2024 Q1 - Oscar Balcells Obeso

from Long-Term Future Fundfunds.effectivealtruism.org

recipient$40,000

Oscar Balcells Obeso

Bio

Community Signal

Links

Grants