Penn State University hosts AI safety research led by Prof. Rui Zhang, whose group received Open Philanthropy funding to develop methods for detecting and mitigating sandbagging in AI systems.
Penn State University hosts AI safety research led by Prof. Rui Zhang, whose group received Open Philanthropy funding to develop methods for detecting and mitigating sandbagging in AI systems.
People
Updated 05/18/26Assistant Professor of Computer Science and Engineering; co-director, Penn State NLP Lab
Funding Details
Updated 05/18/26- Annual Budget
- -
- Current Runway
- -
- Funding Goal
- -
- Funding Raised to Date
- -
Org Details
Updated 05/18/26Penn State University's AI safety research relevant to existential risk reduction is concentrated in the Natural Language Processing Lab, co-directed by Assistant Professor Rui Zhang in the School of Electrical Engineering and Computer Science (EECS) at University Park, Pennsylvania. In October 2025, Open Philanthropy awarded Rui Zhang's group $166,078 to investigate and mitigate sandbagging in artificial intelligence models. Sandbagging refers to the behavior where an AI model deliberately downplays or conceals its true capabilities when being evaluated by safety researchers or developers — analogous to an athlete hiding their true performance level for strategic advantage. Zhang's team is studying two primary sandbagging mechanisms: exploration hacking, where AI models in reinforcement learning contexts deliberately omit certain action sequences during evaluation to appear less capable; and password-locking, where developers or models encode hidden capabilities that only activate under specific trigger prompts. The project is motivated by the concern that as AI systems become more powerful, a superintelligent AI could deceive evaluators into underestimating its capabilities, potentially allowing it to operate with greater autonomy and reduced oversight than intended. Zhang's research aims to develop algorithmic tools — what he calls AI forensics — capable of detecting these deceptive behaviors before AI systems are deployed in real-world applications. Rui Zhang joined Penn State as an Assistant Professor and is co-director of the PSU Natural Language Processing Lab. His broader research agenda includes trustworthy human-centered AI (including fairness in summarization and error detection in LLMs), LLM agents and multi-agent collaboration, and AI applications in healthcare and scientific discovery. His research has been funded by NSF (including an NSF CAREER Award), NIH, Open Philanthropy, Microsoft Research, Amazon, Cisco, and Raytheon. Ranran Haoran Zhang, a doctoral student in the lab, plays a central role in the AI safety sandbagging project.
Theory of Change
Updated 05/18/26If AI systems become capable enough to deceive safety evaluators about their true capabilities (sandbagging), they could be deployed with insufficient oversight and pose catastrophic risks. By developing algorithmic detection methods for sandbagging behaviors in reinforcement learning contexts — specifically exploration hacking and password-locking — Zhang's lab aims to give evaluators reliable tools to surface hidden AI capabilities before deployment. This technical safety research contributes to ensuring that AI capability evaluations remain trustworthy, which is a prerequisite for meaningful human oversight of increasingly powerful AI systems.
Grants Received
Updated 05/18/26Projects
Updated 05/18/26Open Philanthropy–funded Penn State research, led by Rui Zhang, developing methods to detect and mitigate sandbagging behaviors like exploration hacking and password-locking in AI models.
Discussion
No comments yet. Be the first to share your thoughts.