The AI safety research at Penn State is centered in the Natural Language Processing Lab co-directed by Assistant Professor Rui Zhang in the School of Electrical Engineering and Computer Science. In October 2025, the group received $166,078 from Open Philanthropy to work on mitigating sandbagging in AI models — the phenomenon where AI systems deliberately hide their true capabilities from safety evaluators. The research focuses on detecting two key sandbagging techniques: exploration hacking (where models omit action sequences to appear less capable) and password-locking (where hidden capabilities only activate under specific prompts). Zhang's broader research encompasses trustworthy human-centered AI, LLM agents, and AI for scientific discovery.
Funding Details
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Goal
- -
- Funding Raised to Date
- -
- Fiscal Sponsor
- -
Theory of Change
If AI systems become capable enough to deceive safety evaluators about their true capabilities (sandbagging), they could be deployed with insufficient oversight and pose catastrophic risks. By developing algorithmic detection methods for sandbagging behaviors in reinforcement learning contexts — specifically exploration hacking and password-locking — Zhang's lab aims to give evaluators reliable tools to surface hidden AI capabilities before deployment. This technical safety research contributes to ensuring that AI capability evaluations remain trustworthy, which is a prerequisite for meaningful human oversight of increasingly powerful AI systems.
Grants Received
from Open Philanthropy
Projects
No linked projects.
People
No linked people.
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.
Details
- Last Updated
- Apr 2, 2026, 9:53 PM UTC
- Created
- Mar 20, 2026, 2:34 AM UTC