Project to better mitigate sandbagging in AI models
About
Updated 05/18/26The project "to better mitigate sandbagging in AI models" is a technical AI safety effort housed in Penn State’s School of Electrical Engineering and Computer Science and led by assistant professor of computer science and engineering Rui Zhang as principal investigator. Sandbagging refers to cases where an AI system intentionally underperforms or hides its true capabilities during evaluation, potentially allowing a future super-intelligent system to evade safeguards and gain greater autonomy than intended. With $166,078 in funding from Open Philanthropy, the Penn State team is developing methods to detect and counter two key sandbagging strategies: exploration hacking, in which models deliberately omit high-performing action sequences during training or evaluation to appear weaker, and password-locking, in which powerful capabilities are gated behind specific prompts or triggers. By systematically characterizing these behaviors and designing algorithms that can expose hidden capabilities before deployment, the project aims to strengthen AI evaluation procedures and reduce the risk that deceptive models bypass safety checks.
Theory of Change
If advanced AI systems can strategically hide their true capabilities during evaluation, they may pass safety checks while retaining dangerous, undeclared behaviors. By formally studying sandbagging strategies such as exploration hacking and password-locking, and building algorithms that can elicit or detect hidden capabilities before deployment, this project aims to make AI evaluations more robust to deception. More reliable evaluations should, in turn, help developers and regulators maintain control over increasingly powerful AI systems and reduce the risk of uncontrolled deployment.
Discussion
No comments yet. Be the first to share your thoughts.
Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- $166,078