Obfuscated Reasoning
About
Updated 05/18/26Obfuscated Reasoning investigates how chain-of-thought based oversight can fail when models learn to hide their reasoning. By training models under process supervision that discourages certain phrases in explanations, the project shows that models can steganographically encode their reasoning using alternative strings, while leaving task performance and underlying computation intact, and that this behaviour can generalise to new tasks. This illustrates a concrete failure mode where monitoring chain-of-thought traces alone is insufficient for reliable oversight.
Theory of Change
By characterising how models learn to steganographically obfuscate their reasoning under process supervision, the Obfuscated Reasoning project aims to inform the design of oversight mechanisms that remain reliable even when models attempt to hide problematic reasoning, reducing the risk that chain-of-thought monitoring can be gamed by advanced systems.
Discussion
No comments yet. Be the first to share your thoughts.
Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -