LLMs often know when they are being evaluated. We’ll do a study comparing various methods to measure and monitor this capability.
LLMs often know when they are being evaluated. We’ll do a study comparing various methods to measure and monitor this capability.
People
Updated 06/10/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/10/26By grantmaking.aiProject summary
LLMs often know when they are being evaluated, and there's evidence that they'll behave differently from how they would in real scenarios if they believe so.
There are various different methods one can use to measure a model's evaluation awareness:
- Blackbox: Directly asking it to respond with a binary yes/no, or reporting a distribution, or tracing chain-of-thought for organic evaluation awareness reasoning
- Whitebox: Linear probes, SAE features
- Ensembles of above methods
However, there is currently no systematic study to compare the reliability of these (and other) approaches. Some current concerns for measurement involve models being bad at reporting probabilities, or models being primed by the questioning itself and having high sensitivity but low specificity (preliminary experiments on Qwen3-32b and R1). Organic chain-of-thought monitoring is more convincing but much rarer, and models might not always talk about being aware.
We want to explore how these different approaches compare,and see which are most informative of models' evaluation awareness abilities. For example, which metrics correlate most with behaviours differences during evaluations (e.g. choosing act deceptively in a setup)?
We plan to use Needham et al. and Laine et al. as starting points, although we plan to also look at setups such as Agentic Misalignment\
What are this project's goals? How will you achieve them?
We will publish a paper and a Lesswrong post with the results. The research will hopefully help the AI safety research community better quantify evaluation awareness capabilities, and help build more trustworthy alignment evaluations.\
How will this funding be used?
The money will first go to compute (API costs and vast.ai GPU hours). More money means we will be able to measure and compare more methods and more frontier models (which are most relevant but expensive).
Roughly, our estimates are:
- API costs: $4.5-8k
- GPU time: $1-2.5k
If the goal is met and there's extra, it will be used as salary for Jord and Igor.
Who is on your team? What's your track record on similar projects?
Igor Ivanov, worked on making scheming evaluations more realistic. Previously made BioLP-bench which won CAIS SafeBench competition and is widely adopted.
Jord Nguyen, worked on probing evaluation awareness during Pivotal fellowship. Previously worked on DarkBench while at Apart.
What are the most likely causes and outcomes if this project fails?
- This might be dual use, if used to create better capability benchmarks that frontier companies can hill-climb on.
- If evaluation awareness is more conceptually fuzzy in the future (e.g. if models are constantly being monitored and evaluated during deployment), this might be less useful. \
How much money have you raised in the last 12 months, and from where?
None.\
Grants Received
Updated 06/10/26By grantmaking.aiDiscussion
Project update: Igor has published a comparison of blackbox methods for detection evaluation awareness and continues to pursue his research agenda on evaluation awareness. Jord is currently mentoring a project at Algoverse on evaluation awareness monitoring. Funding has been used for API costs in experiments mentioned above.
[Progress update]
What progress have you made since your last update?
The team has published a comparison of blackbox methods for detection evaluation awareness
What are your next steps?
Igor continues to pursue his research agenda on evaluation awareness. Jord is currently mentoring a project at Algoverse on evaluation awareness monitor evasion. Funding has been used for API costs in experiments mentioned above.
Is there anything others could help you with?
We'd benefit from extra compute funding, but otherwise no.
Offering 3k to get this started. I think the project makes sense, but have no further evidence about the project or authors than listed here.
This donation is not a commitment to follow-up funding, even if reasonable progress is made.