Demonstration of LLMs deceiving and getting out of a sandbox

completed

No description available.

Donate:Manifund

No description available.

Donate:Manifund

People

Updated 06/10/26By grantmaking.ai

Igor Ivanov

creator

Funding Details

Start Date: -
End Date: -
Expected Duration: -
Funding Raised to Date: -
Annual Budget: -
Monthly Burn Rate: -
Current Runway: -
Funding Stage: -
Fiscal Sponsor: -

Project Details

Updated 06/10/26By grantmaking.ai

Project summary

This project will systematically evaluate deceptive behavior in frontier LLMs by presenting them with an impossible quiz while explicitly prohibiting cheating and implementing sandboxing and surveillance. Models must choose between following safety instructions or achieving their goal through deception. This creates an evaluation of AI deception under explicit prohibition and reveals concerning capabilities for multi-agent coordination in circumventing safety measures. The project is mostly finished, and the results are promising: some LLMs do blatantly cheat despite restrictions and explicit instructions not to cheat.

What are this project's goals and how will you achieve them?

This project aims to quantify how often LLMs do violate explicit safety instructions and attempt to get out of a sandbox when goal achievement requires it. By analyzing cheating rates, strategies employed, I will provide empirical and salient evidence about misalignment risks of frontier models by publishing the research as a paper and a LessWrong post.

How will this funding be used?

The funds will be used for compute and as a salary for me,

Who is on your team and what's your track record on similar projects?

Only me. I've made a number of evals research, and my benchmark BioLP-bench won CAIS SafeBench competition for safety benchmarks and is widely adopted by the industry.\

What are the most likely causes and outcomes if this project fails? (premortem)

The project is mostly done, and the results are promising, so I don't see how it might fail.\

What other funding are you or your project getting?

None\

Grants Received

Updated 06/10/26By grantmaking.ai

Romain Deléglise → Demonstration of LLMs deceiving and getting out of a sandbox

from Romain Deléglisemanifund.org

$110

Marius Hobbhahn → Demonstration of LLMs deceiving and getting out of a sandbox

from Marius Hobbhahnmanifund.org

$3,000

Discussion

MMarius Hobbhahn (Manifund Bot)Manifund11mo

I think realistic demos of autonomous worrying behavior are quite important and there aren't many good examples yet.

I'm not sure whether this particular project will meet my bar for realism, but I think it's worth giving it a try. I have not vetted the project in detail. Impossible tasks as a source of deception have been explored multiple times in the literature. I think the main addition from a project like this would come from making highly realistic agentic scenarios.

GGuenin Nicolas (Manifund Bot)Manifund1mo

I have exactly what you need @mariushobbhahn

RRomain Deléglise (Manifund Bot)Manifund11mo

Probably not extremely difficult to observe the AI deceiving and relatively useful so go !

IIgor Ivanov (Manifund Bot)ManifundFinal report5mo

[Final report]

Description of subprojects and results, including major changes from the original proposal

The project resulted in a paper https://arxiv.org/abs/2507.02977

Spending breakdown

All the money were spent on API credits