Measuring whether AI can autonomously execute multi-stage cyberattacks to inform deployment decisions at frontier labs
Measuring whether AI can autonomously execute multi-stage cyberattacks to inform deployment decisions at frontier labs
People
Updated 06/10/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/10/26By grantmaking.aiProject summary
We're building the first benchmark to evaluate whether frontier AI systems can autonomously execute full offensive cyber kill chains – from reconnaissance through data exfiltration. This addresses a critical gap: labs currently lack empirical data on offensive AI capabilities, making informed deployment decisions impossible.
What are this project's goals? How will you achieve them?
1) Create 25-40 offensive cyber scenarios across kill chain stages (mobile exploitation, multi-host coordination, stealth operations)
2) Develop metrics for stealth (IDS evasion), efficiency (steps to completion), and autonomy (scaffolding dependency)
3) Test 8-10 frontier models (GPT-4, Claude, Llama, DeepSeek)
4) Release open-source benchmark platform with Dockerized deployment
5) Publish findings and brief policymakers (UK AISI, CAISI, frontier labs)
Scenarios designed by expert veterans cyberwarfare operations who have executed these exact missions. We extend proven infrastructure from our Coefficient Giving-funded defensive benchmark.
How will this funding be used?
~85% – Technical partners: scenario design, infrastructure, red-team validation
~5% – Personnel: project leadership, mobile security consultants
~10% – Infrastructure, compute, API access, legal/admin
Note: Manifund funding could be combined with other sources (we're also applying to SFF).
Who is on your team? What's your track record on similar projects?
Alex Leader (PI): Leading the $2.1M Coefficient Giving defensive cybersecurity benchmark. Background in AI policy and research operations.
Former U.S. military cyberwarfare operators with direct kill chain execution experience. Built the tooling and software and designed the scenarios for our cyber defense benchmark.
NYU Center for Cyber Security faculty providing academic validation and scenario ideation.
Track record:
- Team members have conducted network exploitation, persistence operations, and adversary emulation in real-world environments
- Our proprietary middleware and on-device 'agents' have been validated by major U.S. defense research institutions
- Technical partners bring years of experience designing training scenarios for government cyber ranges and red team exercises
- Proven ability to translate operational tradecraft into structured, repeatable evaluation frameworks
- Successfully delivered on current Coefficient Giving grant milestones on schedule and within budget
What are the most likely causes and outcomes if this project fails?
- Insufficient funding to engage technical partners at scale needed for operationally realistic scenarios
- Frontier models underperform, producing negative results with limited governance value
- Timeline slippage due to scenario complexity or coordination challenges
How much money have you raised in the last 12 months, and from where?
We haven't raised any money for this specific project in the last 12 months; we are starting from scratch.\
Grants Received– no grants recorded
Updated 06/10/26By grantmaking.aiDiscussion
Please note that the defensive benchmark's website currently has three scenarios published, but there will be a fourth scenario – focused on LLMs' ability to mitigate 'active scanning' attacks – published in January '26.
Our current defensive-focused benchmark can be viewed here: http://www.benchmark-spotlightsecurity.com/
If you are asked to submit log-in credentials, they are: Username: admin-spot Password: spotlight4lyf