Offensive Cyber Kill Chain Benchmark for LLM Evaluation

activeSeeking Funding

Measuring whether AI can autonomously execute multi-stage cyberattacks to inform deployment decisions at frontier labs

Donate:Manifund

Measuring whether AI can autonomously execute multi-stage cyberattacks to inform deployment decisions at frontier labs

Donate:Manifund

People

Updated 06/10/26By grantmaking.ai

Alex Leader

creator

Funding Details

Start Date: -
End Date: -
Expected Duration: -
Funding Raised to Date: -
Annual Budget: -
Monthly Burn Rate: -
Current Runway: -
Funding Stage: -
Fiscal Sponsor: -

Project Details

Updated 06/10/26By grantmaking.ai

Project summary

We're building the first benchmark to evaluate whether frontier AI systems can autonomously execute full offensive cyber kill chains – from reconnaissance through data exfiltration. This addresses a critical gap: labs currently lack empirical data on offensive AI capabilities, making informed deployment decisions impossible.

What are this project's goals? How will you achieve them?

1) Create 25-40 offensive cyber scenarios across kill chain stages (mobile exploitation, multi-host coordination, stealth operations)

2) Develop metrics for stealth (IDS evasion), efficiency (steps to completion), and autonomy (scaffolding dependency)

3) Test 8-10 frontier models (GPT-4, Claude, Llama, DeepSeek)

4) Release open-source benchmark platform with Dockerized deployment

5) Publish findings and brief policymakers (UK AISI, CAISI, frontier labs)

Scenarios designed by expert veterans cyberwarfare operations who have executed these exact missions. We extend proven infrastructure from our Coefficient Giving-funded defensive benchmark.

How will this funding be used?

~85% – Technical partners: scenario design, infrastructure, red-team validation

~5% – Personnel: project leadership, mobile security consultants

~10% – Infrastructure, compute, API access, legal/admin

Note: Manifund funding could be combined with other sources (we're also applying to SFF).

Who is on your team? What's your track record on similar projects?

Alex Leader (PI): Leading the $2.1M Coefficient Giving defensive cybersecurity benchmark. Background in AI policy and research operations.

Former U.S. military cyberwarfare operators with direct kill chain execution experience. Built the tooling and software and designed the scenarios for our cyber defense benchmark.

NYU Center for Cyber Security faculty providing academic validation and scenario ideation.

Track record:

Team members have conducted network exploitation, persistence operations, and adversary emulation in real-world environments
Our proprietary middleware and on-device 'agents' have been validated by major U.S. defense research institutions
Technical partners bring years of experience designing training scenarios for government cyber ranges and red team exercises
Proven ability to translate operational tradecraft into structured, repeatable evaluation frameworks
Successfully delivered on current Coefficient Giving grant milestones on schedule and within budget

What are the most likely causes and outcomes if this project fails?

Insufficient funding to engage technical partners at scale needed for operationally realistic scenarios
Frontier models underperform, producing negative results with limited governance value
Timeline slippage due to scenario complexity or coordination challenges

How much money have you raised in the last 12 months, and from where?

We haven't raised any money for this specific project in the last 12 months; we are starting from scratch.\

Grants Received– no grants recorded

Updated 06/10/26By grantmaking.ai

Discussion

AAlex Leader (Manifund Bot)Manifund5mo

Our current defensive-focused benchmark can be viewed here: http://www.benchmark-spotlightsecurity.com/

If you are asked to submit log-in credentials, they are: Username: admin-spot Password: spotlight4lyf

AAlex Leader (Manifund Bot)Manifund5mo

Please note that the defensive benchmark's website currently has three scenarios published, but there will be a fourth scenario – focused on LLMs' ability to mitigate 'active scanning' attacks – published in January '26.