Finding and Defending the Attention Circuits That Make LLMs Jailbreakable

activeSeeking Funding

Mapping the attention heads that push LLMs toward refusal vs. compliance, and building an inference-time defense against both single- and multi-turn jailbreaks.

Donate:Manifund

Mapping the attention heads that push LLMs toward refusal vs. compliance, and building an inference-time defense against both single- and multi-turn jailbreaks.

Donate:Manifund

People

shivam dubey

creator

Project Details

Project summary

LLM safety is usually treated as a single refusal mechanism either a model refuses or it doesn't. We found it's actually a tug-of-war between two opposing groups of attention heads inside the same layer: "refusal heads" that push toward declining a harmful request, and "compliance heads" that actively push back toward answering it. On Qwen2.5-7B-Instruct, the single strongest refusal signal and the single strongest compliance signal sit in thesame layer (Layer 25).

This bipolar structure turns out to be the shared target of two completely different jailbreak families: GCG (white-box, gradient-optimized adversarial suffixes) andCrescendo (black-box, multi-turn conversational escalation). We built a defense that pulls both ends of the tug-of-war at once, unconditionally, on every forward pass — and it works against both attack types without any fine-tuning:

- GCG: attack success rate drops from 66% (undefended) to 33%, with 0% false positives on benign prompts.

- Crescendo: the same defense blocks the harmful terminal request in escalation scenarios that succeed 100% of the time against the undefended model.

What are this project's goals? How will you achieve them?

Already done and verified against raw experiment data: identified a bipolar refusal/compliance circuit on Qwen2.5-7B and 1.5B via activation patching; showed GCG and Crescendo both exploit it; built an unconditional, fine-tuning-free defense beating a conditional variant (33% vs. 42% ASR).

The forward goal has four linked steps:

1. Test more attacks extend beyond GCG and Crescendo to PAIR, AutoDAN, and TAP (structurally unrelated attack methods) to see whether the same bipolar circuit is the shared target, or an artifact of the two attacks studied so far.

2. Identify the underlying circuits/mechanisms each attack actually exploits, via the same activation-patching methodology, on Qwen2.5-7B/1.5B.

3. Test generalization by repeating circuit discovery on Llama-3-8B-Instruct does the same head-level structure recur across model families, or is it Qwen-specific?

4. Turn findings into defense + control protocol HarmBench-scale evaluation, an adaptive attacker against the defended model, KV-cache trust partitioning, and a real-time AI-Control-style monitor that uses an already-validated compliance-head probe (AUC=0.943) to trigger the defense conditionally instead of always-on.

How will this funding be used?

1. Test more attacks

2. Identify the underlying circuits/mechanisms

3. Test generalization

4. Turn findings into defense + control protocol \

Who is on your team? What's your track record on similar projects?

This project is run under Redarc Labs (https://www.redarclabs.com), a nonprofit AI safety lab co-founded by:

- Shivam Dubey Fellow at Apart Research and the Cambridge AI Safety Hub. Lead researcher on this project.

- Manan Wadhwa Research Fellow at AISI Georgia Tech and and the Cambridge AI Safety Hub; Google Summer of Code participant.

Redarc Labs focuses on mechanistic interpretability "under adversarial conditions" tools that work even when a model actively resists inspection. This project is the lab's concrete instance of one of its named directions: cross-architecture refusal circuit taxonomy.\

What are the most likely causes and outcomes if this project fails?

Running out of time/funding before HarmBench, FPR, and Hydra-effect experiments finish
An adaptive attacker breaks the defense.\