Expanding proven isolation techniques to high-risk capability domains in Mixture of Expert models
Expanding proven isolation techniques to high-risk capability domains in Mixture of Expert models
People
Updated 06/10/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/10/26By grantmaking.aiProject summary
Localizing dangerous knowledge within Mixture of Expert models' experts to allow selective deactivation while preserving model performance on other domains.
What are this project's goals? How will you achieve them?
We aim to address the dual-use threat posed by advanced AI capabilities in Chemical, Biological, Radiological, and Nuclear (CBRN) domains by physically localizing dangerous capabilities into designated experts during pre-training using gradient routing in Mixture of Experts (MoE) models, enabling selective activation based on deployment context.
Our goal is to implement this with minimal modifications to current MoE training regimes, publish our work, and open source our code to facilitate easy industry adoption.
We have already demonstrated:
Strong Isolation at Scale (on Medical Knowledge):
- ~250,000x compute slowdown at 1.2B params after ablating medical experts (medical performance degrades to baseline trained with 250,000x fewer FLOPs)
- ~62,500x compute slowdown maintained even after re-calibrating output logits to medical domain
- Isolation effectiveness increases with model scale
Robustness Against Recovery:
-
Full-model fine-tuning on medical data requires ~500M tokens to restore USMLE performanceMinimal Alignment Tax:
-
<0.02 nats increase in loss on non-medical domains post-ablation demonstrates performance preservation on legitimate capabilitiesLabel Efficiency:
-
Semi-supervised approach outperforms supervised, likely indicating resilience to label errors and reduced labeling requirements
Our next goals include:
- Multiple Realistic Domains: Expand from medical to virology, nuclear, and other high-risk areas
- Scale to Larger Models: In order to be more confident in extrapolations about current base models.
- Thorough Evaluation: Test against advanced adversarial attacks, in-context learning recovery.
- Understand Labeling Requirements: Determine minimum quality and quantity requirements for domain classification.
How will this funding be used?
Compute resources to scale experiments to larger models, expand to CBRN domains, conduct extensive robustness testing, and perform comprehensive evaluations across different architectures.
Goal 1: up to**$12,000** Support ICML rebuttal process, understanding labeling requirements. (8 x H200 GPUs for 2 months).Goal 2: up to**$75,000** Support thorough evaluation, expansion to multiple realistic domains. (8 x H200 GPUs for 6 months).Ideal Goal: $150,000 Support all goals and scaling to larger models. (16 x H200 GPUs for 6 months).
Who is on your team? What's your track record on similar projects?
I'll working full-time on this project under the close mentorship with Alec, meeting with him for a minimum of 1 hour per week, with the option for ad-hoc meetings if necessary.
- Mentor: Alec Radford - Former OpenAI research scientist, extensive experience with large-scale language model training and research. Original author of GPT-2 paper, CLIP, DALL-E, Whisper.
- Google Scholar: https://scholar.google.com/citations?user=dOad5HoAAAAJ&hl=en
- Primary author: Krishna Patel - Current Anthropic Fellow, 6 months working on this project with Alec, previously worked on storage optimization and pre-training evaluation at Apple, BS/MS Computer Science (concentrating in AI) from Stanford.
- LinkedIn: https://www.linkedin.com/in/krishnakpatel/
- Google Scholar: https://scholar.google.com/citations?user=VMaMb3AAAAAJ
What are the most likely causes and outcomes if this project fails?
We have already demonstrated considerable promise with strong isolation results and minimal performance degradation.
Most likely failure mode is incomplete capability isolation in more complex CBRN domains. However, our medical knowledge results show the approach is fundamentally sound when isolating a subfield.
If the project fails, we would document the fundamental limitations we came up against and provide evidence for investing in alternative safety measures like comprehensive data filtration.
How much money have you raised in the last 12 months, and from where?
$0, previously supported by MATS/Anthropic Fellows Program.
Grants Received
Updated 06/10/26By grantmaking.aiDiscussion
Emptying the rest of my Manifold balance for now into this and planning to give more. I hope this gets the full $150k required.
After a ~45 min call with Krishna, I was very impressed with how clear she was communicating what she was doing, why she was doing it and the metrics for results she had. Furthermore, she had good reasons for wanting this work to be done outside of a major lab.
Furthermore, I think this is a very promising plausible safety feature to combat CBRN risk from models, isolate the part of them that knows about it while keeping the remainder of capabilities intact.
Overall, I found Krishna to be very smart. In my brief conversation with her mentor Alec from a while ago, I also think he has great insights.
I don't really have major reservations here. I think this grant will be similar to my grant to Joseph Bloom of a couple years ago.
This looks extremely promising to me. Want to have a 30 min call about it?
Thanks we're really excited about it too! What's the easiest way to coordinate a chat? @MarcusAbramovitch
@Krishna-Patel I sent you a LinkedIn message but now I see your email is on your profile. Coming up