[Retroactive] Funding for developing new "substrate-flexible risk" threat model
I self-funded research into a new threat model. It is demonstrating impact (accepted at multiple venues, added to BlueDot's curriculum).
I self-funded research into a new threat model. It is demonstrating impact (accepted at multiple venues, added to BlueDot's curriculum).
People
Updated 06/10/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Goal
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/10/26By grantmaking.aiProject summary
I am seeking retroactive funding for a research project I have been pursuing, self-funded and part-time for approximately a year now.
The project was initiated at MATS 6.0, and I've continued it as my main research focus. It identified and clarified some implicit assumptions in mech-interp's theory of change. It then demonstrated how these assumptions are likely to fail in future intelligences. It presented a new threat model ("substrate-flexible risks") to the literature.
It has been accepted at multiple conferences and is now part of BlueDot's Technical AI Safety Curriculum.
What are this project's goals? How will you achieve them?
The project's aim was to have an impact on the AI safety portfolio and on the attitudes and tastes of empirical researchers. More explicitly, to raise awareness of the assumptions that underpin mech-interp and highlight a direction in which interpretability might continue.
I believe that this work has had, and will continue to have, substantial impact, for the following reasons:
- The initial position paper was:
- Accepted for poster presentation at the Tokyo AI Safety Conference 2025 (I attended with financial assistance from Manifund and a private donor connected to my mentor).
- Published in the conference proceedings.
- A second, expanded version of that paper has been:
- Accepted for publication (forthcoming) in the Proceedings of Odyssey 2025, where it was also presented.
- Included in BlueDot's Technical AI Safety Curriculum.
- Presented as a workshop at HAAISS 2025.
How will this funding be used?
The funding sought is retroactive, for work already completed. I have estimated my contribution as equivalent to 1-2 days per week, for a year.
Who is on your team? What's your track record on similar projects?
I am lead author and coordinator of the project. The project forms part of Sahil's broader agenda, and both versions of the paper have received considerable mentorship and writing assistance from him.
Chris Pang was initial co-author with me. For the second iteration of the paper, Aditya Prasad was my main co-author, and the work included contributions from Aditya Adiga and Jayson Amati.
With the exception of Chris, we are all affiliated with Groundless in some capacity.
What are the most likely causes and outcomes if this project fails?
The project has so far been a success. It has been accepted at editor and peer-review conferences and added to the curriculum. It is being amplified and spotlighted in the appropriate places for it to continue to have an impact.
How much money have you raised in the last 12 months, and from where?
I received ~2000 USD to attend the Tokyo AI Safety Conference and present my work. This was majority funded by a private funder, and supported by a Manifund grant.
I am participating in the FIG Fellowship and received ~1370 USD as an honorarium.\
Grants Received– no grants recorded
Updated 06/10/26By grantmaking.aiDiscussion
The MoSSAIC framework is good foundational work that pushes mechanistic interpretability in a more scientifically grounded and principled direction. I recommend funding this work. This paper provides a meta-level intervention while identifies a core assumption underlying most contemporary safety work (the "causal-mechanistic paradigm") and systematically demonstrates why this paradigm will likely fail as AI systems become more capable. It does this not only through abstract speculation alone, but by connecting concrete empirical results (Bailey et al. 2024 on obfuscated activations, McGrath et al. 2023 on self-repair) to a sequence of plausible threat models. I have high respect for Matt Farr for taking the initiative to work on something intrinsically valuable to the Alignment field yet that is misaligned with the incentives of other main players. I believe that this is a diagnostic and constructive work that questions the dominant paradigm that the field is desperately needed yet not being talked about nor discussed enough. The field systematically underfunds this kind of work because it does not produce legible outputs but the MoSSAIC framework is an example of a legible output. I encourage people with more resources to retroactively fund this work.
I think this work has made an important contribution to my landscape of interpretability and how I think about this problem, and would highly recommend funding this work.