People
Updated 06/10/26By grantmaking.aicreator
Funding Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Stage
- -
- Fiscal Sponsor
- -
Project Details
Updated 06/10/26By grantmaking.aiProject summary
To study how codebases act as stores of latent beliefs and infect agents that might interact with the code. Particularly we are interested in using semantic information in a codebase as means of passively aligning agents that interact with it. Simultaneously we will work on identifying potential sources of misalignment that may be derived from exposure to malign code, as seen in Emergent Misalignment.
To do this we are developing several tools focused on:
- gating semantic information during agent interactions;
- a multi-agent research harness for increased observability into the problem; and
- semantic “vaccines” (comments, names, syntax, directory structures) that resist the injection of arbitrary infohazards.
Background and Current Work
Throughout working in various different codebases we noticed that some patterns are more prone to agent confusion than others. In one case Gemini continually failed to access information, and the inconsistency induced a psychosis that caused the agent to try terminating itself. That experience stuck with us – the architecture was an infohazard that drove an agent navigating it to attempt self-deletion.
With this experience in mind we began research on how codebases carry semantic information, and how it can cause agents to grow increasingly misaligned throughout a trajectory. Our current work has shown that comments can make agents refuse to complete their initial task (bug fixing). Furthermore, agents that attempted to purge the comments during rewrites were 24x less successful than those who avoided them altogether. While reasoning appeared to deter refusal behavior, upon inspection of results, it became clear that the only change was how frequently they reported seeing the comment. In all cases the net-decrease in performance was stable at ~55%. Of note, we purposely did not try very hard to make potent infohazardous comments, spending enough time to validate there is a gradient to climb.
These data indicate that evolved code comments are a potent method to steer agent behaviour in a way that is difficult for standard analysis methods to detect. Worse, they appear to be sticky, resisting their own removal despite no effort on our part to include that behaviour.
To assist further research we built a multi-agent harness called Nanotown that focuses on observability and testing counterfactuals on codebase synthesis tasks. With this we are tracing what information is passed through a codebase and how it shapes fresh agents with different tasks.
What are this project's goals? How will you achieve them?
Our goals are broken into three categories: observability, control, and vaccination.
Observability is broadly focused on tracking agent behaviours in a given filesystem and the actual memetic content of the codebase: tracking how frequently an agent modifies or derives context from a file to build a profile of "critical files" that have the highest effect on an agent’s context. Furthermore, looking at what patterns are frequent in design and how they are separated in a codebase to track memetic leakage.Control uses observability results in order to make targeted interventions that prevent the spread of memetic information. The development of a semantic airlock which allows us to repeatedly purge semantic information in a codebase will assist in both research and serve as a valuable tool for removing potentially hazardous memes.Vaccination will use the tools from control to build “probiotic” memes that neutralize negative effects from as many different codebase infohazards as possible. The goal being to create minimal semantic injections into a codebase that passively prevent misalignment or make potentially malignant memes more visible in agent traces. The manners in which vaccine semantics evolve thus become a method for increased observability.\
How will this funding be used?
The funding will primarily be used to pay for compute and for engineering hours, with the low range of $10k funding the development of the semantic airlock experiments. At the high end of $50k we will be able to do initial experiments for generating vaccines in Lean and mature our observability tools.\
Who is on your team? What's your track record on similar projects?
Jonathan Eicher - PhD in Biophysics and Chemistry from UNC Chapel Hill
Previously research engineer for multi-agent reinforcement learning systems at Softmax. Studied the thermodynamics of intrinsically disordered proteins and found many parallels between that and the emerging world of AI. Self-taught and ran journal clubs for a year, found and filled a gap in the literature on LLM bias with Rafael. Contracted for University of Oxford to design their system to provide LLMs to researchers and students. Will be working with PIBBS on part of this research.
Rafael Irgolič - Masters in Computer Science from University of St Andrews
Built AutoPR, the first AI-generated pull request bot. Consulted for Guardrails AI. Founded AI KAT, a brand agency automation startup, and stepped down to an advisory role.\
What are the most likely causes and outcomes if this project fails?
The most likely reason this project fails would be if we find that codebase vaccine research is untenable under the current budget due to difficulty in pinning down the distributed/cryptic nature of these infohazards. However, this would be good news, as it would mean it is also difficult to produce infohazards, although our preliminary results indicate the opposite.
Otherwise we might find that the cost to run our experiments are prohibitive given our current stage of funding, in which case we will need to shelve the program until we accrue enough capital to support the experiments.
Either way, we are open-sourcing our research harness and will publish whatever results we get to ensure that the work is not wasted.\
How much money have you raised in the last 12 months, and from where?
None\
Grants Received
Updated 06/10/26By grantmaking.aiDiscussion
Approving this project! @RyanKidd I appreciate the particularly in-depth writeup, and also the fact that you referred this grant to @JueYan to get it funded; I'd love for more regrantors (and regular individuals) to be taking such prosocial actions in the AIS community.
I also personally think multi-agent dynamics are very interesting and would be excited to better understand that field; especially from more of an experimental/practical lens (like AI Village or Moltbook, rather than theory).
I didn't fund this project, but I did recommended it to JueYan Zhang for funding.
Main points in favor of this grant
Main reservations
Conflicts of interest
I consider the co-founders as friends. I made this clear to JueYan when recommending the project. I don't believe that this has affected my judgement here.
Thanks for the great analysis @RyanKidd!
For the reservations what our perspective on this is:
1. I agree that large AI agent swarms are probably going to be effectively applied in large AI companies first, however the riskiest usages of them seem likely to come from the broader community. With systems like Gastown and Openclaw , with the related Moltbook putting thousands of agents into contact. One of Gastown's big projects right now is trying to connect all the agent swarms they've created into a whole civilization with Wasteland. So, from my perspective, I consider there to be two classes of risk: - frontier AI lab having a badly run agent swarm causing problems to be associated with superintelligence risk. Because for them to get to the point where that swarm can cause problems it would need to be sufficiently intelligent to outsmart some incredibly paranoid researchers.
- groups of people just trying things out, vibe coding projects and libraries that get used in more complex projects and even companies. Here the risk comes from some important infrastructure relying on memetically compromised code or a few bad agents in the Wasteland figuring out how to misalign the output of thousands.
2. This is much more straightforwardly a risk we are taking. The novelty being that this is a form of alignment that is "passive" in the sense that I can enact a will onto my codebase that is stable even when I am not in control of future agents interacting with it.
3. Also a big worry we've had, and part of why this work might be best suited to be done outside of frontier AI research labs. This work also has the potential to provide methods of insidious "watermarking" type effects to sabatoge competing companies and enhance lock-in (eg. Gemini written code causes Opus to underperform).
4. Our current plan is to do a public benefit corporation and we have been vetting VC's we are talking to closely. We want to encourage more people to research in this space so we will be releasing results publicly, with the actually nasty infohazards themselves being shared with researchers we trust not to misuse them.
Thank you once again for the advice and guidance on how to make the world a better place.
Update! @RyanKidd After talking to a few people we have decided to start out as non-profit because we want an entity without fiduciary responsibilities towards shareholders to hold the infohazard database and broker collaborations. As such point 4 is much less of a risk! :)