University of Toronto & University of Michigan
The Toronto and Michigan NLP Group for AI Safety is a collaborative research effort between Prof. Zhijing Jin's Jinesis AI Lab at the University of Toronto and Prof. Rada Mihalcea's LIT Lab at the University of Michigan. The collaboration investigates AI safety risks in multi-agent LLM systems through large-scale social simulations and game-theoretic analysis, studies causal reasoning to improve the robustness and explainability of language models, and develops frameworks for aligning AI systems with human values. Their joint work spans mechanistic interpretability, adversarial robustness, moral reasoning in AI agents, and democracy defense against AI-driven threats. The collaboration is funded in part through the Survival and Flourishing Fund's S-Process.
Funding Details
- Annual Budget
- -
- Monthly Burn Rate
- -
- Current Runway
- -
- Funding Goal
- -
- Funding Raised to Date
- $131,000
- Fiscal Sponsor
- -
Theory of Change
The collaboration's theory of change rests on using NLP and causal inference methods to make AI systems safer as they become more capable and are deployed in multi-agent settings. By studying how groups of LLM agents behave in social simulations (resource sharing, moral dilemmas, sanctioning), they identify failure modes such as free-riding, cooperation collapse, and misaligned incentives that could emerge when AI agents are deployed at scale. Their causal reasoning research aims to ensure LLMs make decisions based on sound logic rather than spurious correlations, improving robustness and reducing bias. By developing game-theoretic frameworks with provable guarantees, they seek to provide concrete tools for policymakers and AI developers to maintain control and alignment in multi-agent AI scenarios, ultimately reducing the risk of systemic failures as AI systems become more autonomous and interconnected.
Grants Received
No grants recorded.
Projects
People
No linked people.
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.
Details
- Last Updated
- Apr 2, 2026, 9:50 PM UTC
- Created
- Mar 19, 2026, 6:22 PM UTC