A research group at MIT CSAIL developing algorithmic frameworks, techniques, and policies to make AI systems safe and socially beneficial. Led by Associate Professor Dylan Hadfield-Menell.
A research group at MIT CSAIL developing algorithmic frameworks, techniques, and policies to make AI systems safe and socially beneficial. Led by Associate Professor Dylan Hadfield-Menell.
People
Updated 05/18/26Principal Investigator
PhD Student
Postdoctoral Fellow
PhD Student
Research Scientist
PhD Student
PhD Student
MEng Student
Funding Details
Updated 05/18/26- Annual Budget
- -
- Current Runway
- -
- Funding Goal
- -
- Funding Raised to Date
- -
Org Details
Updated 05/18/26The Algorithmic Alignment Group is a research program at MIT CSAIL (Computer Science and Artificial Intelligence Laboratory), established when Dylan Hadfield-Menell joined MIT as an Assistant Professor in 2021. Hadfield-Menell, who holds the Bonnie and Marty Tenenbaum Career Development Professorship and was promoted to Associate Professor in 2024, directs the group within CSAIL's Embodied Intelligence cluster. The group's mission is to develop better conceptual understanding, algorithmic techniques, and policies to make AI more safe and socially beneficial. Their work is interdisciplinary and centers on how humans and AI systems interact across the contexts of value learning, incentives, recommendation systems, debugging, and policy. Core research areas include: AI safety and robustness (adversarial attacks, unlearning methods, latent adversarial training); AI alignment and preference learning (reinforcement learning from human feedback, assistance games, goal inference); interpretability and transparency (neural network diagnostics, feature analysis, internal representations of language models); responsible AI development (auditing, open-weight model risk, deepfake harms); and AI governance and policy (evidence-based policy design, recommendation system regulation, multi-agent coordination). Notable outputs include the CommonClaim dataset of 20,000 human-labeled statements for red-teaming language models, a contracts library for multi-agent reinforcement learning, and influential publications on cooperative inverse reinforcement learning, latent adversarial training, and the limitations of reinforcement learning from human feedback. The group currently includes a postdoctoral researcher, four PhD students, and over a dozen Masters/M.Eng. students. Dylan Hadfield-Menell was named an AI2050 Early Career Fellow by Schmidt Sciences in 2022, supporting his work on AI systems that manage uncertainty about rewards.
Theory of Change
Updated 05/18/26The group believes that rigorous technical research on AI alignment — covering value learning, robustness, interpretability, and human-AI interaction — will produce concrete algorithmic techniques that can be adopted by AI developers to build safer systems. By identifying failure modes in current approaches (such as reward misspecification and feedback manipulation), developing better evaluation and auditing methods, and contributing to evidence-based AI governance, the group aims to shift how AI systems are designed and deployed. The causal chain runs from foundational research to improved alignment methods to safer AI systems in production, with policy work providing a parallel path to reducing systemic risk from advanced AI.
Grants Received– no grants recorded
Updated 05/18/26Projects
Updated 05/18/26Dataset of 20,000 boolean statements, each labeled by two human annotators as common-knowledge-true, common-knowledge-false, or neither, created by the MIT Algorithmic Alignment Group for red-teaming language models.
Open-source library implementing formal contracts to mitigate social dilemmas in multi-agent reinforcement learning, built on RLlib and developed by members of the Algorithmic Alignment Group.
Discussion
No comments yet. Be the first to share your thoughts.