MentaLeap

Israel

7 peopleFounded 2022

MentaLeap is an Israel-based AI safety research group focused on mechanistic interpretability, applying neuroscience and cybersecurity expertise to reverse-engineer neural networks and reduce risks from advanced AI systems.

People

Updated 05/18/26

Itay Yona

Founder & Principal Investigator

Michael Karasik

Researcher

Funding Details

Updated 05/18/26

Annual Budget: -
Current Runway: -
Funding Goal: -
Funding Raised to Date: -

Org Details

Updated 05/18/26

MentaLeap is an AI safety research and community group based in Israel, founded in 2022 by Itay Yona as part of EA Israel's broader AI safety efforts. The group's tagline is "Reverse Engineering Intelligence" and its stated mission is to create a safer path toward artificial general intelligence. The group brings together world-class information security specialists, AI researchers, and neuroscientists with a shared interest in mechanistic interpretability — the scientific program of understanding the internal computations of neural networks from first principles. MentaLeap meets every two to three weeks to discuss the latest interpretability and AI safety research papers, and participates in international online hackathons. At the January 2023 Apart Research Mechanistic Interpretability Hackathon, MentaLeap hosted a location in Israel and submitted a project titled "Soft Prompts are a Convex Set," which received Judge's Choice recognition (ranked #4 overall). The group has also competed in other interpretability hackathons, including a project probing conceptual knowledge of RL agents in solved games. Members of the group have published research on AI security topics, including the Doublespeak attack — a novel in-context representation hijacking technique against large language models that demonstrated safety bypass in production models including GPT-4o, Claude, and Gemini. The group's research blog at mentaleap.ai covers topics from mechanistic interpretability theory to practical AI security concerns. MentaLeap operates with 7 permanent members and has engaged more than 40 participants across its activities. It functions as an EA Israel working group rather than an independent legal entity.

Theory of Change

Updated 05/18/26

MentaLeap believes that understanding the internal mechanisms of neural networks is a prerequisite for reliably safe AI. Drawing a parallel to Ken Thompson's 1984 'Trusting Trust' insight in computer security, they argue that as long as neural network weights cannot be fully reverse-engineered and steered, there is an inherent risk: adversaries with write-access to weights could implant undetectable backdoors, and even well-intentioned alignment efforts could be undermined. By advancing mechanistic interpretability research and AI security knowledge — and by building a community of researchers with both neuroscience and infosec backgrounds — MentaLeap aims to contribute the scientific foundations necessary to detect and prevent such threats, ultimately enabling a safer trajectory toward AGI.

Grants Received

Updated 05/18/26

LTFF 2023 Q3 - MentaLeap

from Long-Term Future Fundfunds.effectivealtruism.org

$40,000

Projects

Updated 05/18/26

Doublespeak: In-Context Representation Hijacking

Research project that introduces Doublespeak, an in-context representation hijacking attack against large language models where harmful keywords are systematically replaced with benign tokens across multiple in-context examples so that the benign token’s internal representation acquires the harmful semantics and bypasses standard safety alignment checks.

active

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

Mechanistic interpretability project that localizes entity-selective neurons (“entity cells”) in language models and uses causal interventions on PopQA-style factual question answering to study how these neurons mediate entity-centric factual recall.

active

Probing Conceptual Knowledge on Solved Games

Research project that uses TCAV to probe how a deep reinforcement learning agent for Connect Four relies on human-interpretable concepts, inspired by the “Acquisition of Chess Knowledge in AlphaZero” paper by DeepMind and Google Brain.

Discussion

No comments yet. Be the first to share your thoughts.