SAELens
About
Updated 05/18/26SAELens is an open-source mechanistic interpretability library developed under Decode Research for training and analyzing sparse autoencoders on language models. The codebase provides configurable training loops, loaders for pre-trained SAEs, utilities for integrating with popular transformer libraries, and visualization tools for generating feature dashboards and inspecting how individual SAE features relate to model logits and behavior. The project is actively maintained on the decoderesearch GitHub organization and distributed via PyPI, and is cited in work such as SAEBench as core infrastructure for evaluating SAE architectures and studying the internal representations of large language models.
Theory of Change
SAELens assumes that sparse autoencoders are a promising way to decompose large language model activations into interpretable features, but that progress is bottlenecked by engineering and tooling. By providing a robust, well-documented library for training, loading, and benchmarking SAEs across many models, SAELens lowers the barrier for researchers to run large-scale experiments, compare architectures, and study safety-relevant features and circuits. This should accelerate progress in mechanistic interpretability and support the development of alignment techniques that rely on understanding and steering internal representations.
Discussion
No comments yet. Be the first to share your thoughts.
Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -