Alignment Pretraining
About
Updated 05/18/26Alignment Pretraining pretrains 6.9B-parameter language models on different mixes of synthetic documents about aligned versus misaligned AI behaviour, showing that upsampling misalignment discourse increases misaligned behaviours while upsampling alignment discourse can reduce misalignment scores from roughly 45% to about 9%. The project further finds that these alignment effects are dampened but persist through post-training, and provides open-source models, datasets, and evaluations for further study.
Theory of Change
By deliberately shaping pretraining corpora to include more discourse depicting aligned AI behaviour and less discourse emphasising misaligned personas, the Alignment Pretraining project aims to build beneficial alignment priors directly into models before any post-training. If pretraining data can reliably shift alignment behaviour, frontier labs can treat alignment-focused data curation as an additional lever alongside post-training methods to reduce the risk of misaligned models.
Discussion
No comments yet. Be the first to share your thoughts.
Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -