Alignment Pretraining

active

About

Updated 05/18/26

Alignment Pretraining pretrains 6.9B-parameter language models on different mixes of synthetic documents about aligned versus misaligned AI behaviour, showing that upsampling misalignment discourse increases misaligned behaviours while upsampling alignment discourse can reduce misalignment scores from roughly 45% to about 9%. The project further finds that these alignment effects are dampened but persist through post-training, and provides open-source models, datasets, and evaluations for further study.

Theory of Change

By deliberately shaping pretraining corpora to include more discourse depicting aligned AI behaviour and less discourse emphasising misaligned personas, the Alignment Pretraining project aims to build beneficial alignment priors directly into models before any post-training. If pretraining data can reliably shift alignment behaviour, frontier labs can treat alignment-focused data curation as an additional lever alongside post-training methods to reduce the risk of misaligned models.

Community Signal

Updated 05/18/26

0Upvotes

0Downvotes

0Endorsements

0Comments

Endorsements support Geodesic Research.

No endorsements yet.

Discussion

No comments yet. Be the first to share your thoughts.

Details

Start Date: -
End Date: -
Expected Duration: -
Funding Raised to Date: -