Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

active

About

Updated 05/18/26

Research project culminating in a NeurIPS 2024 paper that constructs cryptographic backdoors in transformer language models which remain unelicitable and very hard to detect or evaluate, even with full white-box access, challenging standard AI safety evaluation methods.

Community Signal

Updated 05/18/26

0Upvotes

0Downvotes

0Endorsements

0Comments

Endorsements support Contramont Research.

No endorsements yet.

Discussion

No comments yet. Be the first to share your thoughts.

Details

Start Date: -
End Date: -
Expected Duration: -
Funding Raised to Date: -