PandaGuard: LLM Safety Evaluation Framework

active

About

Updated 05/18/26

PandaGuard models LLM safety assessment as interactions between attacker, defender and judge agents, enabling reproducible evaluation of a wide range of jailbreak attack methods and defense mechanisms across different LLM back-ends. The associated PandaBench benchmark aggregates large-scale evaluation data on 40+ models, 19 types of attacks and 12 defense strategies, providing a standardized way to compare vulnerabilities, study trade-offs between safety and utility, and inform Beijing-AISI’s research on robust and governable language models.

Community Signal

Updated 05/18/26

0Upvotes

0Downvotes

0Endorsements

0Comments

Endorsements support Beijing Institute of AI Safety and Governance (Beijing-AISI).

No endorsements yet.

Discussion

No comments yet. Be the first to share your thoughts.

Details

Start Date: -
End Date: -
Expected Duration: -
Funding Raised to Date: -