PandaGuard: LLM Safety Evaluation Framework
active
About
Updated 05/18/26PandaGuard models LLM safety assessment as interactions between attacker, defender and judge agents, enabling reproducible evaluation of a wide range of jailbreak attack methods and defense mechanisms across different LLM back-ends. The associated PandaBench benchmark aggregates large-scale evaluation data on 40+ models, 19 types of attacks and 12 defense strategies, providing a standardized way to compare vulnerabilities, study trade-offs between safety and utility, and inform Beijing-AISI’s research on robust and governable language models.
Community Signal
Updated 05/18/26Endorsements support Beijing Institute of AI Safety and Governance (Beijing-AISI).
No endorsements yet.
Discussion
Sign in to comment
No comments yet. Be the first to share your thoughts.
Details
- Start Date
- -
- End Date
- -
- Expected Duration
- -
- Funding Raised to Date
- -