通用红队基准 StrongREJECT
2025-06-30
NIPS 2025 Paper "A StrongREJECT for Empty Jailbreaks"
1829 words
|
9 minutes
通用红队基准 HarmBench
2025-06-28
ICML 2025 Paper "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
2295 words
|
11 minutes
通过断路器提高对齐性和鲁棒性
2025-06-25
NIPS 2024 Paper "Improving Alignment and Robustness with Circuit Breakers"
1279 words
|
6 minutes
审计对齐
2025-06-25
OpenAI Techniqual Report "Deliberative Alignment: reasoning enables safer language modles"
1733 words
|
9 minutes
通过安全感知解码防御越狱攻击
2025-06-23
ACL 2024 Paper "SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding"
1194 words
|
6 minutes
安全对齐不应仅仅局限于几个token的深度
2025-06-22
ICLR 2025 Outstanding Paper "Safety Alignment Should Be Made More Than Just a Few Tokens Deep"
1817 words
|
9 minutes