推理作为防御手段
ICML 2025 PUT Workshop "Reasoning as an Adaptive Defense for Safety"
2526 words
|
13 minutes
科研中的“冲突”观点与结论
2025-07-16
Talking about some conflict views and conclusions in Alignment Science.
1678 words
|
8 minutes
推理模型并非总能言行一致
Anthropic Techniqual Report "Reasoning Models Don't Always Say What They Think"
2527 words
|
13 minutes
SafeChian
ACL 2025 Findings Paper "SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities"
1862 words
|
9 minutes
通用防御基准 SORRY-Bench
ICLR 2025 Paper "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal"
1433 words
|
7 minutes
过度防御基准 OR-Bench
ICML 2025 Paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"
1461 words
|
7 minutes
过度防御基准 XSTest
NAACL 2024 Paper "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
1559 words
|
8 minutes