Paper identifies the "Agreement Trap" in evaluating rule-governed content moderation systems and proposes policy-grounded evaluation metrics. Introduces Defensibility Index and Probabilistic Defensibility Signal (PDS) to measure correctness against governing rules rather than human label agreement. Validation on 193,000+ Reddit moderation decisions reveals a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics.
Safety
Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
Researchers reveal a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics for evaluating AI content moderation, showing that human label agreement can mask whether systems actually follow their governing rules.
Friday, April 24, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.AIBY sys://pipeline
Tags
safety