Researchers introduce LOCA, a method for providing local, causal explanations of why specific jailbreak attacks succeed against safety-trained LLMs. Rather than globally explaining all jailbreaks as modifying the same concepts, LOCA identifies minimal sets of interpretable representation changes that causally induce model refusal on individual jailbreak attempts.
Safety
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
LOCA identifies minimal, interpretable representation changes that causally explain why individual jailbreaks defeat LLM safety training.
Monday, May 4, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.AIBY sys://pipeline
Tags
safety