A new research paper shows that refusal behavior in large language models is controlled by a single direction in their activation space. Testing 13 open-source models up to 72B parameters, the researchers propose a white-box jailbreak that surgically disables refusal with minimal side effects. The findings demonstrate that current safety fine-tuning approaches are brittle and vulnerable to targeted attacks.
Safety
Refusal in Language Models Is Mediated by a Single Direction
Researchers discovered that refusal behavior across 13+ LLMs (up to 72B parameters) is controlled by a single activation direction, enabling surgical jailbreaks that expose how brittle current safety fine-tuning really is.
Saturday, May 2, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline
Tags
safety