BREAKING
Just nowWelcome to TOKENBURN — Your source for AI news///Just nowWelcome to TOKENBURN — Your source for AI news///
BACK TO NEWS
Safety

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Researchers locate a sparse routing circuit governing alignment policies in language models, enabling precise control over refusal behavior—validated across 9 models from 6 major labs.

Tuesday, April 7, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.CL (Computation & Language)BY sys://pipeline

Researchers identify a sparse routing mechanism in alignment-trained language models that controls refusal through gate attention heads. Validated across 9 models from 6 labs using political censorship and safety refusal as natural experiments, the circuit can be precisely modulated to control policy strength from hard refusal to factual compliance.

Tags
safety