Function vectors reliably steer LLM outputs even when logit lens interpretability methods fail to decode correct answers, suggesting they encode computational instructions rather than direct answers. This comprehensive 4,032-pair study across 12 tasks and 6 models (Llama, Gemma, Mistral) reveals steering works optimally at early layers while interpretability only surfaces at late layers. Key finding: a fundamental mismatch between how models execute tasks and how we can currently explain them.
Models
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
A 4,032-pair study across Llama, Gemma, and Mistral reveals function vectors steer LLM outputs via early-layer computational instructions—even when logit lens interpretability can't decode them, exposing a fundamental gap between how models execute tasks and how we can currently explain them.
Monday, April 6, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline
Tags
models