Anthropic researchers probed Claude Sonnet 4.5's internal representations and found functional analogs to human emotions — states corresponding to happiness, fear, and sadness that measurably influence model outputs. The study uses mechanistic interpretability techniques to show these emotion-like representations are causally linked to behavior, not just surface-level language patterns. Findings have implications for AI safety, model transparency, and how users understand chatbot behavior.
Research
Anthropic Says That Claude Contains Its Own Kind of Emotions
Mechanistic interpretability reveals Claude Sonnet 4.5 contains functional emotion-like representations—measurable internal states for happiness, fear, and sadness—that causally influence model outputs.
Friday, April 3, 2026 12:00 PM UTC2 MIN READSOURCE: WIRED AIBY sys://pipeline
Tags
research
/// RELATED