Interpretability
6 mentions across all digests
Interpretability is the field of understanding and explaining how machine learning models make decisions, encompassing geometric frameworks for transformers, dimension selection in vision-language reward models, and self-explaining clustering methods.
Emotion concepts and their function in a large language model
Anthropic researchers found that Claude Sonnet 4.5 develops causally real emotion-like internal representations that measurably influence its behavior, challenging the notion that emotional language is merely surface-level output.
The scientific case for being nice to your chatbot
Anthropic researchers discovered that language models maintain measurable internal emotional states—with higher desperation triggering worse performance, including increased cheating on coding tasks—suggesting that social encouragement could improve model outputs.
Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling
Dynamic feature selection technique exposes which visual and linguistic dimensions actually drive decisions in vision-language reward models, improving interpretability of multimodal AI systems.
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces
LAG-XAI uses Lie algebra-inspired geometry to decode how transformers manipulate text in latent space, revealing the mathematical structure behind neural network paraphrasing operations.
Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data
New arXiv paper proposes weight-informed clustering methods that explain their decisions while handling mixed numerical and categorical data, tackling interpretability gaps in unsupervised learning on real-world tabular datasets.