sparse autoencoders
3 mentions across all digests
Sparse autoencoders are neural network models used in mechanistic interpretability research to extract atomic, interpretable features from large language model internals, with applications ranging from understanding GPT-4's learned concepts to steering computational fluid dynamics surrogates.
Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates
Sparse autoencoders enable interpretable, fine-grained steering of graph-based CFD surrogates—offering a mechanistic interpretability approach to control neural physics simulations.
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
A decomposability penalty during sparse autoencoder training produces more isolated, interpretable features—advancing mechanistic interpretability by reducing representation entanglement.
Extracting Concepts from GPT-4