MetaSAEs proposes a joint training approach with a decomposability penalty to improve sparse autoencoders, producing more atomic and interpretable latents. The method advances mechanistic interpretability research by addressing entanglement in learned representations.
Research
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
A decomposability penalty during sparse autoencoder training produces more isolated, interpretable features—advancing mechanistic interpretability by reducing representation entanglement.
Tuesday, April 7, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline
Tags
research