Research

MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

A decomposability penalty during sparse autoencoder training produces more isolated, interpretable features—advancing mechanistic interpretability by reducing representation entanglement.

Tuesday, April 7, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline

MetaSAEs proposes a joint training approach with a decomposability penalty to improve sparse autoencoders, producing more atomic and interpretable latents. The method advances mechanistic interpretability research by addressing entanglement in learned representations.

Read original at arXiv CS.LG (Machine Learning)