Systematic empirical study of transformer weight matrix singular value spectra during pretraining across model scales (30M–285M parameters), discovering transient compression waves propagating across layers, persistent power-law spectral gradients forming depth-dependent patterns, and functional asymmetries between attention projections.
Research
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transient compression waves and persistent power-law spectral gradients propagate systematically through transformer layers during pretraining, revealing fundamental asymmetries between attention projection types that scale consistently from 30M to 285M parameters.
Tuesday, April 28, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline
Tags
research