Researchers investigate why transformers experience prolonged training plateaus before sudden generalization ('grokking') on arithmetic tasks. Using causal interventions on Collatz prediction models, they show encoders learn structure early while decoders create a bottleneck preventing accuracy improvement. Numeral base choice dramatically affects convergence: bases aligned to task arithmetic reach 99.8% accuracy while binary fails completely.
Research
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
Transformers learn arithmetic structure early but bottleneck in decoders; numeral base choice drives generalization success, with task-aligned bases reaching 99.8% while binary fails completely.
Thursday, April 16, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline
Tags
research