Different Language Models Learn Similar Number Representations

· ai · Source ↗

TLDR

  • Transformers, Linear RNNs, LSTMs, and word embeddings all independently converge on periodic Fourier number representations with dominant periods T=2, 5, 10.

Key Takeaways

  • Two-tiered hierarchy: all tested architectures produce Fourier-domain spikes at T=2,5,10, but only some develop geometrically separable mod-T features usable for linear classification.
  • Formal result: Fourier domain sparsity is necessary but not sufficient for mod-T geometric separability – the gap explains why similar-looking representations vary in downstream utility.
  • Data, architecture, optimizer, and tokenizer each independently shape whether geometric separability emerges from training.
  • Two acquisition routes for separable features: complementary co-occurrence signals in natural language data, or multi-token (not single-token) arithmetic training.
  • Framed as convergent evolution: diverse model families reach similar internal structure from different training signals and objectives.

Hacker News Comment Review

  • Core debate centers on whether convergence is data-driven (corpus statistics, Benford’s Law-like distributions) or architecture-driven; commenters felt the paper addressed this but left the split under-quantified.
  • The “platonic representation hypothesis” framing resonated; one commenter flagged practical upside – shared representations could provide an entry point for injecting innate mathematical operation primitives into models, a currently unsolved alignment problem.
  • The HN submission title was flagged as editorialized and stronger than what the paper actually claims, which is a common peer-trust signal to watch on ML paper threads.

Notable Comments

  • @zjp: convergence across architectures and languages makes a strong empirical case for universal grammar – models aren’t divining structure, they’re recovering it from human text.
  • @dboreham: predicts convergent emergent states across learning systems trained on similar data will turn out to be pervasive and may explain biological instinct.

Original | Discuss on HN