Transformers, Linear RNNs, LSTMs, and word embeddings all independently converge on periodic Fourier number representations with dominant periods T=2, 5, 10.
Key Takeaways
Two-tiered hierarchy: all tested architectures produce Fourier-domain spikes at T=2,5,10, but only some develop geometrically separable mod-T features usable for linear classification.
Formal result: Fourier domain sparsity is necessary but not sufficient for mod-T geometric separability – the gap explains why similar-looking representations vary in downstream utility.
Data, architecture, optimizer, and tokenizer each independently shape whether geometric separability emerges from training.
Two acquisition routes for separable features: complementary co-occurrence signals in natural language data, or multi-token (not single-token) arithmetic training.
Framed as convergent evolution: diverse model families reach similar internal structure from different training signals and objectives.
Hacker News Comment Review
Core debate centers on whether convergence is data-driven (corpus statistics, Benford’s Law-like distributions) or architecture-driven; commenters felt the paper addressed this but left the split under-quantified.
The “platonic representation hypothesis” framing resonated; one commenter flagged practical upside – shared representations could provide an entry point for injecting innate mathematical operation primitives into models, a currently unsolved alignment problem.
The HN submission title was flagged as editorialized and stronger than what the paper actually claims, which is a common peer-trust signal to watch on ML paper threads.
Notable Comments
@zjp: convergence across architectures and languages makes a strong empirical case for universal grammar – models aren’t divining structure, they’re recovering it from human text.
@dboreham: predicts convergent emergent states across learning systems trained on similar data will turn out to be pervasive and may explain biological instinct.