Talkie: a 13B vintage language model from 1930

· ai · Source ↗

TLDR

  • talkie-1930-13b is a 13B LM trained on 260B tokens of pre-1931 English text, used as a contamination-free testbed for generalization, forecasting, and data-diversity research.

Key Takeaways

  • Vintage LMs are contamination-free by construction, enabling clean generalization evals impossible with modern web-trained models.
  • talkie achieves ~30% learning efficiency vs. human-transcribed baselines when trained on conventional OCR output; regex cleaning recovers it to ~70%.
  • Modern VLM-based OCR hallucinated post-1930 facts into the corpus, making it unusable for vintage training despite higher raw accuracy.
  • Temporal leakage is a hard unsolved problem: talkie-13b still knows Roosevelt presidency details and some WWII/UN facts despite n-gram anachronism filtering.
  • Scaling roadmap targets a GPT-3-level vintage model this summer and a GPT-3.5-level model using a projected 1T+ token historical corpus.

Hacker News Comment Review

  • Commenters widely engaged with live prompting, surfacing consistent anachronism gaps: the model behaves more like pre-1900 than 1930, missing the Great Depression and garbling 1930s technology like dial telephones.
  • The model’s confident but wrong predictions about the future (no second great war, British India thriving in 2026) generated strong engagement, illustrating both the charm and the epistemic limits of the approach.
  • Several commenters noted the Civil War framing output sidesteps slavery as cause, raising questions about how pre-1931 text distributions encode and propagate historical bias.

Notable Comments

  • @stbullard: talkie describes future “computers” as human office workers doing calculations 10-to-6, unaware of digital machines.
  • @Animats: model credits Edison with a 125 MPH car and gets London Underground traction voltage right but then hallucinates context, consistent with uneven pre-1900 data weighting.

Original | Discuss on HN