Which one is more important: more parameters or more computation? (2021)

· Source ↗

TLDR

  • 2021 analysis examining whether neural network performance scales more from adding parameters or from increasing training compute.

Key Takeaways

  • The core tradeoff: more parameters give a model more expressive capacity; more compute refines how precisely the learned function fits the data.
  • Framing the question as binary misses the point – both axes interact, and the right balance depends on task, budget, and architecture.
  • Efficient alternatives to raw parameter scaling include LoRA fine-tuning, mixture-of-experts (MoE), and selective training data curation.
  • The 2021 timestamp places this debate just before the Chinchilla paper rebalanced consensus toward compute-optimal training over pure parameter count.

Hacker News Comment Review

  • Commenters pushed back on the either/or framing: parameters expand the hypothesis space, compute and data sharpen it – treating them as rivals is a category error.
  • A practical thread emerged around layer-level redundancy: one commenter pointed to work showing duplicate “thinking layers” in LLMs can be identified, cut, and reordered to improve benchmark scores with negligible overhead – a concrete path to efficiency without adding parameters.
  • The nuclear-bomb analogy captured the engineering concern: 100B-parameter brute force works, but LoRA, MoE, and data selection hit similar targets far more cheaply.

Notable Comments

  • @mskogly: Frames giant parameter counts as overkill – “shooting sparrows with a nuclear bomb” – and names LoRA, MoE, and selective data as the real levers.
  • @vorticalbox: Points to empirical layer-pruning work where cutting duplicate attention layers and reordering them lifted LLM scores with near-zero compute cost.

Original | Discuss on HN