Andrej Karpathy on Software 3.0, Verifiability, and Agentic Engineering

· ai-agents · Source ↗

Published 2026-04-29 - Runtime about 30 min - Watch on YouTube

TLDR

  • Karpathy says December marked a shift: agentic tools stopped needing frequent corrections, making vibe coding feel fundamentally different.
  • He frames Software 3.0 as prompting LLMs through context, where agents can replace brittle install scripts and even whole app layers.

Key Takeaways

  • Karpathy contrasts vibe coding, which lowers the floor, with agentic engineering, which preserves professional quality while speeding execution.
  • He argues verifiability drives capability: math, code, and other RL-friendly domains improve faster than messy, hard-to-check work.
  • He says human judgment still matters for specs, taste, and oversight, while agents handle repetitive API and tensor details.
  • He thinks agent-native infrastructure will require systems rewritten for agents first, not for humans reading docs.

Notes

  • Karpathy says he felt behind as a programmer after December, when latest models started producing correct chunks repeatedly with little correction.
  • He treats Software 1.0 as writing code, Software 2.0 as training neural networks, and Software 3.0 as programming with prompts and context.
  • He describes LLMs as interpreters over context windows, with the prompt as the main lever on computation.
  • He gives an install example: instead of a complex shell script, a copy-paste instruction to an agent can install software like OpenClaw.
  • He says MenuGen showed the old mindset: a Vercel app that OCRs menu items and generates images for them.
  • He contrasts that with a simpler Software 3.0 approach: give a menu photo to Gemini and ask it to use Nanobanana to overlay items directly in pixels.
  • He extends the idea beyond code, saying LLM knowledge bases can recompile documents into wikis for organizations or individuals.
  • For the 2026 equivalent of earlier app eras, he imagines neural computers where raw video or audio feeds a neural net that renders a UI.
  • His verifiability view: models peak in domains with clear rewards, like math and code, but stay jagged in less-verifiable tasks.
  • He cites GPT-3.5 to GPT-4 chess gains as an example of capability jumping when chess data was added to pretraining.
  • He says founders should look for valuable RL environments and fine-tuning leverage in domains labs have not focused on yet.
  • He thinks almost everything can be made somewhat verifiable, but some tasks are easier to automate than others.