Andrej Karpathy on Software 3.0, Verifiability, and Agentic Engineering
Published 2026-04-29 - Runtime about 30 min - Watch on YouTube
TLDR
- Karpathy says December marked a shift: agentic tools stopped needing frequent corrections, making vibe coding feel fundamentally different.
- He frames Software 3.0 as prompting LLMs through context, where agents can replace brittle install scripts and even whole app layers.
Key Takeaways
- Karpathy contrasts vibe coding, which lowers the floor, with agentic engineering, which preserves professional quality while speeding execution.
- He argues verifiability drives capability: math, code, and other RL-friendly domains improve faster than messy, hard-to-check work.
- He says human judgment still matters for specs, taste, and oversight, while agents handle repetitive API and tensor details.
- He thinks agent-native infrastructure will require systems rewritten for agents first, not for humans reading docs.
Notes
- Karpathy says he felt behind as a programmer after December, when latest models started producing correct chunks repeatedly with little correction.
- He treats Software 1.0 as writing code, Software 2.0 as training neural networks, and Software 3.0 as programming with prompts and context.
- He describes LLMs as interpreters over context windows, with the prompt as the main lever on computation.
- He gives an install example: instead of a complex shell script, a copy-paste instruction to an agent can install software like OpenClaw.
- He says MenuGen showed the old mindset: a Vercel app that OCRs menu items and generates images for them.
- He contrasts that with a simpler Software 3.0 approach: give a menu photo to Gemini and ask it to use Nanobanana to overlay items directly in pixels.
- He extends the idea beyond code, saying LLM knowledge bases can recompile documents into wikis for organizations or individuals.
- For the 2026 equivalent of earlier app eras, he imagines neural computers where raw video or audio feeds a neural net that renders a UI.
- His verifiability view: models peak in domains with clear rewards, like math and code, but stay jagged in less-verifiable tasks.
- He cites GPT-3.5 to GPT-4 chess gains as an example of capability jumping when chess data was added to pretraining.
- He says founders should look for valuable RL environments and fine-tuning leverage in domains labs have not focused on yet.
- He thinks almost everything can be made somewhat verifiable, but some tasks are easier to automate than others.