Andrej Karpathy on Software 3.0, Verifiability, and Agentic Engineering

Apr 29, 2026 · ai-agents · Source ↗

Published 2026-04-29 - Runtime about 30 min - Watch on YouTube

TLDR

Karpathy says December marked a shift: agentic tools stopped needing frequent corrections, making vibe coding feel fundamentally different.
He frames Software 3.0 as prompting LLMs through context, where agents can replace brittle install scripts and even whole app layers.

Key Takeaways

Karpathy contrasts vibe coding, which lowers the floor, with agentic engineering, which preserves professional quality while speeding execution.
He argues verifiability drives capability: math, code, and other RL-friendly domains improve faster than messy, hard-to-check work.
He says human judgment still matters for specs, taste, and oversight, while agents handle repetitive API and tensor details.
He thinks agent-native infrastructure will require systems rewritten for agents first, not for humans reading docs.

Notes

Karpathy says he felt behind as a programmer after December, when latest models started producing correct chunks repeatedly with little correction.
He treats Software 1.0 as writing code, Software 2.0 as training neural networks, and Software 3.0 as programming with prompts and context.
He describes LLMs as interpreters over context windows, with the prompt as the main lever on computation.
He gives an install example: instead of a complex shell script, a copy-paste instruction to an agent can install software like OpenClaw.
He says MenuGen showed the old mindset: a Vercel app that OCRs menu items and generates images for them.
He contrasts that with a simpler Software 3.0 approach: give a menu photo to Gemini and ask it to use Nanobanana to overlay items directly in pixels.
He extends the idea beyond code, saying LLM knowledge bases can recompile documents into wikis for organizations or individuals.
For the 2026 equivalent of earlier app eras, he imagines neural computers where raw video or audio feeds a neural net that renders a UI.
His verifiability view: models peak in domains with clear rewards, like math and code, but stay jagged in less-verifiable tasks.
He cites GPT-3.5 to GPT-4 chess gains as an example of capability jumping when chess data was added to pretraining.
He says founders should look for valuable RL environments and fine-tuning leverage in domains labs have not focused on yet.
He thinks almost everything can be made somewhat verifiable, but some tasks are easier to automate than others.