SWE-bench Verified no longer measures frontier coding capabilities

· ai coding · Source ↗

TLDR

  • OpenAI audited SWE-bench Verified and found 59.4% of sampled problems have flawed tests, plus confirmed training contamination across all frontier models tested.

Key Takeaways

  • Audit of 138 problems o3 failed consistently found 35.5% have narrow tests enforcing specific implementation details, and 18.8% test functionality never mentioned in the problem description.
  • All frontier models probed (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce gold patches or verbatim problem specifics, confirming benchmark contamination in training data.
  • Contamination inflates scores mechanically: models exposed to solutions during training pass underspecified tests by recalling the exact fix rather than reasoning from the problem statement.
  • OpenAI has stopped reporting SWE-bench Verified scores and recommends SWE-bench Pro, which showed significantly fewer verbatim gold-patch recalls in their contamination pipeline.
  • Longer-term fix involves privately authored benchmarks like GDPVal, where tasks are written by domain experts and graded holistically rather than by automated test suites.

Hacker News Comment Review

  • Commenters broadly agree the contamination problem is structural and self-reinforcing: any public benchmark eventually enters training data, so the improvement timeline from release to saturation is compressing toward months, not years.
  • There is skepticism that SWE-bench Pro solves the root issue; the consensus leans toward private, never-published test sets as the only durable solution, with ARC-AGI-3 cited as a benchmark whose task design makes memorization harder.
  • A recurring thread points out that even pre-contamination SWE-bench scores were misleading because many “passing” PRs would not survive human code review, meaning the metric was measuring test-passing, not engineering quality, from the start.

Notable Comments

  • @ofirpress: SWE-bench co-creator notes Verified is now at 93.9% saturation; SWE-bench Multilingual and Multimodal remain unsaturated and will open-source within a month.
  • @kqr: “virtually no improvement in the rate at which models produced quality code” through 2025; gains were in test-passing, not real coding ability.
  • @jddj: Links prior reporting showing many SWE-bench-passing PRs would not be merged, and that top scores may already be skewed by git history leaks.

Original | Discuss on HN