OpenAI audited SWE-bench Verified and found 59.4% of sampled problems have flawed tests, plus confirmed training contamination across all frontier models tested.
Key Takeaways
Audit of 138 problems o3 failed consistently found 35.5% have narrow tests enforcing specific implementation details, and 18.8% test functionality never mentioned in the problem description.
All frontier models probed (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce gold patches or verbatim problem specifics, confirming benchmark contamination in training data.
Contamination inflates scores mechanically: models exposed to solutions during training pass underspecified tests by recalling the exact fix rather than reasoning from the problem statement.
OpenAI has stopped reporting SWE-bench Verified scores and recommends SWE-bench Pro, which showed significantly fewer verbatim gold-patch recalls in their contamination pipeline.
Longer-term fix involves privately authored benchmarks like GDPVal, where tasks are written by domain experts and graded holistically rather than by automated test suites.
Hacker News Comment Review
Commenters broadly agree the contamination problem is structural and self-reinforcing: any public benchmark eventually enters training data, so the improvement timeline from release to saturation is compressing toward months, not years.
There is skepticism that SWE-bench Pro solves the root issue; the consensus leans toward private, never-published test sets as the only durable solution, with ARC-AGI-3 cited as a benchmark whose task design makes memorization harder.
A recurring thread points out that even pre-contamination SWE-bench scores were misleading because many “passing” PRs would not survive human code review, meaning the metric was measuring test-passing, not engineering quality, from the start.
Notable Comments
@ofirpress: SWE-bench co-creator notes Verified is now at 93.9% saturation; SWE-bench Multilingual and Multimodal remain unsaturated and will open-source within a month.
@kqr: “virtually no improvement in the rate at which models produced quality code” through 2025; gains were in test-passing, not real coding ability.
@jddj: Links prior reporting showing many SWE-bench-passing PRs would not be merged, and that top scores may already be skewed by git history leaks.