Harvard study in Science finds OpenAI o1 diagnosed ER patients correctly 67% of the time vs. 50-55% for human triage doctors given identical EHR data.
Key Takeaways
Study tested 76 real Boston ER patients; AI and doctor pairs received identical electronic health records with vitals, demographics, and nurse notes.
o1 accuracy rose to 82% with more detail vs. 70-79% for expert humans, though that gap was not statistically significant.
On 5 clinical case studies, o1 scored 89% on treatment planning vs. 34% for 46 doctors using conventional resources like search engines.
AI was not tested on visual or behavioral patient signals, making it closer to a second-opinion tool than a full clinical replacement.
Authors propose a “triadic care model”: doctor, patient, and AI working together rather than AI replacing physicians.
Hacker News Comment Review
Commenters flagged benchmark validity risk: a recent arXiv paper showed AI beating radiologists on chest X-ray QA without access to the actual X-rays, suggesting leakage or task-design flaws can inflate scores.
The test conditions are seen as heavily favoring the AI: clinical case studies are physician learning tools, not real-time performance benchmarks, and the setup strips doctors of physical observation and patient interaction.
Commenters noted the study does not report which patient subgroups AI underperformed on, such as elderly patients or non-English speakers, limiting safety conclusions for routine deployment.
Notable Comments
@gpm: cites a concrete arXiv case where AI beat radiologists without seeing the X-rays, urging caution on benchmark design.
@lokar: argues accuracy score is the wrong metric; clinical goal is minimizing total patient harm, not hitting the most likely diagnosis.