OpenAI's o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

· ai · Source ↗

TLDR

  • Harvard study in Science finds OpenAI o1 diagnosed ER patients correctly 67% of the time vs. 50-55% for human triage doctors given identical EHR data.

Key Takeaways

  • Study tested 76 real Boston ER patients; AI and doctor pairs received identical electronic health records with vitals, demographics, and nurse notes.
  • o1 accuracy rose to 82% with more detail vs. 70-79% for expert humans, though that gap was not statistically significant.
  • On 5 clinical case studies, o1 scored 89% on treatment planning vs. 34% for 46 doctors using conventional resources like search engines.
  • AI was not tested on visual or behavioral patient signals, making it closer to a second-opinion tool than a full clinical replacement.
  • Authors propose a “triadic care model”: doctor, patient, and AI working together rather than AI replacing physicians.

Hacker News Comment Review

  • Commenters flagged benchmark validity risk: a recent arXiv paper showed AI beating radiologists on chest X-ray QA without access to the actual X-rays, suggesting leakage or task-design flaws can inflate scores.
  • The test conditions are seen as heavily favoring the AI: clinical case studies are physician learning tools, not real-time performance benchmarks, and the setup strips doctors of physical observation and patient interaction.
  • Commenters noted the study does not report which patient subgroups AI underperformed on, such as elderly patients or non-English speakers, limiting safety conclusions for routine deployment.

Notable Comments

  • @gpm: cites a concrete arXiv case where AI beat radiologists without seeing the X-rays, urging caution on benchmark design.
  • @lokar: argues accuracy score is the wrong metric; clinical goal is minimizing total patient harm, not hitting the most likely diagnosis.

Original | Discuss on HN