He asked AI to count carbs 27000 times. It couldn't give the same answer twice

Apr 29, 2026 · ai · Source ↗

TLDR

Preprint study ran 26,904 queries across GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, and Gemini 3.1 Pro on 13 food photos; all models produced clinically dangerous carb variance.

Claude Sonnet 4.6 had the lowest median coefficient of variation (2.4%) and zero queries exceeding a 2-unit insulin error on strong-reference foods; Gemini 2.5 Pro’s worst query would cause an 11.3-unit overdose.
The paella photo generated estimates ranging from 55g to 484g under Gemini 2.5 Pro across 500+ queries – a 429g spread equivalent to 42.9 units of insulin at a 1:10 ICR.
High consistency does not mean accuracy: all four models converged on ~28g for a cheese sandwich with a packet-verified 40g reference, a 12g systematic underdose.
Confidence scores are not a safety signal – Claude’s confidence has zero correlation (r = -0.01) with actual accuracy, and high-confidence queries from Claude were less accurate on average.
The study uses the real iAPS open-source AID production prompt, not a toy example, making results directly applicable to shipped diabetes apps.

Strong consensus that visual carb estimation is fundamentally information-limited: a photo cannot reveal oil content, hollow cavities, or portion density, so any model will be stochastic regardless of size.
The study’s value is as a quantified warning against production use, not as a benchmark for a reasonable task – several commenters note real apps are already shipping this to diabetics, and that justifies the data.
Crema catalana vs. creme brulee misidentification drew pushback: the two dishes are nearly identical in appearance and composition, so calling it a hallucination overstates the error.

@ozbonus: pushes back on dismissiveness – “people are using LLMs for this kind of thing. Lots of people. All the time” – and argues the study exists precisely to provide evidence against it.
@Aurornis: notes real carb-counting apps likely use cheaper fine-tuned models with forced structured output, stripping safety warnings, making variance worse than what this study measured.