Show HN: A new benchmark for testing LLMs for deterministic outputs

Apr 29, 2026 · ai coding · Source ↗

TLDR

SOB exposes a 15-30 point gap between JSON parse rate and actual leaf-value accuracy across 21 models, tested on text, image, and audio inputs.

Every frontier model clears 95%+ on JSON Pass Rate, but Value Accuracy – exact field-level correctness – drops 15-30 points below that on the same records.
Audio is the hardest modality by far: best Value Accuracy is 23.7% (Gemini-2.5-Flash) vs 83.0% on text and 67.2% on images, even with text-normalized context.
Model size does not predict ranking: Qwen3.5-35B leads Value Accuracy (80.1%), beating GPT-5.4, Claude-Sonnet-4.6, and GPT-5 on that metric.
Perfect Response Rate – every leaf value exactly correct – collapses to roughly 50% even for the top-ranked GPT-5.4 (46.9%).
Leaderboard is schema-complexity-weighted (easy 1.0x, medium 2.0x, hard 3.0x), so harder nested schemas drive rank more than easy ones.

Benchmark coverage gaps drew immediate pushback: missing models like Opus 4.7 and Gemini 3.1 Pro make the leaderboard feel selective, and commenters want explicit inclusion criteria stated upfront.
The two-pass architecture question came up as a practical alternative: solve the task first in one LLM call, then serialize to JSON in a second – rather than asking one model to do both in a single pass.
Structured decoding was raised as an unaddressed variable: none of the 21 models appear tested with constrained decoding enabled, which could substantially change Value Accuracy results.

@jadbox: flags Qwen3.5-35B as potentially the best price-performance model for pure JSON extraction workloads given its leaderboard position at 35B params.
@broyojo: asks why structured decoding was excluded, a meaningful methodological gap the benchmark does not address.