Show HN: A new benchmark for testing LLMs for deterministic outputs

· ai coding · Source ↗

TLDR

  • SOB exposes a 15-30 point gap between JSON parse rate and actual leaf-value accuracy across 21 models, tested on text, image, and audio inputs.

Key Takeaways

  • Every frontier model clears 95%+ on JSON Pass Rate, but Value Accuracy – exact field-level correctness – drops 15-30 points below that on the same records.
  • Audio is the hardest modality by far: best Value Accuracy is 23.7% (Gemini-2.5-Flash) vs 83.0% on text and 67.2% on images, even with text-normalized context.
  • Model size does not predict ranking: Qwen3.5-35B leads Value Accuracy (80.1%), beating GPT-5.4, Claude-Sonnet-4.6, and GPT-5 on that metric.
  • Perfect Response Rate – every leaf value exactly correct – collapses to roughly 50% even for the top-ranked GPT-5.4 (46.9%).
  • Leaderboard is schema-complexity-weighted (easy 1.0x, medium 2.0x, hard 3.0x), so harder nested schemas drive rank more than easy ones.

Hacker News Comment Review

  • Benchmark coverage gaps drew immediate pushback: missing models like Opus 4.7 and Gemini 3.1 Pro make the leaderboard feel selective, and commenters want explicit inclusion criteria stated upfront.
  • The two-pass architecture question came up as a practical alternative: solve the task first in one LLM call, then serialize to JSON in a second – rather than asking one model to do both in a single pass.
  • Structured decoding was raised as an unaddressed variable: none of the 21 models appear tested with constrained decoding enabled, which could substantially change Value Accuracy results.

Notable Comments

  • @jadbox: flags Qwen3.5-35B as potentially the best price-performance model for pure JSON extraction workloads given its leaderboard position at 35B params.
  • @broyojo: asks why structured decoding was excluded, a meaningful methodological gap the benchmark does not address.

Original | Discuss on HN