We decreased our LLM costs with Opus

· ai · Source ↗

TLDR

  • Running Opus 4.6 costs less than Sonnet 4.0 by routing 80% of CI failures through a Haiku triager that stops duplicates before Opus sees them.

Key Takeaways

  • Haiku triager uses exact matching plus pgvector semantic search to classify failures as duplicates; 4 out of 5 never reach Opus.
  • Opus orchestrates via specific sub-agent prompts and never reads raw logs directly; it queries ClickHouse via a SQL interface.
  • Haiku handles 65% of input tokens but only 36% of LLM spend; removing the model hierarchy more than doubles the daily bill.
  • Sub-agents are capped one level deep to prevent runaway fan-out; Haiku input/output ratio is 86:1, Opus orchestrator is ~50:1.
  • Context hygiene: Opus receives only structured sub-agent summaries; each sub-agent starts clean and its context is discarded after completing.

Hacker News Comment Review

  • The headline drew immediate criticism as misleading clickbait; multiple commenters independently reduced the article’s core insight to one sentence about cheap models gating expensive ones.
  • @vanviegen pushed back on the “don’t pre-bias the agent” framing, arguing that sharing relevant known context before a debug session is usually an advantage, not a liability.
  • @albert_e sees the pattern generalizing to Claude Code tooling: buffer prompts, use Haiku with repo context to reframe the ask, then route to Opus only when necessary.

Notable Comments

  • @syntaxing: Challenges the Haiku triager layer by suggesting a local embed model like llama-embed-nemotron-8b with 32K context could one-shot the whole ticket cheaper.
  • @iammrpayments: “I’m afraid claude code will start doing this in the background without telling you” – flags silent model-routing in AI coding tools as a transparency concern.

Original | Discuss on HN