We decreased our LLM costs with Opus

Apr 29, 2026 · ai · Source ↗

TLDR

Running Opus 4.6 costs less than Sonnet 4.0 by routing 80% of CI failures through a Haiku triager that stops duplicates before Opus sees them.

Haiku triager uses exact matching plus pgvector semantic search to classify failures as duplicates; 4 out of 5 never reach Opus.
Opus orchestrates via specific sub-agent prompts and never reads raw logs directly; it queries ClickHouse via a SQL interface.
Haiku handles 65% of input tokens but only 36% of LLM spend; removing the model hierarchy more than doubles the daily bill.
Sub-agents are capped one level deep to prevent runaway fan-out; Haiku input/output ratio is 86:1, Opus orchestrator is ~50:1.
Context hygiene: Opus receives only structured sub-agent summaries; each sub-agent starts clean and its context is discarded after completing.

The headline drew immediate criticism as misleading clickbait; multiple commenters independently reduced the article’s core insight to one sentence about cheap models gating expensive ones.
@vanviegen pushed back on the “don’t pre-bias the agent” framing, arguing that sharing relevant known context before a debug session is usually an advantage, not a liability.
@albert_e sees the pattern generalizing to Claude Code tooling: buffer prompts, use Haiku with repo context to reframe the ask, then route to Opus only when necessary.

@syntaxing: Challenges the Haiku triager layer by suggesting a local embed model like llama-embed-nemotron-8b with 32K context could one-shot the whole ticket cheaper.
@iammrpayments: “I’m afraid claude code will start doing this in the background without telling you” – flags silent model-routing in AI coding tools as a transparency concern.