I benchmarked Claude Code's caveman plugin against "be brief."

· ai · Source ↗

TLDR

  • Solo dev ran 24 prompts across 5 arms on claude-opus-4-7; “be brief.” matched Caveman on token count and quality, cutting 34% tokens vs baseline.

Key Takeaways

  • Quality was flat across all arms: baseline 0.985, brief 0.985, lite 0.976, full 0.975, ultra 0.970 – all within 1.5%, zero must_avoid triggers in 120 responses.
  • “Be brief.” averaged 419 tokens vs baseline 636; Caveman lite/full landed near 419; ultra averaged 449 due to Auto-Clarity safety escapes inflating setup and security categories.
  • Caveman’s real differentiator is structural consistency and session persistence via SessionStart/UserPromptSubmit hooks – not compression.
  • Auto-Clarity intentionally drops compression for destructive ops and multi-step sequences; two words carry no equivalent safety escape logic.
  • Ultra triggered unexpected tool-use behavior on a Dockerfile prompt, adding ~1300 tokens to its setup category mean – a side-effect of terse compression priming.

Hacker News Comment Review

  • Broad consensus that Caveman was always more novelty than utility; commenters note token savings are negligible at scale in long coding sessions where hundreds of thousands of tokens are common.
  • Skepticism that prompt-engineering hacks beat model defaults is strong; several commenters frame plugins like Caveman as “snake oil” that spread before anyone measures against the boring baseline.
  • Practical spin-off: commenters are now considering adding “be brief” directly to CLAUDE.md or AGENTS.md as a zero-cost default.

Notable Comments

  • @avaer: frames Caveman-style hacks as easy-to-install tools “you never check” – popularity outpaces any verification as the tech moves fast.
  • @encody: flags the post’s prose itself as exhibiting AI-output patterns; “I know this smell.”

Original | Discuss on HN