CC-Canary: Detect early signs of regressions in Claude Code

· ai coding web · Source ↗

TLDR

  • Two installable Claude Code Agent Skills that scan ~/.claude/projects/ JSONL logs locally to detect model drift and produce a forensic regression report.

Key Takeaways

  • Outputs a verdict (HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE) with headline metrics table, weekly trend bars, and cross-version comparisons controlling for task mix.
  • Tracks read:edit ratio, write share of mutations, reasoning loops per 1K tool calls, thinking redaction rate, mean thinking length, and tokens per user turn as health signals.
  • Auto-detects an inflection date using a composite health score and argmax of delta over candidate dates with a 0.75σ floor; falls back to median-timestamp split.
  • Fully local: stdlib-only Python 3.8+, zero network calls, no pip, no daemon, no telemetry; user prompts truncated to 180 chars and path-redacted before entering the report skeleton.
  • Total runtime is roughly 2.5s for the Python script plus 10-20s for Claude to fill the ~20 narrative slots left in the pre-rendered markdown/HTML skeleton.

Hacker News Comment Review

  • The core methodological objection: using Claude Code itself to analyze and narrate its own potential regression is a circular measurement problem — the instrument being evaluated is also writing the verdict.
  • Commenters challenged whether behavioral metrics like read:edit ratio and reasoning loops are attributable to model changes at all, given task variation, prompt changes, and CLAUDE.md modifications across sessions.
  • A counter-signal worth noting: marginlab.ai tracks Claude Code performance via conventional benchmark datasets, which commenters cited as a complementary but also limited approach since benchmarks can mask real-world drift on novel tasks.

Notable Comments

  • @evantahler: “asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements”
  • @ctoth: uses an HK-47 persona in CLAUDE.md as a canary — when the model stops using the persona’s voice, other instructions are likely also being dropped.
  • @redanddead: “the actual canary is the need for the canary itself”

Original | Discuss on HN