Open-source test runner that benchmarks Anthropic Agent Skills (SKILL.md) by running each eval with and without the skill, then judge-grades both outputs.
Key Takeaways
Runs every eval twice: with_skill (SKILL.md in context) vs without_skill baseline, using any OpenAI-compatible model as target and judge.
Produces portable JSON/JSONL artifacts plus a static HTML report in an iteration-N/ workspace layout; no infrastructure needed to publish results.
CLI is one-liner (npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline); TypeScript SDK supports custom providers, CI pipelines, and JSONL streaming.
Supports deterministic tool-call assertions alongside judge-graded text output, covering agentic workflows beyond plain text generation.
Fully implements the agentskills.io spec: SKILL.md frontmatter validation, evals/evals.json schema, and official artifact layout.