Skillgrade

The easiest way to evaluate your Agent Skills. Tests that AI agents correctly discover and use your skills.

See examples/ — superlint (simple) and angular-modern (TypeScript grader).

Quick Start

Prerequisites: Node.js 20+, Docker

npm i -g skillgrade

1. Initialize — go to your skill directory (must have SKILL.md) and scaffold:

cd my-skill/
GEMINI_API_KEY=your-key skillgrade init    # or ANTHROPIC_API_KEY / OPENAI_API_KEY
# Use --force to overwrite an existing eval.yaml

Generates eval.yaml with AI-powered tasks and graders. Without an API key, creates a well-commented template.

2. Edit — customize eval.yaml for your skill (see eval.yaml Reference).

3. Run:

GEMINI_API_KEY=your-key skillgrade --smoke

The agent is auto-detected from your API key: GEMINI_API_KEY �� Gemini, ANTHROPIC_API_KEY → Claude, OPENAI_API_KEY → Codex. Override with --agent=claude.

4. Review:

skillgrade preview          # CLI report
skillgrade preview browser  # web UI → http://localhost:3847

Reports are saved to $TMPDIR/skillgrade/<skill-name>/results/. Override with --output=DIR.

Presets

Flag	Trials	Use Case
`--smoke`	5	Quick capability check
`--reliable`	15	Reliable pass rate estimate
`--regression`	30	High-confidence regression detection

Options

Flag	Description
`--eval=NAME[,NAME]`	Run specific evals by name (comma-separated)
`--grader=TYPE`	Run only graders of a type (`deterministic` or `llm_rubric`)
`--trials=N`	Override trial count
`--parallel=N`	Run trials concurrently
`--agent=gemini\|claude\|codex`	Override agent (default: auto-detect from API key)
`--provider=docker\|local`	Override provider
`--output=DIR`	Output directory (default: `$TMPDIR/skillgrade`)
`--validate`	Verify graders using reference solutions
`--ci`	CI mode: exit non-zero if below threshold
`--threshold=0.8`	Pass rate threshold for CI mode
`--preview`	Show CLI results after running

eval.yaml Reference

version: "1"

# Optional: explicit path to skill directory (defaults to auto-detecting SKILL.md)
# skill: path/to/my-skill

defaults:
  agent: gemini          # gemini | claude | codex
  provider: docker       # docker | local
  trials: 5
  timeout: 300           # seconds
  threshold: 0.8         # for --ci mode
  grader_model: gemini-3-flash-preview  # default LLM grader model
  docker:
    base: node:20-slim
    setup: |             # extra commands run during image build
      apt-get update && apt-get install -y jq
  environment:           # container resource limits
    cpus: 2
    memory_mb: 2048

tasks:
  - name: fix-linting-errors
    instruction: |
      Use the superlint tool to fix coding standard violations in app.js.

    workspace:                           # files copied into the container
      - src: fixtures/broken-app.js
        dest: app.js
      - src: bin/superlint
        dest: /usr/local/bin/superlint
        chmod: "+x"

    graders:
      - type: deterministic
        setup: npm install typescript    # grader-specific deps (optional)
        run: npx ts-node graders/check.ts
        weight: 0.7
      - type: llm_rubric
        rubric: |
          Did the agent follow the check → fix → verify workflow?
        model: gemini-2.0-flash          # optional model override
        weight: 0.3

    # Per-task overrides (optional)
    agent: claude
    trials: 10
    timeout: 600

String values (instruction, rubric, run) support file references — if the value is a valid file path, its contents are read automatically:

instruction: instructions/fix-linting.md
rubric: rubrics/workflow-quality.md

Graders

Deterministic

Runs a command and parses JSON from stdout:

- type: deterministic
  run: bash graders/check.sh
  weight: 0.7

Output format:

{
  "score": 0.67,
  "details": "2/3 checks passed",
  "checks": [
    {"name": "file-created", "passed": true, "message": "Output file exists"},
    {"name": "content-correct", "passed": false, "message": "Missing expected output"}
  ]
}

score (0.0–1.0) and details are required. checks is optional.

Bash example:

#!/bin/bash
passed=0; total=2
c1_pass=false c1_msg="File missing"
c2_pass=false c2_msg="Content wrong"

if test -f output.txt; then
  passed=$((passed + 1)); c1_pass=true; c1_msg="File exists"
fi
if grep -q "expected" output.txt 2>/dev/null; then
  passed=$((passed + 1)); c2_pass=true; c2_msg="Content correct"
fi

score=$(awk "BEGIN {printf \"%.2f\", $passed/$total}")
echo "{\"score\":$score,\"details\":\"$passed/$total passed\",\"checks\":[{\"name\":\"file\",\"passed\":$c1_pass,\"message\":\"$c1_msg\"},{\"name\":\"content\",\"passed\":$c2_pass,\"message\":\"$c2_msg\"}]}"

Use awk for arithmetic — bc is not available in node:20-slim.

LLM Rubric

Evaluates the agent's session transcript against qualitative criteria:

- type: llm_rubric
  rubric: |
    Workflow Compliance (0-0.5):
    - Did the agent follow the mandatory 3-step workflow?

    Efficiency (0-0.5):
    - Completed in ≤5 commands?
  weight: 0.3
  model: gemini-2.0-flash    # optional, auto-detected from API key

Uses Gemini or Anthropic based on available API key. Override with the model field.

Combining Graders

graders:
  - type: deterministic
    run: bash graders/check.sh
    weight: 0.7      # 70% — did it work?
  - type: llm_rubric
    rubric: rubrics/quality.md
    weight: 0.3      # 30% — was the approach good?

Final reward = Σ (grader_score × weight) / Σ weight

CI Integration

Use --provider=local in CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit.

# .github/workflows/skillgrade.yml
- run: |
    npm i -g skillgrade
    cd skills/superlint
    GEMINI_API_KEY=${{ secrets.GEMINI_API_KEY }} skillgrade --regression --ci --provider=local

Exits with code 1 if pass rate falls below --threshold (default: 0.8).

Tip: Use docker (the default) for local development to protect your machine. In CI, local is faster and simpler.

Environment Variables

Variable	Used by
`GEMINI_API_KEY`	Agent execution, LLM grading, `skillgrade init`
`ANTHROPIC_API_KEY`	Agent execution, LLM grading, `skillgrade init`
`OPENAI_API_KEY`	Agent execution (Codex), `skillgrade init`

Variables are also loaded from .env in the skill directory. Shell values override .env. All values are redacted from persisted session logs.

Best Practices

Grade outcomes, not steps. Check that the file was fixed, not that the agent ran a specific command.
Instructions must name output files. If the grader checks for output.html, the instruction must tell the agent to save as output.html.
Validate graders first. Use --validate with a reference solution before running real evals.
Start small. 3–5 well-designed tasks beat 50 noisy ones.

For a comprehensive guide on writing high-quality skills, check out skills-best-practices. You can also install the skill creator skill to help author skills:

npx skills add mgechev/skills-best-practices

License

MIT

Inspired by SkillsBench and Demystifying Evals for AI Agents.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
assets		assets
bin		bin
examples		examples
fixtures		fixtures
graders		graders
skills		skills
src		src
templates		templates
tests		tests
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE		LICENSE
README.md		README.md
eval.yaml		eval.yaml
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skillgrade

Quick Start

Presets

Options

eval.yaml Reference

Graders

Deterministic

LLM Rubric

Combining Graders

CI Integration

Environment Variables

Best Practices

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Skillgrade

Quick Start

Presets

Options

eval.yaml Reference

Graders

Deterministic

LLM Rubric

Combining Graders

CI Integration

Environment Variables

Best Practices

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages