UPskill
Generate and evaluate agent skills for code agents like Claude Code, Open Code, OpenAI Codex
README
UPskill
Generate and evaluate agent skills based on traces with agents. Create skills with teacher models (expensive/slow) that student models (cheap/fast) can use to perform harder tasks reliably.
Quick Start
Install upskill:
pip install upskill
# or just use uv
uvx upskill
Create a new skill
upskill generate "write good git commit messages"
# or based on previous agent traces
upskill generate "document the pattern" --from ./trace.md
# Skills are saved to ./skills/{skill-name}/ by default
Generate a skill with a teaching model and evaluate it on a student model.
upskill generate "write good git commit messages" --model sonnet --eval-model haiku
Benchmark a set of models against a skill.
upskill eval ./skills/git-commit-messages/ -m haiku -m sonnet
# logs pretty printed to the terminal
View the results later.
upskill runs --skill git-commit-messages
Commands
upskill generate
Generate a skill from a task description with automatic evaluation and refinement.
upskill generate TASK [OPTIONS]
Arguments:
TASK- Description of what the skill should teach
Options:
-e, --example- Input -> output example (can be repeated)--tool- Generate from MCP tool schema (path#tool_name)-f, --from PATH- Improve from existing skill dir or agent trace file (auto-detected)-m, --model MODEL- Model for generation (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514')-o, --output PATH- Output directory for skill--no-eval- Skip evaluation and refinement--eval-model MODEL- Different model to evaluate skill on--runs-dir PATH- Directory for run logs (default: ./runs)--log-runs / --no-log-runs- Log run data (default: enabled)
Examples:
# Basic usage
upskill generate "parse JSON Schema files"
# Make and evaluate skills for less powerful models
upskill generate "write git commits" --model sonnet --eval-model haiku
# Improve an existing skill (auto-detected as directory)
upskill generate "add more error handling examples" --from ./skills/api-errors/
# Generate from an agent trace file (auto-detected as file)
upskill generate "document the pattern" --from ./trace.json
# Skip evaluation during generation (evaluate separately with upskill eval)
upskill generate "parse YAML" --no-eval
Output:
Generating skill with sonnet...
Generating test cases...
Evaluating on sonnet... (attempt 1)
60% -> 100% (+40%) OK
git-commit-messages
Write clear, conventional commit messages that follow best practices.
SKILL.md ~450 tokens
baseline โโโโโโโโโโโโโโโโโโโโ 60%
with skill โโโโโโโโโโโโโโโโโโโโ 100% (+40%)
tokens: 1200 โ 800 (-33%)
Saved to ./skills/git-commit-messages
upskill eval
Evaluate an existing skill against test cases. Supports single-model evaluation with baseline comparison, or multi-model benchmarking.
upskill eval SKILL_PATH [OPTIONS]
Arguments:
SKILL_PATH- Path to skill directory containing SKILL.md
Options:
-t, --tests PATH- Test cases JSON file-m, --model MODEL- Model(s) to evaluate against (repeatable for multi-model benchmarking)--runs N- Number of runs per model (default: 1)--provider [anthropic|openai|generic]- API provider (auto-detected as 'generic' when --base-url is provided)--base-url URL- Custom API endpoint for local models--no-baseline- Skip baseline comparison-v, --verbose- Show per-test results--log-runs / --no-log-runs- Log run data (default: enabled)--runs-dir PATH- Directory for run logs
Examples:
# Basic evaluation with baseline comparison
upskill eval ./skills/my-skill/
# With verbose output
upskill eval ./skills/my-skill/ -v
# Custom test cases
upskill eval ./skills/my-skill/ --tests ./tests.json
# Evaluate on specific model
upskill eval ./skills/my-skill/ -m haiku
# Multi-model benchmarking (compare models)
upskill eval ./skills/my-skill/ -m haiku -m sonnet
# Multiple runs per model for statistical significance
upskill eval ./skills/my-skill/ -m haiku -m sonnet --runs 5
# Evaluate on local model (llama.cpp server)
upskill eval ./skills/my-skill/ \
-m "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
--base-url http://localhost:8080/v1
# Skip baseline (just test with skill)
upskill eval ./skills/my-skill/ --no-baseline
# Disable run logging
upskill eval ./skills/my-skill/ --no-log-runs
Benchmark output:
Evaluating my-skill across 2 model(s)
3 test case(s), 5 run(s) per model
haiku
Pass rate: 4/5 (80%) Avg assertions: 2.8/3
sonnet
Pass rate: 5/5 (100%) Avg assertions: 3.0/3
โโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโ
โ Model โ Pass Rate โ Avg Assertions โ Avg Tokens โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ haiku โ 4/5 โ 2.8/3 โ 1250 โ
โ sonnet โ 5/5 โ 3.0/3 โ 1890 โ
โโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโ
Test cases JSON format:
[
{"input": "Write a commit for adding login", "expected": {"contains": ["feat", "login"]}},
{"input": "Fix the null pointer bug", "expected": {"contains": ["fix", "bug"]}}
]
upskill list
List all generated skills in a tree view.
upskill list [OPTIONS]
Options:
-d, --dir PATH- Skills directory to list-v, --verbose- Show skill contents preview
Examples:
# List skills in default directory
upskill list
# List from custom directory
upskill list -d ./my-skills/
# Show preview of skill contents
upskill list -v
Output:
./skills
โโโ git-commit-messages
โ โโโ Write clear, conventional commit messages...
โ โโโ files
โ โโโ SKILL.md
โโโ api-error-handling
โ โโโ Handle API errors gracefully with proper logging...
โ โโโ files
โ โโโ SKILL.md
โ โโโ references/error-codes.md
โโโ yaml-parsing
โโโ Parse YAML files safely with schema validation...
โโโ files
โโโ SKILL.md
โโโ scripts/validate.py
upskill runs
View run results as a plot, or export to CSV. By default, shows a visual comparison of baseline vs with-skill performance.
upskill runs [OPTIONS]
Options:
-d, --dir PATH- Runs directory-s, --skill TEXT- Filter by skill name(s) (repeatable)-m, --model TEXT- Filter by model(s) (repeatable)--metric [success|tokens]- Metric to display (default: success)--csv PATH- Export to CSV instead of plot
Examples:
# View results plot (default)
upskill runs
# Filter by skill and models
upskill runs -s my-skill -m haiku -m sonnet
# Show token usage instead of success rate
upskill runs --metric tokens
# Export to CSV
upskill runs --csv ./results.csv
# Custom runs directory
upskill runs -d ./my-runs/
Plot output:
skill: git-commit-messages
haiku
baseline โโโโโโโโโโโโโโโโโโโโ 60%
with skill โโโโโโโโโโโโโโโโโโโโ 80% (+20%)
sonnet
baseline โโโโโโโโโโโโโโโโโโโโ 60%
with skill โโโโโโโโโโโโโโโโโโโโ 100% (+40%)
Matrix view (multiple skills and models):
โโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโ
โ skill โ haiku โ sonnet โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ git-commit-messages โ 60%โ80% โ 60%โ100% โ
โ api-error-handling โ 40%โ70% โ 50%โ90% โ
โ yaml-parsing โ 70%โ90% โ 80%โ100% โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
Skill Output Format
Skills are saved in a standard directory format:
./skills/{skill-name}/
โโโ SKILL.md # Main skill instructions
โโโ references/ # Supporting documents (optional)
โโโ scripts/ # Executable scripts (optional)
Example SKILL.md:
# git-commit-messages
Write clear, conventional commit messages that follow best practices.
## Instructions
This skill teaches how to write effective git commit messages
following the Conventional Commits specification.
## Format
Commit messages should follow this structure:
<type>(<scope>): <subject>
<body>
<footer>
## Types
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation changes
...
## Examples
### Simple feature commit
feat(auth): add password reset functionality
### Bug fix with explanation
fix(api): handle null response from user service
The user service can return null when not found.
Added proper null checking to prevent crashes.
Closes #123
Run Logging
By default, upskill logs all runs to ./runs/. Each run creates:
./runs/
โโโ 2025_01_21_15_30/ # Batch folder (timestamp)
โ โโโ run_1/
โ โ โโโ run_metadata.json # Model, task, timing
โ โ โโโ run_result.json # Pass/fail, assertions, tokens
โ โโโ run_2/
โ โ โโโ ...
โ โโโ batch_summary.json # Aggregate results
โโโ results.csv # Summary CSV (after `upskill runs`)
Disable with --no-log-runs.
Configuration
upskill config (~/.config/upskill/config.yaml)
model: sonnet # Default generation model
eval_model: haiku # Default evaluation model (optional)
skills_dir: ./skills # Where to save skills
runs_dir: ./runs # Where to save run logs
max_refine_attempts: 3 # Refinement iterations
FastAgent config (fastagent.config.yaml)
Place in your project directory to customize FastAgent settings:
default_model: sonnet
logger:
progress_display: true
show_chat: false
streaming: markdown
# MCP servers (optional)
mcp:
servers:
fetch:
command: "uvx"
args: ["mcp-server-fetch"]
Environment Variables
# Required for Anthropic models
ANTHROPIC_API_KEY=sk-ant-...
# Required for OpenAI models
OPENAI_API_KEY=sk-...
# Optional: custom endpoints
ANTHROPIC_BASE_URL=http://localhost:8080
OPENAI_API_BASE=http://localhost:11434/v1
# For local models (generic provider)
GENERIC_BASE_URL=http://localhost:8080/v1
GENERIC_API_KEY=local # Optional, defaults to "local"
Python API
from upskill import (
generate_skill,
generate_tests,
evaluate_skill,
refine_skill,
Config,
)
# Load configuration
config = Config.load()
# Generate a skill
skill = await generate_skill(
"parse JSON Schema files",
model="sonnet",
config=config,
)
# Generate test cases
tests = await generate_tests("parse JSON Schema files")
# Evaluate the skill
results = await evaluate_skill(
skill,
tests,
model="haiku",
config=config,
)
print(f"Skill lift: {results.skill_lift:.0%}")
print(f"Token savings: {results.token_savings:.0%}")
print(f"Is beneficial: {results.is_beneficial}")
# Refine based on failures
if not results.is_beneficial:
from upskill.evaluate import get_failure_descriptions
failures = get_failure_descriptions(results)
improved_skill = await refine_skill(skill, failures)
Model Format
upskill uses FastAgent model format:
<provider>.<model>.<reasoning_effort?>
Examples:
sonnet- Anthropic Claude Sonnet (alias)haiku- Anthropic Claude Haiku (alias)opus- Anthropic Claude Opus (alias)anthropic.claude-sonnet-4-20250514- Full model nameopenai.gpt-4.1- OpenAI GPT-4.1openai.o3-mini.low- OpenAI o3-mini with low reasoning effortgeneric.llama3.2:latest- Local model via Ollamageneric.my-model- Local model via llama.cpp or other OpenAI-compatible server
Local Models
upskill supports local models through any OpenAI-compatible endpoint (Ollama, llama.cpp, vLLM, etc.).
Quick start with Ollama:
# Start Ollama (default port 11434)
ollama serve
# Evaluate with a local model
upskill eval ./skills/my-skill/ \
--model llama3.2:latest \
--base-url http://localhost:11434/v1
With llama.cpp server:
# Start llama.cpp server
./llama-server -m model.gguf --port 8080
# Evaluate with the local model
upskill eval ./skills/my-skill/ \
--model my-model \
--base-url http://localhost:8080/v1
When --base-url is provided, the provider is automatically set to generic unless you specify --provider explicitly.
