TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

UPskill

huggingface / upskill

Generate and evaluate agent skills for code agents like Claude Code, Open Code, OpenAI Codex

694 87 Language: Python License: Apache-2.0 Updated: 22h ago

README

upskill_banner

UPskill

Generate and evaluate agent skills based on traces with agents. Create skills with teacher models (expensive/slow) that student models (cheap/fast) can use to perform harder tasks reliably.

Quick Start

Install upskill:

pip install upskill
# or just use uv
uvx upskill

Create a new skill

upskill generate "write good git commit messages"
# or based on previous agent traces
upskill generate "document the pattern" --from ./trace.md
# Skills are saved to ./skills/{skill-name}/ by default

Generate a skill with a teaching model and evaluate it on a student model.

upskill generate "write good git commit messages" --model sonnet --eval-model haiku

Benchmark a set of models against a skill.

upskill eval ./skills/git-commit-messages/ -m haiku -m sonnet
# logs pretty printed to the terminal

View the results later.

upskill runs --skill git-commit-messages

Commands

upskill generate

Generate a skill from a task description with automatic evaluation and refinement.

upskill generate TASK [OPTIONS]

Arguments:

  • TASK - Description of what the skill should teach

Options:

  • -e, --example - Input -> output example (can be repeated)
  • --tool - Generate from MCP tool schema (path#tool_name)
  • -f, --from PATH - Improve from existing skill dir or agent trace file (auto-detected)
  • -m, --model MODEL - Model for generation (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514')
  • -o, --output PATH - Output directory for skill
  • --no-eval - Skip evaluation and refinement
  • --eval-model MODEL - Different model to evaluate skill on
  • --runs-dir PATH - Directory for run logs (default: ./runs)
  • --log-runs / --no-log-runs - Log run data (default: enabled)

Examples:

# Basic usage
upskill generate "parse JSON Schema files"

# Make and evaluate skills for less powerful models
upskill generate "write git commits" --model sonnet --eval-model haiku

# Improve an existing skill (auto-detected as directory)
upskill generate "add more error handling examples" --from ./skills/api-errors/

# Generate from an agent trace file (auto-detected as file)
upskill generate "document the pattern" --from ./trace.json

# Skip evaluation during generation (evaluate separately with upskill eval)
upskill generate "parse YAML" --no-eval

Output:

Generating skill with sonnet...
Generating test cases...
Evaluating on sonnet... (attempt 1)
  60% -> 100% (+40%) OK

  git-commit-messages
  Write clear, conventional commit messages that follow best practices.

  SKILL.md              ~450 tokens

  baseline   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘   60%
  with skill โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  100%  (+40%)

  tokens: 1200 โ†’ 800  (-33%)

Saved to ./skills/git-commit-messages

upskill eval

Evaluate an existing skill against test cases. Supports single-model evaluation with baseline comparison, or multi-model benchmarking.

upskill eval SKILL_PATH [OPTIONS]

Arguments:

  • SKILL_PATH - Path to skill directory containing SKILL.md

Options:

  • -t, --tests PATH - Test cases JSON file
  • -m, --model MODEL - Model(s) to evaluate against (repeatable for multi-model benchmarking)
  • --runs N - Number of runs per model (default: 1)
  • --provider [anthropic|openai|generic] - API provider (auto-detected as 'generic' when --base-url is provided)
  • --base-url URL - Custom API endpoint for local models
  • --no-baseline - Skip baseline comparison
  • -v, --verbose - Show per-test results
  • --log-runs / --no-log-runs - Log run data (default: enabled)
  • --runs-dir PATH - Directory for run logs

Examples:

# Basic evaluation with baseline comparison
upskill eval ./skills/my-skill/

# With verbose output
upskill eval ./skills/my-skill/ -v

# Custom test cases
upskill eval ./skills/my-skill/ --tests ./tests.json

# Evaluate on specific model
upskill eval ./skills/my-skill/ -m haiku

# Multi-model benchmarking (compare models)
upskill eval ./skills/my-skill/ -m haiku -m sonnet

# Multiple runs per model for statistical significance
upskill eval ./skills/my-skill/ -m haiku -m sonnet --runs 5

# Evaluate on local model (llama.cpp server)
upskill eval ./skills/my-skill/ \
    -m "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
    --base-url http://localhost:8080/v1

# Skip baseline (just test with skill)
upskill eval ./skills/my-skill/ --no-baseline

# Disable run logging
upskill eval ./skills/my-skill/ --no-log-runs

Benchmark output:

Evaluating my-skill across 2 model(s)
  3 test case(s), 5 run(s) per model

haiku
  Pass rate: 4/5 (80%)  Avg assertions: 2.8/3

sonnet
  Pass rate: 5/5 (100%)  Avg assertions: 3.0/3

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Model  โ”ƒ Pass Rate โ”ƒ Avg Assertions โ”ƒ Avg Tokens โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ haiku  โ”‚ 4/5       โ”‚ 2.8/3          โ”‚ 1250       โ”‚
โ”‚ sonnet โ”‚ 5/5       โ”‚ 3.0/3          โ”‚ 1890       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Test cases JSON format:

[
  {"input": "Write a commit for adding login", "expected": {"contains": ["feat", "login"]}},
  {"input": "Fix the null pointer bug", "expected": {"contains": ["fix", "bug"]}}
]

upskill list

List all generated skills in a tree view.

upskill list [OPTIONS]

Options:

  • -d, --dir PATH - Skills directory to list
  • -v, --verbose - Show skill contents preview

Examples:

# List skills in default directory
upskill list

# List from custom directory
upskill list -d ./my-skills/

# Show preview of skill contents
upskill list -v

Output:

./skills
โ”œโ”€โ”€ git-commit-messages
โ”‚   โ”œโ”€โ”€ Write clear, conventional commit messages...
โ”‚   โ””โ”€โ”€ files
โ”‚       โ””โ”€โ”€ SKILL.md
โ”œโ”€โ”€ api-error-handling
โ”‚   โ”œโ”€โ”€ Handle API errors gracefully with proper logging...
โ”‚   โ””โ”€โ”€ files
โ”‚       โ”œโ”€โ”€ SKILL.md
โ”‚       โ””โ”€โ”€ references/error-codes.md
โ””โ”€โ”€ yaml-parsing
    โ”œโ”€โ”€ Parse YAML files safely with schema validation...
    โ””โ”€โ”€ files
        โ”œโ”€โ”€ SKILL.md
        โ””โ”€โ”€ scripts/validate.py

upskill runs

View run results as a plot, or export to CSV. By default, shows a visual comparison of baseline vs with-skill performance.

upskill runs [OPTIONS]

Options:

  • -d, --dir PATH - Runs directory
  • -s, --skill TEXT - Filter by skill name(s) (repeatable)
  • -m, --model TEXT - Filter by model(s) (repeatable)
  • --metric [success|tokens] - Metric to display (default: success)
  • --csv PATH - Export to CSV instead of plot

Examples:

# View results plot (default)
upskill runs

# Filter by skill and models
upskill runs -s my-skill -m haiku -m sonnet

# Show token usage instead of success rate
upskill runs --metric tokens

# Export to CSV
upskill runs --csv ./results.csv

# Custom runs directory
upskill runs -d ./my-runs/

Plot output:

skill: git-commit-messages

haiku
  baseline   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘   60%
  with skill โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘   80%  (+20%)

sonnet
  baseline   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘   60%
  with skill โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  100%  (+40%)

Matrix view (multiple skills and models):

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ skill               โ”ƒ haiku        โ”ƒ sonnet       โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ git-commit-messages โ”‚ 60%โ†’80%      โ”‚ 60%โ†’100%     โ”‚
โ”‚ api-error-handling  โ”‚ 40%โ†’70%      โ”‚ 50%โ†’90%      โ”‚
โ”‚ yaml-parsing        โ”‚ 70%โ†’90%      โ”‚ 80%โ†’100%     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Skill Output Format

Skills are saved in a standard directory format:

./skills/{skill-name}/
โ”œโ”€โ”€ SKILL.md          # Main skill instructions
โ”œโ”€โ”€ references/       # Supporting documents (optional)
โ””โ”€โ”€ scripts/          # Executable scripts (optional)

Example SKILL.md:

# git-commit-messages

Write clear, conventional commit messages that follow best practices.

## Instructions

This skill teaches how to write effective git commit messages
following the Conventional Commits specification.

## Format

Commit messages should follow this structure:

<type>(<scope>): <subject>

<body>

<footer>

## Types

- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation changes
...

## Examples

### Simple feature commit
feat(auth): add password reset functionality

### Bug fix with explanation
fix(api): handle null response from user service

The user service can return null when not found.
Added proper null checking to prevent crashes.

Closes #123

Run Logging

By default, upskill logs all runs to ./runs/. Each run creates:

./runs/
โ”œโ”€โ”€ 2025_01_21_15_30/           # Batch folder (timestamp)
โ”‚   โ”œโ”€โ”€ run_1/
โ”‚   โ”‚   โ”œโ”€โ”€ run_metadata.json   # Model, task, timing
โ”‚   โ”‚   โ””โ”€โ”€ run_result.json     # Pass/fail, assertions, tokens
โ”‚   โ”œโ”€โ”€ run_2/
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ batch_summary.json      # Aggregate results
โ””โ”€โ”€ results.csv                 # Summary CSV (after `upskill runs`)

Disable with --no-log-runs.

Configuration

upskill config (~/.config/upskill/config.yaml)

model: sonnet                    # Default generation model
eval_model: haiku               # Default evaluation model (optional)
skills_dir: ./skills            # Where to save skills
runs_dir: ./runs                # Where to save run logs
max_refine_attempts: 3          # Refinement iterations

FastAgent config (fastagent.config.yaml)

Place in your project directory to customize FastAgent settings:

default_model: sonnet

logger:
  progress_display: true
  show_chat: false
  streaming: markdown

# MCP servers (optional)
mcp:
  servers:
    fetch:
      command: "uvx"
      args: ["mcp-server-fetch"]

Environment Variables

# Required for Anthropic models
ANTHROPIC_API_KEY=sk-ant-...

# Required for OpenAI models
OPENAI_API_KEY=sk-...

# Optional: custom endpoints
ANTHROPIC_BASE_URL=http://localhost:8080
OPENAI_API_BASE=http://localhost:11434/v1

# For local models (generic provider)
GENERIC_BASE_URL=http://localhost:8080/v1
GENERIC_API_KEY=local  # Optional, defaults to "local"

Python API

from upskill import (
    generate_skill,
    generate_tests,
    evaluate_skill,
    refine_skill,
    Config,
)

# Load configuration
config = Config.load()

# Generate a skill
skill = await generate_skill(
    "parse JSON Schema files",
    model="sonnet",
    config=config,
)

# Generate test cases
tests = await generate_tests("parse JSON Schema files")

# Evaluate the skill
results = await evaluate_skill(
    skill,
    tests,
    model="haiku",
    config=config,
)

print(f"Skill lift: {results.skill_lift:.0%}")
print(f"Token savings: {results.token_savings:.0%}")
print(f"Is beneficial: {results.is_beneficial}")

# Refine based on failures
if not results.is_beneficial:
    from upskill.evaluate import get_failure_descriptions
    failures = get_failure_descriptions(results)
    improved_skill = await refine_skill(skill, failures)

Model Format

upskill uses FastAgent model format:

<provider>.<model>.<reasoning_effort?>

Examples:

  • sonnet - Anthropic Claude Sonnet (alias)
  • haiku - Anthropic Claude Haiku (alias)
  • opus - Anthropic Claude Opus (alias)
  • anthropic.claude-sonnet-4-20250514 - Full model name
  • openai.gpt-4.1 - OpenAI GPT-4.1
  • openai.o3-mini.low - OpenAI o3-mini with low reasoning effort
  • generic.llama3.2:latest - Local model via Ollama
  • generic.my-model - Local model via llama.cpp or other OpenAI-compatible server

Local Models

upskill supports local models through any OpenAI-compatible endpoint (Ollama, llama.cpp, vLLM, etc.).

Quick start with Ollama:

# Start Ollama (default port 11434)
ollama serve

# Evaluate with a local model
upskill eval ./skills/my-skill/ \
    --model llama3.2:latest \
    --base-url http://localhost:11434/v1

With llama.cpp server:

# Start llama.cpp server
./llama-server -m model.gguf --port 8080

# Evaluate with the local model
upskill eval ./skills/my-skill/ \
    --model my-model \
    --base-url http://localhost:8080/v1

When --base-url is provided, the provider is automatically set to generic unless you specify --provider explicitly.

0 AIs selected
Clear selection
#
Name
Task