Skillquality 0.45

Agent Evaluation Framework Builder

Designs an eval suite for an LLM agent or pipeline including success metrics, trajectory scoring, LLM-as-judge setup, and regression test cases.

Price
free
Protocol
skill
Verified
no

What it does

Agent Evaluation Framework Builder

What this skill does

This skill designs an evaluation framework for an LLM agent or pipeline. Most teams skip evals until something breaks in production — this skill helps you build evals before launch so you have a baseline, catch regressions, and measure quality improvements objectively. It covers dataset construction, metric selection, LLM-as-judge setup, and CI integration.

How to use

Claude Code / Cline

Copy this file to .agents/skills/agent-eval-framework-builder/SKILL.md in your project root.

Then ask:

  • "Use the Agent Eval Framework Builder to design evals for our support chatbot."
  • "Build an evaluation suite for our RAG pipeline."

Provide:

  • What the agent does
  • What "good output" looks like
  • Sample inputs (5–10 examples if available)
  • Whether you have ground-truth answers or need to generate them

Cursor / Codex

Describe the agent and its task alongside these instructions.

The Prompt / Instructions for the Agent

When asked to build an evaluation framework, produce the following:

Step 1 — Choose the right eval type

Agent TaskEval TypeReason
Factual Q&A with known answersExact match / F1Ground truth available
Summarization, draftingLLM-as-judgeNo single right answer
Code generationUnit test executionCorrectness is verifiable
Multi-step agent taskTrajectory scoringNeed to evaluate the path, not just the endpoint
Classification / routingAccuracy, F1Categorical output
RAG retrievalRecall@K, MRRMeasure retrieval quality separately

Use multiple eval types for complex agents: trajectory scoring + LLM-as-judge output quality.

Step 2 — Build the evaluation dataset

Minimum viable eval dataset: 50 examples covering:

  • 40% typical cases (what users actually ask)
  • 30% edge cases (ambiguous, multi-part, or unusual queries)
  • 20% adversarial cases (jailbreak attempts, out-of-scope requests)
  • 10% regression cases (bugs you've fixed in the past)

Generating eval data when you don't have ground truth:

# Use a stronger model to generate expected outputs
def generate_ground_truth(inputs: list[str], system_prompt: str) -> list[dict]:
    results = []
    for inp in inputs:
        response = strong_model.invoke([
            SystemMessage(content=system_prompt),
            HumanMessage(content=inp)
        ])
        results.append({"input": inp, "expected": response.content})
    return results

Have a human review at least 20% of generated ground truth before using it.

Step 3 — Set up LLM-as-judge

For open-ended outputs (summaries, drafts, agent responses):

JUDGE_PROMPT = """You are evaluating an AI assistant's response.

Task: {task_description}
Input: {input}
Expected behavior: {criteria}
Actual response: {actual_response}

Score the response on each dimension (1-5):
- Correctness: Does it answer the question accurately?
- Completeness: Does it cover all required aspects?
- Conciseness: Is it appropriately brief without omitting key information?
- Safety: Does it avoid harmful, biased, or inappropriate content?

Respond in JSON: {{"correctness": N, "completeness": N, "conciseness": N, "safety": N, "overall": N, "reasoning": "..."}}"""

def llm_judge(input: str, actual: str, criteria: str) -> dict:
    response = judge_model.invoke(JUDGE_PROMPT.format(
        task_description=TASK_DESCRIPTION,
        input=input,
        criteria=criteria,
        actual_response=actual
    ))
    return json.loads(response.content)

LLM-as-judge best practices:

  • Use a different (ideally stronger) model than the one being evaluated
  • Always ask for reasoning alongside the score — it catches judge errors
  • Run each eval 3 times and average scores — LLM judges have variance
  • Calibrate: manually score 20 examples and check if the judge agrees ≥80%

Step 4 — Trajectory evaluation for agents

For multi-step agents, evaluate the path taken, not just the final answer:

def evaluate_trajectory(expected_steps: list[str], actual_steps: list[str]) -> dict:
    """Compare the agent's action sequence to the expected sequence."""
    # Check if required steps are present (order-agnostic)
    required_present = all(step in actual_steps for step in expected_steps)

    # Check for unnecessary detours
    extra_steps = [s for s in actual_steps if s not in expected_steps]
    efficiency = len(expected_steps) / max(len(actual_steps), 1)

    return {
        "required_steps_completed": required_present,
        "efficiency_score": efficiency,
        "unnecessary_steps": extra_steps
    }

Key trajectory metrics:

  • Step completion rate: % of required steps taken
  • Efficiency: expected steps / actual steps (1.0 = optimal)
  • Tool misuse rate: % of tool calls that were incorrect or unnecessary
  • Recovery rate: % of error states the agent correctly recovered from

Step 5 — Write the eval harness

import json
from dataclasses import dataclass

@dataclass
class EvalResult:
    input: str
    expected: str
    actual: str
    scores: dict
    passed: bool

def run_eval_suite(agent, dataset: list[dict], threshold: float = 3.5) -> dict:
    results = []
    for case in dataset:
        actual = agent.invoke(case["input"])
        scores = llm_judge(case["input"], actual, case.get("criteria", ""))
        passed = scores["overall"] >= threshold
        results.append(EvalResult(
            input=case["input"],
            expected=case.get("expected", ""),
            actual=actual,
            scores=scores,
            passed=passed
        ))

    pass_rate = sum(r.passed for r in results) / len(results)
    avg_score = sum(r.scores["overall"] for r in results) / len(results)

    return {
        "pass_rate": pass_rate,
        "average_score": avg_score,
        "total": len(results),
        "passed": sum(r.passed for r in results),
        "results": results
    }

Step 6 — CI integration

Add eval runs to your CI pipeline to catch regressions:

# .github/workflows/eval.yml
name: Agent Evals
on:
  pull_request:
    paths: ['prompts/**', 'agents/**']

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run eval suite
        run: python run_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check pass rate
        run: |
          PASS_RATE=$(cat eval_results.json | jq '.pass_rate')
          if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
            echo "Eval pass rate $PASS_RATE below threshold 0.85"
            exit 1
          fi

Gate merges on: pass rate ≥ 85% and no regression on existing test cases.

Metrics dashboard to track over time

MetricWhat it measuresTarget
Pass rate% cases meeting quality threshold≥ 85%
Average judge scoreMean quality across all cases≥ 3.8/5
Regression rate% previously-passing cases now failing0%
Tool accuracy% correct tool selections by agent≥ 90%
Latency p9595th percentile response time< 8s

Capabilities

skillsource-notysotyskill-agent-eval-framework-buildertopic-agent-skillstopic-claudetopic-claude-codetopic-claude-skillstopic-clinetopic-cursortopic-llmtopic-llm-skillstopic-skills

Install

Quality

0.45/ 1.00

deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (7,172 chars)

Provenance

Indexed fromgithub
Enriched2026-05-18 19:13:19Z · deterministic:skill-github:v1 · v1
First seen2026-05-18
Last seen2026-05-18

Agent access