Skillquality 0.47

ai-evals

Create an AI Evals Pack (eval PRD, test set, rubric, judge plan, results + iteration loop). See also: building-with-llms (build), ai-product-strategy (strategy).

Price
free
Protocol
skill
Verified
no

What it does

AI Evals

Scope

Covers

  • Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
  • Converting failures into a golden test set + error taxonomy + rubric
  • Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
  • Producing decision-ready results and an iteration loop (every bug becomes a new test)

When to use

  • “Design evals for this LLM feature so we can ship with confidence.”
  • “Create a rubric + golden set + benchmark for our AI assistant/copilot.”
  • “We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
  • “Compare prompts/models safely with a clear acceptance threshold.”

When NOT to use

  • You need to decide what to build (use problem-definition, building-with-llms, or ai-product-strategy).
  • You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
  • You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).
  • You only want vendor/model selection with no defined task + data (use evaluating-new-technology first, then come back with a concrete use case).
  • You want to measure overall product-market fit or retention, not AI output quality (use measuring-product-market-fit).
  • You need a product requirements document that includes but goes beyond eval design (use writing-prds).

Inputs

Minimum required

  • System under test (SUT): what the AI does, for whom, in what workflow (inputs → outputs)
  • The decision the eval must support (ship/no-ship, compare options, regression gate)
  • What “good” means: 3–10 target behaviors + top failure modes
  • Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline

Missing-info strategy

  • Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
  • If details remain missing, proceed with explicit assumptions and provide 2–3 viable options (judge type, scoring scheme, dataset size).
  • If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).

Outputs (deliverables)

Produce an AI Evals Pack (in chat; or as files if requested), in this order:

  1. Eval PRD (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds
  2. Test set spec + initial golden set: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)
  3. Error taxonomy (from error analysis + open coding): failure modes, severity, examples
  4. Rubric + judging guide: dimensions, scoring scale, definitions, examples, tie-breakers
  5. Judge + harness plan: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate
  6. Reporting + iteration loop: baseline results format, regression policy, how new bugs become new tests
  7. Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

Workflow (7 steps)

1) Define the decision and write the Eval PRD

  • Inputs: SUT description, stakeholders, decision to support.
  • Actions: Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.
  • Outputs: Draft Eval PRD (template in references/TEMPLATES.md).
  • Checks: A stakeholder can restate what is being measured, why, and what “pass” means.

2) Draft the golden set structure + coverage plan

  • Inputs: User workflows, edge cases, safety risks, data availability.
  • Actions: Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).
  • Outputs: Test set spec + initial golden set.
  • Checks: Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.

3) Run error analysis and open coding to build a taxonomy

  • Inputs: Known failures, logs, stakeholder anecdotes, initial golden set.
  • Actions: Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).
  • Outputs: Error taxonomy + “top failure modes” list.
  • Checks: Taxonomy is mutually understandable by PM/eng; each category has 1–2 concrete examples.

4) Convert taxonomy → rubric + scoring rules

  • Inputs: Taxonomy, target behaviors, output formats.
  • Actions: Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.
  • Outputs: Rubric + judging guide.
  • Checks: Two independent judges would likely score the same case similarly (instructions are specific, not vibes).

5) Choose the judging approach + harness/runbook

  • Inputs: Constraints (time/cost), required reliability, privacy/safety constraints.
  • Actions: Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.
  • Outputs: Judge + harness plan.
  • Checks: The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).

6) Define reporting, thresholds, and the iteration loop

  • Inputs: Stakeholder needs, release cadence.
  • Actions: Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.
  • Outputs: Reporting + iteration loop.
  • Checks: A reader can make a decision from the report without additional meetings; regressions are detectable.

7) Quality gate + finalize

  • Inputs: Full draft pack.
  • Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include Risks / Open questions / Next steps.
  • Outputs: Final AI Evals Pack.
  • Checks: The eval definition functions as a product requirement: clear, testable, and actionable.

Quality gate (required)

Examples

Example 1 (answer quality + safety): “Use ai-evals to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.”

Example 2 (structured extraction): “Use ai-evals to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for amount and due_date. Output: AI Evals Pack.”

Boundary example: “We don’t know what the AI feature should do yet—just ‘add AI’ and pick a model.” Response: out of scope; first define the job/spec and success metrics (use problem-definition or building-with-llms), then return to ai-evals with a concrete SUT.

Boundary example 2: “Our AI feature is live but users aren’t retaining. Help me figure out product-market fit.” Response: retention and PMF are business-level metrics, not eval-level metrics. Use measuring-product-market-fit for that analysis. Return to ai-evals if the issue is specifically AI output quality, accuracy, or safety.

Anti-patterns (common failure modes)

  1. Happy-path-only test sets: Building a golden set that only covers ideal inputs and expected behavior. The eval misses adversarial inputs, edge cases, and safety-critical scenarios.
  2. Vibes-based rubrics: Writing scoring criteria like “the response should be good” or “helpful and accurate” without concrete behavioral anchors, examples, or tie-breakers. Two judges will disagree on every case.
  3. One-shot eval: Running the eval once, celebrating a pass rate, and never re-running. Evals must be repeatable with a regression policy and an iteration loop that turns new failures into new tests.
  4. Ignoring judge calibration: Using LLM-as-judge without calibrating against human judgments on gold examples. Uncalibrated judges produce confident but unreliable scores.
  5. Metric without decision rule: Tracking accuracy or pass rate without defining what score triggers ship/no-ship/revise. Metrics without thresholds do not support decisions.

Capabilities

skillsource-liqiongyuskill-ai-evalstopic-agent-skillstopic-ai-agentstopic-automationtopic-claudetopic-codextopic-prompt-engineeringtopic-refoundaitopic-skillpack

Install

Installnpx skills add liqiongyu/lenny_skills_plus
Transportskills-sh
Protocolskill

Quality

0.47/ 1.00

deterministic score 0.47 from registry signals: · indexed on github topic:agent-skills · 49 github stars · SKILL.md body (8,914 chars)

Provenance

Indexed fromgithub
Enriched2026-04-22 00:56:20Z · deterministic:skill-github:v1 · v1
First seen2026-04-18
Last seen2026-04-22

Agent access