Skillquality 0.47

ai-evals

Create an AI Evals Pack (eval PRD, test set, rubric, judge plan, results + iteration loop). See also: building-with-llms (build), ai-product-strategy (strategy).

Price

free

Protocol

skill

Verified

Endpoint

https://skills.sh/liqiongyu/lenny_skills_plus/ai-evals

What it does

AI Evals

Scope

Covers

Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
Converting failures into a golden test set + error taxonomy + rubric
Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
Producing decision-ready results and an iteration loop (every bug becomes a new test)

When to use

“Design evals for this LLM feature so we can ship with confidence.”
“Create a rubric + golden set + benchmark for our AI assistant/copilot.”
“We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
“Compare prompts/models safely with a clear acceptance threshold.”

When NOT to use

You need to decide what to build (use problem-definition, building-with-llms, or ai-product-strategy).
You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).
You only want vendor/model selection with no defined task + data (use evaluating-new-technology first, then come back with a concrete use case).
You want to measure overall product-market fit or retention, not AI output quality (use measuring-product-market-fit).
You need a product requirements document that includes but goes beyond eval design (use writing-prds).

Inputs

Minimum required

System under test (SUT): what the AI does, for whom, in what workflow (inputs → outputs)
The decision the eval must support (ship/no-ship, compare options, regression gate)
What “good” means: 3–10 target behaviors + top failure modes
Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline

Missing-info strategy

Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
If details remain missing, proceed with explicit assumptions and provide 2–3 viable options (judge type, scoring scheme, dataset size).
If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).

Outputs (deliverables)

Produce an AI Evals Pack (in chat; or as files if requested), in this order:

Eval PRD (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds
Test set spec + initial golden set: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)
Error taxonomy (from error analysis + open coding): failure modes, severity, examples
Rubric + judging guide: dimensions, scoring scale, definitions, examples, tie-breakers
Judge + harness plan: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate
Reporting + iteration loop: baseline results format, regression policy, how new bugs become new tests
Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

Workflow (7 steps)

1) Define the decision and write the Eval PRD

Inputs: SUT description, stakeholders, decision to support.
Actions: Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.
Outputs: Draft Eval PRD (template in references/TEMPLATES.md).
Checks: A stakeholder can restate what is being measured, why, and what “pass” means.

2) Draft the golden set structure + coverage plan

Inputs: User workflows, edge cases, safety risks, data availability.
Actions: Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).
Outputs: Test set spec + initial golden set.
Checks: Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.

3) Run error analysis and open coding to build a taxonomy

Inputs: Known failures, logs, stakeholder anecdotes, initial golden set.
Actions: Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).
Outputs: Error taxonomy + “top failure modes” list.
Checks: Taxonomy is mutually understandable by PM/eng; each category has 1–2 concrete examples.

4) Convert taxonomy → rubric + scoring rules

Inputs: Taxonomy, target behaviors, output formats.
Actions: Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.
Outputs: Rubric + judging guide.
Checks: Two independent judges would likely score the same case similarly (instructions are specific, not vibes).

5) Choose the judging approach + harness/runbook

Inputs: Constraints (time/cost), required reliability, privacy/safety constraints.
Actions: Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.
Outputs: Judge + harness plan.
Checks: The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).

6) Define reporting, thresholds, and the iteration loop

Inputs: Stakeholder needs, release cadence.
Actions: Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.
Outputs: Reporting + iteration loop.
Checks: A reader can make a decision from the report without additional meetings; regressions are detectable.

7) Quality gate + finalize

Inputs: Full draft pack.
Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include Risks / Open questions / Next steps.
Outputs: Final AI Evals Pack.
Checks: The eval definition functions as a product requirement: clear, testable, and actionable.

Quality gate (required)

Use references/CHECKLISTS.md and references/RUBRIC.md.
Always include: Risks, Open questions, Next steps.

Examples

Example 1 (answer quality + safety): “Use ai-evals to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.”

Example 2 (structured extraction): “Use ai-evals to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for amount and due_date. Output: AI Evals Pack.”

Boundary example: “We don’t know what the AI feature should do yet—just ‘add AI’ and pick a model.” Response: out of scope; first define the job/spec and success metrics (use problem-definition or building-with-llms), then return to ai-evals with a concrete SUT.

Boundary example 2: “Our AI feature is live but users aren’t retaining. Help me figure out product-market fit.” Response: retention and PMF are business-level metrics, not eval-level metrics. Use measuring-product-market-fit for that analysis. Return to ai-evals if the issue is specifically AI output quality, accuracy, or safety.

Anti-patterns (common failure modes)

Happy-path-only test sets: Building a golden set that only covers ideal inputs and expected behavior. The eval misses adversarial inputs, edge cases, and safety-critical scenarios.
Vibes-based rubrics: Writing scoring criteria like “the response should be good” or “helpful and accurate” without concrete behavioral anchors, examples, or tie-breakers. Two judges will disagree on every case.
One-shot eval: Running the eval once, celebrating a pass rate, and never re-running. Evals must be repeatable with a regression policy and an iteration loop that turns new failures into new tests.
Ignoring judge calibration: Using LLM-as-judge without calibrating against human judgments on gold examples. Uncalibrated judges produce confident but unreliable scores.
Metric without decision rule: Tracking accuracy or pass rate without defining what score triggers ship/no-ship/revise. Metrics without thresholds do not support decisions.

Capabilities

skillsource-liqiongyuskill-ai-evalstopic-agent-skillstopic-ai-agentstopic-automationtopic-claudetopic-codextopic-prompt-engineeringtopic-refoundaitopic-skillpack

Install

Installnpx skills add liqiongyu/lenny_skills_plus

Sourcehttps://github.com/liqiongyu/lenny_skills_plus/tree/main/skills/ai-evals

skills.shhttps://skills.sh/liqiongyu/lenny_skills_plus/ai-evals

Transportskills-sh

Protocolskill

Quality

0.47/ 1.00

deterministic score 0.47 from registry signals: · indexed on github topic:agent-skills · 49 github stars · SKILL.md body (8,914 chars)

Provenance

Indexed fromgithub

Enriched2026-04-22 00:56:20Z · deterministic:skill-github:v1 · v1

First seen2026-04-18

Last seen2026-04-22

Agent access

JSONhttps://clawmart.sh/api/listings/k74Uma