{"id":"c1bb9b21-10cd-4d7b-ab63-a1374b76e159","shortId":"k74Uma","kind":"skill","title":"ai-evals","tagline":"Create an AI Evals Pack (eval PRD, test set, rubric, judge plan, results + iteration loop). See also: building-with-llms (build), ai-product-strategy (strategy).","description":"# AI Evals\n\n## Scope\n\n**Covers**\n- Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured\n- Converting failures into a **golden test set** + **error taxonomy** + **rubric**\n- Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook\n- Producing decision-ready results and an iteration loop (every bug becomes a new test)\n\n**When to use**\n- “Design evals for this LLM feature so we can ship with confidence.”\n- “Create a rubric + golden set + benchmark for our AI assistant/copilot.”\n- “We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”\n- “Compare prompts/models safely with a clear acceptance threshold.”\n\n**When NOT to use**\n- You need to decide *what to build* (use `problem-definition`, `building-with-llms`, or `ai-product-strategy`).\n- You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).\n- You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).\n- You only want vendor/model selection with no defined task + data (use `evaluating-new-technology` first, then come back with a concrete use case).\n- You want to measure overall product-market fit or retention, not AI output quality (use `measuring-product-market-fit`).\n- You need a product requirements document that includes but goes beyond eval design (use `writing-prds`).\n\n## Inputs\n\n**Minimum required**\n- System under test (SUT): what the AI does, for whom, in what workflow (inputs → outputs)\n- The decision the eval must support (ship/no-ship, compare options, regression gate)\n- What “good” means: 3–10 target behaviors + top failure modes\n- Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline\n\n**Missing-info strategy**\n- Ask up to 5 questions from [references/INTAKE.md](references/INTAKE.md) (3–5 at a time).\n- If details remain missing, proceed with explicit assumptions and provide 2–3 viable options (judge type, scoring scheme, dataset size).\n- If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).\n\n## Outputs (deliverables)\n\nProduce an **AI Evals Pack** (in chat; or as files if requested), in this order:\n\n1) **Eval PRD** (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds  \n2) **Test set spec + initial golden set**: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)  \n3) **Error taxonomy** (from error analysis + open coding): failure modes, severity, examples  \n4) **Rubric + judging guide**: dimensions, scoring scale, definitions, examples, tie-breakers  \n5) **Judge + harness plan**: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate  \n6) **Reporting + iteration loop**: baseline results format, regression policy, how new bugs become new tests  \n7) **Risks / Open questions / Next steps** (always included)\n\nTemplates: [references/TEMPLATES.md](references/TEMPLATES.md)\n\n## Workflow (7 steps)\n\n### 1) Define the decision and write the Eval PRD\n- **Inputs:** SUT description, stakeholders, decision to support.\n- **Actions:** Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.\n- **Outputs:** Draft **Eval PRD** (template in [references/TEMPLATES.md](references/TEMPLATES.md)).\n- **Checks:** A stakeholder can restate what is being measured, why, and what “pass” means.\n\n### 2) Draft the golden set structure + coverage plan\n- **Inputs:** User workflows, edge cases, safety risks, data availability.\n- **Actions:** Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).\n- **Outputs:** **Test set spec + initial golden set**.\n- **Checks:** Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.\n\n### 3) Run error analysis and open coding to build a taxonomy\n- **Inputs:** Known failures, logs, stakeholder anecdotes, initial golden set.\n- **Actions:** Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).\n- **Outputs:** **Error taxonomy** + “top failure modes” list.\n- **Checks:** Taxonomy is mutually understandable by PM/eng; each category has 1–2 concrete examples.\n\n### 4) Convert taxonomy → rubric + scoring rules\n- **Inputs:** Taxonomy, target behaviors, output formats.\n- **Actions:** Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.\n- **Outputs:** **Rubric + judging guide**.\n- **Checks:** Two independent judges would likely score the same case similarly (instructions are specific, not vibes).\n\n### 5) Choose the judging approach + harness/runbook\n- **Inputs:** Constraints (time/cost), required reliability, privacy/safety constraints.\n- **Actions:** Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.\n- **Outputs:** **Judge + harness plan**.\n- **Checks:** The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).\n\n### 6) Define reporting, thresholds, and the iteration loop\n- **Inputs:** Stakeholder needs, release cadence.\n- **Actions:** Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.\n- **Outputs:** **Reporting + iteration loop**.\n- **Checks:** A reader can make a decision from the report without additional meetings; regressions are detectable.\n\n### 7) Quality gate + finalize\n- **Inputs:** Full draft pack.\n- **Actions:** Run [references/CHECKLISTS.md](references/CHECKLISTS.md) and score with [references/RUBRIC.md](references/RUBRIC.md). Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include **Risks / Open questions / Next steps**.\n- **Outputs:** Final **AI Evals Pack**.\n- **Checks:** The eval definition functions as a product requirement: clear, testable, and actionable.\n\n## Quality gate (required)\n- Use [references/CHECKLISTS.md](references/CHECKLISTS.md) and [references/RUBRIC.md](references/RUBRIC.md).\n- Always include: **Risks**, **Open questions**, **Next steps**.\n\n## Examples\n\n**Example 1 (answer quality + safety):** “Use `ai-evals` to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.”\n\n**Example 2 (structured extraction):** “Use `ai-evals` to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for `amount` and `due_date`. Output: AI Evals Pack.”\n\n**Boundary example:** “We don’t know what the AI feature should do yet—just ‘add AI’ and pick a model.”\nResponse: out of scope; first define the job/spec and success metrics (use `problem-definition` or `building-with-llms`), then return to `ai-evals` with a concrete SUT.\n\n**Boundary example 2:** “Our AI feature is live but users aren’t retaining. Help me figure out product-market fit.”\nResponse: retention and PMF are business-level metrics, not eval-level metrics. Use `measuring-product-market-fit` for that analysis. Return to `ai-evals` if the issue is specifically AI output quality, accuracy, or safety.\n\n## Anti-patterns (common failure modes)\n\n1. **Happy-path-only test sets**: Building a golden set that only covers ideal inputs and expected behavior. The eval misses adversarial inputs, edge cases, and safety-critical scenarios.\n2. **Vibes-based rubrics**: Writing scoring criteria like “the response should be good” or “helpful and accurate” without concrete behavioral anchors, examples, or tie-breakers. Two judges will disagree on every case.\n3. **One-shot eval**: Running the eval once, celebrating a pass rate, and never re-running. Evals must be repeatable with a regression policy and an iteration loop that turns new failures into new tests.\n4. **Ignoring judge calibration**: Using LLM-as-judge without calibrating against human judgments on gold examples. Uncalibrated judges produce confident but unreliable scores.\n5. **Metric without decision rule**: Tracking accuracy or pass rate without defining what score triggers ship/no-ship/revise. Metrics without thresholds do not support decisions.","tags":["evals","lenny","skills","plus","liqiongyu","agent-skills","ai-agents","automation","claude","codex","prompt-engineering","refoundai"],"capabilities":["skill","source-liqiongyu","skill-ai-evals","topic-agent-skills","topic-ai-agents","topic-automation","topic-claude","topic-codex","topic-prompt-engineering","topic-refoundai","topic-skillpack"],"categories":["lenny_skills_plus"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/liqiongyu/lenny_skills_plus/ai-evals","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add liqiongyu/lenny_skills_plus","source_repo":"https://github.com/liqiongyu/lenny_skills_plus","install_from":"skills.sh"}},"qualityScore":"0.474","qualityRationale":"deterministic score 0.47 from registry signals: · indexed on github topic:agent-skills · 49 github stars · SKILL.md body (8,914 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-04-22T06:56:18.593Z","embedding":null,"createdAt":"2026-04-18T22:16:12.245Z","updatedAt":"2026-04-22T06:56:18.593Z","lastSeenAt":"2026-04-22T06:56:18.593Z","tsv":"'1':380,483,672,933,1125 '10':292 '2':333,393,540,599,673,970,1061,1156 '3':291,318,334,412,609,1190 '4':424,676,1227 '5':313,319,436,733,1251 '6':454,796 '7':469,481,861 'absolut':708 'accept':139,391,511 'accur':1173 'accuraci':1116,1257 'action':499,557,629,688,746,809,869,914 'add':702,1023 'addit':856 'adversari':1147 'adversarial/safety':571 'ai':2,6,27,31,116,162,233,268,367,899,939,966,975,1006,1017,1024,1053,1063,1106,1113 'ai-ev':1,938,974,1052,1105 'ai-product-strategi':26,161 'also':20 'alway':475,890,924,994 'amount':1001 'analysi':125,417,612,1102 'anchor':1177 'anecdot':625 'answer':934 'anti':1120 'anti-pattern':1119 'api/model':192 'appli':357 'approach':66,737 'aren':1069 'articl':959 'ask':310,344 'assign':642 'assist':951 'assistant/copilot':117 'assum':191 'assumpt':330 'autom':72,447,756 'avail':556 'b':507 'back':215 'base':1159 'baselin':458 'becom':89,466,835 'behavior':294,388,510,595,685,706,1143,1176 'benchmark':113 'beyond':252 'boundari':1009,1059 'breaker':435,701,1182 'budget':304 'bug':88,465 'build':22,25,151,157,617,1046,1132 'building-with-llm':21,156,1045 'busi':1086 'business-level':1085 'cadenc':808 'calibr':450,759,1230,1237 'case':220,408,552,561,601,726,1150,1189 'categori':670 'caus':647 'celebr':1199 'chang':822 'chat':371 'check':73,448,526,592,662,717,757,765,782,845,902 'choos':63,734 'cite':957 'clear':138,695,793,911 'code':347,419,615,636 'come':214 'common':1122 'compar':133,284,504 'comparison':712 'concret':218,674,1057,1175 'confid':107,1247 'confirm':355 'consolid':637 'constraint':298,740,745,952,992 'context':650 'contract':44 'convert':53,677 'cost/latency':303 'cost/time':452 'cover':34,1138 'coverag':401,546,565,880 'creat':4,108,575,978 'criteria':1163 'critic':1154 'custom':947 'customer-support':946 'data':206,555,794 'dataset':341,350 'date':1004 'decid':148,707 'decis':80,278,385,486,496,502,851,1254,1273 'decision-readi':79 'defin':204,484,500,689,758,797,828,1034,1262 'definit':155,431,905,1043 'deleg':194 'deliver':364 'descript':494 'design':35,96,188,254,942 'detail':324 'detect':860 'determinist':789 'dimens':428,691 'disagre':1186 'disallow':705 'discov':833 'document':247 'draft':519,541,867,950 'due':1003 'edg':551,1149 'eng':178 'error':60,124,413,416,611,656 'estim':453,776 'eval':3,7,9,32,37,97,132,253,280,368,381,490,520,827,900,904,940,943,967,976,1007,1054,1091,1107,1145,1194,1197,1208 'eval-level':1090 'evalu':36,209,383 'evaluating-new-technolog':208 'everi':87,593,832,1188 'exampl':423,432,675,703,761,931,932,969,1010,1060,1178,1243 'execut':43 'expect':1142 'explicit':329,607 'extract':972,987 'failur':54,296,420,622,631,659,834,1123,1223 'featur':40,101,1018,1064 'field':989 'figur':1074 'file':374 'final':864,898 'first':212,1033 'fit':229,241,1079,1099 'fix':878 'flaki':121 'format':460,653,687,812 'full':866 'function':906 'gate':287,863,916 'generat':349 'goe':251 'gold':760,1242 'golden':57,111,398,543,590,627,981,1134 'good':46,289,1169 'guid':427,716 'handl':795 'happen':517 'happi':567,1127 'happy-path-on':1126 'har':438,780,888 'harness/runbook':77,738 'help':1072,1171 'high':583,603 'high-sever':602 'high-sign':582 'human':67,440,751,1239 'ideal':1139 'identifi':644 'ignor':1228 'includ':249,476,891,925 'independ':719 'info':308 'infra':187 'initi':397,577,589,626 'input':259,275,492,548,620,682,739,804,865,1140,1148 'instruct':697,728 'inter':763 'inter-rat':762 'invoic':988 'issu':1110 'iter':17,85,456,802,830,843,1218 'job/spec':1036 'json':991,997 'judg':14,65,71,337,426,437,445,696,715,720,736,748,755,779,1184,1229,1235,1245 'judgment':1240 'kb':958 'know':1014 'known':621 'label':632 'languag':302,883 'leakag':955 'least':358,598 'level':1087,1092 'like':645,722,1164 'list':661 'live':1066 'llm':69,100,172,443,753,985,1233 'llm-as-judg':68,442,752,1232 'llm/ai':39 'llms':24,159,1048 'log':623 'long':573 'long-tail':572 'loop':18,86,457,803,831,844,1219 'make':849 'market':228,240,1078,1098 'mean':47,290,539 'measur':52,224,238,534,1096 'measuring-product-market-fit':237,1095 'meet':857 'metric':390,817,1039,1088,1093,1252,1267 'minimum':260 'miss':307,326,649,879,1146 'missing-info':306 'misus':652 'ml/infra':196 'mode':297,421,660,1124 'model':183,1028 'must':281,515,956,961,993,1209 'mutual':665 'need':146,243,806 'never':516,1204 'new':91,210,464,467,837,1222,1225 'next':473,895,929 'non':171,886 'non-llm':170 'non-repeat':885 'one':1192 'one-shot':1191 'open':418,471,614,635,893,927 'option':285,336 'order':379 'output':234,276,363,518,585,655,686,713,778,841,897,965,1005,1114 'overal':225,813 'pack':8,369,868,901,968,1008 'pairwis':711 'pass':538,1201,1259 'path':568,570,1128 'pattern':1121 'per':815 'per-tag':814 'pick':747,1026 'pii':954 'plan':15,402,439,547,781,784 'pm/eng':668 'pmf':1083 'polici':301,462,654,1215 'possibl':792 'prd':10,382,491,521 'prds':258 'primarili':167 'priorit':998 'privacy/compliance':299 'privacy/safety':744 'privileg':359 'problem':154,1042 'problem-definit':153,1041 'proceed':327 'produc':78,365,1246 'product':28,163,227,239,245,909,1077,1097 'product-market':226,1076 'prompt':648 'prompts/instructions':449 'prompts/models':134,788 'provid':332 'qa/unit/integration':179 'qualiti':122,235,862,915,935,1115 'question':314,472,894,928 'rate':1202,1260 'rater':764 're':119,166,825,1206 're-run':824,1205 'reader':847 'readi':81 'recal':999 'redact/anonymize':362 'references/checklists.md':871,872,919,920 'references/intake.md':316,317 'references/rubric.md':876,877,922,923 'references/templates.md':478,479,524,525 'refus':962 'regress':286,461,818,858,1214 'releas':807 'reliabl':743 'remain':325 'repeat':76,131,786,887,1211 'repli':949 'report':455,798,811,842,854 'repres':608 'request':354,376,964 'requir':246,261,384,742,823,910,917 'research':185 'respons':1029,1080,1166 'restat':530 'result':16,82,459,769 'retain':1071 'retent':231,1081 'return':995,1050,1103 'review':630 'risk':470,554,605,892,926 'root':646 'rubric':13,62,110,425,679,714,882,980,1160 'rule':681,819,1255 'run':346,610,826,870,1195,1207 'runbook':451,774 'runtime/cost':777 'safe':135 'safeti':300,553,936,1118,1153 'safety-crit':1152 'sampl':766 'scale':430,693 'scenario':1155 'scenario/risk':411 'schema':400,562 'scheme':340 'scope':33,386,1032 'scope/non-goals':508 'score':339,429,680,690,709,723,874,1162,1250,1264 'secret':361 'see':19,120 'select':201 'sensit':352 'set':12,59,112,395,399,406,544,579,587,591,628,790,982,1131,1135 'sever':422,604 'severity/impact':643 'ship':105 'ship/no-ship':283,503 'ship/no-ship/revise':1266 'shot':1193 'signal':584 'similar':727 'size':342 'skill':190 'skill-ai-evals' 'small':580 'softwar':173 'sourc':353 'source-liqiongyu' 'spec':396,588 'specif':730,1112 'specifi':558,810 'stakehold':495,528,624,805 'standard':177 'starter':405,578 'step':474,482,889,896,930 'store':771 'strategi':29,30,164,309 'structur':545,971 'success':389,1038 'support':282,498,948,1272 'sut':265,493,1058 'system':262 'tag':409,563,816 'tail':574 'target':293,387,509,566,594,684 'task':205 'taxonomi':61,414,619,640,657,663,678,683,839 'technolog':211 'templat':477,522 'test':11,58,92,174,180,264,394,468,560,586,600,838,1130,1226 'testabl':912 'threshold':140,392,512,799,1269 'tie':434,700,1181 'tie-break':433,699,1180 'time':322 'time/cost':741 'timelin':305 'tool':651 'top':295,658 'topic-agent-skills' 'topic-ai-agents' 'topic-automation' 'topic-claude' 'topic-codex' 'topic-prompt-engineering' 'topic-refoundai' 'topic-skillpack' 'track':1256 'tradit':169 'train':184 'tricki':569 'trigger':1265 'turn':127,1221 'two':718,1183 'type':338,749 'uncalibr':1244 'understand':666 'unreli':1249 'unsaf':963 'updat':840 'usag':193 'use':95,144,152,175,207,219,236,255,918,937,973,1040,1094,1231 'user':549,1068 'vagu':881 'valid':996 'vendor/model':200 'version':787 'viabl':335 'vibe':732,1158 'vibes-bas':1157 'vs':441,446,506,710 'want':182,199,222 'without':855,1174,1236,1253,1261,1268 'workflow':274,480,550 'would':721 'write':257,488,694,772,1161 'writing-prd':256 'yet':1021","prices":[{"id":"3d26521f-6ba1-4493-b776-78e8e0f6ed50","listingId":"c1bb9b21-10cd-4d7b-ab63-a1374b76e159","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"liqiongyu","category":"lenny_skills_plus","install_from":"skills.sh"},"createdAt":"2026-04-18T22:16:12.245Z"}],"sources":[{"listingId":"c1bb9b21-10cd-4d7b-ab63-a1374b76e159","source":"github","sourceId":"liqiongyu/lenny_skills_plus/ai-evals","sourceUrl":"https://github.com/liqiongyu/lenny_skills_plus/tree/main/skills/ai-evals","isPrimary":false,"firstSeenAt":"2026-04-18T22:16:12.245Z","lastSeenAt":"2026-04-22T06:56:18.593Z"}],"details":{"listingId":"c1bb9b21-10cd-4d7b-ab63-a1374b76e159","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"liqiongyu","slug":"ai-evals","github":{"repo":"liqiongyu/lenny_skills_plus","stars":49,"topics":["agent-skills","ai-agents","automation","claude","codex","prompt-engineering","refoundai","skillpack"],"license":"apache-2.0","html_url":"https://github.com/liqiongyu/lenny_skills_plus","pushed_at":"2026-04-04T06:30:11Z","description":"86 agent-executable skill packs converted from RefoundAI’s Lenny skills (unofficial). Works with Codex + Claude Code.","skill_md_sha":"aff68c937e314d770fb030b4ffbe663161ec764e","skill_md_path":"skills/ai-evals/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/liqiongyu/lenny_skills_plus/tree/main/skills/ai-evals"},"layout":"multi","source":"github","category":"lenny_skills_plus","frontmatter":{"name":"ai-evals","description":"Create an AI Evals Pack (eval PRD, test set, rubric, judge plan, results + iteration loop). See also: building-with-llms (build), ai-product-strategy (strategy)."},"skills_sh_url":"https://skills.sh/liqiongyu/lenny_skills_plus/ai-evals"},"updatedAt":"2026-04-22T06:56:18.593Z"}}