{"id":"d706a9c9-4889-4e23-a08e-669998324bd5","shortId":"zXGu9X","kind":"skill","title":"recipe-eval-skill","tagline":"Creates or updates Claude Code skills through interactive dialog, then evaluates effectiveness by parallel execution comparison. Use when creating new skills, updating existing skills, or evaluating skill quality.","description":"**Context**: Skill authoring (Phase A) followed by blind A/B evaluation (Phase B)\n\nMode: $ARGUMENTS\n\n## Orchestrator Definition\n\n**Core Identity**: \"I am not a worker. I am an orchestrator.\"\n\n**Execution Method**:\n- Skill generation/modification → performed by rashomon:skill-creator\n- Skill quality grading → performed by rashomon:skill-reviewer\n- Test task execution → performed by eval-executor.py script (via `claude -p`)\n- Blind result comparison → performed by rashomon:skill-eval-reporter\n\nOrchestrator invokes sub-agents via Agent tool and scripts via Bash, passes structured data between them.\n\n**First Action**: Register all steps using TaskCreate before any execution. Phase A steps are defined in the mode-specific reference (create.md or update.md). Phase B steps are defined in eval.md. Update status using TaskUpdate upon each step completion.\n\n## Mode Detection\n\nDetermine mode from $ARGUMENTS:\n\n| Mode | Criteria |\n|------|----------|\n| Creation | \"create\", new skill request, no existing skill referenced |\n| Update | \"improve\", \"update\", existing skill name or path mentioned |\n| Unspecified | $ARGUMENTS is empty or ambiguous | Ask user via AskUserQuestion: \"Create a new skill or update an existing one?\" |\n\n## Scope Boundaries\n\n**Phase A (Skill Authoring)**: Create or modify skill content through dialog. Ends with user-approved skill file.\n**Phase B (Evaluation)**: Measure skill effectiveness through blind execution comparison. Does not modify skill content.\n\n**Responsibility Boundary**: This skill completes with the combined evaluation report and ship/revise/reject recommendation.\n\n## Workflow\n\n### Phase A: Skill Authoring\n\nRead the mode-specific reference and execute:\n\n- **Creation mode**: Read [references/create.md](references/create.md) and follow its steps\n- **Update mode**: Read [references/update.md](references/update.md) and follow its steps\n\nPhase A ends with: user-approved skill content (new or modified).\n\n### Phase A → Phase B Handoff\n\nBefore starting Phase B, confirm these data are available in context. Phase B cannot proceed without them:\n\n| Data | Source | Required |\n|------|--------|----------|\n| Skill name | Phase A dialog | Always |\n| Source skill directory | Phase A file write | Always |\n| User phrases | Phase A Round 3 (create) / Round 2 (update) | Always |\n| Trigger scenarios | Phase A Round 3 (create) / Round 1-2 (update) | Always |\n| Original SKILL.md content | Phase A Step 6 (update mode only) | Update mode |\n\nIf user phrases are missing, ask the user before proceeding: \"What phrases does your team use when requesting work that this skill covers?\"\n\n### Phase B: Evaluation\n\nRead [references/eval.md](references/eval.md) and execute the evaluation protocol. Pass the handoff data above as context.\n\nPhase B consists of:\n1. **Trigger check**: Does the skill fire for its intended use case? (Step 1)\n2. **Trigger fail handling**: Diagnose and revise if trigger fails (Step 2, conditional)\n3. **Execution effectiveness**: Blind A/B comparison of output quality (Steps 3-7)\n\n### Final Output\n\nPresent combined results to user:\n1. **Phase A result**: Skill quality grade (A/B/C from rashomon:skill-reviewer)\n2. **Phase B trigger**: Discovered (yes/no), Invoked (yes/no)\n3. **Phase B execution**: Blind comparison result (from rashomon:skill-eval-reporter)\n4. **Recommendation**: ship / revise / reject\n\n## Error Handling\n\n| Scenario | Behavior |\n|----------|----------|\n| User cancels during Phase A | Stop. No eval needed. |\n| Grade C after 2 iterations | Present content with issues. User decides: accept/revise/abort. |\n| One executor fails in Phase B | Continue with partial comparison. |\n| Both executors fail in Phase B | Report failure. Phase A result still valid. |\n| Worktree creation fails | Report git error. Phase A result still valid. |\n\n## Prerequisites\n\n- Git repository (git 2.5+ for worktree support)\n- `claude` CLI available in PATH\n- Sufficient disk space for worktree copies\n\n## Completion Criteria\n\n### Phase A\n- [ ] Skill knowledge collected through dialog\n- [ ] rashomon:skill-creator returned valid output\n- [ ] rashomon:skill-reviewer returned grade A or B\n- [ ] User approved final content\n- [ ] File written to target location\n\n### Phase B\n- [ ] Trigger check executed and result presented\n- [ ] Parallel execution completed in worktrees\n- [ ] Blind comparison completed by rashomon:skill-eval-reporter\n- [ ] Worktrees cleaned up\n- [ ] Combined report presented with recommendation","tags":["recipe","eval","skill","rashomon","shinpr","agent-skills","ai-tools","claude-code","claude-code-plugin","developer-tools","evaluation","llm"],"capabilities":["skill","source-shinpr","skill-recipe-eval-skill","topic-agent-skills","topic-ai-tools","topic-claude-code","topic-claude-code-plugin","topic-developer-tools","topic-evaluation","topic-llm","topic-prompt-engineering","topic-prompt-evaluation","topic-prompt-optimization","topic-skills"],"categories":["rashomon"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/shinpr/rashomon/recipe-eval-skill","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add shinpr/rashomon","source_repo":"https://github.com/shinpr/rashomon","install_from":"skills.sh"}},"qualityScore":"0.454","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 9 github stars · SKILL.md body (4,473 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-04-24T07:03:39.788Z","embedding":null,"createdAt":"2026-04-23T13:04:21.415Z","updatedAt":"2026-04-24T07:03:39.788Z","lastSeenAt":"2026-04-24T07:03:39.788Z","tsv":"'-2':350 '-7':448 '1':349,410,423,456 '2':338,424,435,469,511 '2.5':558 '3':335,346,437,447,477 '4':490 '6':359 'a/b':41,441 'a/b/c':463 'accept/revise/abort':519 'action':117 'agent':103,105 'alway':321,329,340,352 'ambigu':186 'approv':217,285,599 'argument':46,160,182 'ask':187,370 'askuserquest':190 'author':35,205,252 'avail':304,564 'b':44,141,221,294,299,308,389,407,471,479,525,535,597,608 'bash':110 'behavior':498 'blind':40,89,227,440,481,620 'boundari':201,236 'c':509 'cancel':500 'cannot':309 'case':421 'check':412,610 'claud':8,87,562 'clean':630 'cli':563 'code':9 'collect':579 'combin':242,452,632 'comparison':20,91,229,442,482,529,621 'complet':154,239,573,617,622 'condit':436 'confirm':300 'consist':408 'content':210,234,287,355,514,601 'context':33,306,405 'continu':526 'copi':572 'core':49 'cover':387 'creat':5,23,164,191,206,336,347 'create.md':137 'creation':163,261,544 'creator':69,585 'criteria':162,574 'data':113,302,313,402 'decid':518 'defin':130,144 'definit':48 'detect':156 'determin':157 'diagnos':428 'dialog':13,212,320,581 'directori':324 'discov':473 'disk':568 'effect':16,225,439 'empti':184 'end':213,281 'error':495,548 'eval':3,97,488,506,627 'eval-executor.py':84 'eval.md':146 'evalu':15,30,42,222,243,390,397 'execut':19,60,81,125,228,260,395,438,480,611,616 'executor':521,531 'exist':27,169,175,198 'fail':426,433,522,532,545 'failur':537 'file':219,327,602 'final':449,600 'fire':416 'first':116 'follow':38,267,276 'generation/modification':63 'git':547,555,557 'grade':72,462,508,594 'handl':427,496 'handoff':295,401 'ident':50 'improv':173 'intend':419 'interact':12 'invok':100,475 'issu':516 'iter':512 'knowledg':578 'locat':606 'measur':223 'mention':180 'method':61 'miss':369 'mode':45,134,155,158,161,256,262,271,361,364 'mode-specif':133,255 'modifi':208,232,290 'name':177,317 'need':507 'new':24,165,193,288 'one':199,520 'orchestr':47,59,99 'origin':353 'output':444,450,588 'p':88 'parallel':18,615 'partial':528 'pass':111,399 'path':179,566 'perform':64,73,82,92 'phase':36,43,126,140,202,220,249,279,291,293,298,307,318,325,332,343,356,388,406,457,470,478,502,524,534,538,549,575,607 'phrase':331,367,376 'prerequisit':554 'present':451,513,614,634 'proceed':310,374 'protocol':398 'qualiti':32,71,445,461 'rashomon':66,75,94,465,485,582,589,624 'read':253,263,272,391 'recip':2 'recipe-eval-skil':1 'recommend':247,491,636 'refer':136,258 'referenc':171 'references/create.md':264,265 'references/eval.md':392,393 'references/update.md':273,274 'regist':118 'reject':494 'report':98,244,489,536,546,628,633 'repositori':556 'request':167,382 'requir':315 'respons':235 'result':90,453,459,483,540,551,613 'return':586,593 'review':78,468,592 'revis':430,493 'round':334,337,345,348 'scenario':342,497 'scope':200 'script':85,108 'ship':492 'ship/revise/reject':246 'skill':4,10,25,28,31,34,62,68,70,77,96,166,170,176,194,204,209,218,224,233,238,251,286,316,323,386,415,460,467,487,577,584,591,626 'skill-creat':67,583 'skill-eval-report':95,486,625 'skill-recipe-eval-skill' 'skill-review':76,466,590 'skill.md':354 'sourc':314,322 'source-shinpr' 'space':569 'specif':135,257 'start':297 'status':148 'step':120,128,142,153,269,278,358,422,434,446 'still':541,552 'stop':504 'structur':112 'sub':102 'sub-ag':101 'suffici':567 'support':561 'target':605 'task':80 'taskcreat':122 'taskupd':150 'team':379 'test':79 'tool':106 'topic-agent-skills' 'topic-ai-tools' 'topic-claude-code' 'topic-claude-code-plugin' 'topic-developer-tools' 'topic-evaluation' 'topic-llm' 'topic-prompt-engineering' 'topic-prompt-evaluation' 'topic-prompt-optimization' 'topic-skills' 'trigger':341,411,425,432,472,609 'unspecifi':181 'updat':7,26,147,172,174,196,270,339,351,360,363 'update.md':139 'upon':151 'use':21,121,149,380,420 'user':188,216,284,330,366,372,455,499,517,598 'user-approv':215,283 'valid':542,553,587 'via':86,104,109,189 'without':311 'work':383 'worker':55 'workflow':248 'worktre':543,560,571,619,629 'write':328 'written':603 'yes/no':474,476","prices":[{"id":"0e264d25-d72b-4418-a2a8-9940b9ea9d97","listingId":"d706a9c9-4889-4e23-a08e-669998324bd5","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"shinpr","category":"rashomon","install_from":"skills.sh"},"createdAt":"2026-04-23T13:04:21.415Z"}],"sources":[{"listingId":"d706a9c9-4889-4e23-a08e-669998324bd5","source":"github","sourceId":"shinpr/rashomon/recipe-eval-skill","sourceUrl":"https://github.com/shinpr/rashomon/tree/main/skills/recipe-eval-skill","isPrimary":false,"firstSeenAt":"2026-04-23T13:04:21.415Z","lastSeenAt":"2026-04-24T07:03:39.788Z"}],"details":{"listingId":"d706a9c9-4889-4e23-a08e-669998324bd5","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"shinpr","slug":"recipe-eval-skill","github":{"repo":"shinpr/rashomon","stars":9,"topics":["agent-skills","ai-tools","claude-code","claude-code-plugin","developer-tools","evaluation","llm","prompt-engineering","prompt-evaluation","prompt-optimization","skills"],"license":"mit","html_url":"https://github.com/shinpr/rashomon","pushed_at":"2026-04-04T07:32:14Z","description":"Measure prompt and skill improvements with blind A/B comparison.","skill_md_sha":"c9e9b371efd0463e207a69a18e993e159113a40b","skill_md_path":"skills/recipe-eval-skill/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/shinpr/rashomon/tree/main/skills/recipe-eval-skill"},"layout":"multi","source":"github","category":"rashomon","frontmatter":{"name":"recipe-eval-skill","description":"Creates or updates Claude Code skills through interactive dialog, then evaluates effectiveness by parallel execution comparison. Use when creating new skills, updating existing skills, or evaluating skill quality."},"skills_sh_url":"https://skills.sh/shinpr/rashomon/recipe-eval-skill"},"updatedAt":"2026-04-24T07:03:39.788Z"}}