{"id":"b23be400-14cf-48bf-9833-e11e2f492000","shortId":"drjBxN","kind":"skill","title":"agentic-eval","tagline":"Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:\n- Implementing self-critique and reflection loops\n- Building evaluator-optimizer pipelines for quality-critical generation\n- Creating test-driven code refinement workflows\n- Designing rubr","description":"# Agentic Evaluation Patterns\n\nPatterns for self-improvement through iterative evaluation and refinement.\n\n## Overview\n\nEvaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.\n\n```\nGenerate → Evaluate → Critique → Refine → Output\n    ↑                              │\n    └──────────────────────────────┘\n```\n\n## When to Use\n\n- **Quality-critical generation**: Code, reports, analysis requiring high accuracy\n- **Tasks with clear evaluation criteria**: Defined success metrics exist\n- **Content requiring specific standards**: Style guides, compliance, formatting\n\n---\n\n## Pattern 1: Basic Reflection\n\nAgent evaluates and improves its own output through self-critique.\n\n```python\ndef reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:\n    \"\"\"Generate with reflection loop.\"\"\"\n    output = llm(f\"Complete this task:\\n{task}\")\n    \n    for i in range(max_iterations):\n        # Self-critique\n        critique = llm(f\"\"\"\n        Evaluate this output against criteria: {criteria}\n        Output: {output}\n        Rate each: PASS/FAIL with feedback as JSON.\n        \"\"\")\n        \n        critique_data = json.loads(critique)\n        all_pass = all(c[\"status\"] == \"PASS\" for c in critique_data.values())\n        if all_pass:\n            return output\n        \n        # Refine based on critique\n        failed = {k: v[\"feedback\"] for k, v in critique_data.items() if v[\"status\"] == \"FAIL\"}\n        output = llm(f\"Improve to address: {failed}\\nOriginal: {output}\")\n    \n    return output\n```\n\n**Key insight**: Use structured JSON output for reliable parsing of critique results.\n\n---\n\n## Pattern 2: Evaluator-Optimizer\n\nSeparate generation and evaluation into distinct components for clearer responsibilities.\n\n```python\nclass EvaluatorOptimizer:\n    def __init__(self, score_threshold: float = 0.8):\n        self.score_threshold = score_threshold\n    \n    def generate(self, task: str) -> str:\n        return llm(f\"Complete: {task}\")\n    \n    def evaluate(self, output: str, task: str) -> dict:\n        return json.loads(llm(f\"\"\"\n        Evaluate output for task: {task}\n        Output: {output}\n        Return JSON: {{\"overall_score\": 0-1, \"dimensions\": {{\"accuracy\": ..., \"clarity\": ...}}}}\n        \"\"\"))\n    \n    def optimize(self, output: str, feedback: dict) -> str:\n        return llm(f\"Improve based on feedback: {feedback}\\nOutput: {output}\")\n    \n    def run(self, task: str, max_iterations: int = 3) -> str:\n        output = self.generate(task)\n        for _ in range(max_iterations):\n            evaluation = self.evaluate(output, task)\n            if evaluation[\"overall_score\"] >= self.score_threshold:\n                break\n            output = self.optimize(output, evaluation)\n        return output\n```\n\n---\n\n## Pattern 3: Code-Specific Reflection\n\nTest-driven refinement loop for code generation.\n\n```python\nclass CodeReflector:\n    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:\n        code = llm(f\"Write Python code for: {spec}\")\n        tests = llm(f\"Generate pytest tests for: {spec}\\nCode: {code}\")\n        \n        for _ in range(max_iterations):\n            result = run_tests(code, tests)\n            if result[\"success\"]:\n                return code\n            code = llm(f\"Fix error: {result['error']}\\nCode: {code}\")\n        return code\n```\n\n---\n\n## Evaluation Strategies\n\n### Outcome-Based\nEvaluate whether output achieves the expected result.\n\n```python\ndef evaluate_outcome(task: str, output: str, expected: str) -> str:\n    return llm(f\"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}\")\n```\n\n### LLM-as-Judge\nUse LLM to compare and rank outputs.\n\n```python\ndef llm_judge(output_a: str, output_b: str, criteria: str) -> str:\n    return llm(f\"Compare outputs A and B for {criteria}. Which is better and why?\")\n```\n\n### Rubric-Based\nScore outputs against weighted dimensions.\n\n```python\nRUBRIC = {\n    \"accuracy\": {\"weight\": 0.4},\n    \"clarity\": {\"weight\": 0.3},\n    \"completeness\": {\"weight\": 0.3}\n}\n\ndef evaluate_with_rubric(output: str, rubric: dict) -> float:\n    scores = json.loads(llm(f\"Rate 1-5 for each dimension: {list(rubric.keys())}\\nOutput: {output}\"))\n    return sum(scores[d] * rubric[d][\"weight\"] for d in rubric) / 5\n```\n\n---\n\n## Best Practices\n\n| Practice | Rationale |\n|----------|-----------|\n| **Clear criteria** | Define specific, measurable evaluation criteria upfront |\n| **Iteration limits** | Set max iterations (3-5) to prevent infinite loops |\n| **Convergence check** | Stop if output score isn't improving between iterations |\n| **Log history** | Keep full trajectory for debugging and analysis |\n| **Structured output** | Use JSON for reliable parsing of evaluation results |\n\n---\n\n## Quick Start Checklist\n\n```markdown\n## Evaluation Implementation Checklist\n\n### Setup\n- [ ] Define evaluation criteria/rubric\n- [ ] Set score threshold for \"good enough\"\n- [ ] Configure max iterations (default: 3)\n\n### Implementation\n- [ ] Implement generate() function\n- [ ] Implement evaluate() function with structured output\n- [ ] Implement optimize() function\n- [ ] Wire up the refinement loop\n\n### Safety\n- [ ] Add convergence detection\n- [ ] Log all iterations for debugging\n- [ ] Handle evaluation parse failures gracefully\n```","tags":["agentic","eval","awesome","copilot","github","agent-skills","agents","custom-agents","github-copilot","hacktoberfest","prompt-engineering"],"capabilities":["skill","source-github","skill-agentic-eval","topic-agent-skills","topic-agents","topic-awesome","topic-custom-agents","topic-github-copilot","topic-hacktoberfest","topic-prompt-engineering"],"categories":["awesome-copilot"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/github/awesome-copilot/agentic-eval","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add github/awesome-copilot","source_repo":"https://github.com/github/awesome-copilot","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 33270 github stars · SKILL.md body (5,359 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:52:04.223Z","embedding":null,"createdAt":"2026-04-18T20:25:33.581Z","updatedAt":"2026-05-18T18:52:04.223Z","lastSeenAt":"2026-05-18T18:52:04.223Z","tsv":"'-1':306 '-5':546,584 '0':305 '0.3':527,530 '0.4':524 '0.8':266 '1':115,545 '2':243 '3':142,336,364,390,583,640 '5':565 'accuraci':96,308,522 'achiev':444,464 'add':660 'address':224 'agent':2,12,44,61,118 'agentic-ev':1 'ai':11 'analysi':93,608 'assess':63 'b':492,504 'base':203,322,440,514 'basic':116 'best':566 'better':509 'beyond':70 'break':356 'build':25 'c':190,194 'check':590 'checklist':621,625 'clariti':309,525 'class':258,378 'clear':99,570 'clearer':255 'code':39,91,366,375,392,397,409,418,424,425,433,435 'code-specif':365 'codereflector':379 'compar':480,500 'complet':151,280,528 'complianc':112 'compon':253 'configur':636 'content':106 'converg':589,661 'creat':35 'criteria':101,136,172,173,494,506,571,576 'criteria/rubric':629 'critic':33,89 'critiqu':21,81,128,164,165,183,186,205,240 'critique_data.items':214 'critique_data.values':196 'd':557,559,562 'data':184 'debug':606,667 'def':130,260,271,282,310,328,380,449,485,531 'default':639 'defin':102,572,627 'design':42 'detect':662 'dict':289,316,538 'dimens':307,519,549 'distinct':252 'driven':38,371 'enabl':60 'enough':635 'error':429,431 'eval':3 'evalu':8,27,45,54,58,80,100,119,168,245,250,283,294,346,351,360,436,441,450,532,575,617,623,628,646,669 'evaluator-optim':26,244 'evaluatoroptim':259 'exist':105 'expect':446,456,465,469,470 'f':150,167,221,279,293,320,394,402,427,461,499,543 'fail':206,218,225 'failur':671 'feedback':180,209,315,324,325 'fix':383,428 'float':265,539 'format':113 'full':603 'function':644,647,653 'generat':34,74,79,90,144,248,272,376,403,643 'good':634 'grace':672 'guid':111 'handl':668 'high':95 'histori':601 'implement':18,624,641,642,645,651 'improv':10,51,65,121,222,321,597 'infinit':587 'init':261 'insight':231 'int':141,335,389 'isn':595 'iter':53,76,140,161,334,345,388,414,578,582,599,638,665 'json':182,234,302,612 'json.loads':185,291,541 'judg':476,487 'k':207,211 'keep':602 'key':230 'limit':579 'list':137,550 'llm':149,166,220,278,292,319,393,401,426,460,474,478,486,498,542 'llm-as-judg':473 'log':600,663 'loop':24,78,147,373,588,658 'markdown':622 'max':139,160,333,344,387,413,581,637 'measur':574 'metric':104 'move':69 'n':154 'ncode':408,432 'norigin':226 'noutput':326,552 'optim':28,246,311,652 'outcom':439,451,466 'outcome-bas':438 'output':13,68,83,124,148,170,174,175,201,219,227,229,235,285,295,299,300,313,327,338,348,357,359,362,443,454,463,471,472,483,488,491,501,516,535,553,593,610,650 'overal':303,352 'overview':57 'pars':238,615,670 'pass':188,192,199 'pass/fail':178 'pattern':4,46,47,59,114,242,363 'pipelin':29 'practic':567,568 'prevent':586 'pytest':404 'python':129,257,377,396,448,484,520 'qualiti':32,88 'quality-crit':31,87 'quick':619 'rang':159,343,412 'rank':482 'rate':176,544 'rational':569 'refin':40,56,77,82,133,202,372,657 'reflect':23,117,131,146,368,381 'reliabl':237,614 'report':92 'requir':94,107 'respons':256 'result':241,415,421,430,447,618 'return':200,228,277,290,301,318,361,423,434,459,497,554 'rubr':43 'rubric':513,521,534,537,558,564 'rubric-bas':512 'rubric.keys':551 'run':329,416 'safeti':659 'score':263,269,304,353,515,540,556,594,631 'self':20,50,127,163,262,273,284,312,330,384 'self-critiqu':19,126,162 'self-improv':49 'self.evaluate':347 'self.generate':339 'self.optimize':358 'self.score':267,354 'separ':247 'set':580,630 'setup':626 'shot':73 'singl':72 'single-shot':71 'skill':16 'skill-agentic-eval' 'source-github' 'spec':385,399,407 'specif':108,367,573 'standard':109 'start':620 'status':191,217 'stop':591 'str':135,138,143,275,276,286,288,314,317,332,337,386,391,453,455,457,458,490,493,495,496,536 'strategi':437 'structur':233,609,649 'style':110 'success':103,422 'sum':555 'task':97,134,153,155,274,281,287,297,298,331,340,349,452,467,468 'techniqu':6 'test':37,370,400,405,417,419 'test-driven':36,369 'threshold':264,268,270,355,632 'topic-agent-skills' 'topic-agents' 'topic-awesome' 'topic-custom-agents' 'topic-github-copilot' 'topic-hacktoberfest' 'topic-prompt-engineering' 'trajectori':604 'upfront':577 'use':14,86,232,477,611 'v':208,212,216 'weight':518,523,526,529,560 'whether':442 'wire':654 'workflow':41 'write':395","prices":[{"id":"0cf94e7e-5ff8-4950-a331-805bd126c9d1","listingId":"b23be400-14cf-48bf-9833-e11e2f492000","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"github","category":"awesome-copilot","install_from":"skills.sh"},"createdAt":"2026-04-18T20:25:33.581Z"}],"sources":[{"listingId":"b23be400-14cf-48bf-9833-e11e2f492000","source":"github","sourceId":"github/awesome-copilot/agentic-eval","sourceUrl":"https://github.com/github/awesome-copilot/tree/main/skills/agentic-eval","isPrimary":false,"firstSeenAt":"2026-04-18T21:48:08.290Z","lastSeenAt":"2026-05-18T18:52:04.223Z"},{"listingId":"b23be400-14cf-48bf-9833-e11e2f492000","source":"skills_sh","sourceId":"github/awesome-copilot/agentic-eval","sourceUrl":"https://skills.sh/github/awesome-copilot/agentic-eval","isPrimary":true,"firstSeenAt":"2026-04-18T20:25:33.581Z","lastSeenAt":"2026-05-07T22:40:17.480Z"}],"details":{"listingId":"b23be400-14cf-48bf-9833-e11e2f492000","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"github","slug":"agentic-eval","github":{"repo":"github/awesome-copilot","stars":33270,"topics":["agent-skills","agents","ai","awesome","custom-agents","github-copilot","hacktoberfest","prompt-engineering"],"license":"mit","html_url":"https://github.com/github/awesome-copilot","pushed_at":"2026-05-18T01:26:59Z","description":"Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.","skill_md_sha":"3cb14203efaf62af98065bd599600ba808f0ef31","skill_md_path":"skills/agentic-eval/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/github/awesome-copilot/tree/main/skills/agentic-eval"},"layout":"multi","source":"github","category":"awesome-copilot","frontmatter":{"name":"agentic-eval","description":"Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:\n- Implementing self-critique and reflection loops\n- Building evaluator-optimizer pipelines for quality-critical generation\n- Creating test-driven code refinement workflows\n- Designing rubric-based or LLM-as-judge evaluation systems\n- Adding iterative improvement to agent outputs (code, reports, analysis)\n- Measuring and improving agent response quality"},"skills_sh_url":"https://skills.sh/github/awesome-copilot/agentic-eval"},"updatedAt":"2026-05-18T18:52:04.223Z"}}