{"id":"b23be400-14cf-48bf-9833-e11e2f492000","shortId":"drjBxN","kind":"skill","title":"Agentic Eval","tagline":"Awesome Copilot skill by Github","description":"# Agentic Evaluation Patterns\n\nPatterns for self-improvement through iterative evaluation and refinement.\n\n## Overview\n\nEvaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.\n\n```\nGenerate → Evaluate → Critique → Refine → Output\n    ↑                              │\n    └──────────────────────────────┘\n```\n\n## When to Use\n\n- **Quality-critical generation**: Code, reports, analysis requiring high accuracy\n- **Tasks with clear evaluation criteria**: Defined success metrics exist\n- **Content requiring specific standards**: Style guides, compliance, formatting\n\n---\n\n## Pattern 1: Basic Reflection\n\nAgent evaluates and improves its own output through self-critique.\n\n```python\ndef reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:\n    \"\"\"Generate with reflection loop.\"\"\"\n    output = llm(f\"Complete this task:\\n{task}\")\n    \n    for i in range(max_iterations):\n        # Self-critique\n        critique = llm(f\"\"\"\n        Evaluate this output against criteria: {criteria}\n        Output: {output}\n        Rate each: PASS/FAIL with feedback as JSON.\n        \"\"\")\n        \n        critique_data = json.loads(critique)\n        all_pass = all(c[\"status\"] == \"PASS\" for c in critique_data.values())\n        if all_pass:\n            return output\n        \n        # Refine based on critique\n        failed = {k: v[\"feedback\"] for k, v in critique_data.items() if v[\"status\"] == \"FAIL\"}\n        output = llm(f\"Improve to address: {failed}\\nOriginal: {output}\")\n    \n    return output\n```\n\n**Key insight**: Use structured JSON output for reliable parsing of critique results.\n\n---\n\n## Pattern 2: Evaluator-Optimizer\n\nSeparate generation and evaluation into distinct components for clearer responsibilities.\n\n```python\nclass EvaluatorOptimizer:\n    def __init__(self, score_threshold: float = 0.8):\n        self.score_threshold = score_threshold\n    \n    def generate(self, task: str) -> str:\n        return llm(f\"Complete: {task}\")\n    \n    def evaluate(self, output: str, task: str) -> dict:\n        return json.loads(llm(f\"\"\"\n        Evaluate output for task: {task}\n        Output: {output}\n        Return JSON: {{\"overall_score\": 0-1, \"dimensions\": {{\"accuracy\": ..., \"clarity\": ...}}}}\n        \"\"\"))\n    \n    def optimize(self, output: str, feedback: dict) -> str:\n        return llm(f\"Improve based on feedback: {feedback}\\nOutput: {output}\")\n    \n    def run(self, task: str, max_iterations: int = 3) -> str:\n        output = self.generate(task)\n        for _ in range(max_iterations):\n            evaluation = self.evaluate(output, task)\n            if evaluation[\"overall_score\"] >= self.score_threshold:\n                break\n            output = self.optimize(output, evaluation)\n        return output\n```\n\n---\n\n## Pattern 3: Code-Specific Reflection\n\nTest-driven refinement loop for code generation.\n\n```python\nclass CodeReflector:\n    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:\n        code = llm(f\"Write Python code for: {spec}\")\n        tests = llm(f\"Generate pytest tests for: {spec}\\nCode: {code}\")\n        \n        for _ in range(max_iterations):\n            result = run_tests(code, tests)\n            if result[\"success\"]:\n                return code\n            code = llm(f\"Fix error: {result['error']}\\nCode: {code}\")\n        return code\n```\n\n---\n\n## Evaluation Strategies\n\n### Outcome-Based\nEvaluate whether output achieves the expected result.\n\n```python\ndef evaluate_outcome(task: str, output: str, expected: str) -> str:\n    return llm(f\"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}\")\n```\n\n### LLM-as-Judge\nUse LLM to compare and rank outputs.\n\n```python\ndef llm_judge(output_a: str, output_b: str, criteria: str) -> str:\n    return llm(f\"Compare outputs A and B for {criteria}. Which is better and why?\")\n```\n\n### Rubric-Based\nScore outputs against weighted dimensions.\n\n```python\nRUBRIC = {\n    \"accuracy\": {\"weight\": 0.4},\n    \"clarity\": {\"weight\": 0.3},\n    \"completeness\": {\"weight\": 0.3}\n}\n\ndef evaluate_with_rubric(output: str, rubric: dict) -> float:\n    scores = json.loads(llm(f\"Rate 1-5 for each dimension: {list(rubric.keys())}\\nOutput: {output}\"))\n    return sum(scores[d] * rubric[d][\"weight\"] for d in rubric) / 5\n```\n\n---\n\n## Best Practices\n\n| Practice | Rationale |\n|----------|-----------|\n| **Clear criteria** | Define specific, measurable evaluation criteria upfront |\n| **Iteration limits** | Set max iterations (3-5) to prevent infinite loops |\n| **Convergence check** | Stop if output score isn't improving between iterations |\n| **Log history** | Keep full trajectory for debugging and analysis |\n| **Structured output** | Use JSON for reliable parsing of evaluation results |\n\n---\n\n## Quick Start Checklist\n\n```markdown\n## Evaluation Implementation Checklist\n\n### Setup\n- [ ] Define evaluation criteria/rubric\n- [ ] Set score threshold for \"good enough\"\n- [ ] Configure max iterations (default: 3)\n\n### Implementation\n- [ ] Implement generate() function\n- [ ] Implement evaluate() function with structured output\n- [ ] Implement optimize() function\n- [ ] Wire up the refinement loop\n\n### Safety\n- [ ] Add convergence detection\n- [ ] Log all iterations for debugging\n- [ ] Handle evaluation parse failures gracefully\n```","tags":["agentic","eval","awesome","copilot","github"],"capabilities":["skill","source-github","category-awesome-copilot"],"categories":["awesome-copilot"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/github/awesome-copilot/agentic-eval","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"install_from":"skills.sh"}},"qualityScore":"0.300","qualityRationale":"deterministic score 0.30 from registry signals: · indexed on skills.sh · published under github/awesome-copilot","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill:v1","enrichmentVersion":1,"enrichedAt":"2026-04-22T17:40:17.606Z","embedding":null,"createdAt":"2026-04-18T20:25:33.581Z","updatedAt":"2026-04-22T17:40:17.606Z","lastSeenAt":"2026-04-22T17:40:17.606Z","tsv":"'-1':270 '-5':510,548 '0':269 '0.3':491,494 '0.4':488 '0.8':230 '1':79,509 '2':207 '3':106,300,328,354,547,604 '5':529 'accuraci':60,272,486 'achiev':408,428 'add':624 'address':188 'agent':1,8,25,82 'analysi':57,572 'assess':27 'awesom':3 'b':456,468 'base':167,286,404,478 'basic':80 'best':530 'better':473 'beyond':34 'break':320 'c':154,158 'category-awesome-copilot' 'check':554 'checklist':585,589 'clariti':273,489 'class':222,342 'clear':63,534 'clearer':219 'code':55,330,339,356,361,373,382,388,389,397,399 'code-specif':329 'codereflector':343 'compar':444,464 'complet':115,244,492 'complianc':76 'compon':217 'configur':600 'content':70 'converg':553,625 'copilot':4 'criteria':65,100,136,137,458,470,535,540 'criteria/rubric':593 'critic':53 'critiqu':45,92,128,129,147,150,169,204 'critique_data.items':178 'critique_data.values':160 'd':521,523,526 'data':148 'debug':570,631 'def':94,224,235,246,274,292,344,413,449,495 'default':603 'defin':66,536,591 'detect':626 'dict':253,280,502 'dimens':271,483,513 'distinct':216 'driven':335 'enabl':24 'enough':599 'error':393,395 'eval':2 'evalu':9,18,22,44,64,83,132,209,214,247,258,310,315,324,400,405,414,496,539,581,587,592,610,633 'evaluator-optim':208 'evaluatoroptim':223 'exist':69 'expect':410,420,429,433,434 'f':114,131,185,243,257,284,358,366,391,425,463,507 'fail':170,182,189 'failur':635 'feedback':144,173,279,288,289 'fix':347,392 'float':229,503 'format':77 'full':567 'function':608,611,617 'generat':38,43,54,108,212,236,340,367,607 'github':7 'good':598 'grace':636 'guid':75 'handl':632 'high':59 'histori':565 'implement':588,605,606,609,615 'improv':15,29,85,186,285,561 'infinit':551 'init':225 'insight':195 'int':105,299,353 'isn':559 'iter':17,40,104,125,298,309,352,378,542,546,563,602,629 'json':146,198,266,576 'json.loads':149,255,505 'judg':440,451 'k':171,175 'keep':566 'key':194 'limit':543 'list':101,514 'llm':113,130,184,242,256,283,357,365,390,424,438,442,450,462,506 'llm-as-judg':437 'log':564,627 'loop':42,111,337,552,622 'markdown':586 'max':103,124,297,308,351,377,545,601 'measur':538 'metric':68 'move':33 'n':118 'ncode':372,396 'norigin':190 'noutput':290,516 'optim':210,275,616 'outcom':403,415,430 'outcome-bas':402 'output':32,47,88,112,134,138,139,165,183,191,193,199,249,259,263,264,277,291,302,312,321,323,326,407,418,427,435,436,447,452,455,465,480,499,517,557,574,614 'overal':267,316 'overview':21 'pars':202,579,634 'pass':152,156,163 'pass/fail':142 'pattern':10,11,23,78,206,327 'practic':531,532 'prevent':550 'pytest':368 'python':93,221,341,360,412,448,484 'qualiti':52 'quality-crit':51 'quick':583 'rang':123,307,376 'rank':446 'rate':140,508 'rational':533 'refin':20,41,46,97,166,336,621 'reflect':81,95,110,332,345 'reliabl':201,578 'report':56 'requir':58,71 'respons':220 'result':205,379,385,394,411,582 'return':164,192,241,254,265,282,325,387,398,423,461,518 'rubric':477,485,498,501,522,528 'rubric-bas':476 'rubric.keys':515 'run':293,380 'safeti':623 'score':227,233,268,317,479,504,520,558,595 'self':14,91,127,226,237,248,276,294,348 'self-critiqu':90,126 'self-improv':13 'self.evaluate':311 'self.generate':303 'self.optimize':322 'self.score':231,318 'separ':211 'set':544,594 'setup':590 'shot':37 'singl':36 'single-shot':35 'skill':5 'source-github' 'spec':349,363,371 'specif':72,331,537 'standard':73 'start':584 'status':155,181 'stop':555 'str':99,102,107,239,240,250,252,278,281,296,301,350,355,417,419,421,422,454,457,459,460,500 'strategi':401 'structur':197,573,613 'style':74 'success':67,386 'sum':519 'task':61,98,117,119,238,245,251,261,262,295,304,313,416,431,432 'test':334,364,369,381,383 'test-driven':333 'threshold':228,232,234,319,596 'trajectori':568 'upfront':541 'use':50,196,441,575 'v':172,176,180 'weight':482,487,490,493,524 'whether':406 'wire':618 'write':359","prices":[{"id":"0cf94e7e-5ff8-4950-a331-805bd126c9d1","listingId":"b23be400-14cf-48bf-9833-e11e2f492000","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"github","category":"awesome-copilot","install_from":"skills.sh"},"createdAt":"2026-04-18T20:25:33.581Z"}],"sources":[{"listingId":"b23be400-14cf-48bf-9833-e11e2f492000","source":"github","sourceId":"github/awesome-copilot/agentic-eval","sourceUrl":"https://github.com/github/awesome-copilot/tree/main/skills/agentic-eval","isPrimary":false,"firstSeenAt":"2026-04-18T21:48:08.290Z","lastSeenAt":"2026-04-22T12:52:04.737Z"},{"listingId":"b23be400-14cf-48bf-9833-e11e2f492000","source":"skills_sh","sourceId":"github/awesome-copilot/agentic-eval","sourceUrl":"https://skills.sh/github/awesome-copilot/agentic-eval","isPrimary":true,"firstSeenAt":"2026-04-18T20:25:33.581Z","lastSeenAt":"2026-04-22T17:40:17.606Z"}],"details":{"listingId":"b23be400-14cf-48bf-9833-e11e2f492000","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"github","slug":"agentic-eval","source":"skills_sh","category":"awesome-copilot","skills_sh_url":"https://skills.sh/github/awesome-copilot/agentic-eval"},"updatedAt":"2026-04-22T17:40:17.606Z"}}