{"id":"ef86bc39-e471-476f-8084-4dc1d08ddc58","shortId":"UDNW7H","kind":"skill","title":"advanced-evaluation","tagline":"This skill should be used when the user asks to \"implement LLM-as-judge\", \"compare model outputs\", \"create evaluation rubrics\", \"mitigate evaluation bias\", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.","description":"# Advanced Evaluation\n\nThis skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.\n\n**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.\n\n## When to Use\nActivate this skill when:\n\n- Building automated evaluation pipelines for LLM outputs\n- Comparing multiple model responses to select the best one\n- Establishing consistent quality standards across evaluation teams\n- Debugging evaluation systems that show inconsistent results\n- Designing A/B tests for prompt or model changes\n- Creating rubrics for human or automated evaluation\n- Analyzing correlation between automated and human judgments\n\n## Core Concepts\n\n### The Evaluation Taxonomy\n\nEvaluation approaches fall into two primary categories with distinct reliability profiles:\n\n**Direct Scoring**: A single LLM rates one response on a defined scale.\n- Best for: Objective criteria (factual accuracy, instruction following, toxicity)\n- Reliability: Moderate to high for well-defined criteria\n- Failure mode: Score calibration drift, inconsistent scale interpretation\n\n**Pairwise Comparison**: An LLM compares two responses and selects the better one.\n- Best for: Subjective preferences (tone, style, persuasiveness)\n- Reliability: Higher than direct scoring for preferences\n- Failure mode: Position bias, length bias\n\nResearch from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.\n\n### The Bias Landscape\n\nLLM judges exhibit systematic biases that must be actively mitigated:\n\n**Position Bias**: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.\n\n**Length Bias**: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.\n\n**Self-Enhancement Bias**: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.\n\n**Verbosity Bias**: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.\n\n**Authority Bias**: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.\n\n### Metric Selection Framework\n\nChoose metrics based on the evaluation task structure:\n\n| Task Type | Primary Metrics | Secondary Metrics |\n|-----------|-----------------|-------------------|\n| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ |\n| Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |\n| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |\n| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |\n\nThe critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.\n\n## Evaluation Approaches\n\n### Direct Scoring Implementation\n\nDirect scoring requires three components: clear criteria, a calibrated scale, and structured output format.\n\n**Criteria Definition Pattern**:\n```\nCriterion: [Name]\nDescription: [What this criterion measures]\nWeight: [Relative importance, 0-1]\n```\n\n**Scale Calibration**:\n- 1-3 scales: Binary with neutral option, lowest cognitive load\n- 1-5 scales: Standard Likert, good balance of granularity and reliability\n- 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics\n\n**Prompt Structure for Direct Scoring**:\n```\nYou are an expert evaluator assessing response quality.\n\n## Task\nEvaluate the following response against each criterion.\n\n## Original Prompt\n{prompt}\n\n## Response to Evaluate\n{response}\n\n## Criteria\n{for each criterion: name, description, weight}\n\n## Instructions\nFor each criterion:\n1. Find specific evidence in the response\n2. Score according to the rubric (1-{max} scale)\n3. Justify your score with evidence\n4. Suggest one specific improvement\n\n## Output Format\nRespond with structured JSON containing scores, justifications, and summary.\n```\n\n**Chain-of-Thought Requirement**: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.\n\n### Pairwise Comparison Implementation\n\nPairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.\n\n**Position Bias Mitigation Protocol**:\n1. First pass: Response A in first position, Response B in second\n2. Second pass: Response B in first position, Response A in second\n3. Consistency check: If passes disagree, return TIE with reduced confidence\n4. Final verdict: Consistent winner with averaged confidence\n\n**Prompt Structure for Pairwise Comparison**:\n```\nYou are an expert evaluator comparing two AI responses.\n\n## Critical Instructions\n- Do NOT prefer responses because they are longer\n- Do NOT prefer responses based on position (first vs second)\n- Focus ONLY on quality according to the specified criteria\n- Ties are acceptable when responses are genuinely equivalent\n\n## Original Prompt\n{prompt}\n\n## Response A\n{response_a}\n\n## Response B\n{response_b}\n\n## Comparison Criteria\n{criteria list}\n\n## Instructions\n1. Analyze each response independently first\n2. Compare them on each criterion\n3. Determine overall winner with confidence level\n\n## Output Format\nJSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.\n```\n\n**Confidence Calibration**: Confidence scores should reflect position consistency:\n- Both passes agree: confidence = average of individual confidences\n- Passes disagree: confidence = 0.5, verdict = TIE\n\n### Rubric Generation\n\nWell-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.\n\n**Rubric Components**:\n1. **Level descriptions**: Clear boundaries for each score level\n2. **Characteristics**: Observable features that define each level\n3. **Examples**: Representative text for each level (optional but valuable)\n4. **Edge cases**: Guidance for ambiguous situations\n5. **Scoring guidelines**: General principles for consistent application\n\n**Strictness Calibration**:\n- **Lenient**: Lower bar for passing scores, appropriate for encouraging iteration\n- **Balanced**: Fair, typical expectations for production use\n- **Strict**: High standards, appropriate for safety-critical or high-stakes evaluation\n\n**Domain Adaptation**: Rubrics should use domain-specific terminology. A \"code readability\" rubric mentions variables, functions, and comments. A \"medical accuracy\" rubric references clinical terminology and evidence standards.\n\n## Practical Guidance\n\n### Evaluation Pipeline Design\n\nProduction evaluation systems require multiple layers:\n\n```\n┌─────────────────────────────────────────────────┐\n│                 Evaluation Pipeline              │\n├─────────────────────────────────────────────────┤\n│                                                   │\n│  Input: Response + Prompt + Context               │\n│           │                                       │\n│           ▼                                       │\n│  ┌─────────────────────┐                         │\n│  │   Criteria Loader   │ ◄── Rubrics, weights    │\n│  └──────────┬──────────┘                         │\n│             │                                     │\n│             ▼                                     │\n│  ┌─────────────────────┐                         │\n│  │   Primary Scorer    │ ◄── Direct or Pairwise  │\n│  └──────────┬──────────┘                         │\n│             │                                     │\n│             ▼                                     │\n│  ┌─────────────────────┐                         │\n│  │   Bias Mitigation   │ ◄── Position swap, etc. │\n│  └──────────┬──────────┘                         │\n│             │                                     │\n│             ▼                                     │\n│  ┌─────────────────────┐                         │\n│  │ Confidence Scoring  │ ◄── Calibration         │\n│  └──────────┬──────────┘                         │\n│             │                                     │\n│             ▼                                     │\n│  Output: Scores + Justifications + Confidence     │\n│                                                   │\n└─────────────────────────────────────────────────┘\n```\n\n### Common Anti-Patterns\n\n**Anti-pattern: Scoring without justification**\n- Problem: Scores lack grounding, difficult to debug or improve\n- Solution: Always require evidence-based justification before score\n\n**Anti-pattern: Single-pass pairwise comparison**\n- Problem: Position bias corrupts results\n- Solution: Always swap positions and check consistency\n\n**Anti-pattern: Overloaded criteria**\n- Problem: Criteria measuring multiple things are unreliable\n- Solution: One criterion = one measurable aspect\n\n**Anti-pattern: Missing edge case guidance**\n- Problem: Evaluators handle ambiguous cases inconsistently\n- Solution: Include edge cases in rubrics with explicit guidance\n\n**Anti-pattern: Ignoring confidence calibration**\n- Problem: High-confidence wrong judgments are worse than low-confidence\n- Solution: Calibrate confidence to position consistency and evidence strength\n\n### Decision Framework: Direct vs. Pairwise\n\nUse this decision tree:\n\n```\nIs there an objective ground truth?\n├── Yes → Direct Scoring\n│   └── Examples: factual accuracy, instruction following, format compliance\n│\n└── No → Is it a preference or quality judgment?\n    ├── Yes → Pairwise Comparison\n    │   └── Examples: tone, style, persuasiveness, creativity\n    │\n    └── No → Consider reference-based evaluation\n        └── Examples: summarization (compare to source), translation (compare to reference)\n```\n\n### Scaling Evaluation\n\nFor high-volume evaluation:\n\n1. **Panel of LLMs (PoLL)**: Use multiple models as judges, aggregate votes\n   - Reduces individual model bias\n   - More expensive but more reliable for high-stakes decisions\n\n2. **Hierarchical evaluation**: Fast cheap model for screening, expensive model for edge cases\n   - Cost-effective for large volumes\n   - Requires calibration of screening threshold\n\n3. **Human-in-the-loop**: Automated evaluation for clear cases, human review for low-confidence\n   - Best reliability for critical applications\n   - Design feedback loop to improve automated evaluation\n\n## Examples\n\n### Example 1: Direct Scoring for Accuracy\n\n**Input**:\n```\nPrompt: \"What causes seasons on Earth?\"\nResponse: \"Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, \ndifferent hemispheres receive more direct sunlight at different times of year.\"\nCriterion: Factual Accuracy (weight: 1.0)\nScale: 1-5\n```\n\n**Output**:\n```json\n{\n  \"criterion\": \"Factual Accuracy\",\n  \"score\": 5,\n  \"evidence\": [\n    \"Correctly identifies axial tilt as primary cause\",\n    \"Correctly explains differential sunlight by hemisphere\",\n    \"No factual errors present\"\n  ],\n  \"justification\": \"Response accurately explains the cause of seasons with correct \nscientific reasoning. Both the axial tilt and its effect on sunlight distribution \nare correctly described.\",\n  \"improvement\": \"Could add the specific tilt angle (23.5°) for completeness.\"\n}\n```\n\n### Example 2: Pairwise Comparison with Position Swap\n\n**Input**:\n```\nPrompt: \"Explain machine learning to a beginner\"\nResponse A: [Technical explanation with jargon]\nResponse B: [Simple analogy-based explanation]\nCriteria: [\"clarity\", \"accessibility\"]\n```\n\n**First Pass (A first)**:\n```json\n{ \"winner\": \"B\", \"confidence\": 0.8 }\n```\n\n**Second Pass (B first)**:\n```json\n{ \"winner\": \"A\", \"confidence\": 0.6 }\n```\n(Note: Winner is A because B was in first position)\n\n**Mapped Second Pass**:\n```json\n{ \"winner\": \"B\", \"confidence\": 0.6 }\n```\n\n**Final Result**:\n```json\n{\n  \"winner\": \"B\",\n  \"confidence\": 0.7,\n  \"positionConsistency\": {\n    \"consistent\": true,\n    \"firstPassWinner\": \"B\",\n    \"secondPassWinner\": \"B\"\n  }\n}\n```\n\n### Example 3: Rubric Generation\n\n**Input**:\n```\ncriterionName: \"Code Readability\"\ncriterionDescription: \"How easy the code is to understand and maintain\"\ndomain: \"software engineering\"\nscale: \"1-5\"\nstrictness: \"balanced\"\n```\n\n**Output** (abbreviated):\n```json\n{\n  \"levels\": [\n    {\n      \"score\": 1,\n      \"label\": \"Poor\",\n      \"description\": \"Code is difficult to understand without significant effort\",\n      \"characteristics\": [\n        \"No meaningful variable or function names\",\n        \"No comments or documentation\",\n        \"Deeply nested or convoluted logic\"\n      ]\n    },\n    {\n      \"score\": 3,\n      \"label\": \"Adequate\",\n      \"description\": \"Code is understandable with some effort\",\n      \"characteristics\": [\n        \"Most variables have meaningful names\",\n        \"Basic comments present for complex sections\",\n        \"Logic is followable but could be cleaner\"\n      ]\n    },\n    {\n      \"score\": 5,\n      \"label\": \"Excellent\",\n      \"description\": \"Code is immediately clear and maintainable\",\n      \"characteristics\": [\n        \"All names are descriptive and consistent\",\n        \"Comprehensive documentation\",\n        \"Clean, modular structure\"\n      ]\n    }\n  ],\n  \"edgeCases\": [\n    {\n      \"situation\": \"Code is well-structured but uses domain-specific abbreviations\",\n      \"guidance\": \"Score based on readability for domain experts, not general audience\"\n    }\n  ]\n}\n```\n\n## Guidelines\n\n1. **Always require justification before scores** - Chain-of-thought prompting improves reliability by 15-25%\n\n2. **Always swap positions in pairwise comparison** - Single-pass comparison is corrupted by position bias\n\n3. **Match scale granularity to rubric specificity** - Don't use 1-10 without detailed level descriptions\n\n4. **Separate objective and subjective criteria** - Use direct scoring for objective, pairwise for subjective\n\n5. **Include confidence scores** - Calibrate to position consistency and evidence strength\n\n6. **Define edge cases explicitly** - Ambiguous situations cause the most evaluation variance\n\n7. **Use domain-specific rubrics** - Generic rubrics produce generic (less useful) evaluations\n\n8. **Validate against human judgments** - Automated evaluation is only valuable if it correlates with human assessment\n\n9. **Monitor for systematic bias** - Track disagreement patterns by criterion, response type, model\n\n10. **Design for iteration** - Evaluation systems improve with feedback loops\n\n## Integration\n\nThis skill integrates with:\n\n- **context-fundamentals** - Evaluation prompts require effective context structure\n- **tool-design** - Evaluation tools need proper schemas and error handling\n- **context-optimization** - Evaluation prompts can be optimized for token efficiency\n- **evaluation** (foundational) - This skill extends the foundational evaluation concepts\n\n## References\n\nInternal reference:\n- LLM-as-Judge Implementation Patterns\n- Bias Mitigation Techniques\n- Metric Selection Guide\n\nExternal research:\n- [Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/)\n- [Judging LLM-as-a-Judge (Zheng et al., 2023)](https://arxiv.org/abs/2306.05685)\n- [G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)](https://arxiv.org/abs/2303.16634)\n- [Large Language Models are not Fair Evaluators (Wang et al., 2023)](https://arxiv.org/abs/2305.17926)\n\nRelated skills in this collection:\n- evaluation - Foundational evaluation concepts\n- context-fundamentals - Context structure for evaluation prompts\n- tool-design - Building evaluation tools\n\n---\n\n## Skill Metadata\n\n**Created**: 2024-12-24\n**Last Updated**: 2024-12-24\n**Author**: Muratcan Koylan\n**Version**: 1.0.0\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.","tags":["advanced","evaluation","antigravity","awesome","skills","sickn33","agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows"],"capabilities":["skill","source-sickn33","skill-advanced-evaluation","topic-agent-skills","topic-agentic-skills","topic-ai-agent-skills","topic-ai-agents","topic-ai-coding","topic-ai-workflows","topic-antigravity","topic-antigravity-skills","topic-claude-code","topic-claude-code-skills","topic-codex-cli","topic-codex-skills"],"categories":["antigravity-awesome-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/sickn33/antigravity-awesome-skills/advanced-evaluation","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add sickn33/antigravity-awesome-skills","source_repo":"https://github.com/sickn33/antigravity-awesome-skills","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 34997 github stars · SKILL.md body (16,558 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-04-25T06:50:22.354Z","embedding":null,"createdAt":"2026-04-18T21:30:22.618Z","updatedAt":"2026-04-25T06:50:22.354Z","lastSeenAt":"2026-04-25T06:50:22.354Z","tsv":"'-1':543,849 '-10':568,1659 '-12':1896,1901 '-24':1897,1902 '-25':679,1631 '-3':547 '-4':1847 '-5':447,557,1337,1502 '-60':885 '/abs/2303.16634)':1854 '/abs/2305.17926)':1868 '/abs/2306.05685)':1839 '/writing/llm-evaluators/)':1826 '0':542,848 '0.5':871 '0.6':1446,1464 '0.7':1471 '0.8':1437 '1':446,546,556,567,620,633,708,818,894,1212,1293,1336,1501,1510,1616,1658 '1.0':1334 '1.0.0':1907 '10':1743 '15':678,1630 '2':627,720,824,903,1238,1399,1632 '2023':272,1836,1851,1865 '2024':1895,1900 '23.5':1395 '3':636,732,830,911,1262,1480,1539,1648 '4':642,743,921,1664 '40':884 '5':928,1344,1569,1678 '6':1689 '7':1701 '8':1714 '9':1730 'a/b':155 'abbrevi':1506,1603 'absolut':484 'academ':63 'accept':796 'access':1428 'accord':629,789 'accur':1365 'accuraci':209,409,988,1169,1297,1332,1342 'achiev':277 'acknowledg':379 'across':144 'action':72 'activ':120,314 'adapt':969 'add':1390 'adequ':1541 'advanc':2,42 'advanced-evalu':1 'aggreg':1222 'agre':862 'agreement':279,461,485 'ai':763 'al':271,1835,1850,1864 'alway':1054,1076,1617,1633 'ambigu':926,1110,1694 'analog':1423 'analogy-bas':1422 'analyz':169,819 'angl':1394 'anti':1036,1039,1063,1083,1101,1123 'anti-pattern':1035,1038,1062,1082,1100,1122 'applic':935,1283 'approach':95,105,182,511,685 'appropri':295,944,958 'arxiv.org':1838,1853,1867 'arxiv.org/abs/2303.16634)':1852 'arxiv.org/abs/2305.17926)':1866 'arxiv.org/abs/2306.05685)':1837 'ask':12,1941 'aspect':1099 'assess':41,591,1729 'audienc':1614 'author':400,1903 'authorit':403 'autom':39,125,167,172,1268,1289,1719 'averag':749,864 'axi':1313 'axial':1348,1377 'b':717,724,810,812,1420,1435,1440,1452,1462,1469,1476,1478 'balanc':562,948,1504 'bar':940 'base':289,423,698,779,1058,1194,1424,1606 'basic':1555 'beginn':1412 'bench':267 'best':138,204,242,1279 'better':240 'bias':27,35,109,259,261,304,310,317,341,363,382,401,702,705,1022,1072,1227,1647,1734,1807 'binari':435,549 'boundari':898,1949 'build':75,124,1889 'calibr':225,466,523,545,575,853,937,1029,1127,1141,1258,1682 'case':923,1105,1111,1116,1250,1272,1692 'categori':187 'caus':1301,1308,1352,1368,1696 'chain':659,1623 'chain-of-thought':658,1622 'chang':161 'characterist':904,1522,1549,1579 'cheap':1242 'check':339,416,734,1080 'choos':102,421 'citat':413 'clarif':1943 'clariti':1427 'classif':436 'clean':1588 'cleaner':1567 'clear':300,520,897,1271,1576,1916 'clinic':991 'code':978,1485,1491,1514,1543,1573,1593 'cognit':554 'cohen':441,455 'collect':1873 'comment':985,1530,1556 'common':1034 'compar':19,131,234,680,761,825,886,1198,1202 'comparison':33,231,276,327,687,690,755,813,844,1069,1184,1401,1638,1642 'compet':113 'complet':1397 'complex':1559 'complianc':1173 'compon':519,893 'comprehens':1586 'concept':177,1797,1877 'confid':402,465,742,750,835,847,852,854,863,867,870,1027,1033,1126,1131,1139,1142,1278,1436,1445,1463,1470,1680 'consid':1191 'consist':141,338,464,495,733,746,859,934,1081,1145,1473,1585,1685 'contain':653 'context':101,1012,1759,1765,1779,1879,1881 'context-fundament':1758,1878 'context-optim':1778 'convolut':1536 'core':112,176 'correct':1346,1353,1372,1386 'correl':170,1726 'corrupt':1073,1644 'cost':1252 'cost-effect':1251 'could':1389,1565 'cover':46 'creat':22,162,1894 'creativ':1189 'criteria':207,221,298,393,501,521,529,609,793,814,815,1013,1086,1088,1426,1669,1952 'criteria-specif':392 'criterion':532,537,601,612,619,829,843,1096,1330,1340,1739 'criteriondescript':1487 'criterionnam':1484 'critic':481,765,962,1282 'debug':147,1050 'decis':1149,1156,1237 'deepli':1533 'defin':202,220,878,908,1690 'definit':530 'describ':1387,1920 'descript':534,614,896,1513,1542,1572,1583,1663 'design':154,1000,1284,1744,1769,1888 'detail':383,399,579,1661 'determin':831 'develop':116 'differ':99,372,1319,1326 'differenti':1355 'difficult':1048,1516 'direct':30,192,252,284,292,512,515,584,1019,1151,1165,1294,1323,1671 'disagr':490,1736 'disagre':496,737,869 'distinct':189 'distribut':1384 'document':1532,1587 'domain':968,974,1497,1601,1610,1704 'domain-specif':973,1600,1703 'drift':226 'earth':1304,1310,1315 'easi':1489 'edg':922,1104,1115,1249,1691 'edgecas':1591 'effect':1253,1381,1764,1819 'effici':1788 'effort':1521,1548 'encourag':946 'end':890 'engin':1499 'enhanc':362 'environ':1932 'environment-specif':1931 'equival':801 'error':1361,1776 'establish':140,273 'et':270,1834,1849,1863 'etc':1026 'eugen':1815 'eugeneyan.com':1825 'eugeneyan.com/writing/llm-evaluators/)':1824 'eval':1842 'evalu':3,23,26,36,43,52,77,100,126,145,148,168,179,181,290,329,377,426,510,590,595,607,699,760,881,967,998,1002,1007,1108,1195,1206,1211,1240,1269,1290,1699,1713,1720,1747,1761,1770,1781,1789,1796,1817,1823,1844,1861,1874,1876,1884,1890 'even':388 'evid':412,623,641,994,1057,1147,1345,1687 'evidence-bas':1056 'exampl':912,1167,1185,1196,1291,1292,1398,1479 'excel':1571 'exhibit':308 'expect':951 'expens':1229,1246 'experi':70 'expert':589,759,1611,1937 'explain':1354,1366,1407 'explan':384,1416,1425 'explicit':351,1120,1693 'extend':1793 'extern':1813 'f1':440,472,475 'fact':415 'fact-check':414 'factual':208,1168,1331,1341,1360 'failur':222,256 'fair':949,1860 'fall':183 'famili':93 'fast':1241 'featur':906 'feedback':1285,1751 'final':744,1465 'find':621 'first':319,684,709,714,726,782,823,1429,1432,1441,1455 'first-posit':318 'firstpasswinn':1475 'focus':785 'follow':211,597,1171,1563 'format':528,648,838,1172 'foundat':1790,1795,1875 'framework':420,1150 'function':983,1527 'fundament':1760,1880 'g':1841 'g-eval':1840 'general':931,1613 'generat':375,875,1482 'generic':1707,1710 'genuin':800 'good':561 'gpt':1846 'grade':49 'granular':564,571,1651 'ground':301,1047,1162 'guid':1812 'guidanc':924,997,1106,1121,1604 'guidelin':930,1615 'handl':1109,1777 'harder':573 'hemispher':1320,1358 'hierarch':1239 'high':216,483,570,956,965,1130,1209,1235 'high-confid':1129 'high-stak':964,1234 'high-volum':1208 'higher':250,278,346,369,386,406 'human':165,174,281,498,1264,1273,1717,1728 'human-in-the-loop':1263 'identifi':1347 'ignor':354,1125 'immedi':1575 'implement':14,69,514,688,1805 'import':541 'improv':646,675,1052,1288,1388,1627,1749 'includ':1114,1679 'inconsist':152,227,1112 'independ':822 'individu':866,1225 'industri':65 'inher':692 'input':1009,1298,1405,1483,1946 'insight':80,482 'instruct':210,616,766,817,1170 'integr':1753,1756 'intern':1799 'interpret':229 'irrelev':398 'iter':947,1746 'jargon':1418 'json':652,839,1339,1433,1442,1460,1467,1507 'judg':18,58,85,282,307,493,1221,1804,1827,1832 'judgment':175,1133,1181,1718 'justif':655,668,1032,1043,1059,1363,1619 'justifi':637 'kendal':452 'key':79 'known':108 'koylan':1905 'label':469,478,1511,1540,1570 'lack':1046 'landscap':305 'languag':1856 'larg':1255,1855 'last':1898 'layer':417,1006 'learn':1409 'length':260,340,355,357 'length-norm':356 'lenient':938 'less':487,1711 'level':836,895,902,910,917,1508,1662 'likert':560 'limit':380,1908 'list':816 'liu':1848 'llm':16,53,82,129,196,233,306,1802,1822,1829 'llm-as-a-judg':81,1828 'llm-as-judg':15,1801 'llm-evalu':1821 'llms':56,1215 'load':555 'loader':1014 'logic':1537,1561 'longer':342,774 'loop':1267,1286,1752 'low':1138,1277 'low-confid':1137,1276 'lower':939 'lowest':553 'machin':1408 'macro':471 'macro-f1':470 'maintain':1496,1578 'major':335 'map':1457 'match':1649,1917 'matter':486 'max':634 'meaning':1524,1553 'measur':538,1089,1098 'medic':987 'mention':29,981 'metadata':1893 'metric':418,422,432,434,1810 'micro':474 'micro-f1':473 'miss':1103,1954 'mitig':25,107,315,328,350,370,391,410,703,706,1023,1808 'mode':223,257 'model':20,133,160,364,373,1219,1226,1243,1247,1742,1857 'moder':214 'modular':1589 'monitor':1731 'mt':266 'mt-bench':265 'multi':468 'multi-label':467 'multipl':132,1005,1090,1218 'muratcan':1904 'must':312,666 'name':533,613,1528,1554,1581 'need':1772 'nest':1534 'neutral':551 'nlg':1843 'nois':509 'normal':358 'note':1447 'object':206,297,1161,1666,1674 'observ':905 'one':139,198,241,506,644,1095,1097 'open':889 'open-end':888 'optim':1780,1785 'option':552,918 'orbit':1316 'ordin':444 'origin':602,802 'output':21,54,130,368,527,647,837,1030,1338,1505,1926 'overal':832,845 'overload':1085 'pairwis':32,230,275,326,459,686,689,754,1021,1068,1153,1183,1400,1637,1675 'panel':1213 'paper':64,268 'pass':710,722,736,861,868,942,1067,1430,1439,1459,1641 'pass/fail':437 'pattern':73,491,531,1037,1040,1064,1084,1102,1124,1737,1806 'penal':397 'per':477,842 'per-criterion':841 'per-label':476 'permiss':1947 'persuas':248,1188 'pipelin':37,127,999,1008 'poll':1216 'poor':1512 'posit':34,258,316,320,333,463,704,715,727,781,858,1024,1071,1078,1144,1403,1456,1635,1646,1684 'positionconsist':1472 'practic':66,68,996 'precis':439 'precision/recall':479 'prefer':245,255,288,460,697,769,777,1178 'preference-bas':287,696 'preferenti':323 'present':1362,1557 'primari':186,431,1017,1351 'principl':932 'problem':1044,1070,1087,1107,1128 'problemat':504 'produc':1709 'product':48,953,1001 'production-grad':47 'profil':191 'prompt':158,352,581,603,604,665,751,803,804,1011,1299,1406,1626,1762,1782,1885 'proper':1773 'protocol':707 'qualiti':40,142,349,593,788,1180 'random':508 'rate':197,345,365,405,448,462 'readabl':979,1486,1608 'reason':851,1374 'recal':438 'receiv':322,385,1321 'reduc':741,880,1224 'refer':990,1193,1204,1798,1800 'reference-bas':1192 'reflect':857 'regardless':347,407 'relat':540,1869 'reliabl':76,190,213,249,566,676,694,1232,1280,1628 'remain':294 'repres':913 'requir':411,517,662,667,701,1004,1055,1257,1618,1763,1945 'research':61,262,672,1814 'respond':649 'respons':134,199,236,321,343,592,598,605,608,626,711,716,723,728,764,770,778,798,805,807,809,811,821,1010,1305,1364,1413,1419,1740 'result':153,1074,1466 'return':738 'review':1274,1938 'right':104 'rubric':24,163,395,580,632,874,879,892,970,980,989,1015,1118,1481,1653,1706,1708 'safeti':961,1948 'safety-crit':960 'scale':203,228,445,524,544,548,558,569,635,1205,1335,1500,1650 'schema':1774 'scientif':1373 'scope':1919 'score':31,193,224,253,285,293,359,387,513,516,585,628,639,654,664,671,683,855,891,901,929,943,1028,1031,1041,1045,1061,1166,1295,1343,1509,1538,1568,1605,1621,1672,1681 'score-first':682 'scorer':1018 'screen':1245,1260 'season':1302,1306,1370 'second':719,721,731,784,1438,1458 'secondari':433 'secondpasswinn':1477 'section':1560 'select':136,238,419,1811 'self':361 'self-enhanc':360 'separ':1665 'show':151,673 'signific':1520 'simpl':1421 'singl':89,195,1066,1640 'single-pass':1065,1639 'situat':927,1592,1695 'skill':5,45,115,122,1755,1792,1870,1892,1911 'skill-advanced-evaluation' 'softwar':1498 'solut':1053,1075,1094,1113,1140 'sourc':1200 'source-sickn33' 'spearman':449 'specif':394,500,622,645,975,1392,1602,1654,1705,1933 'specifi':792 'stake':966,1236 'standard':143,559,957,995 'stop':1939 'strength':1148,1688 'strict':936,955,1503 'structur':428,526,582,651,752,1590,1597,1766,1882 'style':247,1187 'subject':244,1668,1677 'substitut':1929 'success':1951 'suggest':643 'suit':97 'summar':1197 'summari':657 'sun':1318 'sunlight':1324,1356,1383 'swap':332,1025,1077,1404,1634 'synthes':60 'system':78,149,1003,1748 'systemat':309,489,1733 'task':427,429,594,1915 'taxonomi':180 'team':146 'technic':1415 'techniqu':50,90,1809 'terminolog':976,992 'test':156,1935 'text':914 'thing':1091 'thought':661,1625 'three':518 'threshold':1261 'tie':739,794,873 'tilt':1312,1349,1378,1393 'time':1327 'token':1787 'tone':246,404,1186 'tool':1768,1771,1887,1891 'tool-design':1767,1886 'topic-agent-skills' 'topic-agentic-skills' 'topic-ai-agent-skills' 'topic-ai-agents' 'topic-ai-coding' 'topic-ai-workflows' 'topic-antigravity' 'topic-antigravity-skills' 'topic-claude-code' 'topic-claude-code-skills' 'topic-codex-cli' 'topic-codex-skills' 'toxic':212 'track':1735 'translat':1201 'treat':1924 'treatment':324 'tree':1157 'true':1474 'truth':302,1163 'twice':330 'two':185,235,762 'type':430,1741 'typic':950 'understand':1494,1518,1545 'unnecessari':390 'unreli':1093 'updat':1899 'use':8,55,119,334,371,576,954,972,1154,1217,1599,1657,1670,1702,1712,1845,1909 'user':11 'valid':1715,1934 'valuabl':920,1723 'variabl':982,1525,1551 'varianc':882,1700 'verbos':381 'verdict':745,872 'version':1906 'volum':1210,1256 'vote':336,1223 'vs':783,1152 'wang':1862 'weight':458,539,615,1016,1333 'well':219,877,1596 'well-defin':218,876 'well-structur':1595 'winner':747,833,846,1434,1443,1448,1461,1468 'without':1042,1519,1660 'wors':1135 'wrong':1132 'yan':1816 'year':1329 'yes':1164,1182 'zheng':269,1833 'κ':443,457 'ρ':451 'τ':454","prices":[{"id":"c2f1dbc8-7774-49e6-bd19-91e9ffa4615d","listingId":"ef86bc39-e471-476f-8084-4dc1d08ddc58","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"sickn33","category":"antigravity-awesome-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T21:30:22.618Z"}],"sources":[{"listingId":"ef86bc39-e471-476f-8084-4dc1d08ddc58","source":"github","sourceId":"sickn33/antigravity-awesome-skills/advanced-evaluation","sourceUrl":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/advanced-evaluation","isPrimary":false,"firstSeenAt":"2026-04-18T21:30:22.618Z","lastSeenAt":"2026-04-25T06:50:22.354Z"}],"details":{"listingId":"ef86bc39-e471-476f-8084-4dc1d08ddc58","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"sickn33","slug":"advanced-evaluation","github":{"repo":"sickn33/antigravity-awesome-skills","stars":34997,"topics":["agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows","antigravity","antigravity-skills","claude-code","claude-code-skills","codex-cli","codex-skills","cursor","cursor-skills","developer-tools","gemini-cli","gemini-skills","kiro","mcp","skill-library"],"license":"mit","html_url":"https://github.com/sickn33/antigravity-awesome-skills","pushed_at":"2026-04-25T06:33:17Z","description":"Installable GitHub library of 1,400+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.","skill_md_sha":"be8b44db6edeb13553ebb916b1dde64299034e92","skill_md_path":"skills/advanced-evaluation/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/advanced-evaluation"},"layout":"multi","source":"github","category":"antigravity-awesome-skills","frontmatter":{"name":"advanced-evaluation","description":"This skill should be used when the user asks to \"implement LLM-as-judge\", \"compare model outputs\", \"create evaluation rubrics\", \"mitigate evaluation bias\", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment."},"skills_sh_url":"https://skills.sh/sickn33/antigravity-awesome-skills/advanced-evaluation"},"updatedAt":"2026-04-25T06:50:22.354Z"}}