{"id":"bff3d176-7c07-4c72-a54a-f0f446c06ed0","shortId":"9qhvwd","kind":"skill","title":"llm-evaluation","tagline":"Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.","description":"# LLM Evaluation\n\nMaster comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.\n\n## Do not use this skill when\n\n- The task is unrelated to llm evaluation\n- You need a different domain or tool outside this scope\n\n## Instructions\n\n- Clarify goals, constraints, and required inputs.\n- Apply relevant best practices and validate outcomes.\n- Provide actionable steps and verification.\n- If detailed examples are required, open `resources/implementation-playbook.md`.\n\n## Use this skill when\n\n- Measuring LLM application performance systematically\n- Comparing different models or prompts\n- Detecting performance regressions before deployment\n- Validating improvements from prompt changes\n- Building confidence in production systems\n- Establishing baselines and tracking progress over time\n- Debugging unexpected model behavior\n\n## Core Evaluation Types\n\n### 1. Automated Metrics\nFast, repeatable, scalable evaluation using computed scores.\n\n**Text Generation:**\n- **BLEU**: N-gram overlap (translation)\n- **ROUGE**: Recall-oriented (summarization)\n- **METEOR**: Semantic similarity\n- **BERTScore**: Embedding-based similarity\n- **Perplexity**: Language model confidence\n\n**Classification:**\n- **Accuracy**: Percentage correct\n- **Precision/Recall/F1**: Class-specific performance\n- **Confusion Matrix**: Error patterns\n- **AUC-ROC**: Ranking quality\n\n**Retrieval (RAG):**\n- **MRR**: Mean Reciprocal Rank\n- **NDCG**: Normalized Discounted Cumulative Gain\n- **Precision@K**: Relevant in top K\n- **Recall@K**: Coverage in top K\n\n### 2. Human Evaluation\nManual assessment for quality aspects difficult to automate.\n\n**Dimensions:**\n- **Accuracy**: Factual correctness\n- **Coherence**: Logical flow\n- **Relevance**: Answers the question\n- **Fluency**: Natural language quality\n- **Safety**: No harmful content\n- **Helpfulness**: Useful to the user\n\n### 3. LLM-as-Judge\nUse stronger LLMs to evaluate weaker model outputs.\n\n**Approaches:**\n- **Pointwise**: Score individual responses\n- **Pairwise**: Compare two responses\n- **Reference-based**: Compare to gold standard\n- **Reference-free**: Judge without ground truth\n\n## Quick Start\n\n```python\nfrom llm_eval import EvaluationSuite, Metric\n\n# Define evaluation suite\nsuite = EvaluationSuite([\n    Metric.accuracy(),\n    Metric.bleu(),\n    Metric.bertscore(),\n    Metric.custom(name=\"groundedness\", fn=check_groundedness)\n])\n\n# Prepare test cases\ntest_cases = [\n    {\n        \"input\": \"What is the capital of France?\",\n        \"expected\": \"Paris\",\n        \"context\": \"France is a country in Europe. Paris is its capital.\"\n    },\n    # ... more test cases\n]\n\n# Run evaluation\nresults = suite.evaluate(\n    model=your_model,\n    test_cases=test_cases\n)\n\nprint(f\"Overall Accuracy: {results.metrics['accuracy']}\")\nprint(f\"BLEU Score: {results.metrics['bleu']}\")\n```\n\n## Automated Metrics Implementation\n\n### BLEU Score\n```python\nfrom nltk.translate.bleu_score import sentence_bleu, SmoothingFunction\n\ndef calculate_bleu(reference, hypothesis):\n    \"\"\"Calculate BLEU score between reference and hypothesis.\"\"\"\n    smoothie = SmoothingFunction().method4\n\n    return sentence_bleu(\n        [reference.split()],\n        hypothesis.split(),\n        smoothing_function=smoothie\n    )\n\n# Usage\nbleu = calculate_bleu(\n    reference=\"The cat sat on the mat\",\n    hypothesis=\"A cat is sitting on the mat\"\n)\n```\n\n### ROUGE Score\n```python\nfrom rouge_score import rouge_scorer\n\ndef calculate_rouge(reference, hypothesis):\n    \"\"\"Calculate ROUGE scores.\"\"\"\n    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)\n    scores = scorer.score(reference, hypothesis)\n\n    return {\n        'rouge1': scores['rouge1'].fmeasure,\n        'rouge2': scores['rouge2'].fmeasure,\n        'rougeL': scores['rougeL'].fmeasure\n    }\n```\n\n### BERTScore\n```python\nfrom bert_score import score\n\ndef calculate_bertscore(references, hypotheses):\n    \"\"\"Calculate BERTScore using pre-trained BERT.\"\"\"\n    P, R, F1 = score(\n        hypotheses,\n        references,\n        lang='en',\n        model_type='microsoft/deberta-xlarge-mnli'\n    )\n\n    return {\n        'precision': P.mean().item(),\n        'recall': R.mean().item(),\n        'f1': F1.mean().item()\n    }\n```\n\n### Custom Metrics\n```python\ndef calculate_groundedness(response, context):\n    \"\"\"Check if response is grounded in provided context.\"\"\"\n    # Use NLI model to check entailment\n    from transformers import pipeline\n\n    nli = pipeline(\"text-classification\", model=\"microsoft/deberta-large-mnli\")\n\n    result = nli(f\"{context} [SEP] {response}\")[0]\n\n    # Return confidence that response is entailed by context\n    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0\n\ndef calculate_toxicity(text):\n    \"\"\"Measure toxicity in generated text.\"\"\"\n    from detoxify import Detoxify\n\n    results = Detoxify('original').predict(text)\n    return max(results.values())  # Return highest toxicity score\n\ndef calculate_factuality(claim, knowledge_base):\n    \"\"\"Verify factual claims against knowledge base.\"\"\"\n    # Implementation depends on your knowledge base\n    # Could use retrieval + NLI, or fact-checking API\n    pass\n```\n\n## LLM-as-Judge Patterns\n\n### Single Output Evaluation\n```python\ndef llm_judge_quality(response, question):\n    \"\"\"Use GPT-5 to judge response quality.\"\"\"\n    prompt = f\"\"\"Rate the following response on a scale of 1-10 for:\n1. Accuracy (factually correct)\n2. Helpfulness (answers the question)\n3. Clarity (well-written and understandable)\n\nQuestion: {question}\nResponse: {response}\n\nProvide ratings in JSON format:\n{{\n  \"accuracy\": <1-10>,\n  \"helpfulness\": <1-10>,\n  \"clarity\": <1-10>,\n  \"reasoning\": \"<brief explanation>\"\n}}\n\"\"\"\n\n    result = openai.ChatCompletion.create(\n        model=\"gpt-5\",\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n        temperature=0\n    )\n\n    return json.loads(result.choices[0].message.content)\n```\n\n### Pairwise Comparison\n```python\ndef compare_responses(question, response_a, response_b):\n    \"\"\"Compare two responses using LLM judge.\"\"\"\n    prompt = f\"\"\"Compare these two responses to the question and determine which is better.\n\nQuestion: {question}\n\nResponse A: {response_a}\n\nResponse B: {response_b}\n\nWhich response is better and why? Consider accuracy, helpfulness, and clarity.\n\nAnswer with JSON:\n{{\n  \"winner\": \"A\" or \"B\" or \"tie\",\n  \"reasoning\": \"<explanation>\",\n  \"confidence\": <1-10>\n}}\n\"\"\"\n\n    result = openai.ChatCompletion.create(\n        model=\"gpt-5\",\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n        temperature=0\n    )\n\n    return json.loads(result.choices[0].message.content)\n```\n\n## Human Evaluation Frameworks\n\n### Annotation Guidelines\n```python\nclass AnnotationTask:\n    \"\"\"Structure for human annotation task.\"\"\"\n\n    def __init__(self, response, question, context=None):\n        self.response = response\n        self.question = question\n        self.context = context\n\n    def get_annotation_form(self):\n        return {\n            \"question\": self.question,\n            \"context\": self.context,\n            \"response\": self.response,\n            \"ratings\": {\n                \"accuracy\": {\n                    \"scale\": \"1-5\",\n                    \"description\": \"Is the response factually correct?\"\n                },\n                \"relevance\": {\n                    \"scale\": \"1-5\",\n                    \"description\": \"Does it answer the question?\"\n                },\n                \"coherence\": {\n                    \"scale\": \"1-5\",\n                    \"description\": \"Is it logically consistent?\"\n                }\n            },\n            \"issues\": {\n                \"factual_error\": False,\n                \"hallucination\": False,\n                \"off_topic\": False,\n                \"unsafe_content\": False\n            },\n            \"feedback\": \"\"\n        }\n```\n\n### Inter-Rater Agreement\n```python\nfrom sklearn.metrics import cohen_kappa_score\n\ndef calculate_agreement(rater1_scores, rater2_scores):\n    \"\"\"Calculate inter-rater agreement.\"\"\"\n    kappa = cohen_kappa_score(rater1_scores, rater2_scores)\n\n    interpretation = {\n        kappa < 0: \"Poor\",\n        kappa < 0.2: \"Slight\",\n        kappa < 0.4: \"Fair\",\n        kappa < 0.6: \"Moderate\",\n        kappa < 0.8: \"Substantial\",\n        kappa <= 1.0: \"Almost Perfect\"\n    }\n\n    return {\n        \"kappa\": kappa,\n        \"interpretation\": interpretation[True]\n    }\n```\n\n## A/B Testing\n\n### Statistical Testing Framework\n```python\nfrom scipy import stats\nimport numpy as np\n\nclass ABTest:\n    def __init__(self, variant_a_name=\"A\", variant_b_name=\"B\"):\n        self.variant_a = {\"name\": variant_a_name, \"scores\": []}\n        self.variant_b = {\"name\": variant_b_name, \"scores\": []}\n\n    def add_result(self, variant, score):\n        \"\"\"Add evaluation result for a variant.\"\"\"\n        if variant == \"A\":\n            self.variant_a[\"scores\"].append(score)\n        else:\n            self.variant_b[\"scores\"].append(score)\n\n    def analyze(self, alpha=0.05):\n        \"\"\"Perform statistical analysis.\"\"\"\n        a_scores = self.variant_a[\"scores\"]\n        b_scores = self.variant_b[\"scores\"]\n\n        # T-test\n        t_stat, p_value = stats.ttest_ind(a_scores, b_scores)\n\n        # Effect size (Cohen's d)\n        pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)\n        cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std\n\n        return {\n            \"variant_a_mean\": np.mean(a_scores),\n            \"variant_b_mean\": np.mean(b_scores),\n            \"difference\": np.mean(b_scores) - np.mean(a_scores),\n            \"relative_improvement\": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),\n            \"p_value\": p_value,\n            \"statistically_significant\": p_value < alpha,\n            \"cohens_d\": cohens_d,\n            \"effect_size\": self.interpret_cohens_d(cohens_d),\n            \"winner\": \"B\" if np.mean(b_scores) > np.mean(a_scores) else \"A\"\n        }\n\n    @staticmethod\n    def interpret_cohens_d(d):\n        \"\"\"Interpret Cohen's d effect size.\"\"\"\n        abs_d = abs(d)\n        if abs_d < 0.2:\n            return \"negligible\"\n        elif abs_d < 0.5:\n            return \"small\"\n        elif abs_d < 0.8:\n            return \"medium\"\n        else:\n            return \"large\"\n```\n\n## Regression Testing\n\n### Regression Detection\n```python\nclass RegressionDetector:\n    def __init__(self, baseline_results, threshold=0.05):\n        self.baseline = baseline_results\n        self.threshold = threshold\n\n    def check_for_regression(self, new_results):\n        \"\"\"Detect if new results show regression.\"\"\"\n        regressions = []\n\n        for metric in self.baseline.keys():\n            baseline_score = self.baseline[metric]\n            new_score = new_results.get(metric)\n\n            if new_score is None:\n                continue\n\n            # Calculate relative change\n            relative_change = (new_score - baseline_score) / baseline_score\n\n            # Flag if significant decrease\n            if relative_change < -self.threshold:\n                regressions.append({\n                    \"metric\": metric,\n                    \"baseline\": baseline_score,\n                    \"current\": new_score,\n                    \"change\": relative_change\n                })\n\n        return {\n            \"has_regression\": len(regressions) > 0,\n            \"regressions\": regressions\n        }\n```\n\n## Benchmarking\n\n### Running Benchmarks\n```python\nclass BenchmarkRunner:\n    def __init__(self, benchmark_dataset):\n        self.dataset = benchmark_dataset\n\n    def run_benchmark(self, model, metrics):\n        \"\"\"Run model on benchmark and calculate metrics.\"\"\"\n        results = {metric.name: [] for metric in metrics}\n\n        for example in self.dataset:\n            # Generate prediction\n            prediction = model.predict(example[\"input\"])\n\n            # Calculate each metric\n            for metric in metrics:\n                score = metric.calculate(\n                    prediction=prediction,\n                    reference=example[\"reference\"],\n                    context=example.get(\"context\")\n                )\n                results[metric.name].append(score)\n\n        # Aggregate results\n        return {\n            metric: {\n                \"mean\": np.mean(scores),\n                \"std\": np.std(scores),\n                \"min\": min(scores),\n                \"max\": max(scores)\n            }\n            for metric, scores in results.items()\n        }\n```\n\n## Resources\n\n- **references/metrics.md**: Comprehensive metric guide\n- **references/human-evaluation.md**: Annotation best practices\n- **references/benchmarking.md**: Standard benchmarks\n- **references/a-b-testing.md**: Statistical testing guide\n- **references/regression-testing.md**: CI/CD integration\n- **assets/evaluation-framework.py**: Complete evaluation harness\n- **assets/benchmark-dataset.jsonl**: Example datasets\n- **scripts/evaluate-model.py**: Automated evaluation runner\n\n## Best Practices\n\n1. **Multiple Metrics**: Use diverse metrics for comprehensive view\n2. **Representative Data**: Test on real-world, diverse examples\n3. **Baselines**: Always compare against baseline performance\n4. **Statistical Rigor**: Use proper statistical tests for comparisons\n5. **Continuous Evaluation**: Integrate into CI/CD pipeline\n6. **Human Validation**: Combine automated metrics with human judgment\n7. **Error Analysis**: Investigate failures to understand weaknesses\n8. **Version Control**: Track evaluation results over time\n\n## Common Pitfalls\n\n- **Single Metric Obsession**: Optimizing for one metric at the expense of others\n- **Small Sample Size**: Drawing conclusions from too few examples\n- **Data Contamination**: Testing on training data\n- **Ignoring Variance**: Not accounting for statistical uncertainty\n- **Metric Mismatch**: Using metrics not aligned with business goals\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.","tags":["llm","evaluation","antigravity","awesome","skills","sickn33","agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows"],"capabilities":["skill","source-sickn33","skill-llm-evaluation","topic-agent-skills","topic-agentic-skills","topic-ai-agent-skills","topic-ai-agents","topic-ai-coding","topic-ai-workflows","topic-antigravity","topic-antigravity-skills","topic-claude-code","topic-claude-code-skills","topic-codex-cli","topic-codex-skills"],"categories":["antigravity-awesome-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/sickn33/antigravity-awesome-skills/llm-evaluation","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add sickn33/antigravity-awesome-skills","source_repo":"https://github.com/sickn33/antigravity-awesome-skills","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 37911 github stars · SKILL.md body (14,173 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:51:19.106Z","embedding":null,"createdAt":"2026-04-18T21:40:06.637Z","updatedAt":"2026-05-18T18:51:19.106Z","lastSeenAt":"2026-05-18T18:51:19.106Z","tsv":"'-10':631,660,663,666,749 '-5':615,672,754,809,819,829 '0':527,679,683,761,765,881,1216 '0.0':544 '0.05':976,1142 '0.2':884,1111 '0.4':887 '0.5':1117 '0.6':890 '0.8':893,1123 '1':130,630,633,659,662,665,748,808,818,828,1336 '1.0':896 '2':206,637,1014,1018,1019,1345 '3':241,642,1355 '4':1362 '5':1371 '6':1378 '7':1387 '8':1395 'a/b':18,36,905 'ab':1104,1106,1109,1115,1121 'abtest':920 'account':1435 'accuraci':166,218,342,344,634,658,733,806 'action':76 'add':947,952 'aggreg':1283 'agreement':851,861,870 'align':1444 'almost':897 'alpha':975,1069 'alway':1357 'analysi':979,1389 'analyz':973 'annot':770,778,795,1310 'annotationtask':774 'answer':225,639,737,823 'api':596 'append':964,970,1281 'appli':68 'applic':10,28,93 'approach':254 'ask':1481 'aspect':213 'assess':210 'assets/benchmark-dataset.jsonl':1327 'assets/evaluation-framework.py':1323 'auc':179 'auc-roc':178 'autom':12,30,131,216,351,1331,1382 'b':695,723,725,743,929,931,940,943,968,985,988,1001,1016,1023,1038,1041,1045,1053,1082,1085 'base':159,265,575,581,587 'baselin':117,1139,1144,1166,1187,1189,1202,1203,1356,1360 'behavior':126 'benchmark':1219,1221,1228,1231,1235,1242,1315 'benchmarkrunn':1224 'bert':451,466 'bertscor':156,448,457,461 'best':70,1311,1334 'better':715,729 'bleu':142,347,350,354,362,366,370,381,388,390 'boundari':1489 'build':111 'busi':1446 'calcul':365,369,389,416,420,456,460,492,546,571,860,866,1180,1244,1262 'capit':309,324 'case':302,304,327,336,338 'cat':393,400 'chang':110,1182,1184,1197,1208,1210 'check':298,496,508,595,1149 'ci/cd':1321,1376 'claim':573,578 'clarif':1483 'clarifi':62 'clariti':643,664,736 'class':171,773,919,1134,1223 'class-specif':170 'classif':165,518 'clear':1456 'cohen':856,872,1005,1020,1070,1072,1077,1079,1095,1099 'coher':221,826 'combin':1381 'common':1403 'compar':96,260,266,689,696,704,1358 'comparison':686,1370 'complet':1324 'comprehens':5,23,1306,1343 'comput':138 'conclus':1421 'confid':112,164,529,747 'confus':174 'consid':732 'consist':834 'constraint':64 'contamin':1427 'content':235,676,758,845 'context':314,495,503,524,535,785,792,801,1276,1278 'continu':1179,1372 'control':1397 'core':127 'correct':168,220,636,815 'could':588 'countri':318 'coverag':202 'criteria':1492 'cumul':192 'current':1205 'custom':488 'd':1007,1021,1071,1073,1078,1080,1096,1097,1101,1105,1107,1110,1116,1122 'data':1347,1426,1431 'dataset':1229,1232,1329 'debug':123 'decreas':1194 'def':364,415,455,491,545,570,607,688,780,793,859,921,946,972,1093,1136,1148,1225,1233 'defin':286 'depend':583 'deploy':105 'describ':1460 'descript':810,820,830 'detail':81 'detect':101,1132,1155 'determin':712 'detoxifi':555,557,559 'differ':54,97,1043 'difficult':214 'dimens':217 'discount':191 'divers':1340,1353 'domain':55 'draw':1420 'effect':1003,1074,1102 'elif':1114,1120 'els':543,966,1090,1126 'embed':158 'embedding-bas':157 'en':474 'entail':509,533,542 'environ':1472 'environment-specif':1471 'error':176,837,1388 'establish':116 'europ':320 'eval':282 'evalu':3,6,16,21,24,34,50,128,136,208,250,287,329,605,768,953,1325,1332,1373,1399 'evaluationsuit':284,290 'exampl':82,1253,1260,1274,1328,1354,1425 'example.get':1277 'expect':312 'expens':1414 'expert':1477 'f':340,346,523,621,703 'f1':469,485 'f1.mean':486 'fact':594 'fact-check':593 'factual':219,572,577,635,814,836 'failur':1391 'fair':888 'fals':838,840,843,846 'fast':133 'feedback':847 'flag':1191 'flow':223 'fluenci':228 'fmeasur':439,443,447 'fn':297 'follow':624 'form':796 'format':657 'framework':769,909 'franc':311,315 'free':272 'function':385 'gain':193 'generat':141,552,1256 'get':794 'goal':63,1447 'gold':268 'gpt':614,671,753 'gram':145 'ground':275,500 'grounded':296,299,493 'guid':1308,1319 'guidelin':771 'hallucin':839 'har':1326 'harm':234 'help':236,638,661,734 'highest':567 'human':15,33,207,767,777,1379,1385 'hypothes':459,471 'hypothesi':368,375,398,419,434 'hypothesis.split':383 'ignor':1432 'implement':353,582 'import':283,360,412,453,512,556,855,913,915 'improv':107,1051 'ind':998 'individu':257 'init':781,922,1137,1226 'input':67,305,1261,1486 'instruct':61 'integr':1322,1374 'inter':849,868 'inter-rat':848,867 'interpret':879,902,903,1094,1098 'investig':1390 'issu':835 'item':481,484,487 'json':656,739 'json.loads':681,763 'judg':245,273,601,609,617,701 'judgment':1386 'k':195,199,201,205 'kappa':857,871,873,880,883,886,889,892,895,900,901 'knowledg':574,580,586 'label':541 'lang':473 'languag':162,230 'larg':1128 'len':1214 'limit':1448 'llm':2,9,20,27,49,92,243,281,599,608,700 'llm-as-judg':242,598 'llm-evalu':1 'llms':248 'logic':222,833 'manual':209 'master':4,22 'mat':397,405 'match':1457 'matrix':175 'max':564,1296,1297 'mean':186,1033,1039,1287 'measur':91,549 'medium':1125 'messag':673,755 'message.content':684,766 'meteor':153 'method4':378 'metric':13,31,132,285,352,489,1163,1169,1173,1200,1201,1238,1245,1249,1251,1264,1266,1268,1286,1300,1307,1338,1341,1383,1406,1411,1439,1442 'metric.accuracy':291 'metric.bertscore':293 'metric.bleu':292 'metric.calculate':1270 'metric.custom':294 'metric.name':1247,1280 'microsoft/deberta-large-mnli':520 'microsoft/deberta-xlarge-mnli':477 'min':1293,1294 'mismatch':1440 'miss':1494 'model':98,125,163,252,332,334,475,506,519,670,752,1237,1240 'model.predict':1259 'moder':891 'mrr':185 'multipl':1337 'n':144 'n-gram':143 'name':295,926,930,934,937,941,944 'natur':229 'ndcg':189 'need':52 'neglig':1113 'new':1153,1157,1170,1175,1185,1206 'new_results.get':1172 'nli':505,514,522,591 'nltk.translate.bleu':358 'none':786,1178 'normal':190 'np':918 'np.mean':1022,1025,1034,1040,1044,1047,1052,1055,1058,1084,1087,1288 'np.sqrt':1010 'np.std':1011,1015,1291 'numpi':916 'obsess':1407 'one':1410 'open':85 'openai.chatcompletion.create':669,751 'optim':1408 'orient':151 'origin':560 'other':1416 'outcom':74 'output':253,604,1466 'outsid':58 'overal':341 'overlap':146 'p':467,995,1061,1063,1067 'p.mean':480 'pairwis':259,685 'pari':313,321 'pass':597 'pattern':177,602 'percentag':167 'perfect':898 'perform':94,102,173,977,1361 'permiss':1487 'perplex':161 'pipelin':513,515,1377 'pitfal':1404 'pointwis':255 'pool':1008,1028 'poor':882 'practic':71,1312,1335 'pre':464 'pre-train':463 'precis':194,479 'precision/recall/f1':169 'predict':561,1257,1258,1271,1272 'prepar':300 'print':339,345 'product':114 'progress':120 'prompt':100,109,620,677,702,759 'proper':1366 'provid':75,502,653 'python':279,356,408,449,490,606,687,772,852,910,1133,1222 'qualiti':182,212,231,610,619 'question':227,612,641,649,650,691,710,716,717,784,790,799,825 'quick':277 'r':468 'r.mean':483 'rag':184 'rank':181,188 'rate':622,654,805 'rater':850,869 'rater1':862,875 'rater2':864,877 'real':1351 'real-world':1350 'reason':667,746 'recal':150,200,482 'recall-ori':149 'reciproc':187 'refer':264,271,367,373,391,418,433,458,472,1273,1275 'reference-bas':263 'reference-fre':270 'reference.split':382 'references/a-b-testing.md':1316 'references/benchmarking.md':1313 'references/human-evaluation.md':1309 'references/metrics.md':1305 'references/regression-testing.md':1320 'regress':103,1129,1131,1151,1160,1161,1213,1215,1217,1218 'regressiondetector':1135 'regressions.append':1199 'relat':1050,1181,1183,1196,1209 'relev':69,196,224,816 'repeat':134 'repres':1346 'requir':66,84,1485 'resourc':1304 'resources/implementation-playbook.md':86 'respons':258,262,494,498,526,531,611,618,625,651,652,690,692,694,698,707,718,720,722,724,727,783,788,803,813 'result':330,521,537,540,558,668,750,948,954,1140,1145,1154,1158,1246,1279,1284,1400 'result.choices':682,764 'results.items':1303 'results.metrics':343,349 'results.values':565 'retriev':183,590 'return':379,435,478,528,536,563,566,680,762,798,899,1030,1112,1118,1124,1127,1211,1285 'review':1478 'rigor':1364 'roc':180 'role':674,756 'roug':148,406,410,413,417,421 'rouge1':425,436,438 'rouge2':426,440,442 'rouge_scorer.rougescorer':424 'rougel':427,444,446 'run':328,1220,1234,1239 'runner':1333 'safeti':232,1488 'sampl':1418 'sat':394 'scalabl':135 'scale':628,807,817,827 'scipi':912 'scope':60,1459 'score':139,256,348,355,359,371,407,411,422,431,437,441,445,452,454,470,538,569,858,863,865,874,876,878,938,945,951,963,965,969,971,981,984,986,989,1000,1002,1013,1017,1024,1027,1036,1042,1046,1049,1054,1057,1060,1086,1089,1167,1171,1176,1186,1188,1190,1204,1207,1269,1282,1289,1292,1295,1298,1301 'scorer':414,423 'scorer.score':432 'scripts/evaluate-model.py':1330 'self':782,797,923,949,974,1138,1152,1227,1236 'self.baseline':1143,1168 'self.baseline.keys':1165 'self.context':791,802 'self.dataset':1230,1255 'self.interpret':1076 'self.question':789,800 'self.response':787,804 'self.threshold':1146,1198 'self.variant':932,939,961,967,982,987 'semant':154 'sentenc':361,380 'sep':525 'show':1159 'signific':1066,1193 'similar':155,160 'singl':603,1405 'sit':402 'size':1004,1075,1103,1419 'skill':42,89,1451 'skill-llm-evaluation' 'sklearn.metrics':854 'slight':885 'small':1119,1417 'smooth':384 'smoothi':376,386 'smoothingfunct':363,377 'source-sickn33' 'specif':172,1473 'standard':269,1314 'start':278 'stat':914,994 'staticmethod':1092 'statist':907,978,1065,1317,1363,1367,1437 'stats.ttest':997 'std':1009,1029,1290 'stemmer':429 'step':77 'stop':1479 'strategi':7,25 'stronger':247 'structur':775 'substanti':894 'substitut':1469 'success':1491 'suit':288,289 'suite.evaluate':331 'summar':152 'system':115 'systemat':95 't-test':990 'task':45,779,1455 'temperatur':678,760 'test':19,37,301,303,326,335,337,906,908,992,1130,1318,1348,1368,1428,1475 'text':140,517,548,553,562 'text-classif':516 'threshold':1141,1147 'tie':745 'time':122,1402 'tool':57 'top':198,204 'topic':842 'topic-agent-skills' 'topic-agentic-skills' 'topic-ai-agent-skills' 'topic-ai-agents' 'topic-ai-coding' 'topic-ai-workflows' 'topic-antigravity' 'topic-antigravity-skills' 'topic-claude-code' 'topic-claude-code-skills' 'topic-codex-cli' 'topic-codex-skills' 'toxic':547,550,568 'track':119,1398 'train':465,1430 'transform':511 'translat':147 'treat':1464 'true':430,904 'truth':276 'two':261,697,706 'type':129,476 'uncertainti':1438 'understand':648,1393 'unexpect':124 'unrel':47 'unsaf':844 'usag':387 'use':40,87,137,237,246,428,462,504,589,613,699,1339,1365,1441,1449 'user':240,675,757 'valid':73,106,1380,1474 'valu':996,1062,1064,1068 'varianc':1433 'variant':924,928,935,942,950,957,959,1031,1037 'verif':79 'verifi':576 'version':1396 'view':1344 'weak':1394 'weaker':251 'well':645 'well-written':644 'winner':740,1081 'without':274 'world':1352 'written':646","prices":[{"id":"59660d5c-2c5b-454c-a7e0-5dff9b59d2c8","listingId":"bff3d176-7c07-4c72-a54a-f0f446c06ed0","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"sickn33","category":"antigravity-awesome-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T21:40:06.637Z"}],"sources":[{"listingId":"bff3d176-7c07-4c72-a54a-f0f446c06ed0","source":"github","sourceId":"sickn33/antigravity-awesome-skills/llm-evaluation","sourceUrl":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/llm-evaluation","isPrimary":false,"firstSeenAt":"2026-04-18T21:40:06.637Z","lastSeenAt":"2026-05-18T18:51:19.106Z"},{"listingId":"bff3d176-7c07-4c72-a54a-f0f446c06ed0","source":"skills_sh","sourceId":"sickn33/antigravity-awesome-skills/llm-evaluation","sourceUrl":"https://skills.sh/sickn33/antigravity-awesome-skills/llm-evaluation","isPrimary":true,"firstSeenAt":"2026-05-07T20:42:22.711Z","lastSeenAt":"2026-05-07T22:41:34.179Z"}],"details":{"listingId":"bff3d176-7c07-4c72-a54a-f0f446c06ed0","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"sickn33","slug":"llm-evaluation","github":{"repo":"sickn33/antigravity-awesome-skills","stars":37911,"topics":["agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows","antigravity","antigravity-skills","claude-code","claude-code-skills","codex-cli","codex-skills","cursor","cursor-skills","developer-tools","gemini-cli","gemini-skills","kiro","mcp","skill-library"],"license":"mit","html_url":"https://github.com/sickn33/antigravity-awesome-skills","pushed_at":"2026-05-18T08:24:49Z","description":"Installable GitHub library of 1,400+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.","skill_md_sha":"d2b8c7ae0cbfeb854f5e1937f955ef6ec0d2c750","skill_md_path":"skills/llm-evaluation/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/llm-evaluation"},"layout":"multi","source":"github","category":"antigravity-awesome-skills","frontmatter":{"name":"llm-evaluation","description":"Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing."},"skills_sh_url":"https://skills.sh/sickn33/antigravity-awesome-skills/llm-evaluation"},"updatedAt":"2026-05-18T18:51:19.106Z"}}