{"id":"79be5e11-daf3-4a54-aabd-3a752d132fe1","shortId":"QSdb4K","kind":"skill","title":"dspy-evaluation-suite","tagline":"This skill should be used when the user asks to \"evaluate a DSPy program\", \"test my DSPy module\", \"measure performance\", \"create evaluation metrics\", \"use answer_exact_match or SemanticF1\", mentions \"Evaluate class\", \"comparing programs\", \"establishing baselines\", or needs to sys","description":"# DSPy Evaluation Suite\n\n## Goal\n\nSystematically evaluate DSPy programs using built-in and custom metrics with parallel execution.\n\n## When to Use\n\n- Measuring program performance before/after optimization\n- Comparing different program variants\n- Establishing baselines\n- Validating production readiness\n\n## Related Skills\n\n- Use with any optimizer: [dspy-bootstrap-fewshot](../dspy-bootstrap-fewshot/SKILL.md), [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md), [dspy-gepa-reflective](../dspy-gepa-reflective/SKILL.md)\n- Evaluate RAG pipelines: [dspy-rag-pipeline](../dspy-rag-pipeline/SKILL.md)\n\n## Inputs\n\n| Input | Type | Description |\n|-------|------|-------------|\n| `program` | `dspy.Module` | Program to evaluate |\n| `devset` | `list[dspy.Example]` | Evaluation examples |\n| `metric` | `callable` | Scoring function |\n| `num_threads` | `int` | Parallel threads |\n\n## Outputs\n\n| Output | Type | Description |\n|--------|------|-------------|\n| `score` | `float` | Average metric score |\n| `results` | `list` | Per-example results |\n\n## Workflow\n\n### Phase 1: Setup Evaluator\n\n```python\nfrom dspy.evaluate import Evaluate\n\nevaluator = Evaluate(\n    devset=devset,\n    metric=my_metric,\n    num_threads=8,\n    display_progress=True\n)\n```\n\n### Phase 2: Run Evaluation\n\n```python\nresult = evaluator(my_program)\nprint(f\"Score: {result.score:.2f}%\")\n# Access individual results: (example, prediction, score) tuples\nfor example, pred, score in result.results[:3]:\n    print(f\"Example: {example.question[:50]}... Score: {score}\")\n```\n\n## Built-in Metrics\n\n### answer_exact_match\n\n```python\nimport dspy\n\n# Normalized, case-insensitive comparison\nmetric = dspy.evaluate.answer_exact_match\n```\n\n### SemanticF1\n\nLLM-based semantic evaluation:\n\n```python\nfrom dspy.evaluate import SemanticF1\n\nsemantic = SemanticF1()\nscore = semantic(example, prediction)\n```\n\n## Custom Metrics\n\n### Basic Metric\n\n```python\ndef exact_match(example, pred, trace=None):\n    \"\"\"Returns bool, int, or float.\"\"\"\n    return example.answer.lower().strip() == pred.answer.lower().strip()\n```\n\n### Multi-Factor Metric\n\n```python\ndef quality_metric(example, pred, trace=None):\n    \"\"\"Score based on multiple factors.\"\"\"\n    score = 0.0\n    \n    # Correctness (50%)\n    if example.answer.lower() in pred.answer.lower():\n        score += 0.5\n    \n    # Conciseness (25%)\n    if len(pred.answer.split()) <= 20:\n        score += 0.25\n    \n    # Has reasoning (25%)\n    if hasattr(pred, 'reasoning') and pred.reasoning:\n        score += 0.25\n    \n    return score\n```\n\n### GEPA-Compatible Metric\n\n```python\ndef feedback_metric(example, pred, trace=None):\n    \"\"\"Returns (score, feedback) for GEPA optimizer.\"\"\"\n    correct = example.answer.lower() in pred.answer.lower()\n    \n    if correct:\n        return 1.0, \"Correct answer provided.\"\n    else:\n        return 0.0, f\"Expected '{example.answer}', got '{pred.answer}'\"\n```\n\n## Production Example\n\n```python\nimport dspy\nfrom dspy.evaluate import Evaluate, SemanticF1\nimport json\nimport logging\nfrom typing import Optional\nfrom dataclasses import dataclass\n\nlogger = logging.getLogger(__name__)\n\n@dataclass\nclass EvaluationResult:\n    score: float\n    num_examples: int\n    correct: int\n    incorrect: int\n    errors: int\n\ndef comprehensive_metric(example, pred, trace=None) -> float:\n    \"\"\"Multi-dimensional evaluation metric.\"\"\"\n    scores = []\n    \n    # 1. Correctness\n    if hasattr(example, 'answer') and hasattr(pred, 'answer'):\n        correct = example.answer.lower().strip() in pred.answer.lower().strip()\n        scores.append(1.0 if correct else 0.0)\n    \n    # 2. Completeness (answer not empty or error)\n    if hasattr(pred, 'answer'):\n        complete = len(pred.answer.strip()) > 0 and \"error\" not in pred.answer.lower()\n        scores.append(1.0 if complete else 0.0)\n    \n    # 3. Reasoning quality (if available)\n    if hasattr(pred, 'reasoning'):\n        has_reasoning = len(str(pred.reasoning)) > 20\n        scores.append(1.0 if has_reasoning else 0.5)\n    \n    return sum(scores) / len(scores) if scores else 0.0\n\nclass EvaluationSuite:\n    def __init__(self, devset, num_threads=8):\n        self.devset = devset\n        self.num_threads = num_threads\n    \n    def evaluate(self, program, metric=None) -> EvaluationResult:\n        \"\"\"Run full evaluation with detailed results.\"\"\"\n        metric = metric or comprehensive_metric\n\n        evaluator = Evaluate(\n            devset=self.devset,\n            metric=metric,\n            num_threads=self.num_threads,\n            display_progress=True\n        )\n\n        eval_result = evaluator(program)\n\n        # Extract individual scores from results\n        scores = [score for example, pred, score in eval_result.results]\n        correct = sum(1 for s in scores if s >= 0.5)\n        errors = sum(1 for s in scores if s == 0)\n\n        return EvaluationResult(\n            score=eval_result.score,\n            num_examples=len(self.devset),\n            correct=correct,\n            incorrect=len(self.devset) - correct - errors,\n            errors=errors\n        )\n    \n    def compare(self, programs: dict, metric=None) -> dict:\n        \"\"\"Compare multiple programs.\"\"\"\n        results = {}\n        \n        for name, program in programs.items():\n            logger.info(f\"Evaluating: {name}\")\n            results[name] = self.evaluate(program, metric)\n        \n        # Rank by score\n        ranked = sorted(results.items(), key=lambda x: x[1].score, reverse=True)\n        \n        print(\"\\n=== Comparison Results ===\")\n        for rank, (name, result) in enumerate(ranked, 1):\n            print(f\"{rank}. {name}: {result.score:.2%}\")\n        \n        return results\n    \n    def export_report(self, program, output_path: str, metric=None):\n        \"\"\"Export detailed evaluation report.\"\"\"\n        result = self.evaluate(program, metric)\n        \n        report = {\n            \"summary\": {\n                \"score\": result.score,\n                \"total\": result.num_examples,\n                \"correct\": result.correct,\n                \"accuracy\": result.correct / result.num_examples\n            },\n            \"config\": {\n                \"num_threads\": self.num_threads,\n                \"num_examples\": len(self.devset)\n            }\n        }\n        \n        with open(output_path, 'w') as f:\n            json.dump(report, f, indent=2)\n        \n        logger.info(f\"Report saved to {output_path}\")\n        return report\n\n# Usage\nsuite = EvaluationSuite(devset, num_threads=8)\n\n# Single evaluation\nresult = suite.evaluate(my_program)\nprint(f\"Score: {result.score:.2%}\")\n\n# Compare variants\nresults = suite.compare({\n    \"baseline\": baseline_program,\n    \"optimized\": optimized_program,\n    \"finetuned\": finetuned_program\n})\n```\n\n## Best Practices\n\n1. **Hold out test data** - Never optimize on evaluation set\n2. **Multiple metrics** - Combine correctness, quality, efficiency\n3. **Statistical significance** - Use enough examples (100+)\n4. **Track over time** - Version control evaluation results\n\n## Limitations\n\n- Metrics are task-specific; no universal measure\n- SemanticF1 requires LLM calls (cost)\n- Parallel evaluation can hit rate limits\n- Edge cases may not be captured\n\n## Official Documentation\n\n- **DSPy Documentation**: https://dspy.ai/\n- **DSPy GitHub**: https://github.com/stanfordnlp/dspy\n- **Evaluation API**: https://dspy.ai/api/evaluation/\n- **Metrics Guide**: https://dspy.ai/learn/evaluation/metrics/","tags":["dspy","evaluation","suite","skills","omidzamani","agent-skills","claude-code","claude-skills","llm","prompt-optimization","rag"],"capabilities":["skill","source-omidzamani","skill-dspy-evaluation-suite","topic-agent-skills","topic-claude-code","topic-claude-skills","topic-dspy","topic-llm","topic-prompt-optimization","topic-rag"],"categories":["dspy-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/OmidZamani/dspy-skills/dspy-evaluation-suite","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add OmidZamani/dspy-skills","source_repo":"https://github.com/OmidZamani/dspy-skills","install_from":"skills.sh"}},"qualityScore":"0.487","qualityRationale":"deterministic score 0.49 from registry signals: · indexed on github topic:agent-skills · 74 github stars · SKILL.md body (7,416 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-02T06:55:44.209Z","embedding":null,"createdAt":"2026-04-18T22:14:11.049Z","updatedAt":"2026-05-02T06:55:44.209Z","lastSeenAt":"2026-05-02T06:55:44.209Z","tsv":"'/api/evaluation/':806 '/dspy-bootstrap-fewshot/skill.md':90 '/dspy-gepa-reflective/skill.md':100 '/dspy-miprov2-optimizer/skill.md':95 '/dspy-rag-pipeline/skill.md':108 '/learn/evaluation/metrics/':811 '/stanfordnlp/dspy':801 '0':437,562 '0.0':281,342,422,448,479 '0.25':297,308 '0.5':289,470,552 '1':149,401,545,555,616,631,734 '1.0':336,418,444,465 '100':757 '2':171,423,637,691,718,744 '20':295,463 '25':291,300 '2f':183 '3':197,449,751 '4':758 '50':202,283 '8':166,488,707 'access':184 'accuraci':667 'answer':29,209,338,406,410,425,433 'api':803 'ask':13 'avail':453 'averag':138 'base':227,276 'baselin':40,76,723,724 'basic':243 'before/after':69 'best':732 'bool':254 'bootstrap':88 'built':55,206 'built-in':54,205 'call':778 'callabl':124 'captur':791 'case':217,787 'case-insensit':216 'class':36,374,480 'combin':747 'compar':37,71,581,588,719 'comparison':219,622 'compat':313 'complet':424,434,446 'comprehens':388,511 'concis':290 'config':671 'control':763 'correct':282,329,334,337,381,402,411,420,543,571,572,576,665,748 'cost':779 'creat':25 'custom':58,241 'data':738 'dataclass':367,369,373 'def':246,268,316,387,482,495,580,640 'descript':112,135 'detail':506,651 'devset':118,159,160,485,490,515,704 'dict':584,587 'differ':72 'dimension':397 'display':167,523 'document':793,795 'dspi':2,17,21,45,51,87,92,97,105,214,352,794,797 'dspy-bootstrap-fewshot':86 'dspy-evaluation-suit':1 'dspy-gepa-reflect':96 'dspy-miprov2-optimizer':91 'dspy-rag-pipelin':104 'dspy.ai':796,805,810 'dspy.ai/api/evaluation/':804 'dspy.ai/learn/evaluation/metrics/':809 'dspy.evaluate':154,232,354 'dspy.evaluate.answer':221 'dspy.example':120 'dspy.module':114 'edg':786 'effici':750 'els':340,421,447,469,478 'empti':427 'enough':755 'enumer':629 'error':385,429,439,553,577,578,579 'establish':39,75 'eval':526 'eval_result.results':542 'eval_result.score':566 'evalu':3,15,26,35,46,50,101,117,121,151,156,157,158,173,176,229,356,398,496,504,513,514,528,599,652,709,742,764,781,802 'evaluationresult':375,501,564 'evaluationsuit':481,703 'exact':30,210,222,247 'exampl':122,145,187,192,200,239,249,271,319,349,379,390,405,538,568,664,670,677,756 'example.answer':345 'example.answer.lower':259,285,330,412 'example.question':201 'execut':62 'expect':344 'export':641,650 'extract':530 'f':180,199,343,598,633,686,689,693,715 'factor':265,279 'feedback':317,325 'fewshot':89 'finetun':729,730 'float':137,257,377,394 'full':503 'function':126 'gepa':98,312,327 'gepa-compat':311 'github':798 'github.com':800 'github.com/stanfordnlp/dspy':799 'goal':48 'got':346 'guid':808 'hasattr':302,404,408,431,455 'hit':783 'hold':735 'import':155,213,233,351,355,358,360,364,368 'incorrect':383,573 'indent':690 'individu':185,531 'init':483 'input':109,110 'insensit':218 'int':129,255,380,382,384,386 'json':359 'json.dump':687 'key':612 'lambda':613 'len':293,435,460,474,569,574,678 'limit':766,785 'list':119,142 'llm':226,777 'llm-base':225 'log':361 'logger':370 'logger.info':597,692 'logging.getlogger':371 'match':31,211,223,248 'may':788 'measur':23,66,774 'mention':34 'metric':27,59,123,139,161,163,208,220,242,244,266,270,314,318,389,399,499,508,509,512,517,518,585,605,648,657,746,767,807 'miprov2':93 'modul':22 'multi':264,396 'multi-dimension':395 'multi-factor':263 'multipl':278,589,745 'n':621 'name':372,593,600,602,626,635 'need':42 'never':739 'none':252,274,322,393,500,586,649 'normal':215 'num':127,164,378,486,493,519,567,672,676,705 'offici':792 'open':681 'optim':70,85,94,328,726,727,740 'option':365 'output':132,133,645,682,697 'parallel':61,130,780 'path':646,683,698 'per':144 'per-exampl':143 'perform':24,68 'phase':148,170 'pipelin':103,107 'practic':733 'pred':193,250,272,303,320,391,409,432,456,539 'pred.answer':347 'pred.answer.lower':261,287,332,415,442 'pred.answer.split':294 'pred.answer.strip':436 'pred.reasoning':306,462 'predict':188,240 'print':179,198,620,632,714 'product':78,348 'program':18,38,52,67,73,113,115,178,498,529,583,590,594,604,644,656,713,725,728,731 'programs.items':596 'progress':168,524 'provid':339 'python':152,174,212,230,245,267,315,350 'qualiti':269,451,749 'rag':102,106 'rank':606,609,625,630,634 'rate':784 'readi':79 'reason':299,304,450,457,459,468 'reflect':99 'relat':80 'report':642,653,658,688,694,700 'requir':776 'result':141,146,175,186,507,527,534,591,601,623,627,639,654,710,721,765 'result.correct':666,668 'result.num':663,669 'result.results':196 'result.score':182,636,661,717 'results.items':611 'return':253,258,309,323,335,341,471,563,638,699 'revers':618 'run':172,502 'save':695 'score':125,136,140,181,189,194,203,204,237,275,280,288,296,307,310,324,376,400,473,475,477,532,535,536,540,549,559,565,608,617,660,716 'scores.append':417,443,464 'self':484,497,582,643 'self.devset':489,516,570,575,679 'self.evaluate':603,655 'self.num':491,521,674 'semant':228,235,238 'semanticf1':33,224,234,236,357,775 'set':743 'setup':150 'signific':753 'singl':708 'skill':6,81 'skill-dspy-evaluation-suite' 'sort':610 'source-omidzamani' 'specif':771 'statist':752 'str':461,647 'strip':260,262,413,416 'suit':4,47,702 'suite.compare':722 'suite.evaluate':711 'sum':472,544,554 'summari':659 'sys':44 'systemat':49 'task':770 'task-specif':769 'test':19,737 'thread':128,131,165,487,492,494,520,522,673,675,706 'time':761 'topic-agent-skills' 'topic-claude-code' 'topic-claude-skills' 'topic-dspy' 'topic-llm' 'topic-prompt-optimization' 'topic-rag' 'total':662 'trace':251,273,321,392 'track':759 'true':169,525,619 'tupl':190 'type':111,134,363 'univers':773 'usag':701 'use':9,28,53,65,82,754 'user':12 'valid':77 'variant':74,720 'version':762 'w':684 'workflow':147 'x':614,615","prices":[{"id":"45056114-887c-4a48-af0f-904c96e443b9","listingId":"79be5e11-daf3-4a54-aabd-3a752d132fe1","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"OmidZamani","category":"dspy-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T22:14:11.049Z"}],"sources":[{"listingId":"79be5e11-daf3-4a54-aabd-3a752d132fe1","source":"github","sourceId":"OmidZamani/dspy-skills/dspy-evaluation-suite","sourceUrl":"https://github.com/OmidZamani/dspy-skills/tree/master/skills/dspy-evaluation-suite","isPrimary":false,"firstSeenAt":"2026-04-18T22:14:11.049Z","lastSeenAt":"2026-05-02T06:55:44.209Z"}],"details":{"listingId":"79be5e11-daf3-4a54-aabd-3a752d132fe1","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"OmidZamani","slug":"dspy-evaluation-suite","github":{"repo":"OmidZamani/dspy-skills","stars":74,"topics":["agent-skills","claude-code","claude-skills","dspy","llm","prompt-optimization","rag"],"license":"mit","html_url":"https://github.com/OmidZamani/dspy-skills","pushed_at":"2026-02-21T12:49:43Z","description":"Collection of Claude Skills for DSPy framework - program language models, optimize prompts, and build RAG pipelines systematically","skill_md_sha":"6a38ff90dd1e64aa16f255aba12a53e9fc376e14","skill_md_path":"skills/dspy-evaluation-suite/SKILL.md","default_branch":"master","skill_tree_url":"https://github.com/OmidZamani/dspy-skills/tree/master/skills/dspy-evaluation-suite"},"layout":"multi","source":"github","category":"dspy-skills","frontmatter":{"name":"dspy-evaluation-suite","description":"This skill should be used when the user asks to \"evaluate a DSPy program\", \"test my DSPy module\", \"measure performance\", \"create evaluation metrics\", \"use answer_exact_match or SemanticF1\", mentions \"Evaluate class\", \"comparing programs\", \"establishing baselines\", or needs to systematically test and measure DSPy program quality with custom or built-in metrics."},"skills_sh_url":"https://skills.sh/OmidZamani/dspy-skills/dspy-evaluation-suite"},"updatedAt":"2026-05-02T06:55:44.209Z"}}