Skillquality 0.49
dspy-evaluation-suite
This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1", mentions "Evaluate class", "comparing programs", "establishing baselines", or needs to sys
Price
free
Protocol
skill
Verified
no
What it does
DSPy Evaluation Suite
Goal
Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution.
When to Use
- Measuring program performance before/after optimization
- Comparing different program variants
- Establishing baselines
- Validating production readiness
Related Skills
- Use with any optimizer: dspy-bootstrap-fewshot, dspy-miprov2-optimizer, dspy-gepa-reflective
- Evaluate RAG pipelines: dspy-rag-pipeline
Inputs
| Input | Type | Description |
|---|---|---|
program | dspy.Module | Program to evaluate |
devset | list[dspy.Example] | Evaluation examples |
metric | callable | Scoring function |
num_threads | int | Parallel threads |
Outputs
| Output | Type | Description |
|---|---|---|
score | float | Average metric score |
results | list | Per-example results |
Workflow
Phase 1: Setup Evaluator
from dspy.evaluate import Evaluate
evaluator = Evaluate(
devset=devset,
metric=my_metric,
num_threads=8,
display_progress=True
)
Phase 2: Run Evaluation
result = evaluator(my_program)
print(f"Score: {result.score:.2f}%")
# Access individual results: (example, prediction, score) tuples
for example, pred, score in result.results[:3]:
print(f"Example: {example.question[:50]}... Score: {score}")
Built-in Metrics
answer_exact_match
import dspy
# Normalized, case-insensitive comparison
metric = dspy.evaluate.answer_exact_match
SemanticF1
LLM-based semantic evaluation:
from dspy.evaluate import SemanticF1
semantic = SemanticF1()
score = semantic(example, prediction)
Custom Metrics
Basic Metric
def exact_match(example, pred, trace=None):
"""Returns bool, int, or float."""
return example.answer.lower().strip() == pred.answer.lower().strip()
Multi-Factor Metric
def quality_metric(example, pred, trace=None):
"""Score based on multiple factors."""
score = 0.0
# Correctness (50%)
if example.answer.lower() in pred.answer.lower():
score += 0.5
# Conciseness (25%)
if len(pred.answer.split()) <= 20:
score += 0.25
# Has reasoning (25%)
if hasattr(pred, 'reasoning') and pred.reasoning:
score += 0.25
return score
GEPA-Compatible Metric
def feedback_metric(example, pred, trace=None):
"""Returns (score, feedback) for GEPA optimizer."""
correct = example.answer.lower() in pred.answer.lower()
if correct:
return 1.0, "Correct answer provided."
else:
return 0.0, f"Expected '{example.answer}', got '{pred.answer}'"
Production Example
import dspy
from dspy.evaluate import Evaluate, SemanticF1
import json
import logging
from typing import Optional
from dataclasses import dataclass
logger = logging.getLogger(__name__)
@dataclass
class EvaluationResult:
score: float
num_examples: int
correct: int
incorrect: int
errors: int
def comprehensive_metric(example, pred, trace=None) -> float:
"""Multi-dimensional evaluation metric."""
scores = []
# 1. Correctness
if hasattr(example, 'answer') and hasattr(pred, 'answer'):
correct = example.answer.lower().strip() in pred.answer.lower().strip()
scores.append(1.0 if correct else 0.0)
# 2. Completeness (answer not empty or error)
if hasattr(pred, 'answer'):
complete = len(pred.answer.strip()) > 0 and "error" not in pred.answer.lower()
scores.append(1.0 if complete else 0.0)
# 3. Reasoning quality (if available)
if hasattr(pred, 'reasoning'):
has_reasoning = len(str(pred.reasoning)) > 20
scores.append(1.0 if has_reasoning else 0.5)
return sum(scores) / len(scores) if scores else 0.0
class EvaluationSuite:
def __init__(self, devset, num_threads=8):
self.devset = devset
self.num_threads = num_threads
def evaluate(self, program, metric=None) -> EvaluationResult:
"""Run full evaluation with detailed results."""
metric = metric or comprehensive_metric
evaluator = Evaluate(
devset=self.devset,
metric=metric,
num_threads=self.num_threads,
display_progress=True
)
eval_result = evaluator(program)
# Extract individual scores from results
scores = [score for example, pred, score in eval_result.results]
correct = sum(1 for s in scores if s >= 0.5)
errors = sum(1 for s in scores if s == 0)
return EvaluationResult(
score=eval_result.score,
num_examples=len(self.devset),
correct=correct,
incorrect=len(self.devset) - correct - errors,
errors=errors
)
def compare(self, programs: dict, metric=None) -> dict:
"""Compare multiple programs."""
results = {}
for name, program in programs.items():
logger.info(f"Evaluating: {name}")
results[name] = self.evaluate(program, metric)
# Rank by score
ranked = sorted(results.items(), key=lambda x: x[1].score, reverse=True)
print("\n=== Comparison Results ===")
for rank, (name, result) in enumerate(ranked, 1):
print(f"{rank}. {name}: {result.score:.2%}")
return results
def export_report(self, program, output_path: str, metric=None):
"""Export detailed evaluation report."""
result = self.evaluate(program, metric)
report = {
"summary": {
"score": result.score,
"total": result.num_examples,
"correct": result.correct,
"accuracy": result.correct / result.num_examples
},
"config": {
"num_threads": self.num_threads,
"num_examples": len(self.devset)
}
}
with open(output_path, 'w') as f:
json.dump(report, f, indent=2)
logger.info(f"Report saved to {output_path}")
return report
# Usage
suite = EvaluationSuite(devset, num_threads=8)
# Single evaluation
result = suite.evaluate(my_program)
print(f"Score: {result.score:.2%}")
# Compare variants
results = suite.compare({
"baseline": baseline_program,
"optimized": optimized_program,
"finetuned": finetuned_program
})
Best Practices
- Hold out test data - Never optimize on evaluation set
- Multiple metrics - Combine correctness, quality, efficiency
- Statistical significance - Use enough examples (100+)
- Track over time - Version control evaluation results
Limitations
- Metrics are task-specific; no universal measure
- SemanticF1 requires LLM calls (cost)
- Parallel evaluation can hit rate limits
- Edge cases may not be captured
Official Documentation
- DSPy Documentation: https://dspy.ai/
- DSPy GitHub: https://github.com/stanfordnlp/dspy
- Evaluation API: https://dspy.ai/api/evaluation/
- Metrics Guide: https://dspy.ai/learn/evaluation/metrics/
Capabilities
skillsource-omidzamaniskill-dspy-evaluation-suitetopic-agent-skillstopic-claude-codetopic-claude-skillstopic-dspytopic-llmtopic-prompt-optimizationtopic-rag
Install
Installnpx skills add OmidZamani/dspy-skills
Transportskills-sh
Protocolskill
Quality
0.49/ 1.00
deterministic score 0.49 from registry signals: · indexed on github topic:agent-skills · 74 github stars · SKILL.md body (7,416 chars)
Provenance
Indexed fromgithub
Enriched2026-05-02 06:55:44Z · deterministic:skill-github:v1 · v1
First seen2026-04-18
Last seen2026-05-02