Skillquality 0.70
phoenix-evals
Build and run evaluators for AI/LLM applications using Phoenix.
What it does
Phoenix Evals
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
Quick Reference
Workflows
Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}
RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
Production: production-overview → production-guardrails → production-continuous
Reference Categories
| Prefix | Description |
|---|---|
fundamentals-* | Types, scores, anti-patterns |
observe-* | Tracing, sampling |
error-analysis-* | Finding failures |
axial-coding-* | Categorizing failures |
evaluators-* | Code, LLM, RAG evaluators |
experiments-* | Datasets, running experiments |
validation-* | Validating evaluator accuracy against human labels |
production-* | CI/CD, monitoring |
Key Principles
| Principle | Action |
|---|---|
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |
Capabilities
skillsource-githubskill-phoenix-evalstopic-agent-skillstopic-agentstopic-awesometopic-custom-agentstopic-github-copilottopic-hacktoberfesttopic-prompt-engineering
Install
Installnpx skills add github/awesome-copilot
Transportskills-sh
Protocolskill
Quality
0.70/ 1.00
deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 30743 github stars · SKILL.md body (4,153 chars)
Provenance
Indexed fromskills_sh
Also seen ingithub
Enriched2026-04-22 00:52:13Z · deterministic:skill-github:v1 · v1
First seen2026-04-18
Last seen2026-04-22