{"id":"3327e786-f009-423b-aa32-7e988244893f","shortId":"3XKqan","kind":"skill","title":"Phoenix Evals","tagline":"Awesome Copilot skill by Github","description":"# Phoenix Evals\n\nBuild evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.\n\n## Quick Reference\n\n| Task | Files |\n| ---- | ----- |\n| Setup | [setup-python](references/setup-python.md), [setup-typescript](references/setup-typescript.md) |\n| Decide what to evaluate | [evaluators-overview](references/evaluators-overview.md) |\n| Choose a judge model | [fundamentals-model-selection](references/fundamentals-model-selection.md) |\n| Use pre-built evaluators | [evaluators-pre-built](references/evaluators-pre-built.md) |\n| Build code evaluator | [evaluators-code-python](references/evaluators-code-python.md), [evaluators-code-typescript](references/evaluators-code-typescript.md) |\n| Build LLM evaluator | [evaluators-llm-python](references/evaluators-llm-python.md), [evaluators-llm-typescript](references/evaluators-llm-typescript.md), [evaluators-custom-templates](references/evaluators-custom-templates.md) |\n| Batch evaluate DataFrame | [evaluate-dataframe-python](references/evaluate-dataframe-python.md) |\n| Run experiment | [experiments-running-python](references/experiments-running-python.md), [experiments-running-typescript](references/experiments-running-typescript.md) |\n| Create dataset | [experiments-datasets-python](references/experiments-datasets-python.md), [experiments-datasets-typescript](references/experiments-datasets-typescript.md) |\n| Generate synthetic data | [experiments-synthetic-python](references/experiments-synthetic-python.md), [experiments-synthetic-typescript](references/experiments-synthetic-typescript.md) |\n| Validate evaluator accuracy | [validation](references/validation.md), [validation-evaluators-python](references/validation-evaluators-python.md), [validation-evaluators-typescript](references/validation-evaluators-typescript.md) |\n| Sample traces for review | [observe-sampling-python](references/observe-sampling-python.md), [observe-sampling-typescript](references/observe-sampling-typescript.md) |\n| Analyze errors | [error-analysis](references/error-analysis.md), [error-analysis-multi-turn](references/error-analysis-multi-turn.md), [axial-coding](references/axial-coding.md) |\n| RAG evals | [evaluators-rag](references/evaluators-rag.md) |\n| Avoid common mistakes | [common-mistakes-python](references/common-mistakes-python.md), [fundamentals-anti-patterns](references/fundamentals-anti-patterns.md) |\n| Production | [production-overview](references/production-overview.md), [production-guardrails](references/production-guardrails.md), [production-continuous](references/production-continuous.md) |\n\n## Workflows\n\n**Starting Fresh:**\n[observe-tracing-setup](references/observe-tracing-setup.md) → [error-analysis](references/error-analysis.md) → [axial-coding](references/axial-coding.md) → [evaluators-overview](references/evaluators-overview.md)\n\n**Building Evaluator:**\n[fundamentals](references/fundamentals.md) → [common-mistakes-python](references/common-mistakes-python.md) → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}\n\n**RAG Systems:**\n[evaluators-rag](references/evaluators-rag.md) → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)\n\n**Production:**\n[production-overview](references/production-overview.md) → [production-guardrails](references/production-guardrails.md) → [production-continuous](references/production-continuous.md)\n\n## Reference Categories\n\n| Prefix | Description |\n| ------ | ----------- |\n| `fundamentals-*` | Types, scores, anti-patterns |\n| `observe-*` | Tracing, sampling |\n| `error-analysis-*` | Finding failures |\n| `axial-coding-*` | Categorizing failures |\n| `evaluators-*` | Code, LLM, RAG evaluators |\n| `experiments-*` | Datasets, running experiments |\n| `validation-*` | Validating evaluator accuracy against human labels |\n| `production-*` | CI/CD, monitoring |\n\n## Key Principles\n\n| Principle | Action |\n| --------- | ------ |\n| Error analysis first | Can't automate what you haven't observed |\n| Custom > generic | Build from your failures |\n| Code first | Deterministic before LLM |\n| Validate judges | >80% TPR/TNR |\n| Binary > Likert | Pass/fail, not 1-5 |","tags":["phoenix","evals","awesome","copilot","github"],"capabilities":["skill","source-github","category-awesome-copilot"],"categories":["awesome-copilot"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/github/awesome-copilot/phoenix-evals","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"install_from":"skills.sh"}},"qualityScore":"0.300","qualityRationale":"deterministic score 0.30 from registry signals: · indexed on skills.sh · published under github/awesome-copilot","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill:v1","enrichmentVersion":1,"enrichedAt":"2026-04-22T02:40:28.851Z","embedding":null,"createdAt":"2026-04-18T20:36:20.721Z","updatedAt":"2026-04-22T02:40:28.851Z","lastSeenAt":"2026-04-22T02:40:28.851Z","tsv":"'-5':359 '1':358 '80':352 'accuraci':141,317 'action':327 'ai/llm':13 'analysi':172,176,226,297,329 'analyz':168 'anti':200,290 'anti-pattern':289 'applic':14 'autom':333 'avoid':190 'awesom':3 'axial':181,229,301 'axial-cod':180,228,300 'batch':94 'binari':354 'build':10,63,76,236,341 'built':56,61 'categor':303 'categori':283 'category-awesome-copilot' 'choos':44 'ci/cd':322 'code':15,64,68,73,182,230,246,263,302,306,345 'common':191,194,241 'common-mistakes-python':193,240 'continu':214,280 'copilot':4 'creat':114 'custom':91,339 'data':128 'datafram':96,99 'dataset':115,118,123,311 'decid':36 'descript':285 'determinist':347 'error':169,171,175,225,296,328 'error-analysi':170,224,295 'error-analysis-multi-turn':174 'eval':2,9,185 'evalu':11,39,41,57,59,65,67,72,78,80,85,90,95,98,140,146,151,187,233,237,245,252,258,262,266,305,309,316 'evaluate-dataframe-python':97 'evaluators-cod':261 'evaluators-code-python':66 'evaluators-code-typescript':71 'evaluators-custom-templ':89 'evaluators-llm':265 'evaluators-llm-python':79 'evaluators-llm-typescript':84 'evaluators-overview':40,232 'evaluators-pre-built':58 'evaluators-rag':186,257 'experi':103,105,110,117,122,130,135,310,313 'experiments-datasets-python':116 'experiments-datasets-typescript':121 'experiments-running-python':104 'experiments-running-typescript':109 'experiments-synthetic-python':129 'experiments-synthetic-typescript':134 'failur':299,304,344 'faith':268 'file':26 'find':298 'first':16,330,346 'fresh':218 'fundament':49,199,238,286 'fundamentals-anti-pattern':198 'fundamentals-model-select':48 'generat':126 'generic':340 'github':7 'guardrail':210,276 'haven':336 'human':22,319 'judg':46,351 'key':324 'label':320 'likert':355 'llm':17,77,81,86,247,267,307,349 'mistak':192,195,242 'model':47,50 'monitor':323 'multi':177 'nuanc':19 'observ':159,164,220,292,338 'observe-sampling-python':158 'observe-sampling-typescript':163 'observe-tracing-setup':219 'overview':42,206,234,272 'pass/fail':356 'pattern':201,291 'phoenix':1,8 'pre':55,60 'pre-built':54 'prefix':284 'principl':325,326 'product':203,205,209,213,269,271,275,279,321 'production-continu':212,278 'production-guardrail':208,274 'production-overview':204,270 'python':30,69,82,100,107,119,132,147,161,196,243,248,253 'quick':23 'rag':184,188,255,259,308 'refer':24,282 'references/axial-coding.md':183,231 'references/common-mistakes-python.md':197,244 'references/error-analysis-multi-turn.md':179 'references/error-analysis.md':173,227 'references/evaluate-dataframe-python.md':101 'references/evaluators-code-python.md':70 'references/evaluators-code-typescript.md':75 'references/evaluators-custom-templates.md':93 'references/evaluators-llm-python.md':83 'references/evaluators-llm-typescript.md':88 'references/evaluators-overview.md':43,235 'references/evaluators-pre-built.md':62 'references/evaluators-rag.md':189,260 'references/experiments-datasets-python.md':120 'references/experiments-datasets-typescript.md':125 'references/experiments-running-python.md':108 'references/experiments-running-typescript.md':113 'references/experiments-synthetic-python.md':133 'references/experiments-synthetic-typescript.md':138 'references/fundamentals-anti-patterns.md':202 'references/fundamentals-model-selection.md':52 'references/fundamentals.md':239 'references/observe-sampling-python.md':162 'references/observe-sampling-typescript.md':167 'references/observe-tracing-setup.md':223 'references/production-continuous.md':215,281 'references/production-guardrails.md':211,277 'references/production-overview.md':207,273 'references/setup-python.md':31 'references/setup-typescript.md':35 'references/validation-evaluators-python.md':148 'references/validation-evaluators-typescript.md':153 'references/validation.md':143 'retriev':264 'review':157 'run':102,106,111,312 'sampl':154,160,165,294 'score':288 'select':51 'setup':27,29,33,222 'setup-python':28 'setup-typescript':32 'skill':5 'source-github' 'start':217 'synthet':127,131,136 'system':256 'task':25 'templat':92 'tpr/tnr':353 'trace':155,221,293 'turn':178 'type':287 'typescript':34,74,87,112,124,137,152,166,249,254 'use':53 'valid':20,139,142,145,150,251,314,315,350 'validation-evalu':250 'validation-evaluators-python':144 'validation-evaluators-typescript':149 'workflow':216","prices":[{"id":"b081a7f9-75a6-4ebd-8a37-f00e26fd377c","listingId":"3327e786-f009-423b-aa32-7e988244893f","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"github","category":"awesome-copilot","install_from":"skills.sh"},"createdAt":"2026-04-18T20:36:20.721Z"}],"sources":[{"listingId":"3327e786-f009-423b-aa32-7e988244893f","source":"github","sourceId":"github/awesome-copilot/phoenix-evals","sourceUrl":"https://github.com/github/awesome-copilot/tree/main/skills/phoenix-evals","isPrimary":false,"firstSeenAt":"2026-04-18T21:50:27.141Z","lastSeenAt":"2026-04-22T00:52:13.888Z"},{"listingId":"3327e786-f009-423b-aa32-7e988244893f","source":"skills_sh","sourceId":"github/awesome-copilot/phoenix-evals","sourceUrl":"https://skills.sh/github/awesome-copilot/phoenix-evals","isPrimary":true,"firstSeenAt":"2026-04-18T20:36:20.721Z","lastSeenAt":"2026-04-22T02:40:28.851Z"}],"details":{"listingId":"3327e786-f009-423b-aa32-7e988244893f","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"github","slug":"phoenix-evals","source":"skills_sh","category":"awesome-copilot","skills_sh_url":"https://skills.sh/github/awesome-copilot/phoenix-evals"},"updatedAt":"2026-04-22T02:40:28.851Z"}}