{"id":"3327e786-f009-423b-aa32-7e988244893f","shortId":"3XKqan","kind":"skill","title":"phoenix-evals","tagline":"Build and run evaluators for AI/LLM applications using Phoenix.","description":"# Phoenix Evals\n\nBuild evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.\n\n## Quick Reference\n\n| Task | Files |\n| ---- | ----- |\n| Setup | [setup-python](references/setup-python.md), [setup-typescript](references/setup-typescript.md) |\n| Decide what to evaluate | [evaluators-overview](references/evaluators-overview.md) |\n| Choose a judge model | [fundamentals-model-selection](references/fundamentals-model-selection.md) |\n| Use pre-built evaluators | [evaluators-pre-built](references/evaluators-pre-built.md) |\n| Build code evaluator | [evaluators-code-python](references/evaluators-code-python.md), [evaluators-code-typescript](references/evaluators-code-typescript.md) |\n| Build LLM evaluator | [evaluators-llm-python](references/evaluators-llm-python.md), [evaluators-llm-typescript](references/evaluators-llm-typescript.md), [evaluators-custom-templates](references/evaluators-custom-templates.md) |\n| Batch evaluate DataFrame | [evaluate-dataframe-python](references/evaluate-dataframe-python.md) |\n| Run experiment | [experiments-running-python](references/experiments-running-python.md), [experiments-running-typescript](references/experiments-running-typescript.md) |\n| Create dataset | [experiments-datasets-python](references/experiments-datasets-python.md), [experiments-datasets-typescript](references/experiments-datasets-typescript.md) |\n| Generate synthetic data | [experiments-synthetic-python](references/experiments-synthetic-python.md), [experiments-synthetic-typescript](references/experiments-synthetic-typescript.md) |\n| Validate evaluator accuracy | [validation](references/validation.md), [validation-evaluators-python](references/validation-evaluators-python.md), [validation-evaluators-typescript](references/validation-evaluators-typescript.md) |\n| Sample traces for review | [observe-sampling-python](references/observe-sampling-python.md), [observe-sampling-typescript](references/observe-sampling-typescript.md) |\n| Analyze errors | [error-analysis](references/error-analysis.md), [error-analysis-multi-turn](references/error-analysis-multi-turn.md), [axial-coding](references/axial-coding.md) |\n| RAG evals | [evaluators-rag](references/evaluators-rag.md) |\n| Avoid common mistakes | [common-mistakes-python](references/common-mistakes-python.md), [fundamentals-anti-patterns](references/fundamentals-anti-patterns.md) |\n| Production | [production-overview](references/production-overview.md), [production-guardrails](references/production-guardrails.md), [production-continuous](references/production-continuous.md) |\n\n## Workflows\n\n**Starting Fresh:**\n[observe-tracing-setup](references/observe-tracing-setup.md) → [error-analysis](references/error-analysis.md) → [axial-coding](references/axial-coding.md) → [evaluators-overview](references/evaluators-overview.md)\n\n**Building Evaluator:**\n[fundamentals](references/fundamentals.md) → [common-mistakes-python](references/common-mistakes-python.md) → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}\n\n**RAG Systems:**\n[evaluators-rag](references/evaluators-rag.md) → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)\n\n**Production:**\n[production-overview](references/production-overview.md) → [production-guardrails](references/production-guardrails.md) → [production-continuous](references/production-continuous.md)\n\n## Reference Categories\n\n| Prefix | Description |\n| ------ | ----------- |\n| `fundamentals-*` | Types, scores, anti-patterns |\n| `observe-*` | Tracing, sampling |\n| `error-analysis-*` | Finding failures |\n| `axial-coding-*` | Categorizing failures |\n| `evaluators-*` | Code, LLM, RAG evaluators |\n| `experiments-*` | Datasets, running experiments |\n| `validation-*` | Validating evaluator accuracy against human labels |\n| `production-*` | CI/CD, monitoring |\n\n## Key Principles\n\n| Principle | Action |\n| --------- | ------ |\n| Error analysis first | Can't automate what you haven't observed |\n| Custom > generic | Build from your failures |\n| Code first | Deterministic before LLM |\n| Validate judges | >80% TPR/TNR |\n| Binary > Likert | Pass/fail, not 1-5 |","tags":["phoenix","evals","awesome","copilot","github","agent-skills","agents","custom-agents","github-copilot","hacktoberfest","prompt-engineering"],"capabilities":["skill","source-github","skill-phoenix-evals","topic-agent-skills","topic-agents","topic-awesome","topic-custom-agents","topic-github-copilot","topic-hacktoberfest","topic-prompt-engineering"],"categories":["awesome-copilot"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/github/awesome-copilot/phoenix-evals","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add github/awesome-copilot","source_repo":"https://github.com/github/awesome-copilot","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 33270 github stars · SKILL.md body (4,153 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:52:19.437Z","embedding":null,"createdAt":"2026-04-18T20:36:20.721Z","updatedAt":"2026-05-18T18:52:19.437Z","lastSeenAt":"2026-05-18T18:52:19.437Z","tsv":"'-5':364 '1':363 '80':357 'accuraci':146,322 'action':332 'ai/llm':9,18 'analysi':177,181,231,302,334 'analyz':173 'anti':205,295 'anti-pattern':294 'applic':10,19 'autom':338 'avoid':195 'axial':186,234,306 'axial-cod':185,233,305 'batch':99 'binari':359 'build':4,15,68,81,241,346 'built':61,66 'categor':308 'categori':288 'choos':49 'ci/cd':327 'code':20,69,73,78,187,235,251,268,307,311,350 'common':196,199,246 'common-mistakes-python':198,245 'continu':219,285 'creat':119 'custom':96,344 'data':133 'datafram':101,104 'dataset':120,123,128,316 'decid':41 'descript':290 'determinist':352 'error':174,176,180,230,301,333 'error-analysi':175,229,300 'error-analysis-multi-turn':179 'eval':3,14,190 'evalu':7,16,44,46,62,64,70,72,77,83,85,90,95,100,103,145,151,156,192,238,242,250,257,263,267,271,310,314,321 'evaluate-dataframe-python':102 'evaluators-cod':266 'evaluators-code-python':71 'evaluators-code-typescript':76 'evaluators-custom-templ':94 'evaluators-llm':270 'evaluators-llm-python':84 'evaluators-llm-typescript':89 'evaluators-overview':45,237 'evaluators-pre-built':63 'evaluators-rag':191,262 'experi':108,110,115,122,127,135,140,315,318 'experiments-datasets-python':121 'experiments-datasets-typescript':126 'experiments-running-python':109 'experiments-running-typescript':114 'experiments-synthetic-python':134 'experiments-synthetic-typescript':139 'failur':304,309,349 'faith':273 'file':31 'find':303 'first':21,335,351 'fresh':223 'fundament':54,204,243,291 'fundamentals-anti-pattern':203 'fundamentals-model-select':53 'generat':131 'generic':345 'guardrail':215,281 'haven':341 'human':27,324 'judg':51,356 'key':329 'label':325 'likert':360 'llm':22,82,86,91,252,272,312,354 'mistak':197,200,247 'model':52,55 'monitor':328 'multi':182 'nuanc':24 'observ':164,169,225,297,343 'observe-sampling-python':163 'observe-sampling-typescript':168 'observe-tracing-setup':224 'overview':47,211,239,277 'pass/fail':361 'pattern':206,296 'phoenix':2,12,13 'phoenix-ev':1 'pre':60,65 'pre-built':59 'prefix':289 'principl':330,331 'product':208,210,214,218,274,276,280,284,326 'production-continu':217,283 'production-guardrail':213,279 'production-overview':209,275 'python':35,74,87,105,112,124,137,152,166,201,248,253,258 'quick':28 'rag':189,193,260,264,313 'refer':29,287 'references/axial-coding.md':188,236 'references/common-mistakes-python.md':202,249 'references/error-analysis-multi-turn.md':184 'references/error-analysis.md':178,232 'references/evaluate-dataframe-python.md':106 'references/evaluators-code-python.md':75 'references/evaluators-code-typescript.md':80 'references/evaluators-custom-templates.md':98 'references/evaluators-llm-python.md':88 'references/evaluators-llm-typescript.md':93 'references/evaluators-overview.md':48,240 'references/evaluators-pre-built.md':67 'references/evaluators-rag.md':194,265 'references/experiments-datasets-python.md':125 'references/experiments-datasets-typescript.md':130 'references/experiments-running-python.md':113 'references/experiments-running-typescript.md':118 'references/experiments-synthetic-python.md':138 'references/experiments-synthetic-typescript.md':143 'references/fundamentals-anti-patterns.md':207 'references/fundamentals-model-selection.md':57 'references/fundamentals.md':244 'references/observe-sampling-python.md':167 'references/observe-sampling-typescript.md':172 'references/observe-tracing-setup.md':228 'references/production-continuous.md':220,286 'references/production-guardrails.md':216,282 'references/production-overview.md':212,278 'references/setup-python.md':36 'references/setup-typescript.md':40 'references/validation-evaluators-python.md':153 'references/validation-evaluators-typescript.md':158 'references/validation.md':148 'retriev':269 'review':162 'run':6,107,111,116,317 'sampl':159,165,170,299 'score':293 'select':56 'setup':32,34,38,227 'setup-python':33 'setup-typescript':37 'skill' 'skill-phoenix-evals' 'source-github' 'start':222 'synthet':132,136,141 'system':261 'task':30 'templat':97 'topic-agent-skills' 'topic-agents' 'topic-awesome' 'topic-custom-agents' 'topic-github-copilot' 'topic-hacktoberfest' 'topic-prompt-engineering' 'tpr/tnr':358 'trace':160,226,298 'turn':183 'type':292 'typescript':39,79,92,117,129,142,157,171,254,259 'use':11,58 'valid':25,144,147,150,155,256,319,320,355 'validation-evalu':255 'validation-evaluators-python':149 'validation-evaluators-typescript':154 'workflow':221","prices":[{"id":"b081a7f9-75a6-4ebd-8a37-f00e26fd377c","listingId":"3327e786-f009-423b-aa32-7e988244893f","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"github","category":"awesome-copilot","install_from":"skills.sh"},"createdAt":"2026-04-18T20:36:20.721Z"}],"sources":[{"listingId":"3327e786-f009-423b-aa32-7e988244893f","source":"github","sourceId":"github/awesome-copilot/phoenix-evals","sourceUrl":"https://github.com/github/awesome-copilot/tree/main/skills/phoenix-evals","isPrimary":false,"firstSeenAt":"2026-04-18T21:50:27.141Z","lastSeenAt":"2026-05-18T18:52:19.437Z"},{"listingId":"3327e786-f009-423b-aa32-7e988244893f","source":"skills_sh","sourceId":"github/awesome-copilot/phoenix-evals","sourceUrl":"https://skills.sh/github/awesome-copilot/phoenix-evals","isPrimary":true,"firstSeenAt":"2026-04-18T20:36:20.721Z","lastSeenAt":"2026-05-07T22:40:43.834Z"}],"details":{"listingId":"3327e786-f009-423b-aa32-7e988244893f","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"github","slug":"phoenix-evals","github":{"repo":"github/awesome-copilot","stars":33270,"topics":["agent-skills","agents","ai","awesome","custom-agents","github-copilot","hacktoberfest","prompt-engineering"],"license":"mit","html_url":"https://github.com/github/awesome-copilot","pushed_at":"2026-05-18T01:26:59Z","description":"Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.","skill_md_sha":"bd82da6cac9907c55d382b9c7610007b26d08dd9","skill_md_path":"skills/phoenix-evals/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/github/awesome-copilot/tree/main/skills/phoenix-evals"},"layout":"multi","source":"github","category":"awesome-copilot","frontmatter":{"name":"phoenix-evals","license":"Apache-2.0","description":"Build and run evaluators for AI/LLM applications using Phoenix.","compatibility":"Requires Phoenix server. Python skills need phoenix and openai packages; TypeScript skills need @arizeai/phoenix-client."},"skills_sh_url":"https://skills.sh/github/awesome-copilot/phoenix-evals"},"updatedAt":"2026-05-18T18:52:19.437Z"}}