Skillquality 0.70

phoenix-evals

Build and run evaluators for AI/LLM applications using Phoenix.

Price
free
Protocol
skill
Verified
no

What it does

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

TaskFiles
Setupsetup-python, setup-typescript
Decide what to evaluateevaluators-overview
Choose a judge modelfundamentals-model-selection
Use pre-built evaluatorsevaluators-pre-built
Build code evaluatorevaluators-code-python, evaluators-code-typescript
Build LLM evaluatorevaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates
Batch evaluate DataFrameevaluate-dataframe-python
Run experimentexperiments-running-python, experiments-running-typescript
Create datasetexperiments-datasets-python, experiments-datasets-typescript
Generate synthetic dataexperiments-synthetic-python, experiments-synthetic-typescript
Validate evaluator accuracyvalidation, validation-evaluators-python, validation-evaluators-typescript
Sample traces for reviewobserve-sampling-python, observe-sampling-typescript
Analyze errorserror-analysis, error-analysis-multi-turn, axial-coding
RAG evalsevaluators-rag
Avoid common mistakescommon-mistakes-python, fundamentals-anti-patterns
Productionproduction-overview, production-guardrails, production-continuous

Workflows

Starting Fresh: observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator: fundamentalscommon-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overviewproduction-guardrailsproduction-continuous

Reference Categories

PrefixDescription
fundamentals-*Types, scores, anti-patterns
observe-*Tracing, sampling
error-analysis-*Finding failures
axial-coding-*Categorizing failures
evaluators-*Code, LLM, RAG evaluators
experiments-*Datasets, running experiments
validation-*Validating evaluator accuracy against human labels
production-*CI/CD, monitoring

Key Principles

PrincipleAction
Error analysis firstCan't automate what you haven't observed
Custom > genericBuild from your failures
Code firstDeterministic before LLM
Validate judges>80% TPR/TNR
Binary > LikertPass/fail, not 1-5

Capabilities

skillsource-githubskill-phoenix-evalstopic-agent-skillstopic-agentstopic-awesometopic-custom-agentstopic-github-copilottopic-hacktoberfesttopic-prompt-engineering

Install

Installnpx skills add github/awesome-copilot
Transportskills-sh
Protocolskill

Quality

0.70/ 1.00

deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 30743 github stars · SKILL.md body (4,153 chars)

Provenance

Indexed fromskills_sh
Also seen ingithub
Enriched2026-04-22 00:52:13Z · deterministic:skill-github:v1 · v1
First seen2026-04-18
Last seen2026-04-22

Agent access