{"id":"089019a4-491a-408e-997e-09d69e5b895f","shortId":"NwbRv4","kind":"skill","title":"Arize Evaluator","tagline":"Awesome Copilot skill by Github","description":"# Arize Evaluator Skill\n\nThis skill covers designing, creating, and running **LLM-as-judge evaluators** on Arize. An evaluator defines the judge; a **task** is how you run it against real data.\n\n---\n\n## Prerequisites\n\nProceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.\n\nIf an `ax` command fails, troubleshoot based on the error:\n- `command not found` or version error → see references/ax-setup.md\n- `401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong: check `.env` for `ARIZE_API_KEY` and use it to create/update the profile via references/ax-profiles.md. If `.env` has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys)\n- Space ID unknown → check `.env` for `ARIZE_SPACE_ID`, or run `ax spaces list -o json`, or ask the user\n- LLM provider call fails (missing OPENAI_API_KEY / ANTHROPIC_API_KEY) → check `.env`, load if present, otherwise ask the user\n\n---\n\n## Concepts\n\n### What is an Evaluator?\n\nAn **evaluator** is an LLM-as-judge definition. It contains:\n\n| Field | Description |\n|-------|-------------|\n| **Template** | The judge prompt. Uses `{variable}` placeholders (e.g. `{input}`, `{output}`, `{context}`) that get filled in at run time via a task's column mappings. |\n| **Classification choices** | The set of allowed output labels (e.g. `factual` / `hallucinated`). Binary is the default and most common. Each choice can optionally carry a numeric score. |\n| **AI Integration** | Stored LLM provider credentials (OpenAI, Anthropic, Bedrock, etc.) the evaluator uses to call the judge model. |\n| **Model** | The specific judge model (e.g. `gpt-4o`, `claude-sonnet-4-5`). |\n| **Invocation params** | Optional JSON of model settings like `{\"temperature\": 0}`. Low temperature is recommended for reproducibility. |\n| **Optimization direction** | Whether higher scores are better (`maximize`) or worse (`minimize`). Sets how the UI renders trends. |\n| **Data granularity** | Whether the evaluator runs at the **span**, **trace**, or **session** level. Most evaluators run at the span level. |\n\nEvaluators are **versioned** — every prompt or model change creates a new immutable version. The most recent version is active.\n\n### What is a Task?\n\nA **task** is how you run one or more evaluators against real data. Tasks are attached to a **project** (live traces/spans) or a **dataset** (experiment runs). A task contains:\n\n| Field | Description |\n|-------|-------------|\n| **Evaluators** | List of evaluators to run. You can run multiple in one task. |\n| **Column mappings** | Maps each evaluator's template variables to actual field paths on spans or experiment runs (e.g. `\"input\" → \"attributes.input.value\"`). This is what makes evaluators portable across projects and experiments. |\n| **Query filter** | SQL-style expression to select which spans/runs to evaluate (e.g. `\"span_kind = 'LLM'\"`). Optional but important for precision. |\n| **Continuous** | For project tasks: whether to automatically score new spans as they arrive. |\n| **Sampling rate** | For continuous project tasks: fraction of new spans to evaluate (0–1). |\n\n---\n\n## Data Granularity\n\nThe `--data-granularity` flag controls what unit of data the evaluator scores. It defaults to `span` and only applies to **project tasks** (not dataset/experiment tasks — those evaluate experiment runs directly).\n\n| Level | What it evaluates | Use for | Result column prefix |\n|-------|-------------------|---------|---------------------|\n| `span` (default) | Individual spans | Q&A correctness, hallucination, relevance | `eval.{name}.label` / `.score` / `.explanation` |\n| `trace` | All spans in a trace, grouped by `context.trace_id` | Agent trajectory, task correctness — anything that needs the full call chain | `trace_eval.{name}.label` / `.score` / `.explanation` |\n| `session` | All traces in a session, grouped by `attributes.session.id` and ordered by start time | Multi-turn coherence, overall tone, conversation quality | `session_eval.{name}.label` / `.score` / `.explanation` |\n\n### How trace and session aggregation works\n\nFor **trace** granularity, spans sharing the same `context.trace_id` are grouped together. Column values used by the evaluator template are comma-joined into a single string (each value truncated to 100K characters) before being passed to the judge model.\n\nFor **session** granularity, the same trace-level grouping happens first, then traces are ordered by `start_time` and grouped by `attributes.session.id`. Session-level values are capped at 100K characters total.\n\n### The `{conversation}` template variable\n\nAt session granularity, `{conversation}` is a special template variable that renders as a JSON array of `{input, output}` turns across all traces in the session, built from `attributes.input.value` / `attributes.llm.input_messages` (input side) and `attributes.output.value` / `attributes.llm.output_messages` (output side).\n\nAt span or trace granularity, `{conversation}` is treated as a regular template variable and resolved via column mappings like any other.\n\n### Multi-evaluator tasks\n\nA task can contain evaluators at different granularities. At runtime the system uses the **highest** granularity (session > trace > span) for data fetching and automatically **splits into one child run per evaluator**. Per-evaluator `query_filter` in the task's evaluators JSON further narrows which spans are included (e.g., only tool-call spans within a session).\n\n---\n\n## Basic CRUD\n\n### AI Integrations\n\nAI integrations store the LLM provider credentials the evaluator uses. For full CRUD — listing, creating for all providers (OpenAI, Anthropic, Azure, Bedrock, Vertex, Gemini, NVIDIA NIM, custom), updating, and deleting — use the **arize-ai-provider-integration** skill.\n\nQuick reference for the common case (OpenAI):\n\n```bash\n# Check for an existing integration first\nax ai-integrations list --space-id SPACE_ID\n\n# Create if none exists\nax ai-integrations create \\\n  --name \"My OpenAI Integration\" \\\n  --provider openAI \\\n  --api-key $OPENAI_API_KEY\n```\n\nCopy the returned integration ID — it is required for `ax evaluators create --ai-integration-id`.\n\n### Evaluators\n\n```bash\n# List / Get\nax evaluators list --space-id SPACE_ID\nax evaluators get EVALUATOR_ID\nax evaluators list-versions EVALUATOR_ID\nax evaluators get-version VERSION_ID\n\n# Create (creates the evaluator and its first version)\nax evaluators create \\\n  --name \"Answer Correctness\" \\\n  --space-id SPACE_ID \\\n  --description \"Judges if the model answer is correct\" \\\n  --template-name \"correctness\" \\\n  --commit-message \"Initial version\" \\\n  --ai-integration-id INT_ID \\\n  --model-name \"gpt-4o\" \\\n  --include-explanations \\\n  --use-function-calling \\\n  --classification-choices '{\"correct\": 1, \"incorrect\": 0}' \\\n  --template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.\n\nUser question: {input}\n\nModel response: {output}\n\nRespond with exactly one of these labels: correct, incorrect'\n\n# Create a new version (for prompt or model changes — versions are immutable)\nax evaluators create-version EVALUATOR_ID \\\n  --commit-message \"Added context grounding\" \\\n  --template-name \"correctness\" \\\n  --ai-integration-id INT_ID \\\n  --model-name \"gpt-4o\" \\\n  --include-explanations \\\n  --classification-choices '{\"correct\": 1, \"incorrect\": 0}' \\\n  --template 'Updated prompt...\n\n{input} / {output} / {context}'\n\n# Update metadata only (name, description — not prompt)\nax evaluators update EVALUATOR_ID \\\n  --name \"New Name\" \\\n  --description \"Updated description\"\n\n# Delete (permanent — removes all versions)\nax evaluators delete EVALUATOR_ID\n```\n\n**Key flags for `create`:**\n\n| Flag | Required | Description |\n|------|----------|-------------|\n| `--name` | yes | Evaluator name (unique within space) |\n| `--space-id` | yes | Space to create in |\n| `--template-name` | yes | Eval column name — alphanumeric, spaces, hyphens, underscores |\n| `--commit-message` | yes | Description of this version |\n| `--ai-integration-id` | yes | AI integration ID (from above) |\n| `--model-name` | yes | Judge model (e.g. `gpt-4o`) |\n| `--template` | yes | Prompt with `{variable}` placeholders (single-quoted in bash) |\n| `--classification-choices` | yes | JSON object mapping choice labels to numeric scores e.g. `'{\"correct\": 1, \"incorrect\": 0}'` |\n| `--description` | no | Human-readable description |\n| `--include-explanations` | no | Include reasoning alongside the label |\n| `--use-function-calling` | no | Prefer structured function-call output |\n| `--invocation-params` | no | JSON of model params e.g. `'{\"temperature\": 0}'` |\n| `--data-granularity` | no | `span` (default), `trace`, or `session`. Only relevant for project tasks, not dataset/experiment tasks. See Data Granularity section. |\n| `--provider-params` | no | JSON object of provider-specific parameters |\n\n### Tasks\n\n```bash\n# List / Get\nax tasks list --space-id SPACE_ID\nax tasks list --project-id PROJ_ID\nax tasks list --dataset-id DATASET_ID\nax tasks get TASK_ID\n\n# Create (project — continuous)\nax tasks create \\\n  --name \"Correctness Monitor\" \\\n  --task-type template_evaluation \\\n  --project-id PROJ_ID \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"input\": \"attributes.input.value\", \"output\": \"attributes.output.value\"}}]' \\\n  --is-continuous \\\n  --sampling-rate 0.1\n\n# Create (project — one-time / backfill)\nax tasks create \\\n  --name \"Correctness Backfill\" \\\n  --task-type template_evaluation \\\n  --project-id PROJ_ID \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"input\": \"attributes.input.value\", \"output\": \"attributes.output.value\"}}]' \\\n  --no-continuous\n\n# Create (experiment / dataset)\nax tasks create \\\n  --name \"Experiment Scoring\" \\\n  --task-type template_evaluation \\\n  --dataset-id DATASET_ID \\\n  --experiment-ids \"EXP_ID_1,EXP_ID_2\" \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"output\": \"output\"}}]' \\\n  --no-continuous\n\n# Trigger a run (project task — use data window)\nax tasks trigger-run TASK_ID \\\n  --data-start-time \"2026-03-20T00:00:00\" \\\n  --data-end-time \"2026-03-21T23:59:59\" \\\n  --wait\n\n# Trigger a run (experiment task — use experiment IDs)\nax tasks trigger-run TASK_ID \\\n  --experiment-ids \"EXP_ID_1\" \\\n  --wait\n\n# Monitor\nax tasks list-runs TASK_ID\nax tasks get-run RUN_ID\nax tasks wait-for-run RUN_ID --timeout 300\nax tasks cancel-run RUN_ID --force\n```\n\n**Time format for trigger-run:** `2026-03-21T09:00:00` — no trailing `Z`.\n\n**Additional trigger-run flags:**\n\n| Flag | Description |\n|------|-------------|\n| `--max-spans` | Cap processed spans (default 10,000) |\n| `--override-evaluations` | Re-score spans that already have labels |\n| `--wait` / `-w` | Block until the run finishes |\n| `--timeout` | Seconds to wait with `--wait` (default 600) |\n| `--poll-interval` | Poll interval in seconds when waiting (default 5) |\n\n**Run status guide:**\n\n| Status | Meaning |\n|--------|---------|\n| `completed`, 0 spans | No spans in eval index for that window — widen time range |\n| `cancelled` ~1s | Integration credentials invalid |\n| `cancelled` ~3min | Found spans but LLM call failed — check model name or key |\n| `completed`, N > 0 | Success — check scores in UI |\n\n---\n\n## Workflow A: Create an evaluator for a project\n\nUse this when the user says something like *\"create an evaluator for my Playground Traces project\"*.\n\n### Step 1: Resolve the project name to an ID\n\n`ax spans export` requires a project **ID**, not a name — passing a name causes a validation error. Always look up the ID first:\n\n```bash\nax projects list --space-id SPACE_ID -o json\n```\n\nFind the entry whose `\"name\"` matches (case-insensitive). Copy its `\"id\"` (a base64 string).\n\n### Step 2: Understand what to evaluate\n\nIf the user specified the evaluator type (hallucination, correctness, relevance, etc.) → skip to Step 3.\n\nIf not, sample recent spans to base the evaluator on actual data:\n\n```bash\nax spans export PROJECT_ID --space-id SPACE_ID -l 10 --days 30 --stdout\n```\n\nInspect `attributes.input`, `attributes.output`, span kinds, and any existing annotations. Identify failure modes (e.g. hallucinated facts, off-topic answers, missing context) and propose **1–3 concrete evaluator ideas**. Let the user pick.\n\nEach suggestion must include: the evaluator name (bold), a one-sentence description of what it judges, and the binary label pair in parentheses. Format each like:\n\n1. **Name** — Description of what is being judged. (`label_a` / `label_b`)\n\nExample:\n1. **Response Correctness** — Does the agent's response correctly address the user's financial query? (`correct` / `incorrect`)\n2. **Hallucination** — Does the response fabricate facts not grounded in retrieved context? (`factual` / `hallucinated`)\n\n### Step 3: Confirm or create an AI integration\n\n```bash\nax ai-integrations list --space-id SPACE_ID -o json\n```\n\nIf a suitable integration exists, note its ID. If not, create one using the **arize-ai-provider-integration** skill. Ask the user which provider/model they want for the judge.\n\n### Step 4: Create the evaluator\n\nUse the template design best practices below. Keep the evaluator name and variables **generic** — the task (Step 6) handles project-specific wiring via `column_mappings`.\n\n```bash\nax evaluators create \\\n  --name \"Hallucination\" \\\n  --space-id SPACE_ID \\\n  --template-name \"hallucination\" \\\n  --commit-message \"Initial version\" \\\n  --ai-integration-id INT_ID \\\n  --model-name \"gpt-4o\" \\\n  --include-explanations \\\n  --use-function-calling \\\n  --classification-choices '{\"factual\": 1, \"hallucinated\": 0}' \\\n  --template 'You are an evaluator. Given the user question and the model response, decide if the response is factual or contains unsupported claims.\n\nUser question: {input}\n\nModel response: {output}\n\nRespond with exactly one of these labels: hallucinated, factual'\n```\n\n### Step 5: Ask — backfill, continuous, or both?\n\nBefore creating the task, ask:\n\n> \"Would you like to:\n> (a) Run a **backfill** on historical spans (one-time)?\n> (b) Set up **continuous** evaluation on new spans going forward?\n> (c) **Both** — backfill now and keep scoring new spans automatically?\"\n\n### Step 6: Determine column mappings from real span data\n\nDo not guess paths. Pull a sample and inspect what fields are actually present:\n\n```bash\nax spans export PROJECT_ID --space-id SPACE_ID -l 5 --days 7 --stdout\n```\n\nFor each template variable (`{input}`, `{output}`, `{context}`), find the matching JSON path. Common starting points — **always verify on your actual data before using**:\n\n| Template var | LLM span | CHAIN span |\n|---|---|---|\n| `input` | `attributes.input.value` | `attributes.input.value` |\n| `output` | `attributes.llm.output_messages.0.message.content` | `attributes.output.value` |\n| `context` | `attributes.retrieval.documents.contents` | — |\n| `tool_output` | `attributes.input.value` (fallback) | `attributes.output.value` |\n\n**Validate span kind alignment:** If the evaluator prompt assumes LLM final text but the task targets CHAIN spans (or vice versa), runs can cancel or score the wrong text. Make sure the `query_filter` on the task matches the span kind you mapped.\n\n**Full example `--evaluators` JSON:**\n\n```json\n[\n  {\n    \"evaluator_id\": \"EVAL_ID\",\n    \"query_filter\": \"span_kind = 'LLM'\",\n    \"column_mappings\": {\n      \"input\": \"attributes.input.value\",\n      \"output\": \"attributes.llm.output_messages.0.message.content\",\n      \"context\": \"attributes.retrieval.documents.contents\"\n    }\n  }\n]\n```\n\nInclude a mapping for **every** variable the template references. Omitting one causes runs to produce no valid scores.\n\n### Step 7: Create the task\n\n**Backfill only (a):**\n```bash\nax tasks create \\\n  --name \"Hallucination Backfill\" \\\n  --task-type template_evaluation \\\n  --project-id PROJECT_ID \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"input\": \"attributes.input.value\", \"output\": \"attributes.output.value\"}}]' \\\n  --no-continuous\n```\n\n**Continuous only (b):**\n```bash\nax tasks create \\\n  --name \"Hallucination Monitor\" \\\n  --task-type template_evaluation \\\n  --project-id PROJECT_ID \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"input\": \"attributes.input.value\", \"output\": \"attributes.output.value\"}}]' \\\n  --is-continuous \\\n  --sampling-rate 0.1\n```\n\n**Both (c):** Use `--is-continuous` on create, then also trigger a backfill run in Step 8.\n\n### Step 8: Trigger a backfill run (if requested)\n\nFirst find what time range has data:\n```bash\nax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 1 --stdout   # try last 24h first\nax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 7 --stdout   # widen if empty\n```\n\nUse the `start_time` / `end_time` fields from real spans to set the window. Use the most recent data for your first test run.\n\n```bash\nax tasks trigger-run TASK_ID \\\n  --data-start-time \"2026-03-20T00:00:00\" \\\n  --data-end-time \"2026-03-21T23:59:59\" \\\n  --wait\n```\n\n---\n\n## Workflow B: Create an evaluator for an experiment\n\nUse this when the user says something like *\"create an evaluator for my experiment\"* or *\"evaluate my dataset runs\"*.\n\n**If the user says \"dataset\" but doesn't have an experiment:** A task must target an experiment (not a bare dataset). Ask:\n> \"Evaluation tasks run against experiment runs, not datasets directly. Would you like help creating an experiment on that dataset first?\"\n\nIf yes, use the **arize-experiment** skill to create one, then return here.\n\n### Step 1: Resolve dataset and experiment\n\n```bash\nax datasets list --space-id SPACE_ID -o json\nax experiments list --dataset-id DATASET_ID -o json\n```\n\nNote the dataset ID and the experiment ID(s) to score.\n\n### Step 2: Understand what to evaluate\n\nIf the user specified the evaluator type → skip to Step 3.\n\nIf not, inspect a recent experiment run to base the evaluator on actual data:\n\n```bash\nax experiments export EXPERIMENT_ID --stdout | python3 -c \"import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))\"\n```\n\nLook at the `output`, `input`, `evaluations`, and `metadata` fields. Identify gaps (metrics the user cares about but doesn't have yet) and propose **1–3 evaluator ideas**. Each suggestion must include: the evaluator name (bold), a one-sentence description, and the binary label pair in parentheses — same format as Workflow A, Step 2.\n\n### Step 3: Confirm or create an AI integration\n\nSame as Workflow A, Step 3.\n\n### Step 4: Create the evaluator\n\nSame as Workflow A, Step 4. Keep variables generic.\n\n### Step 5: Determine column mappings from real run data\n\nRun data shape differs from span data. Inspect:\n\n```bash\nax experiments export EXPERIMENT_ID --stdout | python3 -c \"import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))\"\n```\n\nCommon mapping for experiment runs:\n- `output` → `\"output\"` (top-level field on each run)\n- `input` → check if it's on the run or embedded in the linked dataset examples\n\nIf `input` is not on the run JSON, export dataset examples to find the path:\n```bash\nax datasets export DATASET_ID --stdout | python3 -c \"import sys,json; ex=json.load(sys.stdin); print(json.dumps(ex[0], indent=2))\"\n```\n\n### Step 6: Create the task\n\n```bash\nax tasks create \\\n  --name \"Experiment Correctness\" \\\n  --task-type template_evaluation \\\n  --dataset-id DATASET_ID \\\n  --experiment-ids \"EXP_ID\" \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"output\": \"output\"}}]' \\\n  --no-continuous\n```\n\n### Step 7: Trigger and monitor\n\n```bash\nax tasks trigger-run TASK_ID \\\n  --experiment-ids \"EXP_ID\" \\\n  --wait\n\nax tasks list-runs TASK_ID\nax tasks get-run RUN_ID\n```\n\n---\n\n## Best Practices for Template Design\n\n### 1. Use generic, portable variable names\n\nUse `{input}`, `{output}`, and `{context}` — not names tied to a specific project or span attribute (e.g. do not use `{attributes_input_value}`). The evaluator itself stays abstract; the **task's `column_mappings`** is where you wire it to the actual fields in a specific project or experiment. This lets the same evaluator run across multiple projects and experiments without modification.\n\n### 2. Default to binary labels\n\nUse exactly two clear string labels (e.g. `hallucinated` / `factual`, `correct` / `incorrect`, `pass` / `fail`). Binary labels are:\n- Easiest for the judge model to produce consistently\n- Most common in the industry\n- Simplest to interpret in dashboards\n\nIf the user insists on more than two choices, that's fine — but recommend binary first and explain the tradeoff (more labels → more ambiguity → lower inter-rater reliability).\n\n### 3. Be explicit about what the model must return\n\nThe template must tell the judge model to respond with **only** the label string — nothing else. The label strings in the prompt must **exactly match** the labels in `--classification-choices` (same spelling, same casing).\n\nGood:\n```\nRespond with exactly one of these labels: hallucinated, factual\n```\n\nBad (too open-ended):\n```\nIs this hallucinated? Answer yes or no.\n```\n\n### 4. Keep temperature low\n\nPass `--invocation-params '{\"temperature\": 0}'` for reproducible scoring. Higher temperatures introduce noise into evaluation results.\n\n### 5. Use `--include-explanations` for debugging\n\nDuring initial setup, always include explanations so you can verify the judge is reasoning correctly before trusting the labels at scale.\n\n### 6. Pass the template in single quotes in bash\n\nSingle quotes prevent the shell from interpolating `{variable}` placeholders. Double quotes will cause issues:\n\n```bash\n# Correct\n--template 'Judge this: {input} → {output}'\n\n# Wrong — shell may interpret { } or fail\n--template \"Judge this: {input} → {output}\"\n```\n\n### 7. Always set `--classification-choices` to match your template labels\n\nThe labels in `--classification-choices` must exactly match the labels referenced in `--template` (same spelling, same casing). Omitting `--classification-choices` causes task runs to fail with \"missing rails and classification choices.\"\n\n---\n\n## Troubleshooting\n\n| Problem | Solution |\n|---------|----------|\n| `ax: command not found` | See references/ax-setup.md |\n| `401 Unauthorized` | API key may not have access to this space. Verify at https://app.arize.com/admin > API Keys |\n| `Evaluator not found` | `ax evaluators list --space-id SPACE_ID` |\n| `Integration not found` | `ax ai-integrations list --space-id SPACE_ID` |\n| `Task not found` | `ax tasks list --space-id SPACE_ID` |\n| `project-id and dataset-id are mutually exclusive` | Use only one when creating a task |\n| `experiment-ids required for dataset tasks` | Add `--experiment-ids` to `create` and `trigger-run` |\n| `sampling-rate only valid for project tasks` | Remove `--sampling-rate` from dataset tasks |\n| Validation error on `ax spans export` | Pass project ID (base64), not project name — look up via `ax projects list` |\n| Template validation errors | Use single-quoted `--template '...'` in bash; single braces `{var}`, not double `{{var}}` |\n| Run stuck in `pending` | `ax tasks get-run RUN_ID`; then `ax tasks cancel-run RUN_ID` |\n| Run `cancelled` ~1s | Integration credentials invalid — check AI integration |\n| Run `cancelled` ~3min | Found spans but LLM call failed — wrong model name or bad key |\n| Run `completed`, 0 spans | Widen time window; eval index may not cover older data |\n| No scores in UI | Fix `column_mappings` to match real paths on your spans/runs |\n| Scores look wrong | Add `--include-explanations` and inspect judge reasoning on a few samples |\n| Evaluator cancels on wrong span kind | Match `query_filter` and `column_mappings` to LLM vs CHAIN spans |\n| Time format error on `trigger-run` | Use `2026-03-21T09:00:00` — no trailing `Z` |\n| Run failed: \"missing rails and classification choices\" | Add `--classification-choices '{\"label_a\": 1, \"label_b\": 0}'` to `ax evaluators create` — labels must match the template |\n| Run `completed`, all spans skipped | Query filter matched spans but column mappings are wrong or template variables don't resolve — export a sample span and verify paths |\n\n---\n\n## Related Skills\n\n- **arize-ai-provider-integration**: Full CRUD for LLM provider integrations (create, update, delete credentials)\n- **arize-trace**: Export spans to discover column paths and time ranges\n- **arize-experiment**: Create experiments and export runs for experiment column mappings\n- **arize-dataset**: Export dataset examples to find input fields when runs omit them\n- **arize-link**: Deep links to evaluators and tasks in the Arize UI\n\n---\n\n## Save Credentials for Future Use\n\nSee references/ax-profiles.md § Save Credentials for Future Use.","tags":["arize","evaluator","awesome","copilot","github"],"capabilities":["skill","source-github","category-awesome-copilot"],"categories":["awesome-copilot"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/github/awesome-copilot/arize-evaluator","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"install_from":"skills.sh"}},"qualityScore":"0.300","qualityRationale":"deterministic score 0.30 from registry signals: · indexed on skills.sh · published under github/awesome-copilot","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill:v1","enrichmentVersion":1,"enrichedAt":"2026-04-22T03:40:38.090Z","embedding":null,"createdAt":"2026-04-18T20:36:09.816Z","updatedAt":"2026-04-22T03:40:38.090Z","lastSeenAt":"2026-04-22T03:40:38.090Z","tsv":"'-03':1430,1440,1508,2408,2418,3450 '-20':1431,2409 '-21':1441,1509,2419,3451 '-5':277 '/admin':135,3216 '0':287,474,984,1071,1194,1231,1575,1608,1975,2594,2714,2779,3068,3383,3474 '0.1':1333,2300 '00':1433,1434,1511,1512,2411,2412,3453,3454 '000':1531 '1':475,982,1069,1192,1394,1466,1639,1768,1804,1817,1973,2347,2508,2620,2859,3471 '10':1530,1741 '100':2345,2364 '100k':624,662 '1s':1589,3359 '2':1397,1697,1834,2546,2596,2650,2716,2781,2925 '2026':1429,1439,1507,2407,2417,3449 '24h':2351 '3':1716,1769,1849,2561,2621,2652,2664,2993 '30':1743 '300':1492 '3min':1594,3368 '4':276,1900,2666,2675,3059 '401':79,3201 '4o':272,970,1061,1166,1961 '5':1568,2015,2095,2680,3079 '59':1443,1444,2421,2422 '6':1921,2061,2783,3107 '600':1557 '7':2097,2225,2366,2822,3148 '8':2317,2319 'abstract':2891 'access':3208 'across':424,688,2918 'activ':349 'actual':407,1727,2081,2118,2574,2904 'ad':1043 'add':3278,3412,3465 'addit':1516 'address':1826 'agent':542,1822 'aggreg':591 'ai':246,791,793,827,847,861,889,960,1051,1148,1152,1854,1859,1885,1951,2657,3235,3364,3515 'ai-integr':846,860,1858,3234 'ai-integration-id':888,959,1050,1147,1950 'align':2144 'allow':225 'alongsid':1207 'alphanumer':1135 'alreadi':1540 'also':2310 'alway':1664,2114,3089,3149 'ambigu':2987 'annot':1753 'answer':935,947,1003,1763,3055 'anthrop':166,253,812 'anyth':546 'api':82,100,108,131,136,164,167,871,874,3203,3217 'api-key':870 'app.arize.com':134,3215 'app.arize.com/admin':133,3214 'appli':497 'ariz':1,8,24,107,130,144,826,1884,2498,3514,3529,3541,3553,3567,3577 'arize-ai-provider-integr':825,1883,3513 'arize-dataset':3552 'arize-experi':2497,3540 'arize-link':3566 'arize-trac':3528 'array':683 'arriv':461 'ask':125,155,175,1889,2016,2025,2472 'assum':2149 'attach':369 'attribut':2879,2884 'attributes.input':1746 'attributes.input.value':417,696,1324,1364,2129,2130,2138,2201,2257,2291 'attributes.llm.input':697 'attributes.llm.output':703 'attributes.llm.output_messages.0.message.content':2132,2203 'attributes.output':1747 'attributes.output.value':702,1326,1366,2133,2140,2259,2293 'attributes.retrieval.documents.contents':2135,2205 'attributes.session.id':567,654 'automat':455,755,2059 'awesom':3 'ax':48,63,85,149,845,859,885,896,904,909,916,931,1033,1085,1101,1268,1276,1284,1292,1300,1340,1373,1418,1454,1469,1476,1483,1493,1647,1671,1730,1857,1931,2084,2233,2267,2334,2353,2396,2514,2524,2577,2697,2762,2788,2827,2840,2847,3195,3222,3233,3246,3306,3319,3342,3350,3476 'azur':813 'b':1815,2040,2265,2425,3473 'backfil':1339,1345,2017,2033,2052,2229,2238,2313,2322 'bad':3047,3379 'bare':2470 'base':67,1723,2570 'base64':1694,3312 'bash':838,893,1177,1265,1670,1729,1856,1930,2083,2232,2266,2333,2395,2513,2576,2696,2761,2787,2826,3115,3130,3331 'basic':789 'bedrock':254,814 'best':1908,2854 'better':300 'binari':231,1796,2639,2928,2943,2978 'block':1545 'bold':1784,2631 'brace':3333 'built':694 'c':2050,2302,2584,2704,2769 'call':160,260,551,784,977,1213,1219,1599,1968,3373 'cancel':1496,1588,1593,2164,3353,3358,3367,3425 'cancel-run':1495,3352 'cap':660,1526 'care':2611 'carri':242 'case':836,1688,3036,3176 'case-insensit':1687 'category-awesome-copilot' 'caus':1660,2217,3128,3181 'chain':552,2126,2157,3439 'chang':338,1029 'charact':625,663 'check':54,104,141,169,839,1601,1610,2732,3363 'child':759 'choic':221,239,980,1067,1180,1185,1971,2972,3032,3153,3164,3180,3191,3464,3468 'claim':1998 'classif':220,979,1066,1179,1970,3031,3152,3163,3179,3190,3463,3467 'classification-choic':978,1065,1178,1969,3030,3151,3162,3178,3466 'claud':274 'claude-sonnet':273 'clear':2933 'coher':576 'column':218,398,516,605,723,1133,1321,1361,1403,1928,2063,2198,2254,2288,2682,2814,2895,3400,3434,3494,3535,3550 'comma':614 'comma-join':613 'command':49,64,71,3196 'commit':955,1041,1140,1946 'commit-messag':954,1040,1139,1945 'common':237,835,2111,2717,2955 'complet':1574,1606,3382,3485 'concept':178 'concret':1770 'confirm':1850,2653 'consist':2953 'contain':193,382,735,1996 'context':206,1044,1077,1765,1845,2105,2134,2204,2869 'context.trace':540,600 'continu':449,465,1299,1329,1369,1409,2018,2043,2262,2263,2296,2306,2820 'control':483 'convers':579,666,672,712 'copi':876,1690 'copilot':4 'correct':524,545,936,949,953,981,1002,1019,1049,1068,1191,1304,1344,1710,1819,1825,1832,2793,2939,3100,3131 'cover':13,3392 'creat':15,339,807,855,863,887,923,924,933,1021,1036,1109,1126,1297,1302,1334,1342,1370,1375,1616,1630,1852,1879,1901,1933,2022,2226,2235,2269,2308,2426,2440,2486,2502,2655,2667,2784,2790,3268,3283,3478,3524,3543 'create-vers':1035 'create/update':114 'credenti':251,799,1591,3361,3527,3580,3587 'crud':790,805,3519 'current':91 'custom':819 'dashboard':2963 'data':39,311,366,476,480,487,752,1233,1250,1416,1426,1436,1728,2068,2119,2332,2389,2404,2414,2575,2687,2689,2694,3394 'data-end-tim':1435,2413 'data-granular':479,1232 'data-start-tim':1425,2403 'dataset':377,1288,1290,1372,1385,1387,2449,2455,2471,2480,2491,2510,2515,2528,2530,2536,2744,2755,2763,2765,2800,2802,3259,3276,3301,3554,3556 'dataset-id':1287,1384,2527,2799,3258 'dataset/experiment':502,1247 'day':1742,2096,2346,2365 'debug':3085 'decid':998,1989 'deep':3569 'default':234,492,519,1237,1529,1556,1567,2926 'defin':27 'definit':191 'delet':822,1096,1103,3526 'descript':195,384,942,1082,1093,1095,1112,1143,1195,1200,1522,1789,1806,2636 'design':14,1907,2858 'determin':2062,2681 'differ':738,2691 'direct':42,295,508,2481 'discov':3534 'doesn':2457,2614 'doubl':3125,3336 'e.g':203,228,269,415,440,780,1163,1190,1229,1757,2880,2936 'easiest':2946 'either':124 'els':3017 'embed':2740 'empti':2370 'end':1437,2375,2415,3051 'entri':1683 'env':56,105,120,142,170 'error':70,76,1663,3304,3324,3443 'etc':255,1712 'eval':527,554,582,1132,1319,1359,1401,1580,2191,2252,2286,2812,3388 'evalu':2,9,22,26,182,184,257,315,325,331,363,385,388,402,422,439,473,489,505,512,610,730,736,762,765,772,801,886,892,897,905,907,910,914,917,926,932,989,1034,1038,1086,1088,1102,1104,1115,1310,1316,1317,1350,1356,1357,1383,1398,1399,1534,1618,1632,1701,1707,1725,1771,1782,1903,1913,1932,1980,2044,2147,2186,2189,2243,2249,2250,2277,2283,2284,2428,2442,2447,2473,2550,2556,2572,2602,2622,2629,2669,2798,2809,2810,2888,2916,3077,3219,3223,3424,3477,3572 'everi':334,2210 'ex':2773,2778 'exact':1014,2007,2931,3025,3040,3166 'exampl':1816,2185,2745,2756,3557 'exclus':3263 'exist':842,858,1752,1873 'exp':1392,1395,1464,2807,2837 'experi':378,413,427,506,1371,1377,1390,1449,1452,1462,2431,2445,2461,2467,2477,2488,2499,2512,2525,2540,2567,2578,2580,2698,2700,2720,2792,2805,2835,2911,2922,3272,3280,3542,3544,3549 'experiment-id':1389,1461,2804,2834,3271,3279 'explain':2981 'explan':531,558,586,973,1064,1203,1964,3083,3091,3415 'explicit':2995 'export':1649,1732,2086,2336,2355,2579,2699,2754,2764,3308,3504,3531,3546,3555 'express':433 'fabric':1839 'fact':1759,1840 'factual':229,1846,1972,1994,2013,2938,3046 'fail':65,161,1600,2942,3142,3185,3374,3459 'failur':1755 'fallback':2139 'fetch':753 'field':194,383,408,2079,2377,2605,2727,2905,3561 'fill':209 'filter':429,767,2174,2194,3432,3490 'final':2151 'financi':1830 'find':1681,2106,2327,2758,3559 'fine':2975 'finish':1549 'first':643,844,929,1669,2326,2352,2392,2492,2979 'fix':3399 'flag':482,1107,1110,1520,1521 'forc':1500 'format':1502,1801,2645,3442 'forward':2049 'found':73,1595,3198,3221,3232,3245,3369 'fraction':468 'full':550,804,2184,3518 'function':976,1212,1218,1967 'function-cal':1217 'futur':3582,3589 'gap':2607 'gemini':816 'generic':1917,2678,2861 'get':208,895,906,919,1267,1294,1479,2850,3345 'get-run':1478,2849,3344 'get-vers':918 'github':7 'given':990,1981 'go':2048 'good':3037 'gpt':271,969,1060,1165,1960 'gpt-4o':270,968,1059,1164,1959 'granular':312,477,481,595,635,671,711,739,747,1234,1251 'ground':1045,1842 'group':538,565,603,641,652 'guess':2071 'guid':1571 'hallucin':230,525,1709,1758,1835,1847,1935,1944,1974,2012,2237,2271,2937,3045,3054 'handl':1922 'happen':642 'help':2485 'higher':297,3072 'highest':746 'histor':2035 'human':1198 'human-read':1197 'hyphen':1137 'id':139,146,541,601,852,854,880,891,901,903,908,915,922,939,941,962,964,1039,1053,1055,1089,1105,1122,1150,1154,1273,1275,1281,1283,1289,1291,1296,1313,1315,1318,1320,1353,1355,1358,1360,1386,1388,1391,1393,1396,1400,1402,1424,1453,1460,1463,1465,1475,1482,1490,1499,1646,1653,1668,1676,1678,1692,1734,1737,1739,1864,1866,1876,1938,1940,1953,1955,2088,2091,2093,2190,2192,2246,2248,2251,2253,2280,2282,2285,2287,2338,2341,2343,2357,2360,2362,2402,2519,2521,2529,2531,2537,2541,2581,2701,2766,2801,2803,2806,2808,2811,2813,2833,2836,2838,2846,2853,3227,3229,3240,3242,3251,3253,3256,3260,3273,3281,3311,3348,3356 'idea':1772,2623 'identifi':1754,2606 'immut':342,1032 'import':446,2585,2705,2770 'includ':779,972,1063,1202,1205,1780,1963,2206,2627,3082,3090,3414 'include-explan':971,1062,1201,1962,3081,3413 'incorrect':983,1020,1070,1193,1833,2940 'indent':2595,2715,2780 'index':1581,3389 'individu':520 'industri':2958 'initi':957,1948,3087 'input':204,416,685,699,1008,1075,1323,1363,2001,2103,2128,2200,2256,2290,2601,2731,2747,2866,2885,3135,3146,3560 'insensit':1689 'insist':2967 'inspect':89,1745,2077,2564,2695,3417 'int':963,1054,1954 'integr':247,792,794,829,843,848,862,867,879,890,961,1052,1149,1153,1590,1855,1860,1872,1887,1952,2658,3230,3236,3360,3365,3517,3523 'inter':2990 'inter-rat':2989 'interpol':3122 'interpret':2961,3140 'interv':1560,1562 'introduc':3074 'invalid':1592,3362 'invoc':278,1222,3065 'invocation-param':1221,3064 'is-continu':1327,2294,2304 'issu':3129 'join':615 'json':153,281,682,773,1182,1225,1257,1680,1868,2109,2187,2188,2523,2533,2587,2707,2753,2772 'json.dumps':2592,2712,2777 'json.load':2589,2709,2774 'judg':21,29,190,198,262,267,631,943,1161,1793,1811,1898,2949,3007,3097,3133,3144,3418 'keep':1911,2055,2676,3060 'key':83,101,109,123,132,137,165,168,872,875,1106,1605,3204,3218,3380 'kind':442,1749,2143,2181,2196,3429 'l':1740,2094,2344,2363 'label':227,529,556,584,1018,1186,1209,1542,1797,1812,1814,2011,2640,2929,2935,2944,2985,3014,3019,3028,3044,3104,3158,3160,3169,3469,3472,3479 'last':2350 'let':1773,2913 'level':323,330,509,640,657,2726 'like':285,725,1629,1803,2028,2439,2484 'link':2743,3568,3570 'list':151,386,806,849,894,898,912,1266,1270,1278,1286,1472,1673,1861,2516,2526,2843,3224,3237,3248,3321 'list-run':1471,2842 'list-vers':911 'live':373 'llm':19,158,188,249,443,797,1598,2124,2150,2197,3372,3437,3521 'llm-as-judg':18,187 'load':171 'look':1665,2597,3316,3410 'low':288,3062 'lower':2988 'make':421,2170 'map':219,399,400,724,1184,1322,1362,1404,1929,2064,2183,2199,2208,2255,2289,2683,2718,2815,2896,3401,3435,3495,3551 'match':1686,2108,2178,3026,3155,3167,3403,3430,3481,3491 'max':1524 'max-span':1523 'maxim':301 'may':3139,3205,3390 'mean':1573 'messag':698,704,956,1042,1141,1947 'metadata':1079,2604 'metric':2608 'minim':304 'miss':81,97,162,1764,3187,3460 'mode':1756 'model':263,264,268,283,337,632,946,966,996,1009,1028,1057,1158,1162,1227,1602,1957,1987,2002,2950,2999,3008,3376 'model-nam':965,1056,1157,1956 'modif':2924 'monitor':1305,1468,2272,2825 'multi':574,729 'multi-evalu':728 'multi-turn':573 'multipl':394,2919 'must':1779,2464,2626,3000,3004,3024,3165,3480 'mutual':3262 'n':1607 'name':528,555,583,864,934,952,967,1048,1058,1081,1090,1092,1113,1116,1130,1134,1159,1303,1343,1376,1603,1643,1656,1659,1685,1783,1805,1914,1934,1943,1958,2236,2270,2630,2791,2864,2871,3315,3377 'narrow':775 'need':51,548 'new':341,457,470,1023,1091,2046,2057 'nim':818 'no-continu':1367,1407,2260,2818 'nois':3075 'none':857 'note':1874,2534 'noth':3016 'numer':244,1188 'nvidia':817 'o':152,1679,1867,2522,2532 'object':1183,1258 'off-top':1760 'older':3393 'omit':2215,3177,3564 'one':360,396,758,1015,1337,1787,1880,2008,2038,2216,2503,2634,3041,3266 'one-sent':1786,2633 'one-tim':1336,2037 'open':3050 'open-end':3049 'openai':163,252,811,837,866,869,873 'optim':294 'option':241,280,444 'order':569,647 'otherwis':174 'output':205,226,686,705,1011,1076,1220,1325,1365,1405,1406,2004,2104,2131,2137,2202,2258,2292,2600,2722,2723,2816,2817,2867,3136,3147 'overal':577 'overrid':1533 'override-evalu':1532 'pair':1798,2641 'param':279,1223,1228,1255,3066 'paramet':1263 'parenthes':1800,2643 'pass':628,1657,2941,3063,3108,3309 'path':409,2072,2110,2760,3405,3510,3536 'pend':3341 'per':761,764 'per-evalu':763 'perman':1097 'pick':1776 'placehold':202,1172,3124 'playground':1635 'point':2113 'poll':1559,1561 'poll-interv':1558 'portabl':423,2862 'practic':1909,2855 'precis':448 'prefer':1215 'prefix':517 'prerequisit':40 'present':173,2082 'prevent':3118 'print':2591,2711,2776 'problem':3193 'proceed':41 'process':1527 'produc':2220,2952 'profil':59,86,92,95,116 'proj':1282,1314,1354 'project':372,425,451,466,499,1244,1280,1298,1312,1335,1352,1413,1621,1637,1642,1652,1672,1733,1924,2087,2245,2247,2279,2281,2337,2356,2876,2909,2920,3255,3294,3310,3314,3320 'project-id':1279,1311,1351,2244,2278,3254 'project-specif':1923 'prompt':199,335,1026,1074,1084,1169,2148,3023 'propos':1767,2619 'provid':159,250,798,810,828,868,1254,1261,1886,3516,3522 'provider-param':1253 'provider-specif':1260 'provider/model':1893 'pull':2073 'python3':2583,2703,2768 'q':522 'qualiti':580 'queri':428,766,1831,2173,2193,3431,3489 'question':993,1005,1007,1984,2000 'quick':831 'quot':1175,3113,3117,3126,3328 'rail':3188,3461 'rang':1587,2330,3539 'rate':463,1332,2299,3290,3299 'rater':2991 're':1536 're-scor':1535 'readabl':1199 'real':38,365,2066,2379,2685,3404 'reason':1206,3099,3419 'recent':346,1720,2388,2566 'recommend':291,2977 'refer':832,2214 'referenc':3170 'references/ax-profiles.md':118,3585 'references/ax-setup.md':78,3200 'regular':717 'relat':3511 'relev':526,1242,1711 'reliabl':2992 'remov':1098,3296 'render':309,679 'reproduc':293,3070 'request':2325 'requir':883,1111,1650,3274 'resolv':721,1640,2509,3503 'respond':1012,2005,3010,3038 'respons':997,1001,1010,1818,1824,1838,1988,1992,2003 'result':515,3078 'retriev':1844 'return':878,2505,3001 'run':17,35,46,84,148,212,316,326,359,379,390,393,414,507,760,1412,1422,1448,1458,1473,1480,1481,1488,1489,1497,1498,1506,1519,1548,1569,2031,2162,2218,2314,2323,2394,2400,2450,2475,2478,2568,2588,2593,2686,2688,2708,2713,2721,2730,2738,2752,2831,2844,2851,2852,2917,3183,3287,3338,3346,3347,3354,3355,3357,3366,3381,3447,3458,3484,3547,3563 'runtim':741 'sampl':462,1331,1719,2075,2298,3289,3298,3423,3506 'sampling-r':1330,2297,3288,3297 'save':3579,3586 'say':1627,2437,2454 'scale':3106 'score':245,298,456,490,530,557,585,1189,1378,1537,1611,2056,2166,2223,2544,3071,3396,3409 'second':1551,1564 'section':1252 'see':77,1249,3199,3584 'select':435 'sentenc':1788,2635 'session':322,559,564,581,590,634,656,670,693,748,788,1240 'session-level':655 'set':223,284,305,2041,2382,3150 'setup':3088 'shape':2690 'share':597 'shell':3120,3138 'show':87 'side':700,706 'simplest':2959 'singl':618,1174,3112,3116,3327,3332 'single-quot':1173,3326 'skill':5,10,12,830,1888,2500,3512 'skip':1713,2558,3488 'solut':3194 'someth':1628,2438 'sonnet':275 'source-github' 'space':138,145,150,851,853,900,902,938,940,1119,1121,1124,1136,1272,1274,1675,1677,1736,1738,1863,1865,1937,1939,2090,2092,2340,2342,2359,2361,2518,2520,3211,3226,3228,3239,3241,3250,3252 'space-id':850,899,937,1120,1271,1674,1735,1862,1936,2089,2339,2358,2517,3225,3238,3249 'span':319,329,411,441,458,471,494,518,521,534,596,708,750,777,785,1236,1525,1528,1538,1576,1578,1596,1648,1721,1731,1748,2036,2047,2058,2067,2085,2125,2127,2142,2158,2180,2195,2335,2354,2380,2693,2878,3307,3370,3384,3428,3440,3487,3492,3507,3532 'spans/runs':437,3408 'special':675 'specif':266,1262,1925,2875,2908 'specifi':1705,2554 'spell':3034,3174 'split':756 'sql':431 'sql-style':430 'start':571,649,1427,2112,2373,2405 'status':1570,1572 'stay':2890 'stdout':1744,2098,2348,2367,2582,2702,2767 'step':1638,1696,1715,1848,1899,1920,2014,2060,2224,2316,2318,2507,2545,2560,2649,2651,2663,2665,2674,2679,2782,2821 'store':248,795 'string':619,1695,2934,3015,3020 'structur':1216 'stuck':3339 'style':432 'success':1609 'suggest':1778,2625 'suitabl':1871 'sure':2171 'sys':2586,2706,2771 'sys.stdin':2590,2710,2775 'system':743 't00':1432,2410 't09':1510,3452 't23':1442,2420 'target':2156,2465 'task':31,45,216,353,355,367,381,397,452,467,500,503,544,731,733,770,1245,1248,1264,1269,1277,1285,1293,1295,1301,1307,1341,1347,1374,1380,1414,1419,1423,1450,1455,1459,1470,1474,1477,1484,1494,1919,2024,2155,2177,2228,2234,2240,2268,2274,2397,2401,2463,2474,2786,2789,2795,2828,2832,2841,2845,2848,2893,3182,3243,3247,3270,3277,3295,3302,3343,3351,3574 'task-typ':1306,1346,1379,2239,2273,2794 'tell':3005 'temperatur':286,289,1230,3061,3067,3073 'templat':196,404,611,667,676,718,951,985,1047,1072,1129,1167,1309,1349,1382,1906,1942,1976,2101,2122,2213,2242,2276,2797,2857,3003,3110,3132,3143,3157,3172,3322,3329,3483,3499 'template-nam':950,1046,1128,1941 'test':2393 'text':2152,2169 'tie':2872 'time':213,572,650,1338,1428,1438,1501,1586,2039,2329,2374,2376,2406,2416,3386,3441,3538 'timeout':1491,1550 'togeth':604 'tone':578 'tool':783,2136 'tool-cal':782 'top':2725 'top-level':2724 'topic':1762 'total':664 'trace':320,532,537,553,561,588,594,639,645,690,710,749,1238,1636,3530 'trace-level':638 'traces/spans':374 'tradeoff':2983 'trail':1514,3456 'trajectori':543 'treat':714 'trend':310 'tri':2349 'trigger':1410,1421,1446,1457,1505,1518,2311,2320,2399,2823,2830,3286,3446 'trigger-run':1420,1456,1504,1517,2398,2829,3285,3445 'troubleshoot':66,3192 'truncat':622 'trust':3102 'turn':575,687 'two':2932,2971 'type':1308,1348,1381,1708,2241,2275,2557,2796 'ui':308,1613,3398,3578 'unauthor':80,3202 'underscor':1138 'understand':1698,2547 'uniqu':1117 'unit':485 'unknown':140 'unsupport':1997 'updat':820,1073,1078,1087,1094,3525 'upfront':60 'use':111,200,258,513,607,744,802,823,975,1211,1415,1451,1622,1881,1904,1966,2121,2303,2371,2385,2432,2495,2860,2865,2883,2930,3080,3264,3325,3448,3583,3590 'use-function-cal':974,1210,1965 'user':127,157,177,992,1006,1626,1704,1775,1828,1891,1983,1999,2436,2453,2553,2610,2966 'valid':1662,2141,2222,3292,3303,3323 'valu':606,621,658,2886 'var':57,2123,3334,3337 'variabl':201,405,668,677,719,1171,1916,2102,2211,2677,2863,3123,3500 'verifi':2115,3095,3212,3509 'versa':2161 'version':55,75,333,343,347,913,920,921,930,958,1024,1030,1037,1100,1146,1949 'vertex':815 'via':117,214,722,1927,3318 'vice':2160 'vs':3438 'w':1544 'wait':1445,1467,1486,1543,1553,1555,1566,2423,2839 'wait-for-run':1485 'want':1895 'whether':296,313,453 'whose':1684 'widen':1585,2368,3385 'window':1417,1584,2384,3387 'wire':1926,2900 'within':786,1118 'without':2923 'work':592 'workflow':1614,2424,2647,2661,2672 'wors':303 'would':2026,2482 'wrong':103,2168,3137,3375,3411,3427,3497 'yes':1114,1123,1131,1142,1151,1160,1168,1181,2494,3056 'yet':2617 'z':1515,3457","prices":[{"id":"f24c1917-f74d-438e-8d88-f5819c846ae6","listingId":"089019a4-491a-408e-997e-09d69e5b895f","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"github","category":"awesome-copilot","install_from":"skills.sh"},"createdAt":"2026-04-18T20:36:09.816Z"}],"sources":[{"listingId":"089019a4-491a-408e-997e-09d69e5b895f","source":"github","sourceId":"github/awesome-copilot/arize-evaluator","sourceUrl":"https://github.com/github/awesome-copilot/tree/main/skills/arize-evaluator","isPrimary":false,"firstSeenAt":"2026-04-18T21:48:14.672Z","lastSeenAt":"2026-04-22T00:52:03.703Z"},{"listingId":"089019a4-491a-408e-997e-09d69e5b895f","source":"skills_sh","sourceId":"github/awesome-copilot/arize-evaluator","sourceUrl":"https://skills.sh/github/awesome-copilot/arize-evaluator","isPrimary":true,"firstSeenAt":"2026-04-18T20:36:09.816Z","lastSeenAt":"2026-04-22T03:40:38.090Z"}],"details":{"listingId":"089019a4-491a-408e-997e-09d69e5b895f","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"github","slug":"arize-evaluator","source":"skills_sh","category":"awesome-copilot","skills_sh_url":"https://skills.sh/github/awesome-copilot/arize-evaluator"},"updatedAt":"2026-04-22T03:40:38.090Z"}}