{"id":"60ded37c-d9e1-4bb4-aa38-680d65771130","shortId":"9T95jq","kind":"skill","title":"arize-evaluator","tagline":"INVOKE THIS SKILL for LLM-as-judge evaluation workflows on Arize: creating/updating evaluators, running evaluations on spans or experiments, tasks, trigger-run, column mapping, and continuous monitoring. Use when the user says: create an evaluator, LLM judge, hallucination/faithf","description":"# Arize Evaluator Skill\n\n> **`SPACE`** — All `--space` flags and the `ARIZE_SPACE` env var accept a space **name** (e.g., `my-workspace`) or a base64 space **ID** (e.g., `U3BhY2U6...`). Find yours with `ax spaces list`.\n\nThis skill covers designing, creating, and running **LLM-as-judge evaluators** on Arize. An evaluator defines the judge; a **task** is how you run it against real data.\n\n---\n\n## Prerequisites\n\nProceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.\n\nIf an `ax` command fails, troubleshoot based on the error:\n- `command not found` or version error → see references/ax-setup.md\n- `401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong, follow references/ax-profiles.md to create/update it. If the user doesn't have their key, direct them to https://app.arize.com/admin > API Keys\n- Space unknown → run `ax spaces list` to pick by name, or ask the user\n- LLM provider call fails (missing OPENAI_API_KEY / ANTHROPIC_API_KEY) → run `ax ai-integrations list --space SPACE` to check for platform-managed credentials. If none exist, ask the user to provide the key or create an integration via the **arize-ai-provider-integration** skill\n- **Security:** Never read `.env` files or search the filesystem for credentials. Use `ax profiles` for Arize credentials and `ax ai-integrations` for LLM provider keys. If credentials are not available through these channels, ask the user.\n- **CRITICAL — Never fabricate evaluation results:** If an evaluation task fails, is cancelled, or produces no scores, report the failure clearly and explain what went wrong. Do NOT perform a \"manual evaluation,\" invent quality scores, estimate percentages, or present any agent-generated analysis as if it came from the Arize evaluation system. Instead suggest: (1) fix the identified issue and retry, (2) try running from the Arize UI, (3) verify integration credentials with `ax ai-integrations list`, (4) contact support at https://arize.com/support\n\n---\n\n## Concepts\n\n### What is an Evaluator?\n\nAn **evaluator** is an LLM-as-judge definition. It contains:\n\n| Field | Description |\n|-------|-------------|\n| **Template** | The judge prompt. Uses `{variable}` placeholders (e.g. `{input}`, `{output}`, `{context}`) that get filled in at run time via a task's column mappings. |\n| **Classification choices** | The set of allowed output labels (e.g. `factual` / `hallucinated`). Binary is the default and most common. Each choice can optionally carry a numeric score. |\n| **AI Integration** | Stored LLM provider credentials (OpenAI, Anthropic, Bedrock, etc.) the evaluator uses to call the judge model. |\n| **Model** | The specific judge model (e.g. `gpt-4o`, `claude-sonnet-4-5`). |\n| **Invocation params** | Optional JSON of model settings like `{\"temperature\": 0}`. Low temperature is recommended for reproducibility. |\n| **Optimization direction** | Whether higher scores are better (`maximize`) or worse (`minimize`). Sets how the UI renders trends. |\n| **Data granularity** | Whether the evaluator runs at the **span**, **trace**, or **session** level. Most evaluators run at the span level. |\n\nEvaluators are **versioned** — every prompt or model change creates a new immutable version. The most recent version is active.\n\n### What is a Task?\n\nA **task** is how you run one or more evaluators against real data. Tasks are attached to a **project** (live traces/spans) or a **dataset** (experiment runs). A task contains:\n\n| Field | Description |\n|-------|-------------|\n| **Evaluators** | List of evaluators to run. You can run multiple in one task. |\n| **Column mappings** | Maps each evaluator's template variables to actual field paths on spans or experiment runs (e.g. `\"input\" → \"attributes.input.value\"`). This is what makes evaluators portable across projects and experiments. |\n| **Query filter** | SQL-style expression to select which spans/runs to evaluate (e.g. `\"span_kind = 'LLM'\"`). Optional but important for precision. |\n| **Continuous** | For project tasks: whether to automatically score new spans as they arrive. |\n| **Sampling rate** | For continuous project tasks: fraction of new spans to evaluate (0–1). |\n\n---\n\n## Data Granularity\n\nThe `--data-granularity` flag controls what unit of data the evaluator scores. It defaults to `span` and only applies to **project tasks** (not dataset/experiment tasks — those evaluate experiment runs directly).\n\n| Level | What it evaluates | Use for | Result column prefix |\n|-------|-------------------|---------|---------------------|\n| `span` (default) | Individual spans | Q&A correctness, hallucination, relevance | `eval.{name}.label` / `.score` / `.explanation` |\n| `trace` | All spans in a trace, grouped by `context.trace_id` | Agent trajectory, task correctness — anything that needs the full call chain | `trace_eval.{name}.label` / `.score` / `.explanation` |\n| `session` | All traces in a session, grouped by `attributes.session.id` and ordered by start time | Multi-turn coherence, overall tone, conversation quality | `session_eval.{name}.label` / `.score` / `.explanation` |\n\n### How trace and session aggregation works\n\nFor **trace** granularity, spans sharing the same `context.trace_id` are grouped together. Column values used by the evaluator template are comma-joined into a single string (each value truncated to 100K characters) before being passed to the judge model.\n\nFor **session** granularity, the same trace-level grouping happens first, then traces are ordered by `start_time` and grouped by `attributes.session.id`. Session-level values are capped at 100K characters total.\n\n### The `{conversation}` template variable\n\nAt session granularity, `{conversation}` is a special template variable that renders as a JSON array of `{input, output}` turns across all traces in the session, built from `attributes.input.value` / `attributes.llm.input_messages` (input side) and `attributes.output.value` / `attributes.llm.output_messages` (output side).\n\nAt span or trace granularity, `{conversation}` is treated as a regular template variable and resolved via column mappings like any other.\n\n### Multi-evaluator tasks\n\nA task can contain evaluators at different granularities. At runtime the system uses the **highest** granularity (session > trace > span) for data fetching and automatically **splits into one child run per evaluator**. Per-evaluator `query_filter` in the task's evaluators JSON further narrows which spans are included (e.g., only tool-call spans within a session).\n\n---\n\n## Basic CRUD\n\n### AI Integrations\n\nAI integrations store the LLM provider credentials the evaluator uses. For full CRUD — listing, creating for all providers (OpenAI, Anthropic, Azure, Bedrock, Vertex, Gemini, NVIDIA NIM, custom), updating, and deleting — use the **arize-ai-provider-integration** skill.\n\nQuick reference for the common case (OpenAI):\n\n```bash\n# Check for an existing integration first\nax ai-integrations list --space SPACE\n\n# Create if none exists\nax ai-integrations create \\\n  --name \"My OpenAI Integration\" \\\n  --provider openAI \\\n  --api-key $OPENAI_API_KEY\n```\n\nCopy the returned integration ID — it is required for `ax evaluators create --ai-integration-id`.\n\n### Evaluators\n\n```bash\n# List / Get\nax evaluators list --space SPACE\nax evaluators get ID                    # accepts name or ID\nax evaluators get NAME --space SPACE   # required when using name instead of ID\nax evaluators list-versions NAME_OR_ID\nax evaluators get-version VERSION_ID\n\n# Create (creates the evaluator and its first version)\nax evaluators create \\\n  --name \"Answer Correctness\" \\\n  --space SPACE \\\n  --description \"Judges if the model answer is correct\" \\\n  --template-name \"correctness\" \\\n  --commit-message \"Initial version\" \\\n  --ai-integration-id INT_ID \\\n  --model-name \"gpt-4o\" \\\n  --include-explanations \\\n  --use-function-calling \\\n  --classification-choices '{\"correct\": 1, \"incorrect\": 0}' \\\n  --template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.\n\nUser question: {input}\n\nModel response: {output}\n\nRespond with exactly one of these labels: correct, incorrect'\n\n# Create a new version (for prompt or model changes — versions are immutable)\nax evaluators create-version NAME_OR_ID \\\n  --commit-message \"Added context grounding\" \\\n  --template-name \"correctness\" \\\n  --ai-integration-id INT_ID \\\n  --model-name \"gpt-4o\" \\\n  --include-explanations \\\n  --classification-choices '{\"correct\": 1, \"incorrect\": 0}' \\\n  --template 'Updated prompt...\n\n{input} / {output} / {context}'\n\n# Update metadata only (name, description — not prompt)\nax evaluators update NAME_OR_ID \\\n  --name \"New Name\" \\\n  --description \"Updated description\"\n\n# Delete (permanent — removes all versions)\nax evaluators delete NAME_OR_ID\n```\n\n**Key flags for `create`:**\n\n| Flag | Required | Description |\n|------|----------|-------------|\n| `--name` | yes | Evaluator name (unique within space) |\n| `--space` | yes | Space name or ID to create in |\n| `--template-name` | yes | Eval column name — alphanumeric, spaces, hyphens, underscores |\n| `--commit-message` | yes | Description of this version |\n| `--ai-integration-id` | yes | AI integration ID (from above) |\n| `--model-name` | yes | Judge model (e.g. `gpt-4o`) |\n| `--template` | yes | Prompt with `{variable}` placeholders (single-quoted in bash) |\n| `--classification-choices` | yes | JSON object mapping choice labels to numeric scores e.g. `'{\"correct\": 1, \"incorrect\": 0}'` |\n| `--description` | no | Human-readable description |\n| `--include-explanations` | no | Include reasoning alongside the label |\n| `--use-function-calling` | no | Prefer structured function-call output |\n| `--invocation-params` | no | JSON of model params e.g. `'{\"temperature\": 0}'` |\n| `--data-granularity` | no | `span` (default), `trace`, or `session`. Only relevant for project tasks, not dataset/experiment tasks. See Data Granularity section. |\n| `--direction` | no | Optimization direction: `maximize` or `minimize`. Sets how the UI renders trends. |\n| `--provider-params` | no | JSON object of provider-specific parameters |\n\n### Tasks\n\n> `PROJECT_NAME`, `DATASET_NAME`, and `evaluator_id` all accept a name or base64 ID.\n\n```bash\n# List / Get\nax tasks list --space SPACE\nax tasks list --project PROJECT_NAME\nax tasks list --dataset DATASET_NAME --space SPACE\nax tasks get TASK_ID\n\n# Create (project — continuous)\nax tasks create \\\n  --name \"Correctness Monitor\" \\\n  --task-type template_evaluation \\\n  --project PROJECT_NAME \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"input\": \"attributes.input.value\", \"output\": \"attributes.output.value\"}}]' \\\n  --is-continuous \\\n  --sampling-rate 0.1\n\n# Create (project — one-time / backfill)\nax tasks create \\\n  --name \"Correctness Backfill\" \\\n  --task-type template_evaluation \\\n  --project PROJECT_NAME \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"input\": \"attributes.input.value\", \"output\": \"attributes.output.value\"}}]' \\\n  --no-continuous\n\n# Create (experiment / dataset)\nax tasks create \\\n  --name \"Experiment Scoring\" \\\n  --task-type template_evaluation \\\n  --dataset DATASET_NAME --space SPACE \\\n  --experiment-ids \"EXP_ID_1,EXP_ID_2\" \\   # base64 IDs from `ax experiments list --space SPACE -o json`\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"output\": \"output\"}}]' \\\n  --no-continuous\n\n# Trigger a run (project task — use data window)\nax tasks trigger-run TASK_ID \\\n  --data-start-time \"2026-03-20T00:00:00\" \\\n  --data-end-time \"2026-03-21T23:59:59\" \\\n  --wait\n\n# Trigger a run (experiment task — use experiment IDs)\nax tasks trigger-run TASK_ID \\\n  --experiment-ids \"EXP_ID_1\" \\   # base64 ID from `ax experiments list --space SPACE -o json`\n  --wait\n\n# Monitor\nax tasks list-runs TASK_ID\nax tasks get-run RUN_ID\nax tasks wait-for-run RUN_ID --timeout 300\nax tasks cancel-run RUN_ID --force\n```\n\n**Time format for trigger-run:** `2026-03-21T09:00:00` — no trailing `Z`.\n\n**Additional trigger-run flags:**\n\n| Flag | Description |\n|------|-------------|\n| `--max-spans` | Cap processed spans (default 10,000) |\n| `--override-evaluations` | Re-score spans that already have labels |\n| `--wait` / `-w` | Block until the run finishes |\n| `--timeout` | Seconds to wait with `--wait` (default 600) |\n| `--poll-interval` | Poll interval in seconds when waiting (default 5) |\n\n**Run status guide:**\n\n| Status | Meaning |\n|--------|---------|\n| `completed`, 0 spans | The eval index lags 1–2 hours — spans ingested recently may not be indexed yet. Shift the window to data at least 2 hours old, or widen the time range to cover more historical data. |\n| `cancelled` ~1s | Integration credentials invalid |\n| `cancelled` ~3min | Found spans but LLM call failed — check model name or key |\n| `completed`, N > 0 | Success — check scores in UI |\n\n---\n\n## Workflow A: Create an evaluator for a project\n\nUse this when the user says something like *\"create an evaluator for my Playground Traces project\"*.\n\n### Step 1: Confirm the project name\n\n`ax spans export` accepts a project name directly — no ID lookup needed. If you don't know the project name, list available projects:\n\n```bash\nax projects list --space SPACE -o json\n```\n\nFind the entry whose `\"name\"` matches (case-insensitive) and use that name as `PROJECT` in subsequent commands. If you later hit a validation error with a name, fall back to using the project's `\"id\"` (a base64 string) instead.\n\n### Step 2: Understand what to evaluate\n\nIf the user specified the evaluator type (hallucination, correctness, relevance, etc.) → skip to Step 3.\n\nIf not, sample recent spans to base the evaluator on actual data:\n\n```bash\nax spans export PROJECT --space SPACE -l 10 --days 30 --stdout\n```\n\nInspect `attributes.input`, `attributes.output`, span kinds, and any existing annotations. Identify failure modes (e.g. hallucinated facts, off-topic answers, missing context) and propose **1–3 concrete evaluator ideas**. Let the user pick.\n\nEach suggestion must include: the evaluator name (bold), a one-sentence description of what it judges, and the binary label pair in parentheses. Format each like:\n\n1. **Name** — Description of what is being judged. (`label_a` / `label_b`)\n\nExample:\n1. **Response Correctness** — Does the agent's response correctly address the user's financial query? (`correct` / `incorrect`)\n2. **Hallucination** — Does the response fabricate facts not grounded in retrieved context? (`factual` / `hallucinated`)\n\n### Step 3: Confirm or create an AI integration\n\n```bash\nax ai-integrations list --space SPACE -o json\n```\n\nIf a suitable integration exists, note its ID. If not, create one using the **arize-ai-provider-integration** skill. Ask the user which provider/model they want for the judge.\n\n### Step 4: Create the evaluator\n\nUse the template design best practices below. Keep the evaluator name and variables **generic** — the task (Step 6) handles project-specific wiring via `column_mappings`.\n\n```bash\nax evaluators create \\\n  --name \"Hallucination\" \\\n  --space SPACE \\\n  --template-name \"hallucination\" \\\n  --commit-message \"Initial version\" \\\n  --ai-integration-id INT_ID \\\n  --model-name \"gpt-4o\" \\\n  --include-explanations \\\n  --use-function-calling \\\n  --classification-choices '{\"factual\": 1, \"hallucinated\": 0}' \\\n  --template 'You are an evaluator. Given the user question and the model response, decide if the response is factual or contains unsupported claims.\n\nUser question: {input}\n\nModel response: {output}\n\nRespond with exactly one of these labels: hallucinated, factual'\n```\n\n### Step 5: Ask — backfill, continuous, or both?\n\n**Recommended approach:** Always start with a small backfill (~100 historical spans) to validate the evaluator before turning on continuous monitoring. This lets you catch column mapping errors, wrong span kinds, and template issues on known data before scoring all future production spans. Only enable continuous after a backfill confirms correct scoring.\n\nBefore creating the task, ask:\n\n> \"Would you like to:\n> (a) Run a **backfill** on historical spans (one-time)?\n> (b) Set up **continuous** evaluation on new spans going forward?\n> (c) **Both** — backfill first to validate, then keep scoring new spans automatically? (recommended)\"\n\n### Step 6: Determine column mappings from real span data\n\nDo not guess paths. Pull a sample and inspect what fields are actually present:\n\n```bash\nax spans export PROJECT --space SPACE -l 5 --days 7 --stdout\n```\n\nFor each template variable (`{input}`, `{output}`, `{context}`), find the matching JSON path. Common starting points — **always verify on your actual data before using**:\n\n| Template var | LLM span | CHAIN span |\n|---|---|---|\n| `input` | `attributes.input.value` | `attributes.input.value` |\n| `output` | `attributes.llm.output_messages.0.message.content` | `attributes.output.value` |\n| `context` | `attributes.retrieval.documents.contents` | — |\n| `tool_output` | `attributes.input.value` (fallback) | `attributes.output.value` |\n\n**Validate span kind alignment:** If the evaluator prompt assumes LLM final text but the task targets CHAIN spans (or vice versa), runs can cancel or score the wrong text. Make sure the `query_filter` on the task matches the span kind you mapped.\n\n**`query_filter` only works on indexed attributes:** The `query_filter` in the evaluators JSON is evaluated against the eval index, not the raw span store. Attributes under `attributes.metadata.*` or custom keys may not be indexed and will silently match nothing. Use well-known indexed attributes like `span_kind` or `attributes.llm.model_name` for filtering. If a filter returns 0 spans despite data existing, try removing the filter as a diagnostic step.\n\n**Full example `--evaluators` JSON:**\n\n```json\n[\n  {\n    \"evaluator_id\": \"EVAL_ID\",\n    \"query_filter\": \"span_kind = 'LLM'\",\n    \"column_mappings\": {\n      \"input\": \"attributes.input.value\",\n      \"output\": \"attributes.llm.output_messages.0.message.content\",\n      \"context\": \"attributes.retrieval.documents.contents\"\n    }\n  }\n]\n```\n\nInclude a mapping for **every** variable the template references. Omitting one causes runs to produce no valid scores.\n\n### Step 7: Create the task\n\n**Backfill only (a):**\n```bash\nax tasks create \\\n  --name \"Hallucination Backfill\" \\\n  --task-type template_evaluation \\\n  --project PROJECT \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"input\": \"attributes.input.value\", \"output\": \"attributes.output.value\"}}]' \\\n  --no-continuous\n```\n\n**Continuous only (b):**\n```bash\nax tasks create \\\n  --name \"Hallucination Monitor\" \\\n  --task-type template_evaluation \\\n  --project PROJECT \\\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"input\": \"attributes.input.value\", \"output\": \"attributes.output.value\"}}]' \\\n  --is-continuous \\\n  --sampling-rate 0.1\n```\n\n**Both (c):** Use `--is-continuous` on create, then also trigger a backfill run in Step 8.\n\n### Step 8: Trigger a backfill run (if requested)\n\n> **Eval index lag:** The eval index is built asynchronously from the primary trace store and can lag **1–2 hours**. For your first test run, use a time window ending at least 2 hours in the past. If you set `--data-end-time` to \"now\" on spans ingested in the last hour, the run will complete successfully but score 0 spans.\n\nFirst find what time range has data:\n```bash\nax spans export PROJECT --space SPACE -l 100 --days 1 --stdout   # try last 24h first\nax spans export PROJECT --space SPACE -l 100 --days 7 --stdout   # widen if empty\n```\n\nUse the `start_time` / `end_time` fields from real spans to set the window. For the first validation run, cap `--max-spans` at ~100 to get quick feedback:\n\n```bash\nax tasks trigger-run TASK_ID \\\n  --data-start-time \"2026-03-20T00:00:00\" \\\n  --data-end-time \"2026-03-21T23:59:59\" \\\n  --max-spans 100 \\\n  --wait\n```\n\nReview scores and explanations before widening to the full backfill or enabling continuous.\n\n---\n\n## Workflow B: Create an evaluator for an experiment\n\nUse this when the user says something like *\"create an evaluator for my experiment\"* or *\"evaluate my dataset runs\"*.\n\n**If the user says \"dataset\" but doesn't have an experiment:** A task must target an experiment (not a bare dataset). Ask:\n> \"Evaluation tasks run against experiment runs, not datasets directly. Would you like help creating an experiment on that dataset first?\"\n\nIf yes, use the **arize-experiment** skill to create one, then return here.\n\n### Step 1: Find the dataset and experiment names\n\n```bash\nax datasets list --space SPACE\nax experiments list --dataset DATASET_NAME --space SPACE -o json\n```\n\nNote the dataset name and the experiment name(s) to score. These accept names or IDs in subsequent commands — names are preferred.\n\n### Step 2: Understand what to evaluate\n\nIf the user specified the evaluator type → skip to Step 3.\n\nIf not, inspect a recent experiment run to base the evaluator on actual data:\n\n```bash\nax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE --stdout | python3 -c \"import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))\"\n```\n\nLook at the `output`, `input`, `evaluations`, and `metadata` fields. Identify gaps (metrics the user cares about but doesn't have yet) and propose **1–3 evaluator ideas**. Each suggestion must include: the evaluator name (bold), a one-sentence description, and the binary label pair in parentheses — same format as Workflow A, Step 2.\n\n### Step 3: Confirm or create an AI integration\n\nSame as Workflow A, Step 3.\n\n### Step 4: Create the evaluator\n\nSame as Workflow A, Step 4. Keep variables generic.\n\n### Step 5: Determine column mappings from real run data\n\nRun data shape differs from span data. Inspect:\n\n```bash\nax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE --stdout | python3 -c \"import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))\"\n```\n\nCommon mapping for experiment runs:\n- `output` → `\"output\"` (top-level field on each run)\n- `input` → check if it's on the run or embedded in the linked dataset examples\n\nIf `input` is not on the run JSON, export dataset examples to find the path:\n```bash\nax datasets export DATASET_NAME --space SPACE --stdout | python3 -c \"import sys,json; ex=json.load(sys.stdin); print(json.dumps(ex[0], indent=2))\"\n```\n\n### Step 6: Create the task\n\n```bash\nax tasks create \\\n  --name \"Experiment Correctness\" \\\n  --task-type template_evaluation \\\n  --dataset DATASET_NAME --space SPACE \\\n  --experiment-ids \"EXP_ID\" \\   # base64 ID from `ax experiments list --space SPACE -o json`\n  --evaluators '[{\"evaluator_id\": \"EVAL_ID\", \"column_mappings\": {\"output\": \"output\"}}]' \\\n  --no-continuous\n```\n\n### Step 7: Trigger and monitor\n\n```bash\nax tasks trigger-run TASK_ID \\\n  --experiment-ids \"EXP_ID\" \\   # base64 ID from `ax experiments list --space SPACE -o json`\n  --wait\n\nax tasks list-runs TASK_ID\nax tasks get-run RUN_ID\n```\n\n---\n\n## Best Practices for Template Design\n\n### 1. Use generic, portable variable names\n\nUse `{input}`, `{output}`, and `{context}` — not names tied to a specific project or span attribute (e.g. do not use `{attributes_input_value}`). The evaluator itself stays abstract; the **task's `column_mappings`** is where you wire it to the actual fields in a specific project or experiment. This lets the same evaluator run across multiple projects and experiments without modification.\n\n### 2. Default to binary labels\n\nUse exactly two clear string labels (e.g. `hallucinated` / `factual`, `correct` / `incorrect`, `pass` / `fail`). Binary labels are:\n- Easiest for the judge model to produce consistently\n- Most common in the industry\n- Simplest to interpret in dashboards\n\nIf the user insists on more than two choices, that's fine — but recommend binary first and explain the tradeoff (more labels → more ambiguity → lower inter-rater reliability).\n\n### 3. Be explicit about what the model must return\n\nThe template must tell the judge model to respond with **only** the label string — nothing else. The label strings in the prompt must **exactly match** the labels in `--classification-choices` (same spelling, same casing).\n\nGood:\n```\nRespond with exactly one of these labels: hallucinated, factual\n```\n\nBad (too open-ended):\n```\nIs this hallucinated? Answer yes or no.\n```\n\n### 4. Keep temperature low\n\nPass `--invocation-params '{\"temperature\": 0}'` for reproducible scoring. Higher temperatures introduce noise into evaluation results.\n\n### 5. Use `--include-explanations` for debugging\n\nDuring initial setup, always include explanations so you can verify the judge is reasoning correctly before trusting the labels at scale.\n\n### 6. Pass the template in single quotes in bash\n\nSingle quotes prevent the shell from interpolating `{variable}` placeholders. Double quotes will cause issues:\n\n```bash\n# Correct\n--template 'Judge this: {input} → {output}'\n\n# Wrong — shell may interpret { } or fail\n--template \"Judge this: {input} → {output}\"\n```\n\n### 7. Always set `--classification-choices` to match your template labels\n\nThe labels in `--classification-choices` must exactly match the labels referenced in `--template` (same spelling, same casing). Omitting `--classification-choices` causes task runs to fail with \"missing rails and classification choices.\"\n\n---\n\n## Troubleshooting\n\n| Problem | Solution |\n|---------|----------|\n| `ax: command not found` | See references/ax-setup.md |\n| `401 Unauthorized` | API key may not have access to this space. Verify at https://app.arize.com/admin > API Keys |\n| `Evaluator not found` | `ax evaluators list --space SPACE` |\n| `Integration not found` | `ax ai-integrations list --space SPACE` |\n| `Task not found` | `ax tasks list --space SPACE` |\n| `project and dataset-id are mutually exclusive` | Use only one when creating a task |\n| `experiment-ids required for dataset tasks` | Add `--experiment-ids` to `create` and `trigger-run` |\n| `sampling-rate only valid for project tasks` | Remove `--sampling-rate` from dataset tasks |\n| Validation error on `ax spans export` | Project name usually works; if you still get a validation error, look up the base64 project ID via `ax projects list --space SPACE -o json` and use the `id` field instead |\n| Template validation errors | Use single-quoted `--template '...'` in bash; single braces `{var}`, not double `{{var}}` |\n| Run stuck in `pending` | `ax tasks get-run RUN_ID`; then `ax tasks cancel-run RUN_ID` |\n| Run `cancelled` ~1s | Integration credentials invalid — check AI integration |\n| Run `cancelled` ~3min | Found spans but LLM call failed — wrong model name or bad key |\n| Run `completed`, 0 spans | Widen time window; eval index may not cover older data |\n| No scores in UI | Fix `column_mappings` to match real paths on your spans/runs |\n| Scores look wrong | Add `--include-explanations` and inspect judge reasoning on a few samples |\n| Evaluator cancels on wrong span kind | Match `query_filter` and `column_mappings` to LLM vs CHAIN spans |\n| Time format error on `trigger-run` | Use `2026-03-21T09:00:00` — no trailing `Z` |\n| Run failed: \"missing rails and classification choices\" | Add `--classification-choices '{\"label_a\": 1, \"label_b\": 0}'` to `ax evaluators create` — labels must match the template |\n| Run `completed`, all spans skipped | Query filter matched spans but column mappings are wrong or template variables don't resolve — export a sample span and verify paths |\n| `query_filter` set but 0 spans scored | The filter attribute may not be indexed in the eval index. `attributes.metadata.*` and custom attributes are often not indexed. Use `span_kind` or `attributes.llm.model_name` instead, or remove the filter to confirm spans exist in the window. |\n\n### Diagnosing cancelled runs\n\nWhen a task run is cancelled (status `cancelled`), follow this checklist in order:\n\n**1. Check integration credentials**\n```bash\nax ai-integrations list --space SPACE -o json\n```\nVerify the integration ID used by the evaluator exists and has valid credentials. If the integration was deleted or the API key expired, the run cancels within ~1 second.\n\n**2. Verify the model name**\n```bash\nax evaluators get EVALUATOR_NAME --space SPACE -o json\n```\nCheck the `model_name` field. A typo or deprecated model causes the LLM call to fail and the run to cancel after ~3 minutes.\n\n**3. Export a sample span/run and compare paths to column_mappings**\n\nFor project tasks:\n```bash\nax spans export PROJECT --space SPACE -l 1 --days 7 --stdout | python3 -m json.tool\n```\n\nFor experiment tasks:\n```bash\nax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE --stdout | python3 -c \"import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2)) if runs else print('No runs')\"\n```\n\nCompare the exported JSON paths against the task's `column_mappings`. For each template variable, confirm the mapped path actually exists. Common mismatches:\n- Mapping `output` to `attributes.output.value` on an experiment run (should be just `output`)\n- Mapping `input` to `attributes.input.value` on a CHAIN span when the actual path is `attributes.llm.input_messages`\n- Mapping `context` to a path that doesn't exist on the span kind being filtered\n\n**4. Check that `data_start_time` is not epoch**\n\nIf `trigger-run` used a start time of `0`, `1970-01-01`, or an empty string, the time window is invalid. Always derive from real span timestamps:\n```bash\nax spans export PROJECT --space SPACE -l 5 --days 30 --stdout | python3 -c \"\nimport sys, json\nspans = json.load(sys.stdin)\nfor s in spans:\n    print(s.get('start_time', 'N/A'), s.get('end_time', 'N/A'))\n\"\n```\n\n**5. Verify span kind matches evaluator scope**\n\nIf the evaluator was created with `--data-granularity trace` but the task's `query_filter` is `span_kind = 'LLM'`, the run may find no qualifying data and cancel. Ensure the granularity and filter are consistent.\n\n**6. Check that all template variables resolve**\n\nEvery `{variable}` in the evaluator template must have a corresponding `column_mappings` entry that resolves to a non-null value. Test resolution against a real span:\n```bash\nax spans export PROJECT --space SPACE -l 3 --days 7 --stdout | python3 -c \"\nimport sys, json\nspans = json.load(sys.stdin)\n# Replace these paths with your actual column_mappings values\nmappings = {'input': 'attributes.input.value', 'output': 'attributes.output.value'}\nfor i, span in enumerate(spans):\n    print(f'--- Span {i} ---')\n    for var, path in mappings.items():\n        parts = path.split('.')\n        val = span\n        for p in parts:\n            val = val.get(p) if isinstance(val, dict) else None\n        status = 'FOUND' if val else 'MISSING'\n        print(f'  {var} ({path}): {status} — {str(val)[:80] if val else \\\"null\\\"}')\n\"\n```\nIf any variable shows MISSING on all spans, fix the column mapping or adjust `query_filter` to target a different span kind.\n\n---\n\n## Related Skills\n\n- **arize-ai-provider-integration**: Full CRUD for LLM provider integrations (create, update, delete credentials)\n- **arize-trace**: Export spans to discover column paths and time ranges\n- **arize-experiment**: Create experiments and export runs for experiment column mappings\n- **arize-dataset**: Export dataset examples to find input fields when runs omit them\n- **arize-link**: Deep links to evaluators and tasks in the Arize UI\n\n---\n\n## Save Credentials for Future Use\n\nSee references/ax-profiles.md § Save Credentials for Future Use.","tags":["arize","evaluator","skills","arize-ai","agent-skills","ai-agents","ai-observability","claude-code","codex","cursor","datasets","experiments"],"capabilities":["skill","source-arize-ai","skill-arize-evaluator","topic-agent-skills","topic-ai-agents","topic-ai-observability","topic-arize","topic-claude-code","topic-codex","topic-cursor","topic-datasets","topic-experiments","topic-llmops","topic-tracing"],"categories":["arize-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/Arize-ai/arize-skills/arize-evaluator","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add Arize-ai/arize-skills","source_repo":"https://github.com/Arize-ai/arize-skills","install_from":"skills.sh"}},"qualityScore":"0.456","qualityRationale":"deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 13 github stars · SKILL.md body (31,581 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-04-24T01:02:56.312Z","embedding":null,"createdAt":"2026-04-23T13:03:47.194Z","updatedAt":"2026-04-24T01:02:56.312Z","lastSeenAt":"2026-04-24T01:02:56.312Z","tsv":"'-01':4321,4322 '-03':1668,1678,1756,2863,2873,3969 '-20':1669,2864 '-21':1679,1757,2874,3970 '-5':475 '/admin':189,3728 '/support':375 '0':485,672,1190,1278,1404,1441,1823,1880,2256,2573,2782,3079,3204,3271,3580,3902,3993,4034,4227,4319 '0.1':1563,2696 '00':1671,1672,1759,1760,2866,2867,3972,3973 '000':1779 '1':345,673,1188,1276,1402,1622,1704,1829,1911,2055,2091,2104,2254,2739,2801,2980,3105,3371,3990,4090,4131,4194 '10':1778,2028 '100':2310,2799,2814,2845,2881 '100k':822,860 '1970':4320 '1s':1861,3878 '2':352,1625,1830,1847,1988,2121,2740,2754,3026,3081,3135,3206,3273,3437,4133,4229 '2026':1667,1677,1755,2862,2872,3968 '24h':2805 '3':359,2007,2056,2136,3041,3106,3137,3149,3505,4170,4172,4456 '30':2030,4348 '300':1740 '3min':1866,3887 '4':369,474,2184,3151,3160,3571,4301 '401':146,3713 '4o':470,1176,1268,1376,2242 '5':1816,2296,2426,3165,3591,4346,4371 '59':1681,1682,2876,2877 '6':2205,2396,3275,3619,4414 '600':1805 '7':2428,2627,2816,3324,3660,4196,4458 '8':2713,2715 '80':4527 'abstract':3403 'accept':57,1100,1496,1919,3015 'access':3720 'across':622,886,3430 'activ':547 'actual':605,2018,2416,2449,3054,3416,4255,4281,4473 'ad':1250 'add':3779,3931,3984 'addit':1764 'address':2113 'adjust':4545 'agent':331,740,2109 'agent-gener':330 'aggreg':789 'ai':220,250,274,366,444,989,991,1025,1045,1056,1084,1166,1258,1358,1362,2141,2146,2169,2232,3142,3744,3883,4097,4558 'ai-integr':219,273,365,1044,1055,2145,3743,4096 'ai-integration-id':1083,1165,1257,1357,2231 'align':2475 'allow':423 'alongsid':1417 'alphanumer':1345 'alreadi':1788 'also':2706 'alway':2304,2445,3601,3661,4332 'ambigu':3499 'analysi':333 'annot':2040 'answer':1144,1153,1209,2050,3567 'anthrop':214,451,1010 'anyth':744 'api':149,167,190,212,215,1066,1069,3715,3729,4124 'api-key':1065 'app.arize.com':188,3727 'app.arize.com/admin':187,3726 'appli':695 'approach':2303 'ariz':2,15,44,53,91,249,269,340,357,1024,2168,2970,4557,4572,4584,4596,4610,4620 'arize-ai-provider-integr':248,1023,2167,4556 'arize-dataset':4595 'arize-evalu':1 'arize-experi':2969,4583 'arize-link':4609 'arize-trac':4571 'arize.com':374 'arize.com/support':373 'array':881 'arriv':659 'ask':203,235,288,2173,2297,2357,2944 'assum':2480 'asynchron':2730 'attach':567 'attribut':2521,2540,2560,3391,3396,4039,4051 'attributes.input':2033 'attributes.input.value':615,894,1554,1592,2460,2461,2469,2603,2656,2687,4274,4479 'attributes.llm.input':895,4284 'attributes.llm.model':2565,4060 'attributes.llm.output':901 'attributes.llm.output_messages.0.message.content':2463,2605 'attributes.metadata':2542,4048 'attributes.output':2034 'attributes.output.value':900,1556,1594,2464,2471,2658,2689,4262,4481 'attributes.retrieval.documents.contents':2466,2607 'attributes.session.id':765,852 'automat':653,953,2393 'avail':284,1937 'ax':75,115,130,152,195,218,266,272,364,1043,1054,1080,1091,1096,1104,1117,1125,1140,1239,1292,1309,1505,1510,1516,1524,1532,1570,1601,1629,1656,1692,1708,1717,1724,1731,1741,1916,1940,2021,2144,2215,2419,2635,2666,2792,2807,2851,2988,2993,3057,3182,3252,3280,3304,3329,3344,3352,3359,3707,3734,3742,3752,3807,3828,3861,3869,3995,4095,4139,4187,4205,4339,4449 'azur':1011 'b':2102,2372,2664,2897,3992 'back':1976 'backfil':1569,1575,2298,2309,2349,2365,2384,2631,2640,2709,2718,2892 'bad':3559,3898 'bare':2942 'base':134,2014,3050 'base64':67,1500,1626,1705,1984,3301,3341,3824 'bash':1036,1088,1387,1502,1939,2020,2143,2214,2418,2634,2665,2791,2850,2987,3056,3181,3251,3279,3328,3627,3642,3850,4094,4138,4186,4204,4338,4448 'basic':987 'bedrock':452,1012 'best':2192,3366 'better':498 'binari':429,2083,3124,3440,3455,3490 'block':1793 'bold':2071,3116 'brace':3852 'built':892,2729 'c':2382,2698,3069,3194,3261,4217,4351,4461 'call':208,458,749,982,1183,1423,1429,1871,2249,3892,4161 'came':337 'cancel':302,1744,1860,1865,2495,3872,3877,3886,3944,4075,4082,4084,4129,4168,4406 'cancel-run':1743,3871 'cap':858,1774,2840 'care':3096 'carri':440 'case':1034,1954,3548,3688 'case-insensit':1953 'catch':2325 'caus':2619,3640,3693,4158 'chain':750,2457,2488,3958,4277 'chang':536,1235 'channel':287 'charact':823,861 'check':121,226,1037,1873,1882,3222,3882,4091,4148,4302,4415 'checklist':4087 'child':957 'choic':419,437,1186,1274,1390,1395,2252,3484,3544,3665,3676,3692,3703,3983,3987 'claim':2279 'classif':418,1185,1273,1389,2251,3543,3664,3675,3691,3702,3982,3986 'classification-choic':1184,1272,1388,2250,3542,3663,3674,3690,3985 'claud':472 'claude-sonnet':471 'clear':310,3445 'coher':774 'column':28,416,596,714,803,921,1343,1551,1589,1641,2212,2326,2398,2600,2653,2684,3167,3316,3407,3919,3953,4013,4181,4245,4431,4474,4542,4578,4593 'comma':812 'comma-join':811 'command':116,131,138,1964,3021,3708 'commit':1161,1248,1350,2227 'commit-messag':1160,1247,1349,2226 'common':435,1033,2442,3207,3467,4257 'compar':4178,4236 'complet':1822,1878,2778,3901,4004 'concept':376 'concret':2057 'confirm':1912,2137,2350,3138,4068,4251 'consist':3465,4413 'contact':370 'contain':391,580,933,2277 'context':404,1251,1284,2052,2132,2436,2465,2606,3381,4287 'context.trace':738,798 'continu':31,647,663,1531,1559,1597,1647,2299,2320,2346,2375,2661,2662,2692,2702,2895,3322 'control':681 'convers':777,864,870,910 'copi':1071 'correct':722,743,1145,1155,1159,1187,1208,1225,1256,1275,1401,1536,1574,2001,2106,2112,2119,2351,3285,3451,3612,3643 'correspond':4430 'cover':80,1856,3911 'creat':38,82,243,537,1005,1050,1058,1082,1132,1133,1142,1227,1242,1318,1336,1529,1534,1564,1572,1598,1603,1888,1902,2139,2163,2185,2217,2354,2628,2637,2668,2704,2898,2912,2958,2974,3140,3152,3276,3282,3769,3784,3997,4382,4567,4586 'create-vers':1241 'create/update':174 'creating/updating':16 'credenti':231,264,270,281,362,449,997,1863,3880,4093,4116,4570,4623,4630 'critic':291 'crud':988,1003,4562 'current':158 'custom':1017,2544,4050 'dashboard':3475 'data':106,509,564,674,678,685,950,1443,1460,1654,1664,1674,1844,1859,2019,2337,2403,2450,2576,2763,2790,2859,2869,3055,3172,3174,3179,3913,4304,4385,4404 'data-end-tim':1673,2762,2868 'data-granular':677,1442,4384 'data-start-tim':1663,2858 'dataset':575,1490,1519,1520,1600,1612,1613,2921,2927,2943,2952,2963,2983,2989,2996,2997,3005,3062,3063,3187,3188,3234,3245,3253,3255,3291,3292,3760,3777,3802,4210,4211,4597,4599 'dataset-id':3759 'dataset/experiment':700,1457 'day':2029,2427,2800,2815,4195,4347,4457 'debug':3597 'decid':1204,2270 'deep':4612 'default':432,690,717,1447,1777,1804,1815,3438 'defin':94 'definit':389 'delet':1020,1304,1311,4121,4569 'deprec':4156 'deriv':4333 'descript':393,582,1148,1289,1301,1303,1321,1353,1405,1410,1770,2076,2093,3121 'design':81,2191,3370 'despit':2575 'determin':2397,3166 'diagnos':4074 'diagnost':2584 'dict':4511 'differ':936,3176,4551 'direct':109,184,493,706,1463,1466,1923,2953 'discov':4577 'doesn':179,2929,3099,4292 'doubl':3637,3855 'e.g':61,70,401,426,467,613,638,978,1373,1400,1439,2044,3392,3448 'easiest':3458 'els':3529,4232,4512,4518,4530 'embed':3230 'empti':2820,4325 'enabl':2345,2894 'end':1675,2751,2764,2825,2870,3563,4368 'ensur':4407 'entri':1949,4433 'enumer':4486 'env':55,123,257 'epoch':4309 'error':137,143,1971,2328,3805,3820,3843,3962 'estim':325 'etc':453,2003 'eval':725,752,780,1342,1549,1587,1639,1826,2533,2593,2651,2682,2722,2726,3314,3907,4046 'evalu':3,12,17,19,40,45,89,93,294,298,321,341,380,382,455,513,523,529,561,583,586,600,620,637,671,687,703,710,808,928,934,960,963,970,999,1081,1087,1092,1097,1105,1118,1126,1135,1141,1195,1240,1293,1310,1324,1493,1542,1546,1547,1580,1584,1585,1611,1636,1637,1782,1890,1904,1992,1998,2016,2058,2069,2187,2197,2216,2261,2316,2376,2478,2527,2530,2588,2591,2645,2648,2649,2676,2679,2680,2900,2914,2919,2945,3030,3036,3052,3087,3107,3114,3154,3290,3311,3312,3400,3428,3589,3731,3735,3943,3996,4111,4140,4142,4376,4380,4425,4615 'everi':532,2612,4421 'ex':3265,3270 'exact':1220,2288,3443,3537,3552,3678 'exampl':2103,2587,3235,3246,4600 'exclus':3764 'exist':234,1040,1053,2039,2157,2577,4070,4112,4256,4294 'exp':1620,1623,1702,3299,3339 'experi':23,576,611,625,704,1599,1605,1618,1630,1687,1690,1700,1709,2903,2917,2933,2939,2949,2960,2971,2985,2994,3009,3047,3058,3060,3183,3185,3210,3284,3297,3305,3337,3345,3423,3434,3773,3781,4202,4206,4208,4265,4585,4587,4592 'experiment-id':1617,1699,3296,3336,3772,3780 'expir':4126 'explain':312,3493 'explan':729,756,784,1179,1271,1413,2245,2886,3595,3603,3934 'explicit':3507 'export':1918,2023,2421,2794,2809,3059,3184,3244,3254,3809,4023,4173,4189,4207,4238,4341,4451,4574,4589,4598 'express':631 'f':4489,4521 'fabric':293,2126 'fact':2046,2127 'factual':427,2133,2253,2275,2294,3450,3558 'fail':132,209,300,1872,3454,3654,3697,3893,3978,4163 'failur':309,2042 'fall':1975 'fallback':2470 'feedback':2849 'fetch':951 'field':392,581,606,2414,2827,3090,3217,3417,3839,4152,4604 'file':258 'filesystem':262 'fill':407 'filter':627,965,2505,2516,2524,2568,2571,2581,2596,3951,4009,4031,4038,4066,4300,4393,4411,4547 'final':2482 'financi':2117 'find':72,1947,2437,2785,2981,3248,4401,4602 'fine':3487 'finish':1797 'first':841,1042,1138,2385,2744,2784,2806,2837,2964,3491 'fix':346,3918,4540 'flag':50,680,1316,1319,1768,1769 'follow':171,4085 'forc':1748 'format':1750,2088,3130,3961 'forward':2381 'found':140,1867,3710,3733,3741,3751,3888,4515 'fraction':666 'full':748,1002,2586,2891,4561 'function':1182,1422,1428,2248 'function-cal':1427 'futur':2341,4625,4632 'gap':3092 'gemini':1014 'generat':332 'generic':2201,3163,3373 'get':406,1090,1098,1106,1128,1504,1526,1727,2847,3362,3817,3864,4141 'get-run':1726,3361,3863 'get-vers':1127 'given':1196,2262 'go':2380 'good':3549 'gpt':469,1175,1267,1375,2241 'gpt-4o':468,1174,1266,1374,2240 'granular':510,675,679,793,833,869,909,937,945,1444,1461,4386,4409 'ground':1252,2129 'group':736,763,801,839,850 'guess':2406 'guid':1819 'hallucin':428,723,2000,2045,2122,2134,2219,2225,2255,2293,2639,2670,3449,3557,3566 'hallucination/faithf':43 'handl':2206 'happen':840 'help':2957 'higher':495,3584 'highest':944 'histor':1858,2311,2367 'hit':1968 'hour':1831,1848,2741,2755,2774 'human':1408 'human-read':1407 'hyphen':1347 'id':69,739,799,1075,1086,1099,1103,1116,1124,1131,1168,1170,1246,1260,1262,1297,1314,1334,1360,1364,1494,1501,1528,1548,1550,1586,1588,1619,1621,1624,1627,1638,1640,1662,1691,1698,1701,1703,1706,1723,1730,1738,1747,1925,1982,2160,2234,2236,2592,2594,2650,2652,2681,2683,2857,3018,3298,3300,3302,3313,3315,3335,3338,3340,3342,3358,3365,3761,3774,3782,3826,3838,3867,3875,4107 'idea':2059,3108 'identifi':348,2041,3091 'immut':540,1238 'import':644,3070,3195,3262,4218,4352,4462 'includ':977,1178,1270,1412,1415,2067,2244,2608,3112,3594,3602,3933 'include-explan':1177,1269,1411,2243,3593,3932 'incorrect':1189,1226,1277,1403,2120,3452 'indent':3080,3205,3272,4228 'index':1827,1838,2520,2534,2549,2559,2723,2727,3908,4043,4047,4055 'individu':718 'industri':3470 'ingest':1833,2770 'initi':1163,2229,3599 'input':402,614,883,897,1214,1282,1553,1591,2282,2434,2459,2602,2655,2686,3086,3221,3237,3378,3397,3647,3658,4272,4478,4603 'insensit':1955 'insist':3479 'inspect':156,2032,2412,3044,3180,3936 'instead':343,1114,1986,3840,4062 'int':1169,1261,2235 'integr':221,245,252,275,361,367,445,990,992,1027,1041,1046,1057,1062,1074,1085,1167,1259,1359,1363,1862,2142,2147,2156,2171,2233,3143,3739,3745,3879,3884,4092,4098,4106,4119,4560,4566 'inter':3502 'inter-rat':3501 'interpol':3634 'interpret':3473,3652 'interv':1808,1810 'introduc':3586 'invalid':1864,3881,4331 'invent':322 'invoc':476,1432,3577 'invocation-param':1431,3576 'invok':4 'is-continu':1557,2690,2700 'isinst':4509 'issu':349,2334,3641 'join':813 'json':479,880,971,1392,1435,1480,1635,1714,1946,2152,2440,2528,2589,2590,3002,3072,3197,3243,3264,3310,3350,3834,4103,4147,4220,4239,4354,4464 'json.dumps':3077,3202,3269,4225 'json.load':3074,3199,3266,4222,4356,4466 'json.tool':4200 'judg':11,42,88,96,388,396,460,465,829,1149,1371,2080,2098,2182,3461,3519,3609,3645,3656,3937 'keep':2195,2389,3161,3572 'key':150,168,183,191,213,216,241,279,1067,1070,1315,1877,2545,3716,3730,3899,4125 'kind':640,2036,2331,2474,2512,2563,2598,3948,4058,4298,4374,4396,4553 'know':1932 'known':2336,2558 'l':2027,2425,2798,2813,4193,4345,4455 'label':425,727,754,782,1224,1396,1419,1790,2084,2099,2101,2292,3125,3441,3447,3456,3497,3526,3531,3540,3556,3616,3670,3672,3681,3988,3991,3998 'lag':1828,2724,2738 'last':2773,2804 'later':1967 'least':1846,2753 'let':2060,2323,3425 'level':521,528,707,838,855,3216 'like':483,923,1901,2090,2360,2561,2911,2956 'link':3233,4611,4613 'list':77,197,222,368,584,1004,1047,1089,1093,1120,1503,1507,1512,1518,1631,1710,1720,1936,1942,2148,2990,2995,3306,3346,3355,3736,3746,3754,3830,4099 'list-run':1719,3354 'list-vers':1119 'live':571 'llm':9,41,86,206,277,386,447,641,995,1870,2455,2481,2599,3891,3956,4160,4397,4564 'llm-as-judg':8,85,385 'look':3082,3821,3929 'lookup':1926 'low':486,3574 'lower':3500 'm':4199 'make':619,2501 'manag':230 'manual':320 'map':29,417,597,598,922,1394,1552,1590,1642,2213,2327,2399,2514,2601,2610,2654,2685,3168,3208,3317,3408,3920,3954,4014,4182,4246,4253,4259,4271,4286,4432,4475,4477,4543,4594 'mappings.items':4496 'match':1952,2439,2509,2553,3538,3667,3679,3922,3949,4000,4010,4375 'max':1772,2842,2879 'max-span':1771,2841,2878 'maxim':499,1467 'may':1835,2546,3651,3717,3909,4040,4400 'mean':1821 'messag':896,902,1162,1249,1351,2228,4285 'metadata':1286,3089 'metric':3093 'minim':502,1469 'minut':4171 'mismatch':4258 'miss':148,164,210,2051,3699,3979,4519,4536 'mode':2043 'model':461,462,466,481,535,830,1152,1172,1202,1215,1234,1264,1368,1372,1437,1874,2238,2268,2283,3462,3511,3520,3895,4136,4150,4157 'model-nam':1171,1263,1367,2237 'modif':3436 'monitor':32,1537,1716,2321,2671,3327 'multi':772,927 'multi-evalu':926 'multi-turn':771 'multipl':592,3431 'must':2066,2936,3111,3512,3516,3536,3677,3999,4427 'mutual':3763 'my-workspac':62 'n':1879 'n/a':4366,4370 'name':60,201,726,753,781,1059,1101,1107,1113,1122,1143,1158,1173,1244,1255,1265,1288,1295,1298,1300,1312,1322,1325,1332,1340,1344,1369,1489,1491,1498,1515,1521,1535,1545,1573,1583,1604,1614,1875,1915,1922,1935,1951,1959,1974,2070,2092,2198,2218,2224,2239,2566,2638,2669,2986,2998,3006,3010,3016,3022,3061,3064,3115,3186,3189,3256,3283,3293,3376,3383,3811,3896,4061,4137,4143,4151,4209,4212 'narrow':973 'need':118,746,1927 'never':255,292 'new':539,655,668,1229,1299,2378,2391 'nim':1016 'no-continu':1595,1645,2659,3320 'nois':3587 'non':4439 'non-nul':4438 'none':233,1052,4513 'note':2158,3003 'noth':2554,3528 'null':4440,4531 'numer':442,1398 'nvidia':1015 'o':1634,1713,1945,2151,3001,3309,3349,3833,4102,4146 'object':1393,1481 'off-top':2047 'often':4053 'old':1849 'older':3912 'omit':2617,3689,4607 'one':558,594,956,1221,1567,2074,2164,2289,2370,2618,2975,3119,3553,3767 'one-sent':2073,3118 'one-tim':1566,2369 'open':3562 'open-end':3561 'openai':211,450,1009,1035,1061,1064,1068 'optim':492,1465 'option':439,478,642 'order':767,845,4089 'output':403,424,884,903,1217,1283,1430,1555,1593,1643,1644,2285,2435,2462,2468,2604,2657,2688,3085,3212,3213,3318,3319,3379,3648,3659,4260,4270,4480 'overal':775 'overrid':1781 'override-evalu':1780 'p':4502,4507 'pair':2085,3126 'param':477,1433,1438,1478,3578 'paramet':1486 'parenthes':2087,3128 'part':4497,4504 'pass':826,3453,3575,3620 'past':2758 'path':607,2407,2441,3250,3924,4029,4179,4240,4254,4282,4290,4470,4494,4523,4579 'path.split':4498 'pend':3860 'per':959,962 'per-evalu':961 'percentag':326 'perform':318 'perman':1305 'pick':199,2063 'placehold':400,1382,3636 'platform':229 'platform-manag':228 'playground':1907 'point':2444 'poll':1807,1809 'poll-interv':1806 'portabl':621,3374 'practic':2193,3367 'precis':646 'prefer':1425,3024 'prefix':715 'prerequisit':107 'present':328,2417 'prevent':3630 'primari':2733 'print':3076,3201,3268,4224,4233,4362,4488,4520 'problem':3705 'proceed':108 'process':1775 'produc':304,2622,3464 'product':2342 'profil':126,153,159,162,267 'project':570,623,649,664,697,1454,1488,1513,1514,1530,1543,1544,1565,1581,1582,1651,1893,1909,1914,1921,1934,1938,1941,1961,1980,2024,2208,2422,2646,2647,2677,2678,2795,2810,3388,3421,3432,3757,3795,3810,3825,3829,4184,4190,4342,4452 'project-specif':2207 'prompt':397,533,1232,1281,1291,1379,2479,3535 'propos':2054,3104 'provid':207,239,251,278,448,996,1008,1026,1063,1477,1484,2170,4559,4565 'provider-param':1476 'provider-specif':1483 'provider/model':2177 'pull':2408 'python3':3068,3193,3260,4198,4216,4350,4460 'q':720 'qualifi':4403 'qualiti':323,778 'queri':626,964,2118,2504,2515,2523,2595,3950,4008,4030,4392,4546 'question':1199,1211,1213,2265,2281 'quick':1029,2848 'quot':1385,3625,3629,3638,3847 'rail':3700,3980 'rang':1854,2788,4582 'rate':661,1562,2695,3791,3800 'rater':3503 'raw':2537 're':1784 're-scor':1783 'read':256 'readabl':1409 'real':105,563,2401,2829,3170,3923,4335,4446 'reason':1416,3611,3938 'recent':544,1834,2011,3046 'recommend':489,2302,2394,3489 'refer':1030,2616 'referenc':3682 'references/ax-profiles.md':172,4628 'references/ax-setup.md':145,3712 'regular':915 'relat':4554 'relev':724,1452,2002 'reliabl':3504 'remov':1306,2579,3797,4064 'render':507,877,1474 'replac':4468 'report':307 'reproduc':491,3582 'request':2721 'requir':1078,1110,1320,3775 'resolut':4443 'resolv':919,4022,4420,4435 'respond':1218,2286,3522,3550 'respons':1203,1207,1216,2105,2111,2125,2269,2273,2284 'result':295,713,3590 'retri':351 'retriev':2131 'return':1073,2572,2977,3513 'review':2883 'run':18,27,84,102,113,151,194,217,354,410,514,524,557,577,588,591,612,705,958,1650,1660,1686,1696,1721,1728,1729,1736,1737,1745,1746,1754,1767,1796,1817,2363,2493,2620,2710,2719,2746,2776,2839,2855,2922,2947,2950,3048,3073,3078,3171,3173,3198,3203,3211,3220,3228,3242,3333,3356,3363,3364,3429,3695,3788,3857,3865,3866,3873,3874,3876,3885,3900,3966,3977,4003,4076,4080,4128,4166,4221,4226,4231,4235,4266,4313,4399,4590,4606 'runtim':939 's.get':4363,4367 'sampl':660,1561,2010,2410,2694,3790,3799,3942,4025,4175 'sampling-r':1560,2693,3789,3798 'save':4622,4629 'say':37,1899,2909,2926 'scale':3618 'scope':4377 'score':306,324,443,496,654,688,728,755,783,1399,1606,1785,1883,2339,2352,2390,2497,2625,2781,2884,3013,3583,3915,3928,4036 'search':260 'second':1799,1812,4132 'section':1462 'secur':254 'see':144,1459,3711,4627 'select':633 'sentenc':2075,3120 'session':520,757,762,779,788,832,854,868,891,946,986,1450 'session-level':853 'set':421,482,503,1470,2373,2761,2832,3662,4032 'setup':3600 'shape':3175 'share':795 'shell':3632,3650 'shift':1840 'show':154,4535 'side':898,904 'silent':2552 'simplest':3471 'singl':816,1384,3624,3628,3846,3851 'single-quot':1383,3845 'skill':6,46,79,253,1028,2172,2972,4555 'skill-arize-evaluator' 'skip':2004,3038,4007 'small':2308 'solut':3706 'someth':1900,2910 'sonnet':473 'source-arize-ai' 'space':47,49,54,59,68,76,192,196,223,224,1048,1049,1094,1095,1108,1109,1146,1147,1328,1329,1331,1346,1508,1509,1522,1523,1615,1616,1632,1633,1711,1712,1943,1944,2025,2026,2149,2150,2220,2221,2423,2424,2796,2797,2811,2812,2991,2992,2999,3000,3065,3066,3190,3191,3257,3258,3294,3295,3307,3308,3347,3348,3723,3737,3738,3747,3748,3755,3756,3831,3832,4100,4101,4144,4145,4191,4192,4213,4214,4343,4344,4453,4454 'span':21,517,527,609,639,656,669,692,716,719,732,794,906,948,975,983,1446,1773,1776,1786,1824,1832,1868,1917,2012,2022,2035,2312,2330,2343,2368,2379,2392,2402,2420,2456,2458,2473,2489,2511,2538,2562,2574,2597,2769,2783,2793,2808,2830,2843,2880,3178,3390,3808,3889,3903,3947,3959,4006,4011,4026,4035,4057,4069,4188,4278,4297,4336,4340,4355,4361,4373,4395,4447,4450,4465,4484,4487,4490,4500,4539,4552,4575 'span/run':4176 'spans/runs':635,3927 'special':873 'specif':464,1485,2209,3387,3420 'specifi':1996,3034 'spell':3546,3686 'split':954 'sql':629 'sql-style':628 'start':769,847,1665,2305,2443,2823,2860,4305,4316,4364 'status':1818,1820,4083,4514,4524 'stay':3402 'stdout':2031,2429,2802,2817,3067,3192,3259,4197,4215,4349,4459 'step':1910,1987,2006,2135,2183,2204,2295,2395,2585,2626,2712,2714,2979,3025,3040,3134,3136,3148,3150,3159,3164,3274,3323 'still':3816 'store':446,993,2539,2735 'str':4525 'string':817,1985,3446,3527,3532,4326 'structur':1426 'stuck':3858 'style':630 'subsequ':1963,3020 'success':1881,2779 'suggest':344,2065,3110 'suitabl':2155 'support':371 'sure':2502 'sys':3071,3196,3263,4219,4353,4463 'sys.stdin':3075,3200,3267,4223,4357,4467 'system':342,941 't00':1670,2865 't09':1758,3971 't23':1680,2875 'target':2487,2937,4549 'task':24,98,112,299,414,551,553,565,579,595,650,665,698,701,742,929,931,968,1455,1458,1487,1506,1511,1517,1525,1527,1533,1539,1571,1577,1602,1608,1652,1657,1661,1688,1693,1697,1718,1722,1725,1732,1742,2203,2356,2486,2508,2630,2636,2642,2667,2673,2852,2856,2935,2946,3278,3281,3287,3330,3334,3353,3357,3360,3405,3694,3749,3753,3771,3778,3796,3803,3862,3870,4079,4185,4203,4243,4390,4617 'task-typ':1538,1576,1607,2641,2672,3286 'tell':3517 'temperatur':484,487,1440,3573,3579,3585 'templat':394,602,809,865,874,916,1157,1191,1254,1279,1339,1377,1541,1579,1610,2190,2223,2257,2333,2432,2453,2615,2644,2675,3289,3369,3515,3622,3644,3655,3669,3684,3841,3848,4002,4018,4249,4418,4426 'template-nam':1156,1253,1338,2222 'test':2745,4442 'text':2483,2500 'tie':3384 'time':411,770,848,1568,1666,1676,1749,1853,2371,2749,2765,2787,2824,2826,2861,2871,3905,3960,4306,4317,4328,4365,4369,4581 'timeout':1739,1798 'timestamp':4337 'togeth':802 'tone':776 'tool':981,2467 'tool-cal':980 'top':3215 'top-level':3214 'topic':2049 'topic-agent-skills' 'topic-ai-agents' 'topic-ai-observability' 'topic-arize' 'topic-claude-code' 'topic-codex' 'topic-cursor' 'topic-datasets' 'topic-experiments' 'topic-llmops' 'topic-tracing' 'total':862 'trace':518,730,735,751,759,786,792,837,843,888,908,947,1448,1908,2734,4387,4573 'trace-level':836 'traces/spans':572 'tradeoff':3495 'trail':1762,3975 'trajectori':741 'treat':912 'trend':508,1475 'tri':353,2578,2803 'trigger':26,1648,1659,1684,1695,1753,1766,2707,2716,2854,3325,3332,3787,3965,4312 'trigger-run':25,1658,1694,1752,1765,2853,3331,3786,3964,4311 'troubleshoot':133,3704 'truncat':820 'trust':3614 'turn':773,885,2318 'two':3444,3483 'type':1540,1578,1609,1999,2643,2674,3037,3288 'typo':4154 'u3bhy2u6':71 'ui':358,506,1473,1885,3917,4621 'unauthor':147,3714 'underscor':1348 'understand':1989,3027 'uniqu':1326 'unit':683 'unknown':193 'unsupport':2278 'updat':1018,1280,1285,1294,1302,4568 'upfront':127 'use':33,265,398,456,711,805,942,1000,1021,1112,1181,1421,1653,1689,1894,1957,1978,2165,2188,2247,2452,2555,2699,2747,2821,2904,2967,3372,3377,3395,3442,3592,3765,3836,3844,3967,4056,4108,4314,4626,4633 'use-function-cal':1180,1420,2246 'user':36,178,205,237,290,1198,1212,1898,1995,2062,2115,2175,2264,2280,2908,2925,3033,3095,3478 'usual':3812 'val':4499,4505,4510,4517,4526,4529 'val.get':4506 'valid':1970,2314,2387,2472,2624,2838,3793,3804,3819,3842,4115 'valu':804,819,856,3398,4441,4476 'var':56,124,2454,3853,3856,4493,4522 'variabl':399,603,866,875,917,1381,2200,2433,2613,3162,3375,3635,4019,4250,4419,4422,4534 'verifi':360,2446,3607,3724,4028,4104,4134,4372 'versa':2492 'version':122,142,531,541,545,1121,1129,1130,1139,1164,1230,1236,1243,1308,1356,2230 'vertex':1013 'via':246,412,920,2211,3827 'vice':2491 'vs':3957 'w':1792 'wait':1683,1715,1734,1791,1801,1803,1814,2882,3351 'wait-for-run':1733 'want':2179 'well':2557 'well-known':2556 'went':314 'whether':494,511,651 'whose':1950 'widen':1851,2818,2888,3904 'window':1655,1842,2750,2834,3906,4073,4329 'wire':2210,3412 'within':984,1327,4130 'without':3435 'work':790,2518,3813 'workflow':13,1886,2896,3132,3146,3157 'workspac':64 'wors':501 'would':2358,2954 'wrong':170,315,2329,2499,3649,3894,3930,3946,4016 'yes':1323,1330,1341,1352,1361,1370,1378,1391,2966,3568 'yet':1839,3102 'z':1763,3976","prices":[{"id":"35e1e929-8482-47d0-98b6-73498e00cb47","listingId":"60ded37c-d9e1-4bb4-aa38-680d65771130","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"Arize-ai","category":"arize-skills","install_from":"skills.sh"},"createdAt":"2026-04-23T13:03:47.194Z"}],"sources":[{"listingId":"60ded37c-d9e1-4bb4-aa38-680d65771130","source":"github","sourceId":"Arize-ai/arize-skills/arize-evaluator","sourceUrl":"https://github.com/Arize-ai/arize-skills/tree/main/skills/arize-evaluator","isPrimary":false,"firstSeenAt":"2026-04-23T13:03:47.194Z","lastSeenAt":"2026-04-24T01:02:56.312Z"}],"details":{"listingId":"60ded37c-d9e1-4bb4-aa38-680d65771130","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"Arize-ai","slug":"arize-evaluator","github":{"repo":"Arize-ai/arize-skills","stars":13,"topics":["agent-skills","ai-agents","ai-observability","arize","claude-code","codex","cursor","datasets","experiments","llmops","tracing"],"license":"mit","html_url":"https://github.com/Arize-ai/arize-skills","pushed_at":"2026-04-24T00:52:08Z","description":"Agent skills for Arize — datasets, experiments, and traces via the ax CLI","skill_md_sha":"660e9bd62cec6667be5ccf121e2b5cd7b5f98134","skill_md_path":"skills/arize-evaluator/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/Arize-ai/arize-skills/tree/main/skills/arize-evaluator"},"layout":"multi","source":"github","category":"arize-skills","frontmatter":{"name":"arize-evaluator","description":"INVOKE THIS SKILL for LLM-as-judge evaluation workflows on Arize: creating/updating evaluators, running evaluations on spans or experiments, tasks, trigger-run, column mapping, and continuous monitoring. Use when the user says: create an evaluator, LLM judge, hallucination/faithfulness/correctness/relevance, run eval, score my spans or experiment, ax tasks, trigger-run, trigger eval, column mapping, continuous monitoring, query filter for evals, evaluator version, or improve an evaluator prompt."},"skills_sh_url":"https://skills.sh/Arize-ai/arize-skills/arize-evaluator"},"updatedAt":"2026-04-24T01:02:56.312Z"}}