{"id":"8107ba7b-4268-4ac0-8b80-05fc0fa551e8","shortId":"KWJdZ5","kind":"skill","title":"evaluate-trace","tagline":"Evaluate one or more traces against an existing Truesight live evaluation. Use when a deployed live evaluation already exists and the user wants run outputs with optional handoff to review and promotion.","description":"# Evaluate Trace\n\nUse this skill when the user wants to evaluate traces with an existing live evaluation endpoint.\n\n## Interactive Q&A protocol (mandatory)\n\n<HARD-GATE>\nBEFORE the first scoping question, search for a structured question tool (e.g., `AskUserQuestion` or similar interactive widget) and load it. Use that tool for EVERY scoping question. Fall back to plain-text lettered options ONLY if no such tool exists in the environment.\n</HARD-GATE>\n\nIf context does not make scope clear, ask one question at a time using the structured question tool (loaded per the HARD-GATE above).\n\nExample question structure:\n\n```\nDo you want to evaluate one trace or a batch?\nA) One trace now\nB) Small batch (up to 25)\nC) Full batch loop\n```\n\nRules:\n- Ask exactly one clarifying question per message.\n- Use the structured question tool for every question. Structure each with a short header, 2-4 options with labels and descriptions, and place the recommended option first. Do not add \"(Recommended)\" or similar annotations to option labels.\n- Ask a single follow-up if needed, then proceed.\n\n## Workflow\n\n1. Identify target live evaluation:\n   - If live evaluation id is unknown, call `list_live_evaluations`.\n   - Select `public_id` and verify required `input_columns`.\n2. Prepare inputs:\n   - Ensure `inputs` keys exactly match `input_columns`.\n   - Include `media_url` for multimodal evaluations when needed.\n3. Execute evaluation:\n   - Use the `run_eval` tool with `live_evaluation_id` and `inputs` for each trace.\n4. Return useful outputs:\n   - `run_id`\n   - per-judgment scores/outcomes\n   - brief interpretation for next action\n5. Optional handoff:\n   - If human judgment is needed, route to `review-and-promote-traces`.\n\n## Batch mode guidance\n\n- Use deterministic trace ordering and log `run_id` for each input.\n- Apply retries with stable idempotency context in caller logic if needed.\n- Summarize failures by category or threshold, then propose review handoff.\n\n## Scopes reference\n\n- `list_live_evaluations` requires `live-evaluations:read`\n- `run_eval` requires `live-evaluations:execute`\n\nIf a scope error occurs, ask the user to create an API key with the missing scope in Truesight Settings.","tags":["evaluate","trace","truesight","mcp","skills","goodeye-labs","agent-skills","ai-evaluation","chatgpt","claude","cursor","llm"],"capabilities":["skill","source-goodeye-labs","skill-evaluate-trace","topic-agent-skills","topic-ai-evaluation","topic-chatgpt","topic-claude","topic-cursor","topic-llm","topic-mcp","topic-truesight","topic-vscode","topic-windsurf"],"categories":["truesight-mcp-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/Goodeye-Labs/truesight-mcp-skills/evaluate-trace","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add Goodeye-Labs/truesight-mcp-skills","source_repo":"https://github.com/Goodeye-Labs/truesight-mcp-skills","install_from":"skills.sh"}},"qualityScore":"0.453","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 6 github stars · SKILL.md body (2,202 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T13:22:57.299Z","embedding":null,"createdAt":"2026-05-18T13:22:57.299Z","updatedAt":"2026-05-18T13:22:57.299Z","lastSeenAt":"2026-05-18T13:22:57.299Z","tsv":"'-4':178 '1':211 '2':177,234 '25':150 '3':252 '4':269 '5':284 'action':283 'add':192 'alreadi':21 'annot':196 'api':362 'appli':313 'ask':110,156,200,356 'askuserquest':71 'b':145 'back':87 'batch':140,147,153,299 'brief':279 'c':151 'call':222 'caller':320 'categori':327 'clarifi':159 'clear':109 'column':233,243 'context':104,318 'creat':360 'deploy':18 'descript':183 'determinist':303 'e.g':70 'endpoint':53 'ensur':237 'environ':102 'error':354 'eval':258,345 'evalu':2,4,14,20,36,46,52,135,215,218,225,249,254,262,338,342,349 'evaluate-trac':1 'everi':83,169 'exact':157,240 'exampl':128 'execut':253,350 'exist':11,22,50,99 'failur':325 'fall':86 'first':61,189 'follow':204 'follow-up':203 'full':152 'gate':126 'guidanc':301 'handoff':31,286,333 'hard':125 'hard-gat':124 'header':176 'human':288 'id':219,228,263,274,309 'idempot':317 'identifi':212 'includ':244 'input':232,236,238,242,265,312 'interact':54,74 'interpret':280 'judgment':277,289 'key':239,363 'label':181,199 'letter':92 'list':223,336 'live':13,19,51,214,217,224,261,337,341,348 'live-evalu':340,347 'load':77,121 'log':307 'logic':321 'loop':154 'make':107 'mandatori':58 'match':241 'media':245 'messag':162 'miss':366 'mode':300 'multimod':248 'need':207,251,291,323 'next':282 'occur':355 'one':5,111,136,142,158 'option':30,93,179,188,198,285 'order':305 'output':28,272 'per':122,161,276 'per-judg':275 'place':185 'plain':90 'plain-text':89 'prepar':235 'proceed':209 'promot':35,297 'propos':331 'protocol':57 'public':227 'q':55 'question':63,68,85,112,119,129,160,166,170 'read':343 'recommend':187,193 'refer':335 'requir':231,339,346 'retri':314 'return':270 'review':33,295,332 'review-and-promote-trac':294 'rout':292 'rule':155 'run':27,257,273,308,344 'scope':62,84,108,334,353,367 'scores/outcomes':278 'search':64 'select':226 'set':370 'short':175 'similar':73,195 'singl':202 'skill':40 'skill-evaluate-trace' 'small':146 'source-goodeye-labs' 'stabl':316 'structur':67,118,130,165,171 'summar':324 'target':213 'text':91 'threshold':329 'time':115 'tool':69,81,98,120,167,259 'topic-agent-skills' 'topic-ai-evaluation' 'topic-chatgpt' 'topic-claude' 'topic-cursor' 'topic-llm' 'topic-mcp' 'topic-truesight' 'topic-vscode' 'topic-windsurf' 'trace':3,8,37,47,137,143,268,298,304 'truesight':12,369 'unknown':221 'url':246 'use':15,38,79,116,163,255,271,302 'user':25,43,358 'verifi':230 'want':26,44,133 'widget':75 'workflow':210","prices":[{"id":"cb3b8e8c-9471-4509-8bee-3d122d92b2cb","listingId":"8107ba7b-4268-4ac0-8b80-05fc0fa551e8","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"Goodeye-Labs","category":"truesight-mcp-skills","install_from":"skills.sh"},"createdAt":"2026-05-18T13:22:57.299Z"}],"sources":[{"listingId":"8107ba7b-4268-4ac0-8b80-05fc0fa551e8","source":"github","sourceId":"Goodeye-Labs/truesight-mcp-skills/evaluate-trace","sourceUrl":"https://github.com/Goodeye-Labs/truesight-mcp-skills/tree/main/skills/evaluate-trace","isPrimary":false,"firstSeenAt":"2026-05-18T13:22:57.299Z","lastSeenAt":"2026-05-18T13:22:57.299Z"}],"details":{"listingId":"8107ba7b-4268-4ac0-8b80-05fc0fa551e8","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"Goodeye-Labs","slug":"evaluate-trace","github":{"repo":"Goodeye-Labs/truesight-mcp-skills","stars":6,"topics":["agent-skills","ai-evaluation","chatgpt","claude","cursor","llm","mcp","truesight","vscode","windsurf"],"license":"mit","html_url":"https://github.com/Goodeye-Labs/truesight-mcp-skills","pushed_at":"2026-03-26T06:15:56Z","description":"Agent skills for the Truesight MCP. Step-by-step workflow playbooks for scoring inputs, building live evaluations, error analysis, and the review loop. Works with Claude Code, Cursor, ChatGPT, VS Code, Windsurf, and any client that supports the agent skills standard.","skill_md_sha":"baeefdcad77dd1e1773f6c48255ca618794ab561","skill_md_path":"skills/evaluate-trace/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/Goodeye-Labs/truesight-mcp-skills/tree/main/skills/evaluate-trace"},"layout":"multi","source":"github","category":"truesight-mcp-skills","frontmatter":{"name":"evaluate-trace","description":"Evaluate one or more traces against an existing Truesight live evaluation. Use when a deployed live evaluation already exists and the user wants run outputs with optional handoff to review and promotion."},"skills_sh_url":"https://skills.sh/Goodeye-Labs/truesight-mcp-skills/evaluate-trace"},"updatedAt":"2026-05-18T13:22:57.299Z"}}