{"id":"202f0aaa-4222-45fc-97db-3c474e8f3035","shortId":"tvpb2c","kind":"skill","title":"eval-audit","tagline":"Audit an existing evaluation workflow and produce severity-ranked findings with concrete next actions. Use when inheriting an eval setup, diagnosing quality regressions, or checking LLM evaluation process maturity.","description":"# Eval Audit\n\nAudit LLM evaluation practice and route gaps to the right skills.\n\n## Interactive Q&A protocol (mandatory)\n\n<HARD-GATE>\nBEFORE the first scoping question, search for a structured question tool (e.g., `AskUserQuestion` or similar interactive widget) and load it. Use that tool for EVERY scoping question. Fall back to plain-text lettered options ONLY if no such tool exists in the environment.\n</HARD-GATE>\n\nAsk one question at a time using the structured question tool (loaded per the HARD-GATE above).\n\nExample question structure:\n\n```\nWhat should this audit prioritize first?\nA) Live evaluation quality and coverage\nB) Error analysis maturity\nC) Review and promotion loop health\nD) End-to-end process health\n```\n\nRules:\n- One question per message.\n- Use the structured question tool for every question. Structure each with a short header, 2-4 options with labels and descriptions, and place the recommended option first. Do not add \"(Recommended)\" or similar annotations to option labels.\n- Ask one follow-up only if ambiguity remains.\n\n## Inputs and evidence\n\nCollect available evidence from Truesight first:\n- datasets and dataset rows\n- live evaluations\n- evaluation runs/results\n- review queue items\n- existing evaluation criteria and deployment patterns\n\nIf evidence is missing, record that as a finding.\n\n## Diagnostic areas\n\n1. Evaluation coverage and quality dimensions\n2. Error analysis practice and category quality\n3. Review and promotion workflow discipline\n4. Template usage versus custom needs\n5. Operational hygiene (verification, reruns, iteration cadence)\n\n## Report format (mandatory)\n\nFor each finding, include:\n\n```\n### <Finding title>\nStatus: Problem exists | OK | Cannot determine\nEvidence: <specific evidence from Truesight context>\nSeverity: critical | high | medium | low\nRecommended skill: <one of current skill set>\nNext command: <concrete instruction to run next>\n```\n\nOrder findings by severity and impact.\n\n## Severity rubric\n\n- critical: likely causes incorrect go/no-go decisions or severe user harm\n- high: frequent quality failures or missing control loops\n- medium: meaningful process weakness with moderate impact\n- low: optimization opportunity, documentation, or ergonomics issue\n\n## Handoff map\n\n- Missing or weak failure taxonomy -> `error-analysis`\n- Missing live evaluation coverage -> `create-evaluation` or `bootstrap-template-evaluation`\n- Review backlog or low judgment throughput -> `review-and-promote-traces`\n- Unclear starting path -> `truesight-workflows`\n\n## Guardrails\n\n- Keep scope within current Truesight MCP capabilities.","tags":["eval","audit","truesight","mcp","skills","goodeye-labs","agent-skills","ai-evaluation","chatgpt","claude","cursor","llm"],"capabilities":["skill","source-goodeye-labs","skill-eval-audit","topic-agent-skills","topic-ai-evaluation","topic-chatgpt","topic-claude","topic-cursor","topic-llm","topic-mcp","topic-truesight","topic-vscode","topic-windsurf"],"categories":["truesight-mcp-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/Goodeye-Labs/truesight-mcp-skills/eval-audit","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add Goodeye-Labs/truesight-mcp-skills","source_repo":"https://github.com/Goodeye-Labs/truesight-mcp-skills","install_from":"skills.sh"}},"qualityScore":"0.453","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 6 github stars · SKILL.md body (2,594 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T13:22:57.210Z","embedding":null,"createdAt":"2026-05-18T13:22:57.210Z","updatedAt":"2026-05-18T13:22:57.210Z","lastSeenAt":"2026-05-18T13:22:57.210Z","tsv":"'-4':166 '1':234 '2':165,240 '3':247 '4':253 '5':259 'action':18 'add':180 'ambigu':195 'analysi':131,242,338 'annot':184 'area':233 'ask':96,188 'askuserquest':64 'audit':3,4,35,36,120 'avail':201 'b':129 'back':80 'backlog':352 'bootstrap':348 'bootstrap-template-evalu':347 'c':133 'cadenc':265 'cannot':277 'capabl':375 'categori':245 'caus':299 'check':29 'collect':200 'command':288 'concret':16 'control':313 'coverag':128,236,342 'creat':344 'create-evalu':343 'criteria':219 'critic':281,297 'current':372 'custom':257 'd':139 'dataset':206,208 'decis':302 'deploy':221 'descript':171 'determin':278 'diagnos':25 'diagnost':232 'dimens':239 'disciplin':252 'document':325 'e.g':63 'end':141,143 'end-to-end':140 'environ':95 'ergonom':327 'error':130,241,337 'error-analysi':336 'eval':2,23,34 'eval-audit':1 'evalu':7,31,38,125,211,212,218,235,341,345,350 'everi':76,157 'evid':199,202,224,279 'exampl':114 'exist':6,92,217,275 'failur':310,334 'fall':79 'find':14,231,271,290 'first':54,122,177,205 'follow':191 'follow-up':190 'format':267 'frequent':308 'gap':42 'gate':112 'go/no-go':301 'guardrail':368 'handoff':329 'hard':111 'hard-gat':110 'harm':306 'header':164 'health':138,145 'high':282,307 'hygien':261 'impact':294,321 'includ':272 'incorrect':300 'inherit':21 'input':197 'interact':47,67 'issu':328 'item':216 'iter':264 'judgment':355 'keep':369 'label':169,187 'letter':85 'like':298 'live':124,210,340 'llm':30,37 'load':70,107 'loop':137,314 'low':284,322,354 'mandatori':51,268 'map':330 'matur':33,132 'mcp':374 'meaning':316 'medium':283,315 'messag':150 'miss':226,312,331,339 'moder':320 'need':258 'next':17,287 'ok':276 'one':97,147,189 'oper':260 'opportun':324 'optim':323 'option':86,167,176,186 'order':289 'path':364 'pattern':222 'per':108,149 'place':173 'plain':83 'plain-text':82 'practic':39,243 'priorit':121 'problem':274 'process':32,144,317 'produc':10 'promot':136,250,360 'protocol':50 'q':48 'qualiti':26,126,238,246,309 'question':56,61,78,98,105,115,148,154,158 'queue':215 'rank':13 'recommend':175,181,285 'record':227 'regress':27 'remain':196 'report':266 'rerun':263 'review':134,214,248,351,358 'review-and-promote-trac':357 'right':45 'rout':41 'row':209 'rubric':296 'rule':146 'runs/results':213 'scope':55,77,370 'search':57 'setup':24 'sever':12,280,292,295,304 'severity-rank':11 'short':163 'similar':66,183 'skill':46,286 'skill-eval-audit' 'source-goodeye-labs' 'start':363 'status':273 'structur':60,104,116,153,159 'taxonomi':335 'templat':254,349 'text':84 'throughput':356 'time':101 'tool':62,74,91,106,155 'topic-agent-skills' 'topic-ai-evaluation' 'topic-chatgpt' 'topic-claude' 'topic-cursor' 'topic-llm' 'topic-mcp' 'topic-truesight' 'topic-vscode' 'topic-windsurf' 'trace':361 'truesight':204,366,373 'truesight-workflow':365 'unclear':362 'usag':255 'use':19,72,102,151 'user':305 'verif':262 'versus':256 'weak':318,333 'widget':68 'within':371 'workflow':8,251,367","prices":[{"id":"83fb5297-8e50-4d39-8150-57ded100ebec","listingId":"202f0aaa-4222-45fc-97db-3c474e8f3035","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"Goodeye-Labs","category":"truesight-mcp-skills","install_from":"skills.sh"},"createdAt":"2026-05-18T13:22:57.210Z"}],"sources":[{"listingId":"202f0aaa-4222-45fc-97db-3c474e8f3035","source":"github","sourceId":"Goodeye-Labs/truesight-mcp-skills/eval-audit","sourceUrl":"https://github.com/Goodeye-Labs/truesight-mcp-skills/tree/main/skills/eval-audit","isPrimary":false,"firstSeenAt":"2026-05-18T13:22:57.210Z","lastSeenAt":"2026-05-18T13:22:57.210Z"}],"details":{"listingId":"202f0aaa-4222-45fc-97db-3c474e8f3035","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"Goodeye-Labs","slug":"eval-audit","github":{"repo":"Goodeye-Labs/truesight-mcp-skills","stars":6,"topics":["agent-skills","ai-evaluation","chatgpt","claude","cursor","llm","mcp","truesight","vscode","windsurf"],"license":"mit","html_url":"https://github.com/Goodeye-Labs/truesight-mcp-skills","pushed_at":"2026-03-26T06:15:56Z","description":"Agent skills for the Truesight MCP. Step-by-step workflow playbooks for scoring inputs, building live evaluations, error analysis, and the review loop. Works with Claude Code, Cursor, ChatGPT, VS Code, Windsurf, and any client that supports the agent skills standard.","skill_md_sha":"7f664668dade7986f0bf96d792280e6f21960d61","skill_md_path":"skills/eval-audit/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/Goodeye-Labs/truesight-mcp-skills/tree/main/skills/eval-audit"},"layout":"multi","source":"github","category":"truesight-mcp-skills","frontmatter":{"name":"eval-audit","description":"Audit an existing evaluation workflow and produce severity-ranked findings with concrete next actions. Use when inheriting an eval setup, diagnosing quality regressions, or checking LLM evaluation process maturity."},"skills_sh_url":"https://skills.sh/Goodeye-Labs/truesight-mcp-skills/eval-audit"},"updatedAt":"2026-05-18T13:22:57.210Z"}}