{"id":"706f10a4-90d3-4e99-8359-3436069c6bf9","shortId":"tcRS7R","kind":"skill","title":"create-evaluation","tagline":"Scope what quality should be measured, convert it into one or more actionable binary evaluations, deploy those evaluations through Truesight MCP, and generate a companion skill that applies them correctly. Use when a user wants to create new evals, quality checks, guardrails, or ","description":"# Create Evaluation\n\nRun this skill when a user asks to create evals for a task, workflow, or output type.\n\n## Outcome\n\nProduce all of the following in one flow:\n1. Scoped evaluation dimensions with clear pass/fail boundaries\n2. Deployed live eval endpoints\n3. Full runnable cURL per endpoint (must include exact live eval ID and exact API key)\n4. A generated companion skill that explains how to use the evals in the user's workflow\n\n## Default behavior\n\n- Prioritize non-technical scoping first.\n- Use binary evaluations by default.\n- Create separate evals per dimension by default.\n- Avoid asking implementation-detail questions unless they change product intent.\n- Infer technical defaults and execute.\n\n## Interactive Q&A protocol (mandatory)\n\n<HARD-GATE>\nDo NOT call template provisioning tools, create datasets, deploy evaluations, generate cURLs, or produce a companion skill until scoping is complete and the user explicitly approves the scoped evaluation design.\n</HARD-GATE>\n\n<HARD-GATE>\nBEFORE the first scoping question, search for a structured question tool (e.g., `AskUserQuestion` or similar interactive widget) and load it. Use that tool for EVERY scoping question. Fall back to plain-text lettered options ONLY if no such tool exists in the environment.\n</HARD-GATE>\n\n## Anti-pattern: \"This is obvious, skip questions\"\n\nDo not skip the interactive scoping loop, even when the use case seems simple. Fast assumption-heavy execution creates weak criteria and poor downstream behavior. Keep the dialogue short when possible, but do not skip it.\n\n## Checklist (complete in order)\n\nYou MUST complete each item in order:\n\n1. **Initial framing.** Restate the use case and intended operator outcome.\n2. **Clarifying dialogue.** Ask one question at a time; prefer multiple-choice when possible.\n3. **Approach options.** Propose 2-3 decomposition options with trade-offs and recommendation.\n4. **Design approval loop.** Present these sections and get approval after each section:\n   - Quality dimensions\n   - Pass/fail boundaries and strictness\n   - Operational usage pattern (gate, rank, revise loop, monitor)\n5. **Seed labeling.** Have the user label a small sample of traces to calibrate the LLM judge (see Seed labeling section below).\n6. **Build authorization checkpoint.** Ask for explicit go-ahead before any MCP build or deploy action.\n7. **Implementation and verification.** Execute from-scratch flow, verify, then deliver artifacts.\n\n<HARD-GATE>\nEvery scoping question in the checklist above MUST be asked during the clarifying dialogue. The only exception: skip a question if the user has already explicitly answered it earlier in the conversation. Do not infer answers. Do not skip because the answer seems obvious.\n</HARD-GATE>\n\n## Dialogue rules\n\n- Ask exactly one clarifying question per message during scoping.\n- Use the structured question tool (loaded per the HARD-GATE above) for every scoping question. Structure each with a short header, 2-4 options with labels and descriptions, and place the recommended option first. Do not add \"(Recommended)\" or similar annotations to option labels.\n- If the user response is ambiguous, ask one follow-up question before moving forward.\n- Keep questions focused on quality intent, failure cost, and decision thresholds.\n\n## Quick trial redirect\n\nIf the user wants a quick trial or does not yet have a strong evaluation concept, route to `bootstrap-template-evaluation` instead of running this skill.\n\nUse `create-evaluation` for from-scratch evaluation design and deployment.\n\n---\n\n## Scoping workflow (high-information questions only)\n\nAsk questions that define quality, not plumbing. Cover:\n- What is being evaluated\n- What \"good\" and \"bad\" look like\n- Highest-cost failure modes\n- Whether existing sample data or traces are available (if yes, read them early because they inform dimension selection, criterion wording, and borderline calibration)\n- Strictness preference (precision vs recall)\n- How results should be used (gating, ranking, revision loop, monitoring, etc.)\n\nDo not ask about dataset schema, API structure, key storage, or endpoint wiring unless the user explicitly wants custom handling.\n\n## Criterion quality standard\n\nFor each proposed quality dimension:\n\n- Make it atomic: one dimension per criterion.\n- Use strict binary pass/fail boundaries by default.\n- Define explicit fail conditions, not just pass intent.\n- Include at least one borderline example in scoping discussion when ambiguity risk is high.\n- Prefer code-based checks for objective constraints and reserve LLM judgment for interpretive criteria.\n\nAvoid holistic criteria like \"is this good?\" or \"is this helpful?\" without concrete boundaries.\n\n## Real traces first, synthetic fallback via generate-synthetic-data\n\nDefault to real traces from user workflows whenever available.\n\n<HARD-GATE>\nIf fewer than 20 real traces are available, invoke the `generate-synthetic-data` skill to augment the dataset before building. Pass all scoping context already gathered (system type, trace structure, failure modes) so the user is not re-asked. Do NOT proceed to dataset creation or deployment with fewer than 20 traces.\n</HARD-GATE>\n\nSynthetic traces are a bootstrap aid, not a replacement for production traces.\n\n## Seed labeling\n\n<HARD-GATE>\nBefore building the dataset, the user must label a small sample of traces. These labels improve evaluation accuracy and set the standard for how all remaining traces are labeled. The agent then uses those examples to label all remaining traces. No trace may be uploaded without a pre-filled label and reasoning for every judgment column. Skip this step ONLY if the traces already have both labels AND reasoning in every judgment column.\n</HARD-GATE>\n\n**Step 1: User labels seed traces**\n\nSelect the minimum number of traces needed to capture the labeling pattern. Start with 2-3. Only request more if the first batch does not cover enough variation to label the rest confidently. Absolute maximum: 10 traces.\n\nPrioritize the highest-information traces:\n- Borderline cases where pass/fail is genuinely ambiguous\n- Traces that span different failure modes\n- Cases where the criterion wording could be interpreted multiple ways\n\nAvoid obvious pass or obvious fail examples. They add no labeling signal.\n\nFor each selected trace, present it to the user using the structured question tool (loaded per the AskUserQuestion HARD-GATE above). For each judgment dimension, ask:\n- The label (Pass/Fail for binary, the category for categorical, the score for continuous)\n- A 1-2 sentence reason explaining why that label applies\n\nPresent one trace per message. Do not batch them.\n\n**Step 2: Agent labels remaining traces**\n\nUsing the user's seed labels as examples, label all remaining traces with both the judgment value and reasoning for every judgment column. Match the user's labeling style, strictness, and reasoning depth.\n\nAfter labeling, present a summary to the user for approval:\n- Total traces labeled per judgment value (e.g., \"62 Pass, 25 Fail\")\n- 2-3 example auto-labeled traces so the user can spot-check quality\n\nIf the user flags issues, adjust the labeling approach and re-label. Do not upload until the user approves the distribution and spot-check.\n\n**Step 3: Record labels**\n\nWrite all labels (user seed labels and agent-generated labels) into the `judgment_column` and `notes_column` fields for their respective rows.\n\n## Synthesis step\n\nAfter scoping, return:\n- Proposed eval dimensions\n- Recommended number of evals and why\n- Criterion text for each eval with explicit pass/fail boundary\n- Intended usage pattern for eval outputs in downstream workflow\n\nGet explicit user approval on the scoped design before build.\n\n## Build step (Truesight MCP)\n\nUse Truesight MCP to implement approved evals.\n\nFor each eval:\n1. Create/upload dataset with `upload_dataset` or `create_dataset`\n   - Pass `input_columns` and `judgment_configs` inline to avoid separate configure calls\n   - The `columns` array MUST include all `judgment_column` and `notes_column` names from `judgment_configs`, in addition to your input columns. The API will reject the request if judgment/notes columns are missing from `columns`.\n   - Use `idempotency_key` for safe retries in agentic loops\n\n   Example (text input):\n   ```python\n   create_dataset(\n       name=\"My Eval\",\n       columns=[\"conversation\", \"quality\", \"quality_reasoning\"],  # includes judgment + notes columns\n       input_columns=[\"conversation\"],\n       judgment_configs=[{\n           \"judgment_column\": \"quality\",\n           \"notes_column\": \"quality_reasoning\",\n           \"judgment_type\": \"binary\",\n           \"criterion\": \"...\"\n       }]\n   )\n   ```\n\n   Example (image-only input). Use `media_url_column` with `input_columns=[]`. The image column cannot also be an input column:\n   ```python\n   create_dataset(\n       name=\"My Image Eval\",\n       columns=[\"image_url\", \"quality\", \"quality_reasoning\"],\n       input_columns=[],\n       media_url_column=\"image_url\",\n       judgment_configs=[{\n           \"judgment_column\": \"quality\",\n           \"notes_column\": \"quality_reasoning\",\n           \"judgment_type\": \"binary\",\n           \"criterion\": \"...\"\n       }]\n   )\n   ```\n   At `run_eval` time, pass `inputs={}` and provide the image via the `media_url` parameter.\n2. Deploy using `create_and_deploy_evaluation(dataset_id)`\n   - **CRITICAL: the full `api_key` is ONLY returned at creation.** Capture and store it immediately.\n   - The live evaluation `public_id` is also needed for `run_eval` calls\n3. Verify endpoint works with a real call\n\n### judgment_configs reference\n\nEach `judgment_configs` entry defines one scoring dimension. Pass as a list to `upload_dataset` or `create_dataset`.\n\n**Binary (pass/fail), the most common type:**\n```json\n[{\n  \"judgment_column\": \"quality\",\n  \"judgment_type\": \"binary\",\n  \"criterion\": \"The response fully addresses the user's question without factual errors. Pass if it does, Fail if it does not.\"\n}]\n```\n\n**Categorical (multiple labels):**\n```json\n[{\n  \"judgment_column\": \"tone\",\n  \"judgment_type\": \"categorical\",\n  \"options\": [\"professional\", \"neutral\", \"unprofessional\"],\n  \"criterion\": \"Classify the tone of the response.\"\n}]\n```\n\n**Continuous (numeric score):**\n```json\n[{\n  \"judgment_column\": \"relevance\",\n  \"judgment_type\": \"continuous\",\n  \"min_value\": 0,\n  \"max_value\": 10,\n  \"criterion\": \"Score how relevant the response is to the question, from 0 (irrelevant) to 10 (perfectly relevant).\"\n}]\n```\n\n**Multiple dimensions in one dataset:**\n```json\n[\n  {\"judgment_column\": \"accuracy\", \"judgment_type\": \"binary\", \"criterion\": \"...\"},\n  {\"judgment_column\": \"tone\", \"judgment_type\": \"categorical\", \"options\": [\"formal\", \"casual\"], \"criterion\": \"...\"}\n]\n```\n\nOptional fields per config:\n- `notes_column` (str): column for judge reasoning text. Highly recommended so the judge has the reasoning for why the judgment was made.\n\n## cURL requirement (mandatory)\n\nFor every deployed eval, construct and store the full runnable cURL using:\n- Live eval endpoint ID (`public_id`)\n- Its corresponding API key (`api_key`)\n\nTemplate:\n\n```bash\ncurl -sS -X POST \"https://api.truesight.goodeyelabs.com/api/eval/<public_id>\" \\\n  -H \"Authorization: Bearer <api_key>\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"inputs\": { ... }}'\n```\n\nYou must preserve exact endpoint IDs and keys returned from deployment. No placeholders in final delivered skill unless user asked for placeholders.\n\n## Verification requirement (mandatory)\n\n- Execute the exact cURL written into the companion skill for each eval.\n- Confirm successful response and extractable judgment fields.\n- Report verification evidence before claiming completion.\n\n## Companion skill generation\n\nGenerate a new usage skill tailored to the scoped workflow.\n\n### File conventions\n\n- **Directory:** `.claude/skills/<skill-name>/SKILL.md` (the file MUST be named `SKILL.md` in all caps)\n- **Frontmatter:** Every companion skill MUST start with YAML frontmatter:\n\n```yaml\n---\nname: <kebab-case-name>\ndescription: <1-2 sentence description>. Use when <trigger phrases>.\n---\n```\n\nThe `description` field drives skill discovery. Include explicit \"Use when...\" trigger phrases that match how users will ask for this skill.\n\n### Required content\n\nThe companion skill must include:\n- Clear trigger description: what the eval suite does and when to use it\n- Input contract: what inputs must be provided\n- Eval execution instructions aligned to scoped usage (not hardcoded to one pattern)\n- Output parsing guidance: how to read pass/fail and reasoning\n- Full cURL blocks for every eval endpoint\n- Operator loop logic for the approved usage pattern (for example: revise-until-pass, gate-on-fail, or monitor-only)\n\n**IMPORTANT:** Document MCP tool calls in natural language with exact parameter names and values. Never use function-call syntax with parentheses. Example: \"Invoke the `run_eval` tool with `live_evaluation_id` set to `\\\"live_xxx\\\"` and `inputs` set to `{...}`.\" Parenthesized call syntax triggers security hooks.\n\n## Final delivery format\n\nReturn:\n1. Scoping summary\n2. Eval catalog (dimension + criterion + pass/fail boundary)\n3. Deployment manifest (dataset IDs, eval IDs, live eval IDs, API keys)\n4. Companion skill path\n5. Verification results for every cURL\n\nIf any verification fails, stop and return a concrete fix plan instead of marking done.","tags":["create","evaluation","truesight","mcp","skills","goodeye-labs","agent-skills","ai-evaluation","chatgpt","claude","cursor","llm"],"capabilities":["skill","source-goodeye-labs","skill-create-evaluation","topic-agent-skills","topic-ai-evaluation","topic-chatgpt","topic-claude","topic-cursor","topic-llm","topic-mcp","topic-truesight","topic-vscode","topic-windsurf"],"categories":["truesight-mcp-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/Goodeye-Labs/truesight-mcp-skills/create-evaluation","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add Goodeye-Labs/truesight-mcp-skills","source_repo":"https://github.com/Goodeye-Labs/truesight-mcp-skills","install_from":"skills.sh"}},"qualityScore":"0.453","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 6 github stars · SKILL.md body (13,375 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T13:22:57.015Z","embedding":null,"createdAt":"2026-05-18T13:22:57.015Z","updatedAt":"2026-05-18T13:22:57.015Z","lastSeenAt":"2026-05-18T13:22:57.015Z","tsv":"'-2':1028,1734 '-3':323,923,1106 '-4':489 '/api/eval/':1633 '/skill.md':1711 '0':1528,1543 '1':75,292,903,1027,1229,1733,1887 '10':943,1531,1546 '2':83,303,322,488,922,1046,1105,1396,1890 '20':763,812 '25':1103 '3':88,318,1147,1432,1897 '4':104,332,1909 '5':359,1913 '6':381 '62':1101 '7':398 'absolut':941 'accuraci':845,1557 'action':16,397 'add':503,982 'addit':1266 'address':1478 'adjust':1125 'agent':858,1047,1158,1291 'agent-gener':1157 'ahead':390 'aid':819 'align':1790 'alreadi':435,785,892 'also':1343,1426 'ambigu':516,708,957 'annot':507 'answer':437,446,452 'anti':237 'anti-pattern':236 'api':102,654,1272,1408,1621,1623,1907 'api.truesight.goodeyelabs.com':1632 'api.truesight.goodeyelabs.com/api/eval/':1631 'appli':31,1035 'application/json':1641 'approach':319,1128 'approv':187,334,341,1093,1139,1208,1224,1820 'array':1252 'artifact':410 'ask':55,142,306,385,420,457,517,586,650,800,1012,1663,1756 'askuserquest':204,1003 'assumpt':260 'assumption-heavi':259 'atom':678 'augment':776 'author':383,1635 'auto':1109 'auto-label':1108 'avail':616,759,767 'avoid':141,727,974,1246 'back':220 'bad':601 'base':715 'bash':1626 'batch':930,1043 'bearer':1636 'behavior':122,269 'binari':17,130,685,1017,1325,1379,1461,1473,1560 'block':1810 'bootstrap':559,818 'bootstrap-template-evalu':558 'borderlin':630,702,951 'boundari':82,348,687,740,1195,1896 'build':382,394,780,829,1214,1215 'calibr':372,631 'call':164,1249,1431,1439,1841,1855,1878 'cannot':1342 'cap':1720 'captur':916,1415 'case':255,298,952,964 'casual':1570 'catalog':1892 'categor':1021,1495,1504,1567 'categori':1019 'chang':149 'check':44,716,1118,1145 'checklist':281,416 'checkpoint':384 'choic':315 'claim':1692 'clarifi':304,423,460 'classifi':1510 'claude/skills':1710 'clear':80,1767 'code':714 'code-bas':713 'column':884,901,1073,1164,1167,1240,1251,1257,1260,1270,1279,1283,1302,1310,1312,1317,1320,1335,1338,1341,1347,1355,1362,1365,1371,1374,1469,1500,1521,1556,1563,1577,1579 'common':1465 'companion':28,107,177,1676,1694,1723,1763,1910 'complet':182,282,287,1693 'concept':555 'concret':739,1927 'condit':693 'confid':940 'config':1243,1264,1315,1369,1441,1445,1575 'configur':1248 'confirm':1681 'constraint':719 'construct':1605 'content':1639,1761 'content-typ':1638 'context':784 'continu':1025,1516,1525 'contract':1781 'convent':1708 'convers':442,1303,1313 'convert':10 'correct':33 'correspond':1620 'cost':533,606 'could':969 'cover':593,933 'creat':2,40,47,57,134,168,263,569,1236,1297,1349,1399,1459 'create-evalu':1,568 'create/upload':1230 'creation':806,1414 'criteria':265,726,729 'criterion':627,668,682,967,1187,1326,1380,1474,1509,1532,1561,1571,1894 'critic':1405 'curl':91,173,1598,1611,1627,1672,1809,1918 'custom':666 'd':1642 'data':612,750,773 'dataset':169,652,778,805,831,1231,1234,1237,1298,1350,1403,1457,1460,1553,1900 'decis':535 'decomposit':324 'default':121,133,140,154,689,751 'defin':589,690,1447 'deliv':409,1659 'deliveri':1884 'deploy':19,84,170,396,578,808,1397,1401,1603,1654,1898 'depth':1083 'descript':494,1732,1736,1740,1769 'design':191,333,576,1212 'detail':145 'dialogu':272,305,424,455 'differ':961 'dimens':78,138,346,625,675,680,1011,1180,1450,1550,1893 'directori':1709 'discoveri':1744 'discuss':706 'distribut':1141 'document':1838 'done':1933 'downstream':268,1203 'drive':1742 'e.g':203,1100 'earli':621 'earlier':439 'endpoint':87,93,659,1434,1615,1648,1814 'enough':934 'entri':1446 'environ':235 'error':1485 'etc':647 'eval':42,58,86,98,115,136,1179,1184,1191,1200,1225,1228,1301,1354,1383,1430,1604,1614,1680,1772,1787,1813,1863,1891,1902,1905 'evalu':3,18,21,48,77,131,171,190,554,561,570,575,597,844,1402,1422,1867 'even':251 'everi':216,411,479,882,899,1071,1602,1722,1812,1917 'evid':1690 'exact':96,101,458,1647,1671,1846 'exampl':703,862,980,1058,1107,1293,1327,1824,1859 'except':427 'execut':156,262,402,1669,1788 'exist':232,610 'explain':110,1031 'explicit':186,387,436,664,691,1193,1206,1746 'extract':1685 'factual':1484 'fail':692,979,1104,1490,1832,1922 'failur':532,607,791,962 'fall':219 'fallback':745 'fast':258 'fewer':761,810 'field':1168,1573,1687,1741 'file':1707,1713 'fill':877 'final':1658,1883 'first':128,194,500,743,929 'fix':1928 'flag':1123 'flow':74,406 'focus':528 'follow':71,520 'follow-up':519 'formal':1569 'format':1885 'forward':525 'frame':294 'from-scratch':403,572 'frontmatt':1721,1729 'full':89,1407,1609,1808 'fulli':1477 'function':1854 'function-cal':1853 'gate':354,476,642,1006,1830 'gate-on-fail':1829 'gather':786 'generat':26,106,172,748,771,1159,1696,1697 'generate-synthetic-data':747,770 'genuin':956 'get':340,1205 'go':389 'go-ahead':388 'good':599,733 'guardrail':45 'guidanc':1801 'h':1634,1637 'handl':667 'hard':475,1005 'hard-gat':474,1004 'hardcod':1795 'header':487 'heavi':261 'help':737 'high':582,711,1584 'high-inform':581 'highest':605,948 'highest-cost':604 'highest-inform':947 'holist':728 'hook':1882 'id':99,1404,1424,1616,1618,1649,1868,1901,1903,1906 'idempot':1285 'imag':1329,1340,1353,1356,1366,1390 'image-on':1328 'immedi':1419 'implement':144,399,1223 'implementation-detail':143 'import':1837 'improv':843 'includ':95,698,1254,1307,1745,1766 'infer':152,445 'inform':583,624,949 'initi':293 'inlin':1244 'input':1239,1269,1295,1311,1331,1337,1346,1361,1386,1643,1780,1783,1874 'instead':562,1930 'instruct':1789 'intend':300,1196 'intent':151,531,697 'interact':157,207,248 'interpret':725,971 'invok':768,1860 'irrelev':1544 'issu':1124 'item':289 'json':1467,1498,1519,1554 'judg':375,1581,1588 'judgment':723,883,900,1010,1066,1072,1098,1163,1242,1256,1263,1308,1314,1316,1323,1368,1370,1377,1440,1444,1468,1471,1499,1502,1520,1523,1555,1558,1562,1565,1595,1686 'judgment/notes':1278 'keep':270,526 'key':103,656,1286,1409,1622,1624,1651,1908 'label':361,365,378,492,510,827,835,842,856,864,878,895,905,918,937,984,1014,1034,1048,1056,1059,1078,1085,1096,1110,1127,1132,1149,1152,1155,1160,1497 'languag':1844 'least':700 'letter':225 'like':603,730 'list':1454 'live':85,97,1421,1613,1866,1871,1904 'llm':374,722 'load':210,471,1000 'logic':1817 'look':602 'loop':250,335,357,645,1292,1816 'made':1597 'make':676 'mandatori':161,1600,1668 'manifest':1899 'mark':1932 'match':1074,1752 'max':1529 'maximum':942 'may':870 'mcp':24,393,1218,1221,1839 'measur':9 'media':1333,1363,1393 'messag':463,1040 'min':1526 'minimum':910 'miss':1281 'mode':608,792,963 'monitor':358,646,1835 'monitor-on':1834 'move':524 'multipl':314,972,1496,1549 'multiple-choic':313 'must':94,286,418,834,1253,1645,1714,1725,1765,1784 'name':1261,1299,1351,1716,1731,1848 'natur':1843 'need':914,1427 'neutral':1507 'never':1851 'new':41,1699 'non':125 'non-techn':124 'note':1166,1259,1309,1319,1373,1576 'number':911,1182 'numer':1517 'object':718 'obvious':241,454,975,978 'off':329 'one':13,73,307,459,518,679,701,1037,1448,1552,1797 'oper':301,351,1815 'option':226,320,325,490,499,509,1505,1568,1572 'order':284,291 'outcom':66,302 'output':64,1201,1799 'paramet':1395,1847 'parenthes':1858,1877 'pars':1800 'pass':696,781,976,1102,1238,1385,1451,1486,1828 'pass/fail':81,347,686,954,1015,1194,1462,1805,1895 'path':1912 'pattern':238,353,919,1198,1798,1822 'per':92,137,462,472,681,1001,1039,1097,1574 'perfect':1547 'phrase':1750 'place':496 'placehold':1656,1665 'plain':223 'plain-text':222 'plan':1929 'plumb':592 'poor':267 'possibl':275,317 'post':1630 'pre':876 'pre-fil':875 'precis':634 'prefer':312,633,712 'present':336,990,1036,1086 'preserv':1646 'priorit':123,945 'proceed':803 'produc':67,175 'product':150,824 'profession':1506 'propos':321,673,1178 'protocol':160 'provid':1388,1786 'provis':166 'public':1423,1617 'python':1296,1348 'q':158 'qualiti':6,43,345,530,590,669,674,1119,1304,1305,1318,1321,1358,1359,1372,1375,1470 'question':146,196,201,218,243,308,413,430,461,469,481,522,527,584,587,998,1482,1541 'quick':537,545 'rank':355,643 're':799,1131 're-ask':798 're-label':1130 'read':619,1804 'real':741,753,764,1438 'reason':880,897,1030,1069,1082,1306,1322,1360,1376,1582,1591,1807 'recal':636 'recommend':331,498,504,1181,1585 'record':1148 'redirect':539 'refer':1442 'reject':1274 'relev':1522,1535,1548 'remain':853,866,1049,1061 'replac':822 'report':1688 'request':925,1276 'requir':1599,1667,1760 'reserv':721 'respect':1171 'respons':514,1476,1515,1537,1683 'rest':939 'restat':295 'result':638,1915 'retri':1289 'return':1177,1412,1652,1886,1925 'revis':356,644,1826 'revise-until-pass':1825 'risk':709 'rout':556 'row':1172 'rule':456 'run':49,564,1382,1429,1862 'runnabl':90,1610 'safe':1288 'sampl':368,611,838 'schema':653 'scope':4,76,127,180,189,195,217,249,412,465,480,579,705,783,1176,1211,1705,1792,1888 'score':1023,1449,1518,1533 'scratch':405,574 'search':197 'section':338,344,379 'secur':1881 'see':376 'seed':360,377,826,906,1055,1154 'seem':256,453 'select':626,908,988 'sentenc':1029,1735 'separ':135,1247 'set':847,1869,1875 'short':273,486 'signal':985 'similar':206,506 'simpl':257 'skill':29,51,108,178,566,774,1660,1677,1695,1701,1724,1743,1759,1764,1911 'skill-create-evaluation' 'skill.md':1717 'skip':242,246,279,428,449,885 'small':367,837 'source-goodeye-labs' 'span':960 'spot':1117,1144 'spot-check':1116,1143 'ss':1628 'standard':670,849 'start':920,1726 'step':887,902,1045,1146,1174,1216 'stop':1923 'storag':657 'store':1417,1607 'str':1578 'strict':350,632,684,1080 'strong':553 'structur':200,468,482,655,790,997 'style':1079 'success':1682 'suit':1773 'summari':1088,1889 'syntax':1856,1879 'synthesi':1173 'synthet':744,749,772,814 'system':787 'tailor':1702 'task':61 'technic':126,153 'templat':165,560,1625 'text':224,1188,1294,1583 'threshold':536 'time':311,1384 'tone':1501,1512,1564 'tool':167,202,214,231,470,999,1840,1864 'topic-agent-skills' 'topic-ai-evaluation' 'topic-chatgpt' 'topic-claude' 'topic-cursor' 'topic-llm' 'topic-mcp' 'topic-truesight' 'topic-vscode' 'topic-windsurf' 'total':1094 'trace':370,614,742,754,765,789,813,815,825,840,854,867,869,891,907,913,944,950,958,989,1038,1050,1062,1095,1111 'trade':328 'trade-off':327 'trial':538,546 'trigger':1749,1768,1880 'truesight':23,1217,1220 'type':65,788,1324,1378,1466,1472,1503,1524,1559,1566,1640 'unless':147,661,1661 'unprofession':1508 'upload':872,1135,1233,1456 'url':1334,1357,1364,1367,1394 'usag':352,1197,1700,1793,1821 'use':34,113,129,212,254,297,466,567,641,683,860,995,1051,1219,1284,1332,1398,1612,1737,1747,1778,1852 'user':37,54,118,185,364,433,513,542,663,756,795,833,904,994,1053,1076,1091,1114,1122,1138,1153,1207,1480,1662,1754 'valu':1067,1099,1527,1530,1850 'variat':935 'verif':401,1666,1689,1914,1921 'verifi':407,1433 'via':746,1391 'vs':635 'want':38,543,665 'way':973 'weak':264 'whenev':758 'whether':609 'widget':208 'wire':660 'without':738,873,1483 'word':628,968 'work':1435 'workflow':62,120,580,757,1204,1706 'write':1150 'written':1673 'x':1629 'xxx':1872 'yaml':1728,1730 'yes':618 'yet':550","prices":[{"id":"404cdf8a-54a9-44bb-a930-a246f343063e","listingId":"706f10a4-90d3-4e99-8359-3436069c6bf9","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"Goodeye-Labs","category":"truesight-mcp-skills","install_from":"skills.sh"},"createdAt":"2026-05-18T13:22:57.015Z"}],"sources":[{"listingId":"706f10a4-90d3-4e99-8359-3436069c6bf9","source":"github","sourceId":"Goodeye-Labs/truesight-mcp-skills/create-evaluation","sourceUrl":"https://github.com/Goodeye-Labs/truesight-mcp-skills/tree/main/skills/create-evaluation","isPrimary":false,"firstSeenAt":"2026-05-18T13:22:57.015Z","lastSeenAt":"2026-05-18T13:22:57.015Z"}],"details":{"listingId":"706f10a4-90d3-4e99-8359-3436069c6bf9","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"Goodeye-Labs","slug":"create-evaluation","github":{"repo":"Goodeye-Labs/truesight-mcp-skills","stars":6,"topics":["agent-skills","ai-evaluation","chatgpt","claude","cursor","llm","mcp","truesight","vscode","windsurf"],"license":"mit","html_url":"https://github.com/Goodeye-Labs/truesight-mcp-skills","pushed_at":"2026-03-26T06:15:56Z","description":"Agent skills for the Truesight MCP. Step-by-step workflow playbooks for scoring inputs, building live evaluations, error analysis, and the review loop. Works with Claude Code, Cursor, ChatGPT, VS Code, Windsurf, and any client that supports the agent skills standard.","skill_md_sha":"d519f52932968f97fb313056eff1053fc652820f","skill_md_path":"skills/create-evaluation/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/Goodeye-Labs/truesight-mcp-skills/tree/main/skills/create-evaluation"},"layout":"multi","source":"github","category":"truesight-mcp-skills","frontmatter":{"name":"create-evaluation","description":"Scope what quality should be measured, convert it into one or more actionable binary evaluations, deploy those evaluations through Truesight MCP, and generate a companion skill that applies them correctly. Use when a user wants to create new evals, quality checks, guardrails, or pass/fail criteria for AI outputs."},"skills_sh_url":"https://skills.sh/Goodeye-Labs/truesight-mcp-skills/create-evaluation"},"updatedAt":"2026-05-18T13:22:57.015Z"}}