{"id":"2cf86b70-06fd-4fc8-8398-4b10721e0b03","shortId":"vCYY34","kind":"skill","title":"Prove whether a prompt or model variant really won before shipping with promptstats","tagline":"Run statistically sound comparisons on eval results so prompt and model changes are judged by confidence bounds, not bar-chart vibes.","description":"# Prove whether a prompt or model variant really won before shipping with promptstats\n\nRun statistically sound comparisons on eval results so prompt and model changes are judged by confidence bounds, not bar-chart vibes.\n\n## Prerequisites\n\nPython environment, promptstats package, eval result tables or per-input score arrays, prompt or model experiment outputs to compare\n\n## Installation\n\nUse the upstream install or setup path that matches your environment:\n- pip install evalstats\n- pip install \"evalstats[xlsx]\"\n- pip install \"evalstats[all]\"\n- pip install \"evalstats[lmm]\"\n\nRequirements and caveats from upstream:\n- If you set method=\"lmm\", analyze() switches to a mixed-effects path (score ~ template + (1|input)) with Wald CIs and parametric rank distributions. By default this uses statsmodels (pure Python, no additional setup re...\n- ## Python API\n- evalstats main use case is as a Python API, which provides a similar entry point, the analyze() function. Simply pass your benchmark data in the correct format, and pass it to analyze to get a battery of results:\n\nBasic usage or getting-started notes:\n- What statistics test should I run in X situation?\n- as well as example code (which will, obviously, tend to use evalstats, but\n- Running estats.analyze() and then estats.print_analysis_summary(analysis) prints a full statistical report to the terminal, including confidence interval line plots, pairwise comparisons between prompt templates, and...\n\n- Source: https://github.com/ianarawjo/promptstats\n- Extracted from upstream docs: https://raw.githubusercontent.com/ianarawjo/promptstats/HEAD/README.md\n\n## Documentation\n\n- https://statsforevals.com\n\n## Source\n\n- [Agent Skill Exchange](https://agentskillexchange.com/skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats/)","tags":["prove","whether","prompt","model","variant","really","won","before","shipping","with","promptstats","skills"],"capabilities":["skill","source-agentskillexchange","skill-prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats","topic-agent-skills","topic-ai-agents","topic-ai-tools","topic-awesome-list","topic-claude-code","topic-codex","topic-cursor","topic-llm","topic-mcp","topic-npx-skills","topic-openclaw","topic-skills-catalog"],"categories":["skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/agentskillexchange/skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add agentskillexchange/skills","source_repo":"https://github.com/agentskillexchange/skills","install_from":"skills.sh"}},"qualityScore":"0.454","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,826 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:11:56.017Z","embedding":null,"createdAt":"2026-05-18T13:18:37.492Z","updatedAt":"2026-05-18T19:11:56.017Z","lastSeenAt":"2026-05-18T19:11:56.017Z","tsv":"'/ianarawjo/promptstats':258 '/ianarawjo/promptstats/head/readme.md':265 '/skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats/)':274 '1':139 'addit':156 'agent':269 'agentskillexchange.com':273 'agentskillexchange.com/skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats/)':272 'analysi':233,235 'analyz':129,177,192 'api':160,169 'array':84 'bar':33,68 'bar-chart':32,67 'basic':199 'batteri':196 'benchmark':182 'bound':30,65 'case':164 'caveat':121 'chang':25,60 'chart':34,69 'cis':143 'code':219 'compar':91 'comparison':17,52,250 'confid':29,64,245 'correct':186 'data':183 'default':149 'distribut':147 'doc':262 'document':266 'effect':135 'entri':174 'environ':73,103 'estats.analyze':229 'estats.print':232 'eval':19,54,76 'evalstat':106,109,113,117,161,226 'exampl':218 'exchang':271 'experi':88 'extract':259 'format':187 'full':238 'function':178 'get':194,203 'getting-start':202 'github.com':257 'github.com/ianarawjo/promptstats':256 'includ':244 'input':82,140 'instal':92,96,105,108,112,116 'interv':246 'judg':27,62 'line':247 'lmm':118,128 'main':162 'match':101 'method':127 'mix':134 'mixed-effect':133 'model':6,24,41,59,87 'note':205 'obvious':222 'output':89 'packag':75 'pairwis':249 'parametr':145 'pass':180,189 'path':99,136 'per':81 'per-input':80 'pip':104,107,111,115 'plot':248 'point':175 'prerequisit':71 'print':236 'prompt':4,22,39,57,85,252 'promptstat':13,48,74 'prove':1,36 'provid':171 'pure':153 'python':72,154,159,168 'rank':146 'raw.githubusercontent.com':264 'raw.githubusercontent.com/ianarawjo/promptstats/head/readme.md':263 're':158 'realli':8,43 'report':240 'requir':119 'result':20,55,77,198 'run':14,49,211,228 'score':83,137 'set':126 'setup':98,157 'ship':11,46 'similar':173 'simpli':179 'situat':214 'skill':270 'skill-prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats' 'sound':16,51 'sourc':255,268 'source-agentskillexchange' 'start':204 'statist':15,50,207,239 'statsforevals.com':267 'statsmodel':152 'summari':234 'switch':130 'tabl':78 'templat':138,253 'tend':223 'termin':243 'test':208 'topic-agent-skills' 'topic-ai-agents' 'topic-ai-tools' 'topic-awesome-list' 'topic-claude-code' 'topic-codex' 'topic-cursor' 'topic-llm' 'topic-mcp' 'topic-npx-skills' 'topic-openclaw' 'topic-skills-catalog' 'upstream':95,123,261 'usag':200 'use':93,151,163,225 'variant':7,42 'vibe':35,70 'wald':142 'well':216 'whether':2,37 'won':9,44 'x':213 'xlsx':110","prices":[{"id":"8337f751-c9c7-45ef-88d3-8d98cd6fdef9","listingId":"2cf86b70-06fd-4fc8-8398-4b10721e0b03","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"agentskillexchange","category":"skills","install_from":"skills.sh"},"createdAt":"2026-05-18T13:18:37.492Z"}],"sources":[{"listingId":"2cf86b70-06fd-4fc8-8398-4b10721e0b03","source":"github","sourceId":"agentskillexchange/skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats","sourceUrl":"https://github.com/agentskillexchange/skills/tree/main/skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats","isPrimary":false,"firstSeenAt":"2026-05-18T13:18:37.492Z","lastSeenAt":"2026-05-18T19:11:56.017Z"}],"details":{"listingId":"2cf86b70-06fd-4fc8-8398-4b10721e0b03","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"agentskillexchange","slug":"prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats","github":{"repo":"agentskillexchange/skills","stars":8,"topics":["agent-skills","ai-agents","ai-tools","awesome-list","claude-code","codex","cursor","llm","mcp","npx-skills","openclaw","skills-catalog"],"license":"mit","html_url":"https://github.com/agentskillexchange/skills","pushed_at":"2026-05-18T19:02:17Z","description":"The open catalog of AI agent skills — 2,000+ security-scanned skills for Claude Code, Cursor, Codex, and more.","skill_md_sha":"aaeb21308404a9de32196218df16c46a8e8e136d","skill_md_path":"skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/agentskillexchange/skills/tree/main/skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats"},"layout":"multi","source":"github","category":"skills","frontmatter":{"name":"Prove whether a prompt or model variant really won before shipping with promptstats","description":"Run statistically sound comparisons on eval results so prompt and model changes are judged by confidence bounds, not bar-chart vibes."},"skills_sh_url":"https://skills.sh/agentskillexchange/skills/prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats"},"updatedAt":"2026-05-18T19:11:56.017Z"}}