{"id":"1e8b7840-90b2-401f-9c28-5c4daebca9d4","shortId":"QNRDcD","kind":"skill","title":"benchmark-models","tagline":"Cross-model benchmark for vibestack skills. Runs the same prompt through Claude,\nGPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost,\nand optionally quality via LLM judge. Answers \"which model is actually best\nfor this skill?\" with data instead of vibes.","description":"## Preamble\n\n```bash\neval \"$(~/.vibestack/bin/vibe-slug 2>/dev/null)\" 2>/dev/null || SLUG=\"unknown\"\n_LEARN_FILE=\"${VIBESTACK_HOME:-$HOME/.vibestack}/projects/${SLUG:-unknown}/learnings.jsonl\"\nif [ -f \"$_LEARN_FILE\" ]; then\n  _LEARN_COUNT=$(wc -l < \"$_LEARN_FILE\" 2>/dev/null | tr -d ' ')\n  echo \"LEARNINGS: $_LEARN_COUNT entries loaded\"\n  if [ \"$_LEARN_COUNT\" -gt 5 ] 2>/dev/null; then\n    ~/.vibestack/bin/vibe-learnings-search --limit 5 2>/dev/null || true\n  fi\nelse\n  echo \"LEARNINGS: none yet\"\nfi\n```\n\n# /benchmark-models — Cross-Model Skill Benchmark\n\nDifferent from `/benchmark` — that skill measures web page performance (Core Web Vitals, load times). This skill measures AI model performance on skills or arbitrary prompts.\n\n---\n\n## Step 0: Locate the binary\n\n```bash\nBIN=\"$HOME/.vibestack/bin/vibe-model-benchmark\"\n[ -x \"$BIN\" ] || { echo \"ERROR: model benchmark binary not found.\" >&2; exit 1; }\necho \"BIN: $BIN\"\n```\n\nIf not found, stop and tell the user:\n\"`vibe-model-benchmark` is required for this skill but is not installed at `~/.vibestack/bin/vibe-model-benchmark`. **vibestack does not bundle this binary** — it's a separate dependency. See [`docs/external-tools.md`](../../docs/external-tools.md#vibe-model-benchmark) for current options.\"\n\n---\n\n## Step 1: Choose a prompt\n\nUse AskUserQuestion with the preamble format:\n- **Re-ground:** current project + branch.\n- **Simplify:** \"A cross-model benchmark runs the same prompt through 2-3 AI models and shows you how they compare on speed, cost, and output quality. What prompt should we use?\"\n- **RECOMMENDATION:** A because benchmarking against a real skill exposes tool-use differences, not just raw generation.\n- **Options:**\n  - A) Benchmark one of my skills (we'll pick which skill next). Completeness: 10/10.\n  - B) Use an inline prompt — type it on the next turn. Completeness: 8/10.\n  - C) Point at a prompt file on disk — specify path on the next turn. Completeness: 8/10.\n\nIf A: list skills that have SKILL.md files (from `find ~/.claude/skills -name SKILL.md -not -path '*/vibestack/*'`), ask the user to pick one via a second AskUserQuestion. Use the picked SKILL.md path as the prompt file.\n\nIf B: ask the user for the inline prompt. Use it verbatim via `--prompt \"<text>\"`.\n\nIf C: ask for the path. Verify it exists. Use as positional argument.\n\n---\n\n## Step 2: Choose providers\n\n```bash\n\"$BIN\" --prompt \"unused, dry-run\" --models claude,gpt,gemini --dry-run\n```\n\nShow the dry-run output. The \"Adapter availability\" section tells the user which providers will actually run (OK) vs skip (NOT READY — remediation hint included).\n\nIf ALL three show NOT READY: stop with a clear message — benchmark can't run without at least one authed provider. Suggest `claude login`, `codex login`, or `gemini login` / `export GOOGLE_API_KEY`.\n\nIf at least one is OK: AskUserQuestion:\n- **Simplify:** \"Which models should we include? The dry-run above showed which are authed. Unauthed ones will be skipped cleanly — they won't abort the batch.\"\n- **RECOMMENDATION:** A (all authed providers) because running as many as possible gives the richest comparison.\n- **Options:**\n  - A) All authed providers. Completeness: 10/10.\n  - B) Only Claude. Completeness: 6/10 (no cross-model signal — use /ship's review for solo claude benchmarks instead).\n  - C) Pick two — specify on next turn. Completeness: 8/10.\n\n---\n\n## Step 3: Decide on judge\n\n```bash\n[ -n \"$ANTHROPIC_API_KEY\" ] || grep -q 'ANTHROPIC' \"$HOME/.claude/.credentials.json\" 2>/dev/null && echo \"JUDGE_AVAILABLE\" || echo \"JUDGE_UNAVAILABLE\"\n```\n\nIf judge is available, AskUserQuestion:\n- **Simplify:** \"The quality judge scores each model's output on a 0-10 scale using Anthropic's Claude as a tiebreaker. Adds ~$0.05/run. Recommended if you care about output quality, not just latency and cost.\"\n- **RECOMMENDATION:** A — the whole point is comparing quality, not just speed.\n- **Options:**\n  - A) Enable judge (adds ~$0.05). Completeness: 10/10.\n  - B) Skip judge — speed/cost/tokens only. Completeness: 7/10.\n\nIf judge is NOT available, skip this question and omit the `--judge` flag.\n\n---\n\n## Step 4: Run the benchmark\n\nConstruct the command from Step 1, 2, 3 decisions:\n\n```bash\n\"$BIN\" <prompt-spec> --models <picked-models> [--judge] --output table\n```\n\nWhere `<prompt-spec>` is either `--prompt \"<text>\"` (Step 1B), a file path (Step 1A or 1C), and `<picked-models>` is the comma-separated list from Step 2.\n\nStream the output as it arrives. This is slow — each provider runs the prompt fully. Expect 30s-5min depending on prompt complexity and whether `--judge` is on.\n\n---\n\n## Step 5: Interpret results\n\nAfter the table prints, summarize for the user:\n- **Fastest** — provider with lowest latency.\n- **Cheapest** — provider with lowest cost.\n- **Highest quality** (if `--judge` ran) — provider with highest score.\n- **Best overall** — use judgment. If judge ran: quality-weighted. Otherwise: note the tradeoff the user needs to make.\n\nIf any provider hit an error (auth/timeout/rate_limit), call it out with the remediation path.\n\n---\n\n## Step 6: Offer to save results\n\nAskUserQuestion:\n- **Simplify:** \"Save this benchmark as JSON so you can compare future runs against it?\"\n- **RECOMMENDATION:** A — skill performance drifts as providers update their models; a saved baseline catches quality regressions.\n- **Options:**\n  - A) Save to `~/.vibestack/benchmarks/<date>-<skill-or-prompt-slug>.json`. Completeness: 10/10.\n  - B) Just print, don't save. Completeness: 5/10 (loses trend data).\n\nIf A: re-run with `--output json` and tee to the dated file. Print the path so the user can diff future runs against it.\n\n---\n\n## Important Rules\n\n- **Never run a real benchmark without Step 2's dry-run first.** Users need to see auth status before spending API calls.\n- **Never hardcode model names.** Always pass providers from user's Step 2 choice — the binary handles the rest.\n- **Never auto-include `--judge`.** It adds real cost; user must opt in.\n- **If zero providers are authed, STOP.** Don't attempt the benchmark — it produces no useful output.\n- **Cost is visible.** Every run shows per-provider cost in the table. Users should see it before the next run.\n\n---\n\n## Capture Learnings\n\nIf you discovered a non-obvious pattern, pitfall, or insight during this session, log it:\n\n```bash\n~/.vibestack/bin/vibe-learnings-log '{\"skill\":\"benchmark-models\",\"type\":\"TYPE\",\"key\":\"SHORT_KEY\",\"insight\":\"DESCRIPTION\",\"confidence\":N,\"source\":\"SOURCE\",\"files\":[\"path/to/relevant/file\"]}'\n```\n\n**Types:** `pattern`, `pitfall`, `preference`, `architecture`, `operational`.\n\n**Only log genuine discoveries.**","tags":["benchmark","models","vibestack","timurgaleev","agent-skills","ai-agents","claude-code","cursor-ide","developer-tools","kiro","mcp","prompt-engineering"],"capabilities":["skill","source-timurgaleev","skill-benchmark-models","topic-agent-skills","topic-ai-agents","topic-claude-code","topic-cursor-ide","topic-developer-tools","topic-kiro","topic-mcp","topic-prompt-engineering","topic-slash-commands"],"categories":["vibestack"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/timurgaleev/vibestack/benchmark-models","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add timurgaleev/vibestack","source_repo":"https://github.com/timurgaleev/vibestack","install_from":"skills.sh"}},"qualityScore":"0.457","qualityRationale":"deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 15 github stars · SKILL.md body (6,623 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:06:19.396Z","embedding":null,"createdAt":"2026-05-18T19:06:19.396Z","updatedAt":"2026-05-18T19:06:19.396Z","lastSeenAt":"2026-05-18T19:06:19.396Z","tsv":"'-10':582 '-3':239 '/../docs/external-tools.md':202 '/.claude/skills':330 '/.vibestack/benchmarks':821 '/.vibestack/bin/vibe-learnings-log':974 '/.vibestack/bin/vibe-learnings-search':99 '/.vibestack/bin/vibe-model-benchmark':188 '/.vibestack/bin/vibe-slug':54 '/benchmark':120 '/benchmark-models':112 '/dev/null':56,58,82,97,103,558 '/learnings.jsonl':69 '/projects':66 '/run':593 '/ship':526 '/vibestack':335 '0':144,581 '0.05':592,622 '1':162,211,655 '10/10':290,514,624,824 '1a':675 '1b':670 '1c':677 '2':55,57,81,96,102,160,238,383,557,656,687,871,898 '3':544,657 '30s':705 '30s-5min':704 '4':646 '5':95,101,717 '5/10':832 '5min':706 '6':781 '6/10':519 '7/10':631 '8/10':303,319,542 'abort':490 'actual':41,416 'adapt':407 'add':591,621,911 'ai':135,240 'alway':891 'answer':37 'anthrop':550,555,585 'api':457,551,885 'arbitrari':141 'architectur':996 'argument':381 'arriv':693 'ask':336,357,371 'askuserquest':216,345,465,569,786 'attempt':926 'auth':445,480,496,511,881,922 'auth/timeout/rate_limit':772 'auto':907 'auto-includ':906 'avail':408,561,568,636 'b':291,356,515,625,825 'baselin':813 'bash':52,148,386,548,659,973 'batch':492 'benchmark':2,7,117,156,177,206,232,262,278,437,532,649,790,868,928,977 'benchmark-model':1,976 'best':42,747 'bin':149,152,164,165,387,660 'binari':147,157,194,901 'branch':226 'bundl':192 'c':304,370,534 'call':773,886 'captur':955 'care':597 'catch':814 'cheapest':733 'choic':899 'choos':212,384 'claud':16,394,448,517,531,587 'clean':486 'clear':435 'cli':20 'codex':19,450 'comma':682 'comma-separ':681 'command':652 'compar':27,247,612,796 'comparison':507 'complet':289,302,318,513,518,541,623,630,823,831 'complex':710 'confid':986 'construct':650 'core':127 'cost':30,250,605,737,913,934,943 'count':76,88,93 'cross':5,114,230,522 'cross-model':4,113,229,521 'current':208,224 'd':84 'data':47,835 'date':848 'decid':545 'decis':658 'depend':199,707 'descript':985 'diff':857 'differ':118,271 'discov':959 'discoveri':1001 'disk':311 'docs/external-tools.md':201 'dri':391,398,403,474,874 'drift':805 'dry-run':390,397,402,473,873 'echo':85,107,153,163,559,562 'either':667 'els':106 'enabl':619 'entri':89 'error':154,771 'eval':53 'everi':937 'exist':377 'exit':161 'expect':703 'export':455 'expos':267 'f':71 'fastest':728 'fi':105,111 'file':62,73,80,309,327,354,672,849,990 'find':329 'first':876 'flag':644 'format':220 'found':159,168 'fulli':702 'futur':797,858 'gemini':22,396,453 'generat':275 'genuin':1000 'give':504 'googl':456 'gpt':17,395 'grep':553 'ground':223 'gt':94 'handl':902 'hardcod':888 'highest':738,745 'hint':424 'hit':769 'home':64 'home/.claude/.credentials.json':556 'home/.vibestack':65 'home/.vibestack/bin/vibe-model-benchmark':150 'import':862 'includ':425,471,908 'inlin':294,362 'insight':967,984 'instal':186 'instead':48,533 'interpret':718 'json':792,822,843 'judg':36,547,560,563,566,573,620,627,633,643,662,713,741,752,909 'judgment':750 'key':458,552,981,983 'l':78 'latenc':28,603,732 'learn':61,72,75,79,86,87,92,108,956 'least':443,461 'limit':100 'list':322,684 'll':284 'llm':35 'load':90,130 'locat':145 'log':971,999 'login':449,451,454 'lose':833 'lowest':731,736 'make':765 'mani':501 'measur':123,134 'messag':436 'model':3,6,39,115,136,155,176,205,231,241,393,468,523,576,661,810,889,978 'must':915 'n':549,987 'name':331,890 'need':763,878 'never':864,887,905 'next':288,300,316,539,953 'non':962 'non-obvi':961 'none':109 'note':758 'obvious':963 'offer':782 'ok':418,464 'omit':641 'one':279,341,444,462,482 'oper':997 'opt':916 'option':32,209,276,508,617,817 'otherwis':757 'output':252,405,578,599,663,690,842,933 'overal':748 'page':125 'pass':892 'path':313,334,350,374,673,779,852 'path/to/relevant/file':991 'pattern':964,993 'per':941 'per-provid':940 'perform':126,137,804 'pick':285,340,348,535 'pitfal':965,994 'point':305,610 'posit':380 'possibl':503 'preambl':51,219 'prefer':995 'print':723,827,850 'produc':930 'project':225 'prompt':14,142,214,236,255,295,308,353,363,368,388,668,701,709 'provid':385,414,446,497,512,698,729,734,743,768,807,893,920,942 'q':554 'qualiti':33,253,572,600,613,739,755,815 'quality-weight':754 'question':639 'ran':742,753 'raw':274 're':222,839 're-ground':221 're-run':838 'readi':422,431 'real':265,867,912 'recommend':259,493,594,606,801 'regress':816 'remedi':423,778 'requir':179 'rest':904 'result':719,785 'review':528 'richest':506 'rule':863 'run':11,233,392,399,404,417,440,475,499,647,699,798,840,859,865,875,938,954 'save':784,788,812,819,830 'scale':583 'score':574,746 'second':344 'section':409 'see':200,880,949 'separ':198,683 'session':970 'short':982 'show':243,400,429,477,939 'side':24,26 'side-by-sid':23 'signal':524 'simplifi':227,466,570,787 'skill':10,45,116,122,133,139,182,266,282,287,323,803,975 'skill-benchmark-models' 'skill.md':326,332,349 'skip':420,485,626,637 'slow':696 'slug':59,67 'solo':530 'sourc':988,989 'source-timurgaleev' 'specifi':312,537 'speed':249,616 'speed/cost/tokens':628 'spend':884 'status':882 'step':143,210,382,543,645,654,669,674,686,716,780,870,897 'stop':169,432,923 'stream':688 'suggest':447 'summar':724 'tabl':664,722,946 'tee':845 'tell':171,410 'three':428 'tiebreak':590 'time':131 'token':29 'tool':269 'tool-us':268 'topic-agent-skills' 'topic-ai-agents' 'topic-claude-code' 'topic-cursor-ide' 'topic-developer-tools' 'topic-kiro' 'topic-mcp' 'topic-prompt-engineering' 'topic-slash-commands' 'tr':83 'tradeoff':760 'trend':834 'true':104 'turn':301,317,540 'two':536 'type':296,979,980,992 'unauth':481 'unavail':564 'unknown':60,68 'unus':389 'updat':808 'use':215,258,270,292,346,364,378,525,584,749,932 'user':173,338,359,412,727,762,855,877,895,914,947 'verbatim':366 'verifi':375 'via':18,34,342,367 'vibe':50,175,204 'vibe-model-benchmark':174,203 'vibestack':9,63,189 'visibl':936 'vital':129 'vs':419 'wc':77 'web':124,128 'weight':756 'whether':712 'whole':609 'without':441,869 'won':488 'x':151 'yet':110 'zero':919","prices":[{"id":"41b4df8d-9c6a-4eca-bd07-bd5ed80dbb39","listingId":"1e8b7840-90b2-401f-9c28-5c4daebca9d4","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"timurgaleev","category":"vibestack","install_from":"skills.sh"},"createdAt":"2026-05-18T19:06:19.396Z"}],"sources":[{"listingId":"1e8b7840-90b2-401f-9c28-5c4daebca9d4","source":"github","sourceId":"timurgaleev/vibestack/benchmark-models","sourceUrl":"https://github.com/timurgaleev/vibestack/tree/main/skills/benchmark-models","isPrimary":false,"firstSeenAt":"2026-05-18T19:06:19.396Z","lastSeenAt":"2026-05-18T19:06:19.396Z"}],"details":{"listingId":"1e8b7840-90b2-401f-9c28-5c4daebca9d4","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"timurgaleev","slug":"benchmark-models","github":{"repo":"timurgaleev/vibestack","stars":15,"topics":["agent-skills","ai-agents","claude-code","cursor-ide","developer-tools","kiro","mcp","prompt-engineering","slash-commands"],"license":"mit","html_url":"https://github.com/timurgaleev/vibestack","pushed_at":"2026-05-18T18:19:05Z","description":"vibestack is a portable skill pack for AI coding agents. Slash commands like /office-hours, /ship, /investigate, /tdd, /review install once and work across every agent that supports the Agent Skills open standard — Claude Code, Cursor, Kiro, and a growing list of others. ","skill_md_sha":"09b9ca022a9a43b232443e55cf57d78a348db465","skill_md_path":"skills/benchmark-models/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/timurgaleev/vibestack/tree/main/skills/benchmark-models"},"layout":"multi","source":"github","category":"vibestack","frontmatter":{"name":"benchmark-models","description":"Cross-model benchmark for vibestack skills. Runs the same prompt through Claude,\nGPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost,\nand optionally quality via LLM judge. Answers \"which model is actually best\nfor this skill?\" with data instead of vibes. Use when: \"benchmark models\",\n\"compare models\", \"which model is best for X\", \"cross-model comparison\", \"model shootout\"."},"skills_sh_url":"https://skills.sh/timurgaleev/vibestack/benchmark-models"},"updatedAt":"2026-05-18T19:06:19.396Z"}}