benchmark-models
Cross-model benchmark for vibestack skills. Runs the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost, and optionally quality via LLM judge. Answers "which model is actually best for this skill?" with data instead of vibes.
What it does
Preamble
eval "$(~/.vibestack/bin/vibe-slug 2>/dev/null)" 2>/dev/null || SLUG="unknown"
_LEARN_FILE="${VIBESTACK_HOME:-$HOME/.vibestack}/projects/${SLUG:-unknown}/learnings.jsonl"
if [ -f "$_LEARN_FILE" ]; then
_LEARN_COUNT=$(wc -l < "$_LEARN_FILE" 2>/dev/null | tr -d ' ')
echo "LEARNINGS: $_LEARN_COUNT entries loaded"
if [ "$_LEARN_COUNT" -gt 5 ] 2>/dev/null; then
~/.vibestack/bin/vibe-learnings-search --limit 5 2>/dev/null || true
fi
else
echo "LEARNINGS: none yet"
fi
/benchmark-models — Cross-Model Skill Benchmark
Different from /benchmark — that skill measures web page performance (Core Web Vitals, load times). This skill measures AI model performance on skills or arbitrary prompts.
Step 0: Locate the binary
BIN="$HOME/.vibestack/bin/vibe-model-benchmark"
[ -x "$BIN" ] || { echo "ERROR: model benchmark binary not found." >&2; exit 1; }
echo "BIN: $BIN"
If not found, stop and tell the user:
"vibe-model-benchmark is required for this skill but is not installed at ~/.vibestack/bin/vibe-model-benchmark. vibestack does not bundle this binary — it's a separate dependency. See docs/external-tools.md for current options."
Step 1: Choose a prompt
Use AskUserQuestion with the preamble format:
- Re-ground: current project + branch.
- Simplify: "A cross-model benchmark runs the same prompt through 2-3 AI models and shows you how they compare on speed, cost, and output quality. What prompt should we use?"
- RECOMMENDATION: A because benchmarking against a real skill exposes tool-use differences, not just raw generation.
- Options:
- A) Benchmark one of my skills (we'll pick which skill next). Completeness: 10/10.
- B) Use an inline prompt — type it on the next turn. Completeness: 8/10.
- C) Point at a prompt file on disk — specify path on the next turn. Completeness: 8/10.
If A: list skills that have SKILL.md files (from find ~/.claude/skills -name SKILL.md -not -path '*/vibestack/*'), ask the user to pick one via a second AskUserQuestion. Use the picked SKILL.md path as the prompt file.
If B: ask the user for the inline prompt. Use it verbatim via --prompt "<text>".
If C: ask for the path. Verify it exists. Use as positional argument.
Step 2: Choose providers
"$BIN" --prompt "unused, dry-run" --models claude,gpt,gemini --dry-run
Show the dry-run output. The "Adapter availability" section tells the user which providers will actually run (OK) vs skip (NOT READY — remediation hint included).
If ALL three show NOT READY: stop with a clear message — benchmark can't run without at least one authed provider. Suggest claude login, codex login, or gemini login / export GOOGLE_API_KEY.
If at least one is OK: AskUserQuestion:
- Simplify: "Which models should we include? The dry-run above showed which are authed. Unauthed ones will be skipped cleanly — they won't abort the batch."
- RECOMMENDATION: A (all authed providers) because running as many as possible gives the richest comparison.
- Options:
- A) All authed providers. Completeness: 10/10.
- B) Only Claude. Completeness: 6/10 (no cross-model signal — use /ship's review for solo claude benchmarks instead).
- C) Pick two — specify on next turn. Completeness: 8/10.
Step 3: Decide on judge
[ -n "$ANTHROPIC_API_KEY" ] || grep -q 'ANTHROPIC' "$HOME/.claude/.credentials.json" 2>/dev/null && echo "JUDGE_AVAILABLE" || echo "JUDGE_UNAVAILABLE"
If judge is available, AskUserQuestion:
- Simplify: "The quality judge scores each model's output on a 0-10 scale using Anthropic's Claude as a tiebreaker. Adds ~$0.05/run. Recommended if you care about output quality, not just latency and cost."
- RECOMMENDATION: A — the whole point is comparing quality, not just speed.
- Options:
- A) Enable judge (adds ~$0.05). Completeness: 10/10.
- B) Skip judge — speed/cost/tokens only. Completeness: 7/10.
If judge is NOT available, skip this question and omit the --judge flag.
Step 4: Run the benchmark
Construct the command from Step 1, 2, 3 decisions:
"$BIN" <prompt-spec> --models <picked-models> [--judge] --output table
Where <prompt-spec> is either --prompt "<text>" (Step 1B), a file path (Step 1A or 1C), and <picked-models> is the comma-separated list from Step 2.
Stream the output as it arrives. This is slow — each provider runs the prompt fully. Expect 30s-5min depending on prompt complexity and whether --judge is on.
Step 5: Interpret results
After the table prints, summarize for the user:
- Fastest — provider with lowest latency.
- Cheapest — provider with lowest cost.
- Highest quality (if
--judgeran) — provider with highest score. - Best overall — use judgment. If judge ran: quality-weighted. Otherwise: note the tradeoff the user needs to make.
If any provider hit an error (auth/timeout/rate_limit), call it out with the remediation path.
Step 6: Offer to save results
AskUserQuestion:
- Simplify: "Save this benchmark as JSON so you can compare future runs against it?"
- RECOMMENDATION: A — skill performance drifts as providers update their models; a saved baseline catches quality regressions.
- Options:
- A) Save to
~/.vibestack/benchmarks/<date>-<skill-or-prompt-slug>.json. Completeness: 10/10. - B) Just print, don't save. Completeness: 5/10 (loses trend data).
- A) Save to
If A: re-run with --output json and tee to the dated file. Print the path so the user can diff future runs against it.
Important Rules
- Never run a real benchmark without Step 2's dry-run first. Users need to see auth status before spending API calls.
- Never hardcode model names. Always pass providers from user's Step 2 choice — the binary handles the rest.
- Never auto-include
--judge. It adds real cost; user must opt in. - If zero providers are authed, STOP. Don't attempt the benchmark — it produces no useful output.
- Cost is visible. Every run shows per-provider cost in the table. Users should see it before the next run.
Capture Learnings
If you discovered a non-obvious pattern, pitfall, or insight during this session, log it:
~/.vibestack/bin/vibe-learnings-log '{"skill":"benchmark-models","type":"TYPE","key":"SHORT_KEY","insight":"DESCRIPTION","confidence":N,"source":"SOURCE","files":["path/to/relevant/file"]}'
Types: pattern, pitfall, preference, architecture, operational.
Only log genuine discoveries.
Capabilities
Install
Quality
deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 15 github stars · SKILL.md body (6,623 chars)