Skillquality 0.46

benchmark-models

Cross-model benchmark for vibestack skills. Runs the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost, and optionally quality via LLM judge. Answers "which model is actually best for this skill?" with data instead of vibes.

Price
free
Protocol
skill
Verified
no

What it does

Preamble

eval "$(~/.vibestack/bin/vibe-slug 2>/dev/null)" 2>/dev/null || SLUG="unknown"
_LEARN_FILE="${VIBESTACK_HOME:-$HOME/.vibestack}/projects/${SLUG:-unknown}/learnings.jsonl"
if [ -f "$_LEARN_FILE" ]; then
  _LEARN_COUNT=$(wc -l < "$_LEARN_FILE" 2>/dev/null | tr -d ' ')
  echo "LEARNINGS: $_LEARN_COUNT entries loaded"
  if [ "$_LEARN_COUNT" -gt 5 ] 2>/dev/null; then
    ~/.vibestack/bin/vibe-learnings-search --limit 5 2>/dev/null || true
  fi
else
  echo "LEARNINGS: none yet"
fi

/benchmark-models — Cross-Model Skill Benchmark

Different from /benchmark — that skill measures web page performance (Core Web Vitals, load times). This skill measures AI model performance on skills or arbitrary prompts.


Step 0: Locate the binary

BIN="$HOME/.vibestack/bin/vibe-model-benchmark"
[ -x "$BIN" ] || { echo "ERROR: model benchmark binary not found." >&2; exit 1; }
echo "BIN: $BIN"

If not found, stop and tell the user: "vibe-model-benchmark is required for this skill but is not installed at ~/.vibestack/bin/vibe-model-benchmark. vibestack does not bundle this binary — it's a separate dependency. See docs/external-tools.md for current options."


Step 1: Choose a prompt

Use AskUserQuestion with the preamble format:

  • Re-ground: current project + branch.
  • Simplify: "A cross-model benchmark runs the same prompt through 2-3 AI models and shows you how they compare on speed, cost, and output quality. What prompt should we use?"
  • RECOMMENDATION: A because benchmarking against a real skill exposes tool-use differences, not just raw generation.
  • Options:
    • A) Benchmark one of my skills (we'll pick which skill next). Completeness: 10/10.
    • B) Use an inline prompt — type it on the next turn. Completeness: 8/10.
    • C) Point at a prompt file on disk — specify path on the next turn. Completeness: 8/10.

If A: list skills that have SKILL.md files (from find ~/.claude/skills -name SKILL.md -not -path '*/vibestack/*'), ask the user to pick one via a second AskUserQuestion. Use the picked SKILL.md path as the prompt file.

If B: ask the user for the inline prompt. Use it verbatim via --prompt "<text>".

If C: ask for the path. Verify it exists. Use as positional argument.


Step 2: Choose providers

"$BIN" --prompt "unused, dry-run" --models claude,gpt,gemini --dry-run

Show the dry-run output. The "Adapter availability" section tells the user which providers will actually run (OK) vs skip (NOT READY — remediation hint included).

If ALL three show NOT READY: stop with a clear message — benchmark can't run without at least one authed provider. Suggest claude login, codex login, or gemini login / export GOOGLE_API_KEY.

If at least one is OK: AskUserQuestion:

  • Simplify: "Which models should we include? The dry-run above showed which are authed. Unauthed ones will be skipped cleanly — they won't abort the batch."
  • RECOMMENDATION: A (all authed providers) because running as many as possible gives the richest comparison.
  • Options:
    • A) All authed providers. Completeness: 10/10.
    • B) Only Claude. Completeness: 6/10 (no cross-model signal — use /ship's review for solo claude benchmarks instead).
    • C) Pick two — specify on next turn. Completeness: 8/10.

Step 3: Decide on judge

[ -n "$ANTHROPIC_API_KEY" ] || grep -q 'ANTHROPIC' "$HOME/.claude/.credentials.json" 2>/dev/null && echo "JUDGE_AVAILABLE" || echo "JUDGE_UNAVAILABLE"

If judge is available, AskUserQuestion:

  • Simplify: "The quality judge scores each model's output on a 0-10 scale using Anthropic's Claude as a tiebreaker. Adds ~$0.05/run. Recommended if you care about output quality, not just latency and cost."
  • RECOMMENDATION: A — the whole point is comparing quality, not just speed.
  • Options:
    • A) Enable judge (adds ~$0.05). Completeness: 10/10.
    • B) Skip judge — speed/cost/tokens only. Completeness: 7/10.

If judge is NOT available, skip this question and omit the --judge flag.


Step 4: Run the benchmark

Construct the command from Step 1, 2, 3 decisions:

"$BIN" <prompt-spec> --models <picked-models> [--judge] --output table

Where <prompt-spec> is either --prompt "<text>" (Step 1B), a file path (Step 1A or 1C), and <picked-models> is the comma-separated list from Step 2.

Stream the output as it arrives. This is slow — each provider runs the prompt fully. Expect 30s-5min depending on prompt complexity and whether --judge is on.


Step 5: Interpret results

After the table prints, summarize for the user:

  • Fastest — provider with lowest latency.
  • Cheapest — provider with lowest cost.
  • Highest quality (if --judge ran) — provider with highest score.
  • Best overall — use judgment. If judge ran: quality-weighted. Otherwise: note the tradeoff the user needs to make.

If any provider hit an error (auth/timeout/rate_limit), call it out with the remediation path.


Step 6: Offer to save results

AskUserQuestion:

  • Simplify: "Save this benchmark as JSON so you can compare future runs against it?"
  • RECOMMENDATION: A — skill performance drifts as providers update their models; a saved baseline catches quality regressions.
  • Options:
    • A) Save to ~/.vibestack/benchmarks/<date>-<skill-or-prompt-slug>.json. Completeness: 10/10.
    • B) Just print, don't save. Completeness: 5/10 (loses trend data).

If A: re-run with --output json and tee to the dated file. Print the path so the user can diff future runs against it.


Important Rules

  • Never run a real benchmark without Step 2's dry-run first. Users need to see auth status before spending API calls.
  • Never hardcode model names. Always pass providers from user's Step 2 choice — the binary handles the rest.
  • Never auto-include --judge. It adds real cost; user must opt in.
  • If zero providers are authed, STOP. Don't attempt the benchmark — it produces no useful output.
  • Cost is visible. Every run shows per-provider cost in the table. Users should see it before the next run.

Capture Learnings

If you discovered a non-obvious pattern, pitfall, or insight during this session, log it:

~/.vibestack/bin/vibe-learnings-log '{"skill":"benchmark-models","type":"TYPE","key":"SHORT_KEY","insight":"DESCRIPTION","confidence":N,"source":"SOURCE","files":["path/to/relevant/file"]}'

Types: pattern, pitfall, preference, architecture, operational.

Only log genuine discoveries.

Capabilities

skillsource-timurgaleevskill-benchmark-modelstopic-agent-skillstopic-ai-agentstopic-claude-codetopic-cursor-idetopic-developer-toolstopic-kirotopic-mcptopic-prompt-engineeringtopic-slash-commands

Install

Installnpx skills add timurgaleev/vibestack
Transportskills-sh
Protocolskill

Quality

0.46/ 1.00

deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 15 github stars · SKILL.md body (6,623 chars)

Provenance

Indexed fromgithub
Enriched2026-05-18 19:06:19Z · deterministic:skill-github:v1 · v1
First seen2026-05-18
Last seen2026-05-18

Agent access