Skillquality 0.45

LLM Cost Optimizer

Audits an AI application for unnecessary token spend and recommends prompt caching, model routing, and token reduction techniques to cut costs.

Price
free
Protocol
skill
Verified
no

What it does

LLM Cost Optimizer

What this skill does

This skill audits an LLM application's prompts, call patterns, and model selection to identify cost reduction opportunities. It covers prompt caching, model routing (right-sizing), token reduction, batching, and output length control — the techniques that typically cut LLM costs by 40–80% without sacrificing quality.

How to use

Claude Code / Cline

Copy this file to .agents/skills/llm-cost-optimizer/SKILL.md in your project root.

Then ask:

  • "Use the LLM Cost Optimizer to audit our AI application."
  • "How can I reduce our OpenAI API costs? Here are our prompts..."

Provide:

  • Your system prompt(s)
  • Approximate daily call volume
  • Which model(s) you're using
  • Typical input/output token counts if known
  • Whether calls are real-time (low latency required) or batch (latency tolerant)

Cursor / Codex

Paste your prompts, call patterns, and current monthly spend alongside these instructions.

The Prompt / Instructions for the Agent

When asked to optimize LLM costs, audit the following areas in order of typical savings impact:

Audit 1 — Prompt Caching (savings: 50–90% on repeated prefixes)

Check: Does the system prompt stay the same across calls?

If yes, enable prompt caching. The system prompt is sent once and cached — subsequent calls only pay for the new user tokens.

# Anthropic Claude — cache_control on system prompt
response = client.messages.create(
    model="claude-opus-4-6",
    system=[{
        "type": "text",
        "text": your_system_prompt,
        "cache_control": {"type": "ephemeral"}  # cached for 5 minutes
    }],
    messages=[{"role": "user", "content": user_message}]
)

# OpenAI — automatic prompt caching for prompts > 1024 tokens
# No code change needed — cached automatically, check usage.prompt_tokens_details.cached_tokens

When it applies: Any app where the system prompt is > 1024 tokens and reused across calls. Support bots, coding assistants, document analyzers.

Savings estimate: If system prompt = 2000 tokens, 10,000 calls/day → saves ~20M tokens/day in input costs.

Audit 2 — Model Right-Sizing (savings: 60–90% on over-specified models)

Check: Are you using a frontier model (GPT-4o, Claude Opus) for tasks that a smaller model handles just as well?

TaskRecommended Model
Classification, routing, yes/no decisionsGPT-4o-mini, Claude Haiku
Summarization, extraction, translationGPT-4o-mini, Claude Sonnet
Complex reasoning, code generationGPT-4o, Claude Sonnet
Novel research, multi-step agent planningClaude Opus, o1

Implement a model router:

def route_model(task_type: str, complexity: str) -> str:
    if task_type in ("classify", "extract", "translate"):
        return "claude-haiku-4-5-20251001"
    if complexity == "high" or task_type == "code_generation":
        return "claude-sonnet-4-6"
    return "claude-haiku-4-5-20251001"  # default to cheap

Audit 3 — Token Reduction (savings: 20–40% on bloated prompts)

Check: Is the system prompt longer than it needs to be?

Common bloat patterns:

  • Repeating the same instruction multiple ways ("Be concise. Keep answers short. Don't ramble.")
  • Long examples when one would do
  • Full document context when only a section is needed
  • Verbose role descriptions

Token reduction techniques:

  1. Compress examples — use 1 example instead of 3 if the task is clear
  2. Use structured format — bullet points use fewer tokens than prose instructions
  3. Trim RAG context — retrieve top-3 chunks, not top-10; rerank before sending
  4. Limit output length — set max_tokens to the minimum needed:
# If you only need a one-sentence answer, cap it
response = client.messages.create(max_tokens=100, ...)

Audit 4 — Response Caching (savings: 30–70% for repetitive queries)

Check: Do users ask similar questions repeatedly?

Cache model responses by a hash of the (system_prompt + user_input) pair:

import hashlib, json

def get_cached_or_call(system: str, user: str) -> str:
    key = hashlib.sha256(f"{system}:{user}".encode()).hexdigest()
    cached = redis_client.get(key)
    if cached:
        return json.loads(cached)

    response = call_llm(system, user)
    redis_client.setex(key, 3600, json.dumps(response))  # cache 1hr
    return response

Use semantic similarity for fuzzy cache hits if exact-match cache rate is low.

Audit 5 — Batching (savings: 50% cost + latency for async workloads)

Check: Are you running background jobs (document processing, bulk analysis) one-at-a-time?

Both OpenAI and Anthropic offer Batch APIs at 50% discount for async workloads:

# Anthropic Batch API
batch = client.messages.batches.create(
    requests=[
        {"custom_id": f"doc_{i}", "params": {"model": "...", "messages": [...]}}
        for i, doc in enumerate(documents)
    ]
)
# Results available within 24hrs at 50% of standard price

Use when: processing 100+ documents, nightly summarization jobs, bulk classification.

Audit 6 — Streaming Efficiency

Check: Are you streaming responses but storing the full output anyway?

If you don't need to stream to the user, disable streaming — it has slightly higher overhead for short responses. Only stream when showing real-time output to users.

Cost Estimate Template

After auditing, produce a cost breakdown:

OptimizationMonthly Savings EstimateEffort
Prompt caching$XLow
Switch summarization to Haiku$XLow
Cap max_tokens on short-answer routes$XLow
Response caching (top 20% queries)$XMedium
Batch API for nightly jobs$XMedium
Total$X

Example

Input:

"We use Claude Opus for everything. System prompt is 3000 tokens. We do 5000 calls/day for customer support — mostly classifying intent and drafting short replies."

Output:

Critical finding: Wrong model for workload. Intent classification and short reply drafting = Haiku-level tasks. Switching to claude-haiku-4-5-20251001 saves ~85% per token.

Prompt caching: 3000-token system prompt × 5000 calls = 15M cached tokens/day. Enable cache_control on your system prompt.

Combined monthly savings estimate: ~$2,800/month based on Anthropic pricing, down from ~$3,400 to ~$600.

Capabilities

skillsource-notysotyskill-llm-cost-optimizertopic-agent-skillstopic-claudetopic-claude-codetopic-claude-skillstopic-clinetopic-cursortopic-llmtopic-llm-skillstopic-skills

Install

Quality

0.45/ 1.00

deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (6,402 chars)

Provenance

Indexed fromgithub
Enriched2026-05-18 19:13:22Z · deterministic:skill-github:v1 · v1
First seen2026-05-18
Last seen2026-05-18

Agent access