openai-whisper
Speech-to-text transcription via OpenAI Whisper. Supports two modes — Local CLI (no API key, runs on-device) and Cloud API (fast, scalable, requires OPENAI_API_KEY). Use when the user needs to transcribe audio files, translate speech, or convert audio to text.
What it does
OpenAI Whisper — Speech-to-Text
Transcribe audio files using OpenAI's Whisper model. Two modes available depending on your needs:
| Mode | Latency | Cost | Privacy | Setup |
|---|---|---|---|---|
| Local CLI | Slower (on-device GPU/CPU) | Free | Audio never leaves machine | Install whisper binary |
| Cloud API | Fast | Per-minute pricing | Audio sent to OpenAI | OPENAI_API_KEY required |
Mode 1: Local CLI
Run Whisper locally with no API key required. Models download to ~/.cache/whisper on first run.
Quick Start
whisper /path/audio.mp3 --model medium --output_format txt --output_dir .
Common Commands
# Transcribe to text file
whisper /path/audio.mp3 --model medium --output_format txt --output_dir .
# Transcribe with translation to English
whisper /path/audio.m4a --task translate --output_format srt
# Transcribe with specific language
whisper /path/audio.wav --model large --language en --output_format json
Model Selection
| Model | Speed | Accuracy | VRAM |
|---|---|---|---|
tiny | Fastest | Lowest | ~1 GB |
base | Fast | Low | ~1 GB |
small | Medium | Good | ~2 GB |
medium | Slow | Better | ~5 GB |
large | Slowest | Best | ~10 GB |
turbo | Fast | Good (default) | ~6 GB |
Output Formats
txt— Plain text transcriptsrt— SubRip subtitle format with timestampsvtt— WebVTT subtitle formatjson— Detailed JSON with word-level timestampstsv— Tab-separated values
Notes
--modeldefaults toturboon most installs- Use smaller models for speed, larger for accuracy
- GPU acceleration used automatically when available
Mode 2: Cloud API
Transcribe via OpenAI's /v1/audio/transcriptions endpoint. Faster for large batches, no local GPU needed.
Quick Start
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a
Defaults:
- Model:
whisper-1 - Output:
<input>.txt
Common Commands
# Basic transcription
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a
# Specify model and output
{baseDir}/scripts/transcribe.sh /path/to/audio.ogg --model whisper-1 --out /tmp/transcript.txt
# With language hint
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --language en
# With speaker name hints (improves accuracy)
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --prompt "Speaker names: Peter, Daniel"
# JSON output with timestamps
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --json --out /tmp/transcript.json
Raw curl Example
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@/path/to/audio.m4a" \
-F model="whisper-1" \
-F response_format="text"
API Key Setup
Set OPENAI_API_KEY environment variable, or configure in ~/.clawdbot/clawdbot.json:
{
skills: {
"openai-whisper-api": {
apiKey: "OPENAI_KEY_HERE"
}
}
}
Choosing Between Modes
| Consideration | Local CLI | Cloud API |
|---|---|---|
| Privacy-sensitive audio | Best | Audio sent to OpenAI |
| Large batch processing | Slow without GPU | Fast and parallel |
| Offline usage | Works offline | Requires internet |
| Cost | Free (hardware cost) | Per-minute pricing |
| Setup complexity | Install binary + models | API key only |
| Audio format support | Most formats | Most formats |
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 7 github stars · SKILL.md body (3,483 chars)