{"id":"8e5d7fd7-e923-4446-80fe-581d66f5f964","shortId":"2uZDYJ","kind":"skill","title":"voice-agents","tagline":"Voice agents represent the frontier of AI interaction - humans","description":"# Voice Agents\n\nVoice agents represent the frontier of AI interaction - humans speaking\nnaturally with AI systems. The challenge isn't just speech recognition\nand synthesis, it's achieving natural conversation flow with sub-800ms\nlatency while handling interruptions, background noise, and emotional\nnuance.\n\nThis skill covers two architectures: speech-to-speech (OpenAI Realtime API,\nlowest latency, most natural) and pipeline (STT→LLM→TTS, more control,\neasier to debug). Key insight: latency is the constraint. Humans expect\nresponses in 500ms. Every millisecond matters.\n\n84% of organizations are increasing voice AI budgets in 2025. This is the\nyear voice agents go mainstream.\n\n## Principles\n\n- Latency is the constraint - target <800ms end-to-end\n- Jitter (variance) matters as much as absolute latency\n- VAD quality determines conversation flow\n- Interruption handling makes or breaks the experience\n- Start with focused MVP, iterate based on real conversations\n- Combine best-in-class components (Deepgram STT + ElevenLabs TTS)\n\n## Capabilities\n\n- voice-agents\n- speech-to-speech\n- speech-to-text\n- text-to-speech\n- conversational-ai\n- voice-activity-detection\n- turn-taking\n- barge-in-detection\n- voice-interfaces\n\n## Scope\n\n- phone-system-integration → backend\n- audio-processing-dsp → audio-specialist\n- music-generation → audio-specialist\n- accessibility-compliance → accessibility-specialist\n\n## Tooling\n\n### Speech_to_speech\n\n- OpenAI Realtime API - When: Lowest latency, most natural conversation Note: gpt-4o-realtime-preview, native voice, sub-500ms\n- Pipecat - When: Open-source voice orchestration Note: Daily-backed, enterprise-grade, modular\n\n### Speech_to_text\n\n- OpenAI Whisper - When: Highest accuracy, multilingual Note: gpt-4o-transcribe for best results\n- Deepgram Nova-3 - When: Production workloads, 54% lower WER Note: 150-184ms TTFT, 90%+ accuracy on noisy audio\n- AssemblyAI - When: Real-time streaming, speaker diarization Note: Good accuracy-latency balance\n\n### Text_to_speech\n\n- ElevenLabs - When: Most natural voice, emotional control Note: Flash model 75ms latency, V3 for expression\n- OpenAI TTS - When: Integrated with OpenAI stack Note: gpt-4o-mini-tts, 13 voices, streaming\n- Deepgram Aura-2 - When: Cost-effective production TTS Note: 40% cheaper than ElevenLabs, 184ms TTFB\n\n### Frameworks\n\n- Pipecat - When: Open-source voice agent orchestration Note: Silero VAD, SmartTurn, interruption handling\n- Vapi - When: Managed voice agent platform Note: No infrastructure management\n- Retell AI - When: Low-latency voice agents Note: Best context preservation on interruption\n\n## Patterns\n\n### Speech-to-Speech Architecture\n\nDirect audio-to-audio processing for lowest latency\n\n**When to use**: Maximum naturalness, emotional preservation, real-time conversation\n\n# SPEECH-TO-SPEECH ARCHITECTURE:\n\n\"\"\"\n[User Audio] → [S2S Model] → [Agent Audio]\n\nAdvantages:\n- Lowest latency (sub-500ms)\n- Preserves emotion, emphasis, accents\n- Most natural conversation flow\n\nDisadvantages:\n- Less control over responses\n- Harder to debug/audit\n- Can't easily modify what's said\n\"\"\"\n\n## OpenAI Realtime API\n\"\"\"\nimport { RealtimeClient } from '@openai/realtime-api-beta';\n\nconst client = new RealtimeClient({\n  apiKey: process.env.OPENAI_API_KEY,\n});\n\n// Configure for voice conversation\nclient.updateSession({\n  modalities: ['text', 'audio'],\n  voice: 'alloy',\n  input_audio_format: 'pcm16',\n  output_audio_format: 'pcm16',\n  instructions: `You are a helpful customer service agent.\n    Be concise and friendly. If you don't know something,\n    say so rather than making things up.`,\n  turn_detection: {\n    type: 'server_vad',  // or 'semantic_vad'\n    threshold: 0.5,\n    prefix_padding_ms: 300,\n    silence_duration_ms: 500,\n  },\n});\n\n// Handle audio streams\nclient.on('conversation.item.input_audio_transcription', (event) => {\n  console.log('User said:', event.transcript);\n});\n\nclient.on('response.audio.delta', (event) => {\n  // Stream audio to speaker\n  audioPlayer.write(Buffer.from(event.delta, 'base64'));\n});\n\n// Send user audio\nclient.appendInputAudio(audioBuffer);\n\"\"\"\n\n## Use Cases:\n- Real-time customer support\n- Voice assistants\n- Interactive voice response (IVR)\n- Live language translation\n\n### Pipeline Architecture\n\nSeparate STT → LLM → TTS for maximum control\n\n**When to use**: Need to know/control exactly what's said, debugging, compliance\n\n# PIPELINE ARCHITECTURE:\n\n\"\"\"\n[Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]\n\nAdvantages:\n- Full control at each step\n- Can log/audit all text\n- Easier to debug\n- Mix best-in-class components\n\nDisadvantages:\n- Higher latency (700-1200ms typical)\n- Loses some emotion/nuance\n- More components to manage\n\"\"\"\n\n## Production Pipeline Example\n\"\"\"\nimport { Deepgram } from '@deepgram/sdk';\nimport { ElevenLabsClient } from 'elevenlabs';\nimport OpenAI from 'openai';\n\n// Initialize clients\nconst deepgram = new Deepgram(process.env.DEEPGRAM_API_KEY);\nconst elevenlabs = new ElevenLabsClient();\nconst openai = new OpenAI();\n\nasync function processVoiceInput(audioStream) {\n  // 1. Speech-to-Text (Deepgram Nova-3)\n  const transcription = await deepgram.transcription.live({\n    model: 'nova-3',\n    punctuate: true,\n    endpointing: 300,  // ms of silence before end\n  });\n\n  transcription.on('transcript', async (data) => {\n    if (data.is_final && data.speech_final) {\n      const userText = data.channel.alternatives[0].transcript;\n      console.log('User:', userText);\n\n      // 2. LLM Processing\n      const completion = await openai.chat.completions.create({\n        model: 'gpt-4o-mini',\n        messages: [\n          { role: 'system', content: 'You are a concise voice assistant.' },\n          { role: 'user', content: userText }\n        ],\n        max_tokens: 150,  // Keep responses short for voice\n      });\n\n      const agentText = completion.choices[0].message.content;\n      console.log('Agent:', agentText);\n\n      // 3. Text-to-Speech (ElevenLabs)\n      const audioStream = await elevenlabs.textToSpeech.stream({\n        voice_id: 'voice_id_here',\n        text: agentText,\n        model_id: 'eleven_flash_v2_5',  // Lowest latency\n      });\n\n      // Stream to user\n      playAudioStream(audioStream);\n    }\n  });\n\n  // Pipe audio to transcription\n  audioStream.pipe(transcription);\n}\n\"\"\"\n\n## Optimization Tips:\n- Start TTS while LLM still generating (streaming)\n- Pre-compute first response segment during user speech\n- Use Flash/turbo models for latency\n\n### Voice Activity Detection Pattern\n\nDetect when user starts/stops speaking\n\n**When to use**: All voice agents need VAD for turn-taking\n\n# VOICE ACTIVITY DETECTION (VAD):\n\n\"\"\"\nVAD Types:\n1. Energy-based: Simple, fast, noise-sensitive\n2. Model-based: Silero VAD, more accurate\n3. Semantic VAD: Understands meaning, best for conversation\n\"\"\"\n\n## Silero VAD (Popular Open Source)\n\"\"\"\nimport { SileroVAD } from '@pipecat-ai/silero-vad';\n\nconst vad = new SileroVAD({\n  threshold: 0.5,           // Speech probability threshold\n  min_speech_duration: 250, // ms before speech confirmed\n  min_silence_duration: 500, // ms of silence = end of turn\n});\n\nvad.on('speech_start', () => {\n  console.log('User started speaking');\n  // Stop any playing TTS (barge-in)\n  audioPlayer.stop();\n});\n\nvad.on('speech_end', () => {\n  console.log('User finished speaking');\n  // Trigger response generation\n  processTranscript();\n});\n\n// Feed audio to VAD\naudioStream.on('data', (chunk) => {\n  vad.process(chunk);\n});\n\"\"\"\n\n## OpenAI Semantic VAD\n\"\"\"\n// In Realtime API session config\nclient.updateSession({\n  turn_detection: {\n    type: 'semantic_vad',  // Uses meaning, not just silence\n    // Model waits longer after \"ummm...\"\n    // Responds faster after \"Yes, that's correct.\"\n  },\n});\n\"\"\"\n\n## Barge-In Handling\n\"\"\"\n// When user interrupts:\nfunction handleBargeIn() {\n  // 1. Stop TTS immediately\n  audioPlayer.stop();\n\n  // 2. Cancel pending LLM generation\n  llmController.abort();\n\n  // 3. Reset state\n  conversationState.checkpoint();\n\n  // 4. Listen to new input\n  startListening();\n}\n\n// VAD triggers barge-in\nvad.on('speech_start', () => {\n  if (audioPlayer.isPlaying) {\n    handleBargeIn();\n  }\n});\n\"\"\"\n\n### Latency Optimization Pattern\n\nAchieving <800ms end-to-end response time\n\n**When to use**: Production voice agents\n\n# LATENCY OPTIMIZATION:\n\n\"\"\"\nTarget Metrics:\n- End-to-end: <800ms (ideal: <500ms)\n- Time-to-First-Token (TTFT): <300ms\n- Barge-in response: <200ms\n- Jitter variance: <100ms std dev\n\"\"\"\n\n## Pipeline Latency Breakdown\n\"\"\"\nTypical breakdown:\n- VAD processing: 50-100ms\n- STT first result: 150-200ms\n- LLM TTFT: 100-300ms\n- TTS TTFA: 75-200ms\n- Audio buffering: 50-100ms\n\nTotal: 425-900ms\n\"\"\"\n\n## Optimization Strategies\n\n### 1. Streaming Everything\n\"\"\"\n// Stream STT results as they come\nstt.on('partial_transcript', (text) => {\n  // Start processing before final transcript\n  llmPreprocessor.prepare(text);\n});\n\n// Stream LLM output to TTS\nconst llmStream = await openai.chat.completions.create({\n  stream: true,\n  // ...\n});\n\nfor await (const chunk of llmStream) {\n  tts.appendText(chunk.choices[0].delta.content);\n}\n\"\"\"\n\n### 2. Pre-computation\n\"\"\"\n// While user is speaking, predict and prepare\nstt.on('partial_transcript', async (text) => {\n  // Pre-fetch relevant context\n  const context = await retrieveContext(text);\n\n  // Pre-compute likely first sentence\n  const firstSentence = await generateOpener(context);\n});\n\"\"\"\n\n### 3. Use Low-Latency Models\n\"\"\"\n// STT: Deepgram Nova-3 (150ms TTFT)\n// LLM: gpt-4o-mini (fastest GPT-4 class)\n// TTS: ElevenLabs Flash (75ms) or Deepgram Aura-2 (184ms)\n\"\"\"\n\n### 4. Edge Deployment\n\"\"\"\n// Run inference closer to user\n// - Cloud regions near user\n// - Edge computing for VAD/STT\n// - WebSocket over HTTP for lower overhead\n\"\"\"\n\n### Conversation Design Pattern\n\nDesigning natural voice conversations\n\n**When to use**: Building voice UX\n\n# CONVERSATION DESIGN:\n\n## Voice-First Principles\n\"\"\"\nVoice is different from text:\n- No undo button - say it right the first time\n- Linear - user can't scroll back\n- Ephemeral - easy to miss information\n- Emotional - tone matters as much as words\n\"\"\"\n\n## Response Design\n\"\"\"\n# Keep responses short (10-20 seconds max)\n# Front-load the answer\n# Use signposting for lists\n\nBad: \"I found several options. The first is... second is...\"\nGood: \"I found 3 options. Want me to go through them?\"\n\n# Confirm understanding\nBad: \"I'll transfer $500 to John.\"\nGood: \"So that's $500 to John Smith. Should I proceed?\"\n\"\"\"\n\n## Prompting for Voice\n\"\"\"\nsystem_prompt = '''\nYou are a voice assistant. Follow these rules:\n\n1. Be concise - keep responses under 30 words\n2. Use natural speech - contractions, casual language\n3. Never use formatting (bullets, numbers in lists)\n4. Spell out numbers and abbreviations\n5. End with a question to keep conversation flowing\n6. If unclear, ask for clarification\n7. Never say \"I'm an AI\" unless asked\n\nGood: \"Got it. I'll set that reminder for three pm. Anything else?\"\nBad: \"I have set a reminder for 3:00 PM. Is there anything else I can assist you with today?\"\n'''\n\"\"\"\n\n## Error Recovery\n\"\"\"\n// Handle recognition errors gracefully\nconst errorResponses = {\n  no_speech: \"I didn't catch that. Could you say it again?\",\n  unclear: \"Sorry, I'm not sure I understood. You said [repeat]. Is that right?\",\n  timeout: \"Still there? I'm here when you're ready.\",\n};\n\n// Always offer human fallback for complex issues\nif (confidenceScore < 0.6) {\n  response = \"I want to make sure I get this right. Would you like to speak with a human agent?\";\n}\n\"\"\"\n\n## Sharp Edges\n\n### Response Latency Exceeds 800ms\n\nSeverity: CRITICAL\n\nSituation: Building a voice agent pipeline\n\nSymptoms:\nConversations feel awkward. Users repeat themselves. \"Are you\nthere?\" questions. Users hang up or give up. Low satisfaction\nscores despite correct answers.\n\nWhy this breaks:\nIn human conversation, responses typically arrive within 500ms.\nAnything over 800ms feels like the agent is slow or confused.\nUsers lose confidence and patience. Every component adds latency:\nVAD (100ms) + STT (200ms) + LLM (300ms) + TTS (200ms) = 800ms.\n\nRecommended fix:\n\n# Measure and budget latency for each component:\n\n## Target latencies:\n- VAD processing: <100ms\n- STT time-to-first-token: <200ms\n- LLM time-to-first-token: <300ms\n- TTS time-to-first-audio: <150ms\n- Total end-to-end: <800ms\n\n## Optimization strategies:\n\n1. Use low-latency models:\n   - STT: Deepgram Nova-3 (150ms) vs Whisper (500ms+)\n   - TTS: ElevenLabs Flash (75ms) vs standard (200ms+)\n   - LLM: gpt-4o-mini streaming\n\n2. Stream everything:\n   - Don't wait for full STT transcript\n   - Stream LLM output to TTS\n   - Start audio playback before TTS finishes\n\n3. Pre-compute:\n   - While user speaks, prepare context\n   - Generate opening phrase in parallel\n\n4. Edge deployment:\n   - Run VAD/STT at edge\n   - Use nearest cloud region\n\n## Measure continuously:\nLog timestamps at each stage, track P50/P95 latency\n\n### Response Time Variance Disrupts Rhythm\n\nSeverity: HIGH\n\nSituation: Voice agent with inconsistent response times\n\nSymptoms:\nConversations feel unpredictable. User doesn't know when to speak.\nSometimes agent responds immediately, sometimes after long pause.\nUsers talk over agent. Agent talks over users.\n\nWhy this breaks:\nJitter (variance in response time) disrupts conversational rhythm\nmore than absolute latency. Consistent 800ms feels better than\nalternating 400ms and 1200ms. Users can't adapt to unpredictable\ntiming.\n\nRecommended fix:\n\n# Target jitter metrics:\n- Standard deviation: <100ms\n- P95-P50 gap: <200ms\n\n## Reduce jitter sources:\n\n1. Consistent model loading:\n   - Keep models warm\n   - Pre-load on connection start\n\n2. Buffer audio output:\n   - Small buffer (50-100ms) smooths playback\n   - Don't start playing until buffer filled\n\n3. Handle LLM variance:\n   - gpt-4o-mini more consistent than larger models\n   - Set max_tokens to limit long responses\n\n4. Monitor and alert:\n   - Track response time distribution\n   - Alert on jitter spikes\n\n## Implementation:\nconst MIN_RESPONSE_TIME = 400;  // ms\n\nasync function respondWithConsistentTiming(text) {\n  const startTime = Date.now();\n  const audio = await generateSpeech(text);\n\n  const elapsed = Date.now() - startTime;\n  if (elapsed < MIN_RESPONSE_TIME) {\n    await delay(MIN_RESPONSE_TIME - elapsed);\n  }\n\n  playAudio(audio);\n}\n\n### Using Silence Duration for Turn Detection\n\nSeverity: HIGH\n\nSituation: Detecting when user finishes speaking\n\nSymptoms:\nAgent interrupts user mid-thought. Or waits too long after user\nfinishes. \"Let me think...\" triggers premature response. Short\nanswers have awkward pause before response.\n\nWhy this breaks:\nSimple silence detection (e.g., \"end turn after 500ms silence\")\ndoesn't understand conversation. Humans pause mid-sentence.\n\"Yes.\" needs fast response, \"Well, let me think about that...\"\nneeds patience. Fixed timeout fits neither.\n\nRecommended fix:\n\n# Use semantic VAD:\n\n## OpenAI Semantic VAD:\nclient.updateSession({\n  turn_detection: {\n    type: 'semantic_vad',\n    // Waits longer after \"umm...\"\n    // Responds faster after \"Yes, that's correct.\"\n  },\n});\n\n## Pipecat SmartTurn:\nconst pipeline = new Pipeline({\n  vad: new SileroVAD(),\n  turnDetection: new SmartTurn(),\n});\n\n// SmartTurn considers:\n// - Speech content (complete sentence?)\n// - Prosody (falling intonation?)\n// - Context (question asked?)\n\n## Fallback: Adaptive silence threshold:\nfunction calculateSilenceThreshold(transcript) {\n  const endsWithComplete = transcript.match(/[.!?]$/);\n  const hasFillers = transcript.match(/um|uh|like|well/i);\n\n  if (endsWithComplete && !hasFillers) {\n    return 300;  // Fast response\n  } else if (hasFillers) {\n    return 1500;  // Wait for continuation\n  }\n  return 700;  // Default\n}\n\n### Agent Doesn't Stop When User Interrupts\n\nSeverity: HIGH\n\nSituation: User tries to interrupt agent mid-sentence\n\nSymptoms:\nAgent talks over user. User has to wait for agent to finish.\nFrustrating experience. Users give up and abandon call.\n\"STOP! STOP!\" doesn't work.\n\nWhy this breaks:\nWithout barge-in handling, the TTS plays to completion regardless\nof user input. This violates basic conversational norms - in human\nconversation, we stop when interrupted.\n\nRecommended fix:\n\n# Implement barge-in detection:\n\n## Basic barge-in:\nvad.on('speech_start', () => {\n  if (ttsPlayer.isPlaying) {\n    // 1. Stop audio immediately\n    ttsPlayer.stop();\n\n    // 2. Cancel pending TTS generation\n    ttsController.abort();\n\n    // 3. Checkpoint conversation state\n    conversationState.save();\n\n    // 4. Listen to new input\n    startTranscription();\n  }\n});\n\n## Advanced: Distinguish interruption types:\nvad.on('speech_start', async () => {\n  if (!ttsPlayer.isPlaying) return;\n\n  // Wait 200ms to get first words\n  await delay(200);\n  const firstWords = getTranscriptSoFar();\n\n  if (isBackchannel(firstWords)) {\n    // \"uh-huh\", \"yeah\" - don't interrupt\n    return;\n  }\n\n  if (isClarification(firstWords)) {\n    // \"What?\", \"Sorry?\" - repeat last sentence\n    repeatLastSentence();\n  } else {\n    // Real interruption - stop and listen\n    handleFullInterruption();\n  }\n});\n\n## Response time target:\n- Barge-in response: <200ms\n- User should feel heard immediately\n\n### Generating Text-Length Responses for Voice\n\nSeverity: MEDIUM\n\nSituation: Prompting LLM for voice agent responses\n\nSymptoms:\nAgent rambles. Users lose track of information. \"Can you repeat\nthat?\" requests. Users interrupt to ask for shorter version.\nLow comprehension of conveyed information.\n\nWhy this breaks:\nText can be scanned and re-read. Voice is linear and ephemeral.\nA 3-paragraph response that works in chat is overwhelming in voice.\nUsers can only hold ~7 items in working memory.\n\nRecommended fix:\n\n# Constrain response length in prompts:\n\nsystem_prompt = '''\nYou are a voice assistant. Keep responses UNDER 30 WORDS.\nFor complex information, break into chunks and confirm\nunderstanding between each.\n\nInstead of: \"Here are the three options. First, you could...\nSecond... Third...\"\n\nSay: \"I found 3 options. Want me to go through them?\"\n\nNever list more than 3 items without pausing for confirmation.\n'''\n\n## Enforce at generation:\nconst response = await openai.chat.completions.create({\n  max_tokens: 100,  // Hard limit\n  // ...\n});\n\n## Chunking pattern:\nif (information.length > 3) {\n  response = `I have ${information.length} items. Let's go through them one at a time. First: ${information[0]}. Ready for the next?`;\n}\n\n## Progressive disclosure:\n\"I found your account. Want the balance, recent transactions, or something else?\"\n// Don't dump all info at once\n\n### Using Bullets/Numbers/Markdown in Voice\n\nSeverity: MEDIUM\n\nSituation: Formatting LLM output for voice\n\nSymptoms:\n\"First bullet point: item one\" read aloud. Numbers read as \"one\ntwo three\" instead of \"one, two, three.\" Markdown artifacts in\nspeech. Robotic, unnatural delivery.\n\nWhy this breaks:\nTTS models read what they're given. Text formatting intended for\nvisual display sounds robotic when read aloud. Users can't \"see\"\nstructure in audio.\n\nRecommended fix:\n\n# Prompt for spoken format:\n\nsystem_prompt = '''\nFormat responses for SPOKEN delivery:\n- No bullet points, numbered lists, or markdown\n- Spell out numbers: \"twenty-three\" not \"23\"\n- Spell out abbreviations: \"United States\" not \"US\"\n- Use verbal signposting: \"There are three things. First...\"\n- Never use asterisks, dashes, or special characters\n'''\n\n## Post-processing:\nfunction prepareForSpeech(text) {\n  return text\n    // Remove markdown\n    .replace(/[*_#`]/g, '')\n    // Convert numbers\n    .replace(/\\d+/g, numToWords)\n    // Expand abbreviations\n    .replace(/\\betc\\b/gi, 'et cetera')\n    .replace(/\\be\\.g\\./gi, 'for example')\n    // Add pauses\n    .replace(/\\. /g, '... ')\n    .replace(/, /g, '... ');\n}\n\n## SSML for precise control:\n<speak>\n  The total is <say-as interpret-as=\"currency\">$49.99</say-as>.\n  <break time=\"500ms\"/>\n  Want to proceed?\n</speak>\n\n### VAD/STT Fails in Noisy Environments\n\nSeverity: MEDIUM\n\nSituation: Users in cars, cafes, outdoors\n\nSymptoms:\n\"I didn't catch that\" frequently. Background noise triggers\nfalse starts. Fan/AC causes continuous listening. Car engine\nnoise confuses STT.\n\nWhy this breaks:\nDefault VAD thresholds work for quiet environments. Real-world\nusage includes background noise that triggers false positives\nor masks speech, causing false negatives.\n\nRecommended fix:\n\n# Implement noise handling:\n\n## 1. Noise reduction in STT:\nconst transcription = await deepgram.transcription.live({\n  model: 'nova-3',\n  noise_reduction: true,\n  // or\n  smart_format: true,\n});\n\n## 2. Adaptive VAD threshold:\n// Measure ambient noise level\nconst ambientLevel = measureAmbientNoise(5000);  // 5 sec sample\n\nvad.setThreshold(ambientLevel * 1.5);  // Above ambient\n\n## 3. Confidence filtering:\nstt.on('transcript', (data) => {\n  if (data.confidence < 0.7) {\n    // Low confidence - probably noise\n    askForRepeat();\n    return;\n  }\n  processTranscript(data.transcript);\n});\n\n## 4. Echo cancellation:\n// Prevent agent's voice from being transcribed\nconst echoCanceller = new EchoCanceller();\nechoCanceller.reference(ttsOutput);\nconst cleanedAudio = echoCanceller.process(userAudio);\n\n### STT Produces Incorrect or Hallucinated Text\n\nSeverity: MEDIUM\n\nSituation: Processing unclear or accented speech\n\nSymptoms:\nAgent responds to something user didn't say. Names consistently\nwrong. Technical terms misheard. \"I said X, not Y\" frustration.\n\nWhy this breaks:\nSTT models can hallucinate, especially on proper nouns, technical\nterms, or accented speech. These errors propagate through the\npipeline and produce nonsensical responses.\n\nRecommended fix:\n\n# Mitigate STT errors:\n\n## 1. Use keywords/biasing:\nconst transcription = await deepgram.transcription.live({\n  keywords: ['Acme Corp', 'ProductName', 'John Smith'],\n  keyword_boost: 'high',\n});\n\n## 2. Confirmation for critical info:\nif (containsNameOrNumber(transcript)) {\n  response = `I heard \"${name}\". Is that correct?`;\n}\n\n## 3. Confidence-based fallback:\nif (confidence < 0.8) {\n  response = `I think you said \"${transcript}\". Did I get that right?`;\n}\n\n## 4. Multiple hypothesis handling:\n// Some STT APIs return n-best list\nconst alternatives = transcription.alternatives;\nif (alternatives[0].confidence - alternatives[1].confidence < 0.1) {\n  // Ambiguous - ask for clarification\n}\n\n## 5. Error correction patterns:\npromptPattern = `\n  User may correct previous mistakes. If they say \"no, I said X\"\n  or \"not Y, Z\", update your understanding accordingly.\n`;\n\n## Validation Checks\n\n### Missing Latency Measurement\n\nSeverity: ERROR\n\nVoice agents must track latency at each stage\n\nMessage: Voice pipeline without latency tracking. Add timestamps at each stage to measure performance.\n\n### Using Batch STT Instead of Streaming\n\nSeverity: WARNING\n\nStreaming STT reduces latency significantly\n\nMessage: Using batch transcription. Consider streaming for lower latency in voice agents.\n\n### TTS Without Streaming Output\n\nSeverity: WARNING\n\nStreaming TTS reduces time to first audio\n\nMessage: TTS without streaming. Stream audio to reduce time to first audio.\n\n### Hardcoded VAD Silence Threshold\n\nSeverity: WARNING\n\nFixed silence thresholds don't adapt to conversation\n\nMessage: Fixed silence threshold. Consider semantic VAD or adaptive thresholds for better turn-taking.\n\n### Missing Barge-In Handling\n\nSeverity: WARNING\n\nVoice agents should stop when user interrupts\n\nMessage: VAD without barge-in handling. Stop TTS when user starts speaking.\n\n### Voice Prompt Without Length Constraints\n\nSeverity: WARNING\n\nVoice prompts should constrain response length\n\nMessage: Voice prompt without length constraints. Add 'Keep responses under 30 words' to system prompt.\n\n### Markdown Formatting Sent to TTS\n\nSeverity: WARNING\n\nMarkdown will be read literally by TTS\n\nMessage: Check for markdown in TTS input. Strip formatting before sending to TTS.\n\n### STT Without Error Handling\n\nSeverity: WARNING\n\nSTT can fail or return low confidence\n\nMessage: STT without error handling. Check confidence scores and handle failures.\n\n### WebSocket Without Reconnection\n\nSeverity: WARNING\n\nRealtime APIs need reconnection handling\n\nMessage: Realtime connection without reconnection logic. Handle disconnects gracefully.\n\n### Missing Noise Handling\n\nSeverity: INFO\n\nReal-world audio includes background noise\n\nMessage: Consider adding noise handling for real-world audio quality.\n\n## Collaboration\n\n### Delegation Triggers\n\n- user needs phone/telephony integration -> backend (Twilio, Vonage, SIP integration)\n- user needs LLM optimization -> llm-architect (Model selection, prompting, fine-tuning)\n- user needs tools for voice agent -> agent-tool-builder (Tool design for voice context)\n- user needs multi-agent voice system -> multi-agent-orchestration (Voice agents working together)\n- user needs accessibility compliance -> accessibility-specialist (Voice interface accessibility)\n\n## Related Skills\n\nWorks well with: `agent-tool-builder`, `multi-agent-orchestration`, `llm-architect`, `backend`\n\n## When to Use\n- User mentions or implies: voice agent\n- User mentions or implies: speech to text\n- User mentions or implies: text to speech\n- User mentions or implies: whisper\n- User mentions or implies: elevenlabs\n- User mentions or implies: deepgram\n- User mentions or implies: realtime api\n- User mentions or implies: voice assistant\n- User mentions or implies: voice ai\n- User mentions or implies: conversational ai\n- User mentions or implies: tts\n- User mentions or implies: stt\n- User mentions or implies: asr\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.","tags":["voice","agents","antigravity","awesome","skills","sickn33","agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows"],"capabilities":["skill","source-sickn33","skill-voice-agents","topic-agent-skills","topic-agentic-skills","topic-ai-agent-skills","topic-ai-agents","topic-ai-coding","topic-ai-workflows","topic-antigravity","topic-antigravity-skills","topic-claude-code","topic-claude-code-skills","topic-codex-cli","topic-codex-skills"],"categories":["antigravity-awesome-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/sickn33/antigravity-awesome-skills/voice-agents","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add sickn33/antigravity-awesome-skills","source_repo":"https://github.com/sickn33/antigravity-awesome-skills","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 34404 github stars · SKILL.md body (25,834 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-04-22T00:51:56.867Z","embedding":null,"createdAt":"2026-04-18T21:47:17.686Z","updatedAt":"2026-04-22T00:51:56.867Z","lastSeenAt":"2026-04-22T00:51:56.867Z","tsv":"'-100':1079,1100,1831 '-1200':640 '-184':290 '-2':348,1214 '-20':1295 '-200':1085,1095 '-3':281,693,700,1195,1649,2687 '-300':1090 '-4':1205 '-900':1104 '/g':2573,2578,2596,2598 '/gi':2590 '/silero-vad':891 '/um':2050 '0':722,764,1147,2420,2885 '0.1':2890 '0.5':534,897 '0.6':1500 '0.7':2723 '0.8':2856 '00':1435 '1':686,855,994,1108,1361,1640,1811,2161,2676,2818,2888 '1.5':2712 '10':1294 '100':1089,2396 '100ms':1068,1589,1610,1802 '1200ms':1787 '13':343 '150':289,755,1084 '1500':2065 '150ms':1196,1631,1650 '184ms':360,1215 '2':727,864,999,1149,1369,1667,1824,2166,2695,2834 '200':2202 '200ms':1065,1591,1595,1617,1660,1807,2195,2240 '2025':106 '23':2539 '250':904 '3':769,872,1005,1186,1320,1376,1434,1688,1842,2172,2304,2369,2381,2403,2715,2849 '30':1367,2341,3078 '300':538,704,2058 '300ms':1060,1593,1624 '4':1009,1216,1384,1702,1862,2177,2732,2868 '40':356 '400':1879 '400ms':1785 '425':1103 '49.99':2606 '4o':239,274,340,737,1201,1664,1848 '5':791,1390,2707,2895 '50':1078,1099,1830 '500':542,912,1334,1341 '5000':2706 '500ms':93,246,443,1053,1567,1653,1961 '54':285 '6':1399 '7':1405,2319 '700':639,2070 '75':1094 '75ms':325,1210,1657 '800ms':47,121,1030,1051,1525,1570,1596,1637,1780 '84':97 '90':293 'abandon':2109 'abbrevi':1389,2542,2581 'absolut':132,1777 'accent':447,2764,2801 'access':218,221,3233,3236,3240 'accessibility-compli':217 'accessibility-specialist':220,3235 'accord':2919 'account':2430 'accur':871 'accuraci':269,294,309 'accuracy-lat':308 'achiev':40,1029 'acm':2826 'activ':186,829,850 'ad':3167 'adapt':1791,2038,2696,3010,3021 'add':1586,2593,2941,3074 'advanc':2183 'advantag':438,617 'agent':3,5,14,16,112,168,369,381,394,436,507,767,842,1042,1519,1532,1574,1732,1749,1759,1760,1925,2072,2086,2091,2100,2260,2263,2736,2767,2928,2973,3036,3206,3208,3220,3225,3228,3247,3252,3266 'agent-tool-build':3207,3246 'agenttext':762,768,785 'ai':10,21,27,103,183,388,890,1411,3313,3319 'alert':1865,1870 'alloy':491 'aloud':2465,2504 'altern':1784,2881,2884,2887 'alway':1491 'ambient':2700,2714 'ambientlevel':2704,2711 'ambigu':2891 'answer':1302,1556,1945 'anyth':1425,1439,1568 'api':68,229,469,480,672,959,2874,3140,3301 'apikey':478 'architect':3194,3256 'architectur':61,406,431,588,609 'arriv':1565 'artifact':2478 'ask':1402,1413,2036,2278,2892,3368 'askforrepeat':2728 'asr':3334 'assemblyai':298 'assist':579,748,1357,1443,2337,3307 'asterisk':2557 'async':682,712,1163,1881,2190 'audio':205,209,215,297,409,411,433,437,489,493,497,544,548,559,568,610,616,800,946,1097,1630,1683,1826,1889,1909,2163,2511,2986,2992,2998,3161,3174 'audio-processing-dsp':204 'audio-specialist':208,214 'audio-to-audio':408 'audiobuff':570 'audioplayer.isplaying':1024 'audioplayer.stop':933,998 'audioplayer.write':562 'audiostream':685,776,798 'audiostream.on':949 'audiostream.pipe':803 'aura':347,1213 'await':696,732,777,1135,1140,1172,1183,1890,1902,2200,2392,2683,2823 'awkward':1537,1947 'b/gi':2584 'back':257,1276 'backend':203,3183,3257 'background':52,2630,2659,3163 'bad':1307,1330,1427 'balanc':311,2433 'barg':192,931,986,1018,1062,2121,2149,2154,2237,3030,3046 'barge-in':930,985,1017,1061,2120,2148,2153,2236,3029,3045 'barge-in-detect':191 'base':151,858,867,2852 'base64':565 'basic':2135,2152 'batch':2950,2964 'best':157,277,396,632,877,2878 'best-in-class':156,631 'betc':2583 'better':1782,3024 'boost':2832 'boundari':3376 'break':143,1559,1766,1953,2118,2289,2346,2486,2646,2789 'breakdown':1073,1075 'budget':104,1601 'buffer':1098,1825,1829,1840 'buffer.from':563 'build':1248,1529 'builder':3210,3249 'bullet':1380,2460,2526 'bullets/numbers/markdown':2447 'button':1264 'cafe':2621 'calculatesilencethreshold':2042 'call':2110 'cancel':1000,2167,2734 'capabl':165 'car':2620,2639 'case':572 'casual':1374 'catch':1460,2627 'caus':2636,2668 'cetera':2586 'challeng':30 'charact':2561 'chat':2310 'cheaper':357 'check':2921,3098,3128 'checkpoint':2173 'chunk':951,953,1142,2348,2399 'chunk.choices':1146 'clarif':1404,2894,3370 'class':159,634,1206 'cleanedaudio':2749 'clear':3343 'client':475,666 'client.appendinputaudio':569 'client.on':546,555 'client.updatesession':486,962,1996 'closer':1221 'cloud':1224,1711 'collabor':3176 'combin':155 'come':1116 'complet':731,2029,2128 'completion.choices':763 'complex':1496,2344 'complianc':219,607,3234 'compon':160,635,647,1585,1605 'comprehens':2283 'comput':816,1152,1177,1229,1691 'concis':509,746,1363 'confid':1581,2716,2725,2851,2855,2886,2889,3122,3129 'confidence-bas':2850 'confidencescor':1499 'config':961 'configur':482 'confirm':908,1328,2350,2386,2835 'confus':1578,2642 'connect':1822,3146 'consid':2026,2966,3017,3166 'consist':1779,1812,1851,2776 'console.log':551,724,766,922,937 'const':474,667,674,678,694,719,730,761,775,892,1133,1141,1170,1181,1453,1875,1885,1888,1893,2015,2044,2047,2203,2390,2681,2703,2742,2748,2821,2880 'constrain':2326,3065 'constraint':88,119,3059,3073 'containsnameornumb':2840 'content':742,751,2028 'context':397,1169,1171,1185,1696,2034,3215 'continu':1714,2068,2637 'contract':1373 'control':79,321,454,595,619,2602 'convers':42,137,154,182,235,426,450,485,879,1238,1244,1251,1397,1535,1562,1738,1773,1966,2136,2140,2174,3012,3318 'conversation.item.input':547 'conversational-ai':181 'conversationstate.checkpoint':1008 'conversationstate.save':2176 'convert':2574 'convey':2285 'corp':2827 'correct':984,1555,2012,2848,2897,2902 'cost':351 'cost-effect':350 'could':1462,2363 'cover':59 'criteria':3379 'critic':1527,2837 'custom':505,576 'd':2577 'daili':256 'daily-back':255 'dash':2558 'data':713,950,2720 'data.channel.alternatives':721 'data.confidence':2722 'data.is':715 'data.speech':717 'data.transcript':2731 'date.now':1887,1895 'debug':82,606,629 'debug/audit':459 'deepgram':161,279,346,654,668,670,691,1193,1212,1647,3295 'deepgram.transcription.live':697,2684,2824 'deepgram/sdk':656 'default':2071,2647 'delay':1903,2201 'deleg':3177 'deliveri':2483,2524 'delta.content':1148 'deploy':1218,1704 'describ':3347 'design':1239,1241,1252,1290,3212 'despit':1554 'detect':187,194,526,830,832,851,964,1915,1919,1956,1998,2151 'determin':136 'dev':1070 'deviat':1801 'diariz':305 'didn':1458,2625,2772 'differ':1259 'direct':407 'disadvantag':452,636 'disclosur':2426 'disconnect':3151 'display':2499 'disrupt':1726,1772 'distinguish':2184 'distribut':1869 'doesn':1742,1963,2073,2113 'dsp':207 'dump':2441 'durat':540,903,911,1912 'e.g':1957 'easi':1278 'easier':80,627 'easili':462 'echo':2733 'echocancel':2743,2745 'echocanceller.process':2750 'echocanceller.reference':2746 'edg':1217,1228,1521,1703,1708 'effect':352 'elaps':1894,1898,1907 'eleven':788 'elevenlab':163,315,359,660,675,774,1208,1655,3290 'elevenlabs.texttospeech.stream':778 'elevenlabscli':658,677 'els':1426,1440,2061,2226,2438 'emot':55,320,421,445,1282 'emotion/nuance':645 'emphasi':446 'end':123,125,709,916,936,1032,1034,1048,1050,1391,1634,1636,1958 'end-to-end':122,1031,1047,1633 'endpoint':703 'endswithcomplet':2045,2055 'energi':857 'energy-bas':856 'enforc':2387 'engin':2640 'enterpris':259 'enterprise-grad':258 'environ':2614,2653,3359 'environment-specif':3358 'ephemer':1277,2302 'error':1447,1451,2804,2817,2896,2926,3112,3126 'errorrespons':1454 'especi':2794 'et':2585 'event':550,557 'event.delta':564 'event.transcript':554 'everi':94,1584 'everyth':1110,1669 'exact':602 'exampl':652,2592 'exceed':1524 'expand':2580 'expect':90 'experi':145,2104 'expert':3364 'express':329 'fail':2611,3118 'failur':3133 'fall':2032 'fallback':1494,2037,2853 'fals':2633,2663,2669 'fan/ac':2635 'fast':860,1974,2059 'faster':979,2007 'fastest':1203 'feed':945 'feel':1536,1571,1739,1781,2243 'fetch':1167 'fill':1841 'filter':2717 'final':716,718,1124 'fine':3199 'fine-tun':3198 'finish':939,1687,1922,1937,2102 'first':817,1057,1082,1179,1255,1269,1313,1615,1622,1629,2198,2361,2418,2459,2554,2985,2997 'firstsent':1182 'firstword':2204,2208,2219 'fit':1986 'fix':1598,1796,1984,1989,2146,2325,2513,2672,2814,3005,3014 'flash':323,789,1209,1656 'flash/turbo':824 'flow':43,138,451,1398 'focus':148 'follow':1358 'format':494,498,1379,2453,2495,2517,2520,2693,3084,3105 'found':1309,1319,2368,2428 'framework':362 'frequent':2629 'friend':511 'front':1299 'front-load':1298 'frontier':8,19 'frustrat':2103,2786 'full':618,1674 'function':683,992,1882,2041,2565 'g':2589 'gap':1806 'generat':213,812,943,1003,1697,2170,2246,2389 'generateopen':1184 'generatespeech':1891 'get':1508,2197,2865 'gettranscriptsofar':2205 'give':1549,2106 'given':2493 'go':113,1325,2374,2411 'good':307,1317,1337,1414 'got':1415 'gpt':238,273,339,736,1200,1204,1663,1847 'gpt-4o-mini':735,1199,1662,1846 'gpt-4o-mini-tts':338 'gpt-4o-realtime-preview':237 'gpt-4o-transcribe':272 'grace':1452,3152 'grade':260 'hallucin':2756,2793 'handl':50,140,376,543,988,1449,1843,2123,2675,2871,3032,3048,3113,3127,3132,3143,3150,3155,3169 'handlebargein':993,1025 'handlefullinterrupt':2232 'hang':1546 'hard':2397 'hardcod':2999 'harder':457 'hasfil':2048,2056,2063 'heard':2244,2844 'help':504 'high':1729,1917,2080,2833 'higher':637 'highest':268 'hold':2318 'http':1234 'huh':2211 'human':12,23,89,1493,1518,1561,1967,2139 'hypothesi':2870 'id':780,782,787 'ideal':1052 'immedi':997,1751,2164,2245 'implement':1874,2147,2673 'impli':3264,3270,3277,3284,3289,3294,3299,3305,3311,3317,3323,3328,3333 'import':470,653,657,661,885 'includ':2658,3162 'inconsist':1734 'incorrect':2754 'increas':101 'infer':1220 'info':2443,2838,3157 'inform':1281,2269,2286,2345,2419 'information.length':2402,2407 'infrastructur':385 'initi':665 'input':492,1013,2132,2181,3103,3373 'insight':84 'instead':2354,2472,2952 'instruct':500 'integr':202,333,3182,3187 'intend':2496 'interact':11,22,580 'interfac':197,3239 'interrupt':51,139,375,400,991,1926,2078,2085,2144,2185,2215,2228,2276,3041 'inton':2033 'isbackchannel':2207 'isclarif':2218 'isn':31 'issu':1497 'item':2320,2382,2408,2462 'iter':150 'ivr':583 'jitter':126,1066,1767,1798,1809,1872 'john':1336,1343,2829 'keep':756,1291,1364,1396,1815,2338,3075 'key':83,481,673 'keyword':2825,2831 'keywords/biasing':2820 'know':516,1744 'know/control':601 'languag':585,1375 'larger':1853 'last':2223 'latenc':48,70,85,116,133,232,310,326,392,415,440,638,793,827,1026,1043,1072,1190,1523,1587,1602,1607,1644,1722,1778,2923,2931,2939,2960,2970 'length':2249,2328,3058,3067,3072 'less':453 'let':1938,1977,2409 'level':2702 'like':1178,1513,1572,2052 'limit':1859,2398,3335 'linear':1271,2300 'list':1306,1383,2378,2529,2879 'listen':1010,2178,2231,2638 'liter':3094 'live':584 'll':1332,1418 'llm':76,591,613,728,810,1002,1087,1129,1198,1592,1618,1661,1678,1844,2257,2454,3190,3193,3255 'llm-architect':3192,3254 'llmcontroller.abort':1004 'llmpreprocessor.prepare':1126 'llmstream':1134,1144 'load':1300,1814,1820 'log':1715 'log/audit':624 'logic':3149 'long':1754,1860,1934 'longer':975,2003 'lose':643,1580,2266 'low':391,1189,1551,1643,2282,2724,3121 'low-lat':390,1188,1642 'lower':286,1236,2969 'lowest':69,231,414,439,792 'm':1409,1470,1485 'mainstream':114 'make':141,522,1505 'manag':379,386,649 'markdown':2477,2531,2571,3083,3090,3100 'mask':2666 'match':3344 'matter':96,128,1284 'max':753,1297,1856,2394 'maximum':419,594 'may':2901 'mean':876,969 'measur':1599,1713,2699,2924,2947 'measureambientnois':2705 'medium':2254,2451,2616,2759 'memori':2323 'mention':3262,3268,3275,3282,3287,3292,3297,3303,3309,3315,3321,3326,3331 'messag':739,2935,2962,2987,3013,3042,3068,3097,3123,3144,3165 'message.content':765 'metric':1046,1799 'mid':1929,1970,2088 'mid-sent':1969,2087 'mid-thought':1928 'millisecond':95 'min':901,909,1876,1899,1904 'mini':341,738,1202,1665,1849 'misheard':2780 'miss':1280,2922,3028,3153,3381 'mistak':2904 'mitig':2815 'mix':630 'modal':487 'model':324,435,698,734,786,825,866,973,1191,1645,1813,1816,1854,2488,2685,2791,3195 'model-bas':865 'modifi':463 'modular':261 'monitor':1863 'ms':291,537,541,641,705,905,913,1080,1086,1091,1096,1101,1105,1832,1880 'much':130,1286 'multi':3219,3224,3251 'multi-ag':3218 'multi-agent-orchestr':3223,3250 'multilingu':270 'multipl':2869 'music':212 'music-gener':211 'must':2929 'mvp':149 'n':2877 'n-best':2876 'name':2775,2845 'nativ':242 'natur':25,41,72,234,318,420,449,1242,1371 'near':1226 'nearest':1710 'need':599,843,1973,1982,3141,3180,3189,3202,3217,3232 'negat':2670 'neither':1987 'never':1377,1406,2377,2555 'new':476,669,676,680,894,1012,2017,2020,2023,2180,2744 'next':2424 'nois':53,862,2631,2641,2660,2674,2677,2688,2701,2727,3154,3164,3168 'noise-sensit':861 'noisi':296,2613 'nonsens':2811 'norm':2137 'note':236,254,271,288,306,322,337,355,371,383,395 'noun':2797 'nova':280,692,699,1194,1648,2686 'nuanc':56 'number':1381,1387,2466,2528,2534,2575 'numtoword':2579 'offer':1492 'one':2414,2463,2469,2474 'open':250,366,883,1698 'open-sourc':249,365 'openai':66,227,265,330,335,467,662,664,679,681,954,1993 'openai.chat.completions.create':733,1136,2393 'openai/realtime-api-beta':473 'optim':805,1027,1044,1106,1638,3191 'option':1311,1321,2360,2370 'orchestr':253,370,3226,3253 'organ':99 'outdoor':2622 'output':496,1130,1679,1827,2455,2977,3353 'overhead':1237 'overwhelm':2312 'p50':1805 'p50/p95':1721 'p95':1804 'p95-p50':1803 'pad':536 'paragraph':2305 'parallel':1701 'partial':1118,1161 'patienc':1583,1983 'pattern':401,831,1028,1240,2400,2898 'paus':1755,1948,1968,2384,2594 'pcm16':495,499 'pend':1001,2168 'perform':2948 'permiss':3374 'phone':200 'phone-system-integr':199 'phone/telephony':3181 'phrase':1699 'pipe':799 'pipecat':247,363,889,2013 'pipecat-ai':888 'pipelin':74,587,608,651,1071,1533,2016,2018,2808,2937 'platform':382 'play':928,1838,2126 'playaudio':1908 'playaudiostream':797 'playback':1684,1834 'pm':1424,1436 'point':2461,2527 'popular':882 'posit':2664 'post':2563 'post-process':2562 'pre':815,1151,1166,1176,1690,1819 'pre-comput':814,1150,1175,1689 'pre-fetch':1165 'pre-load':1818 'precis':2601 'predict':1157 'prefix':535 'prematur':1942 'prepar':1159,1695 'prepareforspeech':2566 'preserv':398,422,444 'prevent':2735 'preview':241 'previous':2903 'principl':115,1256 'probabl':899,2726 'proceed':1347,2609 'process':206,412,729,1077,1122,1609,2564,2761 'process.env.deepgram':671 'process.env.openai':479 'processtranscript':944,2730 'processvoiceinput':684 'produc':2753,2810 'product':283,353,650,1040 'productnam':2828 'progress':2425 'prompt':1348,1352,2256,2330,2332,2514,2519,3056,3063,3070,3082,3197 'promptpattern':2899 'propag':2805 'proper':2796 'prosodi':2031 'punctuat':701 'qualiti':135,3175 'question':1394,1544,2035 'quiet':2652 'rambl':2264 'rather':520 're':1489,2296,2492 're-read':2295 'read':2297,2464,2467,2489,2503,3093 'readi':1490,2421 'real':153,301,424,574,2227,2655,3159,3172 'real-tim':300,423,573 'real-world':2654,3158,3171 'realtim':67,228,240,468,958,3139,3145,3300 'realtimecli':471,477 'recent':2434 'recognit':35,1450 'recommend':1597,1795,1988,2145,2324,2512,2671,2813 'reconnect':3136,3142,3148 'recoveri':1448 'reduc':1808,2959,2982,2994 'reduct':2678,2689 'regardless':2129 'region':1225,1712 'relat':3241 'relev':1168 'remind':1421,1432 'remov':2570 'repeat':1477,1539,2222,2272 'repeatlastsent':2225 'replac':2572,2576,2582,2587,2595,2597 'repres':6,17 'request':2274 'requir':3372 'reset':1006 'respond':978,1750,2006,2768 'respondwithconsistenttim':1883 'respons':91,456,582,757,818,942,1035,1064,1289,1292,1365,1501,1522,1563,1723,1735,1770,1861,1867,1877,1900,1905,1943,1950,1975,2060,2233,2239,2250,2261,2306,2327,2339,2391,2404,2521,2812,2842,2857,3066,3076 'response.audio.delta':556 'result':278,1083,1113 'retel':387 'retrievecontext':1173 'return':2057,2064,2069,2193,2216,2568,2729,2875,3120 'review':3365 'rhythm':1727,1774 'right':1267,1480,1510,2867 'robot':2481,2501 'role':740,749 'rule':1360 'run':1219,1705 's2s':434 'safeti':3375 'said':466,553,605,1476,2782,2861,2910 'sampl':2709 'satisfact':1552 'say':518,1265,1407,1464,2366,2774,2907 'scan':2293 'scope':198,3346 'score':1553,3130 'scroll':1275 'sec':2708 'second':1296,1315,2364 'see':2508 'segment':819 'select':3196 'semant':531,873,955,966,1991,1994,2000,3018 'send':566,3107 'sensit':863 'sent':3085 'sentenc':1180,1971,2030,2089,2224 'separ':589 'server':528 'servic':506 'session':960 'set':1419,1430,1855 'sever':1310,1526,1728,1916,2079,2253,2450,2615,2758,2925,2955,2978,3003,3033,3060,3088,3114,3137,3156 'sharp':1520 'short':758,1293,1944 'shorter':2280 'signific':2961 'signpost':1304,2549 'silenc':539,707,910,915,972,1911,1955,1962,2039,3001,3006,3015 'silero':372,868,880 'silerovad':886,895,2021 'simpl':859,1954 'sip':3186 'situat':1528,1730,1918,2081,2255,2452,2617,2760 'skill':58,3242,3338 'skill-voice-agents' 'slow':1576 'small':1828 'smart':2692 'smartturn':374,2014,2024,2025 'smith':1344,2830 'smooth':1833 'someth':517,2437,2770 'sometim':1748,1752 'sorri':1468,2221 'sound':2500 'sourc':251,367,884,1810 'source-sickn33' 'speak':24,836,925,940,1156,1515,1694,1747,1923,3054 'speaker':304,561 'special':2560 'specialist':210,216,222,3237 'specif':3360 'speech':34,63,65,170,172,174,180,224,226,262,314,403,405,428,430,688,773,822,898,902,907,920,935,1021,1372,1456,2027,2157,2188,2480,2667,2765,2802,3271,3280 'speech-to-speech':62,169,402,427 'speech-to-text':173,687 'spell':1385,2532,2540 'spike':1873 'spoken':2516,2523 'ssml':2599 'stack':336 'stage':1719,2934,2945 'standard':1659,1800 'start':146,807,921,924,1022,1121,1682,1823,1837,2158,2189,2634,3053 'startlisten':1014 'starts/stops':835 'starttim':1886,1896 'starttranscript':2182 'state':1007,2175,2544 'std':1069 'step':622 'still':811,1482 'stop':926,995,2075,2111,2112,2142,2162,2229,3038,3049,3366 'strategi':1107,1639 'stream':303,345,545,558,794,813,1109,1111,1128,1137,1666,1668,1677,2954,2957,2967,2976,2980,2990,2991 'strip':3104 'structur':2509 'stt':75,162,590,611,1081,1112,1192,1590,1611,1646,1675,2643,2680,2752,2790,2816,2873,2951,2958,3110,3116,3124,3329 'stt.on':1117,1160,2718 'sub':46,245,442 'sub-500ms':244,441 'sub-800ms':45 'substitut':3356 'success':3378 'support':577 'sure':1472,1506 'symptom':1534,1737,1924,2090,2262,2458,2623,2766 'synthesi':37 'system':28,201,741,1351,2331,2518,3081,3222 'take':190,848,3027 'talk':1757,1761,2092 'target':120,1045,1606,1797,2235 'task':3342 'technic':2778,2798 'term':2779,2799 'test':3362 'text':176,178,264,312,488,612,614,626,690,771,784,1120,1127,1164,1174,1261,1884,1892,2248,2290,2494,2567,2569,2757,3273,3278 'text-length':2247 'text-to-speech':177,770 'thing':523,2553 'think':1940,1979,2859 'third':2365 'thought':1930 'three':1423,2359,2471,2476,2537,2552 'threshold':533,896,900,2040,2649,2698,3002,3007,3016,3022 'time':302,425,575,1036,1055,1270,1613,1620,1627,1724,1736,1771,1794,1868,1878,1901,1906,2234,2417,2983,2995 'time-to-first-audio':1626 'time-to-first-token':1054,1612,1619 'timeout':1481,1985 'timestamp':1716,2942 'tip':806 'today':1446 'togeth':3230 'token':754,1058,1616,1623,1857,2395 'tone':1283 'tool':223,3203,3209,3211,3248 'topic-agent-skills' 'topic-agentic-skills' 'topic-ai-agent-skills' 'topic-ai-agents' 'topic-ai-coding' 'topic-ai-workflows' 'topic-antigravity' 'topic-antigravity-skills' 'topic-claude-code' 'topic-claude-code-skills' 'topic-codex-cli' 'topic-codex-skills' 'total':1102,1632,2604 'track':1720,1866,2267,2930,2940 'transact':2435 'transcrib':275,2741 'transcript':549,695,711,723,802,804,1119,1125,1162,1676,2043,2682,2719,2822,2841,2862,2965 'transcript.match':2046,2049 'transcription.alternatives':2882 'transcription.on':710 'transfer':1333 'translat':586 'treat':3351 'tri':2083 'trigger':941,1016,1941,2632,2662,3178 'true':702,1138,2690,2694 'ttfa':1093 'ttfb':361 'ttft':292,1059,1088,1197 'tts':77,164,331,342,354,592,615,808,929,996,1092,1132,1207,1594,1625,1654,1681,1686,2125,2169,2487,2974,2981,2988,3050,3087,3096,3102,3109,3324 'tts.appendtext':1145 'ttscontroller.abort':2171 'ttsoutput':2747 'ttsplayer.isplaying':2160,2192 'ttsplayer.stop':2165 'tune':3200 'turn':189,525,847,918,963,1914,1959,1997,3026 'turn-tak':188,846,3025 'turndetect':2022 'twenti':2536 'twenty-thre':2535 'twilio':3184 'two':60,2470,2475 'type':527,854,965,1999,2186 'typic':642,1074,1564 'uh':2051,2210 'uh-huh':2209 'umm':2005 'ummm':977 'unclear':1401,1467,2762 'understand':875,1329,1965,2351,2918 'understood':1474 'undo':1263 'unit':2543 'unless':1412 'unnatur':2482 'unpredict':1740,1793 'updat':2916 'us':2546 'usag':2657 'use':418,571,598,823,839,968,1039,1187,1247,1303,1370,1378,1641,1709,1910,1990,2446,2547,2556,2819,2949,2963,3260,3336 'user':432,552,567,725,750,796,821,834,923,938,990,1154,1223,1227,1272,1538,1545,1579,1693,1741,1756,1763,1788,1921,1927,1936,2077,2082,2094,2095,2105,2131,2241,2265,2275,2315,2505,2618,2771,2900,3040,3052,3179,3188,3201,3216,3231,3261,3267,3274,3281,3286,3291,3296,3302,3308,3314,3320,3325,3330 'useraudio':2751 'usertext':720,726,752 'ux':1250 'v2':790 'v3':327 'vad':134,373,529,532,844,852,853,869,874,881,893,948,956,967,1015,1076,1588,1608,1992,1995,2001,2019,2648,2697,3000,3019,3043 'vad.on':919,934,1020,2156,2187 'vad.process':952 'vad.setthreshold':2710 'vad/stt':1231,1706,2610 'valid':2920,3361 'vapi':377 'varianc':127,1067,1725,1768,1845 'verbal':2548 'version':2281 'violat':2134 'visual':2498 'voic':2,4,13,15,102,111,167,185,196,243,252,319,344,368,380,393,484,490,578,581,747,760,779,781,828,841,849,1041,1243,1249,1254,1257,1350,1356,1531,1731,2252,2259,2298,2314,2336,2449,2457,2738,2927,2936,2972,3035,3055,3062,3069,3205,3214,3221,3227,3238,3265,3306,3312 'voice-activity-detect':184 'voice-ag':1,166 'voice-first':1253 'voice-interfac':195 'vonag':3185 'vs':1651,1658 'wait':974,1672,1932,2002,2066,2098,2194 'want':1322,1503,2371,2431,2607 'warm':1817 'warn':2956,2979,3004,3034,3061,3089,3115,3138 'websocket':1232,3134 'well':1976,3244 'well/i':2053 'wer':287 'whisper':266,1652,3285 'within':1566 'without':2119,2383,2938,2975,2989,3044,3057,3071,3111,3125,3135,3147 'word':1288,1368,2199,2342,3079 'work':2115,2308,2322,2650,3229,3243 'workload':284 'world':2656,3160,3173 'would':1511 'wrong':2777 'x':2783,2911 'y':2785,2914 'yeah':2212 'year':110 'yes':981,1972,2009 'z':2915","prices":[{"id":"d21dbb57-60cf-4c36-b8f1-b7459c494e43","listingId":"8e5d7fd7-e923-4446-80fe-581d66f5f964","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"sickn33","category":"antigravity-awesome-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T21:47:17.686Z"}],"sources":[{"listingId":"8e5d7fd7-e923-4446-80fe-581d66f5f964","source":"github","sourceId":"sickn33/antigravity-awesome-skills/voice-agents","sourceUrl":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/voice-agents","isPrimary":false,"firstSeenAt":"2026-04-18T21:47:17.686Z","lastSeenAt":"2026-04-22T00:51:56.867Z"}],"details":{"listingId":"8e5d7fd7-e923-4446-80fe-581d66f5f964","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"sickn33","slug":"voice-agents","github":{"repo":"sickn33/antigravity-awesome-skills","stars":34404,"topics":["agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows","antigravity","antigravity-skills","claude-code","claude-code-skills","codex-cli","codex-skills","cursor","cursor-skills","developer-tools","gemini-cli","gemini-skills","kiro","mcp","skill-library"],"license":"mit","html_url":"https://github.com/sickn33/antigravity-awesome-skills","pushed_at":"2026-04-21T16:43:40Z","description":"Installable GitHub library of 1,400+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.","skill_md_sha":"bbf186cef11cf4df6f1786a8c3249f5c8c607b31","skill_md_path":"skills/voice-agents/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/voice-agents"},"layout":"multi","source":"github","category":"antigravity-awesome-skills","frontmatter":{"name":"voice-agents","description":"Voice agents represent the frontier of AI interaction - humans"},"skills_sh_url":"https://skills.sh/sickn33/antigravity-awesome-skills/voice-agents"},"updatedAt":"2026-04-22T00:51:56.867Z"}}