{"id":"0ff1cd81-d2e4-497d-ab3c-ce5ce32cc764","shortId":"PBZMTX","kind":"skill","title":"voice-ai","tagline":"Voice AI architecture and implementation guide. Covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency) and pipeline (STT->LLM->TTS, more control). Includes provider-specific patterns for OpenAI Realtime, Vapi, Deepgram, ElevenLabs, and LiveKit. Use when ","description":"# Voice AI — Architecture & Implementation\n\nYou are a voice AI architect who has shipped production voice agents handling millions of calls. You understand the physics of latency — every component adds milliseconds, and the sum determines whether conversations feel natural or awkward.\n\n## Core Insight: Two Architectures\n\n| Architecture | Latency | Control | Best For |\n|-------------|---------|---------|----------|\n| Speech-to-Speech (S2S) | Lowest (~200-400ms) | Less controllable | Natural conversation, emotion preservation |\n| Pipeline (STT->LLM->TTS) | Higher (~600-1200ms) | Full control at each step | Custom logic, debugging, provider mixing |\n\n---\n\n## Part 1: Architecture Patterns\n\n### Speech-to-Speech Architecture\n\nDirect audio-to-audio processing for lowest latency. Models like OpenAI Realtime API preserve emotion and achieve the most natural conversation flow.\n\n**Strengths:**\n- Preserves vocal emotion and nuance\n- Lowest end-to-end latency\n- Single provider simplicity\n\n**Weaknesses:**\n- Less controllable intermediate steps\n- Harder to debug\n- Provider lock-in\n\n### Pipeline Architecture\n\nSeparate STT -> LLM -> TTS for maximum control at each step.\n\n**Strengths:**\n- Mix best-in-class providers (Deepgram STT + GPT-4o + ElevenLabs TTS)\n- Debug each component independently\n- Custom logic between steps (filters, guardrails, logging)\n\n**Weaknesses:**\n- Higher cumulative latency\n- More integration complexity\n- More failure points\n\n### Voice Activity Detection (VAD)\n\nDetect when user starts/stops speaking. Critical for natural turn-taking.\n\n**Key metrics:**\n- Silence threshold: 500-1000ms typical\n- Prefix padding: 200-300ms to avoid clipping speech start\n- Use semantic VAD (context-aware) over silence-only detection\n\n---\n\n## Part 2: Provider Implementation\n\n### OpenAI Realtime API\n\nNative voice-to-voice with GPT-4o. Best for integrated voice AI without separate STT/TTS.\n\n```python\nimport asyncio\nimport websockets\nimport json\nimport base64\n\nOPENAI_API_KEY = \"sk-...\"\n\nasync def voice_session():\n    url = \"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview\"\n    headers = {\n        \"Authorization\": f\"Bearer {OPENAI_API_KEY}\",\n        \"OpenAI-Beta\": \"realtime=v1\"\n    }\n\n    async with websockets.connect(url, extra_headers=headers) as ws:\n        # Configure session\n        await ws.send(json.dumps({\n            \"type\": \"session.update\",\n            \"session\": {\n                \"modalities\": [\"text\", \"audio\"],\n                \"voice\": \"alloy\",  # alloy, echo, fable, onyx, nova, shimmer\n                \"input_audio_format\": \"pcm16\",\n                \"output_audio_format\": \"pcm16\",\n                \"input_audio_transcription\": {\n                    \"model\": \"whisper-1\"\n                },\n                \"turn_detection\": {\n                    \"type\": \"server_vad\",\n                    \"threshold\": 0.5,\n                    \"prefix_padding_ms\": 300,\n                    \"silence_duration_ms\": 500\n                },\n                \"tools\": [\n                    {\n                        \"type\": \"function\",\n                        \"name\": \"get_weather\",\n                        \"description\": \"Get weather for a location\",\n                        \"parameters\": {\n                            \"type\": \"object\",\n                            \"properties\": {\n                                \"location\": {\"type\": \"string\"}\n                            }\n                        }\n                    }\n                ]\n            }\n        }))\n\n        # Send audio (PCM16, 24kHz, mono)\n        async def send_audio(audio_bytes):\n            await ws.send(json.dumps({\n                \"type\": \"input_audio_buffer.append\",\n                \"audio\": base64.b64encode(audio_bytes).decode()\n            }))\n\n        # Receive events\n        async for message in ws:\n            event = json.loads(message)\n            if event[\"type\"] == \"response.audio.delta\":\n                # Play audio chunk\n                audio_bytes = base64.b64decode(event[\"delta\"])\n                # send to speaker...\n```\n\n### Vapi Voice Agent\n\nBuild voice agents with Vapi platform. Best for phone-based agents and quick deployment.\n\n```python\nfrom flask import Flask, request, jsonify\nimport vapi\n\napp = Flask(__name__)\nclient = vapi.Vapi(api_key=\"...\")\n\n# Create an assistant\nassistant = client.assistants.create(\n    name=\"Support Agent\",\n    model={\n        \"provider\": \"openai\",\n        \"model\": \"gpt-4o\",\n        \"messages\": [\n            {\n                \"role\": \"system\",\n                \"content\": \"You are a helpful support agent...\"\n            }\n        ]\n    },\n    voice={\n        \"provider\": \"11labs\",\n        \"voiceId\": \"21m00Tcm4TlvDq8ikWAM\"  # Rachel\n    },\n    firstMessage=\"Hi! How can I help you today?\",\n    transcriber={\n        \"provider\": \"deepgram\",\n        \"model\": \"nova-2\"\n    }\n)\n\n# Webhook for conversation events\n@app.route(\"/vapi/webhook\", methods=[\"POST\"])\ndef vapi_webhook():\n    event = request.json\n\n    if event[\"type\"] == \"function-call\":\n        name = event[\"functionCall\"][\"name\"]\n        args = event[\"functionCall\"][\"parameters\"]\n        if name == \"check_order\":\n            result = check_order(args[\"order_id\"])\n            return jsonify({\"result\": result})\n\n    elif event[\"type\"] == \"end-of-call-report\":\n        transcript = event[\"transcript\"]\n        save_transcript(event[\"call\"][\"id\"], transcript)\n\n    return jsonify({\"ok\": True})\n\n# Start outbound call\ncall = client.calls.create(\n    assistant_id=assistant.id,\n    customer={\"number\": \"+1234567890\"},\n    phoneNumber={\"twilioPhoneNumber\": \"+0987654321\"}\n)\n\n# Or create web call\nweb_call = client.calls.create(\n    assistant_id=assistant.id,\n    type=\"web\"\n)\n# Returns URL for WebRTC connection\n```\n\n### Deepgram STT + ElevenLabs TTS\n\nBest-in-class transcription and synthesis. Use when you want the highest quality custom pipeline.\n\n```python\nimport asyncio\nfrom deepgram import DeepgramClient, LiveTranscriptionEvents\nfrom elevenlabs import ElevenLabs\n\n# Deepgram real-time transcription\ndeepgram = DeepgramClient(api_key=\"...\")\n\nasync def transcribe_stream(audio_stream):\n    connection = deepgram.listen.live.v(\"1\")\n\n    async def on_transcript(result):\n        transcript = result.channel.alternatives[0].transcript\n        if transcript:\n            print(f\"Heard: {transcript}\")\n            if result.is_final:\n                await handle_user_input(transcript)\n\n    connection.on(LiveTranscriptionEvents.Transcript, on_transcript)\n\n    await connection.start({\n        \"model\": \"nova-2\",       # Best quality\n        \"language\": \"en\",\n        \"smart_format\": True,\n        \"interim_results\": True,  # Get partial results\n        \"utterance_end_ms\": 1000,\n        \"vad_events\": True,       # Voice activity detection\n        \"encoding\": \"linear16\",\n        \"sample_rate\": 16000\n    })\n\n    async for chunk in audio_stream:\n        await connection.send(chunk)\n\n    await connection.finish()\n\n# ElevenLabs streaming synthesis\neleven = ElevenLabs(api_key=\"...\")\n\ndef text_to_speech_stream(text: str):\n    \"\"\"Stream TTS audio chunks.\"\"\"\n    audio_stream = eleven.text_to_speech.convert_as_stream(\n        voice_id=\"21m00Tcm4TlvDq8ikWAM\",  # Rachel\n        model_id=\"eleven_turbo_v2_5\",       # Fastest\n        text=text,\n        output_format=\"pcm_24000\"           # Raw PCM for low latency\n    )\n    for chunk in audio_stream:\n        yield chunk\n\n# WebSocket for lowest latency TTS\nasync def tts_websocket(text_stream):\n    async with eleven.text_to_speech.stream_async(\n        voice_id=\"21m00Tcm4TlvDq8ikWAM\",\n        model_id=\"eleven_turbo_v2_5\"\n    ) as tts:\n        async for text_chunk in text_stream:\n            audio = await tts.send(text_chunk)\n            yield audio\n        final_audio = await tts.flush()\n        yield final_audio\n```\n\n---\n\n## Part 3: Latency Optimization\n\n### Latency Budget\n\nTarget: **< 800ms** total round-trip for natural conversation feel.\n\n| Component | Target | Notes |\n|-----------|--------|-------|\n| STT | 100-200ms | Use interim results |\n| LLM | 200-400ms | Stream tokens |\n| TTS | 100-200ms | Stream audio chunks |\n| Network | 50-100ms | Choose nearest region |\n\n### Streaming Everything\n\nThe single most important optimization: **stream every component**.\n\n- **STT**: Enable interim results for early processing\n- **LLM**: Token streaming to start TTS before LLM finishes\n- **TTS**: Chunk streaming to start playback before full synthesis\n\n### Barge-In Detection\n\nAllow users to interrupt the AI mid-response:\n\n1. Use VAD to detect user speech during AI playback\n2. Immediately stop TTS playback\n3. Clear audio output queue\n4. Process the interruption as new input\n\n---\n\n## Anti-Patterns\n\n### Non-Streaming Pipeline\n**Why bad**: Adds seconds of latency. User perceives as slow. Loses conversation flow.\n**Instead**: Stream everything — STT interim results, LLM token streaming, TTS chunk streaming. Start TTS before LLM finishes.\n\n### Ignoring Interruptions\n**Why bad**: Frustrating user experience. Feels like talking to a machine.\n**Instead**: Implement barge-in detection. Use VAD to detect user speech. Stop TTS immediately. Clear audio queue.\n\n### Silence-Only Turn Detection\n**Why bad**: Misses conversational cues. Cuts off users who pause to think.\n**Instead**: Use semantic VAD that considers context, not just silence duration.\n\n### Long Responses\n**Why bad**: Voice responses over 2-3 sentences feel like lectures.\n**Instead**: Constrain response length in system prompts. Prompt for spoken format (concise, conversational).\n\n### Single Provider Lock-in\n**Why bad**: May not be best quality for each component. Single point of failure.\n**Instead**: Mix best providers — Deepgram for STT (speed + accuracy), ElevenLabs for TTS (voice quality), OpenAI/Anthropic for LLM.\n\n---\n\n## Sharp Edges\n\n| Issue | Severity | Solution |\n|-------|----------|----------|\n| Latency exceeds budget | Critical | Measure and budget latency for each component |\n| Jitter in response time | High | Target jitter metrics, use buffering |\n| Poor turn detection | High | Use semantic VAD with context awareness |\n| No barge-in support | High | Implement barge-in detection with VAD |\n| Overly long responses | Medium | Constrain response length in prompts |\n| Unnatural phrasing | Medium | Prompt for spoken format |\n| Background noise issues | Medium | Implement noise handling / filtering |\n| STT transcription errors | Medium | Mitigate with prompt hints and context |\n\n---\n\n## Requirements\n\n- Python or Node.js\n- API keys for chosen providers\n- Audio handling knowledge (PCM, sample rates, encoding)\n- WebSocket support for real-time streaming\n\n## Capabilities\n\n- Voice agent architecture design\n- Speech-to-speech implementation\n- Pipeline (STT->LLM->TTS) implementation\n- Voice activity detection\n- Turn-taking and barge-in detection\n- Latency optimization\n- Provider selection and integration\n\n## Related Skills\n\nWorks well with: `openai-api`, `openai-agents`, `openai-whisper`, `ai-product`","tags":["voice","coco","rkz91","agent-skills","agents-md","ai-agents","claude-code","codex","cursor","developer-tools","llm-tools","mcp"],"capabilities":["skill","source-rkz91","skill-voice-ai","topic-agent-skills","topic-agents-md","topic-ai-agents","topic-claude-code","topic-codex","topic-cursor","topic-developer-tools","topic-llm-tools","topic-mcp","topic-pm-tools","topic-product-management","topic-productivity"],"categories":["coco"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/rkz91/coco/voice-ai","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add rkz91/coco","source_repo":"https://github.com/rkz91/coco","install_from":"skills.sh"}},"qualityScore":"0.453","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 7 github stars · SKILL.md body (10,964 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:14:10.313Z","embedding":null,"createdAt":"2026-05-18T13:21:43.966Z","updatedAt":"2026-05-18T19:14:10.313Z","lastSeenAt":"2026-05-18T19:14:10.313Z","tsv":"'+0987654321':609 '+1234567890':606 '-1':374 '-100':889 '-1000':252 '-1200':114 '-2':533,709 '-200':869,882 '-3':1073 '-300':258 '-400':100,876 '/v1/realtime?model=gpt-4o-realtime-preview':320 '/vapi/webhook':539 '0':685 '0.5':381 '1':127,677,942 '100':868,881 '1000':726 '11labs':516 '16000':737 '2':277,952,1072 '200':99,257,875 '21m00tcm4tlvdq8ikwam':518,774,818 '24000':788 '24khz':412 '3':849,957 '300':385 '4':962 '4o':208,291,503 '5':781,824 '50':888 '500':251,389 '600':113 '800ms':855 'accuraci':1118 'achiev':152 'activ':233,731,1249 'add':72,978 'agent':59,457,460,469,496,513,1235,1275 'ai':3,5,45,52,296,938,950,1280 'ai-product':1279 'allow':933 'alloy':354,355 'anti':970 'anti-pattern':969 'api':19,148,282,310,326,487,666,754,1214,1272 'api.openai.com':319 'api.openai.com/v1/realtime?model=gpt-4o-realtime-preview':318 'app':482 'app.route':538 'architect':53 'architectur':6,12,46,87,88,128,134,186,1236 'arg':557,568 'assist':491,492,601,617 'assistant.id':603,619 'async':313,333,414,432,668,678,738,806,812,815,827 'asyncio':302,649 'audio':137,139,352,362,366,370,410,417,418,425,427,445,447,672,742,765,767,797,834,840,842,847,885,959,1035,1219 'audio-to-audio':136 'author':322 'avoid':261 'await':344,420,696,705,744,747,835,843 'awar':270,1162 'awkward':83 'background':1192 'bad':977,1009,1043,1068,1097 'barg':930,1022,1165,1171,1256 'barge-in':929,1021,1164,1170,1255 'base':468 'base64':308 'base64.b64decode':449 'base64.b64encode':426 'bearer':324 'best':91,200,292,464,632,710,1101,1112 'best-in-class':199,631 'beta':330 'budget':853,1134,1138 'buffer':1152 'build':458 'byte':419,428,448 'call':63,552,581,589,598,599,613,615 'capabl':1233 'check':563,566 'choos':891 'chosen':1217 'chunk':446,740,746,766,795,800,830,838,886,921,999 'class':202,634 'clear':958,1034 'client':485 'client.assistants.create':493 'client.calls.create':600,616 'clip':262 'complex':228 'compon':71,213,864,903,1105,1142 'concis':1089 'configur':342 'connect':626,674 'connection.finish':748 'connection.on':701 'connection.send':745 'connection.start':706 'consid':1059 'constrain':1079,1180 'content':507 'context':269,1060,1161,1209 'context-awar':268 'control':28,90,103,117,175,193 'convers':79,105,156,536,862,987,1045,1090 'core':84 'cover':10 'creat':489,611 'critic':241,1135 'cue':1046 'cumul':224 'custom':121,215,604,645 'cut':1047 'debug':123,180,211 'decod':429 'deepgram':38,204,530,627,651,659,664,1114 'deepgram.listen.live':675 'deepgramcli':653,665 'def':314,415,542,669,679,756,807 'delta':451 'deploy':472 'descript':396 'design':1237 'detect':234,236,275,376,732,932,946,1024,1028,1041,1155,1173,1250,1258 'determin':77 'direct':135 'durat':387,1064 'earli':909 'echo':356 'edg':1128 'eleven':752,778,821 'eleven.text_to_speech.convert':769 'eleven.text_to_speech.stream':814 'elevenlab':39,209,629,656,658,749,753,1119 'elif':575 'emot':106,150,161 'en':713 'enabl':905 'encod':733,1225 'end':166,168,579,724 'end-of-call-report':578 'end-to-end':165 'error':1202 'event':431,437,441,450,537,545,548,554,558,576,584,588,728 'everi':70,902 'everyth':895,991 'exceed':1133 'experi':1012 'extra':337 'f':323,690 'fabl':357 'failur':230,1109 'fastest':782 'feel':80,863,1013,1075 'filter':219,1199 'final':695,841,846 'finish':919,1005 'firstmessag':520 'flask':475,477,483 'flow':157,988 'format':363,367,715,786,1088,1191 'frustrat':1010 'full':116,927 'function':392,551 'function-cal':550 'functioncal':555,559 'get':394,397,720 'gpt':207,290,502 'gpt-4o':206,289,501 'guardrail':220 'guid':9 'handl':60,697,1198,1220 'harder':178 'header':321,338,339 'heard':691 'help':511,525 'hi':521 'high':1147,1156,1168 'higher':112,223 'highest':643 'hint':1207 'id':570,590,602,618,773,777,817,820 'ignor':1006 'immedi':953,1033 'implement':8,47,279,1020,1169,1196,1242,1247 'import':301,303,305,307,476,480,648,652,657,899 'includ':29 'independ':214 'input':361,369,699,968 'input_audio_buffer.append':424 'insight':85 'instead':989,1019,1054,1078,1110 'integr':227,294,1264 'interim':717,872,906,993 'intermedi':176 'interrupt':936,965,1007 'issu':1129,1194 'jitter':1143,1149 'json':306 'json.dumps':346,422 'json.loads':438 'jsonifi':479,572,593 'key':247,311,327,488,667,755,1215 'knowledg':1221 'languag':712 'latenc':21,69,89,143,169,225,793,804,850,852,981,1132,1139,1259 'lectur':1077 'length':1081,1182 'less':102,174 'like':145,1014,1076 'linear16':734 'livekit':41 'livetranscriptionev':654 'livetranscriptionevents.transcript':702 'llm':25,110,189,874,911,918,995,1004,1126,1245 'locat':401,406 'lock':183,1094 'lock-in':182,1093 'log':221 'logic':122,216 'long':1065,1177 'lose':986 'low':792 'lowest':20,98,142,164,803 'machin':1018 'maximum':192 'may':1098 'measur':1136 'medium':1179,1187,1195,1203 'messag':434,439,504 'method':540 'metric':248,1150 'mid':940 'mid-respons':939 'million':61 'millisecond':73 'miss':1044 'mitig':1204 'mix':125,198,1111 'modal':350 'model':144,372,497,500,531,707,776,819 'mono':413 'ms':101,115,253,259,384,388,725,870,877,883,890 'name':393,484,494,553,556,562 'nativ':283 'natur':81,104,155,243,861 'nearest':892 'network':887 'new':967 'node.js':1213 'nois':1193,1197 'non':973 'non-stream':972 'note':866 'nova':359,532,708 'nuanc':163 'number':605 'object':404 'ok':594 'onyx':358 'openai':17,35,146,280,309,325,329,499,1271,1274,1277 'openai-ag':1273 'openai-api':1270 'openai-beta':328 'openai-whisp':1276 'openai/anthropic':1124 'optim':851,900,1260 'order':564,567,569 'outbound':597 'output':365,785,960 'over':1176 'pad':256,383 'paramet':402,560 'part':126,276,848 'partial':721 'pattern':33,129,971 'paus':1051 'pcm':787,790,1222 'pcm16':364,368,411 'perceiv':983 'phone':467 'phone-bas':466 'phonenumb':607 'phrase':1186 'physic':67 'pipelin':23,108,185,646,975,1243 'platform':463 'play':444 'playback':925,951,956 'point':231,1107 'poor':1153 'post':541 'prefix':255,382 'preserv':107,149,159 'print':689 'process':140,910,963 'product':57,1281 'prompt':1084,1085,1184,1188,1206 'properti':405 'provid':31,124,171,181,203,278,498,515,529,1092,1113,1218,1261 'provider-specif':30 'python':300,473,647,1211 'qualiti':644,711,1102,1123 'queue':961,1036 'quick':471 'rachel':519,775 'rate':736,1224 'raw':789 'real':661,1230 'real-tim':660,1229 'realtim':18,36,147,281,331 'receiv':430 'region':893 'relat':1265 'report':582 'request':478 'request.json':546 'requir':1210 'respons':941,1066,1070,1080,1145,1178,1181 'response.audio.delta':443 'result':565,573,574,682,718,722,873,907,994 'result.channel.alternatives':684 'result.is':694 'return':571,592,622 'role':505 'round':858 'round-trip':857 's2s':97 'sampl':735,1223 'save':586 'second':979 'select':1262 'semant':266,1056,1158 'send':409,416,452 'sentenc':1074 'separ':187,298 'server':378 'session':316,343,349 'session.update':348 'sever':1130 'sharp':1127 'shimmer':360 'ship':56 'silenc':249,273,386,1038,1063 'silence-on':272,1037 'simplic':172 'singl':170,897,1091,1106 'sk':312 'skill':1266 'skill-voice-ai' 'slow':985 'smart':714 'solut':1131 'source-rkz91' 'speak':240 'speaker':454 'specif':32 'speech':14,16,94,96,131,133,263,759,948,1030,1239,1241 'speech-to-speech':13,93,130,1238 'speed':1117 'spoken':1087,1190 'start':264,596,915,924,1001 'starts/stops':239 'step':120,177,196,218 'stop':954,1031 'str':762 'stream':671,673,743,750,760,763,768,771,798,811,833,878,884,894,901,913,922,974,990,997,1000,1232 'strength':158,197 'string':408 'stt':24,109,188,205,628,867,904,992,1116,1200,1244 'stt/tts':299 'sum':76 'support':495,512,1167,1227 'synthesi':637,751,928 'system':506,1083 'take':246,1253 'talk':1015 'target':854,865,1148 'text':351,757,761,783,784,810,829,832,837 'think':1053 'threshold':250,380 'time':662,1146,1231 'today':527 'token':879,912,996 'tool':390 'topic-agent-skills' 'topic-agents-md' 'topic-ai-agents' 'topic-claude-code' 'topic-codex' 'topic-cursor' 'topic-developer-tools' 'topic-llm-tools' 'topic-mcp' 'topic-pm-tools' 'topic-product-management' 'topic-productivity' 'total':856 'transcrib':528,670 'transcript':371,583,585,587,591,635,663,681,683,686,688,692,700,704,1201 'trip':859 'true':595,716,719,729 'tts':26,111,190,210,630,764,805,808,826,880,916,920,955,998,1002,1032,1121,1246 'tts.flush':844 'tts.send':836 'turbo':779,822 'turn':245,375,1040,1154,1252 'turn-tak':244,1251 'twiliophonenumb':608 'two':11,86 'type':347,377,391,403,407,423,442,549,577,620 'typic':254 'understand':65 'unnatur':1185 'url':317,336,623 'use':42,265,638,871,943,1025,1055,1151,1157 'user':238,698,934,947,982,1011,1029,1049 'utter':723 'v':676 'v1':332 'v2':780,823 'vad':235,267,379,727,944,1026,1057,1159,1175 'vapi':37,455,462,481,543 'vapi.vapi':486 'vocal':160 'voic':2,4,44,51,58,232,285,287,295,315,353,456,459,514,730,772,816,1069,1122,1234,1248 'voice-ai':1 'voice-to-voic':284 'voiceid':517 'want':641 'weak':173,222 'weather':395,398 'web':612,614,621 'webhook':534,544 'webrtc':625 'websocket':304,801,809,1226 'websockets.connect':335 'well':1268 'whether':78 'whisper':373,1278 'without':297 'work':1267 'ws':341,436 'ws.send':345,421 'yield':799,839,845","prices":[{"id":"f4ced705-9c1b-42ff-8c3a-851b43cb838c","listingId":"0ff1cd81-d2e4-497d-ab3c-ce5ce32cc764","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"rkz91","category":"coco","install_from":"skills.sh"},"createdAt":"2026-05-18T13:21:43.966Z"}],"sources":[{"listingId":"0ff1cd81-d2e4-497d-ab3c-ce5ce32cc764","source":"github","sourceId":"rkz91/coco/voice-ai","sourceUrl":"https://github.com/rkz91/coco/tree/main/skills/voice-ai","isPrimary":false,"firstSeenAt":"2026-05-18T13:21:43.966Z","lastSeenAt":"2026-05-18T19:14:10.313Z"}],"details":{"listingId":"0ff1cd81-d2e4-497d-ab3c-ce5ce32cc764","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"rkz91","slug":"voice-ai","github":{"repo":"rkz91/coco","stars":7,"topics":["agent-skills","agents-md","ai","ai-agents","claude-code","codex","cursor","developer-tools","llm-tools","mcp","pm-tools","product-management","productivity","prompt-engineering","workflow-automation"],"license":"mit","html_url":"https://github.com/rkz91/coco","pushed_at":"2026-04-26T01:51:27Z","description":"Open-source library of AI superpowers — 59 skills, 34 commands, 10 agents + 24 GSD subagents, 3 system bundles. An entire team, wherever your AI lives. Vendor-neutral across Claude Code, Cursor, Codex, and any AGENTS.md tool.","skill_md_sha":"fd12ef7f8a678808977bd4d5d606e4392bff125f","skill_md_path":"skills/voice-ai/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/rkz91/coco/tree/main/skills/voice-ai"},"layout":"multi","source":"github","category":"coco","frontmatter":{"name":"voice-ai","description":"Voice AI architecture and implementation guide. Covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency) and pipeline (STT->LLM->TTS, more control). Includes provider-specific patterns for OpenAI Realtime, Vapi, Deepgram, ElevenLabs, and LiveKit. Use when building voice agents, voice-enabled apps, or real-time conversational AI."},"skills_sh_url":"https://skills.sh/rkz91/coco/voice-ai"},"updatedAt":"2026-05-18T19:14:10.313Z"}}