2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"] 4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""] 5:I[4126,[],""] 7:I[9630,[],""] 8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"] 9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"] a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"] b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"] 3:T2b61, # Voice Pipeline WebSocket API > **Endpoint:** `wss://{host}/api/voice/pipeline-ws` > **Protocol:** JSON over WebSocket > **Status:** Production Ready > **Last Updated:** 2025-12-02 ## Overview The Voice Pipeline WebSocket provides bidirectional communication for the Thinker-Talker voice mode. It handles audio streaming, transcription, LLM responses, and TTS playback. ## Connection ### Authentication Include JWT token in connection URL or headers: ```javascript const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`); ``` ### Connection Lifecycle ``` 1. Client connects with auth token │ 2. Server accepts, creates pipeline session │ 3. Server sends: session.ready │ 4. Client sends: session.init (optional config) │ 5. Server acknowledges: session.init.ack │ 6. Voice mode active - bidirectional streaming │ 7. Client or server closes connection ``` ## Message Format All messages are JSON objects with a `type` field: ```json { "type": "message_type", "field1": "value1", "field2": "value2" } ``` ## Client → Server Messages ### session.init Initialize or reconfigure the session. ```json { "type": "session.init", "conversation_id": "conv-123", "voice_settings": { "voice_id": "TxGEqnHWrfWFTfGW9XjX", "language": "en", "barge_in_enabled": true } } ``` | Field | Type | Required | Description | | --------------------------------- | ------- | -------- | --------------------------------------- | | `conversation_id` | string | No | Link to existing chat conversation | | `voice_settings.voice_id` | string | No | ElevenLabs voice ID | | `voice_settings.language` | string | No | STT language code (default: "en") | | `voice_settings.barge_in_enabled` | boolean | No | Allow user interruption (default: true) | ### audio.input Stream audio from microphone. ```json { "type": "audio.input", "audio": "base64_encoded_pcm16_audio" } ``` | Field | Type | Required | Description | | ------- | ------ | -------- | ---------------------------------------- | | `audio` | string | Yes | Base64-encoded PCM16 audio (16kHz, mono) | **Audio Format Requirements:** - Sample rate: 16000 Hz - Channels: 1 (mono) - Bit depth: 16-bit signed PCM - Encoding: Little-endian - Chunk size: ~100ms recommended (1600 samples) ### audio.input.complete Signal end of user speech (manual commit). ```json { "type": "audio.input.complete" } ``` Normally, VAD auto-detects speech end. Use this for push-to-talk implementations. ### barge_in Interrupt AI response. ```json { "type": "barge_in" } ``` When received: - Cancels TTS synthesis - Clears audio queue - Resets pipeline to listening state ### message Send text input (fallback when mic unavailable). ```json { "type": "message", "content": "What's the weather like?" } ``` ### ping Keep-alive heartbeat. ```json { "type": "ping" } ``` Server responds with `pong`. ## Server → Client Messages ### session.ready Session initialized successfully. ```json { "type": "session.ready", "session_id": "sess-abc123", "pipeline_mode": "thinker_talker" } ``` ### session.init.ack Acknowledges session.init message. ```json { "type": "session.init.ack" } ``` ### transcript.delta Partial STT transcript (streaming). ```json { "type": "transcript.delta", "text": "What is the", "is_final": false } ``` | Field | Type | Description | | ---------- | ------- | ----------------------- | | `text` | string | Partial transcript text | | `is_final` | boolean | Always false for delta | ### transcript.complete Final STT transcript. ```json { "type": "transcript.complete", "text": "What is the weather today?", "message_id": "msg-xyz789" } ``` | Field | Type | Description | | ------------ | ------ | ------------------------- | | `text` | string | Complete transcript | | `message_id` | string | Unique message identifier | ### response.delta Streaming LLM response token. ```json { "type": "response.delta", "delta": "The", "message_id": "resp-123" } ``` | Field | Type | Description | | ------------ | ------ | -------------------- | | `delta` | string | Response token/chunk | | `message_id` | string | Response message ID | ### response.complete Complete LLM response. ```json { "type": "response.complete", "text": "The weather today is sunny with a high of 72 degrees.", "message_id": "resp-123" } ``` ### audio.output TTS audio chunk. ```json { "type": "audio.output", "audio": "base64_encoded_pcm_audio", "is_final": false, "sentence_index": 0 } ``` | Field | Type | Description | | ---------------- | ------- | -------------------------------------- | | `audio` | string | Base64-encoded PCM audio (24kHz, mono) | | `is_final` | boolean | True for last chunk | | `sentence_index` | number | Which sentence this is from | **Output Audio Format:** - Sample rate: 24000 Hz - Channels: 1 (mono) - Bit depth: 16-bit signed PCM - Encoding: Little-endian ### tool.call Tool invocation started. ```json { "type": "tool.call", "id": "call-abc", "name": "calendar_list_events", "arguments": { "start_date": "2025-12-01", "end_date": "2025-12-07" } } ``` | Field | Type | Description | | ----------- | ------ | ------------------ | | `id` | string | Tool call ID | | `name` | string | Tool function name | | `arguments` | object | Tool arguments | ### tool.result Tool execution completed. ```json { "type": "tool.result", "id": "call-abc", "name": "calendar_list_events", "result": { "events": [{ "title": "Team Meeting", "start": "2025-12-02T10:00:00" }] } } ``` | Field | Type | Description | | -------- | ------ | --------------------- | | `id` | string | Tool call ID | | `name` | string | Tool function name | | `result` | any | Tool execution result | ### voice.state Pipeline state change. ```json { "type": "voice.state", "state": "speaking" } ``` | State | Description | | ------------ | --------------------------- | | `idle` | Waiting for user input | | `listening` | Receiving audio, STT active | | `processing` | LLM thinking | | `speaking` | TTS playing | | `cancelled` | Barge-in occurred | ### heartbeat Server heartbeat (every 30s). ```json { "type": "heartbeat" } ``` ### pong Response to client ping. ```json { "type": "pong" } ``` ### error Error occurred. ```json { "type": "error", "code": "stt_failed", "message": "Speech-to-text service unavailable", "recoverable": true } ``` | Field | Type | Description | | ------------- | ------- | ------------------------ | | `code` | string | Error code | | `message` | string | Human-readable message | | `recoverable` | boolean | True if client can retry | **Error Codes:** | Code | Description | Recoverable | | ------------------- | ---------------------- | ----------- | | `invalid_json` | Malformed JSON message | Yes | | `connection_failed` | Pipeline init failed | No | | `stt_failed` | STT service error | Yes | | `llm_failed` | LLM service error | Yes | | `tts_failed` | TTS service error | Yes | | `auth_failed` | Authentication error | No | | `rate_limited` | Too many requests | Yes | ## Example: Complete Session ```javascript // 1. Connect const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${token}`); ws.onopen = () => { console.log("Connected"); }; ws.onmessage = (event) => { const msg = JSON.parse(event.data); switch (msg.type) { case "session.ready": // 2. Initialize with settings ws.send( JSON.stringify({ type: "session.init", conversation_id: currentConversationId, voice_settings: { voice_id: "TxGEqnHWrfWFTfGW9XjX", language: "en", }, }), ); break; case "session.init.ack": // 3. Start sending audio startMicrophoneCapture(); break; case "transcript.delta": // Show partial transcript updatePartialTranscript(msg.text); break; case "transcript.complete": // Show final transcript setTranscript(msg.text); break; case "response.delta": // Append LLM response appendResponse(msg.delta); break; case "audio.output": // Play TTS audio if (msg.audio) { const pcm = base64ToArrayBuffer(msg.audio); audioPlayer.queueChunk(pcm); } if (msg.is_final) { audioPlayer.finish(); } break; case "tool.call": // Show tool being called showToolCall(msg.name, msg.arguments); break; case "tool.result": // Show tool result showToolResult(msg.name, msg.result); break; case "error": console.error(`Error [${msg.code}]: ${msg.message}`); if (!msg.recoverable) { ws.close(); } break; } }; // Send audio chunks from microphone function sendAudioChunk(pcmData) { ws.send( JSON.stringify({ type: "audio.input", audio: arrayBufferToBase64(pcmData), }), ); } // Handle barge-in (user speaks while AI is talking) function handleBargeIn() { ws.send(JSON.stringify({ type: "barge_in" })); audioPlayer.stop(); } ``` ## Configuration Reference ### TTSessionConfig (Backend) ```python @dataclass class TTSessionConfig: user_id: str session_id: str conversation_id: Optional[str] = None # Voice settings voice_id: str = "TxGEqnHWrfWFTfGW9XjX" tts_model: str = "eleven_flash_v2_5" language: str = "en" # STT settings stt_sample_rate: int = 16000 stt_endpointing_ms: int = 800 stt_utterance_end_ms: int = 1500 # Barge-in barge_in_enabled: bool = True # Timeouts connection_timeout_sec: float = 10.0 idle_timeout_sec: float = 300.0 ``` ## Rate Limiting | Limit | Value | | -------------------------------- | ---------------------- | | Max concurrent sessions per user | 2 | | Max concurrent sessions total | 100 | | Audio chunk rate | ~10/second recommended | | Idle timeout | 300 seconds | ## Related Documentation - [Thinker-Talker Pipeline Overview](../THINKER_TALKER_PIPELINE.md) - [Frontend Voice Hooks](../frontend/thinker-talker-hooks.md) - [Voice Mode Settings Guide](../VOICE_MODE_SETTINGS_GUIDE.md) 6:["slug","api-reference/voice-pipeline-ws","c"] 0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","api-reference/voice-pipeline-ws","c"],{"children":["__PAGE__?{\"slug\":[\"api-reference\",\"voice-pipeline-ws\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","api-reference/voice-pipeline-ws","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Voice Pipeline WebSocket API"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","api-reference/voice-pipeline-ws.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/api-reference/voice-pipeline-ws.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]] c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Voice Pipeline WebSocket API | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"WebSocket API reference for the Thinker-Talker voice pipeline with audio streaming and TTS playback."}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]] 1:null