2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"] 4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""] 5:I[4126,[],""] 7:I[9630,[],""] 8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"] 9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"] a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"] b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"] 3:T3c29, # VoiceAssist Real-time Architecture **Last Updated**: 2025-11-27 **Status**: Production Ready **Related Documentation:** - [WebSocket Protocol](WEBSOCKET_PROTOCOL.md) - Wire protocol specification - [Voice Mode Pipeline](VOICE_MODE_PIPELINE.md) - Voice-specific implementation - [Implementation Status](overview/IMPLEMENTATION_STATUS.md) - Component status --- ## Overview VoiceAssist uses WebSocket connections for real-time bidirectional communication, enabling: - **Streaming chat responses** - Token-by-token LLM output - **Voice interactions** - Speech-to-text and text-to-speech - **Live updates** - Typing indicators, connection status --- ## Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ Client │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐ │ │ │ Chat UI │ │ Voice Input │ │ Connection Manager │ │ │ │ │ │ (Web Audio) │ │ - Reconnection │ │ │ │ - Messages │ │ - Mic capture │ │ - Heartbeat │ │ │ │ - Streaming │ │ - STT stream │ │ - Token refresh │ │ │ └────────┬────────┘ └────────┬────────┘ └────────────┬────────────┘ │ │ │ │ │ │ │ └────────────────────┼────────────────────────┘ │ │ │ │ │ ┌──────▼──────┐ │ │ │ WebSocket │ │ │ │ Client │ │ │ └──────┬──────┘ │ └────────────────────────────────┼────────────────────────────────────────┘ │ WSS/WS │ │ ┌────────────────────────────────┼────────────────────────────────────────┐ │ │ │ │ ┌──────▼──────┐ │ │ │ WebSocket │ │ │ │ Handler │ │ │ │ (FastAPI) │ │ │ └──────┬──────┘ │ │ │ │ │ ┌────────────────────┼────────────────────┐ │ │ │ │ │ │ │ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │ │ │ Chat │ │ Voice │ │ Connection │ │ │ │ Service │ │ Service │ │ Manager │ │ │ │ │ │ │ │ │ │ │ │ - RAG Query │ │ - STT │ │ - Sessions │ │ │ │ - LLM Call │ │ - TTS │ │ - Heartbeat │ │ │ │ - Streaming │ │ - VAD │ │ - Auth │ │ │ └──────┬──────┘ └──────┬──────┘ └─────────────┘ │ │ │ │ │ │ └────────────────────┼────────────────────────────────────────┤ │ │ │ │ ┌──────▼──────┐ │ │ │ OpenAI │ │ │ │ API │ │ │ │ │ │ │ │ - GPT-4 │ │ │ │ - Whisper │ │ │ │ - TTS │ │ │ └─────────────┘ │ │ │ │ Backend │ └─────────────────────────────────────────────────────────────────────────┘ ``` --- ## Connection Lifecycle ### 1. Connection Establishment ``` Client Server │ │ ├──── WebSocket Connect ─────────────────►│ │ (with token & conversationId) │ │ │ │◄──── connection_established ────────────┤ │ { connectionId, serverTime } │ │ │ ``` ### 2. Message Exchange ``` Client Server │ │ ├──── message ───────────────────────────►│ │ { content: "Hello" } │ │ │ │◄──── thinking ──────────────────────────┤ │ │ │◄──── assistant_chunk ───────────────────┤ │ { content: "Hi" } │ │◄──── assistant_chunk ───────────────────┤ │ { content: " there" } │ │◄──── assistant_chunk ───────────────────┤ │ { content: "!" } │ │ │ │◄──── message_complete ──────────────────┤ │ { messageId, totalTokens } │ │ │ ``` ### 3. Heartbeat ``` Client Server │ │ ├──── ping ──────────────────────────────►│ │ │ │◄──── pong ──────────────────────────────┤ │ │ ``` --- ## WebSocket Endpoints | Endpoint | Purpose | | ------------------ | --------------------------------- | | `/api/realtime/ws` | Main chat WebSocket | | `/api/voice/ws` | Voice-specific WebSocket (future) | ### Query Parameters | Parameter | Required | Description | | ---------------- | -------- | -------------------------------- | | `conversationId` | Yes | UUID of the conversation session | | `token` | Yes | JWT access token | ### Connection URL Example ```typescript // Development ws://localhost:8000/api/realtime/ws?conversationId=uuid&token=jwt // Production wss://assist.asimo.io/api/realtime/ws?conversationId=uuid&token=jwt ``` --- ## Message Types ### Client → Server | Type | Description | | ------------- | -------------------------- | | `message` | Send user message | | `ping` | Heartbeat ping | | `stop` | Cancel current response | | `voice_start` | Begin voice input (future) | | `voice_chunk` | Audio data chunk (future) | | `voice_end` | End voice input (future) | ### Server → Client | Type | Description | | ------------------------ | ------------------------------ | | `connection_established` | Connection successful | | `thinking` | AI is processing | | `assistant_chunk` | Streaming response chunk | | `message_complete` | Response finished | | `error` | Error occurred | | `pong` | Heartbeat response | | `voice_transcript` | Speech-to-text result (future) | | `voice_audio` | TTS audio chunk (future) | --- ## Streaming Response Flow ### RAG + LLM Pipeline ``` User Message → WebSocket Handler │ ▼ ┌───────────────┐ │ RAG Service │ ← Retrieves relevant context │ │ from Qdrant vector store └───────┬───────┘ │ ▼ ┌───────────────┐ │ LLM Client │ ← Calls OpenAI with streaming │ │ └───────┬───────┘ │ ┌─────────┼─────────┐ │ │ │ ▼ ▼ ▼ chunk_1 chunk_2 chunk_n │ │ │ └─────────┼─────────┘ │ ▼ WebSocket Send (per chunk) ``` ### Streaming Implementation ```python # Backend (FastAPI WebSocket handler) async def handle_message(websocket, message): # Send thinking indicator await websocket.send_json({"type": "thinking"}) # Get RAG context context = await rag_service.retrieve(message.content) # Stream LLM response async for chunk in llm_client.stream_chat(message.content, context): await websocket.send_json({ "type": "assistant_chunk", "content": chunk.content }) # Send completion await websocket.send_json({ "type": "message_complete", "messageId": str(uuid.uuid4()), "totalTokens": chunk.usage.total_tokens }) ``` --- ## Voice Architecture (Future Enhancement) ### Voice Input Flow ``` Microphone → Web Audio API → VAD (Voice Activity Detection) │ ▼ Audio Chunks (PCM) │ ▼ WebSocket Send │ ▼ Server VAD + STT │ ▼ Transcript Event ``` ### Voice Output Flow ``` LLM Response Text → TTS Service (OpenAI/ElevenLabs) │ ▼ Audio Stream (MP3/PCM) │ ▼ WebSocket Send (chunks) │ ▼ Web Audio API Playback ``` --- ## Error Handling ### Reconnection Strategy ```typescript class WebSocketClient { private reconnectAttempts = 0; private maxReconnectAttempts = 5; private baseDelay = 1000; // 1 second async reconnect() { const delay = Math.min( this.baseDelay * Math.pow(2, this.reconnectAttempts), 30000, // max 30 seconds ); await sleep(delay); this.reconnectAttempts++; if (this.reconnectAttempts < this.maxReconnectAttempts) { await this.connect(); } else { this.emit("connection_failed"); } } } ``` ### Error Types | Error Code | Description | Client Action | | ------------------- | ----------------------- | --------------------------- | | `auth_failed` | Invalid/expired token | Refresh token and reconnect | | `session_not_found` | Invalid conversation ID | Create new session | | `rate_limited` | Too many requests | Backoff and retry | | `server_error` | Internal server error | Retry with backoff | --- ## Performance Considerations ### Client-side 1. **Buffer chunks** - Don't update DOM on every chunk 2. **Throttle renders** - Use requestAnimationFrame 3. **Heartbeat interval** - 30 seconds recommended ### Server-side 1. **Connection pooling** - Reuse OpenAI connections 2. **Chunk size** - Optimize for network vs. latency 3. **Memory management** - Clean up closed connections --- ## Security 1. **Authentication** - JWT token required in query params 2. **Rate limiting** - Per-user connection limits 3. **Message validation** - Schema validation on all messages 4. **TLS** - WSS required in production --- ## Related Documentation - **Protocol Specification:** [WEBSOCKET_PROTOCOL.md](WEBSOCKET_PROTOCOL.md) - **Voice Pipeline:** [VOICE_MODE_PIPELINE.md](VOICE_MODE_PIPELINE.md) - **Backend Handler:** `services/api-gateway/app/api/realtime.py` - **Client Hook:** `apps/web-app/src/hooks/useWebSocket.ts` --- ## Version History | Version | Date | Changes | | ------- | ---------- | ----------------------------- | | 1.0.0 | 2025-11-27 | Initial architecture document | 6:["slug","REALTIME_ARCHITECTURE","c"] 0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","REALTIME_ARCHITECTURE","c"],{"children":["__PAGE__?{\"slug\":[\"REALTIME_ARCHITECTURE\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","REALTIME_ARCHITECTURE","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Real-time Architecture"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","REALTIME_ARCHITECTURE.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/REALTIME_ARCHITECTURE.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]] c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Real-time Architecture | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"WebSocket communication, voice processing, and streaming response architecture."}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]] 1:null