VoiceAssist Real-time Architecture
Last Updated: 2025-11-27
Status: Production Ready
Related Documentation:
Overview
VoiceAssist uses WebSocket connections for real-time bidirectional communication, enabling:
- Streaming chat responses - Token-by-token LLM output
- Voice interactions - Speech-to-text and text-to-speech
- Live updates - Typing indicators, connection status
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ Client │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐ │
│ │ Chat UI │ │ Voice Input │ │ Connection Manager │ │
│ │ │ │ (Web Audio) │ │ - Reconnection │ │
│ │ - Messages │ │ - Mic capture │ │ - Heartbeat │ │
│ │ - Streaming │ │ - STT stream │ │ - Token refresh │ │
│ └────────┬────────┘ └────────┬────────┘ └────────────┬────────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ WebSocket │ │
│ │ Client │ │
│ └──────┬──────┘ │
└────────────────────────────────┼────────────────────────────────────────┘
│
WSS/WS │
│
┌────────────────────────────────┼────────────────────────────────────────┐
│ │ │
│ ┌──────▼──────┐ │
│ │ WebSocket │ │
│ │ Handler │ │
│ │ (FastAPI) │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ Chat │ │ Voice │ │ Connection │ │
│ │ Service │ │ Service │ │ Manager │ │
│ │ │ │ │ │ │ │
│ │ - RAG Query │ │ - STT │ │ - Sessions │ │
│ │ - LLM Call │ │ - TTS │ │ - Heartbeat │ │
│ │ - Streaming │ │ - VAD │ │ - Auth │ │
│ └──────┬──────┘ └──────┬──────┘ └─────────────┘ │
│ │ │ │
│ └────────────────────┼────────────────────────────────────────┤
│ │ │
│ ┌──────▼──────┐ │
│ │ OpenAI │ │
│ │ API │ │
│ │ │ │
│ │ - GPT-4 │ │
│ │ - Whisper │ │
│ │ - TTS │ │
│ └─────────────┘ │
│ │
│ Backend │
└─────────────────────────────────────────────────────────────────────────┘
Connection Lifecycle
1. Connection Establishment
Client Server
│ │
├──── WebSocket Connect ─────────────────►│
│ (with token & conversationId) │
│ │
│◄──── connection_established ────────────┤
│ { connectionId, serverTime } │
│ │
2. Message Exchange
Client Server
│ │
├──── message ───────────────────────────►│
│ { content: "Hello" } │
│ │
│◄──── thinking ──────────────────────────┤
│ │
│◄──── assistant_chunk ───────────────────┤
│ { content: "Hi" } │
│◄──── assistant_chunk ───────────────────┤
│ { content: " there" } │
│◄──── assistant_chunk ───────────────────┤
│ { content: "!" } │
│ │
│◄──── message_complete ──────────────────┤
│ { messageId, totalTokens } │
│ │
3. Heartbeat
Client Server
│ │
├──── ping ──────────────────────────────►│
│ │
│◄──── pong ──────────────────────────────┤
│ │
WebSocket Endpoints
| Endpoint | Purpose |
|---|
/api/realtime/ws | Main chat WebSocket |
/api/voice/ws | Voice-specific WebSocket (future) |
Query Parameters
| Parameter | Required | Description |
|---|
conversationId | Yes | UUID of the conversation session |
token | Yes | JWT access token |
Connection URL Example
// Development
ws://localhost:8000/api/realtime/ws?conversationId=uuid&token=jwt
// Production
wss://assist.asimo.io/api/realtime/ws?conversationId=uuid&token=jwt
Message Types
Client → Server
| Type | Description |
|---|
message | Send user message |
ping | Heartbeat ping |
stop | Cancel current response |
voice_start | Begin voice input (future) |
voice_chunk | Audio data chunk (future) |
voice_end | End voice input (future) |
Server → Client
| Type | Description |
|---|
connection_established | Connection successful |
thinking | AI is processing |
assistant_chunk | Streaming response chunk |
message_complete | Response finished |
error | Error occurred |
pong | Heartbeat response |
voice_transcript | Speech-to-text result (future) |
voice_audio | TTS audio chunk (future) |
Streaming Response Flow
RAG + LLM Pipeline
User Message → WebSocket Handler
│
▼
┌───────────────┐
│ RAG Service │ ← Retrieves relevant context
│ │ from Qdrant vector store
└───────┬───────┘
│
▼
┌───────────────┐
│ LLM Client │ ← Calls OpenAI with streaming
│ │
└───────┬───────┘
│
┌─────────┼─────────┐
│ │ │
▼ ▼ ▼
chunk_1 chunk_2 chunk_n
│ │ │
└─────────┼─────────┘
│
▼
WebSocket Send
(per chunk)
Streaming Implementation
# Backend (FastAPI WebSocket handler)
async def handle_message(websocket, message):
# Send thinking indicator
await websocket.send_json({"type": "thinking"})
# Get RAG context
context = await rag_service.retrieve(message.content)
# Stream LLM response
async for chunk in llm_client.stream_chat(message.content, context):
await websocket.send_json({
"type": "assistant_chunk",
"content": chunk.content
})
# Send completion
await websocket.send_json({
"type": "message_complete",
"messageId": str(uuid.uuid4()),
"totalTokens": chunk.usage.total_tokens
})
Voice Architecture (Future Enhancement)
Microphone → Web Audio API → VAD (Voice Activity Detection)
│
▼
Audio Chunks (PCM)
│
▼
WebSocket Send
│
▼
Server VAD + STT
│
▼
Transcript Event
Voice Output Flow
LLM Response Text → TTS Service (OpenAI/ElevenLabs)
│
▼
Audio Stream (MP3/PCM)
│
▼
WebSocket Send (chunks)
│
▼
Web Audio API Playback
Error Handling
Reconnection Strategy
class WebSocketClient {
private reconnectAttempts = 0;
private maxReconnectAttempts = 5;
private baseDelay = 1000; // 1 second
async reconnect() {
const delay = Math.min(
this.baseDelay * Math.pow(2, this.reconnectAttempts),
30000, // max 30 seconds
);
await sleep(delay);
this.reconnectAttempts++;
if (this.reconnectAttempts < this.maxReconnectAttempts) {
await this.connect();
} else {
this.emit("connection_failed");
}
}
}
Error Types
| Error Code | Description | Client Action |
|---|
auth_failed | Invalid/expired token | Refresh token and reconnect |
session_not_found | Invalid conversation ID | Create new session |
rate_limited | Too many requests | Backoff and retry |
server_error | Internal server error | Retry with backoff |
Client-side
- Buffer chunks - Don't update DOM on every chunk
- Throttle renders - Use requestAnimationFrame
- Heartbeat interval - 30 seconds recommended
Server-side
- Connection pooling - Reuse OpenAI connections
- Chunk size - Optimize for network vs. latency
- Memory management - Clean up closed connections
Security
- Authentication - JWT token required in query params
- Rate limiting - Per-user connection limits
- Message validation - Schema validation on all messages
- TLS - WSS required in production
Version History
| Version | Date | Changes |
|---|
| 1.0.0 | 2025-11-27 | Initial architecture document |