Docs / Raw

Voice Mode Pipeline

Sourced from docs/VOICE_MODE_PIPELINE.md

Edit on GitHub

Voice Mode Pipeline

Status: Production-ready Last Updated: 2025-12-03

This document describes the unified Voice Mode pipeline architecture, data flow, metrics, and testing strategy. It serves as the canonical reference for developers working on real-time voice features.

Voice Pipeline Modes

VoiceAssist supports two voice pipeline modes:

ModeDescriptionBest For
Thinker-Talker (Recommended)Local STT → LLM → TTS pipelineFull tool support, unified context, custom TTS
OpenAI Realtime (Legacy)Direct OpenAI Realtime APIQuick setup, minimal backend changes

Thinker-Talker Pipeline (Primary)

The Thinker-Talker pipeline is the recommended approach, providing:

  • Unified conversation context between voice and chat modes
  • Full tool/RAG support in voice interactions
  • Custom TTS via ElevenLabs with premium voices
  • Lower cost per interaction

Documentation: THINKER_TALKER_PIPELINE.md

[Audio] → [Deepgram STT] → [GPT-4o Thinker] → [ElevenLabs TTS] → [Audio Out]
              │                    │                    │
         Transcripts          Tool Calls           Audio Chunks
              │                    │                    │
              └───────── WebSocket Handler ──────────────┘

OpenAI Realtime API (Legacy)

The original implementation using OpenAI's Realtime API directly. Still supported for backward compatibility.


Implementation Status

Thinker-Talker Components

ComponentStatusLocation
ThinkerServiceLiveapp/services/thinker_service.py
TalkerServiceLiveapp/services/talker_service.py
VoicePipelineServiceLiveapp/services/voice_pipeline_service.py
T/T WebSocket HandlerLiveapp/services/thinker_talker_websocket_handler.py
SentenceChunkerLiveapp/services/sentence_chunker.py
Frontend T/T hookLiveapps/web-app/src/hooks/useThinkerTalkerSession.ts
T/T Audio PlaybackLiveapps/web-app/src/hooks/useTTAudioPlayback.ts
T/T Voice PanelLiveapps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsx

OpenAI Realtime Components (Legacy)

ComponentStatusLocation
Backend session endpointLiveservices/api-gateway/app/api/voice.py
Ephemeral token generationLiveapp/services/realtime_voice_service.py
Voice metrics endpointLivePOST /api/voice/metrics
Frontend voice hookLiveapps/web-app/src/hooks/useRealtimeVoiceSession.ts
Voice settings storeLiveapps/web-app/src/stores/voiceSettingsStore.ts
Voice UI panelLiveapps/web-app/src/components/voice/VoiceModePanel.tsx
Chat timeline integrationLiveVoice messages appear in chat
Barge-in supportLiveresponse.cancel + onSpeechStarted callback
Audio overlap preventionLiveResponse ID tracking + isProcessingResponseRef
E2E test suitePassing95 tests across unit/integration/E2E

Full status: See Implementation Status for all components.

Overview

Voice Mode enables real-time voice conversations with the AI assistant using OpenAI's Realtime API. The pipeline handles:

  • Ephemeral session authentication (no raw API keys in browser)
  • WebSocket-based bidirectional voice streaming
  • Voice activity detection (VAD) with user-configurable sensitivity
  • User settings propagation (voice, language, VAD threshold)
  • Chat timeline integration (voice messages appear in chat)
  • Connection state management with automatic reconnection
  • Barge-in support (interrupt AI while speaking)
  • Audio playback management (prevent overlapping responses)
  • Metrics tracking for observability

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                              FRONTEND                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────┐     ┌─────────────────────┐     ┌───────────────┐  │
│  │  VoiceModePanel     │────▶│useRealtimeVoice     │────▶│ voiceSettings │  │
│  │  (UI Component)     │     │Session (Hook)       │     │ Store         │  │
│  │  - Start/Stop       │     │- connect()          │     │ - voice       │  │
│  │  - Status display   │     │- disconnect()       │     │ - language    │  │
│  │  - Metrics logging  │     │- sendMessage()      │     │ - vadSens     │  │
│  └─────────┬───────────┘     └──────────┬──────────┘     └───────────────┘  │
│            │                            │                                    │
│            │                            │ onUserMessage()/onAssistantMessage()
│            │                            ▼                                    │
│  ┌─────────▼───────────┐     ┌─────────────────────┐                        │
│  │  MessageInput       │     │  ChatPage           │                        │
│  │  - Voice toggle     │────▶│  - useChatSession   │                        │
│  │  - Panel container  │     │  - addMessage()     │                        │
│  └─────────────────────┘     └─────────────────────┘                        │
│                                                                              │
└──────────────────────────────────────┬──────────────────────────────────────┘
                                       │
                                       │ POST /api/voice/realtime-session
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              BACKEND                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────┐     ┌─────────────────────┐                        │
│  │  voice.py           │────▶│  realtime_voice_    │                        │
│  │  (FastAPI Router)   │     │  service.py         │                        │
│  │  - /realtime-session│     │  - generate_session │                        │
│  │  - Timing logs      │     │  - ephemeral token  │                        │
│  └─────────────────────┘     └──────────┬──────────┘                        │
│                                         │                                    │
│                                         │ POST /v1/realtime/sessions         │
│                                         ▼                                    │
│                              ┌─────────────────────┐                        │
│                              │  OpenAI API         │                        │
│                              │  - Ephemeral token  │                        │
│                              │  - Voice config     │                        │
│                              └─────────────────────┘                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       │ WebSocket wss://api.openai.com/v1/realtime
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          OPENAI REALTIME API                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  - Server-side VAD (voice activity detection)                                │
│  - Bidirectional audio streaming (PCM16)                                     │
│  - Real-time transcription (Whisper)                                         │
│  - GPT-4o responses with audio synthesis                                     │
└─────────────────────────────────────────────────────────────────────────────┘

Backend: /api/voice/realtime-session

Location: services/api-gateway/app/api/voice.py

Request

interface RealtimeSessionRequest { conversation_id?: string; // Optional conversation context voice?: string; // "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" language?: string; // "en" | "es" | "fr" | "de" | "it" | "pt" vad_sensitivity?: number; // 0-100 (maps to threshold: 0→0.9, 100→0.1) }

Response

interface RealtimeSessionResponse { url: string; // WebSocket URL: "wss://api.openai.com/v1/realtime" model: string; // "gpt-4o-realtime-preview" session_id: string; // Unique session identifier expires_at: number; // Unix timestamp (epoch seconds) conversation_id: string | null; auth: { type: "ephemeral_token"; token: string; // Ephemeral token (ek_...), NOT raw API key expires_at: number; // Token expiry (5 minutes) }; voice_config: { voice: string; // Selected voice modalities: ["text", "audio"]; input_audio_format: "pcm16"; output_audio_format: "pcm16"; input_audio_transcription: { model: "whisper-1" }; turn_detection: { type: "server_vad"; threshold: number; // 0.1 (sensitive) to 0.9 (insensitive) prefix_padding_ms: number; silence_duration_ms: number; }; }; }

VAD Sensitivity Mapping

The frontend uses a 0-100 scale for user-friendly VAD sensitivity:

User SettingVAD ThresholdBehavior
0 (Low)0.9Requires loud/clear speech
50 (Medium)0.5Balanced detection
100 (High)0.1Very sensitive, picks up soft speech

Formula: threshold = 0.9 - (vad_sensitivity / 100 * 0.8)

Observability

Backend logs timing and context for each session request:

# Request logging logger.info( f"Creating Realtime session for user {current_user.id}", extra={ "user_id": current_user.id, "conversation_id": request.conversation_id, "voice": request.voice, "language": request.language, "vad_sensitivity": request.vad_sensitivity, }, ) # Success logging with duration duration_ms = int((time.monotonic() - start_time) * 1000) logger.info( f"Realtime session created for user {current_user.id}", extra={ "user_id": current_user.id, "session_id": config["session_id"], "voice": config.get("voice_config", {}).get("voice"), "duration_ms": duration_ms, }, )

Frontend Hook: useRealtimeVoiceSession

Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts

Usage

const { status, // 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'failed' | 'expired' | 'error' transcript, // Current transcript text isSpeaking, // Is the AI currently speaking? isConnected, // Derived: status === 'connected' isConnecting, // Derived: status === 'connecting' || 'reconnecting' canSend, // Can send messages? error, // Error message if any metrics, // VoiceMetrics object connect, // () => Promise<void> - start session disconnect, // () => void - end session sendMessage, // (text: string) => void - send text message } = useRealtimeVoiceSession({ conversationId, voice, // From voiceSettingsStore language, // From voiceSettingsStore vadSensitivity, // From voiceSettingsStore (0-100) onConnected, // Callback when connected onDisconnected, // Callback when disconnected onError, // Callback on error onUserMessage, // Callback with user transcript onAssistantMessage, // Callback with AI response onMetricsUpdate, // Callback when metrics change });

Connection States

disconnected ──▶ connecting ──▶ connected
                      │              │
                      ▼              ▼
                   failed ◀──── reconnecting
                      │              │
                      ▼              ▼
                  expired ◀────── error
StateDescription
disconnectedInitial/idle state
connectingFetching session config, establishing WebSocket
connectedActive voice session
reconnectingAuto-reconnect after temporary disconnect
failedConnection failed (backend error, network issue)
expiredSession token expired (needs manual restart)
errorGeneral error state

WebSocket Connection

The hook connects using three protocols for authentication:

const ws = new WebSocket(url, ["realtime", "openai-beta.realtime-v1", `openai-insecure-api-key.${ephemeralToken}`]);

Voice Settings Store

Location: apps/web-app/src/stores/voiceSettingsStore.ts

Schema

interface VoiceSettings { voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"; language: "en" | "es" | "fr" | "de" | "it" | "pt"; vadSensitivity: number; // 0-100 autoStartOnOpen: boolean; // Auto-start voice when panel opens showStatusHints: boolean; // Show helper text in UI }

Persistence

Settings are persisted to localStorage under key voiceassist-voice-settings using Zustand's persist middleware.

Defaults

SettingDefault
voice"alloy"
language"en"
vadSensitivity50
autoStartOnOpenfalse
showStatusHintstrue

Chat Integration

Location: apps/web-app/src/pages/ChatPage.tsx

Message Flow

  1. User speaks → VoiceModePanel receives final transcript
  2. VoiceModePanel calls onUserMessage(transcript)
  3. ChatPage receives callback, calls useChatSession.addMessage()
  4. Message added to timeline with metadata: { source: "voice" }
// ChatPage.tsx const handleVoiceUserMessage = (content: string) => { addMessage({ role: "user", content, metadata: { source: "voice" }, }); }; const handleVoiceAssistantMessage = (content: string) => { addMessage({ role: "assistant", content, metadata: { source: "voice" }, }); };

Message Structure

interface VoiceMessage { id: string; // "voice-{timestamp}-{random}" role: "user" | "assistant"; content: string; timestamp: number; metadata: { source: "voice"; // Distinguishes from text messages }; }

Barge-in & Audio Playback

Location: apps/web-app/src/components/voice/VoiceModePanel.tsx, apps/web-app/src/hooks/useRealtimeVoiceSession.ts

Barge-in Flow

When the user starts speaking while the AI is responding, the system immediately:

  1. Detects speech start via OpenAI's input_audio_buffer.speech_started event
  2. Cancels active response by sending response.cancel to OpenAI
  3. Stops audio playback via onSpeechStarted callback
  4. Clears pending responses to prevent stale audio from playing
User speaks → speech_started event → response.cancel → stopCurrentAudio()
                                                            ↓
                                                    Audio stops
                                                    Queue cleared
                                                    Response ID incremented

Response Cancellation

Location: useRealtimeVoiceSession.ts - handleRealtimeMessage

case "input_audio_buffer.speech_started": setIsSpeaking(true); setPartialTranscript(""); // Barge-in: Cancel any active response when user starts speaking if (activeResponseIdRef.current && wsRef.current?.readyState === WebSocket.OPEN) { wsRef.current.send(JSON.stringify({ type: "response.cancel" })); activeResponseIdRef.current = null; } // Notify parent to stop audio playback options.onSpeechStarted?.(); break;

Audio Playback Management

Location: VoiceModePanel.tsx

The panel tracks audio playback state to prevent overlapping responses:

// Track currently playing Audio element const currentAudioRef = useRef<HTMLAudioElement | null>(null); // Prevent overlapping response processing const isProcessingResponseRef = useRef(false); // Response ID to invalidate stale responses after barge-in const currentResponseIdRef = useRef<number>(0);

Stop current audio function:

const stopCurrentAudio = useCallback(() => { if (currentAudioRef.current) { currentAudioRef.current.pause(); currentAudioRef.current.currentTime = 0; if (currentAudioRef.current.src.startsWith("blob:")) { URL.revokeObjectURL(currentAudioRef.current.src); } currentAudioRef.current = null; } audioQueueRef.current = []; isPlayingRef.current = false; currentResponseIdRef.current++; // Invalidate pending responses isProcessingResponseRef.current = false; }, []);

Overlap Prevention

When a relay result arrives, the handler checks:

  1. Already processing? Skip if isProcessingResponseRef.current === true
  2. Response ID valid? Skip playback if ID changed (barge-in occurred)
onRelayResult: async ({ answer }) => { if (answer) { // Prevent overlapping responses if (isProcessingResponseRef.current) { console.log("[VoiceModePanel] Skipping response - already processing another"); return; } const responseId = ++currentResponseIdRef.current; isProcessingResponseRef.current = true; // ... synthesis and playback ... // Check if response is still valid before playback if (responseId !== currentResponseIdRef.current) { console.log("[VoiceModePanel] Response cancelled - skipping playback"); return; } } };

Error Handling

Benign cancellation errors (e.g., "Cancellation failed: no active response found") are handled gracefully:

case "error": { const errorMessage = message.error?.message || "Realtime API error"; // Ignore benign cancellation errors if ( errorMessage.includes("Cancellation failed") || errorMessage.includes("no active response") ) { voiceLog.debug(`Ignoring benign error: ${errorMessage}`); break; } handleError(new Error(errorMessage)); break; }

Metrics

Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts

VoiceMetrics Interface

interface VoiceMetrics { connectionTimeMs: number | null; // Time to establish connection timeToFirstTranscriptMs: number | null; // Time to first user transcript lastSttLatencyMs: number | null; // Speech-to-text latency lastResponseLatencyMs: number | null; // AI response latency sessionDurationMs: number | null; // Total session duration userTranscriptCount: number; // Number of user turns aiResponseCount: number; // Number of AI turns reconnectCount: number; // Number of reconnections sessionStartedAt: number | null; // Session start timestamp }

Frontend Logging

VoiceModePanel logs key metrics to console:

// Connection time console.log(`[VoiceModePanel] voice_session_connect_ms=${metrics.connectionTimeMs}`); // STT latency console.log(`[VoiceModePanel] voice_stt_latency_ms=${metrics.lastSttLatencyMs}`); // Response latency console.log(`[VoiceModePanel] voice_first_reply_ms=${metrics.lastResponseLatencyMs}`); // Session duration console.log(`[VoiceModePanel] voice_session_duration_ms=${metrics.sessionDurationMs}`);

Consuming Metrics

Developers can plug into metrics via the onMetricsUpdate callback:

useRealtimeVoiceSession({ onMetricsUpdate: (metrics) => { // Send to telemetry service analytics.track("voice_session_metrics", { connection_ms: metrics.connectionTimeMs, stt_latency_ms: metrics.lastSttLatencyMs, response_latency_ms: metrics.lastResponseLatencyMs, duration_ms: metrics.sessionDurationMs, }); }, });

Metrics Export to Backend

Metrics can be automatically exported to the backend for aggregation and alerting.

Backend Endpoint: POST /api/voice/metrics

Location: services/api-gateway/app/api/voice.py

Request Schema

interface VoiceMetricsPayload { conversation_id?: string; connection_time_ms?: number; time_to_first_transcript_ms?: number; last_stt_latency_ms?: number; last_response_latency_ms?: number; session_duration_ms?: number; user_transcript_count: number; ai_response_count: number; reconnect_count: number; session_started_at?: number; }

Response

interface VoiceMetricsResponse { status: "ok"; }

Privacy

No PHI or transcript content is sent. Only timing metrics and counts.

Frontend Configuration

Metrics export is controlled by environment variables:

  • Production (import.meta.env.PROD): Metrics sent automatically
  • Development: Set VITE_ENABLE_VOICE_METRICS=true to enable

The export uses navigator.sendBeacon() for reliability (survives page navigation).

Backend Logging

Metrics are logged with user context:

logger.info( "VoiceMetrics received", extra={ "user_id": current_user.id, "conversation_id": payload.conversation_id, "connection_time_ms": payload.connection_time_ms, "session_duration_ms": payload.session_duration_ms, ... }, )

Testing

# Backend cd /home/asimo/VoiceAssist/services/api-gateway source venv/bin/activate && export PYTHONPATH=. python -m pytest tests/integration/test_voice_metrics.py -v

Security

Ephemeral Token Architecture

CRITICAL: The browser NEVER receives the raw OpenAI API key.

  1. Backend holds OPENAI_API_KEY securely
  2. Frontend requests session via /api/voice/realtime-session
  3. Backend creates ephemeral token via OpenAI /v1/realtime/sessions
  4. Ephemeral token returned to frontend (valid ~5 minutes)
  5. Frontend connects WebSocket using ephemeral token

Token Refresh

The hook monitors session.expires_at and can trigger refresh before expiry. If the token expires mid-session, status transitions to expired.

Testing

Voice Pipeline Smoke Suite

Run these commands to validate the voice pipeline:

# 1. Backend tests (CI-safe, mocked) cd /home/asimo/VoiceAssist/services/api-gateway source venv/bin/activate export PYTHONPATH=. python -m pytest tests/integration/test_openai_config.py -v # 2. Frontend unit tests (run individually to avoid OOM) cd /home/asimo/VoiceAssist/apps/web-app export NODE_OPTIONS="--max-old-space-size=768" npx vitest run src/hooks/__tests__/useRealtimeVoiceSession.test.ts --reporter=dot npx vitest run src/hooks/__tests__/useChatSession-voice-integration.test.ts --reporter=dot npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot # 3. E2E tests (Chromium, mocked backend) cd /home/asimo/VoiceAssist npx playwright test \ e2e/voice-mode-navigation.spec.ts \ e2e/voice-mode-session-smoke.spec.ts \ e2e/voice-mode-voice-chat-integration.spec.ts \ --project=chromium --reporter=list

Test Coverage Summary

Test FileTestsCoverage
useRealtimeVoiceSession.test.ts22Hook lifecycle, states, metrics
useChatSession-voice-integration.test.ts8Message structure validation
voiceSettingsStore.test.ts17Store actions, persistence
VoiceModeSettings.test.tsx25Component rendering, interactions
MessageInput-voice-settings.test.tsx12Integration with chat input
voice-mode-navigation.spec.ts4E2E navigation flow
voice-mode-session-smoke.spec.ts3E2E session smoke (1 live gated)
voice-mode-voice-chat-integration.spec.ts4E2E panel integration

Total: 95 tests

Live Testing

To test with real OpenAI backend:

# Backend (requires OPENAI_API_KEY in .env) LIVE_REALTIME_TESTS=1 python -m pytest tests/integration/test_openai_config.py -v # E2E (requires running backend + valid API key) LIVE_REALTIME_E2E=1 npx playwright test e2e/voice-mode-session-smoke.spec.ts

File Reference

Backend

FilePurpose
services/api-gateway/app/api/voice.pyAPI routes, metrics, timing logs
services/api-gateway/app/services/realtime_voice_service.pySession creation, token generation
services/api-gateway/tests/integration/test_openai_config.pyIntegration tests
services/api-gateway/tests/integration/test_voice_metrics.pyMetrics endpoint tests

Frontend

FilePurpose
apps/web-app/src/hooks/useRealtimeVoiceSession.tsCore hook
apps/web-app/src/components/voice/VoiceModePanel.tsxUI panel
apps/web-app/src/components/voice/VoiceModeSettings.tsxSettings modal
apps/web-app/src/stores/voiceSettingsStore.tsSettings store
apps/web-app/src/components/chat/MessageInput.tsxVoice button integration
apps/web-app/src/pages/ChatPage.tsxChat timeline integration
apps/web-app/src/hooks/useChatSession.tsaddMessage() helper

Tests

FilePurpose
apps/web-app/src/hooks/__tests__/useRealtimeVoiceSession.test.tsHook tests
apps/web-app/src/hooks/__tests__/useChatSession-voice-integration.test.tsChat integration
apps/web-app/src/stores/__tests__/voiceSettingsStore.test.tsStore tests
apps/web-app/src/components/voice/__tests__/VoiceModeSettings.test.tsxComponent tests
apps/web-app/src/components/chat/__tests__/MessageInput-voice-settings.test.tsxIntegration tests
e2e/voice-mode-navigation.spec.tsE2E navigation
e2e/voice-mode-session-smoke.spec.tsE2E smoke test
e2e/voice-mode-voice-chat-integration.spec.tsE2E panel integration

Observability & Monitoring (Phase 3)

Implemented: 2025-12-02

The voice pipeline includes comprehensive observability features for production monitoring.

Error Taxonomy (voice_errors.py)

Location: services/api-gateway/app/core/voice_errors.py

Structured error classification with 8 categories and 40+ error codes:

CategoryCodesDescription
CONNECTIONCONN_001-7WebSocket, network failures
STTSTT_001-7Speech-to-text errors
TTSTTS_001-7Text-to-speech errors
LLMLLM_001-6LLM processing errors
AUDIOAUDIO_001-6Audio encoding/decoding errors
TIMEOUTTIMEOUT_001-7Various timeout conditions
PROVIDERPROVIDER_001-6External provider errors
INTERNALINTERNAL_001-5Internal server errors

Each error code includes:

  • Recoverability flag (can auto-retry)
  • Retry configuration (delay, max attempts)
  • User-friendly description

Voice Metrics (metrics.py)

Location: services/api-gateway/app/core/metrics.py

Prometheus metrics for voice pipeline monitoring:

MetricTypeLabelsDescription
voice_errors_totalCountercategory, code, provider, recoverableTotal voice errors
voice_pipeline_stage_latency_secondsHistogramstagePer-stage latency
voice_ttfa_secondsHistogram-Time to first audio
voice_active_sessionsGauge-Active voice sessions
voice_barge_in_totalCounter-Barge-in events
voice_audio_chunks_totalCounterstatusAudio chunks processed

Per-Stage Latency Tracking (voice_timing.py)

Location: services/api-gateway/app/core/voice_timing.py

Pipeline stages tracked:

  • audio_receive - Time to receive audio from client
  • vad_process - Voice activity detection time
  • stt_transcribe - Speech-to-text latency
  • llm_process - LLM inference time
  • tts_synthesize - Text-to-speech synthesis
  • audio_send - Time to send audio to client
  • ttfa - Time to first audio (end-to-end)

Usage:

from app.core.voice_timing import create_pipeline_timings, PipelineStage timings = create_pipeline_timings(session_id="abc123") with timings.time_stage(PipelineStage.STT_TRANSCRIBE): transcript = await stt_client.transcribe(audio) timings.record_ttfa() # When first audio byte ready timings.finalize() # When response complete

SLO Alerts (voice_slo_alerts.yml)

Location: infrastructure/observability/prometheus/rules/voice_slo_alerts.yml

SLO targets with Prometheus alerting rules:

SLOTargetAlert
TTFA P95< 200msVoiceTTFASLOViolation
STT Latency P95< 300msVoiceSTTLatencySLOViolation
TTS First Chunk P95< 200msVoiceTTSFirstChunkSLOViolation
Connection Time P95< 500msVoiceConnectionTimeSLOViolation
Error Rate< 1%VoiceErrorRateHigh
Session Success Rate> 95%VoiceSessionSuccessRateLow

Client Telemetry (voiceTelemetry.ts)

Location: apps/web-app/src/lib/voiceTelemetry.ts

Frontend telemetry with:

  • Network quality assessment via Network Information API
  • Browser performance metrics via Performance.memory API
  • Jitter estimation for network quality
  • Batched reporting (10s intervals)
  • Beacon API for reliable delivery on page unload
import { getVoiceTelemetry } from "@/lib/voiceTelemetry"; const telemetry = getVoiceTelemetry(); telemetry.startSession(sessionId); telemetry.recordLatency("stt", 150); telemetry.recordLatency("ttfa", 180); telemetry.endSession();

Voice Health Endpoint (/health/voice)

Location: services/api-gateway/app/api/health.py

Comprehensive voice subsystem health check:

curl https://assist.asimo.io/health/voice

Response:

{ "status": "healthy", "providers": { "openai": { "status": "up", "latency_ms": 120.5 }, "elevenlabs": { "status": "up", "latency_ms": 85.2 }, "deepgram": { "status": "up", "latency_ms": 95.8 } }, "session_store": { "status": "up", "active_sessions": 5 }, "metrics": { "active_sessions": 5 }, "slo": { "ttfa_target_ms": 200, "error_rate_target": 0.01 } }

Debug Logging Configuration

Location: services/api-gateway/app/core/logging.py

Configurable voice log verbosity via VOICE_LOG_LEVEL environment variable:

LevelContent
MINIMALErrors only
STANDARD+ Session lifecycle (start/end/state changes)
VERBOSE+ All latency measurements
DEBUG+ Audio frame details, chunk timing

Usage:

from app.core.logging import get_voice_logger voice_log = get_voice_logger(__name__) voice_log.session_start(session_id="abc123", provider="thinker_talker") voice_log.latency("stt_transcribe", 150.5, session_id="abc123") voice_log.error("voice_connection_failed", error_code="CONN_001")

Phase 9: Offline & Network Fallback

Implemented: 2025-12-03

The voice pipeline now includes comprehensive offline support and network-aware fallback mechanisms.

Network Monitoring (networkMonitor.ts)

Location: apps/web-app/src/lib/offline/networkMonitor.ts

Continuously monitors network health using multiple signals:

  • Navigator.onLine: Basic online/offline detection
  • Network Information API: Connection type, downlink speed, RTT
  • Health Check Pinging: Periodic /api/health pings for latency measurement
import { getNetworkMonitor } from "@/lib/offline/networkMonitor"; const monitor = getNetworkMonitor(); monitor.subscribe((status) => { console.log(`Network quality: ${status.quality}`); console.log(`Health check latency: ${status.healthCheckLatencyMs}ms`); });

Network Quality Levels

QualityLatencyisHealthyAction
Excellent< 100mstrueFull cloud processing
Good< 200mstrueFull cloud processing
Moderate< 500mstrueCloud with quality warning
Poor≥ 500msvariableConsider offline fallback
OfflineUnreachablefalseAutomatic offline fallback

Configuration

const monitor = createNetworkMonitor({ healthCheckUrl: "/api/health", healthCheckIntervalMs: 30000, // 30 seconds healthCheckTimeoutMs: 5000, // 5 seconds goodLatencyThresholdMs: 100, moderateLatencyThresholdMs: 200, poorLatencyThresholdMs: 500, failuresBeforeUnhealthy: 3, });

useNetworkStatus Hook

Location: apps/web-app/src/hooks/useNetworkStatus.ts

React hook providing network status with computed properties:

const { isOnline, isHealthy, quality, healthCheckLatencyMs, effectiveType, // "4g", "3g", "2g", "slow-2g" downlink, // Mbps rtt, // Round-trip time ms isSuitableForVoice, // quality >= "good" && isHealthy shouldUseOffline, // !isOnline || !isHealthy || quality < "moderate" qualityScore, // 0-4 (offline=0, poor=1, moderate=2, good=3, excellent=4) checkNow, // Force immediate health check } = useNetworkStatus();

Offline VAD with Network Fallback

Location: apps/web-app/src/hooks/useOfflineVAD.ts

The useOfflineVADWithFallback hook automatically switches between network and offline VAD:

const { isListening, isSpeaking, currentEnergy, isUsingOfflineVAD, // Currently using offline mode? networkAvailable, networkQuality, modeReason, // "network_vad" | "network_unavailable" | "poor_quality" | "forced_offline" forceOffline, // Manually switch to offline forceNetwork, // Manually switch to network (if available) startListening, stopListening, } = useOfflineVADWithFallback({ useNetworkMonitor: true, minNetworkQuality: "moderate", networkRecoveryDelayMs: 2000, // Prevent flapping onFallbackToOffline: () => console.log("Switched to offline VAD"), onReturnToNetwork: () => console.log("Returned to network VAD"), });

Fallback Decision Flow

┌────────────────────┐
│  Network Monitor   │
│  Health Check      │
└─────────┬──────────┘
          │
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Is Online?        │──────────▶│  Use Offline VAD   │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Is Healthy?       │──────────▶│  Use Offline VAD   │
│  (3+ checks pass)  │            │  reason: unhealthy │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Quality ≥ Min?    │──────────▶│  Use Offline VAD   │
│  (e.g., moderate)  │            │  reason: poor_qual │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐
│  Use Network VAD   │
│  (cloud processing)│
└────────────────────┘

TTS Caching (useTTSCache)

Location: apps/web-app/src/hooks/useOfflineVAD.ts

Caches synthesized TTS audio for offline playback:

const { getTTS, // Get audio (from cache or fresh) preload, // Preload common phrases isCached, // Check if text is cached stats, // { entryCount, sizeMB, hitRate } clear, // Clear cache } = useTTSCache({ voice: "alloy", maxSizeMB: 50, ttsFunction: async (text) => synthesizeAudio(text), }); // Preload common phrases on app start await preload(); // Caches "I'm listening", "Go ahead", etc. // Get TTS (cache hit = instant, cache miss = synthesize + cache) const audio = await getTTS("Hello world");

User Settings Integration

Phase 9 settings are stored in voiceSettingsStore:

SettingDefaultDescription
enableOfflineFallbacktrueAuto-switch to offline when network poor
preferOfflineVADfalseForce offline VAD (privacy mode)
ttsCacheEnabledtrueEnable TTS response caching

File Reference (Phase 9)

FilePurpose
apps/web-app/src/lib/offline/networkMonitor.tsNetwork health monitoring
apps/web-app/src/lib/offline/webrtcVAD.tsWebRTC-based offline VAD
apps/web-app/src/lib/offline/types.tsOffline module type definitions
apps/web-app/src/hooks/useNetworkStatus.tsReact hook for network status
apps/web-app/src/hooks/useOfflineVAD.tsOffline VAD + TTS cache hooks
apps/web-app/src/lib/offline/__tests__/networkMonitor.test.tsNetwork monitor tests

Future Work

  • Metrics export to backend: Send metrics to backend for aggregation/alerting ✓ Implemented
  • Barge-in support: Allow user to interrupt AI responses ✓ Implemented (2025-11-28)
  • Audio overlap prevention: Prevent multiple responses playing simultaneously ✓ Implemented (2025-11-28)
  • Per-user voice preferences: Backend persistence for TTS settings ✓ Implemented (2025-11-29)
  • Context-aware voice styles: Auto-detect tone from content ✓ Implemented (2025-11-29)
  • Aggressive latency optimization: 200ms VAD, 256-sample chunks, 300ms reconnect ✓ Implemented (2025-11-29)
  • Observability & Monitoring (Phase 3): Error taxonomy, metrics, SLO alerts, telemetry ✓ Implemented (2025-12-02)
  • Phase 7: Multilingual Support: Auto language detection, accent profiles, language switch confidence ✓ Implemented (2025-12-03)
  • Phase 8: Voice Calibration: Personalized VAD thresholds, calibration wizard, adaptive learning ✓ Implemented (2025-12-03)
  • Phase 9: Offline Fallback: Network monitoring, offline VAD, TTS caching, quality-based switching ✓ Implemented (2025-12-03)
  • Phase 10: Conversation Intelligence: Sentiment tracking, discourse analysis, response recommendations ✓ Implemented (2025-12-03)

Voice Mode Enhancement - 10 Phase Plan ✅ COMPLETE (2025-12-03)

A comprehensive enhancement transforming voice mode into a human-like conversational partner with medical dictation:

  • Phase 1: Emotional Intelligence (Hume AI) ✓ Complete
  • Phase 2: Backchanneling System ✓ Complete
  • Phase 3: Prosody Analysis ✓ Complete
  • Phase 4: Memory & Context System ✓ Complete
  • Phase 5: Advanced Turn-Taking ✓ Complete
  • Phase 6: Variable Response Timing ✓ Complete
  • Phase 7: Conversational Repair ✓ Complete
  • Phase 8: Medical Dictation Core ✓ Complete
  • Phase 9: Patient Context Integration ✓ Complete
  • Phase 10: Frontend Integration & Analytics ✓ Complete

Full documentation: VOICE_MODE_ENHANCEMENT_10_PHASE.md

Remaining Tasks

  • Voice→chat transcript content E2E: Test actual transcript content in chat timeline
  • Error tracking integration: Send errors to Sentry/similar
  • Audio level visualization: Show real-time audio level meter during recording
Beginning of guide
End of guide