2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"] 4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""] 5:I[4126,[],""] 7:I[9630,[],""] 8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"] 9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"] a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"] b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"] 3:Tb6a2, # Voice Mode Pipeline > **Status**: Production-ready > **Last Updated**: 2025-12-03 This document describes the unified Voice Mode pipeline architecture, data flow, metrics, and testing strategy. It serves as the canonical reference for developers working on real-time voice features. ## Voice Pipeline Modes VoiceAssist supports **two voice pipeline modes**: | Mode | Description | Best For | | -------------------------------- | ------------------------------ | ---------------------------------------------- | | **Thinker-Talker** (Recommended) | Local STT → LLM → TTS pipeline | Full tool support, unified context, custom TTS | | **OpenAI Realtime** (Legacy) | Direct OpenAI Realtime API | Quick setup, minimal backend changes | ### Thinker-Talker Pipeline (Primary) The Thinker-Talker pipeline is the recommended approach, providing: - **Unified conversation context** between voice and chat modes - **Full tool/RAG support** in voice interactions - **Custom TTS** via ElevenLabs with premium voices - **Lower cost** per interaction **Documentation:** [THINKER_TALKER_PIPELINE.md](THINKER_TALKER_PIPELINE.md) ``` [Audio] → [Deepgram STT] → [GPT-4o Thinker] → [ElevenLabs TTS] → [Audio Out] │ │ │ Transcripts Tool Calls Audio Chunks │ │ │ └───────── WebSocket Handler ──────────────┘ ``` ### OpenAI Realtime API (Legacy) The original implementation using OpenAI's Realtime API directly. Still supported for backward compatibility. --- ## Implementation Status ### Thinker-Talker Components | Component | Status | Location | | --------------------- | -------- | --------------------------------------------------------------- | | ThinkerService | **Live** | `app/services/thinker_service.py` | | TalkerService | **Live** | `app/services/talker_service.py` | | VoicePipelineService | **Live** | `app/services/voice_pipeline_service.py` | | T/T WebSocket Handler | **Live** | `app/services/thinker_talker_websocket_handler.py` | | SentenceChunker | **Live** | `app/services/sentence_chunker.py` | | Frontend T/T hook | **Live** | `apps/web-app/src/hooks/useThinkerTalkerSession.ts` | | T/T Audio Playback | **Live** | `apps/web-app/src/hooks/useTTAudioPlayback.ts` | | T/T Voice Panel | **Live** | `apps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsx` | ### OpenAI Realtime Components (Legacy) | Component | Status | Location | | -------------------------- | ----------- | ------------------------------------------------------ | | Backend session endpoint | **Live** | `services/api-gateway/app/api/voice.py` | | Ephemeral token generation | **Live** | `app/services/realtime_voice_service.py` | | Voice metrics endpoint | **Live** | `POST /api/voice/metrics` | | Frontend voice hook | **Live** | `apps/web-app/src/hooks/useRealtimeVoiceSession.ts` | | Voice settings store | **Live** | `apps/web-app/src/stores/voiceSettingsStore.ts` | | Voice UI panel | **Live** | `apps/web-app/src/components/voice/VoiceModePanel.tsx` | | Chat timeline integration | **Live** | Voice messages appear in chat | | Barge-in support | **Live** | `response.cancel` + `onSpeechStarted` callback | | Audio overlap prevention | **Live** | Response ID tracking + `isProcessingResponseRef` | | E2E test suite | **Passing** | 95 tests across unit/integration/E2E | > **Full status:** See [Implementation Status](overview/IMPLEMENTATION_STATUS.md) for all components. ## Overview Voice Mode enables real-time voice conversations with the AI assistant using OpenAI's Realtime API. The pipeline handles: - **Ephemeral session authentication** (no raw API keys in browser) - **WebSocket-based bidirectional voice streaming** - **Voice activity detection (VAD)** with user-configurable sensitivity - **User settings propagation** (voice, language, VAD threshold) - **Chat timeline integration** (voice messages appear in chat) - **Connection state management** with automatic reconnection - **Barge-in support** (interrupt AI while speaking) - **Audio playback management** (prevent overlapping responses) - **Metrics tracking** for observability ## Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ FRONTEND │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────┐ ┌─────────────────────┐ ┌───────────────┐ │ │ │ VoiceModePanel │────▶│useRealtimeVoice │────▶│ voiceSettings │ │ │ │ (UI Component) │ │Session (Hook) │ │ Store │ │ │ │ - Start/Stop │ │- connect() │ │ - voice │ │ │ │ - Status display │ │- disconnect() │ │ - language │ │ │ │ - Metrics logging │ │- sendMessage() │ │ - vadSens │ │ │ └─────────┬───────────┘ └──────────┬──────────┘ └───────────────┘ │ │ │ │ │ │ │ │ onUserMessage()/onAssistantMessage() │ │ ▼ │ │ ┌─────────▼───────────┐ ┌─────────────────────┐ │ │ │ MessageInput │ │ ChatPage │ │ │ │ - Voice toggle │────▶│ - useChatSession │ │ │ │ - Panel container │ │ - addMessage() │ │ │ └─────────────────────┘ └─────────────────────┘ │ │ │ └──────────────────────────────────────┬──────────────────────────────────────┘ │ │ POST /api/voice/realtime-session ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ BACKEND │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ voice.py │────▶│ realtime_voice_ │ │ │ │ (FastAPI Router) │ │ service.py │ │ │ │ - /realtime-session│ │ - generate_session │ │ │ │ - Timing logs │ │ - ephemeral token │ │ │ └─────────────────────┘ └──────────┬──────────┘ │ │ │ │ │ │ POST /v1/realtime/sessions │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ OpenAI API │ │ │ │ - Ephemeral token │ │ │ │ - Voice config │ │ │ └─────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ │ │ WebSocket wss://api.openai.com/v1/realtime ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ OPENAI REALTIME API │ ├─────────────────────────────────────────────────────────────────────────────┤ │ - Server-side VAD (voice activity detection) │ │ - Bidirectional audio streaming (PCM16) │ │ - Real-time transcription (Whisper) │ │ - GPT-4o responses with audio synthesis │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ## Backend: `/api/voice/realtime-session` **Location**: `services/api-gateway/app/api/voice.py` ### Request ```typescript interface RealtimeSessionRequest { conversation_id?: string; // Optional conversation context voice?: string; // "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" language?: string; // "en" | "es" | "fr" | "de" | "it" | "pt" vad_sensitivity?: number; // 0-100 (maps to threshold: 0→0.9, 100→0.1) } ``` ### Response ```typescript interface RealtimeSessionResponse { url: string; // WebSocket URL: "wss://api.openai.com/v1/realtime" model: string; // "gpt-4o-realtime-preview" session_id: string; // Unique session identifier expires_at: number; // Unix timestamp (epoch seconds) conversation_id: string | null; auth: { type: "ephemeral_token"; token: string; // Ephemeral token (ek_...), NOT raw API key expires_at: number; // Token expiry (5 minutes) }; voice_config: { voice: string; // Selected voice modalities: ["text", "audio"]; input_audio_format: "pcm16"; output_audio_format: "pcm16"; input_audio_transcription: { model: "whisper-1" }; turn_detection: { type: "server_vad"; threshold: number; // 0.1 (sensitive) to 0.9 (insensitive) prefix_padding_ms: number; silence_duration_ms: number; }; }; } ``` ### VAD Sensitivity Mapping The frontend uses a 0-100 scale for user-friendly VAD sensitivity: | User Setting | VAD Threshold | Behavior | | ------------ | ------------- | ------------------------------------ | | 0 (Low) | 0.9 | Requires loud/clear speech | | 50 (Medium) | 0.5 | Balanced detection | | 100 (High) | 0.1 | Very sensitive, picks up soft speech | **Formula**: `threshold = 0.9 - (vad_sensitivity / 100 * 0.8)` ### Observability Backend logs timing and context for each session request: ```python # Request logging logger.info( f"Creating Realtime session for user {current_user.id}", extra={ "user_id": current_user.id, "conversation_id": request.conversation_id, "voice": request.voice, "language": request.language, "vad_sensitivity": request.vad_sensitivity, }, ) # Success logging with duration duration_ms = int((time.monotonic() - start_time) * 1000) logger.info( f"Realtime session created for user {current_user.id}", extra={ "user_id": current_user.id, "session_id": config["session_id"], "voice": config.get("voice_config", {}).get("voice"), "duration_ms": duration_ms, }, ) ``` ## Frontend Hook: `useRealtimeVoiceSession` **Location**: `apps/web-app/src/hooks/useRealtimeVoiceSession.ts` ### Usage ```typescript const { status, // 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'failed' | 'expired' | 'error' transcript, // Current transcript text isSpeaking, // Is the AI currently speaking? isConnected, // Derived: status === 'connected' isConnecting, // Derived: status === 'connecting' || 'reconnecting' canSend, // Can send messages? error, // Error message if any metrics, // VoiceMetrics object connect, // () => Promise - start session disconnect, // () => void - end session sendMessage, // (text: string) => void - send text message } = useRealtimeVoiceSession({ conversationId, voice, // From voiceSettingsStore language, // From voiceSettingsStore vadSensitivity, // From voiceSettingsStore (0-100) onConnected, // Callback when connected onDisconnected, // Callback when disconnected onError, // Callback on error onUserMessage, // Callback with user transcript onAssistantMessage, // Callback with AI response onMetricsUpdate, // Callback when metrics change }); ``` ### Connection States ``` disconnected ──▶ connecting ──▶ connected │ │ ▼ ▼ failed ◀──── reconnecting │ │ ▼ ▼ expired ◀────── error ``` | State | Description | | -------------- | ------------------------------------------------ | | `disconnected` | Initial/idle state | | `connecting` | Fetching session config, establishing WebSocket | | `connected` | Active voice session | | `reconnecting` | Auto-reconnect after temporary disconnect | | `failed` | Connection failed (backend error, network issue) | | `expired` | Session token expired (needs manual restart) | | `error` | General error state | ### WebSocket Connection The hook connects using three protocols for authentication: ```typescript const ws = new WebSocket(url, ["realtime", "openai-beta.realtime-v1", `openai-insecure-api-key.${ephemeralToken}`]); ``` ## Voice Settings Store **Location**: `apps/web-app/src/stores/voiceSettingsStore.ts` ### Schema ```typescript interface VoiceSettings { voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"; language: "en" | "es" | "fr" | "de" | "it" | "pt"; vadSensitivity: number; // 0-100 autoStartOnOpen: boolean; // Auto-start voice when panel opens showStatusHints: boolean; // Show helper text in UI } ``` ### Persistence Settings are persisted to `localStorage` under key `voiceassist-voice-settings` using Zustand's persist middleware. ### Defaults | Setting | Default | | --------------- | ------- | | voice | "alloy" | | language | "en" | | vadSensitivity | 50 | | autoStartOnOpen | false | | showStatusHints | true | ## Chat Integration **Location**: `apps/web-app/src/pages/ChatPage.tsx` ### Message Flow 1. **User speaks** → VoiceModePanel receives final transcript 2. VoiceModePanel calls `onUserMessage(transcript)` 3. ChatPage receives callback, calls `useChatSession.addMessage()` 4. Message added to timeline with `metadata: { source: "voice" }` ```typescript // ChatPage.tsx const handleVoiceUserMessage = (content: string) => { addMessage({ role: "user", content, metadata: { source: "voice" }, }); }; const handleVoiceAssistantMessage = (content: string) => { addMessage({ role: "assistant", content, metadata: { source: "voice" }, }); }; ``` ### Message Structure ```typescript interface VoiceMessage { id: string; // "voice-{timestamp}-{random}" role: "user" | "assistant"; content: string; timestamp: number; metadata: { source: "voice"; // Distinguishes from text messages }; } ``` ## Barge-in & Audio Playback **Location**: `apps/web-app/src/components/voice/VoiceModePanel.tsx`, `apps/web-app/src/hooks/useRealtimeVoiceSession.ts` ### Barge-in Flow When the user starts speaking while the AI is responding, the system immediately: 1. **Detects speech start** via OpenAI's `input_audio_buffer.speech_started` event 2. **Cancels active response** by sending `response.cancel` to OpenAI 3. **Stops audio playback** via `onSpeechStarted` callback 4. **Clears pending responses** to prevent stale audio from playing ``` User speaks → speech_started event → response.cancel → stopCurrentAudio() ↓ Audio stops Queue cleared Response ID incremented ``` ### Response Cancellation **Location**: `useRealtimeVoiceSession.ts` - `handleRealtimeMessage` ```typescript case "input_audio_buffer.speech_started": setIsSpeaking(true); setPartialTranscript(""); // Barge-in: Cancel any active response when user starts speaking if (activeResponseIdRef.current && wsRef.current?.readyState === WebSocket.OPEN) { wsRef.current.send(JSON.stringify({ type: "response.cancel" })); activeResponseIdRef.current = null; } // Notify parent to stop audio playback options.onSpeechStarted?.(); break; ``` ### Audio Playback Management **Location**: `VoiceModePanel.tsx` The panel tracks audio playback state to prevent overlapping responses: ```typescript // Track currently playing Audio element const currentAudioRef = useRef(null); // Prevent overlapping response processing const isProcessingResponseRef = useRef(false); // Response ID to invalidate stale responses after barge-in const currentResponseIdRef = useRef(0); ``` **Stop current audio function:** ```typescript const stopCurrentAudio = useCallback(() => { if (currentAudioRef.current) { currentAudioRef.current.pause(); currentAudioRef.current.currentTime = 0; if (currentAudioRef.current.src.startsWith("blob:")) { URL.revokeObjectURL(currentAudioRef.current.src); } currentAudioRef.current = null; } audioQueueRef.current = []; isPlayingRef.current = false; currentResponseIdRef.current++; // Invalidate pending responses isProcessingResponseRef.current = false; }, []); ``` ### Overlap Prevention When a relay result arrives, the handler checks: 1. **Already processing?** Skip if `isProcessingResponseRef.current === true` 2. **Response ID valid?** Skip playback if ID changed (barge-in occurred) ```typescript onRelayResult: async ({ answer }) => { if (answer) { // Prevent overlapping responses if (isProcessingResponseRef.current) { console.log("[VoiceModePanel] Skipping response - already processing another"); return; } const responseId = ++currentResponseIdRef.current; isProcessingResponseRef.current = true; // ... synthesis and playback ... // Check if response is still valid before playback if (responseId !== currentResponseIdRef.current) { console.log("[VoiceModePanel] Response cancelled - skipping playback"); return; } } }; ``` ### Error Handling Benign cancellation errors (e.g., "Cancellation failed: no active response found") are handled gracefully: ```typescript case "error": { const errorMessage = message.error?.message || "Realtime API error"; // Ignore benign cancellation errors if ( errorMessage.includes("Cancellation failed") || errorMessage.includes("no active response") ) { voiceLog.debug(`Ignoring benign error: ${errorMessage}`); break; } handleError(new Error(errorMessage)); break; } ``` ## Metrics **Location**: `apps/web-app/src/hooks/useRealtimeVoiceSession.ts` ### VoiceMetrics Interface ```typescript interface VoiceMetrics { connectionTimeMs: number | null; // Time to establish connection timeToFirstTranscriptMs: number | null; // Time to first user transcript lastSttLatencyMs: number | null; // Speech-to-text latency lastResponseLatencyMs: number | null; // AI response latency sessionDurationMs: number | null; // Total session duration userTranscriptCount: number; // Number of user turns aiResponseCount: number; // Number of AI turns reconnectCount: number; // Number of reconnections sessionStartedAt: number | null; // Session start timestamp } ``` ### Frontend Logging VoiceModePanel logs key metrics to console: ```typescript // Connection time console.log(`[VoiceModePanel] voice_session_connect_ms=${metrics.connectionTimeMs}`); // STT latency console.log(`[VoiceModePanel] voice_stt_latency_ms=${metrics.lastSttLatencyMs}`); // Response latency console.log(`[VoiceModePanel] voice_first_reply_ms=${metrics.lastResponseLatencyMs}`); // Session duration console.log(`[VoiceModePanel] voice_session_duration_ms=${metrics.sessionDurationMs}`); ``` ### Consuming Metrics Developers can plug into metrics via the `onMetricsUpdate` callback: ```typescript useRealtimeVoiceSession({ onMetricsUpdate: (metrics) => { // Send to telemetry service analytics.track("voice_session_metrics", { connection_ms: metrics.connectionTimeMs, stt_latency_ms: metrics.lastSttLatencyMs, response_latency_ms: metrics.lastResponseLatencyMs, duration_ms: metrics.sessionDurationMs, }); }, }); ``` ### Metrics Export to Backend Metrics can be automatically exported to the backend for aggregation and alerting. **Backend Endpoint**: `POST /api/voice/metrics` **Location**: `services/api-gateway/app/api/voice.py` #### Request Schema ```typescript interface VoiceMetricsPayload { conversation_id?: string; connection_time_ms?: number; time_to_first_transcript_ms?: number; last_stt_latency_ms?: number; last_response_latency_ms?: number; session_duration_ms?: number; user_transcript_count: number; ai_response_count: number; reconnect_count: number; session_started_at?: number; } ``` #### Response ```typescript interface VoiceMetricsResponse { status: "ok"; } ``` #### Privacy **No PHI or transcript content is sent.** Only timing metrics and counts. #### Frontend Configuration Metrics export is controlled by environment variables: - **Production** (`import.meta.env.PROD`): Metrics sent automatically - **Development**: Set `VITE_ENABLE_VOICE_METRICS=true` to enable The export uses `navigator.sendBeacon()` for reliability (survives page navigation). #### Backend Logging Metrics are logged with user context: ```python logger.info( "VoiceMetrics received", extra={ "user_id": current_user.id, "conversation_id": payload.conversation_id, "connection_time_ms": payload.connection_time_ms, "session_duration_ms": payload.session_duration_ms, ... }, ) ``` #### Testing ```bash # Backend cd /home/asimo/VoiceAssist/services/api-gateway source venv/bin/activate && export PYTHONPATH=. python -m pytest tests/integration/test_voice_metrics.py -v ``` ## Security ### Ephemeral Token Architecture **CRITICAL**: The browser NEVER receives the raw OpenAI API key. 1. Backend holds `OPENAI_API_KEY` securely 2. Frontend requests session via `/api/voice/realtime-session` 3. Backend creates ephemeral token via OpenAI `/v1/realtime/sessions` 4. Ephemeral token returned to frontend (valid ~5 minutes) 5. Frontend connects WebSocket using ephemeral token ### Token Refresh The hook monitors `session.expires_at` and can trigger refresh before expiry. If the token expires mid-session, status transitions to `expired`. ## Testing ### Voice Pipeline Smoke Suite Run these commands to validate the voice pipeline: ```bash # 1. Backend tests (CI-safe, mocked) cd /home/asimo/VoiceAssist/services/api-gateway source venv/bin/activate export PYTHONPATH=. python -m pytest tests/integration/test_openai_config.py -v # 2. Frontend unit tests (run individually to avoid OOM) cd /home/asimo/VoiceAssist/apps/web-app export NODE_OPTIONS="--max-old-space-size=768" npx vitest run src/hooks/__tests__/useRealtimeVoiceSession.test.ts --reporter=dot npx vitest run src/hooks/__tests__/useChatSession-voice-integration.test.ts --reporter=dot npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot # 3. E2E tests (Chromium, mocked backend) cd /home/asimo/VoiceAssist npx playwright test \ e2e/voice-mode-navigation.spec.ts \ e2e/voice-mode-session-smoke.spec.ts \ e2e/voice-mode-voice-chat-integration.spec.ts \ --project=chromium --reporter=list ``` ### Test Coverage Summary | Test File | Tests | Coverage | | ----------------------------------------- | ----- | --------------------------------- | | useRealtimeVoiceSession.test.ts | 22 | Hook lifecycle, states, metrics | | useChatSession-voice-integration.test.ts | 8 | Message structure validation | | voiceSettingsStore.test.ts | 17 | Store actions, persistence | | VoiceModeSettings.test.tsx | 25 | Component rendering, interactions | | MessageInput-voice-settings.test.tsx | 12 | Integration with chat input | | voice-mode-navigation.spec.ts | 4 | E2E navigation flow | | voice-mode-session-smoke.spec.ts | 3 | E2E session smoke (1 live gated) | | voice-mode-voice-chat-integration.spec.ts | 4 | E2E panel integration | **Total: 95 tests** ### Live Testing To test with real OpenAI backend: ```bash # Backend (requires OPENAI_API_KEY in .env) LIVE_REALTIME_TESTS=1 python -m pytest tests/integration/test_openai_config.py -v # E2E (requires running backend + valid API key) LIVE_REALTIME_E2E=1 npx playwright test e2e/voice-mode-session-smoke.spec.ts ``` ## File Reference ### Backend | File | Purpose | | -------------------------------------------------------------- | ---------------------------------- | | `services/api-gateway/app/api/voice.py` | API routes, metrics, timing logs | | `services/api-gateway/app/services/realtime_voice_service.py` | Session creation, token generation | | `services/api-gateway/tests/integration/test_openai_config.py` | Integration tests | | `services/api-gateway/tests/integration/test_voice_metrics.py` | Metrics endpoint tests | ### Frontend | File | Purpose | | --------------------------------------------------------- | ------------------------- | | `apps/web-app/src/hooks/useRealtimeVoiceSession.ts` | Core hook | | `apps/web-app/src/components/voice/VoiceModePanel.tsx` | UI panel | | `apps/web-app/src/components/voice/VoiceModeSettings.tsx` | Settings modal | | `apps/web-app/src/stores/voiceSettingsStore.ts` | Settings store | | `apps/web-app/src/components/chat/MessageInput.tsx` | Voice button integration | | `apps/web-app/src/pages/ChatPage.tsx` | Chat timeline integration | | `apps/web-app/src/hooks/useChatSession.ts` | addMessage() helper | ### Tests | File | Purpose | | --------------------------------------------------------------------------------- | --------------------- | | `apps/web-app/src/hooks/__tests__/useRealtimeVoiceSession.test.ts` | Hook tests | | `apps/web-app/src/hooks/__tests__/useChatSession-voice-integration.test.ts` | Chat integration | | `apps/web-app/src/stores/__tests__/voiceSettingsStore.test.ts` | Store tests | | `apps/web-app/src/components/voice/__tests__/VoiceModeSettings.test.tsx` | Component tests | | `apps/web-app/src/components/chat/__tests__/MessageInput-voice-settings.test.tsx` | Integration tests | | `e2e/voice-mode-navigation.spec.ts` | E2E navigation | | `e2e/voice-mode-session-smoke.spec.ts` | E2E smoke test | | `e2e/voice-mode-voice-chat-integration.spec.ts` | E2E panel integration | ## Related Documentation - [VOICE_MODE_ENHANCEMENT_10_PHASE.md](./VOICE_MODE_ENHANCEMENT_10_PHASE.md) - **10-phase enhancement plan (emotion, dictation, analytics)** - [VOICE_MODE_SETTINGS_GUIDE.md](./VOICE_MODE_SETTINGS_GUIDE.md) - User settings configuration - [TESTING_GUIDE.md](./TESTING_GUIDE.md) - E2E testing strategy and validation checklist ## Observability & Monitoring (Phase 3) **Implemented:** 2025-12-02 The voice pipeline includes comprehensive observability features for production monitoring. ### Error Taxonomy (`voice_errors.py`) Location: `services/api-gateway/app/core/voice_errors.py` Structured error classification with 8 categories and 40+ error codes: | Category | Codes | Description | | ---------- | -------------- | ------------------------------ | | CONNECTION | CONN_001-7 | WebSocket, network failures | | STT | STT_001-7 | Speech-to-text errors | | TTS | TTS_001-7 | Text-to-speech errors | | LLM | LLM_001-6 | LLM processing errors | | AUDIO | AUDIO_001-6 | Audio encoding/decoding errors | | TIMEOUT | TIMEOUT_001-7 | Various timeout conditions | | PROVIDER | PROVIDER_001-6 | External provider errors | | INTERNAL | INTERNAL_001-5 | Internal server errors | Each error code includes: - Recoverability flag (can auto-retry) - Retry configuration (delay, max attempts) - User-friendly description ### Voice Metrics (`metrics.py`) Location: `services/api-gateway/app/core/metrics.py` Prometheus metrics for voice pipeline monitoring: | Metric | Type | Labels | Description | | -------------------------------------- | --------- | ------------------------------------- | ---------------------- | | `voice_errors_total` | Counter | category, code, provider, recoverable | Total voice errors | | `voice_pipeline_stage_latency_seconds` | Histogram | stage | Per-stage latency | | `voice_ttfa_seconds` | Histogram | - | Time to first audio | | `voice_active_sessions` | Gauge | - | Active voice sessions | | `voice_barge_in_total` | Counter | - | Barge-in events | | `voice_audio_chunks_total` | Counter | status | Audio chunks processed | ### Per-Stage Latency Tracking (`voice_timing.py`) Location: `services/api-gateway/app/core/voice_timing.py` Pipeline stages tracked: - `audio_receive` - Time to receive audio from client - `vad_process` - Voice activity detection time - `stt_transcribe` - Speech-to-text latency - `llm_process` - LLM inference time - `tts_synthesize` - Text-to-speech synthesis - `audio_send` - Time to send audio to client - `ttfa` - Time to first audio (end-to-end) Usage: ```python from app.core.voice_timing import create_pipeline_timings, PipelineStage timings = create_pipeline_timings(session_id="abc123") with timings.time_stage(PipelineStage.STT_TRANSCRIBE): transcript = await stt_client.transcribe(audio) timings.record_ttfa() # When first audio byte ready timings.finalize() # When response complete ``` ### SLO Alerts (`voice_slo_alerts.yml`) Location: `infrastructure/observability/prometheus/rules/voice_slo_alerts.yml` SLO targets with Prometheus alerting rules: | SLO | Target | Alert | | -------------------- | ------- | ------------------------------- | | TTFA P95 | < 200ms | VoiceTTFASLOViolation | | STT Latency P95 | < 300ms | VoiceSTTLatencySLOViolation | | TTS First Chunk P95 | < 200ms | VoiceTTSFirstChunkSLOViolation | | Connection Time P95 | < 500ms | VoiceConnectionTimeSLOViolation | | Error Rate | < 1% | VoiceErrorRateHigh | | Session Success Rate | > 95% | VoiceSessionSuccessRateLow | ### Client Telemetry (`voiceTelemetry.ts`) Location: `apps/web-app/src/lib/voiceTelemetry.ts` Frontend telemetry with: - **Network quality assessment** via Network Information API - **Browser performance metrics** via Performance.memory API - **Jitter estimation** for network quality - **Batched reporting** (10s intervals) - **Beacon API** for reliable delivery on page unload ```typescript import { getVoiceTelemetry } from "@/lib/voiceTelemetry"; const telemetry = getVoiceTelemetry(); telemetry.startSession(sessionId); telemetry.recordLatency("stt", 150); telemetry.recordLatency("ttfa", 180); telemetry.endSession(); ``` ### Voice Health Endpoint (`/health/voice`) Location: `services/api-gateway/app/api/health.py` Comprehensive voice subsystem health check: ```bash curl https://assist.asimo.io/health/voice ``` Response: ```json { "status": "healthy", "providers": { "openai": { "status": "up", "latency_ms": 120.5 }, "elevenlabs": { "status": "up", "latency_ms": 85.2 }, "deepgram": { "status": "up", "latency_ms": 95.8 } }, "session_store": { "status": "up", "active_sessions": 5 }, "metrics": { "active_sessions": 5 }, "slo": { "ttfa_target_ms": 200, "error_rate_target": 0.01 } } ``` ### Debug Logging Configuration Location: `services/api-gateway/app/core/logging.py` Configurable voice log verbosity via `VOICE_LOG_LEVEL` environment variable: | Level | Content | | -------- | --------------------------------------------- | | MINIMAL | Errors only | | STANDARD | + Session lifecycle (start/end/state changes) | | VERBOSE | + All latency measurements | | DEBUG | + Audio frame details, chunk timing | Usage: ```python from app.core.logging import get_voice_logger voice_log = get_voice_logger(__name__) voice_log.session_start(session_id="abc123", provider="thinker_talker") voice_log.latency("stt_transcribe", 150.5, session_id="abc123") voice_log.error("voice_connection_failed", error_code="CONN_001") ``` --- ## Phase 9: Offline & Network Fallback **Implemented:** 2025-12-03 The voice pipeline now includes comprehensive offline support and network-aware fallback mechanisms. ### Network Monitoring (`networkMonitor.ts`) Location: `apps/web-app/src/lib/offline/networkMonitor.ts` Continuously monitors network health using multiple signals: - **Navigator.onLine**: Basic online/offline detection - **Network Information API**: Connection type, downlink speed, RTT - **Health Check Pinging**: Periodic `/api/health` pings for latency measurement ```typescript import { getNetworkMonitor } from "@/lib/offline/networkMonitor"; const monitor = getNetworkMonitor(); monitor.subscribe((status) => { console.log(`Network quality: ${status.quality}`); console.log(`Health check latency: ${status.healthCheckLatencyMs}ms`); }); ``` #### Network Quality Levels | Quality | Latency | isHealthy | Action | | --------- | ----------- | --------- | -------------------------- | | Excellent | < 100ms | true | Full cloud processing | | Good | < 200ms | true | Full cloud processing | | Moderate | < 500ms | true | Cloud with quality warning | | Poor | ≥ 500ms | variable | Consider offline fallback | | Offline | Unreachable | false | Automatic offline fallback | #### Configuration ```typescript const monitor = createNetworkMonitor({ healthCheckUrl: "/api/health", healthCheckIntervalMs: 30000, // 30 seconds healthCheckTimeoutMs: 5000, // 5 seconds goodLatencyThresholdMs: 100, moderateLatencyThresholdMs: 200, poorLatencyThresholdMs: 500, failuresBeforeUnhealthy: 3, }); ``` ### useNetworkStatus Hook Location: `apps/web-app/src/hooks/useNetworkStatus.ts` React hook providing network status with computed properties: ```typescript const { isOnline, isHealthy, quality, healthCheckLatencyMs, effectiveType, // "4g", "3g", "2g", "slow-2g" downlink, // Mbps rtt, // Round-trip time ms isSuitableForVoice, // quality >= "good" && isHealthy shouldUseOffline, // !isOnline || !isHealthy || quality < "moderate" qualityScore, // 0-4 (offline=0, poor=1, moderate=2, good=3, excellent=4) checkNow, // Force immediate health check } = useNetworkStatus(); ``` ### Offline VAD with Network Fallback Location: `apps/web-app/src/hooks/useOfflineVAD.ts` The `useOfflineVADWithFallback` hook automatically switches between network and offline VAD: ```typescript const { isListening, isSpeaking, currentEnergy, isUsingOfflineVAD, // Currently using offline mode? networkAvailable, networkQuality, modeReason, // "network_vad" | "network_unavailable" | "poor_quality" | "forced_offline" forceOffline, // Manually switch to offline forceNetwork, // Manually switch to network (if available) startListening, stopListening, } = useOfflineVADWithFallback({ useNetworkMonitor: true, minNetworkQuality: "moderate", networkRecoveryDelayMs: 2000, // Prevent flapping onFallbackToOffline: () => console.log("Switched to offline VAD"), onReturnToNetwork: () => console.log("Returned to network VAD"), }); ``` ### Fallback Decision Flow ``` ┌────────────────────┐ │ Network Monitor │ │ Health Check │ └─────────┬──────────┘ │ ▼ ┌────────────────────┐ NO ┌────────────────────┐ │ Is Online? │──────────▶│ Use Offline VAD │ └─────────┬──────────┘ └────────────────────┘ │ YES ▼ ┌────────────────────┐ NO ┌────────────────────┐ │ Is Healthy? │──────────▶│ Use Offline VAD │ │ (3+ checks pass) │ │ reason: unhealthy │ └─────────┬──────────┘ └────────────────────┘ │ YES ▼ ┌────────────────────┐ NO ┌────────────────────┐ │ Quality ≥ Min? │──────────▶│ Use Offline VAD │ │ (e.g., moderate) │ │ reason: poor_qual │ └─────────┬──────────┘ └────────────────────┘ │ YES ▼ ┌────────────────────┐ │ Use Network VAD │ │ (cloud processing)│ └────────────────────┘ ``` ### TTS Caching (`useTTSCache`) Location: `apps/web-app/src/hooks/useOfflineVAD.ts` Caches synthesized TTS audio for offline playback: ```typescript const { getTTS, // Get audio (from cache or fresh) preload, // Preload common phrases isCached, // Check if text is cached stats, // { entryCount, sizeMB, hitRate } clear, // Clear cache } = useTTSCache({ voice: "alloy", maxSizeMB: 50, ttsFunction: async (text) => synthesizeAudio(text), }); // Preload common phrases on app start await preload(); // Caches "I'm listening", "Go ahead", etc. // Get TTS (cache hit = instant, cache miss = synthesize + cache) const audio = await getTTS("Hello world"); ``` ### User Settings Integration Phase 9 settings are stored in `voiceSettingsStore`: | Setting | Default | Description | | ----------------------- | ------- | ---------------------------------------- | | `enableOfflineFallback` | `true` | Auto-switch to offline when network poor | | `preferOfflineVAD` | `false` | Force offline VAD (privacy mode) | | `ttsCacheEnabled` | `true` | Enable TTS response caching | ### File Reference (Phase 9) | File | Purpose | | --------------------------------------------------------------- | ------------------------------- | | `apps/web-app/src/lib/offline/networkMonitor.ts` | Network health monitoring | | `apps/web-app/src/lib/offline/webrtcVAD.ts` | WebRTC-based offline VAD | | `apps/web-app/src/lib/offline/types.ts` | Offline module type definitions | | `apps/web-app/src/hooks/useNetworkStatus.ts` | React hook for network status | | `apps/web-app/src/hooks/useOfflineVAD.ts` | Offline VAD + TTS cache hooks | | `apps/web-app/src/lib/offline/__tests__/networkMonitor.test.ts` | Network monitor tests | --- ## Future Work - ~~**Metrics export to backend**: Send metrics to backend for aggregation/alerting~~ ✓ Implemented - ~~**Barge-in support**: Allow user to interrupt AI responses~~ ✓ Implemented (2025-11-28) - ~~**Audio overlap prevention**: Prevent multiple responses playing simultaneously~~ ✓ Implemented (2025-11-28) - ~~**Per-user voice preferences**: Backend persistence for TTS settings~~ ✓ Implemented (2025-11-29) - ~~**Context-aware voice styles**: Auto-detect tone from content~~ ✓ Implemented (2025-11-29) - ~~**Aggressive latency optimization**: 200ms VAD, 256-sample chunks, 300ms reconnect~~ ✓ Implemented (2025-11-29) - ~~**Observability & Monitoring (Phase 3)**: Error taxonomy, metrics, SLO alerts, telemetry~~ ✓ Implemented (2025-12-02) - ~~**Phase 7: Multilingual Support**: Auto language detection, accent profiles, language switch confidence~~ ✓ Implemented (2025-12-03) - ~~**Phase 8: Voice Calibration**: Personalized VAD thresholds, calibration wizard, adaptive learning~~ ✓ Implemented (2025-12-03) - ~~**Phase 9: Offline Fallback**: Network monitoring, offline VAD, TTS caching, quality-based switching~~ ✓ Implemented (2025-12-03) - ~~**Phase 10: Conversation Intelligence**: Sentiment tracking, discourse analysis, response recommendations~~ ✓ Implemented (2025-12-03) ### Voice Mode Enhancement - 10 Phase Plan ✅ COMPLETE (2025-12-03) A comprehensive enhancement transforming voice mode into a human-like conversational partner with medical dictation: - ~~**Phase 1**: Emotional Intelligence (Hume AI)~~ ✓ Complete - ~~**Phase 2**: Backchanneling System~~ ✓ Complete - ~~**Phase 3**: Prosody Analysis~~ ✓ Complete - ~~**Phase 4**: Memory & Context System~~ ✓ Complete - ~~**Phase 5**: Advanced Turn-Taking~~ ✓ Complete - ~~**Phase 6**: Variable Response Timing~~ ✓ Complete - ~~**Phase 7**: Conversational Repair~~ ✓ Complete - ~~**Phase 8**: Medical Dictation Core~~ ✓ Complete - ~~**Phase 9**: Patient Context Integration~~ ✓ Complete - ~~**Phase 10**: Frontend Integration & Analytics~~ ✓ Complete **Full documentation:** [VOICE_MODE_ENHANCEMENT_10_PHASE.md](./VOICE_MODE_ENHANCEMENT_10_PHASE.md) ### Remaining Tasks - **Voice→chat transcript content E2E**: Test actual transcript content in chat timeline - **Error tracking integration**: Send errors to Sentry/similar - **Audio level visualization**: Show real-time audio level meter during recording 6:["slug","VOICE_MODE_PIPELINE","c"] 0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","VOICE_MODE_PIPELINE","c"],{"children":["__PAGE__?{\"slug\":[\"VOICE_MODE_PIPELINE\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","VOICE_MODE_PIPELINE","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Voice Mode Pipeline"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","VOICE_MODE_PIPELINE.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/VOICE_MODE_PIPELINE.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]] c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Voice Mode Pipeline | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"Unified Voice Mode pipeline architecture, data flow, barge-in, audio playback, metrics, offline fallback, and testing strategy."}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]] 1:null