Voice Mode

Real-time voice interaction powered by OpenAI's Realtime API with bidirectional audio streaming and speech-to-text capabilities.

πŸŽ™οΈ

Speech Input

Real-time speech recognition with partial transcript preview

πŸ”Š

Voice Output

Natural voice synthesis with multiple voice options

⚑

Low Latency

Optimized for minimal latency in voice interactions

βœ… Implementation Status

  • β€’ Voice metrics dashboard with latency indicators
  • β€’ Microphone permission handling UX
  • β€’ Keyboard shortcuts (Ctrl+Shift+V, Space for push-to-talk)
  • β€’ Responsive voice panel layout
  • β€’ Real-time transcript preview during speech

Voice Pipeline Architecture

Voice Mode Pipeline

Status: Production-ready Last Updated: 2025-12-03

This document describes the unified Voice Mode pipeline architecture, data flow, metrics, and testing strategy. It serves as the canonical reference for developers working on real-time voice features.

Voice Pipeline Modes

VoiceAssist supports two voice pipeline modes:

ModeDescriptionBest For
Thinker-Talker (Recommended)Local STT β†’ LLM β†’ TTS pipelineFull tool support, unified context, custom TTS
OpenAI Realtime (Legacy)Direct OpenAI Realtime APIQuick setup, minimal backend changes

Thinker-Talker Pipeline (Primary)

The Thinker-Talker pipeline is the recommended approach, providing:

  • Unified conversation context between voice and chat modes
  • Full tool/RAG support in voice interactions
  • Custom TTS via ElevenLabs with premium voices
  • Lower cost per interaction

Documentation: THINKER_TALKER_PIPELINE.md

[Audio] β†’ [Deepgram STT] β†’ [GPT-4o Thinker] β†’ [ElevenLabs TTS] β†’ [Audio Out]
              β”‚                    β”‚                    β”‚
         Transcripts          Tool Calls           Audio Chunks
              β”‚                    β”‚                    β”‚
              └───────── WebSocket Handler β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

OpenAI Realtime API (Legacy)

The original implementation using OpenAI's Realtime API directly. Still supported for backward compatibility.


Implementation Status

Thinker-Talker Components

ComponentStatusLocation
ThinkerServiceLiveapp/services/thinker_service.py
TalkerServiceLiveapp/services/talker_service.py
VoicePipelineServiceLiveapp/services/voice_pipeline_service.py
T/T WebSocket HandlerLiveapp/services/thinker_talker_websocket_handler.py
SentenceChunkerLiveapp/services/sentence_chunker.py
Frontend T/T hookLiveapps/web-app/src/hooks/useThinkerTalkerSession.ts
T/T Audio PlaybackLiveapps/web-app/src/hooks/useTTAudioPlayback.ts
T/T Voice PanelLiveapps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsx

OpenAI Realtime Components (Legacy)

ComponentStatusLocation
Backend session endpointLiveservices/api-gateway/app/api/voice.py
Ephemeral token generationLiveapp/services/realtime_voice_service.py
Voice metrics endpointLivePOST /api/voice/metrics
Frontend voice hookLiveapps/web-app/src/hooks/useRealtimeVoiceSession.ts
Voice settings storeLiveapps/web-app/src/stores/voiceSettingsStore.ts
Voice UI panelLiveapps/web-app/src/components/voice/VoiceModePanel.tsx
Chat timeline integrationLiveVoice messages appear in chat
Barge-in supportLiveresponse.cancel + onSpeechStarted callback
Audio overlap preventionLiveResponse ID tracking + isProcessingResponseRef
E2E test suitePassing95 tests across unit/integration/E2E

Full status: See Implementation Status for all components.

Overview

Voice Mode enables real-time voice conversations with the AI assistant using OpenAI's Realtime API. The pipeline handles:

  • Ephemeral session authentication (no raw API keys in browser)
  • WebSocket-based bidirectional voice streaming
  • Voice activity detection (VAD) with user-configurable sensitivity
  • User settings propagation (voice, language, VAD threshold)
  • Chat timeline integration (voice messages appear in chat)
  • Connection state management with automatic reconnection
  • Barge-in support (interrupt AI while speaking)
  • Audio playback management (prevent overlapping responses)
  • Metrics tracking for observability

Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              FRONTEND                                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  VoiceModePanel     │────▢│useRealtimeVoice     │────▢│ voiceSettings β”‚  β”‚
β”‚  β”‚  (UI Component)     β”‚     β”‚Session (Hook)       β”‚     β”‚ Store         β”‚  β”‚
β”‚  β”‚  - Start/Stop       β”‚     β”‚- connect()          β”‚     β”‚ - voice       β”‚  β”‚
β”‚  β”‚  - Status display   β”‚     β”‚- disconnect()       β”‚     β”‚ - language    β”‚  β”‚
β”‚  β”‚  - Metrics logging  β”‚     β”‚- sendMessage()      β”‚     β”‚ - vadSens     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚            β”‚                            β”‚                                    β”‚
β”‚            β”‚                            β”‚ onUserMessage()/onAssistantMessage()
β”‚            β”‚                            β–Ό                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚  MessageInput       β”‚     β”‚  ChatPage           β”‚                        β”‚
β”‚  β”‚  - Voice toggle     │────▢│  - useChatSession   β”‚                        β”‚
β”‚  β”‚  - Panel container  β”‚     β”‚  - addMessage()     β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                                       β”‚ POST /api/voice/realtime-session
                                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              BACKEND                                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚  voice.py           │────▢│  realtime_voice_    β”‚                        β”‚
β”‚  β”‚  (FastAPI Router)   β”‚     β”‚  service.py         β”‚                        β”‚
β”‚  β”‚  - /realtime-sessionβ”‚     β”‚  - generate_session β”‚                        β”‚
β”‚  β”‚  - Timing logs      β”‚     β”‚  - ephemeral token  β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                                         β”‚                                    β”‚
β”‚                                         β”‚ POST /v1/realtime/sessions         β”‚
β”‚                                         β–Ό                                    β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚                              β”‚  OpenAI API         β”‚                        β”‚
β”‚                              β”‚  - Ephemeral token  β”‚                        β”‚
β”‚                              β”‚  - Voice config     β”‚                        β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                                       β”‚ WebSocket wss://api.openai.com/v1/realtime
                                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          OPENAI REALTIME API                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  - Server-side VAD (voice activity detection)                                β”‚
β”‚  - Bidirectional audio streaming (PCM16)                                     β”‚
β”‚  - Real-time transcription (Whisper)                                         β”‚
β”‚  - GPT-4o responses with audio synthesis                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Backend: /api/voice/realtime-session

Location: services/api-gateway/app/api/voice.py

Request

interface RealtimeSessionRequest { conversation_id?: string; // Optional conversation context voice?: string; // "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" language?: string; // "en" | "es" | "fr" | "de" | "it" | "pt" vad_sensitivity?: number; // 0-100 (maps to threshold: 0β†’0.9, 100β†’0.1) }

Response

interface RealtimeSessionResponse { url: string; // WebSocket URL: "wss://api.openai.com/v1/realtime" model: string; // "gpt-4o-realtime-preview" session_id: string; // Unique session identifier expires_at: number; // Unix timestamp (epoch seconds) conversation_id: string | null; auth: { type: "ephemeral_token"; token: string; // Ephemeral token (ek_...), NOT raw API key expires_at: number; // Token expiry (5 minutes) }; voice_config: { voice: string; // Selected voice modalities: ["text", "audio"]; input_audio_format: "pcm16"; output_audio_format: "pcm16"; input_audio_transcription: { model: "whisper-1" }; turn_detection: { type: "server_vad"; threshold: number; // 0.1 (sensitive) to 0.9 (insensitive) prefix_padding_ms: number; silence_duration_ms: number; }; }; }

VAD Sensitivity Mapping

The frontend uses a 0-100 scale for user-friendly VAD sensitivity:

User SettingVAD ThresholdBehavior
0 (Low)0.9Requires loud/clear speech
50 (Medium)0.5Balanced detection
100 (High)0.1Very sensitive, picks up soft speech

Formula: threshold = 0.9 - (vad_sensitivity / 100 * 0.8)

Observability

Backend logs timing and context for each session request:

# Request logging logger.info( f"Creating Realtime session for user {current_user.id}", extra={ "user_id": current_user.id, "conversation_id": request.conversation_id, "voice": request.voice, "language": request.language, "vad_sensitivity": request.vad_sensitivity, }, ) # Success logging with duration duration_ms = int((time.monotonic() - start_time) * 1000) logger.info( f"Realtime session created for user {current_user.id}", extra={ "user_id": current_user.id, "session_id": config["session_id"], "voice": config.get("voice_config", {}).get("voice"), "duration_ms": duration_ms, }, )

Frontend Hook: useRealtimeVoiceSession

Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts

Usage

const { status, // 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'failed' | 'expired' | 'error' transcript, // Current transcript text isSpeaking, // Is the AI currently speaking? isConnected, // Derived: status === 'connected' isConnecting, // Derived: status === 'connecting' || 'reconnecting' canSend, // Can send messages? error, // Error message if any metrics, // VoiceMetrics object connect, // () => Promise<void> - start session disconnect, // () => void - end session sendMessage, // (text: string) => void - send text message } = useRealtimeVoiceSession({ conversationId, voice, // From voiceSettingsStore language, // From voiceSettingsStore vadSensitivity, // From voiceSettingsStore (0-100) onConnected, // Callback when connected onDisconnected, // Callback when disconnected onError, // Callback on error onUserMessage, // Callback with user transcript onAssistantMessage, // Callback with AI response onMetricsUpdate, // Callback when metrics change });

Connection States

disconnected ──▢ connecting ──▢ connected
                      β”‚              β”‚
                      β–Ό              β–Ό
                   failed ◀──── reconnecting
                      β”‚              β”‚
                      β–Ό              β–Ό
                  expired ◀────── error
StateDescription
disconnectedInitial/idle state
connectingFetching session config, establishing WebSocket
connectedActive voice session
reconnectingAuto-reconnect after temporary disconnect
failedConnection failed (backend error, network issue)
expiredSession token expired (needs manual restart)
errorGeneral error state

WebSocket Connection

The hook connects using three protocols for authentication:

const ws = new WebSocket(url, ["realtime", "openai-beta.realtime-v1", `openai-insecure-api-key.${ephemeralToken}`]);

Voice Settings Store

Location: apps/web-app/src/stores/voiceSettingsStore.ts

Schema

interface VoiceSettings { voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"; language: "en" | "es" | "fr" | "de" | "it" | "pt"; vadSensitivity: number; // 0-100 autoStartOnOpen: boolean; // Auto-start voice when panel opens showStatusHints: boolean; // Show helper text in UI }

Persistence

Settings are persisted to localStorage under key voiceassist-voice-settings using Zustand's persist middleware.

Defaults

SettingDefault
voice"alloy"
language"en"
vadSensitivity50
autoStartOnOpenfalse
showStatusHintstrue

Chat Integration

Location: apps/web-app/src/pages/ChatPage.tsx

Message Flow

  1. User speaks β†’ VoiceModePanel receives final transcript
  2. VoiceModePanel calls onUserMessage(transcript)
  3. ChatPage receives callback, calls useChatSession.addMessage()
  4. Message added to timeline with metadata: { source: "voice" }
// ChatPage.tsx const handleVoiceUserMessage = (content: string) => { addMessage({ role: "user", content, metadata: { source: "voice" }, }); }; const handleVoiceAssistantMessage = (content: string) => { addMessage({ role: "assistant", content, metadata: { source: "voice" }, }); };

Message Structure

interface VoiceMessage { id: string; // "voice-{timestamp}-{random}" role: "user" | "assistant"; content: string; timestamp: number; metadata: { source: "voice"; // Distinguishes from text messages }; }

Barge-in & Audio Playback

Location: apps/web-app/src/components/voice/VoiceModePanel.tsx, apps/web-app/src/hooks/useRealtimeVoiceSession.ts

Barge-in Flow

When the user starts speaking while the AI is responding, the system immediately:

  1. Detects speech start via OpenAI's input_audio_buffer.speech_started event
  2. Cancels active response by sending response.cancel to OpenAI
  3. Stops audio playback via onSpeechStarted callback
  4. Clears pending responses to prevent stale audio from playing
User speaks β†’ speech_started event β†’ response.cancel β†’ stopCurrentAudio()
                                                            ↓
                                                    Audio stops
                                                    Queue cleared
                                                    Response ID incremented

Response Cancellation

Location: useRealtimeVoiceSession.ts - handleRealtimeMessage

case "input_audio_buffer.speech_started": setIsSpeaking(true); setPartialTranscript(""); // Barge-in: Cancel any active response when user starts speaking if (activeResponseIdRef.current && wsRef.current?.readyState === WebSocket.OPEN) { wsRef.current.send(JSON.stringify({ type: "response.cancel" })); activeResponseIdRef.current = null; } // Notify parent to stop audio playback options.onSpeechStarted?.(); break;

Audio Playback Management

Location: VoiceModePanel.tsx

The panel tracks audio playback state to prevent overlapping responses:

// Track currently playing Audio element const currentAudioRef = useRef<HTMLAudioElement | null>(null); // Prevent overlapping response processing const isProcessingResponseRef = useRef(false); // Response ID to invalidate stale responses after barge-in const currentResponseIdRef = useRef<number>(0);

Stop current audio function:

const stopCurrentAudio = useCallback(() => { if (currentAudioRef.current) { currentAudioRef.current.pause(); currentAudioRef.current.currentTime = 0; if (currentAudioRef.current.src.startsWith("blob:")) { URL.revokeObjectURL(currentAudioRef.current.src); } currentAudioRef.current = null; } audioQueueRef.current = []; isPlayingRef.current = false; currentResponseIdRef.current++; // Invalidate pending responses isProcessingResponseRef.current = false; }, []);

Overlap Prevention

When a relay result arrives, the handler checks:

  1. Already processing? Skip if isProcessingResponseRef.current === true
  2. Response ID valid? Skip playback if ID changed (barge-in occurred)
onRelayResult: async ({ answer }) => { if (answer) { // Prevent overlapping responses if (isProcessingResponseRef.current) { console.log("[VoiceModePanel] Skipping response - already processing another"); return; } const responseId = ++currentResponseIdRef.current; isProcessingResponseRef.current = true; // ... synthesis and playback ... // Check if response is still valid before playback if (responseId !== currentResponseIdRef.current) { console.log("[VoiceModePanel] Response cancelled - skipping playback"); return; } } };

Error Handling

Benign cancellation errors (e.g., "Cancellation failed: no active response found") are handled gracefully:

case "error": { const errorMessage = message.error?.message || "Realtime API error"; // Ignore benign cancellation errors if ( errorMessage.includes("Cancellation failed") || errorMessage.includes("no active response") ) { voiceLog.debug(`Ignoring benign error: ${errorMessage}`); break; } handleError(new Error(errorMessage)); break; }

Metrics

Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts

VoiceMetrics Interface

interface VoiceMetrics { connectionTimeMs: number | null; // Time to establish connection timeToFirstTranscriptMs: number | null; // Time to first user transcript lastSttLatencyMs: number | null; // Speech-to-text latency lastResponseLatencyMs: number | null; // AI response latency sessionDurationMs: number | null; // Total session duration userTranscriptCount: number; // Number of user turns aiResponseCount: number; // Number of AI turns reconnectCount: number; // Number of reconnections sessionStartedAt: number | null; // Session start timestamp }

Frontend Logging

VoiceModePanel logs key metrics to console:

// Connection time console.log(`[VoiceModePanel] voice_session_connect_ms=${metrics.connectionTimeMs}`); // STT latency console.log(`[VoiceModePanel] voice_stt_latency_ms=${metrics.lastSttLatencyMs}`); // Response latency console.log(`[VoiceModePanel] voice_first_reply_ms=${metrics.lastResponseLatencyMs}`); // Session duration console.log(`[VoiceModePanel] voice_session_duration_ms=${metrics.sessionDurationMs}`);

Consuming Metrics

Developers can plug into metrics via the onMetricsUpdate callback:

useRealtimeVoiceSession({ onMetricsUpdate: (metrics) => { // Send to telemetry service analytics.track("voice_session_metrics", { connection_ms: metrics.connectionTimeMs, stt_latency_ms: metrics.lastSttLatencyMs, response_latency_ms: metrics.lastResponseLatencyMs, duration_ms: metrics.sessionDurationMs, }); }, });

Metrics Export to Backend

Metrics can be automatically exported to the backend for aggregation and alerting.

Backend Endpoint: POST /api/voice/metrics

Location: services/api-gateway/app/api/voice.py

Request Schema

interface VoiceMetricsPayload { conversation_id?: string; connection_time_ms?: number; time_to_first_transcript_ms?: number; last_stt_latency_ms?: number; last_response_latency_ms?: number; session_duration_ms?: number; user_transcript_count: number; ai_response_count: number; reconnect_count: number; session_started_at?: number; }

Response

interface VoiceMetricsResponse { status: "ok"; }

Privacy

No PHI or transcript content is sent. Only timing metrics and counts.

Frontend Configuration

Metrics export is controlled by environment variables:

  • Production (import.meta.env.PROD): Metrics sent automatically
  • Development: Set VITE_ENABLE_VOICE_METRICS=true to enable

The export uses navigator.sendBeacon() for reliability (survives page navigation).

Backend Logging

Metrics are logged with user context:

logger.info( "VoiceMetrics received", extra={ "user_id": current_user.id, "conversation_id": payload.conversation_id, "connection_time_ms": payload.connection_time_ms, "session_duration_ms": payload.session_duration_ms, ... }, )

Testing

# Backend cd /home/asimo/VoiceAssist/services/api-gateway source venv/bin/activate && export PYTHONPATH=. python -m pytest tests/integration/test_voice_metrics.py -v

Security

Ephemeral Token Architecture

CRITICAL: The browser NEVER receives the raw OpenAI API key.

  1. Backend holds OPENAI_API_KEY securely
  2. Frontend requests session via /api/voice/realtime-session
  3. Backend creates ephemeral token via OpenAI /v1/realtime/sessions
  4. Ephemeral token returned to frontend (valid ~5 minutes)
  5. Frontend connects WebSocket using ephemeral token

Token Refresh

The hook monitors session.expires_at and can trigger refresh before expiry. If the token expires mid-session, status transitions to expired.

Testing

Voice Pipeline Smoke Suite

Run these commands to validate the voice pipeline:

# 1. Backend tests (CI-safe, mocked) cd /home/asimo/VoiceAssist/services/api-gateway source venv/bin/activate export PYTHONPATH=. python -m pytest tests/integration/test_openai_config.py -v # 2. Frontend unit tests (run individually to avoid OOM) cd /home/asimo/VoiceAssist/apps/web-app export NODE_OPTIONS="--max-old-space-size=768" npx vitest run src/hooks/__tests__/useRealtimeVoiceSession.test.ts --reporter=dot npx vitest run src/hooks/__tests__/useChatSession-voice-integration.test.ts --reporter=dot npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot # 3. E2E tests (Chromium, mocked backend) cd /home/asimo/VoiceAssist npx playwright test \ e2e/voice-mode-navigation.spec.ts \ e2e/voice-mode-session-smoke.spec.ts \ e2e/voice-mode-voice-chat-integration.spec.ts \ --project=chromium --reporter=list

Test Coverage Summary

Test FileTestsCoverage
useRealtimeVoiceSession.test.ts22Hook lifecycle, states, metrics
useChatSession-voice-integration.test.ts8Message structure validation
voiceSettingsStore.test.ts17Store actions, persistence
VoiceModeSettings.test.tsx25Component rendering, interactions
MessageInput-voice-settings.test.tsx12Integration with chat input
voice-mode-navigation.spec.ts4E2E navigation flow
voice-mode-session-smoke.spec.ts3E2E session smoke (1 live gated)
voice-mode-voice-chat-integration.spec.ts4E2E panel integration

Total: 95 tests

Live Testing

To test with real OpenAI backend:

# Backend (requires OPENAI_API_KEY in .env) LIVE_REALTIME_TESTS=1 python -m pytest tests/integration/test_openai_config.py -v # E2E (requires running backend + valid API key) LIVE_REALTIME_E2E=1 npx playwright test e2e/voice-mode-session-smoke.spec.ts

File Reference

Backend

FilePurpose
services/api-gateway/app/api/voice.pyAPI routes, metrics, timing logs
services/api-gateway/app/services/realtime_voice_service.pySession creation, token generation
services/api-gateway/tests/integration/test_openai_config.pyIntegration tests
services/api-gateway/tests/integration/test_voice_metrics.pyMetrics endpoint tests

Frontend

FilePurpose
apps/web-app/src/hooks/useRealtimeVoiceSession.tsCore hook
apps/web-app/src/components/voice/VoiceModePanel.tsxUI panel
apps/web-app/src/components/voice/VoiceModeSettings.tsxSettings modal
apps/web-app/src/stores/voiceSettingsStore.tsSettings store
apps/web-app/src/components/chat/MessageInput.tsxVoice button integration
apps/web-app/src/pages/ChatPage.tsxChat timeline integration
apps/web-app/src/hooks/useChatSession.tsaddMessage() helper

Tests

FilePurpose
apps/web-app/src/hooks/__tests__/useRealtimeVoiceSession.test.tsHook tests
apps/web-app/src/hooks/__tests__/useChatSession-voice-integration.test.tsChat integration
apps/web-app/src/stores/__tests__/voiceSettingsStore.test.tsStore tests
apps/web-app/src/components/voice/__tests__/VoiceModeSettings.test.tsxComponent tests
apps/web-app/src/components/chat/__tests__/MessageInput-voice-settings.test.tsxIntegration tests
e2e/voice-mode-navigation.spec.tsE2E navigation
e2e/voice-mode-session-smoke.spec.tsE2E smoke test
e2e/voice-mode-voice-chat-integration.spec.tsE2E panel integration

Observability & Monitoring (Phase 3)

Implemented: 2025-12-02

The voice pipeline includes comprehensive observability features for production monitoring.

Error Taxonomy (voice_errors.py)

Location: services/api-gateway/app/core/voice_errors.py

Structured error classification with 8 categories and 40+ error codes:

CategoryCodesDescription
CONNECTIONCONN_001-7WebSocket, network failures
STTSTT_001-7Speech-to-text errors
TTSTTS_001-7Text-to-speech errors
LLMLLM_001-6LLM processing errors
AUDIOAUDIO_001-6Audio encoding/decoding errors
TIMEOUTTIMEOUT_001-7Various timeout conditions
PROVIDERPROVIDER_001-6External provider errors
INTERNALINTERNAL_001-5Internal server errors

Each error code includes:

  • Recoverability flag (can auto-retry)
  • Retry configuration (delay, max attempts)
  • User-friendly description

Voice Metrics (metrics.py)

Location: services/api-gateway/app/core/metrics.py

Prometheus metrics for voice pipeline monitoring:

MetricTypeLabelsDescription
voice_errors_totalCountercategory, code, provider, recoverableTotal voice errors
voice_pipeline_stage_latency_secondsHistogramstagePer-stage latency
voice_ttfa_secondsHistogram-Time to first audio
voice_active_sessionsGauge-Active voice sessions
voice_barge_in_totalCounter-Barge-in events
voice_audio_chunks_totalCounterstatusAudio chunks processed

Per-Stage Latency Tracking (voice_timing.py)

Location: services/api-gateway/app/core/voice_timing.py

Pipeline stages tracked:

  • audio_receive - Time to receive audio from client
  • vad_process - Voice activity detection time
  • stt_transcribe - Speech-to-text latency
  • llm_process - LLM inference time
  • tts_synthesize - Text-to-speech synthesis
  • audio_send - Time to send audio to client
  • ttfa - Time to first audio (end-to-end)

Usage:

from app.core.voice_timing import create_pipeline_timings, PipelineStage timings = create_pipeline_timings(session_id="abc123") with timings.time_stage(PipelineStage.STT_TRANSCRIBE): transcript = await stt_client.transcribe(audio) timings.record_ttfa() # When first audio byte ready timings.finalize() # When response complete

SLO Alerts (voice_slo_alerts.yml)

Location: infrastructure/observability/prometheus/rules/voice_slo_alerts.yml

SLO targets with Prometheus alerting rules:

SLOTargetAlert
TTFA P95< 200msVoiceTTFASLOViolation
STT Latency P95< 300msVoiceSTTLatencySLOViolation
TTS First Chunk P95< 200msVoiceTTSFirstChunkSLOViolation
Connection Time P95< 500msVoiceConnectionTimeSLOViolation
Error Rate< 1%VoiceErrorRateHigh
Session Success Rate> 95%VoiceSessionSuccessRateLow

Client Telemetry (voiceTelemetry.ts)

Location: apps/web-app/src/lib/voiceTelemetry.ts

Frontend telemetry with:

  • Network quality assessment via Network Information API
  • Browser performance metrics via Performance.memory API
  • Jitter estimation for network quality
  • Batched reporting (10s intervals)
  • Beacon API for reliable delivery on page unload
import { getVoiceTelemetry } from "@/lib/voiceTelemetry"; const telemetry = getVoiceTelemetry(); telemetry.startSession(sessionId); telemetry.recordLatency("stt", 150); telemetry.recordLatency("ttfa", 180); telemetry.endSession();

Voice Health Endpoint (/health/voice)

Location: services/api-gateway/app/api/health.py

Comprehensive voice subsystem health check:

curl https://assist.asimo.io/health/voice

Response:

{ "status": "healthy", "providers": { "openai": { "status": "up", "latency_ms": 120.5 }, "elevenlabs": { "status": "up", "latency_ms": 85.2 }, "deepgram": { "status": "up", "latency_ms": 95.8 } }, "session_store": { "status": "up", "active_sessions": 5 }, "metrics": { "active_sessions": 5 }, "slo": { "ttfa_target_ms": 200, "error_rate_target": 0.01 } }

Debug Logging Configuration

Location: services/api-gateway/app/core/logging.py

Configurable voice log verbosity via VOICE_LOG_LEVEL environment variable:

LevelContent
MINIMALErrors only
STANDARD+ Session lifecycle (start/end/state changes)
VERBOSE+ All latency measurements
DEBUG+ Audio frame details, chunk timing

Usage:

from app.core.logging import get_voice_logger voice_log = get_voice_logger(__name__) voice_log.session_start(session_id="abc123", provider="thinker_talker") voice_log.latency("stt_transcribe", 150.5, session_id="abc123") voice_log.error("voice_connection_failed", error_code="CONN_001")

Phase 9: Offline & Network Fallback

Implemented: 2025-12-03

The voice pipeline now includes comprehensive offline support and network-aware fallback mechanisms.

Network Monitoring (networkMonitor.ts)

Location: apps/web-app/src/lib/offline/networkMonitor.ts

Continuously monitors network health using multiple signals:

  • Navigator.onLine: Basic online/offline detection
  • Network Information API: Connection type, downlink speed, RTT
  • Health Check Pinging: Periodic /api/health pings for latency measurement
import { getNetworkMonitor } from "@/lib/offline/networkMonitor"; const monitor = getNetworkMonitor(); monitor.subscribe((status) => { console.log(`Network quality: ${status.quality}`); console.log(`Health check latency: ${status.healthCheckLatencyMs}ms`); });

Network Quality Levels

QualityLatencyisHealthyAction
Excellent< 100mstrueFull cloud processing
Good< 200mstrueFull cloud processing
Moderate< 500mstrueCloud with quality warning
Poorβ‰₯ 500msvariableConsider offline fallback
OfflineUnreachablefalseAutomatic offline fallback

Configuration

const monitor = createNetworkMonitor({ healthCheckUrl: "/api/health", healthCheckIntervalMs: 30000, // 30 seconds healthCheckTimeoutMs: 5000, // 5 seconds goodLatencyThresholdMs: 100, moderateLatencyThresholdMs: 200, poorLatencyThresholdMs: 500, failuresBeforeUnhealthy: 3, });

useNetworkStatus Hook

Location: apps/web-app/src/hooks/useNetworkStatus.ts

React hook providing network status with computed properties:

const { isOnline, isHealthy, quality, healthCheckLatencyMs, effectiveType, // "4g", "3g", "2g", "slow-2g" downlink, // Mbps rtt, // Round-trip time ms isSuitableForVoice, // quality >= "good" && isHealthy shouldUseOffline, // !isOnline || !isHealthy || quality < "moderate" qualityScore, // 0-4 (offline=0, poor=1, moderate=2, good=3, excellent=4) checkNow, // Force immediate health check } = useNetworkStatus();

Offline VAD with Network Fallback

Location: apps/web-app/src/hooks/useOfflineVAD.ts

The useOfflineVADWithFallback hook automatically switches between network and offline VAD:

const { isListening, isSpeaking, currentEnergy, isUsingOfflineVAD, // Currently using offline mode? networkAvailable, networkQuality, modeReason, // "network_vad" | "network_unavailable" | "poor_quality" | "forced_offline" forceOffline, // Manually switch to offline forceNetwork, // Manually switch to network (if available) startListening, stopListening, } = useOfflineVADWithFallback({ useNetworkMonitor: true, minNetworkQuality: "moderate", networkRecoveryDelayMs: 2000, // Prevent flapping onFallbackToOffline: () => console.log("Switched to offline VAD"), onReturnToNetwork: () => console.log("Returned to network VAD"), });

Fallback Decision Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Network Monitor   β”‚
β”‚  Health Check      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     NO     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Is Online?        │──────────▢│  Use Offline VAD   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ YES
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     NO     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Is Healthy?       │──────────▢│  Use Offline VAD   β”‚
β”‚  (3+ checks pass)  β”‚            β”‚  reason: unhealthy β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ YES
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     NO     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Quality β‰₯ Min?    │──────────▢│  Use Offline VAD   β”‚
β”‚  (e.g., moderate)  β”‚            β”‚  reason: poor_qual β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ YES
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Use Network VAD   β”‚
β”‚  (cloud processing)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

TTS Caching (useTTSCache)

Location: apps/web-app/src/hooks/useOfflineVAD.ts

Caches synthesized TTS audio for offline playback:

const { getTTS, // Get audio (from cache or fresh) preload, // Preload common phrases isCached, // Check if text is cached stats, // { entryCount, sizeMB, hitRate } clear, // Clear cache } = useTTSCache({ voice: "alloy", maxSizeMB: 50, ttsFunction: async (text) => synthesizeAudio(text), }); // Preload common phrases on app start await preload(); // Caches "I'm listening", "Go ahead", etc. // Get TTS (cache hit = instant, cache miss = synthesize + cache) const audio = await getTTS("Hello world");

User Settings Integration

Phase 9 settings are stored in voiceSettingsStore:

SettingDefaultDescription
enableOfflineFallbacktrueAuto-switch to offline when network poor
preferOfflineVADfalseForce offline VAD (privacy mode)
ttsCacheEnabledtrueEnable TTS response caching

File Reference (Phase 9)

FilePurpose
apps/web-app/src/lib/offline/networkMonitor.tsNetwork health monitoring
apps/web-app/src/lib/offline/webrtcVAD.tsWebRTC-based offline VAD
apps/web-app/src/lib/offline/types.tsOffline module type definitions
apps/web-app/src/hooks/useNetworkStatus.tsReact hook for network status
apps/web-app/src/hooks/useOfflineVAD.tsOffline VAD + TTS cache hooks
apps/web-app/src/lib/offline/__tests__/networkMonitor.test.tsNetwork monitor tests

Future Work

  • Metrics export to backend: Send metrics to backend for aggregation/alerting βœ“ Implemented
  • Barge-in support: Allow user to interrupt AI responses βœ“ Implemented (2025-11-28)
  • Audio overlap prevention: Prevent multiple responses playing simultaneously βœ“ Implemented (2025-11-28)
  • Per-user voice preferences: Backend persistence for TTS settings βœ“ Implemented (2025-11-29)
  • Context-aware voice styles: Auto-detect tone from content βœ“ Implemented (2025-11-29)
  • Aggressive latency optimization: 200ms VAD, 256-sample chunks, 300ms reconnect βœ“ Implemented (2025-11-29)
  • Observability & Monitoring (Phase 3): Error taxonomy, metrics, SLO alerts, telemetry βœ“ Implemented (2025-12-02)
  • Phase 7: Multilingual Support: Auto language detection, accent profiles, language switch confidence βœ“ Implemented (2025-12-03)
  • Phase 8: Voice Calibration: Personalized VAD thresholds, calibration wizard, adaptive learning βœ“ Implemented (2025-12-03)
  • Phase 9: Offline Fallback: Network monitoring, offline VAD, TTS caching, quality-based switching βœ“ Implemented (2025-12-03)
  • Phase 10: Conversation Intelligence: Sentiment tracking, discourse analysis, response recommendations βœ“ Implemented (2025-12-03)

Voice Mode Enhancement - 10 Phase Plan βœ… COMPLETE (2025-12-03)

A comprehensive enhancement transforming voice mode into a human-like conversational partner with medical dictation:

  • Phase 1: Emotional Intelligence (Hume AI) βœ“ Complete
  • Phase 2: Backchanneling System βœ“ Complete
  • Phase 3: Prosody Analysis βœ“ Complete
  • Phase 4: Memory & Context System βœ“ Complete
  • Phase 5: Advanced Turn-Taking βœ“ Complete
  • Phase 6: Variable Response Timing βœ“ Complete
  • Phase 7: Conversational Repair βœ“ Complete
  • Phase 8: Medical Dictation Core βœ“ Complete
  • Phase 9: Patient Context Integration βœ“ Complete
  • Phase 10: Frontend Integration & Analytics βœ“ Complete

Full documentation: VOICE_MODE_ENHANCEMENT_10_PHASE.md

Remaining Tasks

  • Voiceβ†’chat transcript content E2E: Test actual transcript content in chat timeline
  • Error tracking integration: Send errors to Sentry/similar
  • Audio level visualization: Show real-time audio level meter during recording

Voice Settings Guide

Voice Mode Settings Guide

This guide explains how to use and configure Voice Mode settings in VoiceAssist.

Overview

Voice Mode provides real-time voice conversations with the AI assistant. Users can customize their voice experience through the settings panel, including voice selection, language preferences, TTS quality parameters, and behavior options.

Voice Mode Overhaul (2025-11-29): Added backend persistence for voice preferences, context-aware voice style detection, and advanced TTS quality controls.

Phase 7-10 Enhancements (2025-12-03): Added multilingual support with auto-detection, voice calibration, offline fallback with network monitoring, and conversation intelligence features.

Accessing Settings

  1. Open Voice Mode by clicking the voice button in the chat interface
  2. Click the gear icon in the Voice Mode panel header
  3. The settings modal will appear

Available Settings

Voice Selection

Choose from 6 different AI voices:

  • Alloy - Neutral, balanced voice (default)
  • Echo - Warm, friendly voice
  • Fable - Expressive, narrative voice
  • Onyx - Deep, authoritative voice
  • Nova - Energetic, bright voice
  • Shimmer - Soft, calming voice

Language

Select your preferred conversation language:

  • English (default)
  • Spanish
  • French
  • German
  • Italian
  • Portuguese

Voice Detection Sensitivity (0-100%)

Controls how sensitive the voice activity detection is:

  • Lower values (0-30%): Less sensitive, requires louder/clearer speech
  • Medium values (40-60%): Balanced detection (recommended)
  • Higher values (70-100%): More sensitive, may pick up background noise

Auto-start Voice Mode

When enabled, Voice Mode will automatically open when you start a new chat or navigate to the chat page. This is useful for voice-first interactions.

Show Status Hints

When enabled, displays helpful tips and instructions in the Voice Mode panel. Disable if you're familiar with the interface and want a cleaner view.

Context-Aware Voice Style (New)

When enabled, the AI automatically adjusts its voice tone based on the content being spoken:

  • Calm: Default for medical explanations (stable, measured pace)
  • Urgent: For medical warnings/emergencies (dynamic, faster)
  • Empathetic: For sensitive health topics (warm, slower)
  • Instructional: For step-by-step guidance (clear, deliberate)
  • Conversational: For general chat (natural, varied)

The system detects keywords and patterns to select the appropriate style, then blends it with your base preferences (60% your settings, 40% style preset).

Advanced Voice Quality (New)

Expand this section to fine-tune TTS output parameters:

  • Voice Stability (0-100%): Lower = more expressive/varied, Higher = more consistent
  • Voice Clarity (0-100%): Higher values produce clearer, more consistent voice
  • Expressiveness (0-100%): Higher values add more emotion and style variation

These settings primarily affect ElevenLabs TTS but also influence context-aware style blending for OpenAI TTS.


Phase 7: Language & Detection Settings

Auto-Detect Language

When enabled, the system automatically detects the language being spoken and adjusts processing accordingly. This is useful for multilingual users who switch between languages naturally.

  • Default: Enabled
  • Store Key: autoLanguageDetection

Language Switch Confidence (0-100%)

Controls how confident the system must be before switching to a detected language. Higher values prevent false-positive language switches.

  • Lower values (50-70%): More responsive language switching, but may switch accidentally on similar-sounding phrases

  • Medium values (70-85%): Balanced detection (recommended)

  • Higher values (85-100%): Very confident switching, stays in current language unless clearly different

  • Default: 75%

  • Store Key: languageSwitchConfidence

Accent Profile

Select a regional accent profile to improve speech recognition accuracy for your specific accent or dialect.

  • Default: None (auto-detect)
  • Available Profiles: en-us-midwest, en-gb-london, en-au-sydney, ar-eg-cairo, ar-sa-riyadh, etc.
  • Store Key: accentProfileId

Phase 8: Voice Calibration Settings

Voice calibration optimizes the VAD (Voice Activity Detection) thresholds specifically for your voice and environment.

Calibration Status

Shows whether voice calibration has been completed:

  • Not Calibrated: Default state, using generic thresholds
  • Calibrated: Personal thresholds active (shows last calibration date)

Recalibrate Button

Launches the calibration wizard to:

  1. Record ambient noise samples
  2. Record your speaking voice at different volumes
  3. Compute personalized VAD thresholds

Calibration takes approximately 30-60 seconds.

Personalized VAD Threshold

After calibration, the system uses a custom threshold tuned to your voice:

  • Store Key: personalizedVadThreshold
  • Range: 0.0-1.0 (null if not calibrated)

Adaptive Learning

When enabled, the system continuously learns from your voice patterns and subtly adjusts thresholds over time.

  • Default: Enabled
  • Store Key: enableBehaviorLearning

Phase 9: Offline Mode Settings

Configure how the voice assistant behaves when network connectivity is poor or unavailable.

Enable Offline Fallback

When enabled, the system automatically switches to offline VAD processing when:

  • Network is offline

  • Health check fails consecutively

  • Network quality drops below threshold

  • Default: Enabled

  • Store Key: enableOfflineFallback

Prefer Local VAD

Force the use of local (on-device) VAD processing even when network is available. Useful for:

  • Privacy-conscious users who don't want audio sent to servers

  • Environments with unreliable connectivity

  • Lower latency at the cost of accuracy

  • Default: Disabled

  • Store Key: preferOfflineVAD

TTS Audio Caching

When enabled, previously synthesized audio responses are cached locally for:

  • Faster playback of repeated phrases

  • Offline playback of cached responses

  • Reduced bandwidth and API costs

  • Default: Enabled

  • Store Key: ttsCacheEnabled

Network Quality Monitoring

The system continuously monitors network quality and categorizes it into five levels:

QualityLatencyBehavior
Excellent< 100msFull cloud processing
Good< 200msFull cloud processing
Moderate< 500msCloud processing, may show warning
Poorβ‰₯ 500msAuto-fallback to offline VAD
OfflineNo networkFull offline mode

Network status is displayed in the voice panel header when quality is degraded.


Phase 10: Conversation Intelligence Settings

These settings control advanced AI features that enhance conversation quality.

Enable Sentiment Tracking

When enabled, the AI tracks emotional tone throughout the conversation and adapts its responses accordingly.

  • Default: Enabled
  • Store Key: enableSentimentTracking

Enable Discourse Analysis

Tracks conversation structure (topic changes, question chains, clarifications) to provide more contextually aware responses.

  • Default: Enabled
  • Store Key: enableDiscourseAnalysis

Enable Response Recommendations

The AI suggests relevant follow-up questions or actions based on conversation context.

  • Default: Enabled
  • Store Key: enableResponseRecommendations

Show Suggested Follow-Ups

Display AI-suggested follow-up questions after responses. These appear as clickable chips below the assistant's message.

  • Default: Enabled
  • Store Key: showSuggestedFollowUps

Privacy Settings

Store Transcript History

When enabled, voice transcripts are stored in the conversation history. Disable for ephemeral voice sessions.

  • Default: Enabled
  • Store Key: storeTranscriptHistory

Share Anonymous Analytics

Opt-in to share anonymized voice interaction metrics to help improve the service. No transcript content or personal data is shared - only timing metrics (latency, error rates).

  • Default: Disabled
  • Store Key: shareAnonymousAnalytics

Persistence

Voice preferences are now stored in two locations for maximum reliability:

  1. Backend API (Primary): Settings are synced to /api/voice/preferences and stored in the database. This enables cross-device settings sync when logged in.

  2. Local Storage (Fallback): Settings are also cached locally under voiceassist-voice-settings for offline access and faster loading.

Changes are debounced (1 second) before being sent to the backend to reduce API calls while editing.

Resetting to Defaults

Click "Reset to defaults" in the settings modal to restore all settings to their original values:

Core Settings

  • Voice: Alloy
  • Language: English
  • VAD Sensitivity: 50%
  • Auto-start: Disabled
  • Show hints: Enabled
  • Context-aware style: Enabled
  • Stability: 50%
  • Clarity: 75%
  • Expressiveness: 0%

Phase 7 Defaults

  • Auto Language Detection: Enabled
  • Language Switch Confidence: 75%
  • Accent Profile ID: null

Phase 8 Defaults

  • VAD Calibrated: false
  • Last Calibration Date: null
  • Personalized VAD Threshold: null
  • Adaptive Learning: Enabled

Phase 9 Defaults

  • Offline Fallback: Enabled
  • Prefer Local VAD: Disabled
  • TTS Cache: Enabled

Phase 10 Defaults

  • Sentiment Tracking: Enabled
  • Discourse Analysis: Enabled
  • Response Recommendations: Enabled
  • Show Suggested Follow-Ups: Enabled

Privacy Defaults

  • Store Transcript History: Enabled
  • Share Anonymous Analytics: Disabled

Reset also syncs to the backend via POST /api/voice/preferences/reset.

Voice Preferences API (New)

The following API endpoints manage voice preferences:

EndpointMethodDescription
/api/voice/preferencesGETGet user's voice preferences
/api/voice/preferencesPUTUpdate preferences (partial update)
/api/voice/preferences/resetPOSTReset to defaults
/api/voice/style-presetsGETGet available style presets

Response Headers

TTS synthesis requests now include additional headers:

  • X-TTS-Provider: Which provider was used (openai or elevenlabs)
  • X-TTS-Fallback: Whether fallback was used (true/false)
  • X-TTS-Style: Detected style if context-aware is enabled

Technical Details

Store Location

Settings are managed by a Zustand store with persistence:

apps/web-app/src/stores/voiceSettingsStore.ts

Component Locations

  • Settings UI: apps/web-app/src/components/voice/VoiceModeSettings.tsx
  • Enhanced Settings: apps/web-app/src/components/voice/VoiceSettingsEnhanced.tsx
  • Calibration Dialog: apps/web-app/src/components/voice/CalibrationDialog.tsx

Phase 9 Offline/Network Files

  • Network Monitor: apps/web-app/src/lib/offline/networkMonitor.ts
  • WebRTC VAD: apps/web-app/src/lib/offline/webrtcVAD.ts
  • Offline Types: apps/web-app/src/lib/offline/types.ts
  • Network Status Hook: apps/web-app/src/hooks/useNetworkStatus.ts
  • Offline VAD Hook: apps/web-app/src/hooks/useOfflineVAD.ts

Backend Files (New)

  • Model: services/api-gateway/app/models/user_voice_preferences.py
  • Style Detector: services/api-gateway/app/services/voice_style_detector.py
  • API Endpoints: services/api-gateway/app/api/voice.py (preferences section)
  • Schemas: services/api-gateway/app/api/voice_schemas/schemas.py

Frontend Sync Hook (New)

apps/web-app/src/hooks/useVoicePreferencesSync.ts

Handles loading/saving preferences to backend with debouncing.

Integration Points

  • VoiceModePanel.tsx - Displays settings button and uses store values
  • MessageInput.tsx - Reads autoStartOnOpen for auto-open behavior
  • useVoicePreferencesSync.ts - Backend sync on auth and setting changes

Advanced: Voice Mode Pipeline

Settings are not just UI preferences - they propagate into real-time voice sessions:

  • Voice/Language: Sent to /api/voice/realtime-session and used by OpenAI Realtime API
  • VAD Sensitivity: Mapped to server-side VAD threshold (0β†’insensitive, 100β†’sensitive)

For comprehensive pipeline documentation including backend integration, WebSocket connections, and metrics, see VOICE_MODE_PIPELINE.md.


Development: Running Tests

Run the voice settings test suites individually to avoid memory issues:

cd apps/web-app # Unit tests for voice settings store (core) npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot # Unit tests for voice settings store (Phase 7-10) npx vitest run src/stores/__tests__/voiceSettingsStore-phase7-10.test.ts --reporter=dot # Unit tests for network monitor npx vitest run src/lib/offline/__tests__/networkMonitor.test.ts --reporter=dot # Component tests for VoiceModeSettings npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot # Integration tests for MessageInput voice settings npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot

Test Coverage

The test suites cover:

voiceSettingsStore.test.ts (17 tests)

  • Default values verification
  • All setter functions (voice, language, sensitivity, toggles)
  • VAD sensitivity clamping (0-100 range)
  • Reset functionality
  • LocalStorage persistence

voiceSettingsStore-phase7-10.test.ts (41 tests)

  • Phase 7: Multilingual settings (accent profile, auto-detection, confidence)
  • Phase 8: Calibration settings (VAD calibrated, dates, thresholds)
  • Phase 9: Offline mode settings (fallback, prefer offline VAD, TTS cache)
  • Phase 10: Conversation intelligence (sentiment, discourse, recommendations)
  • Privacy settings (transcript history, anonymous analytics)
  • Persistence tests for all Phase 7-10 settings
  • Reset tests verifying all defaults

networkMonitor.test.ts (13 tests)

  • Initial state detection (online/offline)
  • Health check latency measurement
  • Quality computation from latency thresholds
  • Consecutive failure handling before marking unhealthy
  • Subscription/unsubscription for status changes
  • Custom configuration (latency thresholds, health check URL)
  • Offline detection via navigator.onLine

VoiceModeSettings.test.tsx (25 tests)

  • Modal visibility (isOpen prop)
  • Current settings display
  • Settings updates via UI interactions
  • Reset with confirmation
  • Close behavior (Done, X, backdrop)
  • Accessibility (labels, ARIA attributes)

MessageInput-voice-settings.test.tsx (12 tests)

  • Auto-open via store setting (autoStartOnOpen)
  • Auto-open via prop (autoOpenRealtimeVoice)
  • Combined settings behavior
  • Voice/language display in panel header
  • Status hints visibility toggle

Total: 108+ tests for voice settings and related functionality.

Notes

  • Tests mock useRealtimeVoiceSession and WaveformVisualizer to avoid browser API dependencies
  • Run tests individually rather than the full suite to prevent memory issues
  • All tests use Vitest + React Testing Library
  • Phase 7-10 tests also mock fetch and performance.now for network monitoring
Beginning of guide
End of guide