Voice Mode

Real-time voice interaction powered by OpenAI's Realtime API with bidirectional audio streaming and speech-to-text capabilities.

🎙️

Speech Input

Real-time speech recognition with partial transcript preview

🔊

Voice Output

Natural voice synthesis with multiple voice options

⚡

Low Latency

Optimized for minimal latency in voice interactions

✅ Implementation Status

• Voice metrics dashboard with latency indicators
• Microphone permission handling UX
• Keyboard shortcuts (Ctrl+Shift+V, Space for push-to-talk)
• Responsive voice panel layout
• Real-time transcript preview during speech

WebSocket Protocol →Web App →

Voice Pipeline Architecture

Voice Mode Pipeline

Status: Production-ready Last Updated: 2025-12-03

This document describes the unified Voice Mode pipeline architecture, data flow, metrics, and testing strategy. It serves as the canonical reference for developers working on real-time voice features.

Voice Pipeline Modes

VoiceAssist supports two voice pipeline modes:

Mode	Description	Best For
Thinker-Talker (Recommended)	Local STT → LLM → TTS pipeline	Full tool support, unified context, custom TTS
OpenAI Realtime (Legacy)	Direct OpenAI Realtime API	Quick setup, minimal backend changes

Thinker-Talker Pipeline (Primary)

The Thinker-Talker pipeline is the recommended approach, providing:

Unified conversation context between voice and chat modes
Full tool/RAG support in voice interactions
Custom TTS via ElevenLabs with premium voices
Lower cost per interaction

Documentation: THINKER_TALKER_PIPELINE.md

[Audio] → [Deepgram STT] → [GPT-4o Thinker] → [ElevenLabs TTS] → [Audio Out]
              │                    │                    │
         Transcripts          Tool Calls           Audio Chunks
              │                    │                    │
              └───────── WebSocket Handler ──────────────┘

OpenAI Realtime API (Legacy)

The original implementation using OpenAI's Realtime API directly. Still supported for backward compatibility.

Implementation Status

Thinker-Talker Components

Component	Status	Location
ThinkerService	Live	`app/services/thinker_service.py`
TalkerService	Live	`app/services/talker_service.py`
VoicePipelineService	Live	`app/services/voice_pipeline_service.py`
T/T WebSocket Handler	Live	`app/services/thinker_talker_websocket_handler.py`
SentenceChunker	Live	`app/services/sentence_chunker.py`
Frontend T/T hook	Live	`apps/web-app/src/hooks/useThinkerTalkerSession.ts`
T/T Audio Playback	Live	`apps/web-app/src/hooks/useTTAudioPlayback.ts`
T/T Voice Panel	Live	`apps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsx`

OpenAI Realtime Components (Legacy)

Component	Status	Location
Backend session endpoint	Live	`services/api-gateway/app/api/voice.py`
Ephemeral token generation	Live	`app/services/realtime_voice_service.py`
Voice metrics endpoint	Live	`POST /api/voice/metrics`
Frontend voice hook	Live	`apps/web-app/src/hooks/useRealtimeVoiceSession.ts`
Voice settings store	Live	`apps/web-app/src/stores/voiceSettingsStore.ts`
Voice UI panel	Live	`apps/web-app/src/components/voice/VoiceModePanel.tsx`
Chat timeline integration	Live	Voice messages appear in chat
Barge-in support	Live	`response.cancel` + `onSpeechStarted` callback
Audio overlap prevention	Live	Response ID tracking + `isProcessingResponseRef`
E2E test suite	Passing	95 tests across unit/integration/E2E

Full status: See Implementation Status for all components.

Overview

Voice Mode enables real-time voice conversations with the AI assistant using OpenAI's Realtime API. The pipeline handles:

Ephemeral session authentication (no raw API keys in browser)
WebSocket-based bidirectional voice streaming
Voice activity detection (VAD) with user-configurable sensitivity
User settings propagation (voice, language, VAD threshold)
Chat timeline integration (voice messages appear in chat)
Connection state management with automatic reconnection
Barge-in support (interrupt AI while speaking)
Audio playback management (prevent overlapping responses)
Metrics tracking for observability

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                              FRONTEND                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────┐     ┌─────────────────────┐     ┌───────────────┐  │
│  │  VoiceModePanel     │────▶│useRealtimeVoice     │────▶│ voiceSettings │  │
│  │  (UI Component)     │     │Session (Hook)       │     │ Store         │  │
│  │  - Start/Stop       │     │- connect()          │     │ - voice       │  │
│  │  - Status display   │     │- disconnect()       │     │ - language    │  │
│  │  - Metrics logging  │     │- sendMessage()      │     │ - vadSens     │  │
│  └─────────┬───────────┘     └──────────┬──────────┘     └───────────────┘  │
│            │                            │                                    │
│            │                            │ onUserMessage()/onAssistantMessage()
│            │                            ▼                                    │
│  ┌─────────▼───────────┐     ┌─────────────────────┐                        │
│  │  MessageInput       │     │  ChatPage           │                        │
│  │  - Voice toggle     │────▶│  - useChatSession   │                        │
│  │  - Panel container  │     │  - addMessage()     │                        │
│  └─────────────────────┘     └─────────────────────┘                        │
│                                                                              │
└──────────────────────────────────────┬──────────────────────────────────────┘
                                       │
                                       │ POST /api/voice/realtime-session
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              BACKEND                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────┐     ┌─────────────────────┐                        │
│  │  voice.py           │────▶│  realtime_voice_    │                        │
│  │  (FastAPI Router)   │     │  service.py         │                        │
│  │  - /realtime-session│     │  - generate_session │                        │
│  │  - Timing logs      │     │  - ephemeral token  │                        │
│  └─────────────────────┘     └──────────┬──────────┘                        │
│                                         │                                    │
│                                         │ POST /v1/realtime/sessions         │
│                                         ▼                                    │
│                              ┌─────────────────────┐                        │
│                              │  OpenAI API         │                        │
│                              │  - Ephemeral token  │                        │
│                              │  - Voice config     │                        │
│                              └─────────────────────┘                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       │ WebSocket wss://api.openai.com/v1/realtime
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          OPENAI REALTIME API                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  - Server-side VAD (voice activity detection)                                │
│  - Bidirectional audio streaming (PCM16)                                     │
│  - Real-time transcription (Whisper)                                         │
│  - GPT-4o responses with audio synthesis                                     │
└─────────────────────────────────────────────────────────────────────────────┘

Backend: `/api/voice/realtime-session`

Location: services/api-gateway/app/api/voice.py

Request

interface RealtimeSessionRequest {
  conversation_id?: string; // Optional conversation context
  voice?: string; // "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"
  language?: string; // "en" | "es" | "fr" | "de" | "it" | "pt"
  vad_sensitivity?: number; // 0-100 (maps to threshold: 0→0.9, 100→0.1)
}

Response

interface RealtimeSessionResponse {
  url: string; // WebSocket URL: "wss://api.openai.com/v1/realtime"
  model: string; // "gpt-4o-realtime-preview"
  session_id: string; // Unique session identifier
  expires_at: number; // Unix timestamp (epoch seconds)
  conversation_id: string | null;
  auth: {
    type: "ephemeral_token";
    token: string; // Ephemeral token (ek_...), NOT raw API key
    expires_at: number; // Token expiry (5 minutes)
  };
  voice_config: {
    voice: string; // Selected voice
    modalities: ["text", "audio"];
    input_audio_format: "pcm16";
    output_audio_format: "pcm16";
    input_audio_transcription: { model: "whisper-1" };
    turn_detection: {
      type: "server_vad";
      threshold: number; // 0.1 (sensitive) to 0.9 (insensitive)
      prefix_padding_ms: number;
      silence_duration_ms: number;
    };
  };
}

VAD Sensitivity Mapping

The frontend uses a 0-100 scale for user-friendly VAD sensitivity:

User Setting	VAD Threshold	Behavior
0 (Low)	0.9	Requires loud/clear speech
50 (Medium)	0.5	Balanced detection
100 (High)	0.1	Very sensitive, picks up soft speech

Formula: threshold = 0.9 - (vad_sensitivity / 100 * 0.8)

Observability

Backend logs timing and context for each session request:

# Request logging
logger.info(
    f"Creating Realtime session for user {current_user.id}",
    extra={
        "user_id": current_user.id,
        "conversation_id": request.conversation_id,
        "voice": request.voice,
        "language": request.language,
        "vad_sensitivity": request.vad_sensitivity,
    },
)

# Success logging with duration
duration_ms = int((time.monotonic() - start_time) * 1000)
logger.info(
    f"Realtime session created for user {current_user.id}",
    extra={
        "user_id": current_user.id,
        "session_id": config["session_id"],
        "voice": config.get("voice_config", {}).get("voice"),
        "duration_ms": duration_ms,
    },
)

Frontend Hook: `useRealtimeVoiceSession`

Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts

Usage

const {
  status, // 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'failed' | 'expired' | 'error'
  transcript, // Current transcript text
  isSpeaking, // Is the AI currently speaking?
  isConnected, // Derived: status === 'connected'
  isConnecting, // Derived: status === 'connecting' || 'reconnecting'
  canSend, // Can send messages?
  error, // Error message if any
  metrics, // VoiceMetrics object
  connect, // () => Promise<void> - start session
  disconnect, // () => void - end session
  sendMessage, // (text: string) => void - send text message
} = useRealtimeVoiceSession({
  conversationId,
  voice, // From voiceSettingsStore
  language, // From voiceSettingsStore
  vadSensitivity, // From voiceSettingsStore (0-100)
  onConnected, // Callback when connected
  onDisconnected, // Callback when disconnected
  onError, // Callback on error
  onUserMessage, // Callback with user transcript
  onAssistantMessage, // Callback with AI response
  onMetricsUpdate, // Callback when metrics change
});

Connection States

disconnected ──▶ connecting ──▶ connected
                      │              │
                      ▼              ▼
                   failed ◀──── reconnecting
                      │              │
                      ▼              ▼
                  expired ◀────── error

State	Description
`disconnected`	Initial/idle state
`connecting`	Fetching session config, establishing WebSocket
`connected`	Active voice session
`reconnecting`	Auto-reconnect after temporary disconnect
`failed`	Connection failed (backend error, network issue)
`expired`	Session token expired (needs manual restart)
`error`	General error state

WebSocket Connection

The hook connects using three protocols for authentication:

const ws = new WebSocket(url, ["realtime", "openai-beta.realtime-v1", `openai-insecure-api-key.${ephemeralToken}`]);

Voice Settings Store

Location: apps/web-app/src/stores/voiceSettingsStore.ts

Schema

interface VoiceSettings {
  voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer";
  language: "en" | "es" | "fr" | "de" | "it" | "pt";
  vadSensitivity: number; // 0-100
  autoStartOnOpen: boolean; // Auto-start voice when panel opens
  showStatusHints: boolean; // Show helper text in UI
}

Persistence

Settings are persisted to localStorage under key voiceassist-voice-settings using Zustand's persist middleware.

Defaults

Setting	Default
voice	"alloy"
language	"en"
vadSensitivity	50
autoStartOnOpen	false
showStatusHints	true

Chat Integration

Location: apps/web-app/src/pages/ChatPage.tsx

Message Flow

User speaks → VoiceModePanel receives final transcript
VoiceModePanel calls onUserMessage(transcript)
ChatPage receives callback, calls useChatSession.addMessage()
Message added to timeline with metadata: { source: "voice" }

// ChatPage.tsx
const handleVoiceUserMessage = (content: string) => {
  addMessage({
    role: "user",
    content,
    metadata: { source: "voice" },
  });
};

const handleVoiceAssistantMessage = (content: string) => {
  addMessage({
    role: "assistant",
    content,
    metadata: { source: "voice" },
  });
};

Message Structure

interface VoiceMessage {
  id: string; // "voice-{timestamp}-{random}"
  role: "user" | "assistant";
  content: string;
  timestamp: number;
  metadata: {
    source: "voice"; // Distinguishes from text messages
  };
}

Barge-in & Audio Playback

Location: apps/web-app/src/components/voice/VoiceModePanel.tsx, apps/web-app/src/hooks/useRealtimeVoiceSession.ts

Barge-in Flow

When the user starts speaking while the AI is responding, the system immediately:

Detects speech start via OpenAI's input_audio_buffer.speech_started event
Cancels active response by sending response.cancel to OpenAI
Stops audio playback via onSpeechStarted callback
Clears pending responses to prevent stale audio from playing

User speaks → speech_started event → response.cancel → stopCurrentAudio()
                                                            ↓
                                                    Audio stops
                                                    Queue cleared
                                                    Response ID incremented

Response Cancellation

Location: useRealtimeVoiceSession.ts - handleRealtimeMessage

case "input_audio_buffer.speech_started":
  setIsSpeaking(true);
  setPartialTranscript("");

  // Barge-in: Cancel any active response when user starts speaking
  if (activeResponseIdRef.current && wsRef.current?.readyState === WebSocket.OPEN) {
    wsRef.current.send(JSON.stringify({ type: "response.cancel" }));
    activeResponseIdRef.current = null;
  }

  // Notify parent to stop audio playback
  options.onSpeechStarted?.();
  break;

Audio Playback Management

Location: VoiceModePanel.tsx

The panel tracks audio playback state to prevent overlapping responses:

// Track currently playing Audio element
const currentAudioRef = useRef<HTMLAudioElement | null>(null);

// Prevent overlapping response processing
const isProcessingResponseRef = useRef(false);

// Response ID to invalidate stale responses after barge-in
const currentResponseIdRef = useRef<number>(0);

Stop current audio function:

const stopCurrentAudio = useCallback(() => {
  if (currentAudioRef.current) {
    currentAudioRef.current.pause();
    currentAudioRef.current.currentTime = 0;
    if (currentAudioRef.current.src.startsWith("blob:")) {
      URL.revokeObjectURL(currentAudioRef.current.src);
    }
    currentAudioRef.current = null;
  }
  audioQueueRef.current = [];
  isPlayingRef.current = false;
  currentResponseIdRef.current++; // Invalidate pending responses
  isProcessingResponseRef.current = false;
}, []);

Overlap Prevention

When a relay result arrives, the handler checks:

Already processing? Skip if isProcessingResponseRef.current === true
Response ID valid? Skip playback if ID changed (barge-in occurred)

onRelayResult: async ({ answer }) => {
  if (answer) {
    // Prevent overlapping responses
    if (isProcessingResponseRef.current) {
      console.log("[VoiceModePanel] Skipping response - already processing another");
      return;
    }

    const responseId = ++currentResponseIdRef.current;
    isProcessingResponseRef.current = true;

    // ... synthesis and playback ...

    // Check if response is still valid before playback
    if (responseId !== currentResponseIdRef.current) {
      console.log("[VoiceModePanel] Response cancelled - skipping playback");
      return;
    }
  }
};

Error Handling

Benign cancellation errors (e.g., "Cancellation failed: no active response found") are handled gracefully:

case "error": {
  const errorMessage = message.error?.message || "Realtime API error";

  // Ignore benign cancellation errors
  if (
    errorMessage.includes("Cancellation failed") ||
    errorMessage.includes("no active response")
  ) {
    voiceLog.debug(`Ignoring benign error: ${errorMessage}`);
    break;
  }

  handleError(new Error(errorMessage));
  break;
}

Metrics

Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts

VoiceMetrics Interface

interface VoiceMetrics {
  connectionTimeMs: number | null; // Time to establish connection
  timeToFirstTranscriptMs: number | null; // Time to first user transcript
  lastSttLatencyMs: number | null; // Speech-to-text latency
  lastResponseLatencyMs: number | null; // AI response latency
  sessionDurationMs: number | null; // Total session duration
  userTranscriptCount: number; // Number of user turns
  aiResponseCount: number; // Number of AI turns
  reconnectCount: number; // Number of reconnections
  sessionStartedAt: number | null; // Session start timestamp
}

Frontend Logging

VoiceModePanel logs key metrics to console:

// Connection time
console.log(`[VoiceModePanel] voice_session_connect_ms=${metrics.connectionTimeMs}`);

// STT latency
console.log(`[VoiceModePanel] voice_stt_latency_ms=${metrics.lastSttLatencyMs}`);

// Response latency
console.log(`[VoiceModePanel] voice_first_reply_ms=${metrics.lastResponseLatencyMs}`);

// Session duration
console.log(`[VoiceModePanel] voice_session_duration_ms=${metrics.sessionDurationMs}`);

Consuming Metrics

Developers can plug into metrics via the onMetricsUpdate callback:

useRealtimeVoiceSession({
  onMetricsUpdate: (metrics) => {
    // Send to telemetry service
    analytics.track("voice_session_metrics", {
      connection_ms: metrics.connectionTimeMs,
      stt_latency_ms: metrics.lastSttLatencyMs,
      response_latency_ms: metrics.lastResponseLatencyMs,
      duration_ms: metrics.sessionDurationMs,
    });
  },
});

Metrics Export to Backend

Metrics can be automatically exported to the backend for aggregation and alerting.

Backend Endpoint: POST /api/voice/metrics

Location: services/api-gateway/app/api/voice.py

Request Schema

interface VoiceMetricsPayload {
  conversation_id?: string;
  connection_time_ms?: number;
  time_to_first_transcript_ms?: number;
  last_stt_latency_ms?: number;
  last_response_latency_ms?: number;
  session_duration_ms?: number;
  user_transcript_count: number;
  ai_response_count: number;
  reconnect_count: number;
  session_started_at?: number;
}

Response

interface VoiceMetricsResponse {
  status: "ok";
}

Privacy

No PHI or transcript content is sent. Only timing metrics and counts.

Frontend Configuration

Metrics export is controlled by environment variables:

Production (import.meta.env.PROD): Metrics sent automatically
Development: Set VITE_ENABLE_VOICE_METRICS=true to enable

The export uses navigator.sendBeacon() for reliability (survives page navigation).

Backend Logging

Metrics are logged with user context:

logger.info(
    "VoiceMetrics received",
    extra={
        "user_id": current_user.id,
        "conversation_id": payload.conversation_id,
        "connection_time_ms": payload.connection_time_ms,
        "session_duration_ms": payload.session_duration_ms,
        ...
    },
)

Testing

# Backend
cd /home/asimo/VoiceAssist/services/api-gateway
source venv/bin/activate && export PYTHONPATH=.
python -m pytest tests/integration/test_voice_metrics.py -v

Security

Ephemeral Token Architecture

CRITICAL: The browser NEVER receives the raw OpenAI API key.

Backend holds OPENAI_API_KEY securely
Frontend requests session via /api/voice/realtime-session
Backend creates ephemeral token via OpenAI /v1/realtime/sessions
Ephemeral token returned to frontend (valid ~5 minutes)
Frontend connects WebSocket using ephemeral token

Token Refresh

The hook monitors session.expires_at and can trigger refresh before expiry. If the token expires mid-session, status transitions to expired.

Testing

Voice Pipeline Smoke Suite

Run these commands to validate the voice pipeline:

# 1. Backend tests (CI-safe, mocked)
cd /home/asimo/VoiceAssist/services/api-gateway
source venv/bin/activate
export PYTHONPATH=.
python -m pytest tests/integration/test_openai_config.py -v

# 2. Frontend unit tests (run individually to avoid OOM)
cd /home/asimo/VoiceAssist/apps/web-app
export NODE_OPTIONS="--max-old-space-size=768"

npx vitest run src/hooks/__tests__/useRealtimeVoiceSession.test.ts --reporter=dot
npx vitest run src/hooks/__tests__/useChatSession-voice-integration.test.ts --reporter=dot
npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot
npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot
npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot

# 3. E2E tests (Chromium, mocked backend)
cd /home/asimo/VoiceAssist
npx playwright test \
  e2e/voice-mode-navigation.spec.ts \
  e2e/voice-mode-session-smoke.spec.ts \
  e2e/voice-mode-voice-chat-integration.spec.ts \
  --project=chromium --reporter=list

Test Coverage Summary

Test File	Tests	Coverage
useRealtimeVoiceSession.test.ts	22	Hook lifecycle, states, metrics
useChatSession-voice-integration.test.ts	8	Message structure validation
voiceSettingsStore.test.ts	17	Store actions, persistence
VoiceModeSettings.test.tsx	25	Component rendering, interactions
MessageInput-voice-settings.test.tsx	12	Integration with chat input
voice-mode-navigation.spec.ts	4	E2E navigation flow
voice-mode-session-smoke.spec.ts	3	E2E session smoke (1 live gated)
voice-mode-voice-chat-integration.spec.ts	4	E2E panel integration

Total: 95 tests

Live Testing

To test with real OpenAI backend:

# Backend (requires OPENAI_API_KEY in .env)
LIVE_REALTIME_TESTS=1 python -m pytest tests/integration/test_openai_config.py -v

# E2E (requires running backend + valid API key)
LIVE_REALTIME_E2E=1 npx playwright test e2e/voice-mode-session-smoke.spec.ts

File Reference

Backend

File	Purpose
`services/api-gateway/app/api/voice.py`	API routes, metrics, timing logs
`services/api-gateway/app/services/realtime_voice_service.py`	Session creation, token generation
`services/api-gateway/tests/integration/test_openai_config.py`	Integration tests
`services/api-gateway/tests/integration/test_voice_metrics.py`	Metrics endpoint tests

Frontend

File	Purpose
`apps/web-app/src/hooks/useRealtimeVoiceSession.ts`	Core hook
`apps/web-app/src/components/voice/VoiceModePanel.tsx`	UI panel
`apps/web-app/src/components/voice/VoiceModeSettings.tsx`	Settings modal
`apps/web-app/src/stores/voiceSettingsStore.ts`	Settings store
`apps/web-app/src/components/chat/MessageInput.tsx`	Voice button integration
`apps/web-app/src/pages/ChatPage.tsx`	Chat timeline integration
`apps/web-app/src/hooks/useChatSession.ts`	addMessage() helper

Tests

File	Purpose
`apps/web-app/src/hooks/__tests__/useRealtimeVoiceSession.test.ts`	Hook tests
`apps/web-app/src/hooks/__tests__/useChatSession-voice-integration.test.ts`	Chat integration
`apps/web-app/src/stores/__tests__/voiceSettingsStore.test.ts`	Store tests
`apps/web-app/src/components/voice/__tests__/VoiceModeSettings.test.tsx`	Component tests
`apps/web-app/src/components/chat/__tests__/MessageInput-voice-settings.test.tsx`	Integration tests
`e2e/voice-mode-navigation.spec.ts`	E2E navigation
`e2e/voice-mode-session-smoke.spec.ts`	E2E smoke test
`e2e/voice-mode-voice-chat-integration.spec.ts`	E2E panel integration

VOICE_MODE_ENHANCEMENT_10_PHASE.md - 10-phase enhancement plan (emotion, dictation, analytics)
VOICE_MODE_SETTINGS_GUIDE.md - User settings configuration
TESTING_GUIDE.md - E2E testing strategy and validation checklist

Observability & Monitoring (Phase 3)

Implemented: 2025-12-02

The voice pipeline includes comprehensive observability features for production monitoring.

Error Taxonomy (`voice_errors.py`)

Location: services/api-gateway/app/core/voice_errors.py

Structured error classification with 8 categories and 40+ error codes:

Category	Codes	Description
CONNECTION	CONN_001-7	WebSocket, network failures
STT	STT_001-7	Speech-to-text errors
TTS	TTS_001-7	Text-to-speech errors
LLM	LLM_001-6	LLM processing errors
AUDIO	AUDIO_001-6	Audio encoding/decoding errors
TIMEOUT	TIMEOUT_001-7	Various timeout conditions
PROVIDER	PROVIDER_001-6	External provider errors
INTERNAL	INTERNAL_001-5	Internal server errors

Each error code includes:

Recoverability flag (can auto-retry)
Retry configuration (delay, max attempts)
User-friendly description

Voice Metrics (`metrics.py`)

Location: services/api-gateway/app/core/metrics.py

Prometheus metrics for voice pipeline monitoring:

Metric	Type	Labels	Description
`voice_errors_total`	Counter	category, code, provider, recoverable	Total voice errors
`voice_pipeline_stage_latency_seconds`	Histogram	stage	Per-stage latency
`voice_ttfa_seconds`	Histogram	-	Time to first audio
`voice_active_sessions`	Gauge	-	Active voice sessions
`voice_barge_in_total`	Counter	-	Barge-in events
`voice_audio_chunks_total`	Counter	status	Audio chunks processed

Per-Stage Latency Tracking (`voice_timing.py`)

Location: services/api-gateway/app/core/voice_timing.py

Pipeline stages tracked:

audio_receive - Time to receive audio from client
vad_process - Voice activity detection time
stt_transcribe - Speech-to-text latency
llm_process - LLM inference time
tts_synthesize - Text-to-speech synthesis
audio_send - Time to send audio to client
ttfa - Time to first audio (end-to-end)

Usage:

from app.core.voice_timing import create_pipeline_timings, PipelineStage

timings = create_pipeline_timings(session_id="abc123")

with timings.time_stage(PipelineStage.STT_TRANSCRIBE):
    transcript = await stt_client.transcribe(audio)

timings.record_ttfa()  # When first audio byte ready
timings.finalize()     # When response complete

SLO Alerts (`voice_slo_alerts.yml`)

Location: infrastructure/observability/prometheus/rules/voice_slo_alerts.yml

SLO targets with Prometheus alerting rules:

SLO	Target	Alert
TTFA P95	< 200ms	VoiceTTFASLOViolation
STT Latency P95	< 300ms	VoiceSTTLatencySLOViolation
TTS First Chunk P95	< 200ms	VoiceTTSFirstChunkSLOViolation
Connection Time P95	< 500ms	VoiceConnectionTimeSLOViolation
Error Rate	< 1%	VoiceErrorRateHigh
Session Success Rate	> 95%	VoiceSessionSuccessRateLow

Client Telemetry (`voiceTelemetry.ts`)

Location: apps/web-app/src/lib/voiceTelemetry.ts

Frontend telemetry with:

Network quality assessment via Network Information API
Browser performance metrics via Performance.memory API
Jitter estimation for network quality
Batched reporting (10s intervals)
Beacon API for reliable delivery on page unload

import { getVoiceTelemetry } from "@/lib/voiceTelemetry";

const telemetry = getVoiceTelemetry();
telemetry.startSession(sessionId);
telemetry.recordLatency("stt", 150);
telemetry.recordLatency("ttfa", 180);
telemetry.endSession();

Voice Health Endpoint (`/health/voice`)

Location: services/api-gateway/app/api/health.py

Comprehensive voice subsystem health check:

curl https://assist.asimo.io/health/voice

Response:

{
  "status": "healthy",
  "providers": {
    "openai": { "status": "up", "latency_ms": 120.5 },
    "elevenlabs": { "status": "up", "latency_ms": 85.2 },
    "deepgram": { "status": "up", "latency_ms": 95.8 }
  },
  "session_store": { "status": "up", "active_sessions": 5 },
  "metrics": { "active_sessions": 5 },
  "slo": { "ttfa_target_ms": 200, "error_rate_target": 0.01 }
}

Debug Logging Configuration

Location: services/api-gateway/app/core/logging.py

Configurable voice log verbosity via VOICE_LOG_LEVEL environment variable:

Level	Content
MINIMAL	Errors only
STANDARD	+ Session lifecycle (start/end/state changes)
VERBOSE	+ All latency measurements
DEBUG	+ Audio frame details, chunk timing

Usage:

from app.core.logging import get_voice_logger

voice_log = get_voice_logger(__name__)
voice_log.session_start(session_id="abc123", provider="thinker_talker")
voice_log.latency("stt_transcribe", 150.5, session_id="abc123")
voice_log.error("voice_connection_failed", error_code="CONN_001")

Phase 9: Offline & Network Fallback

Implemented: 2025-12-03

The voice pipeline now includes comprehensive offline support and network-aware fallback mechanisms.

Network Monitoring (`networkMonitor.ts`)

Location: apps/web-app/src/lib/offline/networkMonitor.ts

Continuously monitors network health using multiple signals:

Navigator.onLine: Basic online/offline detection
Network Information API: Connection type, downlink speed, RTT
Health Check Pinging: Periodic /api/health pings for latency measurement

import { getNetworkMonitor } from "@/lib/offline/networkMonitor";

const monitor = getNetworkMonitor();
monitor.subscribe((status) => {
  console.log(`Network quality: ${status.quality}`);
  console.log(`Health check latency: ${status.healthCheckLatencyMs}ms`);
});

Network Quality Levels

Quality	Latency	isHealthy	Action
Excellent	< 100ms	true	Full cloud processing
Good	< 200ms	true	Full cloud processing
Moderate	< 500ms	true	Cloud with quality warning
Poor	≥ 500ms	variable	Consider offline fallback
Offline	Unreachable	false	Automatic offline fallback

Configuration

const monitor = createNetworkMonitor({
  healthCheckUrl: "/api/health",
  healthCheckIntervalMs: 30000, // 30 seconds
  healthCheckTimeoutMs: 5000, // 5 seconds
  goodLatencyThresholdMs: 100,
  moderateLatencyThresholdMs: 200,
  poorLatencyThresholdMs: 500,
  failuresBeforeUnhealthy: 3,
});

useNetworkStatus Hook

Location: apps/web-app/src/hooks/useNetworkStatus.ts

React hook providing network status with computed properties:

const {
  isOnline,
  isHealthy,
  quality,
  healthCheckLatencyMs,
  effectiveType, // "4g", "3g", "2g", "slow-2g"
  downlink, // Mbps
  rtt, // Round-trip time ms
  isSuitableForVoice, // quality >= "good" && isHealthy
  shouldUseOffline, // !isOnline || !isHealthy || quality < "moderate"
  qualityScore, // 0-4 (offline=0, poor=1, moderate=2, good=3, excellent=4)
  checkNow, // Force immediate health check
} = useNetworkStatus();

Offline VAD with Network Fallback

Location: apps/web-app/src/hooks/useOfflineVAD.ts

The useOfflineVADWithFallback hook automatically switches between network and offline VAD:

const {
  isListening,
  isSpeaking,
  currentEnergy,
  isUsingOfflineVAD, // Currently using offline mode?
  networkAvailable,
  networkQuality,
  modeReason, // "network_vad" | "network_unavailable" | "poor_quality" | "forced_offline"
  forceOffline, // Manually switch to offline
  forceNetwork, // Manually switch to network (if available)
  startListening,
  stopListening,
} = useOfflineVADWithFallback({
  useNetworkMonitor: true,
  minNetworkQuality: "moderate",
  networkRecoveryDelayMs: 2000, // Prevent flapping
  onFallbackToOffline: () => console.log("Switched to offline VAD"),
  onReturnToNetwork: () => console.log("Returned to network VAD"),
});

Fallback Decision Flow

┌────────────────────┐
│  Network Monitor   │
│  Health Check      │
└─────────┬──────────┘
          │
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Is Online?        │──────────▶│  Use Offline VAD   │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Is Healthy?       │──────────▶│  Use Offline VAD   │
│  (3+ checks pass)  │            │  reason: unhealthy │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Quality ≥ Min?    │──────────▶│  Use Offline VAD   │
│  (e.g., moderate)  │            │  reason: poor_qual │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐
│  Use Network VAD   │
│  (cloud processing)│
└────────────────────┘

TTS Caching (`useTTSCache`)

Location: apps/web-app/src/hooks/useOfflineVAD.ts

Caches synthesized TTS audio for offline playback:

const {
  getTTS, // Get audio (from cache or fresh)
  preload, // Preload common phrases
  isCached, // Check if text is cached
  stats, // { entryCount, sizeMB, hitRate }
  clear, // Clear cache
} = useTTSCache({
  voice: "alloy",
  maxSizeMB: 50,
  ttsFunction: async (text) => synthesizeAudio(text),
});

// Preload common phrases on app start
await preload(); // Caches "I'm listening", "Go ahead", etc.

// Get TTS (cache hit = instant, cache miss = synthesize + cache)
const audio = await getTTS("Hello world");

User Settings Integration

Phase 9 settings are stored in voiceSettingsStore:

Setting	Default	Description
`enableOfflineFallback`	`true`	Auto-switch to offline when network poor
`preferOfflineVAD`	`false`	Force offline VAD (privacy mode)
`ttsCacheEnabled`	`true`	Enable TTS response caching

File Reference (Phase 9)

File	Purpose
`apps/web-app/src/lib/offline/networkMonitor.ts`	Network health monitoring
`apps/web-app/src/lib/offline/webrtcVAD.ts`	WebRTC-based offline VAD
`apps/web-app/src/lib/offline/types.ts`	Offline module type definitions
`apps/web-app/src/hooks/useNetworkStatus.ts`	React hook for network status
`apps/web-app/src/hooks/useOfflineVAD.ts`	Offline VAD + TTS cache hooks
`apps/web-app/src/lib/offline/__tests__/networkMonitor.test.ts`	Network monitor tests

Future Work

Metrics export to backend: Send metrics to backend for aggregation/alerting ✓ Implemented
Barge-in support: Allow user to interrupt AI responses ✓ Implemented (2025-11-28)
Audio overlap prevention: Prevent multiple responses playing simultaneously ✓ Implemented (2025-11-28)
Per-user voice preferences: Backend persistence for TTS settings ✓ Implemented (2025-11-29)
Context-aware voice styles: Auto-detect tone from content ✓ Implemented (2025-11-29)
Aggressive latency optimization: 200ms VAD, 256-sample chunks, 300ms reconnect ✓ Implemented (2025-11-29)
Observability & Monitoring (Phase 3): Error taxonomy, metrics, SLO alerts, telemetry ✓ Implemented (2025-12-02)
Phase 7: Multilingual Support: Auto language detection, accent profiles, language switch confidence ✓ Implemented (2025-12-03)
Phase 8: Voice Calibration: Personalized VAD thresholds, calibration wizard, adaptive learning ✓ Implemented (2025-12-03)
Phase 9: Offline Fallback: Network monitoring, offline VAD, TTS caching, quality-based switching ✓ Implemented (2025-12-03)
Phase 10: Conversation Intelligence: Sentiment tracking, discourse analysis, response recommendations ✓ Implemented (2025-12-03)

Voice Mode Enhancement - 10 Phase Plan ✅ COMPLETE (2025-12-03)

A comprehensive enhancement transforming voice mode into a human-like conversational partner with medical dictation:

Phase 1: Emotional Intelligence (Hume AI) ✓ Complete
Phase 2: Backchanneling System ✓ Complete
Phase 3: Prosody Analysis ✓ Complete
Phase 4: Memory & Context System ✓ Complete
Phase 5: Advanced Turn-Taking ✓ Complete
Phase 6: Variable Response Timing ✓ Complete
Phase 7: Conversational Repair ✓ Complete
Phase 8: Medical Dictation Core ✓ Complete
Phase 9: Patient Context Integration ✓ Complete
Phase 10: Frontend Integration & Analytics ✓ Complete

Full documentation: VOICE_MODE_ENHANCEMENT_10_PHASE.md

Remaining Tasks

Voice→chat transcript content E2E: Test actual transcript content in chat timeline
Error tracking integration: Send errors to Sentry/similar
Audio level visualization: Show real-time audio level meter during recording

Voice Settings Guide

Voice Mode Settings Guide

This guide explains how to use and configure Voice Mode settings in VoiceAssist.

Overview

Voice Mode provides real-time voice conversations with the AI assistant. Users can customize their voice experience through the settings panel, including voice selection, language preferences, TTS quality parameters, and behavior options.

Voice Mode Overhaul (2025-11-29): Added backend persistence for voice preferences, context-aware voice style detection, and advanced TTS quality controls.

Phase 7-10 Enhancements (2025-12-03): Added multilingual support with auto-detection, voice calibration, offline fallback with network monitoring, and conversation intelligence features.

Accessing Settings

Open Voice Mode by clicking the voice button in the chat interface
Click the gear icon in the Voice Mode panel header
The settings modal will appear

Available Settings

Voice Selection

Choose from 6 different AI voices:

Alloy - Neutral, balanced voice (default)
Echo - Warm, friendly voice
Fable - Expressive, narrative voice
Onyx - Deep, authoritative voice
Nova - Energetic, bright voice
Shimmer - Soft, calming voice

Language

Select your preferred conversation language:

English (default)
Spanish
French
German
Italian
Portuguese

Voice Detection Sensitivity (0-100%)

Controls how sensitive the voice activity detection is:

Lower values (0-30%): Less sensitive, requires louder/clearer speech
Medium values (40-60%): Balanced detection (recommended)
Higher values (70-100%): More sensitive, may pick up background noise

Auto-start Voice Mode

When enabled, Voice Mode will automatically open when you start a new chat or navigate to the chat page. This is useful for voice-first interactions.

Show Status Hints

When enabled, displays helpful tips and instructions in the Voice Mode panel. Disable if you're familiar with the interface and want a cleaner view.

Context-Aware Voice Style (New)

When enabled, the AI automatically adjusts its voice tone based on the content being spoken:

Calm: Default for medical explanations (stable, measured pace)
Urgent: For medical warnings/emergencies (dynamic, faster)
Empathetic: For sensitive health topics (warm, slower)
Instructional: For step-by-step guidance (clear, deliberate)
Conversational: For general chat (natural, varied)

The system detects keywords and patterns to select the appropriate style, then blends it with your base preferences (60% your settings, 40% style preset).

Advanced Voice Quality (New)

Expand this section to fine-tune TTS output parameters:

Voice Stability (0-100%): Lower = more expressive/varied, Higher = more consistent
Voice Clarity (0-100%): Higher values produce clearer, more consistent voice
Expressiveness (0-100%): Higher values add more emotion and style variation

These settings primarily affect ElevenLabs TTS but also influence context-aware style blending for OpenAI TTS.

Phase 7: Language & Detection Settings

Auto-Detect Language

When enabled, the system automatically detects the language being spoken and adjusts processing accordingly. This is useful for multilingual users who switch between languages naturally.

Default: Enabled
Store Key: autoLanguageDetection

Controls how confident the system must be before switching to a detected language. Higher values prevent false-positive language switches.

Lower values (50-70%): More responsive language switching, but may switch accidentally on similar-sounding phrases
Medium values (70-85%): Balanced detection (recommended)
Higher values (85-100%): Very confident switching, stays in current language unless clearly different
Default: 75%
Store Key: languageSwitchConfidence

Accent Profile

Select a regional accent profile to improve speech recognition accuracy for your specific accent or dialect.

Default: None (auto-detect)
Available Profiles: en-us-midwest, en-gb-london, en-au-sydney, ar-eg-cairo, ar-sa-riyadh, etc.
Store Key: accentProfileId

Phase 8: Voice Calibration Settings

Voice calibration optimizes the VAD (Voice Activity Detection) thresholds specifically for your voice and environment.

Calibration Status

Shows whether voice calibration has been completed:

Not Calibrated: Default state, using generic thresholds
Calibrated: Personal thresholds active (shows last calibration date)

Recalibrate Button

Launches the calibration wizard to:

Record ambient noise samples
Record your speaking voice at different volumes
Compute personalized VAD thresholds

Calibration takes approximately 30-60 seconds.

Personalized VAD Threshold

After calibration, the system uses a custom threshold tuned to your voice:

Store Key: personalizedVadThreshold
Range: 0.0-1.0 (null if not calibrated)

Adaptive Learning

When enabled, the system continuously learns from your voice patterns and subtly adjusts thresholds over time.

Default: Enabled
Store Key: enableBehaviorLearning

Phase 9: Offline Mode Settings

Configure how the voice assistant behaves when network connectivity is poor or unavailable.

Enable Offline Fallback

When enabled, the system automatically switches to offline VAD processing when:

Network is offline
Health check fails consecutively
Network quality drops below threshold
Default: Enabled
Store Key: enableOfflineFallback

Prefer Local VAD

Force the use of local (on-device) VAD processing even when network is available. Useful for:

Privacy-conscious users who don't want audio sent to servers
Environments with unreliable connectivity
Lower latency at the cost of accuracy
Default: Disabled
Store Key: preferOfflineVAD

TTS Audio Caching

When enabled, previously synthesized audio responses are cached locally for:

Faster playback of repeated phrases
Offline playback of cached responses
Reduced bandwidth and API costs
Default: Enabled
Store Key: ttsCacheEnabled

Network Quality Monitoring

The system continuously monitors network quality and categorizes it into five levels:

Quality	Latency	Behavior
Excellent	< 100ms	Full cloud processing
Good	< 200ms	Full cloud processing
Moderate	< 500ms	Cloud processing, may show warning
Poor	≥ 500ms	Auto-fallback to offline VAD
Offline	No network	Full offline mode

Network status is displayed in the voice panel header when quality is degraded.

Phase 10: Conversation Intelligence Settings

These settings control advanced AI features that enhance conversation quality.

Enable Sentiment Tracking

When enabled, the AI tracks emotional tone throughout the conversation and adapts its responses accordingly.

Default: Enabled
Store Key: enableSentimentTracking

Enable Discourse Analysis

Tracks conversation structure (topic changes, question chains, clarifications) to provide more contextually aware responses.

Default: Enabled
Store Key: enableDiscourseAnalysis

Enable Response Recommendations

The AI suggests relevant follow-up questions or actions based on conversation context.

Default: Enabled
Store Key: enableResponseRecommendations

Show Suggested Follow-Ups

Display AI-suggested follow-up questions after responses. These appear as clickable chips below the assistant's message.

Default: Enabled
Store Key: showSuggestedFollowUps

Privacy Settings

Store Transcript History

When enabled, voice transcripts are stored in the conversation history. Disable for ephemeral voice sessions.

Default: Enabled
Store Key: storeTranscriptHistory

Opt-in to share anonymized voice interaction metrics to help improve the service. No transcript content or personal data is shared - only timing metrics (latency, error rates).

Default: Disabled
Store Key: shareAnonymousAnalytics

Persistence

Voice preferences are now stored in two locations for maximum reliability:

Backend API (Primary): Settings are synced to /api/voice/preferences and stored in the database. This enables cross-device settings sync when logged in.
Local Storage (Fallback): Settings are also cached locally under voiceassist-voice-settings for offline access and faster loading.

Changes are debounced (1 second) before being sent to the backend to reduce API calls while editing.

Resetting to Defaults

Click "Reset to defaults" in the settings modal to restore all settings to their original values:

Core Settings

Voice: Alloy
Language: English
VAD Sensitivity: 50%
Auto-start: Disabled
Show hints: Enabled
Context-aware style: Enabled
Stability: 50%
Clarity: 75%
Expressiveness: 0%

Phase 7 Defaults

Auto Language Detection: Enabled
Language Switch Confidence: 75%
Accent Profile ID: null

Phase 8 Defaults

VAD Calibrated: false
Last Calibration Date: null
Personalized VAD Threshold: null
Adaptive Learning: Enabled

Phase 9 Defaults

Offline Fallback: Enabled
Prefer Local VAD: Disabled
TTS Cache: Enabled

Phase 10 Defaults

Sentiment Tracking: Enabled
Discourse Analysis: Enabled
Response Recommendations: Enabled
Show Suggested Follow-Ups: Enabled

Privacy Defaults

Store Transcript History: Enabled
Share Anonymous Analytics: Disabled

Reset also syncs to the backend via POST /api/voice/preferences/reset.

Voice Preferences API (New)

The following API endpoints manage voice preferences:

Endpoint	Method	Description
`/api/voice/preferences`	GET	Get user's voice preferences
`/api/voice/preferences`	PUT	Update preferences (partial update)
`/api/voice/preferences/reset`	POST	Reset to defaults
`/api/voice/style-presets`	GET	Get available style presets

Response Headers

TTS synthesis requests now include additional headers:

X-TTS-Provider: Which provider was used (openai or elevenlabs)
X-TTS-Fallback: Whether fallback was used (true/false)
X-TTS-Style: Detected style if context-aware is enabled

Technical Details

Store Location

Settings are managed by a Zustand store with persistence:

apps/web-app/src/stores/voiceSettingsStore.ts

Component Locations

Settings UI: apps/web-app/src/components/voice/VoiceModeSettings.tsx
Enhanced Settings: apps/web-app/src/components/voice/VoiceSettingsEnhanced.tsx
Calibration Dialog: apps/web-app/src/components/voice/CalibrationDialog.tsx

Phase 9 Offline/Network Files

Network Monitor: apps/web-app/src/lib/offline/networkMonitor.ts
WebRTC VAD: apps/web-app/src/lib/offline/webrtcVAD.ts
Offline Types: apps/web-app/src/lib/offline/types.ts
Network Status Hook: apps/web-app/src/hooks/useNetworkStatus.ts
Offline VAD Hook: apps/web-app/src/hooks/useOfflineVAD.ts

Backend Files (New)

Model: services/api-gateway/app/models/user_voice_preferences.py
Style Detector: services/api-gateway/app/services/voice_style_detector.py
API Endpoints: services/api-gateway/app/api/voice.py (preferences section)
Schemas: services/api-gateway/app/api/voice_schemas/schemas.py

Frontend Sync Hook (New)

apps/web-app/src/hooks/useVoicePreferencesSync.ts

Handles loading/saving preferences to backend with debouncing.

Integration Points

VoiceModePanel.tsx - Displays settings button and uses store values
MessageInput.tsx - Reads autoStartOnOpen for auto-open behavior
useVoicePreferencesSync.ts - Backend sync on auth and setting changes

Advanced: Voice Mode Pipeline

Settings are not just UI preferences - they propagate into real-time voice sessions:

Voice/Language: Sent to /api/voice/realtime-session and used by OpenAI Realtime API
VAD Sensitivity: Mapped to server-side VAD threshold (0→insensitive, 100→sensitive)

For comprehensive pipeline documentation including backend integration, WebSocket connections, and metrics, see VOICE_MODE_PIPELINE.md.

Development: Running Tests

Run the voice settings test suites individually to avoid memory issues:

cd apps/web-app

# Unit tests for voice settings store (core)
npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot

# Unit tests for voice settings store (Phase 7-10)
npx vitest run src/stores/__tests__/voiceSettingsStore-phase7-10.test.ts --reporter=dot

# Unit tests for network monitor
npx vitest run src/lib/offline/__tests__/networkMonitor.test.ts --reporter=dot

# Component tests for VoiceModeSettings
npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot

# Integration tests for MessageInput voice settings
npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot

Test Coverage

The test suites cover:

voiceSettingsStore.test.ts (17 tests)

Default values verification
All setter functions (voice, language, sensitivity, toggles)
VAD sensitivity clamping (0-100 range)
Reset functionality
LocalStorage persistence

voiceSettingsStore-phase7-10.test.ts (41 tests)

Phase 7: Multilingual settings (accent profile, auto-detection, confidence)
Phase 8: Calibration settings (VAD calibrated, dates, thresholds)
Phase 9: Offline mode settings (fallback, prefer offline VAD, TTS cache)
Phase 10: Conversation intelligence (sentiment, discourse, recommendations)
Privacy settings (transcript history, anonymous analytics)
Persistence tests for all Phase 7-10 settings
Reset tests verifying all defaults

networkMonitor.test.ts (13 tests)

Initial state detection (online/offline)
Health check latency measurement
Quality computation from latency thresholds
Consecutive failure handling before marking unhealthy
Subscription/unsubscription for status changes
Custom configuration (latency thresholds, health check URL)
Offline detection via navigator.onLine

VoiceModeSettings.test.tsx (25 tests)

Modal visibility (isOpen prop)
Current settings display
Settings updates via UI interactions
Reset with confirmation
Close behavior (Done, X, backdrop)
Accessibility (labels, ARIA attributes)

MessageInput-voice-settings.test.tsx (12 tests)

Auto-open via store setting (autoStartOnOpen)
Auto-open via prop (autoOpenRealtimeVoice)
Combined settings behavior
Voice/language display in panel header
Status hints visibility toggle

Total: 108+ tests for voice settings and related functionality.

Notes

Tests mock useRealtimeVoiceSession and WaveformVisualizer to avoid browser API dependencies
Run tests individually rather than the full suite to prevent memory issues
All tests use Vitest + React Testing Library
Phase 7-10 tests also mock fetch and performance.now for network monitoring

Beginning of guide

End of guide

Voice Mode

Speech Input

Voice Output

Low Latency

✅ Implementation Status

Voice Pipeline Architecture

Voice Mode Pipeline

Voice Pipeline Modes

Thinker-Talker Pipeline (Primary)

OpenAI Realtime API (Legacy)

Implementation Status

Thinker-Talker Components

OpenAI Realtime Components (Legacy)

Overview

Architecture Diagram

Backend: /api/voice/realtime-session

Request

Response

VAD Sensitivity Mapping

Observability

Frontend Hook: useRealtimeVoiceSession

Usage

Connection States

WebSocket Connection

Voice Settings Store

Schema

Persistence

Defaults

Chat Integration

Message Flow

Message Structure

Barge-in & Audio Playback

Barge-in Flow

Response Cancellation

Audio Playback Management

Overlap Prevention

Error Handling

Metrics

VoiceMetrics Interface

Frontend Logging

Consuming Metrics

Metrics Export to Backend

Request Schema

Response

Privacy

Frontend Configuration

Backend Logging

Testing

Security

Ephemeral Token Architecture

Token Refresh

Testing

Voice Pipeline Smoke Suite

Test Coverage Summary

Live Testing

File Reference

Backend

Frontend

Tests

Related Documentation

Observability & Monitoring (Phase 3)

Error Taxonomy (voice_errors.py)

Voice Metrics (metrics.py)

Per-Stage Latency Tracking (voice_timing.py)

SLO Alerts (voice_slo_alerts.yml)

Client Telemetry (voiceTelemetry.ts)

Voice Health Endpoint (/health/voice)

Debug Logging Configuration

Phase 9: Offline & Network Fallback

Network Monitoring (networkMonitor.ts)

Network Quality Levels

Configuration

useNetworkStatus Hook

Offline VAD with Network Fallback

Fallback Decision Flow

TTS Caching (useTTSCache)

User Settings Integration

File Reference (Phase 9)

Future Work

Voice Mode Enhancement - 10 Phase Plan ✅ COMPLETE (2025-12-03)

Backend: `/api/voice/realtime-session`

Frontend Hook: `useRealtimeVoiceSession`

Error Taxonomy (`voice_errors.py`)

Voice Metrics (`metrics.py`)

Per-Stage Latency Tracking (`voice_timing.py`)

SLO Alerts (`voice_slo_alerts.yml`)

Client Telemetry (`voiceTelemetry.ts`)

Voice Health Endpoint (`/health/voice`)

Network Monitoring (`networkMonitor.ts`)

TTS Caching (`useTTSCache`)