2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"]
4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""]
5:I[4126,[],""]
7:I[9630,[],""]
8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"]
9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"]
a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"]
b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"]
3:Tb6a2,
# Voice Mode Pipeline

> **Status**: Production-ready
> **Last Updated**: 2025-12-03

This document describes the unified Voice Mode pipeline architecture, data flow, metrics, and testing strategy. It serves as the canonical reference for developers working on real-time voice features.

## Voice Pipeline Modes

VoiceAssist supports **two voice pipeline modes**:

| Mode                             | Description                    | Best For                                       |
| -------------------------------- | ------------------------------ | ---------------------------------------------- |
| **Thinker-Talker** (Recommended) | Local STT → LLM → TTS pipeline | Full tool support, unified context, custom TTS |
| **OpenAI Realtime** (Legacy)     | Direct OpenAI Realtime API     | Quick setup, minimal backend changes           |

### Thinker-Talker Pipeline (Primary)

The Thinker-Talker pipeline is the recommended approach, providing:

- **Unified conversation context** between voice and chat modes
- **Full tool/RAG support** in voice interactions
- **Custom TTS** via ElevenLabs with premium voices
- **Lower cost** per interaction

**Documentation:** [THINKER_TALKER_PIPELINE.md](THINKER_TALKER_PIPELINE.md)

```
[Audio] → [Deepgram STT] → [GPT-4o Thinker] → [ElevenLabs TTS] → [Audio Out]
              │                    │                    │
         Transcripts          Tool Calls           Audio Chunks
              │                    │                    │
              └───────── WebSocket Handler ──────────────┘
```

### OpenAI Realtime API (Legacy)

The original implementation using OpenAI's Realtime API directly. Still supported for backward compatibility.

---

## Implementation Status

### Thinker-Talker Components

| Component             | Status   | Location                                                        |
| --------------------- | -------- | --------------------------------------------------------------- |
| ThinkerService        | **Live** | `app/services/thinker_service.py`                               |
| TalkerService         | **Live** | `app/services/talker_service.py`                                |
| VoicePipelineService  | **Live** | `app/services/voice_pipeline_service.py`                        |
| T/T WebSocket Handler | **Live** | `app/services/thinker_talker_websocket_handler.py`              |
| SentenceChunker       | **Live** | `app/services/sentence_chunker.py`                              |
| Frontend T/T hook     | **Live** | `apps/web-app/src/hooks/useThinkerTalkerSession.ts`             |
| T/T Audio Playback    | **Live** | `apps/web-app/src/hooks/useTTAudioPlayback.ts`                  |
| T/T Voice Panel       | **Live** | `apps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsx` |

### OpenAI Realtime Components (Legacy)

| Component                  | Status      | Location                                               |
| -------------------------- | ----------- | ------------------------------------------------------ |
| Backend session endpoint   | **Live**    | `services/api-gateway/app/api/voice.py`                |
| Ephemeral token generation | **Live**    | `app/services/realtime_voice_service.py`               |
| Voice metrics endpoint     | **Live**    | `POST /api/voice/metrics`                              |
| Frontend voice hook        | **Live**    | `apps/web-app/src/hooks/useRealtimeVoiceSession.ts`    |
| Voice settings store       | **Live**    | `apps/web-app/src/stores/voiceSettingsStore.ts`        |
| Voice UI panel             | **Live**    | `apps/web-app/src/components/voice/VoiceModePanel.tsx` |
| Chat timeline integration  | **Live**    | Voice messages appear in chat                          |
| Barge-in support           | **Live**    | `response.cancel` + `onSpeechStarted` callback         |
| Audio overlap prevention   | **Live**    | Response ID tracking + `isProcessingResponseRef`       |
| E2E test suite             | **Passing** | 95 tests across unit/integration/E2E                   |

> **Full status:** See [Implementation Status](overview/IMPLEMENTATION_STATUS.md) for all components.

## Overview

Voice Mode enables real-time voice conversations with the AI assistant using OpenAI's Realtime API. The pipeline handles:

- **Ephemeral session authentication** (no raw API keys in browser)
- **WebSocket-based bidirectional voice streaming**
- **Voice activity detection (VAD)** with user-configurable sensitivity
- **User settings propagation** (voice, language, VAD threshold)
- **Chat timeline integration** (voice messages appear in chat)
- **Connection state management** with automatic reconnection
- **Barge-in support** (interrupt AI while speaking)
- **Audio playback management** (prevent overlapping responses)
- **Metrics tracking** for observability

## Architecture Diagram

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              FRONTEND                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────┐     ┌─────────────────────┐     ┌───────────────┐  │
│  │  VoiceModePanel     │────▶│useRealtimeVoice     │────▶│ voiceSettings │  │
│  │  (UI Component)     │     │Session (Hook)       │     │ Store         │  │
│  │  - Start/Stop       │     │- connect()          │     │ - voice       │  │
│  │  - Status display   │     │- disconnect()       │     │ - language    │  │
│  │  - Metrics logging  │     │- sendMessage()      │     │ - vadSens     │  │
│  └─────────┬───────────┘     └──────────┬──────────┘     └───────────────┘  │
│            │                            │                                    │
│            │                            │ onUserMessage()/onAssistantMessage()
│            │                            ▼                                    │
│  ┌─────────▼───────────┐     ┌─────────────────────┐                        │
│  │  MessageInput       │     │  ChatPage           │                        │
│  │  - Voice toggle     │────▶│  - useChatSession   │                        │
│  │  - Panel container  │     │  - addMessage()     │                        │
│  └─────────────────────┘     └─────────────────────┘                        │
│                                                                              │
└──────────────────────────────────────┬──────────────────────────────────────┘
                                       │
                                       │ POST /api/voice/realtime-session
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              BACKEND                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────┐     ┌─────────────────────┐                        │
│  │  voice.py           │────▶│  realtime_voice_    │                        │
│  │  (FastAPI Router)   │     │  service.py         │                        │
│  │  - /realtime-session│     │  - generate_session │                        │
│  │  - Timing logs      │     │  - ephemeral token  │                        │
│  └─────────────────────┘     └──────────┬──────────┘                        │
│                                         │                                    │
│                                         │ POST /v1/realtime/sessions         │
│                                         ▼                                    │
│                              ┌─────────────────────┐                        │
│                              │  OpenAI API         │                        │
│                              │  - Ephemeral token  │                        │
│                              │  - Voice config     │                        │
│                              └─────────────────────┘                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       │ WebSocket wss://api.openai.com/v1/realtime
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          OPENAI REALTIME API                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  - Server-side VAD (voice activity detection)                                │
│  - Bidirectional audio streaming (PCM16)                                     │
│  - Real-time transcription (Whisper)                                         │
│  - GPT-4o responses with audio synthesis                                     │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Backend: `/api/voice/realtime-session`

**Location**: `services/api-gateway/app/api/voice.py`

### Request

```typescript
interface RealtimeSessionRequest {
  conversation_id?: string; // Optional conversation context
  voice?: string; // "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"
  language?: string; // "en" | "es" | "fr" | "de" | "it" | "pt"
  vad_sensitivity?: number; // 0-100 (maps to threshold: 0→0.9, 100→0.1)
}
```

### Response

```typescript
interface RealtimeSessionResponse {
  url: string; // WebSocket URL: "wss://api.openai.com/v1/realtime"
  model: string; // "gpt-4o-realtime-preview"
  session_id: string; // Unique session identifier
  expires_at: number; // Unix timestamp (epoch seconds)
  conversation_id: string | null;
  auth: {
    type: "ephemeral_token";
    token: string; // Ephemeral token (ek_...), NOT raw API key
    expires_at: number; // Token expiry (5 minutes)
  };
  voice_config: {
    voice: string; // Selected voice
    modalities: ["text", "audio"];
    input_audio_format: "pcm16";
    output_audio_format: "pcm16";
    input_audio_transcription: { model: "whisper-1" };
    turn_detection: {
      type: "server_vad";
      threshold: number; // 0.1 (sensitive) to 0.9 (insensitive)
      prefix_padding_ms: number;
      silence_duration_ms: number;
    };
  };
}
```

### VAD Sensitivity Mapping

The frontend uses a 0-100 scale for user-friendly VAD sensitivity:

| User Setting | VAD Threshold | Behavior                             |
| ------------ | ------------- | ------------------------------------ |
| 0 (Low)      | 0.9           | Requires loud/clear speech           |
| 50 (Medium)  | 0.5           | Balanced detection                   |
| 100 (High)   | 0.1           | Very sensitive, picks up soft speech |

**Formula**: `threshold = 0.9 - (vad_sensitivity / 100 * 0.8)`

### Observability

Backend logs timing and context for each session request:

```python
# Request logging
logger.info(
    f"Creating Realtime session for user {current_user.id}",
    extra={
        "user_id": current_user.id,
        "conversation_id": request.conversation_id,
        "voice": request.voice,
        "language": request.language,
        "vad_sensitivity": request.vad_sensitivity,
    },
)

# Success logging with duration
duration_ms = int((time.monotonic() - start_time) * 1000)
logger.info(
    f"Realtime session created for user {current_user.id}",
    extra={
        "user_id": current_user.id,
        "session_id": config["session_id"],
        "voice": config.get("voice_config", {}).get("voice"),
        "duration_ms": duration_ms,
    },
)
```

## Frontend Hook: `useRealtimeVoiceSession`

**Location**: `apps/web-app/src/hooks/useRealtimeVoiceSession.ts`

### Usage

```typescript
const {
  status, // 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'failed' | 'expired' | 'error'
  transcript, // Current transcript text
  isSpeaking, // Is the AI currently speaking?
  isConnected, // Derived: status === 'connected'
  isConnecting, // Derived: status === 'connecting' || 'reconnecting'
  canSend, // Can send messages?
  error, // Error message if any
  metrics, // VoiceMetrics object
  connect, // () => Promise<void> - start session
  disconnect, // () => void - end session
  sendMessage, // (text: string) => void - send text message
} = useRealtimeVoiceSession({
  conversationId,
  voice, // From voiceSettingsStore
  language, // From voiceSettingsStore
  vadSensitivity, // From voiceSettingsStore (0-100)
  onConnected, // Callback when connected
  onDisconnected, // Callback when disconnected
  onError, // Callback on error
  onUserMessage, // Callback with user transcript
  onAssistantMessage, // Callback with AI response
  onMetricsUpdate, // Callback when metrics change
});
```

### Connection States

```
disconnected ──▶ connecting ──▶ connected
                      │              │
                      ▼              ▼
                   failed ◀──── reconnecting
                      │              │
                      ▼              ▼
                  expired ◀────── error
```

| State          | Description                                      |
| -------------- | ------------------------------------------------ |
| `disconnected` | Initial/idle state                               |
| `connecting`   | Fetching session config, establishing WebSocket  |
| `connected`    | Active voice session                             |
| `reconnecting` | Auto-reconnect after temporary disconnect        |
| `failed`       | Connection failed (backend error, network issue) |
| `expired`      | Session token expired (needs manual restart)     |
| `error`        | General error state                              |

### WebSocket Connection

The hook connects using three protocols for authentication:

```typescript
const ws = new WebSocket(url, ["realtime", "openai-beta.realtime-v1", `openai-insecure-api-key.${ephemeralToken}`]);
```

## Voice Settings Store

**Location**: `apps/web-app/src/stores/voiceSettingsStore.ts`

### Schema

```typescript
interface VoiceSettings {
  voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer";
  language: "en" | "es" | "fr" | "de" | "it" | "pt";
  vadSensitivity: number; // 0-100
  autoStartOnOpen: boolean; // Auto-start voice when panel opens
  showStatusHints: boolean; // Show helper text in UI
}
```

### Persistence

Settings are persisted to `localStorage` under key `voiceassist-voice-settings` using Zustand's persist middleware.

### Defaults

| Setting         | Default |
| --------------- | ------- |
| voice           | "alloy" |
| language        | "en"    |
| vadSensitivity  | 50      |
| autoStartOnOpen | false   |
| showStatusHints | true    |

## Chat Integration

**Location**: `apps/web-app/src/pages/ChatPage.tsx`

### Message Flow

1. **User speaks** → VoiceModePanel receives final transcript
2. VoiceModePanel calls `onUserMessage(transcript)`
3. ChatPage receives callback, calls `useChatSession.addMessage()`
4. Message added to timeline with `metadata: { source: "voice" }`

```typescript
// ChatPage.tsx
const handleVoiceUserMessage = (content: string) => {
  addMessage({
    role: "user",
    content,
    metadata: { source: "voice" },
  });
};

const handleVoiceAssistantMessage = (content: string) => {
  addMessage({
    role: "assistant",
    content,
    metadata: { source: "voice" },
  });
};
```

### Message Structure

```typescript
interface VoiceMessage {
  id: string; // "voice-{timestamp}-{random}"
  role: "user" | "assistant";
  content: string;
  timestamp: number;
  metadata: {
    source: "voice"; // Distinguishes from text messages
  };
}
```

## Barge-in & Audio Playback

**Location**: `apps/web-app/src/components/voice/VoiceModePanel.tsx`, `apps/web-app/src/hooks/useRealtimeVoiceSession.ts`

### Barge-in Flow

When the user starts speaking while the AI is responding, the system immediately:

1. **Detects speech start** via OpenAI's `input_audio_buffer.speech_started` event
2. **Cancels active response** by sending `response.cancel` to OpenAI
3. **Stops audio playback** via `onSpeechStarted` callback
4. **Clears pending responses** to prevent stale audio from playing

```
User speaks → speech_started event → response.cancel → stopCurrentAudio()
                                                            ↓
                                                    Audio stops
                                                    Queue cleared
                                                    Response ID incremented
```

### Response Cancellation

**Location**: `useRealtimeVoiceSession.ts` - `handleRealtimeMessage`

```typescript
case "input_audio_buffer.speech_started":
  setIsSpeaking(true);
  setPartialTranscript("");

  // Barge-in: Cancel any active response when user starts speaking
  if (activeResponseIdRef.current && wsRef.current?.readyState === WebSocket.OPEN) {
    wsRef.current.send(JSON.stringify({ type: "response.cancel" }));
    activeResponseIdRef.current = null;
  }

  // Notify parent to stop audio playback
  options.onSpeechStarted?.();
  break;
```

### Audio Playback Management

**Location**: `VoiceModePanel.tsx`

The panel tracks audio playback state to prevent overlapping responses:

```typescript
// Track currently playing Audio element
const currentAudioRef = useRef<HTMLAudioElement | null>(null);

// Prevent overlapping response processing
const isProcessingResponseRef = useRef(false);

// Response ID to invalidate stale responses after barge-in
const currentResponseIdRef = useRef<number>(0);
```

**Stop current audio function:**

```typescript
const stopCurrentAudio = useCallback(() => {
  if (currentAudioRef.current) {
    currentAudioRef.current.pause();
    currentAudioRef.current.currentTime = 0;
    if (currentAudioRef.current.src.startsWith("blob:")) {
      URL.revokeObjectURL(currentAudioRef.current.src);
    }
    currentAudioRef.current = null;
  }
  audioQueueRef.current = [];
  isPlayingRef.current = false;
  currentResponseIdRef.current++; // Invalidate pending responses
  isProcessingResponseRef.current = false;
}, []);
```

### Overlap Prevention

When a relay result arrives, the handler checks:

1. **Already processing?** Skip if `isProcessingResponseRef.current === true`
2. **Response ID valid?** Skip playback if ID changed (barge-in occurred)

```typescript
onRelayResult: async ({ answer }) => {
  if (answer) {
    // Prevent overlapping responses
    if (isProcessingResponseRef.current) {
      console.log("[VoiceModePanel] Skipping response - already processing another");
      return;
    }

    const responseId = ++currentResponseIdRef.current;
    isProcessingResponseRef.current = true;

    // ... synthesis and playback ...

    // Check if response is still valid before playback
    if (responseId !== currentResponseIdRef.current) {
      console.log("[VoiceModePanel] Response cancelled - skipping playback");
      return;
    }
  }
};
```

### Error Handling

Benign cancellation errors (e.g., "Cancellation failed: no active response found") are handled gracefully:

```typescript
case "error": {
  const errorMessage = message.error?.message || "Realtime API error";

  // Ignore benign cancellation errors
  if (
    errorMessage.includes("Cancellation failed") ||
    errorMessage.includes("no active response")
  ) {
    voiceLog.debug(`Ignoring benign error: ${errorMessage}`);
    break;
  }

  handleError(new Error(errorMessage));
  break;
}
```

## Metrics

**Location**: `apps/web-app/src/hooks/useRealtimeVoiceSession.ts`

### VoiceMetrics Interface

```typescript
interface VoiceMetrics {
  connectionTimeMs: number | null; // Time to establish connection
  timeToFirstTranscriptMs: number | null; // Time to first user transcript
  lastSttLatencyMs: number | null; // Speech-to-text latency
  lastResponseLatencyMs: number | null; // AI response latency
  sessionDurationMs: number | null; // Total session duration
  userTranscriptCount: number; // Number of user turns
  aiResponseCount: number; // Number of AI turns
  reconnectCount: number; // Number of reconnections
  sessionStartedAt: number | null; // Session start timestamp
}
```

### Frontend Logging

VoiceModePanel logs key metrics to console:

```typescript
// Connection time
console.log(`[VoiceModePanel] voice_session_connect_ms=${metrics.connectionTimeMs}`);

// STT latency
console.log(`[VoiceModePanel] voice_stt_latency_ms=${metrics.lastSttLatencyMs}`);

// Response latency
console.log(`[VoiceModePanel] voice_first_reply_ms=${metrics.lastResponseLatencyMs}`);

// Session duration
console.log(`[VoiceModePanel] voice_session_duration_ms=${metrics.sessionDurationMs}`);
```

### Consuming Metrics

Developers can plug into metrics via the `onMetricsUpdate` callback:

```typescript
useRealtimeVoiceSession({
  onMetricsUpdate: (metrics) => {
    // Send to telemetry service
    analytics.track("voice_session_metrics", {
      connection_ms: metrics.connectionTimeMs,
      stt_latency_ms: metrics.lastSttLatencyMs,
      response_latency_ms: metrics.lastResponseLatencyMs,
      duration_ms: metrics.sessionDurationMs,
    });
  },
});
```

### Metrics Export to Backend

Metrics can be automatically exported to the backend for aggregation and alerting.

**Backend Endpoint**: `POST /api/voice/metrics`

**Location**: `services/api-gateway/app/api/voice.py`

#### Request Schema

```typescript
interface VoiceMetricsPayload {
  conversation_id?: string;
  connection_time_ms?: number;
  time_to_first_transcript_ms?: number;
  last_stt_latency_ms?: number;
  last_response_latency_ms?: number;
  session_duration_ms?: number;
  user_transcript_count: number;
  ai_response_count: number;
  reconnect_count: number;
  session_started_at?: number;
}
```

#### Response

```typescript
interface VoiceMetricsResponse {
  status: "ok";
}
```

#### Privacy

**No PHI or transcript content is sent.** Only timing metrics and counts.

#### Frontend Configuration

Metrics export is controlled by environment variables:

- **Production** (`import.meta.env.PROD`): Metrics sent automatically
- **Development**: Set `VITE_ENABLE_VOICE_METRICS=true` to enable

The export uses `navigator.sendBeacon()` for reliability (survives page navigation).

#### Backend Logging

Metrics are logged with user context:

```python
logger.info(
    "VoiceMetrics received",
    extra={
        "user_id": current_user.id,
        "conversation_id": payload.conversation_id,
        "connection_time_ms": payload.connection_time_ms,
        "session_duration_ms": payload.session_duration_ms,
        ...
    },
)
```

#### Testing

```bash
# Backend
cd /home/asimo/VoiceAssist/services/api-gateway
source venv/bin/activate && export PYTHONPATH=.
python -m pytest tests/integration/test_voice_metrics.py -v
```

## Security

### Ephemeral Token Architecture

**CRITICAL**: The browser NEVER receives the raw OpenAI API key.

1. Backend holds `OPENAI_API_KEY` securely
2. Frontend requests session via `/api/voice/realtime-session`
3. Backend creates ephemeral token via OpenAI `/v1/realtime/sessions`
4. Ephemeral token returned to frontend (valid ~5 minutes)
5. Frontend connects WebSocket using ephemeral token

### Token Refresh

The hook monitors `session.expires_at` and can trigger refresh before expiry. If the token expires mid-session, status transitions to `expired`.

## Testing

### Voice Pipeline Smoke Suite

Run these commands to validate the voice pipeline:

```bash
# 1. Backend tests (CI-safe, mocked)
cd /home/asimo/VoiceAssist/services/api-gateway
source venv/bin/activate
export PYTHONPATH=.
python -m pytest tests/integration/test_openai_config.py -v

# 2. Frontend unit tests (run individually to avoid OOM)
cd /home/asimo/VoiceAssist/apps/web-app
export NODE_OPTIONS="--max-old-space-size=768"

npx vitest run src/hooks/__tests__/useRealtimeVoiceSession.test.ts --reporter=dot
npx vitest run src/hooks/__tests__/useChatSession-voice-integration.test.ts --reporter=dot
npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot
npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot
npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot

# 3. E2E tests (Chromium, mocked backend)
cd /home/asimo/VoiceAssist
npx playwright test \
  e2e/voice-mode-navigation.spec.ts \
  e2e/voice-mode-session-smoke.spec.ts \
  e2e/voice-mode-voice-chat-integration.spec.ts \
  --project=chromium --reporter=list
```

### Test Coverage Summary

| Test File                                 | Tests | Coverage                          |
| ----------------------------------------- | ----- | --------------------------------- |
| useRealtimeVoiceSession.test.ts           | 22    | Hook lifecycle, states, metrics   |
| useChatSession-voice-integration.test.ts  | 8     | Message structure validation      |
| voiceSettingsStore.test.ts                | 17    | Store actions, persistence        |
| VoiceModeSettings.test.tsx                | 25    | Component rendering, interactions |
| MessageInput-voice-settings.test.tsx      | 12    | Integration with chat input       |
| voice-mode-navigation.spec.ts             | 4     | E2E navigation flow               |
| voice-mode-session-smoke.spec.ts          | 3     | E2E session smoke (1 live gated)  |
| voice-mode-voice-chat-integration.spec.ts | 4     | E2E panel integration             |

**Total: 95 tests**

### Live Testing

To test with real OpenAI backend:

```bash
# Backend (requires OPENAI_API_KEY in .env)
LIVE_REALTIME_TESTS=1 python -m pytest tests/integration/test_openai_config.py -v

# E2E (requires running backend + valid API key)
LIVE_REALTIME_E2E=1 npx playwright test e2e/voice-mode-session-smoke.spec.ts
```

## File Reference

### Backend

| File                                                           | Purpose                            |
| -------------------------------------------------------------- | ---------------------------------- |
| `services/api-gateway/app/api/voice.py`                        | API routes, metrics, timing logs   |
| `services/api-gateway/app/services/realtime_voice_service.py`  | Session creation, token generation |
| `services/api-gateway/tests/integration/test_openai_config.py` | Integration tests                  |
| `services/api-gateway/tests/integration/test_voice_metrics.py` | Metrics endpoint tests             |

### Frontend

| File                                                      | Purpose                   |
| --------------------------------------------------------- | ------------------------- |
| `apps/web-app/src/hooks/useRealtimeVoiceSession.ts`       | Core hook                 |
| `apps/web-app/src/components/voice/VoiceModePanel.tsx`    | UI panel                  |
| `apps/web-app/src/components/voice/VoiceModeSettings.tsx` | Settings modal            |
| `apps/web-app/src/stores/voiceSettingsStore.ts`           | Settings store            |
| `apps/web-app/src/components/chat/MessageInput.tsx`       | Voice button integration  |
| `apps/web-app/src/pages/ChatPage.tsx`                     | Chat timeline integration |
| `apps/web-app/src/hooks/useChatSession.ts`                | addMessage() helper       |

### Tests

| File                                                                              | Purpose               |
| --------------------------------------------------------------------------------- | --------------------- |
| `apps/web-app/src/hooks/__tests__/useRealtimeVoiceSession.test.ts`                | Hook tests            |
| `apps/web-app/src/hooks/__tests__/useChatSession-voice-integration.test.ts`       | Chat integration      |
| `apps/web-app/src/stores/__tests__/voiceSettingsStore.test.ts`                    | Store tests           |
| `apps/web-app/src/components/voice/__tests__/VoiceModeSettings.test.tsx`          | Component tests       |
| `apps/web-app/src/components/chat/__tests__/MessageInput-voice-settings.test.tsx` | Integration tests     |
| `e2e/voice-mode-navigation.spec.ts`                                               | E2E navigation        |
| `e2e/voice-mode-session-smoke.spec.ts`                                            | E2E smoke test        |
| `e2e/voice-mode-voice-chat-integration.spec.ts`                                   | E2E panel integration |

## Related Documentation

- [VOICE_MODE_ENHANCEMENT_10_PHASE.md](./VOICE_MODE_ENHANCEMENT_10_PHASE.md) - **10-phase enhancement plan (emotion, dictation, analytics)**
- [VOICE_MODE_SETTINGS_GUIDE.md](./VOICE_MODE_SETTINGS_GUIDE.md) - User settings configuration
- [TESTING_GUIDE.md](./TESTING_GUIDE.md) - E2E testing strategy and validation checklist

## Observability & Monitoring (Phase 3)

**Implemented:** 2025-12-02

The voice pipeline includes comprehensive observability features for production monitoring.

### Error Taxonomy (`voice_errors.py`)

Location: `services/api-gateway/app/core/voice_errors.py`

Structured error classification with 8 categories and 40+ error codes:

| Category   | Codes          | Description                    |
| ---------- | -------------- | ------------------------------ |
| CONNECTION | CONN_001-7     | WebSocket, network failures    |
| STT        | STT_001-7      | Speech-to-text errors          |
| TTS        | TTS_001-7      | Text-to-speech errors          |
| LLM        | LLM_001-6      | LLM processing errors          |
| AUDIO      | AUDIO_001-6    | Audio encoding/decoding errors |
| TIMEOUT    | TIMEOUT_001-7  | Various timeout conditions     |
| PROVIDER   | PROVIDER_001-6 | External provider errors       |
| INTERNAL   | INTERNAL_001-5 | Internal server errors         |

Each error code includes:

- Recoverability flag (can auto-retry)
- Retry configuration (delay, max attempts)
- User-friendly description

### Voice Metrics (`metrics.py`)

Location: `services/api-gateway/app/core/metrics.py`

Prometheus metrics for voice pipeline monitoring:

| Metric                                 | Type      | Labels                                | Description            |
| -------------------------------------- | --------- | ------------------------------------- | ---------------------- |
| `voice_errors_total`                   | Counter   | category, code, provider, recoverable | Total voice errors     |
| `voice_pipeline_stage_latency_seconds` | Histogram | stage                                 | Per-stage latency      |
| `voice_ttfa_seconds`                   | Histogram | -                                     | Time to first audio    |
| `voice_active_sessions`                | Gauge     | -                                     | Active voice sessions  |
| `voice_barge_in_total`                 | Counter   | -                                     | Barge-in events        |
| `voice_audio_chunks_total`             | Counter   | status                                | Audio chunks processed |

### Per-Stage Latency Tracking (`voice_timing.py`)

Location: `services/api-gateway/app/core/voice_timing.py`

Pipeline stages tracked:

- `audio_receive` - Time to receive audio from client
- `vad_process` - Voice activity detection time
- `stt_transcribe` - Speech-to-text latency
- `llm_process` - LLM inference time
- `tts_synthesize` - Text-to-speech synthesis
- `audio_send` - Time to send audio to client
- `ttfa` - Time to first audio (end-to-end)

Usage:

```python
from app.core.voice_timing import create_pipeline_timings, PipelineStage

timings = create_pipeline_timings(session_id="abc123")

with timings.time_stage(PipelineStage.STT_TRANSCRIBE):
    transcript = await stt_client.transcribe(audio)

timings.record_ttfa()  # When first audio byte ready
timings.finalize()     # When response complete
```

### SLO Alerts (`voice_slo_alerts.yml`)

Location: `infrastructure/observability/prometheus/rules/voice_slo_alerts.yml`

SLO targets with Prometheus alerting rules:

| SLO                  | Target  | Alert                           |
| -------------------- | ------- | ------------------------------- |
| TTFA P95             | < 200ms | VoiceTTFASLOViolation           |
| STT Latency P95      | < 300ms | VoiceSTTLatencySLOViolation     |
| TTS First Chunk P95  | < 200ms | VoiceTTSFirstChunkSLOViolation  |
| Connection Time P95  | < 500ms | VoiceConnectionTimeSLOViolation |
| Error Rate           | < 1%    | VoiceErrorRateHigh              |
| Session Success Rate | > 95%   | VoiceSessionSuccessRateLow      |

### Client Telemetry (`voiceTelemetry.ts`)

Location: `apps/web-app/src/lib/voiceTelemetry.ts`

Frontend telemetry with:

- **Network quality assessment** via Network Information API
- **Browser performance metrics** via Performance.memory API
- **Jitter estimation** for network quality
- **Batched reporting** (10s intervals)
- **Beacon API** for reliable delivery on page unload

```typescript
import { getVoiceTelemetry } from "@/lib/voiceTelemetry";

const telemetry = getVoiceTelemetry();
telemetry.startSession(sessionId);
telemetry.recordLatency("stt", 150);
telemetry.recordLatency("ttfa", 180);
telemetry.endSession();
```

### Voice Health Endpoint (`/health/voice`)

Location: `services/api-gateway/app/api/health.py`

Comprehensive voice subsystem health check:

```bash
curl https://assist.asimo.io/health/voice
```

Response:

```json
{
  "status": "healthy",
  "providers": {
    "openai": { "status": "up", "latency_ms": 120.5 },
    "elevenlabs": { "status": "up", "latency_ms": 85.2 },
    "deepgram": { "status": "up", "latency_ms": 95.8 }
  },
  "session_store": { "status": "up", "active_sessions": 5 },
  "metrics": { "active_sessions": 5 },
  "slo": { "ttfa_target_ms": 200, "error_rate_target": 0.01 }
}
```

### Debug Logging Configuration

Location: `services/api-gateway/app/core/logging.py`

Configurable voice log verbosity via `VOICE_LOG_LEVEL` environment variable:

| Level    | Content                                       |
| -------- | --------------------------------------------- |
| MINIMAL  | Errors only                                   |
| STANDARD | + Session lifecycle (start/end/state changes) |
| VERBOSE  | + All latency measurements                    |
| DEBUG    | + Audio frame details, chunk timing           |

Usage:

```python
from app.core.logging import get_voice_logger

voice_log = get_voice_logger(__name__)
voice_log.session_start(session_id="abc123", provider="thinker_talker")
voice_log.latency("stt_transcribe", 150.5, session_id="abc123")
voice_log.error("voice_connection_failed", error_code="CONN_001")
```

---

## Phase 9: Offline & Network Fallback

**Implemented:** 2025-12-03

The voice pipeline now includes comprehensive offline support and network-aware fallback mechanisms.

### Network Monitoring (`networkMonitor.ts`)

Location: `apps/web-app/src/lib/offline/networkMonitor.ts`

Continuously monitors network health using multiple signals:

- **Navigator.onLine**: Basic online/offline detection
- **Network Information API**: Connection type, downlink speed, RTT
- **Health Check Pinging**: Periodic `/api/health` pings for latency measurement

```typescript
import { getNetworkMonitor } from "@/lib/offline/networkMonitor";

const monitor = getNetworkMonitor();
monitor.subscribe((status) => {
  console.log(`Network quality: ${status.quality}`);
  console.log(`Health check latency: ${status.healthCheckLatencyMs}ms`);
});
```

#### Network Quality Levels

| Quality   | Latency     | isHealthy | Action                     |
| --------- | ----------- | --------- | -------------------------- |
| Excellent | < 100ms     | true      | Full cloud processing      |
| Good      | < 200ms     | true      | Full cloud processing      |
| Moderate  | < 500ms     | true      | Cloud with quality warning |
| Poor      | ≥ 500ms     | variable  | Consider offline fallback  |
| Offline   | Unreachable | false     | Automatic offline fallback |

#### Configuration

```typescript
const monitor = createNetworkMonitor({
  healthCheckUrl: "/api/health",
  healthCheckIntervalMs: 30000, // 30 seconds
  healthCheckTimeoutMs: 5000, // 5 seconds
  goodLatencyThresholdMs: 100,
  moderateLatencyThresholdMs: 200,
  poorLatencyThresholdMs: 500,
  failuresBeforeUnhealthy: 3,
});
```

### useNetworkStatus Hook

Location: `apps/web-app/src/hooks/useNetworkStatus.ts`

React hook providing network status with computed properties:

```typescript
const {
  isOnline,
  isHealthy,
  quality,
  healthCheckLatencyMs,
  effectiveType, // "4g", "3g", "2g", "slow-2g"
  downlink, // Mbps
  rtt, // Round-trip time ms
  isSuitableForVoice, // quality >= "good" && isHealthy
  shouldUseOffline, // !isOnline || !isHealthy || quality < "moderate"
  qualityScore, // 0-4 (offline=0, poor=1, moderate=2, good=3, excellent=4)
  checkNow, // Force immediate health check
} = useNetworkStatus();
```

### Offline VAD with Network Fallback

Location: `apps/web-app/src/hooks/useOfflineVAD.ts`

The `useOfflineVADWithFallback` hook automatically switches between network and offline VAD:

```typescript
const {
  isListening,
  isSpeaking,
  currentEnergy,
  isUsingOfflineVAD, // Currently using offline mode?
  networkAvailable,
  networkQuality,
  modeReason, // "network_vad" | "network_unavailable" | "poor_quality" | "forced_offline"
  forceOffline, // Manually switch to offline
  forceNetwork, // Manually switch to network (if available)
  startListening,
  stopListening,
} = useOfflineVADWithFallback({
  useNetworkMonitor: true,
  minNetworkQuality: "moderate",
  networkRecoveryDelayMs: 2000, // Prevent flapping
  onFallbackToOffline: () => console.log("Switched to offline VAD"),
  onReturnToNetwork: () => console.log("Returned to network VAD"),
});
```

### Fallback Decision Flow

```
┌────────────────────┐
│  Network Monitor   │
│  Health Check      │
└─────────┬──────────┘
          │
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Is Online?        │──────────▶│  Use Offline VAD   │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Is Healthy?       │──────────▶│  Use Offline VAD   │
│  (3+ checks pass)  │            │  reason: unhealthy │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Quality ≥ Min?    │──────────▶│  Use Offline VAD   │
│  (e.g., moderate)  │            │  reason: poor_qual │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐
│  Use Network VAD   │
│  (cloud processing)│
└────────────────────┘
```

### TTS Caching (`useTTSCache`)

Location: `apps/web-app/src/hooks/useOfflineVAD.ts`

Caches synthesized TTS audio for offline playback:

```typescript
const {
  getTTS, // Get audio (from cache or fresh)
  preload, // Preload common phrases
  isCached, // Check if text is cached
  stats, // { entryCount, sizeMB, hitRate }
  clear, // Clear cache
} = useTTSCache({
  voice: "alloy",
  maxSizeMB: 50,
  ttsFunction: async (text) => synthesizeAudio(text),
});

// Preload common phrases on app start
await preload(); // Caches "I'm listening", "Go ahead", etc.

// Get TTS (cache hit = instant, cache miss = synthesize + cache)
const audio = await getTTS("Hello world");
```

### User Settings Integration

Phase 9 settings are stored in `voiceSettingsStore`:

| Setting                 | Default | Description                              |
| ----------------------- | ------- | ---------------------------------------- |
| `enableOfflineFallback` | `true`  | Auto-switch to offline when network poor |
| `preferOfflineVAD`      | `false` | Force offline VAD (privacy mode)         |
| `ttsCacheEnabled`       | `true`  | Enable TTS response caching              |

### File Reference (Phase 9)

| File                                                            | Purpose                         |
| --------------------------------------------------------------- | ------------------------------- |
| `apps/web-app/src/lib/offline/networkMonitor.ts`                | Network health monitoring       |
| `apps/web-app/src/lib/offline/webrtcVAD.ts`                     | WebRTC-based offline VAD        |
| `apps/web-app/src/lib/offline/types.ts`                         | Offline module type definitions |
| `apps/web-app/src/hooks/useNetworkStatus.ts`                    | React hook for network status   |
| `apps/web-app/src/hooks/useOfflineVAD.ts`                       | Offline VAD + TTS cache hooks   |
| `apps/web-app/src/lib/offline/__tests__/networkMonitor.test.ts` | Network monitor tests           |

---

## Future Work

- ~~**Metrics export to backend**: Send metrics to backend for aggregation/alerting~~ ✓ Implemented
- ~~**Barge-in support**: Allow user to interrupt AI responses~~ ✓ Implemented (2025-11-28)
- ~~**Audio overlap prevention**: Prevent multiple responses playing simultaneously~~ ✓ Implemented (2025-11-28)
- ~~**Per-user voice preferences**: Backend persistence for TTS settings~~ ✓ Implemented (2025-11-29)
- ~~**Context-aware voice styles**: Auto-detect tone from content~~ ✓ Implemented (2025-11-29)
- ~~**Aggressive latency optimization**: 200ms VAD, 256-sample chunks, 300ms reconnect~~ ✓ Implemented (2025-11-29)
- ~~**Observability & Monitoring (Phase 3)**: Error taxonomy, metrics, SLO alerts, telemetry~~ ✓ Implemented (2025-12-02)
- ~~**Phase 7: Multilingual Support**: Auto language detection, accent profiles, language switch confidence~~ ✓ Implemented (2025-12-03)
- ~~**Phase 8: Voice Calibration**: Personalized VAD thresholds, calibration wizard, adaptive learning~~ ✓ Implemented (2025-12-03)
- ~~**Phase 9: Offline Fallback**: Network monitoring, offline VAD, TTS caching, quality-based switching~~ ✓ Implemented (2025-12-03)
- ~~**Phase 10: Conversation Intelligence**: Sentiment tracking, discourse analysis, response recommendations~~ ✓ Implemented (2025-12-03)

### Voice Mode Enhancement - 10 Phase Plan ✅ COMPLETE (2025-12-03)

A comprehensive enhancement transforming voice mode into a human-like conversational partner with medical dictation:

- ~~**Phase 1**: Emotional Intelligence (Hume AI)~~ ✓ Complete
- ~~**Phase 2**: Backchanneling System~~ ✓ Complete
- ~~**Phase 3**: Prosody Analysis~~ ✓ Complete
- ~~**Phase 4**: Memory & Context System~~ ✓ Complete
- ~~**Phase 5**: Advanced Turn-Taking~~ ✓ Complete
- ~~**Phase 6**: Variable Response Timing~~ ✓ Complete
- ~~**Phase 7**: Conversational Repair~~ ✓ Complete
- ~~**Phase 8**: Medical Dictation Core~~ ✓ Complete
- ~~**Phase 9**: Patient Context Integration~~ ✓ Complete
- ~~**Phase 10**: Frontend Integration & Analytics~~ ✓ Complete

**Full documentation:** [VOICE_MODE_ENHANCEMENT_10_PHASE.md](./VOICE_MODE_ENHANCEMENT_10_PHASE.md)

### Remaining Tasks

- **Voice→chat transcript content E2E**: Test actual transcript content in chat timeline
- **Error tracking integration**: Send errors to Sentry/similar
- **Audio level visualization**: Show real-time audio level meter during recording
6:["slug","VOICE_MODE_PIPELINE","c"]
0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","VOICE_MODE_PIPELINE","c"],{"children":["__PAGE__?{\"slug\":[\"VOICE_MODE_PIPELINE\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","VOICE_MODE_PIPELINE","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Voice Mode Pipeline"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","VOICE_MODE_PIPELINE.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/VOICE_MODE_PIPELINE.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]]
c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Voice Mode Pipeline | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"Unified Voice Mode pipeline architecture, data flow, barge-in, audio playback, metrics, offline fallback, and testing strategy."}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]]
1:null