Thinker-Talker Voice Pipeline

Status: Production Ready Last Updated: 2025-12-01 Phase: Voice Pipeline Migration (Complete)

Overview

The Thinker-Talker (T/T) pipeline is VoiceAssist's voice processing architecture that replaces the OpenAI Realtime API with a local orchestration approach. It provides unified conversation context, full tool/RAG support, and custom TTS with ElevenLabs for superior voice quality.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Thinker-Talker Pipeline                               │
│                                                                              │
│   ┌──────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────┐     │
│   │  Audio   │───>│ Deepgram STT │───>│ GPT-4o       │───>│ElevenLabs│     │
│   │  Input   │    │ (Streaming)  │    │ Thinker      │    │   TTS    │     │
│   └──────────┘    └──────────────┘    └──────────────┘    └──────────┘     │
│        │                │                    │                  │           │
│        │           Transcripts          Tool Calls         Audio Out        │
│        │                │                    │                  │           │
│        v                v                    v                  v           │
│   ┌─────────────────────────────────────────────────────────────────┐      │
│   │                    WebSocket Handler                             │      │
│   │              (Bidirectional Client Communication)                │      │
│   └─────────────────────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────────────────────┘

Benefits Over OpenAI Realtime API

Feature	OpenAI Realtime	Thinker-Talker
Conversation Context	Separate from chat	Unified with chat mode
Tool Support	Limited	Full tool calling + RAG
TTS Quality	OpenAI voices	ElevenLabs premium voices
Cost	Per-minute billing	Per-token + TTS chars
Voice Selection	6 voices	11+ ElevenLabs voices
Customization	Limited	Full control over each stage
Barge-in	Built-in	Fully supported

Architecture Components

1. Voice Pipeline Service

Location: services/api-gateway/app/services/voice_pipeline_service.py

Orchestrates the complete STT → Thinker → Talker flow:

class VoicePipelineService:
    """
    Orchestrates the complete voice pipeline:
    1. Receive audio from client
    2. Stream to Deepgram STT
    3. Send transcripts to Thinker (LLM)
    4. Stream response tokens to Talker (TTS)
    5. Send audio chunks back to client
    """

Configuration:

@dataclass
class PipelineConfig:
    # STT Settings
    stt_language: str = "en"
    stt_sample_rate: int = 16000
    stt_endpointing_ms: int = 800    # Wait for natural pauses
    stt_utterance_end_ms: int = 1500  # Finalize after 1.5s silence

    # TTS Settings - defaults from voice_constants.py (single source of truth)
    # See docs/voice/voice-configuration.md for details
    voice_id: str = DEFAULT_VOICE_ID  # Brian (from voice_constants.py)
    tts_model: str = DEFAULT_TTS_MODEL  # eleven_flash_v2_5

    # Barge-in
    barge_in_enabled: bool = True

2. Thinker Service

Location: services/api-gateway/app/services/thinker_service.py

The reasoning engine that processes transcribed speech:

class ThinkerService:
    """
    Unified reasoning service for the Thinker/Talker pipeline.

    Handles:
    - Conversation context management (persisted across turns)
    - Streaming LLM responses with token callbacks
    - Tool calling with result injection
    - Cancellation support
    """

Key Features:

ConversationContext: Maintains history (max 20 messages) with smart trimming
Tool Registry: Supports calendar, search, medical calculators, KB search
Streaming: Token-by-token callbacks for low-latency TTS
State Machine: IDLE → PROCESSING → TOOL_CALLING → GENERATING → COMPLETE

3. Talker Service

Location: services/api-gateway/app/services/talker_service.py

Text-to-Speech synthesis with streaming audio:

class TalkerService:
    """
    Unified TTS service for the Thinker/Talker pipeline.

    Handles:
    - Streaming LLM tokens through sentence chunker
    - Audio queue management for gapless playback
    - Cancellation (barge-in support)
    """

Voice Configuration:

@dataclass
class VoiceConfig:
    provider: TTSProvider = TTSProvider.ELEVENLABS
    voice_id: str = "TxGEqnHWrfWFTfGW9XjX"  # Josh
    model_id: str = "eleven_turbo_v2_5"
    stability: float = 0.78       # Voice consistency
    similarity_boost: float = 0.85  # Voice clarity
    style: float = 0.08           # Natural, less dramatic
    output_format: str = "pcm_24000"  # Low-latency streaming

4. Sentence Chunker

Location: services/api-gateway/app/services/sentence_chunker.py

Optimizes LLM output for TTS with low latency:

class SentenceChunker:
    """
    Low-latency phrase chunker for TTS processing.

    Strategy:
    - Primary: Split on sentence boundaries (. ! ?)
    - Secondary: Split on clause boundaries (, ; :) after min chars
    - Emergency: Force split at max chars

    Config (optimized for speed):
    - min_chunk_chars: 40   (avoid tiny fragments)
    - optimal_chunk_chars: 120  (natural phrases)
    - max_chunk_chars: 200  (force split)
    """

5. WebSocket Handler

Location: services/api-gateway/app/services/thinker_talker_websocket_handler.py

Manages bidirectional client communication:

class ThinkerTalkerWebSocketHandler:
    """
    WebSocket handler for Thinker/Talker voice pipeline.

    Protocol Messages (Client → Server):
    - audio.input: Base64 PCM16 audio
    - audio.input.complete: Signal end of speech
    - barge_in: Interrupt AI response
    - voice.mode: Activate/deactivate voice mode

    Protocol Messages (Server → Client):
    - transcript.delta/complete: STT results
    - response.delta/complete: LLM response
    - audio.output: TTS audio chunk
    - tool.call/result: Tool execution
    - voice.state: Pipeline state update
    """

Data Flow

Complete Request/Response Cycle

1. User speaks into microphone
   │
   ▼
2. Frontend captures PCM16 audio (16kHz)
   │
   ▼
3. Audio streamed via WebSocket (audio.input messages)
   │
   ▼
4. Deepgram STT processes audio stream
   │
   ├──> transcript.delta (partial text)
   │
   └──> transcript.complete (final text)
        │
        ▼
5. ThinkerService receives transcript
   │
   ├──> Adds to ConversationContext
   │
   ├──> Calls GPT-4o with tools
   │
   ├──> If tool call needed:
   │    │
   │    ├──> tool.call sent to client
   │    │
   │    ├──> Tool executed
   │    │
   │    └──> tool.result sent to client
   │
   └──> response.delta (streaming tokens)
        │
        ▼
6. TalkerService receives tokens
   │
   ├──> SentenceChunker buffers tokens
   │
   ├──> Complete sentences → ElevenLabs TTS
   │
   └──> audio.output (streaming PCM)
        │
        ▼
7. Frontend plays audio via Web Audio API

Barge-in Flow

1. AI is speaking (audio.output streaming)
   │
2. User starts speaking
   │
   ▼
3. Frontend sends barge_in message
   │
   ▼
4. Backend:
   ├──> Cancels TalkerSession
   ├──> Clears audio queue
   └──> Resets pipeline to LISTENING
   │
   ▼
5. New user speech processed normally

State Machine

                    ┌─────────────────┐
                    │      IDLE       │
                    │  (waiting for   │
                    │   user input)   │
                    └────────┬────────┘
                             │
                    audio.input received
                             │
                             ▼
                    ┌─────────────────┐
                    │   LISTENING     │
                    │  (STT active,   │
                    │  collecting)    │
                    └────────┬────────┘
                             │
                    transcript.complete
                             │
                             ▼
                    ┌─────────────────┐
                    │  PROCESSING     │◄─────────┐
                    │  (LLM thinking) │          │
                    └────────┬────────┘          │
                             │                   │
              ┌──────────────┼──────────────┐    │
              │              │              │    │
         tool_call     no tools      error  │    │
              │              │              │    │
              ▼              ▼              │    │
    ┌─────────────────┐ ┌──────────┐       │    │
    │  TOOL_CALLING   │ │GENERATING│       │    │
    │  (executing     │ │(streaming│       │    │
    │   tool)         │ │ response)│       │    │
    └────────┬────────┘ └────┬─────┘       │    │
             │               │             │    │
        tool_result     response.complete  │    │
             │               │             │    │
             └───────┬───────┘             │    │
                     │                     │    │
                     ▼                     │    │
            ┌─────────────────┐            │    │
            │    SPEAKING     │            │    │
            │  (TTS playing)  │────────────┘    │
            └────────┬────────┘  (more to say)  │
                     │                          │
           audio complete or barge_in           │
                     │                          │
                     ▼                          │
            ┌─────────────────┐                 │
            │   CANCELLED     │─────────────────┘
            │  (interrupted)  │   (restart listening)
            └─────────────────┘

WebSocket Protocol

Client → Server Messages

Message Type	Description	Payload
`session.init`	Initialize session with settings	`{ voice_settings: {...}, conversation_id: "..." }`
`audio.input`	Audio chunk from microphone	`{ audio: "<base64 PCM16>" }`
`audio.input.complete`	Manual end-of-speech signal	`{}`
`barge_in`	Interrupt AI response	`{}`
`message`	Text input fallback	`{ content: "..." }`
`ping`	Heartbeat	`{}`

Server → Client Messages

Message Type	Description	Payload
`session.ready`	Session initialized	`{ session_id, pipeline_mode }`
`transcript.delta`	Partial STT transcript	`{ text: "...", is_final: false }`
`transcript.complete`	Final transcript	`{ text: "...", message_id: "..." }`
`response.delta`	Streaming LLM token	`{ delta: "...", message_id: "..." }`
`response.complete`	Complete LLM response	`{ text: "...", message_id: "..." }`
`audio.output`	TTS audio chunk	`{ audio: "<base64 PCM>", is_final: false }`
`tool.call`	Tool being called	`{ id, name, arguments }`
`tool.result`	Tool result	`{ id, name, result }`
`voice.state`	Pipeline state change	`{ state: "listening" }`
`error`	Error occurred	`{ code, message, recoverable }`

Frontend Integration

useThinkerTalkerSession Hook

Location: apps/web-app/src/hooks/useThinkerTalkerSession.ts

const {
  status, // 'disconnected' | 'connecting' | 'ready' | 'error'
  pipelineState, // 'idle' | 'listening' | 'processing' | 'speaking'
  transcript, // Final user transcript
  metrics, // Latency and usage metrics
  connect, // Start session
  disconnect, // End session
  sendAudio, // Send audio chunk
  bargeIn, // Interrupt AI
} = useThinkerTalkerSession({
  conversation_id: "...",
  voiceSettings: {
    voice_id: "TxGEqnHWrfWFTfGW9XjX",
    language: "en",
    barge_in_enabled: true,
  },
  onTranscript: (t) => console.log("Transcript:", t),
  onAudioChunk: (audio) => playAudio(audio),
  onToolCall: (tool) => console.log("Tool:", tool),
});

useTTAudioPlayback Hook

Location: apps/web-app/src/hooks/useTTAudioPlayback.ts

Handles streaming audio playback with barge-in support:

const {
  isPlaying,
  queuedChunks,
  playAudioChunk, // Add chunk to queue
  stopPlayback, // Cancel playback (barge-in)
  clearQueue, // Clear pending audio
} = useTTAudioPlayback({
  sampleRate: 24000,
  onPlaybackEnd: () => console.log("Playback complete"),
});

Configuration Reference

Backend Environment Variables

# LLM Settings
MODEL_SELECTION_DEFAULT=gpt-4o
OPENAI_API_KEY=sk-...
OPENAI_TIMEOUT_SEC=30

# TTS Settings
ELEVENLABS_API_KEY=...
ELEVENLABS_VOICE_ID=TxGEqnHWrfWFTfGW9XjX
ELEVENLABS_MODEL_ID=eleven_turbo_v2_5

# STT Settings
DEEPGRAM_API_KEY=...

Voice Configuration Options

Parameter	Default	Range	Description
`voice_id`	TxGEqnHWrfWFTfGW9XjX (Josh)	See available voices	ElevenLabs voice
`model_id`	eleven_turbo_v2_5	turbo/flash/multilingual	TTS model
`stability`	0.78	0.0-1.0	Higher = more consistent voice
`similarity_boost`	0.85	0.0-1.0	Higher = clearer voice
`style`	0.08	0.0-1.0	Lower = more natural
`output_format`	pcm_24000	pcm/mp3	Audio format

Available ElevenLabs Voices

Voice ID	Name	Gender	Premium
TxGEqnHWrfWFTfGW9XjX	Josh	Male	Yes
pNInz6obpgDQGcFmaJgB	Adam	Male	Yes
EXAVITQu4vr4xnSDxMaL	Bella	Female	Yes
21m00Tcm4TlvDq8ikWAM	Rachel	Female	Yes
AZnzlk1XvdvUeBnXmlld	Domi	Female	No
ErXwobaYiN019PkySvjV	Antoni	Male	No

Metrics & Observability

TTVoiceMetrics

interface TTVoiceMetrics {
  connectionTimeMs: number; // Connect to ready
  sttLatencyMs: number; // Speech end to transcript
  llmFirstTokenMs: number; // Transcript to first token
  ttsFirstAudioMs: number; // First token to first audio
  totalLatencyMs: number; // Speech end to first audio
  userUtteranceCount: number;
  aiResponseCount: number;
  toolCallCount: number;
  bargeInCount: number;
}

Latency Targets

Metric	Target	Description
Connection	< 2000ms	WebSocket + pipeline init
STT	< 500ms	Speech end to transcript
LLM First Token	< 800ms	Transcript to first token
TTS First Audio	< 400ms	First token to audio
Total	< 1500ms	Speech end to audio playback

Troubleshooting

Common Issues

1. No audio output

Check ElevenLabs API key is valid
Verify voice_id exists in available voices
Check browser audio permissions

2. High latency

Check network connection
Verify STT endpoint is responsive
Consider reducing chunk sizes

3. Barge-in not working

Ensure barge_in_enabled: true in config
Check WebSocket connection is stable
Verify frontend is sending barge_in message

4. Tool calls failing

Check user authentication (user_id required)
Verify tool is registered in ToolRegistry
Check tool-specific API keys (calendar, etc.)

Debug Logging

Enable verbose logging:

# Backend
import logging
logging.getLogger("app.services.thinker_service").setLevel(logging.DEBUG)
logging.getLogger("app.services.talker_service").setLevel(logging.DEBUG)

// Frontend
import { voiceLog } from "../lib/logger";
voiceLog.setLevel("debug");

Changelog

2025-12-01 - Initial Release

Complete Thinker-Talker pipeline implementation
Deepgram STT integration with streaming
ElevenLabs TTS with sentence chunking
Full tool calling support
Barge-in capability
Unified conversation context with chat mode

Voice Mode Pipeline

Status: Production-ready Last Updated: 2025-12-03

This document describes the unified Voice Mode pipeline architecture, data flow, metrics, and testing strategy. It serves as the canonical reference for developers working on real-time voice features.

Voice Pipeline Modes

VoiceAssist supports two voice pipeline modes:

Mode	Description	Best For
Thinker-Talker (Recommended)	Local STT → LLM → TTS pipeline	Full tool support, unified context, custom TTS
OpenAI Realtime (Legacy)	Direct OpenAI Realtime API	Quick setup, minimal backend changes

Thinker-Talker Pipeline (Primary)

The Thinker-Talker pipeline is the recommended approach, providing:

Unified conversation context between voice and chat modes
Full tool/RAG support in voice interactions
Custom TTS via ElevenLabs with premium voices
Lower cost per interaction

Documentation: THINKER_TALKER_PIPELINE.md

[Audio] → [Deepgram STT] → [GPT-4o Thinker] → [ElevenLabs TTS] → [Audio Out]
              │                    │                    │
         Transcripts          Tool Calls           Audio Chunks
              │                    │                    │
              └───────── WebSocket Handler ──────────────┘

OpenAI Realtime API (Legacy)

The original implementation using OpenAI's Realtime API directly. Still supported for backward compatibility.

Implementation Status

Thinker-Talker Components

Component	Status	Location
ThinkerService	Live	`app/services/thinker_service.py`
TalkerService	Live	`app/services/talker_service.py`
VoicePipelineService	Live	`app/services/voice_pipeline_service.py`
T/T WebSocket Handler	Live	`app/services/thinker_talker_websocket_handler.py`
SentenceChunker	Live	`app/services/sentence_chunker.py`
Frontend T/T hook	Live	`apps/web-app/src/hooks/useThinkerTalkerSession.ts`
T/T Audio Playback	Live	`apps/web-app/src/hooks/useTTAudioPlayback.ts`
T/T Voice Panel	Live	`apps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsx`

OpenAI Realtime Components (Legacy)

Component	Status	Location
Backend session endpoint	Live	`services/api-gateway/app/api/voice.py`
Ephemeral token generation	Live	`app/services/realtime_voice_service.py`
Voice metrics endpoint	Live	`POST /api/voice/metrics`
Frontend voice hook	Live	`apps/web-app/src/hooks/useRealtimeVoiceSession.ts`
Voice settings store	Live	`apps/web-app/src/stores/voiceSettingsStore.ts`
Voice UI panel	Live	`apps/web-app/src/components/voice/VoiceModePanel.tsx`
Chat timeline integration	Live	Voice messages appear in chat
Barge-in support	Live	`response.cancel` + `onSpeechStarted` callback
Audio overlap prevention	Live	Response ID tracking + `isProcessingResponseRef`
E2E test suite	Passing	95 tests across unit/integration/E2E

Full status: See Implementation Status for all components.

Overview

Voice Mode enables real-time voice conversations with the AI assistant using OpenAI's Realtime API. The pipeline handles:

Ephemeral session authentication (no raw API keys in browser)
WebSocket-based bidirectional voice streaming
Voice activity detection (VAD) with user-configurable sensitivity
User settings propagation (voice, language, VAD threshold)
Chat timeline integration (voice messages appear in chat)
Connection state management with automatic reconnection
Barge-in support (interrupt AI while speaking)
Audio playback management (prevent overlapping responses)
Metrics tracking for observability

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                              FRONTEND                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────┐     ┌─────────────────────┐     ┌───────────────┐  │
│  │  VoiceModePanel     │────▶│useRealtimeVoice     │────▶│ voiceSettings │  │
│  │  (UI Component)     │     │Session (Hook)       │     │ Store         │  │
│  │  - Start/Stop       │     │- connect()          │     │ - voice       │  │
│  │  - Status display   │     │- disconnect()       │     │ - language    │  │
│  │  - Metrics logging  │     │- sendMessage()      │     │ - vadSens     │  │
│  └─────────┬───────────┘     └──────────┬──────────┘     └───────────────┘  │
│            │                            │                                    │
│            │                            │ onUserMessage()/onAssistantMessage()
│            │                            ▼                                    │
│  ┌─────────▼───────────┐     ┌─────────────────────┐                        │
│  │  MessageInput       │     │  ChatPage           │                        │
│  │  - Voice toggle     │────▶│  - useChatSession   │                        │
│  │  - Panel container  │     │  - addMessage()     │                        │
│  └─────────────────────┘     └─────────────────────┘                        │
│                                                                              │
└──────────────────────────────────────┬──────────────────────────────────────┘
                                       │
                                       │ POST /api/voice/realtime-session
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              BACKEND                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────┐     ┌─────────────────────┐                        │
│  │  voice.py           │────▶│  realtime_voice_    │                        │
│  │  (FastAPI Router)   │     │  service.py         │                        │
│  │  - /realtime-session│     │  - generate_session │                        │
│  │  - Timing logs      │     │  - ephemeral token  │                        │
│  └─────────────────────┘     └──────────┬──────────┘                        │
│                                         │                                    │
│                                         │ POST /v1/realtime/sessions         │
│                                         ▼                                    │
│                              ┌─────────────────────┐                        │
│                              │  OpenAI API         │                        │
│                              │  - Ephemeral token  │                        │
│                              │  - Voice config     │                        │
│                              └─────────────────────┘                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       │ WebSocket wss://api.openai.com/v1/realtime
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          OPENAI REALTIME API                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  - Server-side VAD (voice activity detection)                                │
│  - Bidirectional audio streaming (PCM16)                                     │
│  - Real-time transcription (Whisper)                                         │
│  - GPT-4o responses with audio synthesis                                     │
└─────────────────────────────────────────────────────────────────────────────┘

Backend: `/api/voice/realtime-session`

Location: services/api-gateway/app/api/voice.py

Request

interface RealtimeSessionRequest {
  conversation_id?: string; // Optional conversation context
  voice?: string; // "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"
  language?: string; // "en" | "es" | "fr" | "de" | "it" | "pt"
  vad_sensitivity?: number; // 0-100 (maps to threshold: 0→0.9, 100→0.1)
}

Response

interface RealtimeSessionResponse {
  url: string; // WebSocket URL: "wss://api.openai.com/v1/realtime"
  model: string; // "gpt-4o-realtime-preview"
  session_id: string; // Unique session identifier
  expires_at: number; // Unix timestamp (epoch seconds)
  conversation_id: string | null;
  auth: {
    type: "ephemeral_token";
    token: string; // Ephemeral token (ek_...), NOT raw API key
    expires_at: number; // Token expiry (5 minutes)
  };
  voice_config: {
    voice: string; // Selected voice
    modalities: ["text", "audio"];
    input_audio_format: "pcm16";
    output_audio_format: "pcm16";
    input_audio_transcription: { model: "whisper-1" };
    turn_detection: {
      type: "server_vad";
      threshold: number; // 0.1 (sensitive) to 0.9 (insensitive)
      prefix_padding_ms: number;
      silence_duration_ms: number;
    };
  };
}

VAD Sensitivity Mapping

The frontend uses a 0-100 scale for user-friendly VAD sensitivity:

User Setting	VAD Threshold	Behavior
0 (Low)	0.9	Requires loud/clear speech
50 (Medium)	0.5	Balanced detection
100 (High)	0.1	Very sensitive, picks up soft speech

Formula: threshold = 0.9 - (vad_sensitivity / 100 * 0.8)

Observability

Backend logs timing and context for each session request:

# Request logging
logger.info(
    f"Creating Realtime session for user {current_user.id}",
    extra={
        "user_id": current_user.id,
        "conversation_id": request.conversation_id,
        "voice": request.voice,
        "language": request.language,
        "vad_sensitivity": request.vad_sensitivity,
    },
)

# Success logging with duration
duration_ms = int((time.monotonic() - start_time) * 1000)
logger.info(
    f"Realtime session created for user {current_user.id}",
    extra={
        "user_id": current_user.id,
        "session_id": config["session_id"],
        "voice": config.get("voice_config", {}).get("voice"),
        "duration_ms": duration_ms,
    },
)

Frontend Hook: `useRealtimeVoiceSession`

Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts

Usage

const {
  status, // 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'failed' | 'expired' | 'error'
  transcript, // Current transcript text
  isSpeaking, // Is the AI currently speaking?
  isConnected, // Derived: status === 'connected'
  isConnecting, // Derived: status === 'connecting' || 'reconnecting'
  canSend, // Can send messages?
  error, // Error message if any
  metrics, // VoiceMetrics object
  connect, // () => Promise<void> - start session
  disconnect, // () => void - end session
  sendMessage, // (text: string) => void - send text message
} = useRealtimeVoiceSession({
  conversationId,
  voice, // From voiceSettingsStore
  language, // From voiceSettingsStore
  vadSensitivity, // From voiceSettingsStore (0-100)
  onConnected, // Callback when connected
  onDisconnected, // Callback when disconnected
  onError, // Callback on error
  onUserMessage, // Callback with user transcript
  onAssistantMessage, // Callback with AI response
  onMetricsUpdate, // Callback when metrics change
});

Connection States

disconnected ──▶ connecting ──▶ connected
                      │              │
                      ▼              ▼
                   failed ◀──── reconnecting
                      │              │
                      ▼              ▼
                  expired ◀────── error

State	Description
`disconnected`	Initial/idle state
`connecting`	Fetching session config, establishing WebSocket
`connected`	Active voice session
`reconnecting`	Auto-reconnect after temporary disconnect
`failed`	Connection failed (backend error, network issue)
`expired`	Session token expired (needs manual restart)
`error`	General error state

WebSocket Connection

The hook connects using three protocols for authentication:

const ws = new WebSocket(url, ["realtime", "openai-beta.realtime-v1", `openai-insecure-api-key.${ephemeralToken}`]);

Voice Settings Store

Location: apps/web-app/src/stores/voiceSettingsStore.ts

Schema

interface VoiceSettings {
  voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer";
  language: "en" | "es" | "fr" | "de" | "it" | "pt";
  vadSensitivity: number; // 0-100
  autoStartOnOpen: boolean; // Auto-start voice when panel opens
  showStatusHints: boolean; // Show helper text in UI
}

Persistence

Settings are persisted to localStorage under key voiceassist-voice-settings using Zustand's persist middleware.

Defaults

Setting	Default
voice	"alloy"
language	"en"
vadSensitivity	50
autoStartOnOpen	false
showStatusHints	true

Chat Integration

Location: apps/web-app/src/pages/ChatPage.tsx

Message Flow

User speaks → VoiceModePanel receives final transcript
VoiceModePanel calls onUserMessage(transcript)
ChatPage receives callback, calls useChatSession.addMessage()
Message added to timeline with metadata: { source: "voice" }

// ChatPage.tsx
const handleVoiceUserMessage = (content: string) => {
  addMessage({
    role: "user",
    content,
    metadata: { source: "voice" },
  });
};

const handleVoiceAssistantMessage = (content: string) => {
  addMessage({
    role: "assistant",
    content,
    metadata: { source: "voice" },
  });
};

Message Structure

interface VoiceMessage {
  id: string; // "voice-{timestamp}-{random}"
  role: "user" | "assistant";
  content: string;
  timestamp: number;
  metadata: {
    source: "voice"; // Distinguishes from text messages
  };
}

Barge-in & Audio Playback

Location: apps/web-app/src/components/voice/VoiceModePanel.tsx, apps/web-app/src/hooks/useRealtimeVoiceSession.ts

Barge-in Flow

When the user starts speaking while the AI is responding, the system immediately:

Detects speech start via OpenAI's input_audio_buffer.speech_started event
Cancels active response by sending response.cancel to OpenAI
Stops audio playback via onSpeechStarted callback
Clears pending responses to prevent stale audio from playing

User speaks → speech_started event → response.cancel → stopCurrentAudio()
                                                            ↓
                                                    Audio stops
                                                    Queue cleared
                                                    Response ID incremented

Response Cancellation

Location: useRealtimeVoiceSession.ts - handleRealtimeMessage

case "input_audio_buffer.speech_started":
  setIsSpeaking(true);
  setPartialTranscript("");

  // Barge-in: Cancel any active response when user starts speaking
  if (activeResponseIdRef.current && wsRef.current?.readyState === WebSocket.OPEN) {
    wsRef.current.send(JSON.stringify({ type: "response.cancel" }));
    activeResponseIdRef.current = null;
  }

  // Notify parent to stop audio playback
  options.onSpeechStarted?.();
  break;

Audio Playback Management

Location: VoiceModePanel.tsx

The panel tracks audio playback state to prevent overlapping responses:

// Track currently playing Audio element
const currentAudioRef = useRef<HTMLAudioElement | null>(null);

// Prevent overlapping response processing
const isProcessingResponseRef = useRef(false);

// Response ID to invalidate stale responses after barge-in
const currentResponseIdRef = useRef<number>(0);

Stop current audio function:

const stopCurrentAudio = useCallback(() => {
  if (currentAudioRef.current) {
    currentAudioRef.current.pause();
    currentAudioRef.current.currentTime = 0;
    if (currentAudioRef.current.src.startsWith("blob:")) {
      URL.revokeObjectURL(currentAudioRef.current.src);
    }
    currentAudioRef.current = null;
  }
  audioQueueRef.current = [];
  isPlayingRef.current = false;
  currentResponseIdRef.current++; // Invalidate pending responses
  isProcessingResponseRef.current = false;
}, []);

Overlap Prevention

When a relay result arrives, the handler checks:

Already processing? Skip if isProcessingResponseRef.current === true
Response ID valid? Skip playback if ID changed (barge-in occurred)

onRelayResult: async ({ answer }) => {
  if (answer) {
    // Prevent overlapping responses
    if (isProcessingResponseRef.current) {
      console.log("[VoiceModePanel] Skipping response - already processing another");
      return;
    }

    const responseId = ++currentResponseIdRef.current;
    isProcessingResponseRef.current = true;

    // ... synthesis and playback ...

    // Check if response is still valid before playback
    if (responseId !== currentResponseIdRef.current) {
      console.log("[VoiceModePanel] Response cancelled - skipping playback");
      return;
    }
  }
};

Error Handling

Benign cancellation errors (e.g., "Cancellation failed: no active response found") are handled gracefully:

case "error": {
  const errorMessage = message.error?.message || "Realtime API error";

  // Ignore benign cancellation errors
  if (
    errorMessage.includes("Cancellation failed") ||
    errorMessage.includes("no active response")
  ) {
    voiceLog.debug(`Ignoring benign error: ${errorMessage}`);
    break;
  }

  handleError(new Error(errorMessage));
  break;
}

Metrics

Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts

VoiceMetrics Interface

interface VoiceMetrics {
  connectionTimeMs: number | null; // Time to establish connection
  timeToFirstTranscriptMs: number | null; // Time to first user transcript
  lastSttLatencyMs: number | null; // Speech-to-text latency
  lastResponseLatencyMs: number | null; // AI response latency
  sessionDurationMs: number | null; // Total session duration
  userTranscriptCount: number; // Number of user turns
  aiResponseCount: number; // Number of AI turns
  reconnectCount: number; // Number of reconnections
  sessionStartedAt: number | null; // Session start timestamp
}

Frontend Logging

VoiceModePanel logs key metrics to console:

// Connection time
console.log(`[VoiceModePanel] voice_session_connect_ms=${metrics.connectionTimeMs}`);

// STT latency
console.log(`[VoiceModePanel] voice_stt_latency_ms=${metrics.lastSttLatencyMs}`);

// Response latency
console.log(`[VoiceModePanel] voice_first_reply_ms=${metrics.lastResponseLatencyMs}`);

// Session duration
console.log(`[VoiceModePanel] voice_session_duration_ms=${metrics.sessionDurationMs}`);

Consuming Metrics

Developers can plug into metrics via the onMetricsUpdate callback:

useRealtimeVoiceSession({
  onMetricsUpdate: (metrics) => {
    // Send to telemetry service
    analytics.track("voice_session_metrics", {
      connection_ms: metrics.connectionTimeMs,
      stt_latency_ms: metrics.lastSttLatencyMs,
      response_latency_ms: metrics.lastResponseLatencyMs,
      duration_ms: metrics.sessionDurationMs,
    });
  },
});

Metrics Export to Backend

Metrics can be automatically exported to the backend for aggregation and alerting.

Backend Endpoint: POST /api/voice/metrics

Location: services/api-gateway/app/api/voice.py

Request Schema

interface VoiceMetricsPayload {
  conversation_id?: string;
  connection_time_ms?: number;
  time_to_first_transcript_ms?: number;
  last_stt_latency_ms?: number;
  last_response_latency_ms?: number;
  session_duration_ms?: number;
  user_transcript_count: number;
  ai_response_count: number;
  reconnect_count: number;
  session_started_at?: number;
}

Response

interface VoiceMetricsResponse {
  status: "ok";
}

Privacy

No PHI or transcript content is sent. Only timing metrics and counts.

Frontend Configuration

Metrics export is controlled by environment variables:

Production (import.meta.env.PROD): Metrics sent automatically
Development: Set VITE_ENABLE_VOICE_METRICS=true to enable

The export uses navigator.sendBeacon() for reliability (survives page navigation).

Backend Logging

Metrics are logged with user context:

logger.info(
    "VoiceMetrics received",
    extra={
        "user_id": current_user.id,
        "conversation_id": payload.conversation_id,
        "connection_time_ms": payload.connection_time_ms,
        "session_duration_ms": payload.session_duration_ms,
        ...
    },
)

Testing

# Backend
cd /home/asimo/VoiceAssist/services/api-gateway
source venv/bin/activate && export PYTHONPATH=.
python -m pytest tests/integration/test_voice_metrics.py -v

Security

Ephemeral Token Architecture

CRITICAL: The browser NEVER receives the raw OpenAI API key.

Backend holds OPENAI_API_KEY securely
Frontend requests session via /api/voice/realtime-session
Backend creates ephemeral token via OpenAI /v1/realtime/sessions
Ephemeral token returned to frontend (valid ~5 minutes)
Frontend connects WebSocket using ephemeral token

Token Refresh

The hook monitors session.expires_at and can trigger refresh before expiry. If the token expires mid-session, status transitions to expired.

Testing

Voice Pipeline Smoke Suite

Run these commands to validate the voice pipeline:

# 1. Backend tests (CI-safe, mocked)
cd /home/asimo/VoiceAssist/services/api-gateway
source venv/bin/activate
export PYTHONPATH=.
python -m pytest tests/integration/test_openai_config.py -v

# 2. Frontend unit tests (run individually to avoid OOM)
cd /home/asimo/VoiceAssist/apps/web-app
export NODE_OPTIONS="--max-old-space-size=768"

npx vitest run src/hooks/__tests__/useRealtimeVoiceSession.test.ts --reporter=dot
npx vitest run src/hooks/__tests__/useChatSession-voice-integration.test.ts --reporter=dot
npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot
npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot
npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot

# 3. E2E tests (Chromium, mocked backend)
cd /home/asimo/VoiceAssist
npx playwright test \
  e2e/voice-mode-navigation.spec.ts \
  e2e/voice-mode-session-smoke.spec.ts \
  e2e/voice-mode-voice-chat-integration.spec.ts \
  --project=chromium --reporter=list

Test Coverage Summary

Test File	Tests	Coverage
useRealtimeVoiceSession.test.ts	22	Hook lifecycle, states, metrics
useChatSession-voice-integration.test.ts	8	Message structure validation
voiceSettingsStore.test.ts	17	Store actions, persistence
VoiceModeSettings.test.tsx	25	Component rendering, interactions
MessageInput-voice-settings.test.tsx	12	Integration with chat input
voice-mode-navigation.spec.ts	4	E2E navigation flow
voice-mode-session-smoke.spec.ts	3	E2E session smoke (1 live gated)
voice-mode-voice-chat-integration.spec.ts	4	E2E panel integration

Total: 95 tests

Live Testing

To test with real OpenAI backend:

# Backend (requires OPENAI_API_KEY in .env)
LIVE_REALTIME_TESTS=1 python -m pytest tests/integration/test_openai_config.py -v

# E2E (requires running backend + valid API key)
LIVE_REALTIME_E2E=1 npx playwright test e2e/voice-mode-session-smoke.spec.ts

File Reference

Backend

File	Purpose
`services/api-gateway/app/api/voice.py`	API routes, metrics, timing logs
`services/api-gateway/app/services/realtime_voice_service.py`	Session creation, token generation
`services/api-gateway/tests/integration/test_openai_config.py`	Integration tests
`services/api-gateway/tests/integration/test_voice_metrics.py`	Metrics endpoint tests

Frontend

File	Purpose
`apps/web-app/src/hooks/useRealtimeVoiceSession.ts`	Core hook
`apps/web-app/src/components/voice/VoiceModePanel.tsx`	UI panel
`apps/web-app/src/components/voice/VoiceModeSettings.tsx`	Settings modal
`apps/web-app/src/stores/voiceSettingsStore.ts`	Settings store
`apps/web-app/src/components/chat/MessageInput.tsx`	Voice button integration
`apps/web-app/src/pages/ChatPage.tsx`	Chat timeline integration
`apps/web-app/src/hooks/useChatSession.ts`	addMessage() helper

Tests

File	Purpose
`apps/web-app/src/hooks/__tests__/useRealtimeVoiceSession.test.ts`	Hook tests
`apps/web-app/src/hooks/__tests__/useChatSession-voice-integration.test.ts`	Chat integration
`apps/web-app/src/stores/__tests__/voiceSettingsStore.test.ts`	Store tests
`apps/web-app/src/components/voice/__tests__/VoiceModeSettings.test.tsx`	Component tests
`apps/web-app/src/components/chat/__tests__/MessageInput-voice-settings.test.tsx`	Integration tests
`e2e/voice-mode-navigation.spec.ts`	E2E navigation
`e2e/voice-mode-session-smoke.spec.ts`	E2E smoke test
`e2e/voice-mode-voice-chat-integration.spec.ts`	E2E panel integration

VOICE_MODE_ENHANCEMENT_10_PHASE.md - 10-phase enhancement plan (emotion, dictation, analytics)
VOICE_MODE_SETTINGS_GUIDE.md - User settings configuration
TESTING_GUIDE.md - E2E testing strategy and validation checklist

Observability & Monitoring (Phase 3)

Implemented: 2025-12-02

The voice pipeline includes comprehensive observability features for production monitoring.

Error Taxonomy (`voice_errors.py`)

Location: services/api-gateway/app/core/voice_errors.py

Structured error classification with 8 categories and 40+ error codes:

Category	Codes	Description
CONNECTION	CONN_001-7	WebSocket, network failures
STT	STT_001-7	Speech-to-text errors
TTS	TTS_001-7	Text-to-speech errors
LLM	LLM_001-6	LLM processing errors
AUDIO	AUDIO_001-6	Audio encoding/decoding errors
TIMEOUT	TIMEOUT_001-7	Various timeout conditions
PROVIDER	PROVIDER_001-6	External provider errors
INTERNAL	INTERNAL_001-5	Internal server errors

Each error code includes:

Recoverability flag (can auto-retry)
Retry configuration (delay, max attempts)
User-friendly description

Voice Metrics (`metrics.py`)

Location: services/api-gateway/app/core/metrics.py

Prometheus metrics for voice pipeline monitoring:

Metric	Type	Labels	Description
`voice_errors_total`	Counter	category, code, provider, recoverable	Total voice errors
`voice_pipeline_stage_latency_seconds`	Histogram	stage	Per-stage latency
`voice_ttfa_seconds`	Histogram	-	Time to first audio
`voice_active_sessions`	Gauge	-	Active voice sessions
`voice_barge_in_total`	Counter	-	Barge-in events
`voice_audio_chunks_total`	Counter	status	Audio chunks processed

Per-Stage Latency Tracking (`voice_timing.py`)

Location: services/api-gateway/app/core/voice_timing.py

Pipeline stages tracked:

audio_receive - Time to receive audio from client
vad_process - Voice activity detection time
stt_transcribe - Speech-to-text latency
llm_process - LLM inference time
tts_synthesize - Text-to-speech synthesis
audio_send - Time to send audio to client
ttfa - Time to first audio (end-to-end)

Usage:

from app.core.voice_timing import create_pipeline_timings, PipelineStage

timings = create_pipeline_timings(session_id="abc123")

with timings.time_stage(PipelineStage.STT_TRANSCRIBE):
    transcript = await stt_client.transcribe(audio)

timings.record_ttfa()  # When first audio byte ready
timings.finalize()     # When response complete

SLO Alerts (`voice_slo_alerts.yml`)

Location: infrastructure/observability/prometheus/rules/voice_slo_alerts.yml

SLO targets with Prometheus alerting rules:

SLO	Target	Alert
TTFA P95	< 200ms	VoiceTTFASLOViolation
STT Latency P95	< 300ms	VoiceSTTLatencySLOViolation
TTS First Chunk P95	< 200ms	VoiceTTSFirstChunkSLOViolation
Connection Time P95	< 500ms	VoiceConnectionTimeSLOViolation
Error Rate	< 1%	VoiceErrorRateHigh
Session Success Rate	> 95%	VoiceSessionSuccessRateLow

Client Telemetry (`voiceTelemetry.ts`)

Location: apps/web-app/src/lib/voiceTelemetry.ts

Frontend telemetry with:

Network quality assessment via Network Information API
Browser performance metrics via Performance.memory API
Jitter estimation for network quality
Batched reporting (10s intervals)
Beacon API for reliable delivery on page unload

import { getVoiceTelemetry } from "@/lib/voiceTelemetry";

const telemetry = getVoiceTelemetry();
telemetry.startSession(sessionId);
telemetry.recordLatency("stt", 150);
telemetry.recordLatency("ttfa", 180);
telemetry.endSession();

Voice Health Endpoint (`/health/voice`)

Location: services/api-gateway/app/api/health.py

Comprehensive voice subsystem health check:

curl https://assist.asimo.io/health/voice

Response:

{
  "status": "healthy",
  "providers": {
    "openai": { "status": "up", "latency_ms": 120.5 },
    "elevenlabs": { "status": "up", "latency_ms": 85.2 },
    "deepgram": { "status": "up", "latency_ms": 95.8 }
  },
  "session_store": { "status": "up", "active_sessions": 5 },
  "metrics": { "active_sessions": 5 },
  "slo": { "ttfa_target_ms": 200, "error_rate_target": 0.01 }
}

Debug Logging Configuration

Location: services/api-gateway/app/core/logging.py

Configurable voice log verbosity via VOICE_LOG_LEVEL environment variable:

Level	Content
MINIMAL	Errors only
STANDARD	+ Session lifecycle (start/end/state changes)
VERBOSE	+ All latency measurements
DEBUG	+ Audio frame details, chunk timing

Usage:

from app.core.logging import get_voice_logger

voice_log = get_voice_logger(__name__)
voice_log.session_start(session_id="abc123", provider="thinker_talker")
voice_log.latency("stt_transcribe", 150.5, session_id="abc123")
voice_log.error("voice_connection_failed", error_code="CONN_001")

Phase 9: Offline & Network Fallback

Implemented: 2025-12-03

The voice pipeline now includes comprehensive offline support and network-aware fallback mechanisms.

Network Monitoring (`networkMonitor.ts`)

Location: apps/web-app/src/lib/offline/networkMonitor.ts

Continuously monitors network health using multiple signals:

Navigator.onLine: Basic online/offline detection
Network Information API: Connection type, downlink speed, RTT
Health Check Pinging: Periodic /api/health pings for latency measurement

import { getNetworkMonitor } from "@/lib/offline/networkMonitor";

const monitor = getNetworkMonitor();
monitor.subscribe((status) => {
  console.log(`Network quality: ${status.quality}`);
  console.log(`Health check latency: ${status.healthCheckLatencyMs}ms`);
});

Network Quality Levels

Quality	Latency	isHealthy	Action
Excellent	< 100ms	true	Full cloud processing
Good	< 200ms	true	Full cloud processing
Moderate	< 500ms	true	Cloud with quality warning
Poor	≥ 500ms	variable	Consider offline fallback
Offline	Unreachable	false	Automatic offline fallback

Configuration

const monitor = createNetworkMonitor({
  healthCheckUrl: "/api/health",
  healthCheckIntervalMs: 30000, // 30 seconds
  healthCheckTimeoutMs: 5000, // 5 seconds
  goodLatencyThresholdMs: 100,
  moderateLatencyThresholdMs: 200,
  poorLatencyThresholdMs: 500,
  failuresBeforeUnhealthy: 3,
});

useNetworkStatus Hook

Location: apps/web-app/src/hooks/useNetworkStatus.ts

React hook providing network status with computed properties:

const {
  isOnline,
  isHealthy,
  quality,
  healthCheckLatencyMs,
  effectiveType, // "4g", "3g", "2g", "slow-2g"
  downlink, // Mbps
  rtt, // Round-trip time ms
  isSuitableForVoice, // quality >= "good" && isHealthy
  shouldUseOffline, // !isOnline || !isHealthy || quality < "moderate"
  qualityScore, // 0-4 (offline=0, poor=1, moderate=2, good=3, excellent=4)
  checkNow, // Force immediate health check
} = useNetworkStatus();

Offline VAD with Network Fallback

Location: apps/web-app/src/hooks/useOfflineVAD.ts

The useOfflineVADWithFallback hook automatically switches between network and offline VAD:

const {
  isListening,
  isSpeaking,
  currentEnergy,
  isUsingOfflineVAD, // Currently using offline mode?
  networkAvailable,
  networkQuality,
  modeReason, // "network_vad" | "network_unavailable" | "poor_quality" | "forced_offline"
  forceOffline, // Manually switch to offline
  forceNetwork, // Manually switch to network (if available)
  startListening,
  stopListening,
} = useOfflineVADWithFallback({
  useNetworkMonitor: true,
  minNetworkQuality: "moderate",
  networkRecoveryDelayMs: 2000, // Prevent flapping
  onFallbackToOffline: () => console.log("Switched to offline VAD"),
  onReturnToNetwork: () => console.log("Returned to network VAD"),
});

Fallback Decision Flow

┌────────────────────┐
│  Network Monitor   │
│  Health Check      │
└─────────┬──────────┘
          │
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Is Online?        │──────────▶│  Use Offline VAD   │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Is Healthy?       │──────────▶│  Use Offline VAD   │
│  (3+ checks pass)  │            │  reason: unhealthy │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐     NO     ┌────────────────────┐
│  Quality ≥ Min?    │──────────▶│  Use Offline VAD   │
│  (e.g., moderate)  │            │  reason: poor_qual │
└─────────┬──────────┘            └────────────────────┘
          │ YES
          ▼
┌────────────────────┐
│  Use Network VAD   │
│  (cloud processing)│
└────────────────────┘

TTS Caching (`useTTSCache`)

Location: apps/web-app/src/hooks/useOfflineVAD.ts

Caches synthesized TTS audio for offline playback:

const {
  getTTS, // Get audio (from cache or fresh)
  preload, // Preload common phrases
  isCached, // Check if text is cached
  stats, // { entryCount, sizeMB, hitRate }
  clear, // Clear cache
} = useTTSCache({
  voice: "alloy",
  maxSizeMB: 50,
  ttsFunction: async (text) => synthesizeAudio(text),
});

// Preload common phrases on app start
await preload(); // Caches "I'm listening", "Go ahead", etc.

// Get TTS (cache hit = instant, cache miss = synthesize + cache)
const audio = await getTTS("Hello world");

User Settings Integration

Phase 9 settings are stored in voiceSettingsStore:

Setting	Default	Description
`enableOfflineFallback`	`true`	Auto-switch to offline when network poor
`preferOfflineVAD`	`false`	Force offline VAD (privacy mode)
`ttsCacheEnabled`	`true`	Enable TTS response caching

File Reference (Phase 9)

File	Purpose
`apps/web-app/src/lib/offline/networkMonitor.ts`	Network health monitoring
`apps/web-app/src/lib/offline/webrtcVAD.ts`	WebRTC-based offline VAD
`apps/web-app/src/lib/offline/types.ts`	Offline module type definitions
`apps/web-app/src/hooks/useNetworkStatus.ts`	React hook for network status
`apps/web-app/src/hooks/useOfflineVAD.ts`	Offline VAD + TTS cache hooks
`apps/web-app/src/lib/offline/__tests__/networkMonitor.test.ts`	Network monitor tests

Future Work

Metrics export to backend: Send metrics to backend for aggregation/alerting ✓ Implemented
Barge-in support: Allow user to interrupt AI responses ✓ Implemented (2025-11-28)
Audio overlap prevention: Prevent multiple responses playing simultaneously ✓ Implemented (2025-11-28)
Per-user voice preferences: Backend persistence for TTS settings ✓ Implemented (2025-11-29)
Context-aware voice styles: Auto-detect tone from content ✓ Implemented (2025-11-29)
Aggressive latency optimization: 200ms VAD, 256-sample chunks, 300ms reconnect ✓ Implemented (2025-11-29)
Observability & Monitoring (Phase 3): Error taxonomy, metrics, SLO alerts, telemetry ✓ Implemented (2025-12-02)
Phase 7: Multilingual Support: Auto language detection, accent profiles, language switch confidence ✓ Implemented (2025-12-03)
Phase 8: Voice Calibration: Personalized VAD thresholds, calibration wizard, adaptive learning ✓ Implemented (2025-12-03)
Phase 9: Offline Fallback: Network monitoring, offline VAD, TTS caching, quality-based switching ✓ Implemented (2025-12-03)
Phase 10: Conversation Intelligence: Sentiment tracking, discourse analysis, response recommendations ✓ Implemented (2025-12-03)

Voice Mode Enhancement - 10 Phase Plan ✅ COMPLETE (2025-12-03)

A comprehensive enhancement transforming voice mode into a human-like conversational partner with medical dictation:

Phase 1: Emotional Intelligence (Hume AI) ✓ Complete
Phase 2: Backchanneling System ✓ Complete
Phase 3: Prosody Analysis ✓ Complete
Phase 4: Memory & Context System ✓ Complete
Phase 5: Advanced Turn-Taking ✓ Complete
Phase 6: Variable Response Timing ✓ Complete
Phase 7: Conversational Repair ✓ Complete
Phase 8: Medical Dictation Core ✓ Complete
Phase 9: Patient Context Integration ✓ Complete
Phase 10: Frontend Integration & Analytics ✓ Complete

Full documentation: VOICE_MODE_ENHANCEMENT_10_PHASE.md

Remaining Tasks

Voice→chat transcript content E2E: Test actual transcript content in chat timeline
Error tracking integration: Send errors to Sentry/similar
Audio level visualization: Show real-time audio level meter during recording

Voice Mode Settings Guide

This guide explains how to use and configure Voice Mode settings in VoiceAssist.

Overview

Voice Mode provides real-time voice conversations with the AI assistant. Users can customize their voice experience through the settings panel, including voice selection, language preferences, TTS quality parameters, and behavior options.

Voice Mode Overhaul (2025-11-29): Added backend persistence for voice preferences, context-aware voice style detection, and advanced TTS quality controls.

Phase 7-10 Enhancements (2025-12-03): Added multilingual support with auto-detection, voice calibration, offline fallback with network monitoring, and conversation intelligence features.

Accessing Settings

Open Voice Mode by clicking the voice button in the chat interface
Click the gear icon in the Voice Mode panel header
The settings modal will appear

Available Settings

Voice Selection

Choose from 6 different AI voices:

Alloy - Neutral, balanced voice (default)
Echo - Warm, friendly voice
Fable - Expressive, narrative voice
Onyx - Deep, authoritative voice
Nova - Energetic, bright voice
Shimmer - Soft, calming voice

Language

Select your preferred conversation language:

English (default)
Spanish
French
German
Italian
Portuguese

Voice Detection Sensitivity (0-100%)

Controls how sensitive the voice activity detection is:

Lower values (0-30%): Less sensitive, requires louder/clearer speech
Medium values (40-60%): Balanced detection (recommended)
Higher values (70-100%): More sensitive, may pick up background noise

Auto-start Voice Mode

When enabled, Voice Mode will automatically open when you start a new chat or navigate to the chat page. This is useful for voice-first interactions.

Show Status Hints

When enabled, displays helpful tips and instructions in the Voice Mode panel. Disable if you're familiar with the interface and want a cleaner view.

Context-Aware Voice Style (New)

When enabled, the AI automatically adjusts its voice tone based on the content being spoken:

Calm: Default for medical explanations (stable, measured pace)
Urgent: For medical warnings/emergencies (dynamic, faster)
Empathetic: For sensitive health topics (warm, slower)
Instructional: For step-by-step guidance (clear, deliberate)
Conversational: For general chat (natural, varied)

The system detects keywords and patterns to select the appropriate style, then blends it with your base preferences (60% your settings, 40% style preset).

Advanced Voice Quality (New)

Expand this section to fine-tune TTS output parameters:

Voice Stability (0-100%): Lower = more expressive/varied, Higher = more consistent
Voice Clarity (0-100%): Higher values produce clearer, more consistent voice
Expressiveness (0-100%): Higher values add more emotion and style variation

These settings primarily affect ElevenLabs TTS but also influence context-aware style blending for OpenAI TTS.

Phase 7: Language & Detection Settings

Auto-Detect Language

When enabled, the system automatically detects the language being spoken and adjusts processing accordingly. This is useful for multilingual users who switch between languages naturally.

Default: Enabled
Store Key: autoLanguageDetection

Controls how confident the system must be before switching to a detected language. Higher values prevent false-positive language switches.

Lower values (50-70%): More responsive language switching, but may switch accidentally on similar-sounding phrases
Medium values (70-85%): Balanced detection (recommended)
Higher values (85-100%): Very confident switching, stays in current language unless clearly different
Default: 75%
Store Key: languageSwitchConfidence

Accent Profile

Select a regional accent profile to improve speech recognition accuracy for your specific accent or dialect.

Default: None (auto-detect)
Available Profiles: en-us-midwest, en-gb-london, en-au-sydney, ar-eg-cairo, ar-sa-riyadh, etc.
Store Key: accentProfileId

Phase 8: Voice Calibration Settings

Voice calibration optimizes the VAD (Voice Activity Detection) thresholds specifically for your voice and environment.

Calibration Status

Shows whether voice calibration has been completed:

Not Calibrated: Default state, using generic thresholds
Calibrated: Personal thresholds active (shows last calibration date)

Recalibrate Button

Launches the calibration wizard to:

Record ambient noise samples
Record your speaking voice at different volumes
Compute personalized VAD thresholds

Calibration takes approximately 30-60 seconds.

Personalized VAD Threshold

After calibration, the system uses a custom threshold tuned to your voice:

Store Key: personalizedVadThreshold
Range: 0.0-1.0 (null if not calibrated)

Adaptive Learning

When enabled, the system continuously learns from your voice patterns and subtly adjusts thresholds over time.

Default: Enabled
Store Key: enableBehaviorLearning

Phase 9: Offline Mode Settings

Configure how the voice assistant behaves when network connectivity is poor or unavailable.

Enable Offline Fallback

When enabled, the system automatically switches to offline VAD processing when:

Network is offline
Health check fails consecutively
Network quality drops below threshold
Default: Enabled
Store Key: enableOfflineFallback

Prefer Local VAD

Force the use of local (on-device) VAD processing even when network is available. Useful for:

Privacy-conscious users who don't want audio sent to servers
Environments with unreliable connectivity
Lower latency at the cost of accuracy
Default: Disabled
Store Key: preferOfflineVAD

TTS Audio Caching

When enabled, previously synthesized audio responses are cached locally for:

Faster playback of repeated phrases
Offline playback of cached responses
Reduced bandwidth and API costs
Default: Enabled
Store Key: ttsCacheEnabled

Network Quality Monitoring

The system continuously monitors network quality and categorizes it into five levels:

Quality	Latency	Behavior
Excellent	< 100ms	Full cloud processing
Good	< 200ms	Full cloud processing
Moderate	< 500ms	Cloud processing, may show warning
Poor	≥ 500ms	Auto-fallback to offline VAD
Offline	No network	Full offline mode

Network status is displayed in the voice panel header when quality is degraded.

Phase 10: Conversation Intelligence Settings

These settings control advanced AI features that enhance conversation quality.

Enable Sentiment Tracking

When enabled, the AI tracks emotional tone throughout the conversation and adapts its responses accordingly.

Default: Enabled
Store Key: enableSentimentTracking

Enable Discourse Analysis

Tracks conversation structure (topic changes, question chains, clarifications) to provide more contextually aware responses.

Default: Enabled
Store Key: enableDiscourseAnalysis

Enable Response Recommendations

The AI suggests relevant follow-up questions or actions based on conversation context.

Default: Enabled
Store Key: enableResponseRecommendations

Show Suggested Follow-Ups

Display AI-suggested follow-up questions after responses. These appear as clickable chips below the assistant's message.

Default: Enabled
Store Key: showSuggestedFollowUps

Privacy Settings

Store Transcript History

When enabled, voice transcripts are stored in the conversation history. Disable for ephemeral voice sessions.

Default: Enabled
Store Key: storeTranscriptHistory

Opt-in to share anonymized voice interaction metrics to help improve the service. No transcript content or personal data is shared - only timing metrics (latency, error rates).

Default: Disabled
Store Key: shareAnonymousAnalytics

Persistence

Voice preferences are now stored in two locations for maximum reliability:

Backend API (Primary): Settings are synced to /api/voice/preferences and stored in the database. This enables cross-device settings sync when logged in.
Local Storage (Fallback): Settings are also cached locally under voiceassist-voice-settings for offline access and faster loading.

Changes are debounced (1 second) before being sent to the backend to reduce API calls while editing.

Resetting to Defaults

Click "Reset to defaults" in the settings modal to restore all settings to their original values:

Core Settings

Voice: Alloy
Language: English
VAD Sensitivity: 50%
Auto-start: Disabled
Show hints: Enabled
Context-aware style: Enabled
Stability: 50%
Clarity: 75%
Expressiveness: 0%

Phase 7 Defaults

Auto Language Detection: Enabled
Language Switch Confidence: 75%
Accent Profile ID: null

Phase 8 Defaults

VAD Calibrated: false
Last Calibration Date: null
Personalized VAD Threshold: null
Adaptive Learning: Enabled

Phase 9 Defaults

Offline Fallback: Enabled
Prefer Local VAD: Disabled
TTS Cache: Enabled

Phase 10 Defaults

Sentiment Tracking: Enabled
Discourse Analysis: Enabled
Response Recommendations: Enabled
Show Suggested Follow-Ups: Enabled

Privacy Defaults

Store Transcript History: Enabled
Share Anonymous Analytics: Disabled

Reset also syncs to the backend via POST /api/voice/preferences/reset.

Voice Preferences API (New)

The following API endpoints manage voice preferences:

Endpoint	Method	Description
`/api/voice/preferences`	GET	Get user's voice preferences
`/api/voice/preferences`	PUT	Update preferences (partial update)
`/api/voice/preferences/reset`	POST	Reset to defaults
`/api/voice/style-presets`	GET	Get available style presets

Response Headers

TTS synthesis requests now include additional headers:

X-TTS-Provider: Which provider was used (openai or elevenlabs)
X-TTS-Fallback: Whether fallback was used (true/false)
X-TTS-Style: Detected style if context-aware is enabled

Technical Details

Store Location

Settings are managed by a Zustand store with persistence:

apps/web-app/src/stores/voiceSettingsStore.ts

Component Locations

Settings UI: apps/web-app/src/components/voice/VoiceModeSettings.tsx
Enhanced Settings: apps/web-app/src/components/voice/VoiceSettingsEnhanced.tsx
Calibration Dialog: apps/web-app/src/components/voice/CalibrationDialog.tsx

Phase 9 Offline/Network Files

Network Monitor: apps/web-app/src/lib/offline/networkMonitor.ts
WebRTC VAD: apps/web-app/src/lib/offline/webrtcVAD.ts
Offline Types: apps/web-app/src/lib/offline/types.ts
Network Status Hook: apps/web-app/src/hooks/useNetworkStatus.ts
Offline VAD Hook: apps/web-app/src/hooks/useOfflineVAD.ts

Backend Files (New)

Model: services/api-gateway/app/models/user_voice_preferences.py
Style Detector: services/api-gateway/app/services/voice_style_detector.py
API Endpoints: services/api-gateway/app/api/voice.py (preferences section)
Schemas: services/api-gateway/app/api/voice_schemas/schemas.py

Frontend Sync Hook (New)

apps/web-app/src/hooks/useVoicePreferencesSync.ts

Handles loading/saving preferences to backend with debouncing.

Integration Points

VoiceModePanel.tsx - Displays settings button and uses store values
MessageInput.tsx - Reads autoStartOnOpen for auto-open behavior
useVoicePreferencesSync.ts - Backend sync on auth and setting changes

Advanced: Voice Mode Pipeline

Settings are not just UI preferences - they propagate into real-time voice sessions:

Voice/Language: Sent to /api/voice/realtime-session and used by OpenAI Realtime API
VAD Sensitivity: Mapped to server-side VAD threshold (0→insensitive, 100→sensitive)

For comprehensive pipeline documentation including backend integration, WebSocket connections, and metrics, see VOICE_MODE_PIPELINE.md.

Development: Running Tests

Run the voice settings test suites individually to avoid memory issues:

cd apps/web-app

# Unit tests for voice settings store (core)
npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot

# Unit tests for voice settings store (Phase 7-10)
npx vitest run src/stores/__tests__/voiceSettingsStore-phase7-10.test.ts --reporter=dot

# Unit tests for network monitor
npx vitest run src/lib/offline/__tests__/networkMonitor.test.ts --reporter=dot

# Component tests for VoiceModeSettings
npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot

# Integration tests for MessageInput voice settings
npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot

Test Coverage

The test suites cover:

voiceSettingsStore.test.ts (17 tests)

Default values verification
All setter functions (voice, language, sensitivity, toggles)
VAD sensitivity clamping (0-100 range)
Reset functionality
LocalStorage persistence

voiceSettingsStore-phase7-10.test.ts (41 tests)

Phase 7: Multilingual settings (accent profile, auto-detection, confidence)
Phase 8: Calibration settings (VAD calibrated, dates, thresholds)
Phase 9: Offline mode settings (fallback, prefer offline VAD, TTS cache)
Phase 10: Conversation intelligence (sentiment, discourse, recommendations)
Privacy settings (transcript history, anonymous analytics)
Persistence tests for all Phase 7-10 settings
Reset tests verifying all defaults

networkMonitor.test.ts (13 tests)

Initial state detection (online/offline)
Health check latency measurement
Quality computation from latency thresholds
Consecutive failure handling before marking unhealthy
Subscription/unsubscription for status changes
Custom configuration (latency thresholds, health check URL)
Offline detection via navigator.onLine

VoiceModeSettings.test.tsx (25 tests)

Modal visibility (isOpen prop)
Current settings display
Settings updates via UI interactions
Reset with confirmation
Close behavior (Done, X, backdrop)
Accessibility (labels, ARIA attributes)

MessageInput-voice-settings.test.tsx (12 tests)

Auto-open via store setting (autoStartOnOpen)
Auto-open via prop (autoOpenRealtimeVoice)
Combined settings behavior
Voice/language display in panel header
Status hints visibility toggle

Total: 108+ tests for voice settings and related functionality.

Notes

Tests mock useRealtimeVoiceSession and WaveformVisualizer to avoid browser API dependencies
Run tests individually rather than the full suite to prevent memory issues
All tests use Vitest + React Testing Library
Phase 7-10 tests also mock fetch and performance.now for network monitoring

Thinker Service

Location: services/api-gateway/app/services/thinker_service.py Status: Production Ready Last Updated: 2025-12-01

Overview

The ThinkerService is the reasoning engine of the Thinker-Talker voice pipeline. It manages conversation context, orchestrates LLM interactions, and handles tool calling with result injection.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      ThinkerService                              │
│                                                                  │
│  ┌──────────────────┐    ┌──────────────────┐                   │
│  │ ConversationContext │◄──│   ThinkerSession  │                 │
│  │ (shared memory)     │    │   (per-request)   │                │
│  └──────────────────┘    └──────────────────┘                   │
│           │                        │                             │
│           │                        ▼                             │
│           │              ┌──────────────────┐                   │
│           │              │    LLMClient     │                   │
│           │              │   (GPT-4o)       │                   │
│           │              └──────────────────┘                   │
│           │                        │                             │
│           │                        ▼                             │
│           │              ┌──────────────────┐                   │
│           │              │   ToolRegistry   │                   │
│           │              │ (calendar, search,│                  │
│           │              │  medical, KB)     │                   │
│           └──────────────┴──────────────────┘                   │
└─────────────────────────────────────────────────────────────────┘

Classes

ThinkerService

Main service class (singleton pattern).

from app.services.thinker_service import thinker_service

# Create a session for a conversation
session = thinker_service.create_session(
    conversation_id="conv-123",
    on_token=handle_token,         # Called for each LLM token
    on_tool_call=handle_tool_call, # Called when tool is invoked
    on_tool_result=handle_result,  # Called when tool returns
    user_id="user-456",            # Required for authenticated tools
)

# Process user input
response = await session.think("What's on my calendar today?")

Methods

Method	Description	Parameters	Returns
`create_session()`	Create a thinking session	`conversation_id`, `on_token`, `on_tool_call`, `on_tool_result`, `system_prompt`, `user_id`	`ThinkerSession`
`register_tool()`	Register a new tool	`name`, `description`, `parameters`, `handler`	`None`

ThinkerSession

Session class for processing individual requests.

class ThinkerSession:
    """
    A single thinking session with streaming support.

    Manages the flow:
    1. Receive user input
    2. Add to conversation context
    3. Call LLM with streaming
    4. Handle tool calls if needed
    5. Stream response tokens to callback
    """

Methods

Method	Description	Parameters	Returns
`think()`	Process user input	`user_input: str`, `source_mode: str`	`ThinkerResponse`
`cancel()`	Cancel processing	None	`None`
`get_context()`	Get conversation context	None	`ConversationContext`
`get_metrics()`	Get session metrics	None	`ThinkerMetrics`

Properties

Property	Type	Description
`state`	`ThinkingState`	Current processing state

ConversationContext

Manages conversation history with smart trimming.

class ConversationContext:
    MAX_HISTORY_MESSAGES = 20    # Maximum messages to retain
    MAX_CONTEXT_TOKENS = 8000    # Token budget for context

    def __init__(self, conversation_id: str, system_prompt: str = None):
        self.conversation_id = conversation_id
        self.messages: List[ConversationMessage] = []
        self.system_prompt = system_prompt or self._default_system_prompt()

Smart Trimming

When message count exceeds MAX_HISTORY_MESSAGES, the context performs smart trimming:

def _smart_trim(self) -> None:
    """
    Trim messages while preserving tool call chains.

    OpenAI requires: assistant (with tool_calls) -> tool (with tool_call_id)
    We can't break this chain or the API will reject the request.
    """

Rules:

Never trim an assistant message if the next message is a tool result
Never trim a tool message (it needs its preceding assistant message)
Find the first safe trim point that doesn't break chains

Methods

Method	Description
`add_message()`	Add a message to history
`get_messages_for_llm()`	Format messages for OpenAI API
`clear()`	Clear all history

ToolRegistry

Registry for available tools.

class ToolRegistry:
    def register(
        self,
        name: str,
        description: str,
        parameters: Dict,
        handler: Callable[[Dict], Awaitable[Any]],
    ) -> None:
        """Register a tool with its schema and handler."""

    def get_tools_schema(self) -> List[Dict]:
        """Get all tool schemas for LLM API."""

    async def execute(self, tool_name: str, arguments: Dict, user_id: str) -> Any:
        """Execute a tool and return its result."""

Data Classes

ThinkingState

class ThinkingState(str, Enum):
    IDLE = "idle"           # Waiting for input
    PROCESSING = "processing"  # Building request
    TOOL_CALLING = "tool_calling"  # Executing tool
    GENERATING = "generating"  # Streaming response
    COMPLETE = "complete"    # Finished successfully
    CANCELLED = "cancelled"  # User interrupted
    ERROR = "error"          # Error occurred

ConversationMessage

@dataclass
class ConversationMessage:
    role: str              # "user", "assistant", "system", "tool"
    content: str
    message_id: str        # Auto-generated UUID
    timestamp: float       # Unix timestamp
    source_mode: str       # "chat" or "voice"
    tool_call_id: str      # For tool results
    tool_calls: List[Dict] # For assistant messages with tool calls
    citations: List[Dict]  # Source citations

ThinkerResponse

@dataclass
class ThinkerResponse:
    text: str                    # Complete response text
    message_id: str              # Unique ID
    citations: List[Dict]        # Source citations
    tool_calls_made: List[str]   # Names of tools called
    latency_ms: int              # Total processing time
    tokens_used: int             # Token count
    state: ThinkingState         # Final state

ThinkerMetrics

@dataclass
class ThinkerMetrics:
    total_tokens: int = 0
    tool_calls_count: int = 0
    first_token_latency_ms: int = 0
    total_latency_ms: int = 0
    cancelled: bool = False

Available Tools

The ThinkerService automatically registers tools from the unified ToolService:

Tool	Description	Requires Auth
`calendar_create_event`	Create calendar events	Yes
`calendar_list_events`	List upcoming events	Yes
`calendar_update_event`	Modify existing events	Yes
`calendar_delete_event`	Remove events	Yes
`web_search`	Search the web	No
`pubmed_search`	Search medical literature	No
`medical_calculator`	Calculate medical scores	No
`kb_search`	Search knowledge base	No

System Prompt

The default system prompt includes:

Current Time Context: Dynamic date/time with relative calculations
Conversation Memory: Instructions to use conversation history
Tool Usage Guidelines: When and how to use each tool
Response Style: Concise, natural, voice-optimized

def _default_system_prompt(self) -> str:
    tz = pytz.timezone("America/New_York")
    now = datetime.now(tz)

    return f"""You are VoiceAssist, a helpful AI voice assistant.

CURRENT TIME CONTEXT:
- Current date: {now.strftime("%A, %B %d, %Y")}
- Current time: {now.strftime("%I:%M %p %Z")}

CONVERSATION MEMORY:
You have access to the full conversation history...

AVAILABLE TOOLS:
- calendar_create_event: Create events...
- web_search: Search the web...
...

KEY BEHAVIORS:
- Keep responses concise and natural for voice
- Use short sentences (max 15-20 words)
- Avoid abbreviations - say "blood pressure" not "BP"
"""

Usage Examples

Basic Query Processing

from app.services.thinker_service import thinker_service

async def handle_voice_query(conversation_id: str, transcript: str, user_id: str):
    # Token streaming callback
    async def on_token(token: str):
        await send_to_tts(token)

    # Create session with callbacks
    session = thinker_service.create_session(
        conversation_id=conversation_id,
        on_token=on_token,
        user_id=user_id,
    )

    # Process the transcript
    response = await session.think(transcript, source_mode="voice")

    print(f"Response: {response.text}")
    print(f"Tools used: {response.tool_calls_made}")
    print(f"Latency: {response.latency_ms}ms")

With Tool Call Handling

async def handle_tool_call(event: ToolCallEvent):
    """Called when LLM decides to call a tool."""
    await send_to_client({
        "type": "tool.call",
        "tool_name": event.tool_name,
        "arguments": event.arguments,
    })

async def handle_tool_result(event: ToolResultEvent):
    """Called when tool execution completes."""
    await send_to_client({
        "type": "tool.result",
        "tool_name": event.tool_name,
        "result": event.result,
    })

session = thinker_service.create_session(
    conversation_id="conv-123",
    on_token=on_token,
    on_tool_call=handle_tool_call,
    on_tool_result=handle_tool_result,
    user_id="user-456",
)

Cancellation (Barge-in)

# Store session reference
active_session = thinker_service.create_session(...)

# When user barges in:
async def handle_barge_in():
    await active_session.cancel()
    print(f"Cancelled: {active_session.is_cancelled()}")

Context Persistence

Conversation contexts are persisted across turns:

# Class-level storage
_conversation_contexts: Dict[str, ConversationContext] = {}
_context_last_access: Dict[str, float] = {}
CONTEXT_TTL_SECONDS = 3600  # 1 hour TTL

Contexts are automatically cleaned up after 1 hour of inactivity
Same conversation_id reuses existing context
Context persists across voice and chat modes

Error Handling

try:
    response = await session.think(transcript)
except Exception as e:
    # Errors are caught and returned in response
    response = ThinkerResponse(
        text=f"I apologize, but I encountered an error: {str(e)}",
        message_id=message_id,
        state=ThinkingState.ERROR,
    )

Talker Service

Location: services/api-gateway/app/services/talker_service.py Status: Production Ready Last Updated: 2025-12-01

Overview

The TalkerService handles text-to-speech synthesis for the Thinker-Talker voice pipeline. It streams LLM tokens through a sentence chunker and synthesizes speech via ElevenLabs for gapless audio playback.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       TalkerService                              │
│                                                                  │
│   LLM Tokens ──►┌──────────────────┐                            │
│                 │ Markdown Buffer  │  (accumulates for pattern  │
│                 │                  │   detection before strip)   │
│                 └────────┬─────────┘                            │
│                          │                                       │
│                          ▼                                       │
│                 ┌──────────────────┐                            │
│                 │ SentenceChunker  │  (splits at natural        │
│                 │ (40-120-200 chars)│   boundaries)              │
│                 └────────┬─────────┘                            │
│                          │                                       │
│                          ▼                                       │
│                 ┌──────────────────┐                            │
│                 │ strip_markdown   │  (removes **bold**,        │
│                 │ _for_tts()       │   [links](url), LaTeX)     │
│                 └────────┬─────────┘                            │
│                          │                                       │
│                          ▼                                       │
│                 ┌──────────────────┐                            │
│                 │ ElevenLabs TTS   │  (streaming synthesis      │
│                 │ (sequential)     │   with previous_text)      │
│                 └────────┬─────────┘                            │
│                          │                                       │
│                          ▼                                       │
│                   Audio Chunks ──► on_audio_chunk callback       │
└─────────────────────────────────────────────────────────────────┘

Classes

TalkerService

Main service class (singleton pattern).

from app.services.talker_service import talker_service

# Check if TTS is available
if talker_service.is_enabled():
    # Start a speaking session (uses DEFAULT_VOICE_ID from voice_constants.py)
    session = await talker_service.start_session(
        on_audio_chunk=handle_audio,
        voice_config=VoiceConfig(
            # voice_id defaults to DEFAULT_VOICE_ID (Brian)
            stability=0.65,
        ),
    )

    # Feed tokens from LLM
    for token in llm_stream:
        await session.add_token(token)

    # Finish and get metrics
    metrics = await session.finish()

Methods

Method	Description	Parameters	Returns
`is_enabled()`	Check if TTS is available	None	`bool`
`get_provider()`	Get active TTS provider	None	`TTSProvider`
`start_session()`	Start a TTS session	`on_audio_chunk`, `voice_config`	`TalkerSession`
`synthesize_text()`	Simple text synthesis	`text`, `voice_config`	`AsyncIterator[bytes]`
`get_available_voices()`	List available voices	None	`List[Dict]`

TalkerSession

Session class for streaming TTS.

class TalkerSession:
    """
    A single TTS speaking session with streaming support.

    Manages the flow:
    1. Receive LLM tokens
    2. Chunk into sentences
    3. Synthesize each sentence
    4. Stream audio chunks to callback
    """

Methods

Method	Description	Parameters	Returns
`add_token()`	Add token from LLM	`token: str`	`None`
`finish()`	Complete synthesis	None	`TalkerMetrics`
`cancel()`	Cancel for barge-in	None	`None`
`get_metrics()`	Get session metrics	None	`TalkerMetrics`

Properties

Property	Type	Description
`state`	`TalkerState`	Current state

AudioQueue

Queue management for gapless playback.

class AudioQueue:
    """
    Manages audio chunks for gapless playback with cancellation support.

    Features:
    - Async queue for audio chunks
    - Cancellation clears pending audio
    - Tracks queue state
    """

    async def put(self, chunk: AudioChunk) -> bool
    async def get(self) -> Optional[AudioChunk]
    async def cancel(self) -> None
    def finish(self) -> None
    def reset(self) -> None

Data Classes

TalkerState

class TalkerState(str, Enum):
    IDLE = "idle"           # Ready for input
    SPEAKING = "speaking"   # Synthesizing/playing
    CANCELLED = "cancelled" # Interrupted by barge-in

TTSProvider

class TTSProvider(str, Enum):
    ELEVENLABS = "elevenlabs"
    OPENAI = "openai"  # Fallback

VoiceConfig

Note: Default voice is configured in app/core/voice_constants.py. See Voice Configuration for details.

from app.core.voice_constants import DEFAULT_VOICE_ID, DEFAULT_TTS_MODEL

@dataclass
class VoiceConfig:
    provider: TTSProvider = TTSProvider.ELEVENLABS
    voice_id: str = DEFAULT_VOICE_ID  # Brian (from voice_constants.py)
    model_id: str = DEFAULT_TTS_MODEL  # eleven_flash_v2_5
    stability: float = 0.65         # 0.0-1.0, higher = consistent
    similarity_boost: float = 0.80  # 0.0-1.0, higher = clearer
    style: float = 0.15             # 0.0-1.0, lower = natural
    use_speaker_boost: bool = True
    output_format: str = "pcm_24000"

AudioChunk

@dataclass
class AudioChunk:
    data: bytes          # Raw audio bytes
    format: str          # "pcm16" or "mp3"
    is_final: bool       # True for last chunk
    sentence_index: int  # Which sentence this is from
    latency_ms: int      # Time since synthesis started

TalkerMetrics

@dataclass
class TalkerMetrics:
    sentences_processed: int = 0
    total_chars_synthesized: int = 0
    total_audio_bytes: int = 0
    total_latency_ms: int = 0
    first_audio_latency_ms: int = 0
    cancelled: bool = False

Sentence Chunking

The TalkerSession uses SentenceChunker with these settings:

self._chunker = SentenceChunker(
    ChunkerConfig(
        min_chunk_chars=40,    # Avoid tiny fragments
        optimal_chunk_chars=120,  # Full sentences
        max_chunk_chars=200,   # Allow complete thoughts
    )
)

Why These Settings?

Parameter	Value	Rationale
`min_chunk_chars`	40	Prevents choppy TTS from short phrases
`optimal_chunk_chars`	120	Full sentences sound more natural
`max_chunk_chars`	200	Prevents excessive buffering

Trade-off: Larger chunks = better prosody but higher latency to first audio.

Markdown Stripping

LLM responses often contain markdown that sounds unnatural when spoken:

def strip_markdown_for_tts(text: str) -> str:
    """
    Converts:
    - [Link Text](URL) → "Link Text"
    - **bold** → "bold"
    - *italic* → "italic"
    - `code` → "code"
    - ```blocks``` → (removed)
    - # Headers → "Headers"
    - LaTeX formulas → (removed)
    """

Markdown-Aware Token Buffering

The TalkerSession buffers tokens to detect incomplete patterns:

def _process_markdown_token(self, token: str) -> str:
    """
    Accumulates tokens to detect patterns that should be stripped:
    - Markdown links: [text](url) - wait for closing )
    - LaTeX display: [ ... ] with backslashes
    - LaTeX inline: \\( ... \\)
    - Bold/italic: **text** - wait for closing **
    """

This prevents sending "[Link Te" to TTS before we know it's a markdown link.

Voice Continuity

For consistent voice across sentences:

async for audio_data in self._elevenlabs.synthesize_stream(
    text=tts_text,
    previous_text=self._previous_text,  # Context for voice continuity
    ...
):
    ...

# Update for next synthesis
self._previous_text = tts_text

The previous_text parameter helps ElevenLabs maintain consistent prosody.

Sequential Synthesis

To prevent voice variations between chunks:

# Semaphore ensures one synthesis at a time
self._synthesis_semaphore = asyncio.Semaphore(1)

async with self._synthesis_semaphore:
    async for audio_data in self._elevenlabs.synthesize_stream(...):
        ...

Parallel synthesis can cause noticeable voice quality differences between sentences.

Usage Examples

Basic Token Streaming

async def handle_llm_response(llm_stream):
    async def on_audio_chunk(chunk: AudioChunk):
        # Send to client via WebSocket
        await websocket.send_json({
            "type": "audio.output",
            "audio": base64.b64encode(chunk.data).decode(),
            "is_final": chunk.is_final,
        })

    session = await talker_service.start_session(on_audio_chunk=on_audio_chunk)

    async for token in llm_stream:
        await session.add_token(token)

    metrics = await session.finish()
    print(f"Synthesized {metrics.sentences_processed} sentences")
    print(f"First audio in {metrics.first_audio_latency_ms}ms")

Custom Voice Configuration

config = VoiceConfig(
    voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel (female)
    model_id="eleven_flash_v2_5",      # Lower latency
    stability=0.65,                    # More variation
    similarity_boost=0.90,             # Very clear
    style=0.15,                        # Slightly expressive
)

session = await talker_service.start_session(
    on_audio_chunk=handle_audio,
    voice_config=config,
)

Handling Barge-in

active_session = None

async def start_speaking(llm_stream):
    global active_session
    active_session = await talker_service.start_session(on_audio_chunk=send_audio)

    for token in llm_stream:
        if active_session.is_cancelled():
            break
        await active_session.add_token(token)

    await active_session.finish()

async def handle_barge_in():
    global active_session
    if active_session:
        await active_session.cancel()
        # Cancels pending synthesis and clears audio queue

Simple Text Synthesis

# For non-streaming use cases
async for audio_chunk in talker_service.synthesize_text(
    text="Hello, how can I help you today?",
    voice_config=VoiceConfig(voice_id="TxGEqnHWrfWFTfGW9XjX"),
):
    await send_audio(audio_chunk)

Available Voices

voices = talker_service.get_available_voices()
# Returns:
[
    {"id": "TxGEqnHWrfWFTfGW9XjX", "name": "Josh", "gender": "male", "premium": True},
    {"id": "pNInz6obpgDQGcFmaJgB", "name": "Adam", "gender": "male", "premium": True},
    {"id": "EXAVITQu4vr4xnSDxMaL", "name": "Bella", "gender": "female", "premium": True},
    {"id": "21m00Tcm4TlvDq8ikWAM", "name": "Rachel", "gender": "female", "premium": True},
    # ... more voices
]

Performance Tuning

Latency Optimization

Setting	Lower Latency	Higher Quality
`model_id`	`eleven_flash_v2_5`	`eleven_turbo_v2_5`
`min_chunk_chars`	15	40
`optimal_chunk_chars`	50	120
`output_format`	`pcm_24000`	`mp3_44100_192`

Quality Optimization

Setting	More Natural	More Consistent
`stability`	0.50	0.85
`similarity_boost`	0.70	0.90
`style`	0.20	0.05

Error Handling

Synthesis errors don't fail the entire session:

async def _synthesize_sentence(self, sentence: str) -> None:
    try:
        async for audio_data in self._elevenlabs.synthesize_stream(...):
            if self._state == TalkerState.CANCELLED:
                return
            await self._on_audio_chunk(chunk)
    except Exception as e:
        logger.error(f"TTS synthesis error: {e}")
        # Session continues, just skips this sentence

Voice Pipeline WebSocket API

Endpoint: wss://{host}/api/voice/pipeline-ws Protocol: JSON over WebSocket Status: Production Ready Last Updated: 2025-12-02

Overview

The Voice Pipeline WebSocket provides bidirectional communication for the Thinker-Talker voice mode. It handles audio streaming, transcription, LLM responses, and TTS playback.

Connection

Authentication

Include JWT token in connection URL or headers:

const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`);

Connection Lifecycle

1. Client connects with auth token
   │
2. Server accepts, creates pipeline session
   │
3. Server sends: session.ready
   │
4. Client sends: session.init (optional config)
   │
5. Server acknowledges: session.init.ack
   │
6. Voice mode active - bidirectional streaming
   │
7. Client or server closes connection

Message Format

All messages are JSON objects with a type field:

{
  "type": "message_type",
  "field1": "value1",
  "field2": "value2"
}

Client → Server Messages

session.init

Initialize or reconfigure the session.

{
  "type": "session.init",
  "conversation_id": "conv-123",
  "voice_settings": {
    "voice_id": "TxGEqnHWrfWFTfGW9XjX",
    "language": "en",
    "barge_in_enabled": true
  }
}

Field	Type	Required	Description
`conversation_id`	string	No	Link to existing chat conversation
`voice_settings.voice_id`	string	No	ElevenLabs voice ID
`voice_settings.language`	string	No	STT language code (default: "en")
`voice_settings.barge_in_enabled`	boolean	No	Allow user interruption (default: true)

audio.input

Stream audio from microphone.

{
  "type": "audio.input",
  "audio": "base64_encoded_pcm16_audio"
}

Field	Type	Required	Description
`audio`	string	Yes	Base64-encoded PCM16 audio (16kHz, mono)

Audio Format Requirements:

Sample rate: 16000 Hz
Channels: 1 (mono)
Bit depth: 16-bit signed PCM
Encoding: Little-endian
Chunk size: ~100ms recommended (1600 samples)

audio.input.complete

Signal end of user speech (manual commit).

{
  "type": "audio.input.complete"
}

Normally, VAD auto-detects speech end. Use this for push-to-talk implementations.

barge_in

Interrupt AI response.

{
  "type": "barge_in"
}

When received:

Cancels TTS synthesis
Clears audio queue
Resets pipeline to listening state

message

Send text input (fallback when mic unavailable).

{
  "type": "message",
  "content": "What's the weather like?"
}

ping

Keep-alive heartbeat.

{
  "type": "ping"
}

Server responds with pong.

Server → Client Messages

session.ready

Session initialized successfully.

{
  "type": "session.ready",
  "session_id": "sess-abc123",
  "pipeline_mode": "thinker_talker"
}

session.init.ack

Acknowledges session.init message.

{
  "type": "session.init.ack"
}

transcript.delta

Partial STT transcript (streaming).

{
  "type": "transcript.delta",
  "text": "What is the",
  "is_final": false
}

Field	Type	Description
`text`	string	Partial transcript text
`is_final`	boolean	Always false for delta

transcript.complete

Final STT transcript.

{
  "type": "transcript.complete",
  "text": "What is the weather today?",
  "message_id": "msg-xyz789"
}

Field	Type	Description
`text`	string	Complete transcript
`message_id`	string	Unique message identifier

response.delta

Streaming LLM response token.

{
  "type": "response.delta",
  "delta": "The",
  "message_id": "resp-123"
}

Field	Type	Description
`delta`	string	Response token/chunk
`message_id`	string	Response message ID

response.complete

Complete LLM response.

{
  "type": "response.complete",
  "text": "The weather today is sunny with a high of 72 degrees.",
  "message_id": "resp-123"
}

audio.output

TTS audio chunk.

{
  "type": "audio.output",
  "audio": "base64_encoded_pcm_audio",
  "is_final": false,
  "sentence_index": 0
}

Field	Type	Description
`audio`	string	Base64-encoded PCM audio (24kHz, mono)
`is_final`	boolean	True for last chunk
`sentence_index`	number	Which sentence this is from

Output Audio Format:

Sample rate: 24000 Hz
Channels: 1 (mono)
Bit depth: 16-bit signed PCM
Encoding: Little-endian

tool.call

Tool invocation started.

{
  "type": "tool.call",
  "id": "call-abc",
  "name": "calendar_list_events",
  "arguments": {
    "start_date": "2025-12-01",
    "end_date": "2025-12-07"
  }
}

Field	Type	Description
`id`	string	Tool call ID
`name`	string	Tool function name
`arguments`	object	Tool arguments

tool.result

Tool execution completed.

{
  "type": "tool.result",
  "id": "call-abc",
  "name": "calendar_list_events",
  "result": {
    "events": [{ "title": "Team Meeting", "start": "2025-12-02T10:00:00" }]
  }
}

Field	Type	Description
`id`	string	Tool call ID
`name`	string	Tool function name
`result`	any	Tool execution result

voice.state

Pipeline state change.

{
  "type": "voice.state",
  "state": "speaking"
}

State	Description
`idle`	Waiting for user input
`listening`	Receiving audio, STT active
`processing`	LLM thinking
`speaking`	TTS playing
`cancelled`	Barge-in occurred

heartbeat

Server heartbeat (every 30s).

{
  "type": "heartbeat"
}

pong

Response to client ping.

{
  "type": "pong"
}

error

Error occurred.

{
  "type": "error",
  "code": "stt_failed",
  "message": "Speech-to-text service unavailable",
  "recoverable": true
}

Field	Type	Description
`code`	string	Error code
`message`	string	Human-readable message
`recoverable`	boolean	True if client can retry

Error Codes:

Code	Description	Recoverable
`invalid_json`	Malformed JSON message	Yes
`connection_failed`	Pipeline init failed	No
`stt_failed`	STT service error	Yes
`llm_failed`	LLM service error	Yes
`tts_failed`	TTS service error	Yes
`auth_failed`	Authentication error	No
`rate_limited`	Too many requests	Yes

Example: Complete Session

// 1. Connect
const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${token}`);

ws.onopen = () => {
  console.log("Connected");
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);

  switch (msg.type) {
    case "session.ready":
      // 2. Initialize with settings
      ws.send(
        JSON.stringify({
          type: "session.init",
          conversation_id: currentConversationId,
          voice_settings: {
            voice_id: "TxGEqnHWrfWFTfGW9XjX",
            language: "en",
          },
        }),
      );
      break;

    case "session.init.ack":
      // 3. Start sending audio
      startMicrophoneCapture();
      break;

    case "transcript.delta":
      // Show partial transcript
      updatePartialTranscript(msg.text);
      break;

    case "transcript.complete":
      // Show final transcript
      setTranscript(msg.text);
      break;

    case "response.delta":
      // Append LLM response
      appendResponse(msg.delta);
      break;

    case "audio.output":
      // Play TTS audio
      if (msg.audio) {
        const pcm = base64ToArrayBuffer(msg.audio);
        audioPlayer.queueChunk(pcm);
      }
      if (msg.is_final) {
        audioPlayer.finish();
      }
      break;

    case "tool.call":
      // Show tool being called
      showToolCall(msg.name, msg.arguments);
      break;

    case "tool.result":
      // Show tool result
      showToolResult(msg.name, msg.result);
      break;

    case "error":
      console.error(`Error [${msg.code}]: ${msg.message}`);
      if (!msg.recoverable) {
        ws.close();
      }
      break;
  }
};

// Send audio chunks from microphone
function sendAudioChunk(pcmData) {
  ws.send(
    JSON.stringify({
      type: "audio.input",
      audio: arrayBufferToBase64(pcmData),
    }),
  );
}

// Handle barge-in (user speaks while AI is talking)
function handleBargeIn() {
  ws.send(JSON.stringify({ type: "barge_in" }));
  audioPlayer.stop();
}

Configuration Reference

TTSessionConfig (Backend)

@dataclass
class TTSessionConfig:
    user_id: str
    session_id: str
    conversation_id: Optional[str] = None

    # Voice settings
    voice_id: str = "TxGEqnHWrfWFTfGW9XjX"
    tts_model: str = "eleven_flash_v2_5"
    language: str = "en"

    # STT settings
    stt_sample_rate: int = 16000
    stt_endpointing_ms: int = 800
    stt_utterance_end_ms: int = 1500

    # Barge-in
    barge_in_enabled: bool = True

    # Timeouts
    connection_timeout_sec: float = 10.0
    idle_timeout_sec: float = 300.0

Rate Limiting

Limit	Value
Max concurrent sessions per user	2
Max concurrent sessions total	100
Audio chunk rate	~10/second recommended
Idle timeout	300 seconds

Thinker-Talker Frontend Hooks

Location: apps/web-app/src/hooks/ Status: Production Ready Last Updated: 2025-12-01

Overview

The Thinker-Talker frontend integration consists of several React hooks that manage WebSocket connections, audio capture, and playback. These hooks provide a complete voice mode implementation.

Hook Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Voice Mode Components                       │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              useThinkerTalkerVoiceMode                   │    │
│  │         (High-level orchestration hook)                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                              │                                   │
│              ┌───────────────┴───────────────┐                  │
│              ▼                               ▼                   │
│  ┌─────────────────────────┐    ┌─────────────────────────┐     │
│  │ useThinkerTalkerSession │    │  useTTAudioPlayback     │     │
│  │ (WebSocket + Protocol)  │    │  (Audio Queue + Play)   │     │
│  └─────────────────────────┘    └─────────────────────────┘     │
│              │                               │                   │
│              ▼                               ▼                   │
│  ┌─────────────────────────┐    ┌─────────────────────────┐     │
│  │    WebSocket API        │    │   Web Audio API         │     │
│  │    (Backend T/T)        │    │   (AudioContext)        │     │
│  └─────────────────────────┘    └─────────────────────────┘     │
└─────────────────────────────────────────────────────────────────┘

useThinkerTalkerSession

Main hook for WebSocket communication with the T/T pipeline.

Import

import { useThinkerTalkerSession } from "../hooks/useThinkerTalkerSession";

Usage

const {
  status,
  error,
  transcript,
  partialTranscript,
  pipelineState,
  currentToolCalls,
  metrics,
  connect,
  disconnect,
  sendAudioChunk,
  bargeIn,
} = useThinkerTalkerSession({
  conversation_id: "conv-123",
  voiceSettings: {
    voice_id: "TxGEqnHWrfWFTfGW9XjX",
    language: "en",
    barge_in_enabled: true,
  },
  onTranscript: (t) => console.log("Transcript:", t.text),
  onResponseDelta: (delta, id) => appendToChat(delta),
  onAudioChunk: (audio) => playAudio(audio),
  onToolCall: (tool) => showToolUI(tool),
});

Options

interface UseThinkerTalkerSessionOptions {
  conversation_id?: string;
  voiceSettings?: TTVoiceSettings;
  onTranscript?: (transcript: TTTranscript) => void;
  onResponseDelta?: (delta: string, messageId: string) => void;
  onResponseComplete?: (content: string, messageId: string) => void;
  onAudioChunk?: (audioBase64: string) => void;
  onToolCall?: (toolCall: TTToolCall) => void;
  onToolResult?: (toolCall: TTToolCall) => void;
  onError?: (error: Error) => void;
  onConnectionChange?: (status: TTConnectionStatus) => void;
  onPipelineStateChange?: (state: PipelineState) => void;
  onMetricsUpdate?: (metrics: TTVoiceMetrics) => void;
  onSpeechStarted?: () => void;
  onStopPlayback?: () => void;
  autoConnect?: boolean;
}

Return Values

Field	Type	Description
`status`	`TTConnectionStatus`	Connection state
`error`	`Error \| null`	Last error
`transcript`	`string`	Final user transcript
`partialTranscript`	`string`	Streaming transcript
`pipelineState`	`PipelineState`	Backend pipeline state
`currentToolCalls`	`TTToolCall[]`	Active tool calls
`metrics`	`TTVoiceMetrics`	Performance metrics
`connect`	`() => Promise<void>`	Start session
`disconnect`	`() => void`	End session
`sendAudioChunk`	`(data: ArrayBuffer) => void`	Send audio
`bargeIn`	`() => void`	Interrupt AI

Types

type TTConnectionStatus =
  | "disconnected"
  | "connecting"
  | "connected"
  | "ready"
  | "reconnecting"
  | "error"
  | "failed"
  | "mic_permission_denied";

type PipelineState = "idle" | "listening" | "processing" | "speaking" | "cancelled";

interface TTTranscript {
  text: string;
  is_final: boolean;
  timestamp: number;
  message_id?: string;
}

interface TTToolCall {
  id: string;
  name: string;
  arguments: Record<string, unknown>;
  status: "pending" | "running" | "completed" | "failed";
  result?: unknown;
}

interface TTVoiceMetrics {
  connectionTimeMs: number | null;
  sttLatencyMs: number | null;
  llmFirstTokenMs: number | null;
  ttsFirstAudioMs: number | null;
  totalLatencyMs: number | null;
  sessionDurationMs: number | null;
  userUtteranceCount: number;
  aiResponseCount: number;
  toolCallCount: number;
  bargeInCount: number;
  reconnectCount: number;
  sessionStartedAt: number | null;
}

interface TTVoiceSettings {
  voice_id?: string;
  language?: string;
  barge_in_enabled?: boolean;
  tts_model?: string;
}

Reconnection

The hook implements automatic reconnection with exponential backoff:

const MAX_RECONNECT_ATTEMPTS = 5;
const BASE_RECONNECT_DELAY = 300; // 300ms
const MAX_RECONNECT_DELAY = 30000; // 30s

// Delay calculation
delay = min((BASE_DELAY * 2) ^ attempt, MAX_DELAY);

Fatal errors (mic permission denied) do not trigger reconnection.

useTTAudioPlayback

Handles streaming PCM audio playback with queue management.

Import

import { useTTAudioPlayback } from "../hooks/useTTAudioPlayback";

Usage

const { isPlaying, queuedChunks, currentLatency, playAudioChunk, stopPlayback, clearQueue, getAudioContext } =
  useTTAudioPlayback({
    sampleRate: 24000,
    onPlaybackStart: () => console.log("Started playing"),
    onPlaybackEnd: () => console.log("Finished playing"),
    onError: (err) => console.error("Playback error:", err),
  });

// Queue audio from WebSocket
function handleAudioChunk(base64Audio: string) {
  const pcmData = base64ToArrayBuffer(base64Audio);
  playAudioChunk(pcmData);
}

// Handle barge-in
function handleBargeIn() {
  stopPlayback();
  clearQueue();
}

Options

interface UseTTAudioPlaybackOptions {
  sampleRate?: number; // Default: 24000
  bufferSize?: number; // Default: 4096
  onPlaybackStart?: () => void;
  onPlaybackEnd?: () => void;
  onError?: (error: Error) => void;
}

Return Values

Field	Type	Description
`isPlaying`	`boolean`	Audio currently playing
`queuedChunks`	`number`	Chunks waiting to play
`currentLatency`	`number`	Playback latency (ms)
`playAudioChunk`	`(data: ArrayBuffer) => void`	Queue chunk
`stopPlayback`	`() => void`	Stop immediately
`clearQueue`	`() => void`	Clear pending chunks
`getAudioContext`	`() => AudioContext`	Get context

Audio Format

Expects 24kHz mono PCM16 (little-endian):

// Convert base64 to playable audio
function base64ToFloat32(base64: string): Float32Array {
  const binary = atob(base64);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < binary.length; i++) {
    bytes[i] = binary.charCodeAt(i);
  }

  // Convert PCM16 to Float32 for Web Audio
  const pcm16 = new Int16Array(bytes.buffer);
  const float32 = new Float32Array(pcm16.length);
  for (let i = 0; i < pcm16.length; i++) {
    float32[i] = pcm16[i] / 32768;
  }

  return float32;
}

useThinkerTalkerVoiceMode

High-level orchestration combining session and playback.

Import

import { useThinkerTalkerVoiceMode } from "../hooks/useThinkerTalkerVoiceMode";

Usage

const {
  // Connection
  isConnected,
  isConnecting,
  connectionError,

  // State
  voiceState,
  isListening,
  isProcessing,
  isSpeaking,

  // Transcripts
  transcript,
  partialTranscript,

  // Audio
  isPlaying,
  audioLevel,

  // Tools
  activeToolCalls,

  // Metrics
  metrics,

  // Actions
  connect,
  disconnect,
  toggleVoice,
  bargeIn,
} = useThinkerTalkerVoiceMode({
  conversationId: "conv-123",
  voiceId: "TxGEqnHWrfWFTfGW9XjX",
  onTranscriptComplete: (text) => addMessage("user", text),
  onResponseComplete: (text) => addMessage("assistant", text),
});

Options

interface UseThinkerTalkerVoiceModeOptions {
  conversationId?: string;
  voiceId?: string;
  language?: string;
  bargeInEnabled?: boolean;
  autoConnect?: boolean;
  onTranscriptComplete?: (text: string) => void;
  onResponseDelta?: (delta: string) => void;
  onResponseComplete?: (text: string) => void;
  onToolCall?: (tool: TTToolCall) => void;
  onError?: (error: Error) => void;
}

Return Values

Field	Type	Description
`isConnected`	`boolean`	WebSocket connected
`isConnecting`	`boolean`	Connection in progress
`connectionError`	`Error \| null`	Connection error
`voiceState`	`PipelineState`	Current state
`isListening`	`boolean`	STT active
`isProcessing`	`boolean`	LLM thinking
`isSpeaking`	`boolean`	TTS playing
`transcript`	`string`	Final transcript
`partialTranscript`	`string`	Partial transcript
`isPlaying`	`boolean`	Audio playing
`audioLevel`	`number`	Mic level (0-1)
`activeToolCalls`	`TTToolCall[]`	Current tools
`metrics`	`TTVoiceMetrics`	Performance data
`connect`	`() => Promise<void>`	Start voice
`disconnect`	`() => void`	End voice
`toggleVoice`	`() => void`	Toggle on/off
`bargeIn`	`() => void`	Interrupt

useVoicePreferencesSync

Syncs voice settings with backend.

Import

import { useVoicePreferencesSync } from "../hooks/useVoicePreferencesSync";

Usage

const { preferences, isLoading, error, updatePreferences, resetToDefaults } = useVoicePreferencesSync();

// Update voice
await updatePreferences({
  voice_id: "21m00Tcm4TlvDq8ikWAM", // Rachel
  stability: 0.7,
  similarity_boost: 0.8,
});

Return Values

Field	Type	Description
`preferences`	`VoicePreferences`	Current settings
`isLoading`	`boolean`	Loading state
`error`	`Error \| null`	Last error
`updatePreferences`	`(prefs) => Promise`	Save settings
`resetToDefaults`	`() => Promise`	Reset all

Complete Example

import React, { useCallback } from "react";
import { useThinkerTalkerVoiceMode } from "../hooks/useThinkerTalkerVoiceMode";
import { useVoicePreferencesSync } from "../hooks/useVoicePreferencesSync";

function VoicePanel({ conversationId }: { conversationId: string }) {
  const { preferences } = useVoicePreferencesSync();

  const {
    isConnected,
    isConnecting,
    voiceState,
    transcript,
    partialTranscript,
    activeToolCalls,
    metrics,
    connect,
    disconnect,
    bargeIn,
  } = useThinkerTalkerVoiceMode({
    conversationId,
    voiceId: preferences.voice_id,
    onTranscriptComplete: useCallback((text) => {
      console.log("User said:", text);
    }, []),
    onResponseComplete: useCallback((text) => {
      console.log("AI said:", text);
    }, []),
    onToolCall: useCallback((tool) => {
      console.log("Tool called:", tool.name);
    }, []),
  });

  return (
    <div className="voice-panel">
      {/* Connection status */}
      <div className="status">
        {isConnecting ? "Connecting..." : isConnected ? `Status: ${voiceState}` : "Disconnected"}
      </div>

      {/* Transcript display */}
      <div className="transcript">{transcript || partialTranscript || "Listening..."}</div>

      {/* Tool calls */}
      {activeToolCalls.map((tool) => (
        <div key={tool.id} className="tool-call">
          {tool.name}: {tool.status}
        </div>
      ))}

      {/* Metrics */}
      <div className="metrics">Latency: {metrics.totalLatencyMs}ms</div>

      {/* Controls */}
      <button onClick={isConnected ? disconnect : connect}>{isConnected ? "Stop" : "Start"} Voice</button>

      {voiceState === "speaking" && <button onClick={bargeIn}>Interrupt</button>}
    </div>
  );
}

Error Handling

Microphone Permission

// The hook detects permission errors
if (status === "mic_permission_denied") {
  return (
    <div className="error">
      <p>Microphone access is required for voice mode.</p>
      <button onClick={requestMicPermission}>
        Allow Microphone
      </button>
    </div>
  );
}

Connection Errors

const { error, status, reconnectAttempts } = useThinkerTalkerSession({
  onError: (err) => {
    if (isMicPermissionError(err)) {
      showPermissionDialog();
    } else {
      showErrorToast(err.message);
    }
  },
});

if (status === "reconnecting") {
  return <div>Reconnecting... (attempt {reconnectAttempts}/5)</div>;
}

if (status === "failed") {
  return <div>Connection failed. Please refresh.</div>;
}

Performance Tips

1. Memoize Callbacks

const onTranscript = useCallback((t: TTTranscript) => {
  // Handle transcript
}, []);

const onAudioChunk = useCallback(
  (audio: string) => {
    playAudioChunk(base64ToArrayBuffer(audio));
  },
  [playAudioChunk],
);

2. Avoid Re-renders

// Use refs for frequently updating values
const metricsRef = useRef(metrics);
useEffect(() => {
  metricsRef.current = metrics;
}, [metrics]);

3. Batch State Updates

// In the hook implementation
const handleMessage = useCallback((msg) => {
  // React 18 automatically batches these
  setTranscript(msg.text);
  setPipelineState(msg.state);
  setMetrics((prev) => ({ ...prev, ...msg.metrics }));
}, []);

Thinker-Talker Voice Pipeline

Overview

Benefits Over OpenAI Realtime API

Architecture Components

1. Voice Pipeline Service

2. Thinker Service

3. Talker Service

4. Sentence Chunker

5. WebSocket Handler

Data Flow

Complete Request/Response Cycle

Barge-in Flow

State Machine

WebSocket Protocol

Client → Server Messages

Server → Client Messages

Frontend Integration

useThinkerTalkerSession Hook

useTTAudioPlayback Hook

Configuration Reference

Backend Environment Variables

Voice Configuration Options

Available ElevenLabs Voices

Metrics & Observability

TTVoiceMetrics

Latency Targets

Troubleshooting

Common Issues

Debug Logging

Related Documentation

Changelog

2025-12-01 - Initial Release

Voice Mode Pipeline

Voice Pipeline Modes

Thinker-Talker Pipeline (Primary)

OpenAI Realtime API (Legacy)

Implementation Status

Thinker-Talker Components

OpenAI Realtime Components (Legacy)

Overview

Architecture Diagram

Backend: /api/voice/realtime-session

Request

Response

VAD Sensitivity Mapping

Observability

Frontend Hook: useRealtimeVoiceSession

Usage

Connection States

WebSocket Connection

Voice Settings Store

Schema

Persistence

Defaults

Chat Integration

Message Flow

Message Structure

Barge-in & Audio Playback

Barge-in Flow

Response Cancellation

Audio Playback Management

Overlap Prevention

Error Handling

Metrics

VoiceMetrics Interface

Frontend Logging

Consuming Metrics

Metrics Export to Backend

Request Schema

Response

Privacy

Frontend Configuration

Backend Logging

Testing

Security

Ephemeral Token Architecture

Token Refresh

Testing

Voice Pipeline Smoke Suite

Test Coverage Summary

Backend: `/api/voice/realtime-session`

Frontend Hook: `useRealtimeVoiceSession`

Error Taxonomy (`voice_errors.py`)

Voice Metrics (`metrics.py`)

Per-Stage Latency Tracking (`voice_timing.py`)

SLO Alerts (`voice_slo_alerts.yml`)

Client Telemetry (`voiceTelemetry.ts`)

Voice Health Endpoint (`/health/voice`)

Network Monitoring (`networkMonitor.ts`)

TTS Caching (`useTTSCache`)