Thinker-Talker Voice Pipeline
Status: Production Ready Last Updated: 2025-12-01 Phase: Voice Pipeline Migration (Complete)
Overview
The Thinker-Talker (T/T) pipeline is VoiceAssist's voice processing architecture that replaces the OpenAI Realtime API with a local orchestration approach. It provides unified conversation context, full tool/RAG support, and custom TTS with ElevenLabs for superior voice quality.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Thinker-Talker Pipeline │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Audio │───>│ Deepgram STT │───>│ GPT-4o │───>│ElevenLabs│ │
│ │ Input │ │ (Streaming) │ │ Thinker │ │ TTS │ │
│ └──────────┘ └──────────────┘ └──────────────┘ └──────────┘ │
│ │ │ │ │ │
│ │ Transcripts Tool Calls Audio Out │
│ │ │ │ │ │
│ v v v v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ WebSocket Handler │ │
│ │ (Bidirectional Client Communication) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Benefits Over OpenAI Realtime API
| Feature | OpenAI Realtime | Thinker-Talker |
|---|---|---|
| Conversation Context | Separate from chat | Unified with chat mode |
| Tool Support | Limited | Full tool calling + RAG |
| TTS Quality | OpenAI voices | ElevenLabs premium voices |
| Cost | Per-minute billing | Per-token + TTS chars |
| Voice Selection | 6 voices | 11+ ElevenLabs voices |
| Customization | Limited | Full control over each stage |
| Barge-in | Built-in | Fully supported |
Architecture Components
1. Voice Pipeline Service
Location: services/api-gateway/app/services/voice_pipeline_service.py
Orchestrates the complete STT → Thinker → Talker flow:
class VoicePipelineService: """ Orchestrates the complete voice pipeline: 1. Receive audio from client 2. Stream to Deepgram STT 3. Send transcripts to Thinker (LLM) 4. Stream response tokens to Talker (TTS) 5. Send audio chunks back to client """
Configuration:
@dataclass class PipelineConfig: # STT Settings stt_language: str = "en" stt_sample_rate: int = 16000 stt_endpointing_ms: int = 800 # Wait for natural pauses stt_utterance_end_ms: int = 1500 # Finalize after 1.5s silence # TTS Settings - defaults from voice_constants.py (single source of truth) # See docs/voice/voice-configuration.md for details voice_id: str = DEFAULT_VOICE_ID # Brian (from voice_constants.py) tts_model: str = DEFAULT_TTS_MODEL # eleven_flash_v2_5 # Barge-in barge_in_enabled: bool = True
2. Thinker Service
Location: services/api-gateway/app/services/thinker_service.py
The reasoning engine that processes transcribed speech:
class ThinkerService: """ Unified reasoning service for the Thinker/Talker pipeline. Handles: - Conversation context management (persisted across turns) - Streaming LLM responses with token callbacks - Tool calling with result injection - Cancellation support """
Key Features:
- ConversationContext: Maintains history (max 20 messages) with smart trimming
- Tool Registry: Supports calendar, search, medical calculators, KB search
- Streaming: Token-by-token callbacks for low-latency TTS
- State Machine: IDLE → PROCESSING → TOOL_CALLING → GENERATING → COMPLETE
3. Talker Service
Location: services/api-gateway/app/services/talker_service.py
Text-to-Speech synthesis with streaming audio:
class TalkerService: """ Unified TTS service for the Thinker/Talker pipeline. Handles: - Streaming LLM tokens through sentence chunker - Audio queue management for gapless playback - Cancellation (barge-in support) """
Voice Configuration:
@dataclass class VoiceConfig: provider: TTSProvider = TTSProvider.ELEVENLABS voice_id: str = "TxGEqnHWrfWFTfGW9XjX" # Josh model_id: str = "eleven_turbo_v2_5" stability: float = 0.78 # Voice consistency similarity_boost: float = 0.85 # Voice clarity style: float = 0.08 # Natural, less dramatic output_format: str = "pcm_24000" # Low-latency streaming
4. Sentence Chunker
Location: services/api-gateway/app/services/sentence_chunker.py
Optimizes LLM output for TTS with low latency:
class SentenceChunker: """ Low-latency phrase chunker for TTS processing. Strategy: - Primary: Split on sentence boundaries (. ! ?) - Secondary: Split on clause boundaries (, ; :) after min chars - Emergency: Force split at max chars Config (optimized for speed): - min_chunk_chars: 40 (avoid tiny fragments) - optimal_chunk_chars: 120 (natural phrases) - max_chunk_chars: 200 (force split) """
5. WebSocket Handler
Location: services/api-gateway/app/services/thinker_talker_websocket_handler.py
Manages bidirectional client communication:
class ThinkerTalkerWebSocketHandler: """ WebSocket handler for Thinker/Talker voice pipeline. Protocol Messages (Client → Server): - audio.input: Base64 PCM16 audio - audio.input.complete: Signal end of speech - barge_in: Interrupt AI response - voice.mode: Activate/deactivate voice mode Protocol Messages (Server → Client): - transcript.delta/complete: STT results - response.delta/complete: LLM response - audio.output: TTS audio chunk - tool.call/result: Tool execution - voice.state: Pipeline state update """
Data Flow
Complete Request/Response Cycle
1. User speaks into microphone
│
▼
2. Frontend captures PCM16 audio (16kHz)
│
▼
3. Audio streamed via WebSocket (audio.input messages)
│
▼
4. Deepgram STT processes audio stream
│
├──> transcript.delta (partial text)
│
└──> transcript.complete (final text)
│
▼
5. ThinkerService receives transcript
│
├──> Adds to ConversationContext
│
├──> Calls GPT-4o with tools
│
├──> If tool call needed:
│ │
│ ├──> tool.call sent to client
│ │
│ ├──> Tool executed
│ │
│ └──> tool.result sent to client
│
└──> response.delta (streaming tokens)
│
▼
6. TalkerService receives tokens
│
├──> SentenceChunker buffers tokens
│
├──> Complete sentences → ElevenLabs TTS
│
└──> audio.output (streaming PCM)
│
▼
7. Frontend plays audio via Web Audio API
Barge-in Flow
1. AI is speaking (audio.output streaming)
│
2. User starts speaking
│
▼
3. Frontend sends barge_in message
│
▼
4. Backend:
├──> Cancels TalkerSession
├──> Clears audio queue
└──> Resets pipeline to LISTENING
│
▼
5. New user speech processed normally
State Machine
┌─────────────────┐
│ IDLE │
│ (waiting for │
│ user input) │
└────────┬────────┘
│
audio.input received
│
▼
┌─────────────────┐
│ LISTENING │
│ (STT active, │
│ collecting) │
└────────┬────────┘
│
transcript.complete
│
▼
┌─────────────────┐
│ PROCESSING │◄─────────┐
│ (LLM thinking) │ │
└────────┬────────┘ │
│ │
┌──────────────┼──────────────┐ │
│ │ │ │
tool_call no tools error │ │
│ │ │ │
▼ ▼ │ │
┌─────────────────┐ ┌──────────┐ │ │
│ TOOL_CALLING │ │GENERATING│ │ │
│ (executing │ │(streaming│ │ │
│ tool) │ │ response)│ │ │
└────────┬────────┘ └────┬─────┘ │ │
│ │ │ │
tool_result response.complete │ │
│ │ │ │
└───────┬───────┘ │ │
│ │ │
▼ │ │
┌─────────────────┐ │ │
│ SPEAKING │ │ │
│ (TTS playing) │────────────┘ │
└────────┬────────┘ (more to say) │
│ │
audio complete or barge_in │
│ │
▼ │
┌─────────────────┐ │
│ CANCELLED │─────────────────┘
│ (interrupted) │ (restart listening)
└─────────────────┘
WebSocket Protocol
Client → Server Messages
| Message Type | Description | Payload |
|---|---|---|
session.init | Initialize session with settings | { voice_settings: {...}, conversation_id: "..." } |
audio.input | Audio chunk from microphone | { audio: "<base64 PCM16>" } |
audio.input.complete | Manual end-of-speech signal | {} |
barge_in | Interrupt AI response | {} |
message | Text input fallback | { content: "..." } |
ping | Heartbeat | {} |
Server → Client Messages
| Message Type | Description | Payload |
|---|---|---|
session.ready | Session initialized | { session_id, pipeline_mode } |
transcript.delta | Partial STT transcript | { text: "...", is_final: false } |
transcript.complete | Final transcript | { text: "...", message_id: "..." } |
response.delta | Streaming LLM token | { delta: "...", message_id: "..." } |
response.complete | Complete LLM response | { text: "...", message_id: "..." } |
audio.output | TTS audio chunk | { audio: "<base64 PCM>", is_final: false } |
tool.call | Tool being called | { id, name, arguments } |
tool.result | Tool result | { id, name, result } |
voice.state | Pipeline state change | { state: "listening" } |
error | Error occurred | { code, message, recoverable } |
Frontend Integration
useThinkerTalkerSession Hook
Location: apps/web-app/src/hooks/useThinkerTalkerSession.ts
const { status, // 'disconnected' | 'connecting' | 'ready' | 'error' pipelineState, // 'idle' | 'listening' | 'processing' | 'speaking' transcript, // Final user transcript metrics, // Latency and usage metrics connect, // Start session disconnect, // End session sendAudio, // Send audio chunk bargeIn, // Interrupt AI } = useThinkerTalkerSession({ conversation_id: "...", voiceSettings: { voice_id: "TxGEqnHWrfWFTfGW9XjX", language: "en", barge_in_enabled: true, }, onTranscript: (t) => console.log("Transcript:", t), onAudioChunk: (audio) => playAudio(audio), onToolCall: (tool) => console.log("Tool:", tool), });
useTTAudioPlayback Hook
Location: apps/web-app/src/hooks/useTTAudioPlayback.ts
Handles streaming audio playback with barge-in support:
const { isPlaying, queuedChunks, playAudioChunk, // Add chunk to queue stopPlayback, // Cancel playback (barge-in) clearQueue, // Clear pending audio } = useTTAudioPlayback({ sampleRate: 24000, onPlaybackEnd: () => console.log("Playback complete"), });
Configuration Reference
Backend Environment Variables
# LLM Settings MODEL_SELECTION_DEFAULT=gpt-4o OPENAI_API_KEY=sk-... OPENAI_TIMEOUT_SEC=30 # TTS Settings ELEVENLABS_API_KEY=... ELEVENLABS_VOICE_ID=TxGEqnHWrfWFTfGW9XjX ELEVENLABS_MODEL_ID=eleven_turbo_v2_5 # STT Settings DEEPGRAM_API_KEY=...
Voice Configuration Options
| Parameter | Default | Range | Description |
|---|---|---|---|
voice_id | TxGEqnHWrfWFTfGW9XjX (Josh) | See available voices | ElevenLabs voice |
model_id | eleven_turbo_v2_5 | turbo/flash/multilingual | TTS model |
stability | 0.78 | 0.0-1.0 | Higher = more consistent voice |
similarity_boost | 0.85 | 0.0-1.0 | Higher = clearer voice |
style | 0.08 | 0.0-1.0 | Lower = more natural |
output_format | pcm_24000 | pcm/mp3 | Audio format |
Available ElevenLabs Voices
| Voice ID | Name | Gender | Premium |
|---|---|---|---|
| TxGEqnHWrfWFTfGW9XjX | Josh | Male | Yes |
| pNInz6obpgDQGcFmaJgB | Adam | Male | Yes |
| EXAVITQu4vr4xnSDxMaL | Bella | Female | Yes |
| 21m00Tcm4TlvDq8ikWAM | Rachel | Female | Yes |
| AZnzlk1XvdvUeBnXmlld | Domi | Female | No |
| ErXwobaYiN019PkySvjV | Antoni | Male | No |
Metrics & Observability
TTVoiceMetrics
interface TTVoiceMetrics { connectionTimeMs: number; // Connect to ready sttLatencyMs: number; // Speech end to transcript llmFirstTokenMs: number; // Transcript to first token ttsFirstAudioMs: number; // First token to first audio totalLatencyMs: number; // Speech end to first audio userUtteranceCount: number; aiResponseCount: number; toolCallCount: number; bargeInCount: number; }
Latency Targets
| Metric | Target | Description |
|---|---|---|
| Connection | < 2000ms | WebSocket + pipeline init |
| STT | < 500ms | Speech end to transcript |
| LLM First Token | < 800ms | Transcript to first token |
| TTS First Audio | < 400ms | First token to audio |
| Total | < 1500ms | Speech end to audio playback |
Troubleshooting
Common Issues
1. No audio output
- Check ElevenLabs API key is valid
- Verify voice_id exists in available voices
- Check browser audio permissions
2. High latency
- Check network connection
- Verify STT endpoint is responsive
- Consider reducing chunk sizes
3. Barge-in not working
- Ensure
barge_in_enabled: truein config - Check WebSocket connection is stable
- Verify frontend is sending barge_in message
4. Tool calls failing
- Check user authentication (user_id required)
- Verify tool is registered in ToolRegistry
- Check tool-specific API keys (calendar, etc.)
Debug Logging
Enable verbose logging:
# Backend import logging logging.getLogger("app.services.thinker_service").setLevel(logging.DEBUG) logging.getLogger("app.services.talker_service").setLevel(logging.DEBUG)
// Frontend import { voiceLog } from "../lib/logger"; voiceLog.setLevel("debug");
Related Documentation
- Thinker Service API
- Talker Service API
- Voice Pipeline WebSocket Protocol
- Frontend Voice Hooks
- Voice Mode Settings Guide
Changelog
2025-12-01 - Initial Release
- Complete Thinker-Talker pipeline implementation
- Deepgram STT integration with streaming
- ElevenLabs TTS with sentence chunking
- Full tool calling support
- Barge-in capability
- Unified conversation context with chat mode
Voice Mode Pipeline
Status: Production-ready Last Updated: 2025-12-03
This document describes the unified Voice Mode pipeline architecture, data flow, metrics, and testing strategy. It serves as the canonical reference for developers working on real-time voice features.
Voice Pipeline Modes
VoiceAssist supports two voice pipeline modes:
| Mode | Description | Best For |
|---|---|---|
| Thinker-Talker (Recommended) | Local STT → LLM → TTS pipeline | Full tool support, unified context, custom TTS |
| OpenAI Realtime (Legacy) | Direct OpenAI Realtime API | Quick setup, minimal backend changes |
Thinker-Talker Pipeline (Primary)
The Thinker-Talker pipeline is the recommended approach, providing:
- Unified conversation context between voice and chat modes
- Full tool/RAG support in voice interactions
- Custom TTS via ElevenLabs with premium voices
- Lower cost per interaction
Documentation: THINKER_TALKER_PIPELINE.md
[Audio] → [Deepgram STT] → [GPT-4o Thinker] → [ElevenLabs TTS] → [Audio Out]
│ │ │
Transcripts Tool Calls Audio Chunks
│ │ │
└───────── WebSocket Handler ──────────────┘
OpenAI Realtime API (Legacy)
The original implementation using OpenAI's Realtime API directly. Still supported for backward compatibility.
Implementation Status
Thinker-Talker Components
| Component | Status | Location |
|---|---|---|
| ThinkerService | Live | app/services/thinker_service.py |
| TalkerService | Live | app/services/talker_service.py |
| VoicePipelineService | Live | app/services/voice_pipeline_service.py |
| T/T WebSocket Handler | Live | app/services/thinker_talker_websocket_handler.py |
| SentenceChunker | Live | app/services/sentence_chunker.py |
| Frontend T/T hook | Live | apps/web-app/src/hooks/useThinkerTalkerSession.ts |
| T/T Audio Playback | Live | apps/web-app/src/hooks/useTTAudioPlayback.ts |
| T/T Voice Panel | Live | apps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsx |
OpenAI Realtime Components (Legacy)
| Component | Status | Location |
|---|---|---|
| Backend session endpoint | Live | services/api-gateway/app/api/voice.py |
| Ephemeral token generation | Live | app/services/realtime_voice_service.py |
| Voice metrics endpoint | Live | POST /api/voice/metrics |
| Frontend voice hook | Live | apps/web-app/src/hooks/useRealtimeVoiceSession.ts |
| Voice settings store | Live | apps/web-app/src/stores/voiceSettingsStore.ts |
| Voice UI panel | Live | apps/web-app/src/components/voice/VoiceModePanel.tsx |
| Chat timeline integration | Live | Voice messages appear in chat |
| Barge-in support | Live | response.cancel + onSpeechStarted callback |
| Audio overlap prevention | Live | Response ID tracking + isProcessingResponseRef |
| E2E test suite | Passing | 95 tests across unit/integration/E2E |
Full status: See Implementation Status for all components.
Overview
Voice Mode enables real-time voice conversations with the AI assistant using OpenAI's Realtime API. The pipeline handles:
- Ephemeral session authentication (no raw API keys in browser)
- WebSocket-based bidirectional voice streaming
- Voice activity detection (VAD) with user-configurable sensitivity
- User settings propagation (voice, language, VAD threshold)
- Chat timeline integration (voice messages appear in chat)
- Connection state management with automatic reconnection
- Barge-in support (interrupt AI while speaking)
- Audio playback management (prevent overlapping responses)
- Metrics tracking for observability
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│ FRONTEND │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ ┌───────────────┐ │
│ │ VoiceModePanel │────▶│useRealtimeVoice │────▶│ voiceSettings │ │
│ │ (UI Component) │ │Session (Hook) │ │ Store │ │
│ │ - Start/Stop │ │- connect() │ │ - voice │ │
│ │ - Status display │ │- disconnect() │ │ - language │ │
│ │ - Metrics logging │ │- sendMessage() │ │ - vadSens │ │
│ └─────────┬───────────┘ └──────────┬──────────┘ └───────────────┘ │
│ │ │ │
│ │ │ onUserMessage()/onAssistantMessage()
│ │ ▼ │
│ ┌─────────▼───────────┐ ┌─────────────────────┐ │
│ │ MessageInput │ │ ChatPage │ │
│ │ - Voice toggle │────▶│ - useChatSession │ │
│ │ - Panel container │ │ - addMessage() │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
└──────────────────────────────────────┬──────────────────────────────────────┘
│
│ POST /api/voice/realtime-session
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACKEND │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ voice.py │────▶│ realtime_voice_ │ │
│ │ (FastAPI Router) │ │ service.py │ │
│ │ - /realtime-session│ │ - generate_session │ │
│ │ - Timing logs │ │ - ephemeral token │ │
│ └─────────────────────┘ └──────────┬──────────┘ │
│ │ │
│ │ POST /v1/realtime/sessions │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ OpenAI API │ │
│ │ - Ephemeral token │ │
│ │ - Voice config │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ WebSocket wss://api.openai.com/v1/realtime
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ OPENAI REALTIME API │
├─────────────────────────────────────────────────────────────────────────────┤
│ - Server-side VAD (voice activity detection) │
│ - Bidirectional audio streaming (PCM16) │
│ - Real-time transcription (Whisper) │
│ - GPT-4o responses with audio synthesis │
└─────────────────────────────────────────────────────────────────────────────┘
Backend: /api/voice/realtime-session
Location: services/api-gateway/app/api/voice.py
Request
interface RealtimeSessionRequest { conversation_id?: string; // Optional conversation context voice?: string; // "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" language?: string; // "en" | "es" | "fr" | "de" | "it" | "pt" vad_sensitivity?: number; // 0-100 (maps to threshold: 0→0.9, 100→0.1) }
Response
interface RealtimeSessionResponse { url: string; // WebSocket URL: "wss://api.openai.com/v1/realtime" model: string; // "gpt-4o-realtime-preview" session_id: string; // Unique session identifier expires_at: number; // Unix timestamp (epoch seconds) conversation_id: string | null; auth: { type: "ephemeral_token"; token: string; // Ephemeral token (ek_...), NOT raw API key expires_at: number; // Token expiry (5 minutes) }; voice_config: { voice: string; // Selected voice modalities: ["text", "audio"]; input_audio_format: "pcm16"; output_audio_format: "pcm16"; input_audio_transcription: { model: "whisper-1" }; turn_detection: { type: "server_vad"; threshold: number; // 0.1 (sensitive) to 0.9 (insensitive) prefix_padding_ms: number; silence_duration_ms: number; }; }; }
VAD Sensitivity Mapping
The frontend uses a 0-100 scale for user-friendly VAD sensitivity:
| User Setting | VAD Threshold | Behavior |
|---|---|---|
| 0 (Low) | 0.9 | Requires loud/clear speech |
| 50 (Medium) | 0.5 | Balanced detection |
| 100 (High) | 0.1 | Very sensitive, picks up soft speech |
Formula: threshold = 0.9 - (vad_sensitivity / 100 * 0.8)
Observability
Backend logs timing and context for each session request:
# Request logging logger.info( f"Creating Realtime session for user {current_user.id}", extra={ "user_id": current_user.id, "conversation_id": request.conversation_id, "voice": request.voice, "language": request.language, "vad_sensitivity": request.vad_sensitivity, }, ) # Success logging with duration duration_ms = int((time.monotonic() - start_time) * 1000) logger.info( f"Realtime session created for user {current_user.id}", extra={ "user_id": current_user.id, "session_id": config["session_id"], "voice": config.get("voice_config", {}).get("voice"), "duration_ms": duration_ms, }, )
Frontend Hook: useRealtimeVoiceSession
Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts
Usage
const { status, // 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'failed' | 'expired' | 'error' transcript, // Current transcript text isSpeaking, // Is the AI currently speaking? isConnected, // Derived: status === 'connected' isConnecting, // Derived: status === 'connecting' || 'reconnecting' canSend, // Can send messages? error, // Error message if any metrics, // VoiceMetrics object connect, // () => Promise<void> - start session disconnect, // () => void - end session sendMessage, // (text: string) => void - send text message } = useRealtimeVoiceSession({ conversationId, voice, // From voiceSettingsStore language, // From voiceSettingsStore vadSensitivity, // From voiceSettingsStore (0-100) onConnected, // Callback when connected onDisconnected, // Callback when disconnected onError, // Callback on error onUserMessage, // Callback with user transcript onAssistantMessage, // Callback with AI response onMetricsUpdate, // Callback when metrics change });
Connection States
disconnected ──▶ connecting ──▶ connected
│ │
▼ ▼
failed ◀──── reconnecting
│ │
▼ ▼
expired ◀────── error
| State | Description |
|---|---|
disconnected | Initial/idle state |
connecting | Fetching session config, establishing WebSocket |
connected | Active voice session |
reconnecting | Auto-reconnect after temporary disconnect |
failed | Connection failed (backend error, network issue) |
expired | Session token expired (needs manual restart) |
error | General error state |
WebSocket Connection
The hook connects using three protocols for authentication:
const ws = new WebSocket(url, ["realtime", "openai-beta.realtime-v1", `openai-insecure-api-key.${ephemeralToken}`]);
Voice Settings Store
Location: apps/web-app/src/stores/voiceSettingsStore.ts
Schema
interface VoiceSettings { voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"; language: "en" | "es" | "fr" | "de" | "it" | "pt"; vadSensitivity: number; // 0-100 autoStartOnOpen: boolean; // Auto-start voice when panel opens showStatusHints: boolean; // Show helper text in UI }
Persistence
Settings are persisted to localStorage under key voiceassist-voice-settings using Zustand's persist middleware.
Defaults
| Setting | Default |
|---|---|
| voice | "alloy" |
| language | "en" |
| vadSensitivity | 50 |
| autoStartOnOpen | false |
| showStatusHints | true |
Chat Integration
Location: apps/web-app/src/pages/ChatPage.tsx
Message Flow
- User speaks → VoiceModePanel receives final transcript
- VoiceModePanel calls
onUserMessage(transcript) - ChatPage receives callback, calls
useChatSession.addMessage() - Message added to timeline with
metadata: { source: "voice" }
// ChatPage.tsx const handleVoiceUserMessage = (content: string) => { addMessage({ role: "user", content, metadata: { source: "voice" }, }); }; const handleVoiceAssistantMessage = (content: string) => { addMessage({ role: "assistant", content, metadata: { source: "voice" }, }); };
Message Structure
interface VoiceMessage { id: string; // "voice-{timestamp}-{random}" role: "user" | "assistant"; content: string; timestamp: number; metadata: { source: "voice"; // Distinguishes from text messages }; }
Barge-in & Audio Playback
Location: apps/web-app/src/components/voice/VoiceModePanel.tsx, apps/web-app/src/hooks/useRealtimeVoiceSession.ts
Barge-in Flow
When the user starts speaking while the AI is responding, the system immediately:
- Detects speech start via OpenAI's
input_audio_buffer.speech_startedevent - Cancels active response by sending
response.cancelto OpenAI - Stops audio playback via
onSpeechStartedcallback - Clears pending responses to prevent stale audio from playing
User speaks → speech_started event → response.cancel → stopCurrentAudio()
↓
Audio stops
Queue cleared
Response ID incremented
Response Cancellation
Location: useRealtimeVoiceSession.ts - handleRealtimeMessage
case "input_audio_buffer.speech_started": setIsSpeaking(true); setPartialTranscript(""); // Barge-in: Cancel any active response when user starts speaking if (activeResponseIdRef.current && wsRef.current?.readyState === WebSocket.OPEN) { wsRef.current.send(JSON.stringify({ type: "response.cancel" })); activeResponseIdRef.current = null; } // Notify parent to stop audio playback options.onSpeechStarted?.(); break;
Audio Playback Management
Location: VoiceModePanel.tsx
The panel tracks audio playback state to prevent overlapping responses:
// Track currently playing Audio element const currentAudioRef = useRef<HTMLAudioElement | null>(null); // Prevent overlapping response processing const isProcessingResponseRef = useRef(false); // Response ID to invalidate stale responses after barge-in const currentResponseIdRef = useRef<number>(0);
Stop current audio function:
const stopCurrentAudio = useCallback(() => { if (currentAudioRef.current) { currentAudioRef.current.pause(); currentAudioRef.current.currentTime = 0; if (currentAudioRef.current.src.startsWith("blob:")) { URL.revokeObjectURL(currentAudioRef.current.src); } currentAudioRef.current = null; } audioQueueRef.current = []; isPlayingRef.current = false; currentResponseIdRef.current++; // Invalidate pending responses isProcessingResponseRef.current = false; }, []);
Overlap Prevention
When a relay result arrives, the handler checks:
- Already processing? Skip if
isProcessingResponseRef.current === true - Response ID valid? Skip playback if ID changed (barge-in occurred)
onRelayResult: async ({ answer }) => { if (answer) { // Prevent overlapping responses if (isProcessingResponseRef.current) { console.log("[VoiceModePanel] Skipping response - already processing another"); return; } const responseId = ++currentResponseIdRef.current; isProcessingResponseRef.current = true; // ... synthesis and playback ... // Check if response is still valid before playback if (responseId !== currentResponseIdRef.current) { console.log("[VoiceModePanel] Response cancelled - skipping playback"); return; } } };
Error Handling
Benign cancellation errors (e.g., "Cancellation failed: no active response found") are handled gracefully:
case "error": { const errorMessage = message.error?.message || "Realtime API error"; // Ignore benign cancellation errors if ( errorMessage.includes("Cancellation failed") || errorMessage.includes("no active response") ) { voiceLog.debug(`Ignoring benign error: ${errorMessage}`); break; } handleError(new Error(errorMessage)); break; }
Metrics
Location: apps/web-app/src/hooks/useRealtimeVoiceSession.ts
VoiceMetrics Interface
interface VoiceMetrics { connectionTimeMs: number | null; // Time to establish connection timeToFirstTranscriptMs: number | null; // Time to first user transcript lastSttLatencyMs: number | null; // Speech-to-text latency lastResponseLatencyMs: number | null; // AI response latency sessionDurationMs: number | null; // Total session duration userTranscriptCount: number; // Number of user turns aiResponseCount: number; // Number of AI turns reconnectCount: number; // Number of reconnections sessionStartedAt: number | null; // Session start timestamp }
Frontend Logging
VoiceModePanel logs key metrics to console:
// Connection time console.log(`[VoiceModePanel] voice_session_connect_ms=${metrics.connectionTimeMs}`); // STT latency console.log(`[VoiceModePanel] voice_stt_latency_ms=${metrics.lastSttLatencyMs}`); // Response latency console.log(`[VoiceModePanel] voice_first_reply_ms=${metrics.lastResponseLatencyMs}`); // Session duration console.log(`[VoiceModePanel] voice_session_duration_ms=${metrics.sessionDurationMs}`);
Consuming Metrics
Developers can plug into metrics via the onMetricsUpdate callback:
useRealtimeVoiceSession({ onMetricsUpdate: (metrics) => { // Send to telemetry service analytics.track("voice_session_metrics", { connection_ms: metrics.connectionTimeMs, stt_latency_ms: metrics.lastSttLatencyMs, response_latency_ms: metrics.lastResponseLatencyMs, duration_ms: metrics.sessionDurationMs, }); }, });
Metrics Export to Backend
Metrics can be automatically exported to the backend for aggregation and alerting.
Backend Endpoint: POST /api/voice/metrics
Location: services/api-gateway/app/api/voice.py
Request Schema
interface VoiceMetricsPayload { conversation_id?: string; connection_time_ms?: number; time_to_first_transcript_ms?: number; last_stt_latency_ms?: number; last_response_latency_ms?: number; session_duration_ms?: number; user_transcript_count: number; ai_response_count: number; reconnect_count: number; session_started_at?: number; }
Response
interface VoiceMetricsResponse { status: "ok"; }
Privacy
No PHI or transcript content is sent. Only timing metrics and counts.
Frontend Configuration
Metrics export is controlled by environment variables:
- Production (
import.meta.env.PROD): Metrics sent automatically - Development: Set
VITE_ENABLE_VOICE_METRICS=trueto enable
The export uses navigator.sendBeacon() for reliability (survives page navigation).
Backend Logging
Metrics are logged with user context:
logger.info( "VoiceMetrics received", extra={ "user_id": current_user.id, "conversation_id": payload.conversation_id, "connection_time_ms": payload.connection_time_ms, "session_duration_ms": payload.session_duration_ms, ... }, )
Testing
# Backend cd /home/asimo/VoiceAssist/services/api-gateway source venv/bin/activate && export PYTHONPATH=. python -m pytest tests/integration/test_voice_metrics.py -v
Security
Ephemeral Token Architecture
CRITICAL: The browser NEVER receives the raw OpenAI API key.
- Backend holds
OPENAI_API_KEYsecurely - Frontend requests session via
/api/voice/realtime-session - Backend creates ephemeral token via OpenAI
/v1/realtime/sessions - Ephemeral token returned to frontend (valid ~5 minutes)
- Frontend connects WebSocket using ephemeral token
Token Refresh
The hook monitors session.expires_at and can trigger refresh before expiry. If the token expires mid-session, status transitions to expired.
Testing
Voice Pipeline Smoke Suite
Run these commands to validate the voice pipeline:
# 1. Backend tests (CI-safe, mocked) cd /home/asimo/VoiceAssist/services/api-gateway source venv/bin/activate export PYTHONPATH=. python -m pytest tests/integration/test_openai_config.py -v # 2. Frontend unit tests (run individually to avoid OOM) cd /home/asimo/VoiceAssist/apps/web-app export NODE_OPTIONS="--max-old-space-size=768" npx vitest run src/hooks/__tests__/useRealtimeVoiceSession.test.ts --reporter=dot npx vitest run src/hooks/__tests__/useChatSession-voice-integration.test.ts --reporter=dot npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot # 3. E2E tests (Chromium, mocked backend) cd /home/asimo/VoiceAssist npx playwright test \ e2e/voice-mode-navigation.spec.ts \ e2e/voice-mode-session-smoke.spec.ts \ e2e/voice-mode-voice-chat-integration.spec.ts \ --project=chromium --reporter=list
Test Coverage Summary
| Test File | Tests | Coverage |
|---|---|---|
| useRealtimeVoiceSession.test.ts | 22 | Hook lifecycle, states, metrics |
| useChatSession-voice-integration.test.ts | 8 | Message structure validation |
| voiceSettingsStore.test.ts | 17 | Store actions, persistence |
| VoiceModeSettings.test.tsx | 25 | Component rendering, interactions |
| MessageInput-voice-settings.test.tsx | 12 | Integration with chat input |
| voice-mode-navigation.spec.ts | 4 | E2E navigation flow |
| voice-mode-session-smoke.spec.ts | 3 | E2E session smoke (1 live gated) |
| voice-mode-voice-chat-integration.spec.ts | 4 | E2E panel integration |
Total: 95 tests
Live Testing
To test with real OpenAI backend:
# Backend (requires OPENAI_API_KEY in .env) LIVE_REALTIME_TESTS=1 python -m pytest tests/integration/test_openai_config.py -v # E2E (requires running backend + valid API key) LIVE_REALTIME_E2E=1 npx playwright test e2e/voice-mode-session-smoke.spec.ts
File Reference
Backend
| File | Purpose |
|---|---|
services/api-gateway/app/api/voice.py | API routes, metrics, timing logs |
services/api-gateway/app/services/realtime_voice_service.py | Session creation, token generation |
services/api-gateway/tests/integration/test_openai_config.py | Integration tests |
services/api-gateway/tests/integration/test_voice_metrics.py | Metrics endpoint tests |
Frontend
| File | Purpose |
|---|---|
apps/web-app/src/hooks/useRealtimeVoiceSession.ts | Core hook |
apps/web-app/src/components/voice/VoiceModePanel.tsx | UI panel |
apps/web-app/src/components/voice/VoiceModeSettings.tsx | Settings modal |
apps/web-app/src/stores/voiceSettingsStore.ts | Settings store |
apps/web-app/src/components/chat/MessageInput.tsx | Voice button integration |
apps/web-app/src/pages/ChatPage.tsx | Chat timeline integration |
apps/web-app/src/hooks/useChatSession.ts | addMessage() helper |
Tests
| File | Purpose |
|---|---|
apps/web-app/src/hooks/__tests__/useRealtimeVoiceSession.test.ts | Hook tests |
apps/web-app/src/hooks/__tests__/useChatSession-voice-integration.test.ts | Chat integration |
apps/web-app/src/stores/__tests__/voiceSettingsStore.test.ts | Store tests |
apps/web-app/src/components/voice/__tests__/VoiceModeSettings.test.tsx | Component tests |
apps/web-app/src/components/chat/__tests__/MessageInput-voice-settings.test.tsx | Integration tests |
e2e/voice-mode-navigation.spec.ts | E2E navigation |
e2e/voice-mode-session-smoke.spec.ts | E2E smoke test |
e2e/voice-mode-voice-chat-integration.spec.ts | E2E panel integration |
Related Documentation
- VOICE_MODE_ENHANCEMENT_10_PHASE.md - 10-phase enhancement plan (emotion, dictation, analytics)
- VOICE_MODE_SETTINGS_GUIDE.md - User settings configuration
- TESTING_GUIDE.md - E2E testing strategy and validation checklist
Observability & Monitoring (Phase 3)
Implemented: 2025-12-02
The voice pipeline includes comprehensive observability features for production monitoring.
Error Taxonomy (voice_errors.py)
Location: services/api-gateway/app/core/voice_errors.py
Structured error classification with 8 categories and 40+ error codes:
| Category | Codes | Description |
|---|---|---|
| CONNECTION | CONN_001-7 | WebSocket, network failures |
| STT | STT_001-7 | Speech-to-text errors |
| TTS | TTS_001-7 | Text-to-speech errors |
| LLM | LLM_001-6 | LLM processing errors |
| AUDIO | AUDIO_001-6 | Audio encoding/decoding errors |
| TIMEOUT | TIMEOUT_001-7 | Various timeout conditions |
| PROVIDER | PROVIDER_001-6 | External provider errors |
| INTERNAL | INTERNAL_001-5 | Internal server errors |
Each error code includes:
- Recoverability flag (can auto-retry)
- Retry configuration (delay, max attempts)
- User-friendly description
Voice Metrics (metrics.py)
Location: services/api-gateway/app/core/metrics.py
Prometheus metrics for voice pipeline monitoring:
| Metric | Type | Labels | Description |
|---|---|---|---|
voice_errors_total | Counter | category, code, provider, recoverable | Total voice errors |
voice_pipeline_stage_latency_seconds | Histogram | stage | Per-stage latency |
voice_ttfa_seconds | Histogram | - | Time to first audio |
voice_active_sessions | Gauge | - | Active voice sessions |
voice_barge_in_total | Counter | - | Barge-in events |
voice_audio_chunks_total | Counter | status | Audio chunks processed |
Per-Stage Latency Tracking (voice_timing.py)
Location: services/api-gateway/app/core/voice_timing.py
Pipeline stages tracked:
audio_receive- Time to receive audio from clientvad_process- Voice activity detection timestt_transcribe- Speech-to-text latencyllm_process- LLM inference timetts_synthesize- Text-to-speech synthesisaudio_send- Time to send audio to clientttfa- Time to first audio (end-to-end)
Usage:
from app.core.voice_timing import create_pipeline_timings, PipelineStage timings = create_pipeline_timings(session_id="abc123") with timings.time_stage(PipelineStage.STT_TRANSCRIBE): transcript = await stt_client.transcribe(audio) timings.record_ttfa() # When first audio byte ready timings.finalize() # When response complete
SLO Alerts (voice_slo_alerts.yml)
Location: infrastructure/observability/prometheus/rules/voice_slo_alerts.yml
SLO targets with Prometheus alerting rules:
| SLO | Target | Alert |
|---|---|---|
| TTFA P95 | < 200ms | VoiceTTFASLOViolation |
| STT Latency P95 | < 300ms | VoiceSTTLatencySLOViolation |
| TTS First Chunk P95 | < 200ms | VoiceTTSFirstChunkSLOViolation |
| Connection Time P95 | < 500ms | VoiceConnectionTimeSLOViolation |
| Error Rate | < 1% | VoiceErrorRateHigh |
| Session Success Rate | > 95% | VoiceSessionSuccessRateLow |
Client Telemetry (voiceTelemetry.ts)
Location: apps/web-app/src/lib/voiceTelemetry.ts
Frontend telemetry with:
- Network quality assessment via Network Information API
- Browser performance metrics via Performance.memory API
- Jitter estimation for network quality
- Batched reporting (10s intervals)
- Beacon API for reliable delivery on page unload
import { getVoiceTelemetry } from "@/lib/voiceTelemetry"; const telemetry = getVoiceTelemetry(); telemetry.startSession(sessionId); telemetry.recordLatency("stt", 150); telemetry.recordLatency("ttfa", 180); telemetry.endSession();
Voice Health Endpoint (/health/voice)
Location: services/api-gateway/app/api/health.py
Comprehensive voice subsystem health check:
curl https://assist.asimo.io/health/voice
Response:
{ "status": "healthy", "providers": { "openai": { "status": "up", "latency_ms": 120.5 }, "elevenlabs": { "status": "up", "latency_ms": 85.2 }, "deepgram": { "status": "up", "latency_ms": 95.8 } }, "session_store": { "status": "up", "active_sessions": 5 }, "metrics": { "active_sessions": 5 }, "slo": { "ttfa_target_ms": 200, "error_rate_target": 0.01 } }
Debug Logging Configuration
Location: services/api-gateway/app/core/logging.py
Configurable voice log verbosity via VOICE_LOG_LEVEL environment variable:
| Level | Content |
|---|---|
| MINIMAL | Errors only |
| STANDARD | + Session lifecycle (start/end/state changes) |
| VERBOSE | + All latency measurements |
| DEBUG | + Audio frame details, chunk timing |
Usage:
from app.core.logging import get_voice_logger voice_log = get_voice_logger(__name__) voice_log.session_start(session_id="abc123", provider="thinker_talker") voice_log.latency("stt_transcribe", 150.5, session_id="abc123") voice_log.error("voice_connection_failed", error_code="CONN_001")
Phase 9: Offline & Network Fallback
Implemented: 2025-12-03
The voice pipeline now includes comprehensive offline support and network-aware fallback mechanisms.
Network Monitoring (networkMonitor.ts)
Location: apps/web-app/src/lib/offline/networkMonitor.ts
Continuously monitors network health using multiple signals:
- Navigator.onLine: Basic online/offline detection
- Network Information API: Connection type, downlink speed, RTT
- Health Check Pinging: Periodic
/api/healthpings for latency measurement
import { getNetworkMonitor } from "@/lib/offline/networkMonitor"; const monitor = getNetworkMonitor(); monitor.subscribe((status) => { console.log(`Network quality: ${status.quality}`); console.log(`Health check latency: ${status.healthCheckLatencyMs}ms`); });
Network Quality Levels
| Quality | Latency | isHealthy | Action |
|---|---|---|---|
| Excellent | < 100ms | true | Full cloud processing |
| Good | < 200ms | true | Full cloud processing |
| Moderate | < 500ms | true | Cloud with quality warning |
| Poor | ≥ 500ms | variable | Consider offline fallback |
| Offline | Unreachable | false | Automatic offline fallback |
Configuration
const monitor = createNetworkMonitor({ healthCheckUrl: "/api/health", healthCheckIntervalMs: 30000, // 30 seconds healthCheckTimeoutMs: 5000, // 5 seconds goodLatencyThresholdMs: 100, moderateLatencyThresholdMs: 200, poorLatencyThresholdMs: 500, failuresBeforeUnhealthy: 3, });
useNetworkStatus Hook
Location: apps/web-app/src/hooks/useNetworkStatus.ts
React hook providing network status with computed properties:
const { isOnline, isHealthy, quality, healthCheckLatencyMs, effectiveType, // "4g", "3g", "2g", "slow-2g" downlink, // Mbps rtt, // Round-trip time ms isSuitableForVoice, // quality >= "good" && isHealthy shouldUseOffline, // !isOnline || !isHealthy || quality < "moderate" qualityScore, // 0-4 (offline=0, poor=1, moderate=2, good=3, excellent=4) checkNow, // Force immediate health check } = useNetworkStatus();
Offline VAD with Network Fallback
Location: apps/web-app/src/hooks/useOfflineVAD.ts
The useOfflineVADWithFallback hook automatically switches between network and offline VAD:
const { isListening, isSpeaking, currentEnergy, isUsingOfflineVAD, // Currently using offline mode? networkAvailable, networkQuality, modeReason, // "network_vad" | "network_unavailable" | "poor_quality" | "forced_offline" forceOffline, // Manually switch to offline forceNetwork, // Manually switch to network (if available) startListening, stopListening, } = useOfflineVADWithFallback({ useNetworkMonitor: true, minNetworkQuality: "moderate", networkRecoveryDelayMs: 2000, // Prevent flapping onFallbackToOffline: () => console.log("Switched to offline VAD"), onReturnToNetwork: () => console.log("Returned to network VAD"), });
Fallback Decision Flow
┌────────────────────┐
│ Network Monitor │
│ Health Check │
└─────────┬──────────┘
│
▼
┌────────────────────┐ NO ┌────────────────────┐
│ Is Online? │──────────▶│ Use Offline VAD │
└─────────┬──────────┘ └────────────────────┘
│ YES
▼
┌────────────────────┐ NO ┌────────────────────┐
│ Is Healthy? │──────────▶│ Use Offline VAD │
│ (3+ checks pass) │ │ reason: unhealthy │
└─────────┬──────────┘ └────────────────────┘
│ YES
▼
┌────────────────────┐ NO ┌────────────────────┐
│ Quality ≥ Min? │──────────▶│ Use Offline VAD │
│ (e.g., moderate) │ │ reason: poor_qual │
└─────────┬──────────┘ └────────────────────┘
│ YES
▼
┌────────────────────┐
│ Use Network VAD │
│ (cloud processing)│
└────────────────────┘
TTS Caching (useTTSCache)
Location: apps/web-app/src/hooks/useOfflineVAD.ts
Caches synthesized TTS audio for offline playback:
const { getTTS, // Get audio (from cache or fresh) preload, // Preload common phrases isCached, // Check if text is cached stats, // { entryCount, sizeMB, hitRate } clear, // Clear cache } = useTTSCache({ voice: "alloy", maxSizeMB: 50, ttsFunction: async (text) => synthesizeAudio(text), }); // Preload common phrases on app start await preload(); // Caches "I'm listening", "Go ahead", etc. // Get TTS (cache hit = instant, cache miss = synthesize + cache) const audio = await getTTS("Hello world");
User Settings Integration
Phase 9 settings are stored in voiceSettingsStore:
| Setting | Default | Description |
|---|---|---|
enableOfflineFallback | true | Auto-switch to offline when network poor |
preferOfflineVAD | false | Force offline VAD (privacy mode) |
ttsCacheEnabled | true | Enable TTS response caching |
File Reference (Phase 9)
| File | Purpose |
|---|---|
apps/web-app/src/lib/offline/networkMonitor.ts | Network health monitoring |
apps/web-app/src/lib/offline/webrtcVAD.ts | WebRTC-based offline VAD |
apps/web-app/src/lib/offline/types.ts | Offline module type definitions |
apps/web-app/src/hooks/useNetworkStatus.ts | React hook for network status |
apps/web-app/src/hooks/useOfflineVAD.ts | Offline VAD + TTS cache hooks |
apps/web-app/src/lib/offline/__tests__/networkMonitor.test.ts | Network monitor tests |
Future Work
Metrics export to backend: Send metrics to backend for aggregation/alerting✓ ImplementedBarge-in support: Allow user to interrupt AI responses✓ Implemented (2025-11-28)Audio overlap prevention: Prevent multiple responses playing simultaneously✓ Implemented (2025-11-28)Per-user voice preferences: Backend persistence for TTS settings✓ Implemented (2025-11-29)Context-aware voice styles: Auto-detect tone from content✓ Implemented (2025-11-29)Aggressive latency optimization: 200ms VAD, 256-sample chunks, 300ms reconnect✓ Implemented (2025-11-29)Observability & Monitoring (Phase 3): Error taxonomy, metrics, SLO alerts, telemetry✓ Implemented (2025-12-02)Phase 7: Multilingual Support: Auto language detection, accent profiles, language switch confidence✓ Implemented (2025-12-03)Phase 8: Voice Calibration: Personalized VAD thresholds, calibration wizard, adaptive learning✓ Implemented (2025-12-03)Phase 9: Offline Fallback: Network monitoring, offline VAD, TTS caching, quality-based switching✓ Implemented (2025-12-03)Phase 10: Conversation Intelligence: Sentiment tracking, discourse analysis, response recommendations✓ Implemented (2025-12-03)
Voice Mode Enhancement - 10 Phase Plan ✅ COMPLETE (2025-12-03)
A comprehensive enhancement transforming voice mode into a human-like conversational partner with medical dictation:
Phase 1: Emotional Intelligence (Hume AI)✓ CompletePhase 2: Backchanneling System✓ CompletePhase 3: Prosody Analysis✓ CompletePhase 4: Memory & Context System✓ CompletePhase 5: Advanced Turn-Taking✓ CompletePhase 6: Variable Response Timing✓ CompletePhase 7: Conversational Repair✓ CompletePhase 8: Medical Dictation Core✓ CompletePhase 9: Patient Context Integration✓ CompletePhase 10: Frontend Integration & Analytics✓ Complete
Full documentation: VOICE_MODE_ENHANCEMENT_10_PHASE.md
Remaining Tasks
- Voice→chat transcript content E2E: Test actual transcript content in chat timeline
- Error tracking integration: Send errors to Sentry/similar
- Audio level visualization: Show real-time audio level meter during recording
Voice Mode Settings Guide
This guide explains how to use and configure Voice Mode settings in VoiceAssist.
Overview
Voice Mode provides real-time voice conversations with the AI assistant. Users can customize their voice experience through the settings panel, including voice selection, language preferences, TTS quality parameters, and behavior options.
Voice Mode Overhaul (2025-11-29): Added backend persistence for voice preferences, context-aware voice style detection, and advanced TTS quality controls.
Phase 7-10 Enhancements (2025-12-03): Added multilingual support with auto-detection, voice calibration, offline fallback with network monitoring, and conversation intelligence features.
Accessing Settings
- Open Voice Mode by clicking the voice button in the chat interface
- Click the gear icon in the Voice Mode panel header
- The settings modal will appear
Available Settings
Voice Selection
Choose from 6 different AI voices:
- Alloy - Neutral, balanced voice (default)
- Echo - Warm, friendly voice
- Fable - Expressive, narrative voice
- Onyx - Deep, authoritative voice
- Nova - Energetic, bright voice
- Shimmer - Soft, calming voice
Language
Select your preferred conversation language:
- English (default)
- Spanish
- French
- German
- Italian
- Portuguese
Voice Detection Sensitivity (0-100%)
Controls how sensitive the voice activity detection is:
- Lower values (0-30%): Less sensitive, requires louder/clearer speech
- Medium values (40-60%): Balanced detection (recommended)
- Higher values (70-100%): More sensitive, may pick up background noise
Auto-start Voice Mode
When enabled, Voice Mode will automatically open when you start a new chat or navigate to the chat page. This is useful for voice-first interactions.
Show Status Hints
When enabled, displays helpful tips and instructions in the Voice Mode panel. Disable if you're familiar with the interface and want a cleaner view.
Context-Aware Voice Style (New)
When enabled, the AI automatically adjusts its voice tone based on the content being spoken:
- Calm: Default for medical explanations (stable, measured pace)
- Urgent: For medical warnings/emergencies (dynamic, faster)
- Empathetic: For sensitive health topics (warm, slower)
- Instructional: For step-by-step guidance (clear, deliberate)
- Conversational: For general chat (natural, varied)
The system detects keywords and patterns to select the appropriate style, then blends it with your base preferences (60% your settings, 40% style preset).
Advanced Voice Quality (New)
Expand this section to fine-tune TTS output parameters:
- Voice Stability (0-100%): Lower = more expressive/varied, Higher = more consistent
- Voice Clarity (0-100%): Higher values produce clearer, more consistent voice
- Expressiveness (0-100%): Higher values add more emotion and style variation
These settings primarily affect ElevenLabs TTS but also influence context-aware style blending for OpenAI TTS.
Phase 7: Language & Detection Settings
Auto-Detect Language
When enabled, the system automatically detects the language being spoken and adjusts processing accordingly. This is useful for multilingual users who switch between languages naturally.
- Default: Enabled
- Store Key:
autoLanguageDetection
Language Switch Confidence (0-100%)
Controls how confident the system must be before switching to a detected language. Higher values prevent false-positive language switches.
-
Lower values (50-70%): More responsive language switching, but may switch accidentally on similar-sounding phrases
-
Medium values (70-85%): Balanced detection (recommended)
-
Higher values (85-100%): Very confident switching, stays in current language unless clearly different
-
Default: 75%
-
Store Key:
languageSwitchConfidence
Accent Profile
Select a regional accent profile to improve speech recognition accuracy for your specific accent or dialect.
- Default: None (auto-detect)
- Available Profiles: en-us-midwest, en-gb-london, en-au-sydney, ar-eg-cairo, ar-sa-riyadh, etc.
- Store Key:
accentProfileId
Phase 8: Voice Calibration Settings
Voice calibration optimizes the VAD (Voice Activity Detection) thresholds specifically for your voice and environment.
Calibration Status
Shows whether voice calibration has been completed:
- Not Calibrated: Default state, using generic thresholds
- Calibrated: Personal thresholds active (shows last calibration date)
Recalibrate Button
Launches the calibration wizard to:
- Record ambient noise samples
- Record your speaking voice at different volumes
- Compute personalized VAD thresholds
Calibration takes approximately 30-60 seconds.
Personalized VAD Threshold
After calibration, the system uses a custom threshold tuned to your voice:
- Store Key:
personalizedVadThreshold - Range: 0.0-1.0 (null if not calibrated)
Adaptive Learning
When enabled, the system continuously learns from your voice patterns and subtly adjusts thresholds over time.
- Default: Enabled
- Store Key:
enableBehaviorLearning
Phase 9: Offline Mode Settings
Configure how the voice assistant behaves when network connectivity is poor or unavailable.
Enable Offline Fallback
When enabled, the system automatically switches to offline VAD processing when:
-
Network is offline
-
Health check fails consecutively
-
Network quality drops below threshold
-
Default: Enabled
-
Store Key:
enableOfflineFallback
Prefer Local VAD
Force the use of local (on-device) VAD processing even when network is available. Useful for:
-
Privacy-conscious users who don't want audio sent to servers
-
Environments with unreliable connectivity
-
Lower latency at the cost of accuracy
-
Default: Disabled
-
Store Key:
preferOfflineVAD
TTS Audio Caching
When enabled, previously synthesized audio responses are cached locally for:
-
Faster playback of repeated phrases
-
Offline playback of cached responses
-
Reduced bandwidth and API costs
-
Default: Enabled
-
Store Key:
ttsCacheEnabled
Network Quality Monitoring
The system continuously monitors network quality and categorizes it into five levels:
| Quality | Latency | Behavior |
|---|---|---|
| Excellent | < 100ms | Full cloud processing |
| Good | < 200ms | Full cloud processing |
| Moderate | < 500ms | Cloud processing, may show warning |
| Poor | ≥ 500ms | Auto-fallback to offline VAD |
| Offline | No network | Full offline mode |
Network status is displayed in the voice panel header when quality is degraded.
Phase 10: Conversation Intelligence Settings
These settings control advanced AI features that enhance conversation quality.
Enable Sentiment Tracking
When enabled, the AI tracks emotional tone throughout the conversation and adapts its responses accordingly.
- Default: Enabled
- Store Key:
enableSentimentTracking
Enable Discourse Analysis
Tracks conversation structure (topic changes, question chains, clarifications) to provide more contextually aware responses.
- Default: Enabled
- Store Key:
enableDiscourseAnalysis
Enable Response Recommendations
The AI suggests relevant follow-up questions or actions based on conversation context.
- Default: Enabled
- Store Key:
enableResponseRecommendations
Show Suggested Follow-Ups
Display AI-suggested follow-up questions after responses. These appear as clickable chips below the assistant's message.
- Default: Enabled
- Store Key:
showSuggestedFollowUps
Privacy Settings
Store Transcript History
When enabled, voice transcripts are stored in the conversation history. Disable for ephemeral voice sessions.
- Default: Enabled
- Store Key:
storeTranscriptHistory
Share Anonymous Analytics
Opt-in to share anonymized voice interaction metrics to help improve the service. No transcript content or personal data is shared - only timing metrics (latency, error rates).
- Default: Disabled
- Store Key:
shareAnonymousAnalytics
Persistence
Voice preferences are now stored in two locations for maximum reliability:
-
Backend API (Primary): Settings are synced to
/api/voice/preferencesand stored in the database. This enables cross-device settings sync when logged in. -
Local Storage (Fallback): Settings are also cached locally under
voiceassist-voice-settingsfor offline access and faster loading.
Changes are debounced (1 second) before being sent to the backend to reduce API calls while editing.
Resetting to Defaults
Click "Reset to defaults" in the settings modal to restore all settings to their original values:
Core Settings
- Voice: Alloy
- Language: English
- VAD Sensitivity: 50%
- Auto-start: Disabled
- Show hints: Enabled
- Context-aware style: Enabled
- Stability: 50%
- Clarity: 75%
- Expressiveness: 0%
Phase 7 Defaults
- Auto Language Detection: Enabled
- Language Switch Confidence: 75%
- Accent Profile ID: null
Phase 8 Defaults
- VAD Calibrated: false
- Last Calibration Date: null
- Personalized VAD Threshold: null
- Adaptive Learning: Enabled
Phase 9 Defaults
- Offline Fallback: Enabled
- Prefer Local VAD: Disabled
- TTS Cache: Enabled
Phase 10 Defaults
- Sentiment Tracking: Enabled
- Discourse Analysis: Enabled
- Response Recommendations: Enabled
- Show Suggested Follow-Ups: Enabled
Privacy Defaults
- Store Transcript History: Enabled
- Share Anonymous Analytics: Disabled
Reset also syncs to the backend via POST /api/voice/preferences/reset.
Voice Preferences API (New)
The following API endpoints manage voice preferences:
| Endpoint | Method | Description |
|---|---|---|
/api/voice/preferences | GET | Get user's voice preferences |
/api/voice/preferences | PUT | Update preferences (partial update) |
/api/voice/preferences/reset | POST | Reset to defaults |
/api/voice/style-presets | GET | Get available style presets |
Response Headers
TTS synthesis requests now include additional headers:
X-TTS-Provider: Which provider was used (openaiorelevenlabs)X-TTS-Fallback: Whether fallback was used (true/false)X-TTS-Style: Detected style if context-aware is enabled
Technical Details
Store Location
Settings are managed by a Zustand store with persistence:
apps/web-app/src/stores/voiceSettingsStore.ts
Component Locations
- Settings UI:
apps/web-app/src/components/voice/VoiceModeSettings.tsx - Enhanced Settings:
apps/web-app/src/components/voice/VoiceSettingsEnhanced.tsx - Calibration Dialog:
apps/web-app/src/components/voice/CalibrationDialog.tsx
Phase 9 Offline/Network Files
- Network Monitor:
apps/web-app/src/lib/offline/networkMonitor.ts - WebRTC VAD:
apps/web-app/src/lib/offline/webrtcVAD.ts - Offline Types:
apps/web-app/src/lib/offline/types.ts - Network Status Hook:
apps/web-app/src/hooks/useNetworkStatus.ts - Offline VAD Hook:
apps/web-app/src/hooks/useOfflineVAD.ts
Backend Files (New)
- Model:
services/api-gateway/app/models/user_voice_preferences.py - Style Detector:
services/api-gateway/app/services/voice_style_detector.py - API Endpoints:
services/api-gateway/app/api/voice.py(preferences section) - Schemas:
services/api-gateway/app/api/voice_schemas/schemas.py
Frontend Sync Hook (New)
apps/web-app/src/hooks/useVoicePreferencesSync.ts
Handles loading/saving preferences to backend with debouncing.
Integration Points
VoiceModePanel.tsx- Displays settings button and uses store valuesMessageInput.tsx- ReadsautoStartOnOpenfor auto-open behavioruseVoicePreferencesSync.ts- Backend sync on auth and setting changes
Advanced: Voice Mode Pipeline
Settings are not just UI preferences - they propagate into real-time voice sessions:
- Voice/Language: Sent to
/api/voice/realtime-sessionand used by OpenAI Realtime API - VAD Sensitivity: Mapped to server-side VAD threshold (0→insensitive, 100→sensitive)
For comprehensive pipeline documentation including backend integration, WebSocket connections, and metrics, see VOICE_MODE_PIPELINE.md.
Development: Running Tests
Run the voice settings test suites individually to avoid memory issues:
cd apps/web-app # Unit tests for voice settings store (core) npx vitest run src/stores/__tests__/voiceSettingsStore.test.ts --reporter=dot # Unit tests for voice settings store (Phase 7-10) npx vitest run src/stores/__tests__/voiceSettingsStore-phase7-10.test.ts --reporter=dot # Unit tests for network monitor npx vitest run src/lib/offline/__tests__/networkMonitor.test.ts --reporter=dot # Component tests for VoiceModeSettings npx vitest run src/components/voice/__tests__/VoiceModeSettings.test.tsx --reporter=dot # Integration tests for MessageInput voice settings npx vitest run src/components/chat/__tests__/MessageInput-voice-settings.test.tsx --reporter=dot
Test Coverage
The test suites cover:
voiceSettingsStore.test.ts (17 tests)
- Default values verification
- All setter functions (voice, language, sensitivity, toggles)
- VAD sensitivity clamping (0-100 range)
- Reset functionality
- LocalStorage persistence
voiceSettingsStore-phase7-10.test.ts (41 tests)
- Phase 7: Multilingual settings (accent profile, auto-detection, confidence)
- Phase 8: Calibration settings (VAD calibrated, dates, thresholds)
- Phase 9: Offline mode settings (fallback, prefer offline VAD, TTS cache)
- Phase 10: Conversation intelligence (sentiment, discourse, recommendations)
- Privacy settings (transcript history, anonymous analytics)
- Persistence tests for all Phase 7-10 settings
- Reset tests verifying all defaults
networkMonitor.test.ts (13 tests)
- Initial state detection (online/offline)
- Health check latency measurement
- Quality computation from latency thresholds
- Consecutive failure handling before marking unhealthy
- Subscription/unsubscription for status changes
- Custom configuration (latency thresholds, health check URL)
- Offline detection via navigator.onLine
VoiceModeSettings.test.tsx (25 tests)
- Modal visibility (isOpen prop)
- Current settings display
- Settings updates via UI interactions
- Reset with confirmation
- Close behavior (Done, X, backdrop)
- Accessibility (labels, ARIA attributes)
MessageInput-voice-settings.test.tsx (12 tests)
- Auto-open via store setting (autoStartOnOpen)
- Auto-open via prop (autoOpenRealtimeVoice)
- Combined settings behavior
- Voice/language display in panel header
- Status hints visibility toggle
Total: 108+ tests for voice settings and related functionality.
Notes
- Tests mock
useRealtimeVoiceSessionandWaveformVisualizerto avoid browser API dependencies - Run tests individually rather than the full suite to prevent memory issues
- All tests use Vitest + React Testing Library
- Phase 7-10 tests also mock
fetchandperformance.nowfor network monitoring
Thinker Service
Location:
services/api-gateway/app/services/thinker_service.pyStatus: Production Ready Last Updated: 2025-12-01
Overview
The ThinkerService is the reasoning engine of the Thinker-Talker voice pipeline. It manages conversation context, orchestrates LLM interactions, and handles tool calling with result injection.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ ThinkerService │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ ConversationContext │◄──│ ThinkerSession │ │
│ │ (shared memory) │ │ (per-request) │ │
│ └──────────────────┘ └──────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────────┐ │
│ │ │ LLMClient │ │
│ │ │ (GPT-4o) │ │
│ │ └──────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────────┐ │
│ │ │ ToolRegistry │ │
│ │ │ (calendar, search,│ │
│ │ │ medical, KB) │ │
│ └──────────────┴──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Classes
ThinkerService
Main service class (singleton pattern).
from app.services.thinker_service import thinker_service # Create a session for a conversation session = thinker_service.create_session( conversation_id="conv-123", on_token=handle_token, # Called for each LLM token on_tool_call=handle_tool_call, # Called when tool is invoked on_tool_result=handle_result, # Called when tool returns user_id="user-456", # Required for authenticated tools ) # Process user input response = await session.think("What's on my calendar today?")
Methods
| Method | Description | Parameters | Returns |
|---|---|---|---|
create_session() | Create a thinking session | conversation_id, on_token, on_tool_call, on_tool_result, system_prompt, user_id | ThinkerSession |
register_tool() | Register a new tool | name, description, parameters, handler | None |
ThinkerSession
Session class for processing individual requests.
class ThinkerSession: """ A single thinking session with streaming support. Manages the flow: 1. Receive user input 2. Add to conversation context 3. Call LLM with streaming 4. Handle tool calls if needed 5. Stream response tokens to callback """
Methods
| Method | Description | Parameters | Returns |
|---|---|---|---|
think() | Process user input | user_input: str, source_mode: str | ThinkerResponse |
cancel() | Cancel processing | None | None |
get_context() | Get conversation context | None | ConversationContext |
get_metrics() | Get session metrics | None | ThinkerMetrics |
Properties
| Property | Type | Description |
|---|---|---|
state | ThinkingState | Current processing state |
ConversationContext
Manages conversation history with smart trimming.
class ConversationContext: MAX_HISTORY_MESSAGES = 20 # Maximum messages to retain MAX_CONTEXT_TOKENS = 8000 # Token budget for context def __init__(self, conversation_id: str, system_prompt: str = None): self.conversation_id = conversation_id self.messages: List[ConversationMessage] = [] self.system_prompt = system_prompt or self._default_system_prompt()
Smart Trimming
When message count exceeds MAX_HISTORY_MESSAGES, the context performs smart trimming:
def _smart_trim(self) -> None: """ Trim messages while preserving tool call chains. OpenAI requires: assistant (with tool_calls) -> tool (with tool_call_id) We can't break this chain or the API will reject the request. """
Rules:
- Never trim an assistant message if the next message is a tool result
- Never trim a tool message (it needs its preceding assistant message)
- Find the first safe trim point that doesn't break chains
Methods
| Method | Description |
|---|---|
add_message() | Add a message to history |
get_messages_for_llm() | Format messages for OpenAI API |
clear() | Clear all history |
ToolRegistry
Registry for available tools.
class ToolRegistry: def register( self, name: str, description: str, parameters: Dict, handler: Callable[[Dict], Awaitable[Any]], ) -> None: """Register a tool with its schema and handler.""" def get_tools_schema(self) -> List[Dict]: """Get all tool schemas for LLM API.""" async def execute(self, tool_name: str, arguments: Dict, user_id: str) -> Any: """Execute a tool and return its result."""
Data Classes
ThinkingState
class ThinkingState(str, Enum): IDLE = "idle" # Waiting for input PROCESSING = "processing" # Building request TOOL_CALLING = "tool_calling" # Executing tool GENERATING = "generating" # Streaming response COMPLETE = "complete" # Finished successfully CANCELLED = "cancelled" # User interrupted ERROR = "error" # Error occurred
ConversationMessage
@dataclass class ConversationMessage: role: str # "user", "assistant", "system", "tool" content: str message_id: str # Auto-generated UUID timestamp: float # Unix timestamp source_mode: str # "chat" or "voice" tool_call_id: str # For tool results tool_calls: List[Dict] # For assistant messages with tool calls citations: List[Dict] # Source citations
ThinkerResponse
@dataclass class ThinkerResponse: text: str # Complete response text message_id: str # Unique ID citations: List[Dict] # Source citations tool_calls_made: List[str] # Names of tools called latency_ms: int # Total processing time tokens_used: int # Token count state: ThinkingState # Final state
ThinkerMetrics
@dataclass class ThinkerMetrics: total_tokens: int = 0 tool_calls_count: int = 0 first_token_latency_ms: int = 0 total_latency_ms: int = 0 cancelled: bool = False
Available Tools
The ThinkerService automatically registers tools from the unified ToolService:
| Tool | Description | Requires Auth |
|---|---|---|
calendar_create_event | Create calendar events | Yes |
calendar_list_events | List upcoming events | Yes |
calendar_update_event | Modify existing events | Yes |
calendar_delete_event | Remove events | Yes |
web_search | Search the web | No |
pubmed_search | Search medical literature | No |
medical_calculator | Calculate medical scores | No |
kb_search | Search knowledge base | No |
System Prompt
The default system prompt includes:
- Current Time Context: Dynamic date/time with relative calculations
- Conversation Memory: Instructions to use conversation history
- Tool Usage Guidelines: When and how to use each tool
- Response Style: Concise, natural, voice-optimized
def _default_system_prompt(self) -> str: tz = pytz.timezone("America/New_York") now = datetime.now(tz) return f"""You are VoiceAssist, a helpful AI voice assistant. CURRENT TIME CONTEXT: - Current date: {now.strftime("%A, %B %d, %Y")} - Current time: {now.strftime("%I:%M %p %Z")} CONVERSATION MEMORY: You have access to the full conversation history... AVAILABLE TOOLS: - calendar_create_event: Create events... - web_search: Search the web... ... KEY BEHAVIORS: - Keep responses concise and natural for voice - Use short sentences (max 15-20 words) - Avoid abbreviations - say "blood pressure" not "BP" """
Usage Examples
Basic Query Processing
from app.services.thinker_service import thinker_service async def handle_voice_query(conversation_id: str, transcript: str, user_id: str): # Token streaming callback async def on_token(token: str): await send_to_tts(token) # Create session with callbacks session = thinker_service.create_session( conversation_id=conversation_id, on_token=on_token, user_id=user_id, ) # Process the transcript response = await session.think(transcript, source_mode="voice") print(f"Response: {response.text}") print(f"Tools used: {response.tool_calls_made}") print(f"Latency: {response.latency_ms}ms")
With Tool Call Handling
async def handle_tool_call(event: ToolCallEvent): """Called when LLM decides to call a tool.""" await send_to_client({ "type": "tool.call", "tool_name": event.tool_name, "arguments": event.arguments, }) async def handle_tool_result(event: ToolResultEvent): """Called when tool execution completes.""" await send_to_client({ "type": "tool.result", "tool_name": event.tool_name, "result": event.result, }) session = thinker_service.create_session( conversation_id="conv-123", on_token=on_token, on_tool_call=handle_tool_call, on_tool_result=handle_tool_result, user_id="user-456", )
Cancellation (Barge-in)
# Store session reference active_session = thinker_service.create_session(...) # When user barges in: async def handle_barge_in(): await active_session.cancel() print(f"Cancelled: {active_session.is_cancelled()}")
Context Persistence
Conversation contexts are persisted across turns:
# Class-level storage _conversation_contexts: Dict[str, ConversationContext] = {} _context_last_access: Dict[str, float] = {} CONTEXT_TTL_SECONDS = 3600 # 1 hour TTL
- Contexts are automatically cleaned up after 1 hour of inactivity
- Same conversation_id reuses existing context
- Context persists across voice and chat modes
Error Handling
try: response = await session.think(transcript) except Exception as e: # Errors are caught and returned in response response = ThinkerResponse( text=f"I apologize, but I encountered an error: {str(e)}", message_id=message_id, state=ThinkingState.ERROR, )
Related Documentation
Talker Service
Location:
services/api-gateway/app/services/talker_service.pyStatus: Production Ready Last Updated: 2025-12-01
Overview
The TalkerService handles text-to-speech synthesis for the Thinker-Talker voice pipeline. It streams LLM tokens through a sentence chunker and synthesizes speech via ElevenLabs for gapless audio playback.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ TalkerService │
│ │
│ LLM Tokens ──►┌──────────────────┐ │
│ │ Markdown Buffer │ (accumulates for pattern │
│ │ │ detection before strip) │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ SentenceChunker │ (splits at natural │
│ │ (40-120-200 chars)│ boundaries) │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ strip_markdown │ (removes **bold**, │
│ │ _for_tts() │ [links](url), LaTeX) │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ ElevenLabs TTS │ (streaming synthesis │
│ │ (sequential) │ with previous_text) │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ Audio Chunks ──► on_audio_chunk callback │
└─────────────────────────────────────────────────────────────────┘
Classes
TalkerService
Main service class (singleton pattern).
from app.services.talker_service import talker_service # Check if TTS is available if talker_service.is_enabled(): # Start a speaking session (uses DEFAULT_VOICE_ID from voice_constants.py) session = await talker_service.start_session( on_audio_chunk=handle_audio, voice_config=VoiceConfig( # voice_id defaults to DEFAULT_VOICE_ID (Brian) stability=0.65, ), ) # Feed tokens from LLM for token in llm_stream: await session.add_token(token) # Finish and get metrics metrics = await session.finish()
Methods
| Method | Description | Parameters | Returns |
|---|---|---|---|
is_enabled() | Check if TTS is available | None | bool |
get_provider() | Get active TTS provider | None | TTSProvider |
start_session() | Start a TTS session | on_audio_chunk, voice_config | TalkerSession |
synthesize_text() | Simple text synthesis | text, voice_config | AsyncIterator[bytes] |
get_available_voices() | List available voices | None | List[Dict] |
TalkerSession
Session class for streaming TTS.
class TalkerSession: """ A single TTS speaking session with streaming support. Manages the flow: 1. Receive LLM tokens 2. Chunk into sentences 3. Synthesize each sentence 4. Stream audio chunks to callback """
Methods
| Method | Description | Parameters | Returns |
|---|---|---|---|
add_token() | Add token from LLM | token: str | None |
finish() | Complete synthesis | None | TalkerMetrics |
cancel() | Cancel for barge-in | None | None |
get_metrics() | Get session metrics | None | TalkerMetrics |
Properties
| Property | Type | Description |
|---|---|---|
state | TalkerState | Current state |
AudioQueue
Queue management for gapless playback.
class AudioQueue: """ Manages audio chunks for gapless playback with cancellation support. Features: - Async queue for audio chunks - Cancellation clears pending audio - Tracks queue state """ async def put(self, chunk: AudioChunk) -> bool async def get(self) -> Optional[AudioChunk] async def cancel(self) -> None def finish(self) -> None def reset(self) -> None
Data Classes
TalkerState
class TalkerState(str, Enum): IDLE = "idle" # Ready for input SPEAKING = "speaking" # Synthesizing/playing CANCELLED = "cancelled" # Interrupted by barge-in
TTSProvider
class TTSProvider(str, Enum): ELEVENLABS = "elevenlabs" OPENAI = "openai" # Fallback
VoiceConfig
Note: Default voice is configured in
app/core/voice_constants.py. See Voice Configuration for details.
from app.core.voice_constants import DEFAULT_VOICE_ID, DEFAULT_TTS_MODEL @dataclass class VoiceConfig: provider: TTSProvider = TTSProvider.ELEVENLABS voice_id: str = DEFAULT_VOICE_ID # Brian (from voice_constants.py) model_id: str = DEFAULT_TTS_MODEL # eleven_flash_v2_5 stability: float = 0.65 # 0.0-1.0, higher = consistent similarity_boost: float = 0.80 # 0.0-1.0, higher = clearer style: float = 0.15 # 0.0-1.0, lower = natural use_speaker_boost: bool = True output_format: str = "pcm_24000"
AudioChunk
@dataclass class AudioChunk: data: bytes # Raw audio bytes format: str # "pcm16" or "mp3" is_final: bool # True for last chunk sentence_index: int # Which sentence this is from latency_ms: int # Time since synthesis started
TalkerMetrics
@dataclass class TalkerMetrics: sentences_processed: int = 0 total_chars_synthesized: int = 0 total_audio_bytes: int = 0 total_latency_ms: int = 0 first_audio_latency_ms: int = 0 cancelled: bool = False
Sentence Chunking
The TalkerSession uses SentenceChunker with these settings:
self._chunker = SentenceChunker( ChunkerConfig( min_chunk_chars=40, # Avoid tiny fragments optimal_chunk_chars=120, # Full sentences max_chunk_chars=200, # Allow complete thoughts ) )
Why These Settings?
| Parameter | Value | Rationale |
|---|---|---|
min_chunk_chars | 40 | Prevents choppy TTS from short phrases |
optimal_chunk_chars | 120 | Full sentences sound more natural |
max_chunk_chars | 200 | Prevents excessive buffering |
Trade-off: Larger chunks = better prosody but higher latency to first audio.
Markdown Stripping
LLM responses often contain markdown that sounds unnatural when spoken:
def strip_markdown_for_tts(text: str) -> str: """ Converts: - [Link Text](URL) → "Link Text" - **bold** → "bold" - *italic* → "italic" - `code` → "code" - ```blocks``` → (removed) - # Headers → "Headers" - LaTeX formulas → (removed) """
Markdown-Aware Token Buffering
The TalkerSession buffers tokens to detect incomplete patterns:
def _process_markdown_token(self, token: str) -> str: """ Accumulates tokens to detect patterns that should be stripped: - Markdown links: [text](url) - wait for closing ) - LaTeX display: [ ... ] with backslashes - LaTeX inline: \\( ... \\) - Bold/italic: **text** - wait for closing ** """
This prevents sending "[Link Te" to TTS before we know it's a markdown link.
Voice Continuity
For consistent voice across sentences:
async for audio_data in self._elevenlabs.synthesize_stream( text=tts_text, previous_text=self._previous_text, # Context for voice continuity ... ): ... # Update for next synthesis self._previous_text = tts_text
The previous_text parameter helps ElevenLabs maintain consistent prosody.
Sequential Synthesis
To prevent voice variations between chunks:
# Semaphore ensures one synthesis at a time self._synthesis_semaphore = asyncio.Semaphore(1) async with self._synthesis_semaphore: async for audio_data in self._elevenlabs.synthesize_stream(...): ...
Parallel synthesis can cause noticeable voice quality differences between sentences.
Usage Examples
Basic Token Streaming
async def handle_llm_response(llm_stream): async def on_audio_chunk(chunk: AudioChunk): # Send to client via WebSocket await websocket.send_json({ "type": "audio.output", "audio": base64.b64encode(chunk.data).decode(), "is_final": chunk.is_final, }) session = await talker_service.start_session(on_audio_chunk=on_audio_chunk) async for token in llm_stream: await session.add_token(token) metrics = await session.finish() print(f"Synthesized {metrics.sentences_processed} sentences") print(f"First audio in {metrics.first_audio_latency_ms}ms")
Custom Voice Configuration
config = VoiceConfig( voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel (female) model_id="eleven_flash_v2_5", # Lower latency stability=0.65, # More variation similarity_boost=0.90, # Very clear style=0.15, # Slightly expressive ) session = await talker_service.start_session( on_audio_chunk=handle_audio, voice_config=config, )
Handling Barge-in
active_session = None async def start_speaking(llm_stream): global active_session active_session = await talker_service.start_session(on_audio_chunk=send_audio) for token in llm_stream: if active_session.is_cancelled(): break await active_session.add_token(token) await active_session.finish() async def handle_barge_in(): global active_session if active_session: await active_session.cancel() # Cancels pending synthesis and clears audio queue
Simple Text Synthesis
# For non-streaming use cases async for audio_chunk in talker_service.synthesize_text( text="Hello, how can I help you today?", voice_config=VoiceConfig(voice_id="TxGEqnHWrfWFTfGW9XjX"), ): await send_audio(audio_chunk)
Available Voices
voices = talker_service.get_available_voices() # Returns: [ {"id": "TxGEqnHWrfWFTfGW9XjX", "name": "Josh", "gender": "male", "premium": True}, {"id": "pNInz6obpgDQGcFmaJgB", "name": "Adam", "gender": "male", "premium": True}, {"id": "EXAVITQu4vr4xnSDxMaL", "name": "Bella", "gender": "female", "premium": True}, {"id": "21m00Tcm4TlvDq8ikWAM", "name": "Rachel", "gender": "female", "premium": True}, # ... more voices ]
Performance Tuning
Latency Optimization
| Setting | Lower Latency | Higher Quality |
|---|---|---|
model_id | eleven_flash_v2_5 | eleven_turbo_v2_5 |
min_chunk_chars | 15 | 40 |
optimal_chunk_chars | 50 | 120 |
output_format | pcm_24000 | mp3_44100_192 |
Quality Optimization
| Setting | More Natural | More Consistent |
|---|---|---|
stability | 0.50 | 0.85 |
similarity_boost | 0.70 | 0.90 |
style | 0.20 | 0.05 |
Error Handling
Synthesis errors don't fail the entire session:
async def _synthesize_sentence(self, sentence: str) -> None: try: async for audio_data in self._elevenlabs.synthesize_stream(...): if self._state == TalkerState.CANCELLED: return await self._on_audio_chunk(chunk) except Exception as e: logger.error(f"TTS synthesis error: {e}") # Session continues, just skips this sentence
Related Documentation
Voice Pipeline WebSocket API
Endpoint:
wss://{host}/api/voice/pipeline-wsProtocol: JSON over WebSocket Status: Production Ready Last Updated: 2025-12-02
Overview
The Voice Pipeline WebSocket provides bidirectional communication for the Thinker-Talker voice mode. It handles audio streaming, transcription, LLM responses, and TTS playback.
Connection
Authentication
Include JWT token in connection URL or headers:
const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`);
Connection Lifecycle
1. Client connects with auth token
│
2. Server accepts, creates pipeline session
│
3. Server sends: session.ready
│
4. Client sends: session.init (optional config)
│
5. Server acknowledges: session.init.ack
│
6. Voice mode active - bidirectional streaming
│
7. Client or server closes connection
Message Format
All messages are JSON objects with a type field:
{ "type": "message_type", "field1": "value1", "field2": "value2" }
Client → Server Messages
session.init
Initialize or reconfigure the session.
{ "type": "session.init", "conversation_id": "conv-123", "voice_settings": { "voice_id": "TxGEqnHWrfWFTfGW9XjX", "language": "en", "barge_in_enabled": true } }
| Field | Type | Required | Description |
|---|---|---|---|
conversation_id | string | No | Link to existing chat conversation |
voice_settings.voice_id | string | No | ElevenLabs voice ID |
voice_settings.language | string | No | STT language code (default: "en") |
voice_settings.barge_in_enabled | boolean | No | Allow user interruption (default: true) |
audio.input
Stream audio from microphone.
{ "type": "audio.input", "audio": "base64_encoded_pcm16_audio" }
| Field | Type | Required | Description |
|---|---|---|---|
audio | string | Yes | Base64-encoded PCM16 audio (16kHz, mono) |
Audio Format Requirements:
- Sample rate: 16000 Hz
- Channels: 1 (mono)
- Bit depth: 16-bit signed PCM
- Encoding: Little-endian
- Chunk size: ~100ms recommended (1600 samples)
audio.input.complete
Signal end of user speech (manual commit).
{ "type": "audio.input.complete" }
Normally, VAD auto-detects speech end. Use this for push-to-talk implementations.
barge_in
Interrupt AI response.
{ "type": "barge_in" }
When received:
- Cancels TTS synthesis
- Clears audio queue
- Resets pipeline to listening state
message
Send text input (fallback when mic unavailable).
{ "type": "message", "content": "What's the weather like?" }
ping
Keep-alive heartbeat.
{ "type": "ping" }
Server responds with pong.
Server → Client Messages
session.ready
Session initialized successfully.
{ "type": "session.ready", "session_id": "sess-abc123", "pipeline_mode": "thinker_talker" }
session.init.ack
Acknowledges session.init message.
{ "type": "session.init.ack" }
transcript.delta
Partial STT transcript (streaming).
{ "type": "transcript.delta", "text": "What is the", "is_final": false }
| Field | Type | Description |
|---|---|---|
text | string | Partial transcript text |
is_final | boolean | Always false for delta |
transcript.complete
Final STT transcript.
{ "type": "transcript.complete", "text": "What is the weather today?", "message_id": "msg-xyz789" }
| Field | Type | Description |
|---|---|---|
text | string | Complete transcript |
message_id | string | Unique message identifier |
response.delta
Streaming LLM response token.
{ "type": "response.delta", "delta": "The", "message_id": "resp-123" }
| Field | Type | Description |
|---|---|---|
delta | string | Response token/chunk |
message_id | string | Response message ID |
response.complete
Complete LLM response.
{ "type": "response.complete", "text": "The weather today is sunny with a high of 72 degrees.", "message_id": "resp-123" }
audio.output
TTS audio chunk.
{ "type": "audio.output", "audio": "base64_encoded_pcm_audio", "is_final": false, "sentence_index": 0 }
| Field | Type | Description |
|---|---|---|
audio | string | Base64-encoded PCM audio (24kHz, mono) |
is_final | boolean | True for last chunk |
sentence_index | number | Which sentence this is from |
Output Audio Format:
- Sample rate: 24000 Hz
- Channels: 1 (mono)
- Bit depth: 16-bit signed PCM
- Encoding: Little-endian
tool.call
Tool invocation started.
{ "type": "tool.call", "id": "call-abc", "name": "calendar_list_events", "arguments": { "start_date": "2025-12-01", "end_date": "2025-12-07" } }
| Field | Type | Description |
|---|---|---|
id | string | Tool call ID |
name | string | Tool function name |
arguments | object | Tool arguments |
tool.result
Tool execution completed.
{ "type": "tool.result", "id": "call-abc", "name": "calendar_list_events", "result": { "events": [{ "title": "Team Meeting", "start": "2025-12-02T10:00:00" }] } }
| Field | Type | Description |
|---|---|---|
id | string | Tool call ID |
name | string | Tool function name |
result | any | Tool execution result |
voice.state
Pipeline state change.
{ "type": "voice.state", "state": "speaking" }
| State | Description |
|---|---|
idle | Waiting for user input |
listening | Receiving audio, STT active |
processing | LLM thinking |
speaking | TTS playing |
cancelled | Barge-in occurred |
heartbeat
Server heartbeat (every 30s).
{ "type": "heartbeat" }
pong
Response to client ping.
{ "type": "pong" }
error
Error occurred.
{ "type": "error", "code": "stt_failed", "message": "Speech-to-text service unavailable", "recoverable": true }
| Field | Type | Description |
|---|---|---|
code | string | Error code |
message | string | Human-readable message |
recoverable | boolean | True if client can retry |
Error Codes:
| Code | Description | Recoverable |
|---|---|---|
invalid_json | Malformed JSON message | Yes |
connection_failed | Pipeline init failed | No |
stt_failed | STT service error | Yes |
llm_failed | LLM service error | Yes |
tts_failed | TTS service error | Yes |
auth_failed | Authentication error | No |
rate_limited | Too many requests | Yes |
Example: Complete Session
// 1. Connect const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${token}`); ws.onopen = () => { console.log("Connected"); }; ws.onmessage = (event) => { const msg = JSON.parse(event.data); switch (msg.type) { case "session.ready": // 2. Initialize with settings ws.send( JSON.stringify({ type: "session.init", conversation_id: currentConversationId, voice_settings: { voice_id: "TxGEqnHWrfWFTfGW9XjX", language: "en", }, }), ); break; case "session.init.ack": // 3. Start sending audio startMicrophoneCapture(); break; case "transcript.delta": // Show partial transcript updatePartialTranscript(msg.text); break; case "transcript.complete": // Show final transcript setTranscript(msg.text); break; case "response.delta": // Append LLM response appendResponse(msg.delta); break; case "audio.output": // Play TTS audio if (msg.audio) { const pcm = base64ToArrayBuffer(msg.audio); audioPlayer.queueChunk(pcm); } if (msg.is_final) { audioPlayer.finish(); } break; case "tool.call": // Show tool being called showToolCall(msg.name, msg.arguments); break; case "tool.result": // Show tool result showToolResult(msg.name, msg.result); break; case "error": console.error(`Error [${msg.code}]: ${msg.message}`); if (!msg.recoverable) { ws.close(); } break; } }; // Send audio chunks from microphone function sendAudioChunk(pcmData) { ws.send( JSON.stringify({ type: "audio.input", audio: arrayBufferToBase64(pcmData), }), ); } // Handle barge-in (user speaks while AI is talking) function handleBargeIn() { ws.send(JSON.stringify({ type: "barge_in" })); audioPlayer.stop(); }
Configuration Reference
TTSessionConfig (Backend)
@dataclass class TTSessionConfig: user_id: str session_id: str conversation_id: Optional[str] = None # Voice settings voice_id: str = "TxGEqnHWrfWFTfGW9XjX" tts_model: str = "eleven_flash_v2_5" language: str = "en" # STT settings stt_sample_rate: int = 16000 stt_endpointing_ms: int = 800 stt_utterance_end_ms: int = 1500 # Barge-in barge_in_enabled: bool = True # Timeouts connection_timeout_sec: float = 10.0 idle_timeout_sec: float = 300.0
Rate Limiting
| Limit | Value |
|---|---|
| Max concurrent sessions per user | 2 |
| Max concurrent sessions total | 100 |
| Audio chunk rate | ~10/second recommended |
| Idle timeout | 300 seconds |
Related Documentation
Thinker-Talker Frontend Hooks
Location:
apps/web-app/src/hooks/Status: Production Ready Last Updated: 2025-12-01
Overview
The Thinker-Talker frontend integration consists of several React hooks that manage WebSocket connections, audio capture, and playback. These hooks provide a complete voice mode implementation.
Hook Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Voice Mode Components │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ useThinkerTalkerVoiceMode │ │
│ │ (High-level orchestration hook) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ useThinkerTalkerSession │ │ useTTAudioPlayback │ │
│ │ (WebSocket + Protocol) │ │ (Audio Queue + Play) │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ WebSocket API │ │ Web Audio API │ │
│ │ (Backend T/T) │ │ (AudioContext) │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
useThinkerTalkerSession
Main hook for WebSocket communication with the T/T pipeline.
Import
import { useThinkerTalkerSession } from "../hooks/useThinkerTalkerSession";
Usage
const { status, error, transcript, partialTranscript, pipelineState, currentToolCalls, metrics, connect, disconnect, sendAudioChunk, bargeIn, } = useThinkerTalkerSession({ conversation_id: "conv-123", voiceSettings: { voice_id: "TxGEqnHWrfWFTfGW9XjX", language: "en", barge_in_enabled: true, }, onTranscript: (t) => console.log("Transcript:", t.text), onResponseDelta: (delta, id) => appendToChat(delta), onAudioChunk: (audio) => playAudio(audio), onToolCall: (tool) => showToolUI(tool), });
Options
interface UseThinkerTalkerSessionOptions { conversation_id?: string; voiceSettings?: TTVoiceSettings; onTranscript?: (transcript: TTTranscript) => void; onResponseDelta?: (delta: string, messageId: string) => void; onResponseComplete?: (content: string, messageId: string) => void; onAudioChunk?: (audioBase64: string) => void; onToolCall?: (toolCall: TTToolCall) => void; onToolResult?: (toolCall: TTToolCall) => void; onError?: (error: Error) => void; onConnectionChange?: (status: TTConnectionStatus) => void; onPipelineStateChange?: (state: PipelineState) => void; onMetricsUpdate?: (metrics: TTVoiceMetrics) => void; onSpeechStarted?: () => void; onStopPlayback?: () => void; autoConnect?: boolean; }
Return Values
| Field | Type | Description |
|---|---|---|
status | TTConnectionStatus | Connection state |
error | Error | null | Last error |
transcript | string | Final user transcript |
partialTranscript | string | Streaming transcript |
pipelineState | PipelineState | Backend pipeline state |
currentToolCalls | TTToolCall[] | Active tool calls |
metrics | TTVoiceMetrics | Performance metrics |
connect | () => Promise<void> | Start session |
disconnect | () => void | End session |
sendAudioChunk | (data: ArrayBuffer) => void | Send audio |
bargeIn | () => void | Interrupt AI |
Types
type TTConnectionStatus = | "disconnected" | "connecting" | "connected" | "ready" | "reconnecting" | "error" | "failed" | "mic_permission_denied"; type PipelineState = "idle" | "listening" | "processing" | "speaking" | "cancelled"; interface TTTranscript { text: string; is_final: boolean; timestamp: number; message_id?: string; } interface TTToolCall { id: string; name: string; arguments: Record<string, unknown>; status: "pending" | "running" | "completed" | "failed"; result?: unknown; } interface TTVoiceMetrics { connectionTimeMs: number | null; sttLatencyMs: number | null; llmFirstTokenMs: number | null; ttsFirstAudioMs: number | null; totalLatencyMs: number | null; sessionDurationMs: number | null; userUtteranceCount: number; aiResponseCount: number; toolCallCount: number; bargeInCount: number; reconnectCount: number; sessionStartedAt: number | null; } interface TTVoiceSettings { voice_id?: string; language?: string; barge_in_enabled?: boolean; tts_model?: string; }
Reconnection
The hook implements automatic reconnection with exponential backoff:
const MAX_RECONNECT_ATTEMPTS = 5; const BASE_RECONNECT_DELAY = 300; // 300ms const MAX_RECONNECT_DELAY = 30000; // 30s // Delay calculation delay = min((BASE_DELAY * 2) ^ attempt, MAX_DELAY);
Fatal errors (mic permission denied) do not trigger reconnection.
useTTAudioPlayback
Handles streaming PCM audio playback with queue management.
Import
import { useTTAudioPlayback } from "../hooks/useTTAudioPlayback";
Usage
const { isPlaying, queuedChunks, currentLatency, playAudioChunk, stopPlayback, clearQueue, getAudioContext } = useTTAudioPlayback({ sampleRate: 24000, onPlaybackStart: () => console.log("Started playing"), onPlaybackEnd: () => console.log("Finished playing"), onError: (err) => console.error("Playback error:", err), }); // Queue audio from WebSocket function handleAudioChunk(base64Audio: string) { const pcmData = base64ToArrayBuffer(base64Audio); playAudioChunk(pcmData); } // Handle barge-in function handleBargeIn() { stopPlayback(); clearQueue(); }
Options
interface UseTTAudioPlaybackOptions { sampleRate?: number; // Default: 24000 bufferSize?: number; // Default: 4096 onPlaybackStart?: () => void; onPlaybackEnd?: () => void; onError?: (error: Error) => void; }
Return Values
| Field | Type | Description |
|---|---|---|
isPlaying | boolean | Audio currently playing |
queuedChunks | number | Chunks waiting to play |
currentLatency | number | Playback latency (ms) |
playAudioChunk | (data: ArrayBuffer) => void | Queue chunk |
stopPlayback | () => void | Stop immediately |
clearQueue | () => void | Clear pending chunks |
getAudioContext | () => AudioContext | Get context |
Audio Format
Expects 24kHz mono PCM16 (little-endian):
// Convert base64 to playable audio function base64ToFloat32(base64: string): Float32Array { const binary = atob(base64); const bytes = new Uint8Array(binary.length); for (let i = 0; i < binary.length; i++) { bytes[i] = binary.charCodeAt(i); } // Convert PCM16 to Float32 for Web Audio const pcm16 = new Int16Array(bytes.buffer); const float32 = new Float32Array(pcm16.length); for (let i = 0; i < pcm16.length; i++) { float32[i] = pcm16[i] / 32768; } return float32; }
useThinkerTalkerVoiceMode
High-level orchestration combining session and playback.
Import
import { useThinkerTalkerVoiceMode } from "../hooks/useThinkerTalkerVoiceMode";
Usage
const { // Connection isConnected, isConnecting, connectionError, // State voiceState, isListening, isProcessing, isSpeaking, // Transcripts transcript, partialTranscript, // Audio isPlaying, audioLevel, // Tools activeToolCalls, // Metrics metrics, // Actions connect, disconnect, toggleVoice, bargeIn, } = useThinkerTalkerVoiceMode({ conversationId: "conv-123", voiceId: "TxGEqnHWrfWFTfGW9XjX", onTranscriptComplete: (text) => addMessage("user", text), onResponseComplete: (text) => addMessage("assistant", text), });
Options
interface UseThinkerTalkerVoiceModeOptions { conversationId?: string; voiceId?: string; language?: string; bargeInEnabled?: boolean; autoConnect?: boolean; onTranscriptComplete?: (text: string) => void; onResponseDelta?: (delta: string) => void; onResponseComplete?: (text: string) => void; onToolCall?: (tool: TTToolCall) => void; onError?: (error: Error) => void; }
Return Values
| Field | Type | Description |
|---|---|---|
isConnected | boolean | WebSocket connected |
isConnecting | boolean | Connection in progress |
connectionError | Error | null | Connection error |
voiceState | PipelineState | Current state |
isListening | boolean | STT active |
isProcessing | boolean | LLM thinking |
isSpeaking | boolean | TTS playing |
transcript | string | Final transcript |
partialTranscript | string | Partial transcript |
isPlaying | boolean | Audio playing |
audioLevel | number | Mic level (0-1) |
activeToolCalls | TTToolCall[] | Current tools |
metrics | TTVoiceMetrics | Performance data |
connect | () => Promise<void> | Start voice |
disconnect | () => void | End voice |
toggleVoice | () => void | Toggle on/off |
bargeIn | () => void | Interrupt |
useVoicePreferencesSync
Syncs voice settings with backend.
Import
import { useVoicePreferencesSync } from "../hooks/useVoicePreferencesSync";
Usage
const { preferences, isLoading, error, updatePreferences, resetToDefaults } = useVoicePreferencesSync(); // Update voice await updatePreferences({ voice_id: "21m00Tcm4TlvDq8ikWAM", // Rachel stability: 0.7, similarity_boost: 0.8, });
Return Values
| Field | Type | Description |
|---|---|---|
preferences | VoicePreferences | Current settings |
isLoading | boolean | Loading state |
error | Error | null | Last error |
updatePreferences | (prefs) => Promise | Save settings |
resetToDefaults | () => Promise | Reset all |
Complete Example
import React, { useCallback } from "react"; import { useThinkerTalkerVoiceMode } from "../hooks/useThinkerTalkerVoiceMode"; import { useVoicePreferencesSync } from "../hooks/useVoicePreferencesSync"; function VoicePanel({ conversationId }: { conversationId: string }) { const { preferences } = useVoicePreferencesSync(); const { isConnected, isConnecting, voiceState, transcript, partialTranscript, activeToolCalls, metrics, connect, disconnect, bargeIn, } = useThinkerTalkerVoiceMode({ conversationId, voiceId: preferences.voice_id, onTranscriptComplete: useCallback((text) => { console.log("User said:", text); }, []), onResponseComplete: useCallback((text) => { console.log("AI said:", text); }, []), onToolCall: useCallback((tool) => { console.log("Tool called:", tool.name); }, []), }); return ( <div className="voice-panel"> {/* Connection status */} <div className="status"> {isConnecting ? "Connecting..." : isConnected ? `Status: ${voiceState}` : "Disconnected"} </div> {/* Transcript display */} <div className="transcript">{transcript || partialTranscript || "Listening..."}</div> {/* Tool calls */} {activeToolCalls.map((tool) => ( <div key={tool.id} className="tool-call"> {tool.name}: {tool.status} </div> ))} {/* Metrics */} <div className="metrics">Latency: {metrics.totalLatencyMs}ms</div> {/* Controls */} <button onClick={isConnected ? disconnect : connect}>{isConnected ? "Stop" : "Start"} Voice</button> {voiceState === "speaking" && <button onClick={bargeIn}>Interrupt</button>} </div> ); }
Error Handling
Microphone Permission
// The hook detects permission errors if (status === "mic_permission_denied") { return ( <div className="error"> <p>Microphone access is required for voice mode.</p> <button onClick={requestMicPermission}> Allow Microphone </button> </div> ); }
Connection Errors
const { error, status, reconnectAttempts } = useThinkerTalkerSession({ onError: (err) => { if (isMicPermissionError(err)) { showPermissionDialog(); } else { showErrorToast(err.message); } }, }); if (status === "reconnecting") { return <div>Reconnecting... (attempt {reconnectAttempts}/5)</div>; } if (status === "failed") { return <div>Connection failed. Please refresh.</div>; }
Performance Tips
1. Memoize Callbacks
const onTranscript = useCallback((t: TTTranscript) => { // Handle transcript }, []); const onAudioChunk = useCallback( (audio: string) => { playAudioChunk(base64ToArrayBuffer(audio)); }, [playAudioChunk], );
2. Avoid Re-renders
// Use refs for frequently updating values const metricsRef = useRef(metrics); useEffect(() => { metricsRef.current = metrics; }, [metrics]);
3. Batch State Updates
// In the hook implementation const handleMessage = useCallback((msg) => { // React 18 automatically batches these setTranscript(msg.text); setPipelineState(msg.state); setMetrics((prev) => ({ ...prev, ...msg.metrics })); }, []);