Voice Mode Architecture

Comprehensive technical reference for the VoiceAssist Voice Mode implementation, covering the Thinker/Talker pipeline, providers, and latency characteristics.

Voice Mode Stack

STT Provider

  • • Deepgram (Primary)
  • • Whisper (Fallback)
  • • 100-150ms latency

LLM Layer

  • • GPT-4o (Cloud)
  • • Llama (Local/PHI)
  • • Streaming tokens

TTS Provider

  • • ElevenLabs (Primary)
  • • OpenAI TTS (Fallback)
  • • 28+ languages

Latency Target

  • • <500ms end-to-end
  • • Streaming at all stages
  • • Barge-in support
Voice Mode v2.0 - Thinker/Talker Pipeline
Document Purpose

This document provides a comprehensive technical reference for the VoiceAssist Voice Mode implementation. It covers the current architecture, identifies known limitations, and serves as the authoritative source for understanding how voice interactions work in the system.

Overview

VoiceAssist implements a sophisticated voice-first interface for healthcare professionals, enabling natural spoken interactions with an AI medical assistant. The system uses a Thinker/Talker pipeline architecture that decouples speech recognition, language model reasoning, and speech synthesis for maximum flexibility and low latency.

High-Level Architecture


Current Implementation of Voice Mode

End-to-End Pipeline

The voice interaction follows this sequence:

Audio Capture

Audio is captured using the Web Audio API and MediaRecorder:

  • Hook: useThinkerTalkerSession.ts manages the voice session
  • Component: ThinkerTalkerVoicePanel.tsx provides the UI
  • Capture: MediaRecorder API with audio/webm;codecs=opus encoding
  • Sample Rate: 16kHz mono (resampled for Deepgram)
  • Chunk Size: 250ms intervals for streaming

Speech-to-Text (STT) Providers

Deepgram is the primary STT provider, chosen for its low-latency streaming capabilities.

PropertyValue
ModeWebSocket streaming
Latency100-150ms to first transcript
FeaturesInterim results, VAD events, punctuation, diarization
LanguagesEnglish (primary), multilingual support
Config KeyDEEPGRAM_API_KEY

Deepgram provides real-time VAD (Voice Activity Detection) events, enabling accurate end-of-utterance detection without client-side inference.

LLM / Assistant Layer

The Thinker Service (thinker_service.py) handles language model reasoning with intelligent routing:

OpenAI GPT-4o is the primary LLM for general queries.

PropertyValue
Modelgpt-4o
ModeStreaming
Latency200-500ms to first token
FeaturesTool calling, RAG integration, citations
Use CaseGeneral medical queries, clinical decision support
# Query classification determines urgency
URGENT → prioritized, faster response
SIMPLE → direct answer, minimal context
COMPLEX → multi-hop reasoning, RAG retrieval
CLARIFICATION → follow-up questions

Text-to-Speech (TTS) Providers

ElevenLabs provides premium neural TTS with emotional expressiveness.

PropertyValue
Modelseleven_multilingual_v2, eleven_turbo_v2_5
ModeHTTP streaming
Latency50-100ms TTFA (time to first audio)
Languages28+ languages
VoicesCustom voice IDs, professional cloning

Voice Parameters:

  • Stability: 0.0-1.0 (consistency vs. expressiveness)
  • Clarity: 0.0-1.0 (pronunciation precision)
  • Style: 0.0-1.0 (emotional intensity)

ElevenLabs supports SSML tags for prosody control (emphasis, pauses, rate), enabling natural-sounding medical terminology pronunciation.


Streaming and Latency Behavior

Streaming Architecture

All pipeline components support streaming to minimize perceived latency:

ComponentStreaming ModeChunk Size
STT (Deepgram)WebSocket bidirectionalContinuous
LLM (GPT-4o)Server-sent eventsToken-by-token
TTS (ElevenLabs)HTTP chunked256 samples (24kHz)

Latency Targets

Performance Goals

VoiceAssist targets sub-500ms end-to-end latency for optimal conversational UX.

StageTarget LatencyActual (P95)
Audio capture → STT100-150ms~120ms
STT → LLM first token200-300ms~250ms
LLM → TTS first audio50-100ms~80ms
Total (speech-to-audio)under 500ms~450ms

Voice Quality Presets

Users can select latency vs. quality trade-offs:

// voiceSettingsStore.ts
type VoiceQualityPreset = 'speed' | 'balanced' | 'natural';
 
const presets = {
  speed: { ttfa: '100-150ms', description: 'Fastest response' },
  balanced: { ttfa: '200-250ms', description: 'Recommended default' },
  natural: { ttfa: '300-400ms', description: 'Most natural prosody' }
};

VAD and End-of-Utterance Detection

The system determines when the user has finished speaking using:

  1. Deepgram VAD Events: Server-side voice activity detection
  2. Silence Threshold: 800ms of silence triggers end-of-utterance
  3. VAD Sensitivity: 200ms minimum speech duration to avoid false triggers

Barge-In Support

Users can interrupt the AI's response mid-playback:

  • Detection: barge_in_classifier.py monitors for new speech during playback
  • Action: Current audio playback stops, new utterance is processed
  • UI: VoiceBargeInIndicator.tsx provides visual feedback

Multilingual and Pronunciation Behavior

Supported Languages

Deepgram STT supports multiple languages, but the system is primarily configured for:

  • English (US) - Primary
  • Spanish
  • French
  • German
  • Italian
  • Portuguese

Automatic language detection is not currently implemented in STT. The language must be pre-configured or selected by the user.

Mixed-Language Support

Current Limitation

Mixed-language utterances (e.g., English with Arabic terms) are not fully supported. The STT provider may fail to accurately transcribe code-switched speech.

Workarounds:

  • Configure STT for the dominant language
  • Use medical terminology in the configured language
  • Rely on TTS's multilingual model for pronunciation

Pronunciation Handling

FeatureStatusNotes
Custom lexiconsNot implementedNo phoneme dictionaries
Medical terminologyPartialElevenLabs handles common terms
SSML pronunciationSupportedVia ssml_processor.py
Per-language tuningNot implementedSingle-language configuration

Known Issues:

  • Uncommon drug names may be mispronounced
  • Eponyms (e.g., "Parkinson's", "Alzheimer's") generally work well
  • Abbreviations (e.g., "mg", "mL") require SSML hints

Architecture and Module Integration

Backend Service Structure

The voice pipeline is implemented across multiple services in services/api-gateway/app/services/:

services/
├── voice_pipeline_service.py      # Main orchestrator
├── streaming_stt_service.py       # Deepgram/Whisper STT
├── thinker_service.py             # LLM reasoning
├── talker_service.py              # TTS orchestration
├── voice_websocket_handler.py     # WebSocket management
├── thinker_talker_websocket_handler.py  # T/T protocol
├── voice_activity_detector.py     # VAD logic
├── barge_in_classifier.py         # Interrupt detection
├── elevenlabs_service.py          # ElevenLabs client
├── openai_tts_service.py          # OpenAI TTS client
├── ssml_processor.py              # SSML generation
├── emotion_detection_service.py   # User emotion analysis
├── prosody_analysis_service.py    # Speech prosody
├── backchannel_service.py         # Conversational cues
└── dictation_service.py           # Medical dictation

Frontend Hook Structure

Voice features are exposed via React hooks in apps/web-app/src/hooks/:

// Primary hooks (current production)
useThinkerTalkerSession.ts      // Session management
useThinkerTalkerVoiceMode.ts    // Combined session + playback
useTTAudioPlayback.ts           // Audio streaming playback
 
// Supporting hooks
useVoiceMetrics.ts              // Latency tracking
useVoiceModeStateMachine.ts     // State management
useStreamingAudio.ts            // Audio stream handling
useBackchannelAudio.ts          // AI conversational cues
useVoicePreferencesSync.ts      // Settings persistence
 
// Legacy (deprecated)
useRealtimeVoiceSession.ts      // OpenAI Realtime API (deprecated)

Pipeline Modes

The voice pipeline supports multiple operating modes:

ModeDescriptionUse Case
CONVERSATIONFull Thinker/Talker pipelineNormal voice chat
DICTATIONSpeech-to-text with formattingMedical note dictation
COMMANDVoice command processingQuick actions

Error Handling and Retries

// Circuit breaker pattern for external APIs
const circuitBreaker = {
  failureThreshold: 5,
  recoveryTimeout: 30000, // 30 seconds
  halfOpenRequests: 3
};
 
// Retry strategy
const retryPolicy = {
  maxRetries: 3,
  baseDelay: 1000,
  maxDelay: 10000,
  backoffMultiplier: 2
};

When ElevenLabs fails, the system automatically falls back to OpenAI TTS. When Deepgram fails, batch Whisper transcription is used.


Medical Intelligence and Data Sources

Currently Integrated Sources

SourceTypeIntegration
PubMed (NCBI)Research articlesE-utilities API
OpenEvidenceClinical evidenceREST API
Medical GuidelinesCurated guidelinesLocal vector DB
Epic FHIREHR dataFHIR R4 API

RAG Architecture

The system uses Retrieval-Augmented Generation for evidence-based responses:

Medical Embedding Models

Multiple embedding models are available for semantic search:

ModelDimensionsBest For
OpenAI text-embedding-3-large3072General queries
PubMedBERT768Research literature
BioBERT768Biomedical text
MedCPT768Clinical queries

FHIR Integration

Fully Implemented:

  • Patient demographics
  • MedicationRequest (active/historical)
  • Condition (diagnoses, ICD-10)
  • Observation (labs, vitals, LOINC)
  • AllergyIntolerance
  • Procedure (CPT codes)
  • Encounter history

Known Gaps and TODOs

Voice Pipeline Gaps

GapDescriptionPriority
Language DetectionNo automatic STT language detectionHigh
Mixed LanguageCode-switched speech not supportedMedium
Custom LexiconsNo phoneme/pronunciation dictionariesMedium
Speaker IDNo multi-speaker diarizationLow
Noise SuppressionLimited background noise handlingMedium

Medical Intelligence Gaps

GapDescriptionPriority
Drug InteractionsNo PharmGKB integrationHigh
Real-time EHRNo streaming vital signsMedium
Clinical NERNo medication/condition extraction from textHigh
SNOMED CTNo ontology mappingMedium
Evidence GradingLimited quality assessmentMedium

Documentation Gaps

GapInformation Needed
Exact VAD thresholdsConfigurable silence duration and sensitivity
ElevenLabs voice IDsComplete list of available voices and characteristics
PHI detection rulesFull regex patterns and Presidio configuration
Fallback behaviorExact conditions triggering provider fallbacks
WebSocket protocolComplete message schema and error codes

Configuration Reference

Environment Variables

# STT Configuration
DEEPGRAM_API_KEY=your-deepgram-key
VOICE_PIPELINE_STT_PRIMARY=deepgram
VOICE_PIPELINE_STT_FALLBACK=whisper
 
# TTS Configuration
ELEVENLABS_API_KEY=your-elevenlabs-key
VOICE_PIPELINE_TTS_PROVIDER=elevenlabs
TTS_VOICE=default-voice-id
 
# LLM Configuration
OPENAI_API_KEY=your-openai-key
LOCAL_LLM_ENDPOINT=http://localhost:11434
 
# Voice Pipeline
VOICE_WS_MAX_INFLIGHT=10
VAD_SILENCE_THRESHOLD_MS=800
VAD_SENSITIVITY_MS=200

User Preferences (voiceSettingsStore)

interface VoiceSettings {
  voiceId: string;           // ElevenLabs voice ID
  language: string;          // ISO language code
  playbackSpeed: number;     // 0.5-2.0x
  stability: number;         // 0.0-1.0
  clarity: number;           // 0.0-1.0
  expressiveness: number;    // 0.0-1.0
  qualityPreset: 'speed' | 'balanced' | 'natural';
  pushToTalk: boolean;
  autoPlay: boolean;
}

Cost Philosophy

Important Context

The product team is not trying to reduce costs at the expense of quality. We are willing to increase costs when it demonstrably improves the voice experience. However, we aim to avoid wasteful spending and prefer solutions with strong cost-benefit ratios.

Guiding Principles:

  1. Quality First: Premium providers (ElevenLabs, Deepgram) are preferred for their superior quality
  2. Smart Fallbacks: Cost-effective alternatives only activate when primary providers fail
  3. No Downgrades: Never propose replacing current components with cheaper, lower-quality alternatives
  4. Measured Upgrades: New features should justify their cost with measurable UX improvements

References

Backend Files

  • services/api-gateway/app/services/voice_pipeline_service.py
  • services/api-gateway/app/services/streaming_stt_service.py
  • services/api-gateway/app/services/thinker_service.py
  • services/api-gateway/app/services/talker_service.py
  • services/api-gateway/app/services/elevenlabs_service.py

Frontend Files

  • apps/web-app/src/hooks/useThinkerTalkerSession.ts
  • apps/web-app/src/hooks/useThinkerTalkerVoiceMode.ts
  • apps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsx
  • apps/web-app/src/stores/voiceSettingsStore.ts