This document provides a comprehensive technical reference for the VoiceAssist Voice Mode implementation. It covers the current architecture, identifies known limitations, and serves as the authoritative source for understanding how voice interactions work in the system.
Overview
VoiceAssist implements a sophisticated voice-first interface for healthcare professionals, enabling natural spoken interactions with an AI medical assistant. The system uses a Thinker/Talker pipeline architecture that decouples speech recognition, language model reasoning, and speech synthesis for maximum flexibility and low latency.
High-Level Architecture
Current Implementation of Voice Mode
End-to-End Pipeline
The voice interaction follows this sequence:
Audio Capture
Audio is captured using the Web Audio API and MediaRecorder:
- Hook:
useThinkerTalkerSession.tsmanages the voice session - Component:
ThinkerTalkerVoicePanel.tsxprovides the UI - Capture: MediaRecorder API with
audio/webm;codecs=opusencoding - Sample Rate: 16kHz mono (resampled for Deepgram)
- Chunk Size: 250ms intervals for streaming
Speech-to-Text (STT) Providers
Deepgram is the primary STT provider, chosen for its low-latency streaming capabilities.
| Property | Value |
|---|---|
| Mode | WebSocket streaming |
| Latency | 100-150ms to first transcript |
| Features | Interim results, VAD events, punctuation, diarization |
| Languages | English (primary), multilingual support |
| Config Key | DEEPGRAM_API_KEY |
Deepgram provides real-time VAD (Voice Activity Detection) events, enabling accurate end-of-utterance detection without client-side inference.
LLM / Assistant Layer
The Thinker Service (thinker_service.py) handles language model reasoning with intelligent routing:
OpenAI GPT-4o is the primary LLM for general queries.
| Property | Value |
|---|---|
| Model | gpt-4o |
| Mode | Streaming |
| Latency | 200-500ms to first token |
| Features | Tool calling, RAG integration, citations |
| Use Case | General medical queries, clinical decision support |
# Query classification determines urgency
URGENT → prioritized, faster response
SIMPLE → direct answer, minimal context
COMPLEX → multi-hop reasoning, RAG retrieval
CLARIFICATION → follow-up questionsText-to-Speech (TTS) Providers
ElevenLabs provides premium neural TTS with emotional expressiveness.
| Property | Value |
|---|---|
| Models | eleven_multilingual_v2, eleven_turbo_v2_5 |
| Mode | HTTP streaming |
| Latency | 50-100ms TTFA (time to first audio) |
| Languages | 28+ languages |
| Voices | Custom voice IDs, professional cloning |
Voice Parameters:
- Stability: 0.0-1.0 (consistency vs. expressiveness)
- Clarity: 0.0-1.0 (pronunciation precision)
- Style: 0.0-1.0 (emotional intensity)
ElevenLabs supports SSML tags for prosody control (emphasis, pauses, rate), enabling natural-sounding medical terminology pronunciation.
Streaming and Latency Behavior
Streaming Architecture
All pipeline components support streaming to minimize perceived latency:
| Component | Streaming Mode | Chunk Size |
|---|---|---|
| STT (Deepgram) | WebSocket bidirectional | Continuous |
| LLM (GPT-4o) | Server-sent events | Token-by-token |
| TTS (ElevenLabs) | HTTP chunked | 256 samples (24kHz) |
Latency Targets
VoiceAssist targets sub-500ms end-to-end latency for optimal conversational UX.
| Stage | Target Latency | Actual (P95) |
|---|---|---|
| Audio capture → STT | 100-150ms | ~120ms |
| STT → LLM first token | 200-300ms | ~250ms |
| LLM → TTS first audio | 50-100ms | ~80ms |
| Total (speech-to-audio) | under 500ms | ~450ms |
Voice Quality Presets
Users can select latency vs. quality trade-offs:
// voiceSettingsStore.ts
type VoiceQualityPreset = 'speed' | 'balanced' | 'natural';
const presets = {
speed: { ttfa: '100-150ms', description: 'Fastest response' },
balanced: { ttfa: '200-250ms', description: 'Recommended default' },
natural: { ttfa: '300-400ms', description: 'Most natural prosody' }
};VAD and End-of-Utterance Detection
The system determines when the user has finished speaking using:
- Deepgram VAD Events: Server-side voice activity detection
- Silence Threshold: 800ms of silence triggers end-of-utterance
- VAD Sensitivity: 200ms minimum speech duration to avoid false triggers
Barge-In Support
Users can interrupt the AI's response mid-playback:
- Detection:
barge_in_classifier.pymonitors for new speech during playback - Action: Current audio playback stops, new utterance is processed
- UI:
VoiceBargeInIndicator.tsxprovides visual feedback
Multilingual and Pronunciation Behavior
Supported Languages
Deepgram STT supports multiple languages, but the system is primarily configured for:
- English (US) - Primary
- Spanish
- French
- German
- Italian
- Portuguese
Automatic language detection is not currently implemented in STT. The language must be pre-configured or selected by the user.
Mixed-Language Support
Mixed-language utterances (e.g., English with Arabic terms) are not fully supported. The STT provider may fail to accurately transcribe code-switched speech.
Workarounds:
- Configure STT for the dominant language
- Use medical terminology in the configured language
- Rely on TTS's multilingual model for pronunciation
Pronunciation Handling
| Feature | Status | Notes |
|---|---|---|
| Custom lexicons | Not implemented | No phoneme dictionaries |
| Medical terminology | Partial | ElevenLabs handles common terms |
| SSML pronunciation | Supported | Via ssml_processor.py |
| Per-language tuning | Not implemented | Single-language configuration |
Known Issues:
- Uncommon drug names may be mispronounced
- Eponyms (e.g., "Parkinson's", "Alzheimer's") generally work well
- Abbreviations (e.g., "mg", "mL") require SSML hints
Architecture and Module Integration
Backend Service Structure
The voice pipeline is implemented across multiple services in services/api-gateway/app/services/:
services/
├── voice_pipeline_service.py # Main orchestrator
├── streaming_stt_service.py # Deepgram/Whisper STT
├── thinker_service.py # LLM reasoning
├── talker_service.py # TTS orchestration
├── voice_websocket_handler.py # WebSocket management
├── thinker_talker_websocket_handler.py # T/T protocol
├── voice_activity_detector.py # VAD logic
├── barge_in_classifier.py # Interrupt detection
├── elevenlabs_service.py # ElevenLabs client
├── openai_tts_service.py # OpenAI TTS client
├── ssml_processor.py # SSML generation
├── emotion_detection_service.py # User emotion analysis
├── prosody_analysis_service.py # Speech prosody
├── backchannel_service.py # Conversational cues
└── dictation_service.py # Medical dictation
Frontend Hook Structure
Voice features are exposed via React hooks in apps/web-app/src/hooks/:
// Primary hooks (current production)
useThinkerTalkerSession.ts // Session management
useThinkerTalkerVoiceMode.ts // Combined session + playback
useTTAudioPlayback.ts // Audio streaming playback
// Supporting hooks
useVoiceMetrics.ts // Latency tracking
useVoiceModeStateMachine.ts // State management
useStreamingAudio.ts // Audio stream handling
useBackchannelAudio.ts // AI conversational cues
useVoicePreferencesSync.ts // Settings persistence
// Legacy (deprecated)
useRealtimeVoiceSession.ts // OpenAI Realtime API (deprecated)Pipeline Modes
The voice pipeline supports multiple operating modes:
| Mode | Description | Use Case |
|---|---|---|
CONVERSATION | Full Thinker/Talker pipeline | Normal voice chat |
DICTATION | Speech-to-text with formatting | Medical note dictation |
COMMAND | Voice command processing | Quick actions |
Error Handling and Retries
// Circuit breaker pattern for external APIs
const circuitBreaker = {
failureThreshold: 5,
recoveryTimeout: 30000, // 30 seconds
halfOpenRequests: 3
};
// Retry strategy
const retryPolicy = {
maxRetries: 3,
baseDelay: 1000,
maxDelay: 10000,
backoffMultiplier: 2
};When ElevenLabs fails, the system automatically falls back to OpenAI TTS. When Deepgram fails, batch Whisper transcription is used.
Medical Intelligence and Data Sources
Currently Integrated Sources
| Source | Type | Integration |
|---|---|---|
| PubMed (NCBI) | Research articles | E-utilities API |
| OpenEvidence | Clinical evidence | REST API |
| Medical Guidelines | Curated guidelines | Local vector DB |
| Epic FHIR | EHR data | FHIR R4 API |
RAG Architecture
The system uses Retrieval-Augmented Generation for evidence-based responses:
Medical Embedding Models
Multiple embedding models are available for semantic search:
| Model | Dimensions | Best For |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | General queries |
| PubMedBERT | 768 | Research literature |
| BioBERT | 768 | Biomedical text |
| MedCPT | 768 | Clinical queries |
FHIR Integration
Fully Implemented:
- Patient demographics
- MedicationRequest (active/historical)
- Condition (diagnoses, ICD-10)
- Observation (labs, vitals, LOINC)
- AllergyIntolerance
- Procedure (CPT codes)
- Encounter history
Known Gaps and TODOs
Voice Pipeline Gaps
| Gap | Description | Priority |
|---|---|---|
| Language Detection | No automatic STT language detection | High |
| Mixed Language | Code-switched speech not supported | Medium |
| Custom Lexicons | No phoneme/pronunciation dictionaries | Medium |
| Speaker ID | No multi-speaker diarization | Low |
| Noise Suppression | Limited background noise handling | Medium |
Medical Intelligence Gaps
| Gap | Description | Priority |
|---|---|---|
| Drug Interactions | No PharmGKB integration | High |
| Real-time EHR | No streaming vital signs | Medium |
| Clinical NER | No medication/condition extraction from text | High |
| SNOMED CT | No ontology mapping | Medium |
| Evidence Grading | Limited quality assessment | Medium |
Documentation Gaps
| Gap | Information Needed |
|---|---|
| Exact VAD thresholds | Configurable silence duration and sensitivity |
| ElevenLabs voice IDs | Complete list of available voices and characteristics |
| PHI detection rules | Full regex patterns and Presidio configuration |
| Fallback behavior | Exact conditions triggering provider fallbacks |
| WebSocket protocol | Complete message schema and error codes |
Configuration Reference
Environment Variables
# STT Configuration
DEEPGRAM_API_KEY=your-deepgram-key
VOICE_PIPELINE_STT_PRIMARY=deepgram
VOICE_PIPELINE_STT_FALLBACK=whisper
# TTS Configuration
ELEVENLABS_API_KEY=your-elevenlabs-key
VOICE_PIPELINE_TTS_PROVIDER=elevenlabs
TTS_VOICE=default-voice-id
# LLM Configuration
OPENAI_API_KEY=your-openai-key
LOCAL_LLM_ENDPOINT=http://localhost:11434
# Voice Pipeline
VOICE_WS_MAX_INFLIGHT=10
VAD_SILENCE_THRESHOLD_MS=800
VAD_SENSITIVITY_MS=200User Preferences (voiceSettingsStore)
interface VoiceSettings {
voiceId: string; // ElevenLabs voice ID
language: string; // ISO language code
playbackSpeed: number; // 0.5-2.0x
stability: number; // 0.0-1.0
clarity: number; // 0.0-1.0
expressiveness: number; // 0.0-1.0
qualityPreset: 'speed' | 'balanced' | 'natural';
pushToTalk: boolean;
autoPlay: boolean;
}Cost Philosophy
The product team is not trying to reduce costs at the expense of quality. We are willing to increase costs when it demonstrably improves the voice experience. However, we aim to avoid wasteful spending and prefer solutions with strong cost-benefit ratios.
Guiding Principles:
- Quality First: Premium providers (ElevenLabs, Deepgram) are preferred for their superior quality
- Smart Fallbacks: Cost-effective alternatives only activate when primary providers fail
- No Downgrades: Never propose replacing current components with cheaper, lower-quality alternatives
- Measured Upgrades: New features should justify their cost with measurable UX improvements
References
Backend Files
services/api-gateway/app/services/voice_pipeline_service.pyservices/api-gateway/app/services/streaming_stt_service.pyservices/api-gateway/app/services/thinker_service.pyservices/api-gateway/app/services/talker_service.pyservices/api-gateway/app/services/elevenlabs_service.py
Frontend Files
apps/web-app/src/hooks/useThinkerTalkerSession.tsapps/web-app/src/hooks/useThinkerTalkerVoiceMode.tsapps/web-app/src/components/voice/ThinkerTalkerVoicePanel.tsxapps/web-app/src/stores/voiceSettingsStore.ts