Docs / Raw

Voice Mode Enhancement - 10 Phase Implementation

Sourced from docs/VOICE_MODE_ENHANCEMENT_10_PHASE.md

Edit on GitHub

Voice Mode Enhancement - 10 Phase Implementation

Status: ✅ COMPLETE (2025-12-03) All 10 phases implemented with full backend-frontend integration

This document describes the comprehensive 10-phase enhancement to VoiceAssist's voice mode, transforming it from a functional voice assistant into a human-like conversational partner with medical dictation capabilities.

Executive Summary

Primary Goals Achieved:

  1. ✅ Natural, human-like voice interactions
  2. ✅ Contextual memory across conversations
  3. ✅ Professional medical dictation
  4. ✅ Natural backchanneling
  5. ✅ Session analytics and feedback collection

Key External Services:

  • Hume AI - Emotion detection from audio (HIPAA BAA available)
  • Deepgram Nova-3 Medical - Upgraded STT for medical vocabulary
  • ElevenLabs - TTS with backchanneling support

Phase Implementation Status

PhaseNameStatusBackend ServiceFrontend Handler
1Emotional Intelligenceemotion_detection_service.pyemotion.detected
2Backchanneling Systembackchannel_service.pybackchannel.trigger
3Prosody Analysisprosody_analysis_service.pyIntegrated
4Memory & Contextmemory_context_service.pymemory.context_loaded
5Advanced Turn-TakingIntegrated in pipelineturn.state
6Variable Response TimingIntegrated in pipelineTiming controls
7Conversational Repairrepair_strategy_service.pyRepair flows
8Medical Dictation Coredictation_service.py, voice_command_service.py, note_formatter_service.py, medical_vocabulary_service.pydictation.*
9Patient Context Integrationpatient_context_service.py, dictation_phi_monitor.pypatient.*, phi.*
10Frontend Integration & Analyticssession_analytics_service.py, feedback_service.pyanalytics.*, feedback.*

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                        ENHANCED VOICE PIPELINE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   User Audio ──┬──> Deepgram Nova-3 ──> Transcript ──┐                      │
│                │    (Medical STT)                     │                      │
│                │                                      │                      │
│                ├──> Hume AI ──────────> Emotion ──────┼──> Context Builder  │
│                │    (Emotion)                         │                      │
│                │                                      │                      │
│                └──> Prosody Analyzer ──> Urgency ─────┘                      │
│                     (from Deepgram)                                          │
│                                                                              │
│   Context Builder ──┬──> Short-term (Redis) ─────────┐                      │
│                     ├──> Medium-term (PostgreSQL) ───┼──> Memory Service    │
│                     └──> Long-term (Qdrant vectors) ─┘                      │
│                                                                              │
│   Memory + Emotion + Transcript ──> Thinker (GPT-4o) ──> Response           │
│                                                                              │
│   Response ──> Turn Manager ──> TTS (ElevenLabs) ──> User                   │
│                    │                                                         │
│                    └──> Backchannel Service (parallel audio)                │
│                                                                              │
│   Session Analytics ──> Metrics + Latency Tracking ──> Feedback Prompts     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 1: Emotional Intelligence

Goal: Detect user emotions from speech and adapt responses accordingly.

Backend Service

Location: services/api-gateway/app/services/emotion_detection_service.py

class EmotionDetectionService: """ Wraps Hume AI Expression Measurement API. - Analyzes audio chunks (500ms) in parallel with STT - Returns: valence, arousal, discrete emotions - Caches recent emotion states for trending """ async def analyze_audio_chunk(self, audio: bytes) -> EmotionResult async def get_emotion_trend(self, session_id: str) -> EmotionTrend def map_emotion_to_response_style(self, emotion: str) -> VoiceStyle

WebSocket Message

{ type: "emotion.detected", data: { emotion: string, confidence: number, valence: number, arousal: number } }

Frontend Handler

In useThinkerTalkerSession.ts:

onEmotionDetected?: (event: TTEmotionDetectedEvent) => void;

Latency Impact: +50-100ms (parallel, non-blocking)


Phase 2: Backchanneling System

Goal: Natural verbal acknowledgments during user speech.

Backend Service

Location: services/api-gateway/app/services/backchannel_service.py

class BackchannelService: """ Generates and manages backchanneling audio. - Pre-caches common phrases per voice - Triggers based on VAD pause detection """ PHRASES = { "en": ["uh-huh", "mm-hmm", "I see", "right", "got it"], "ar": ["اها", "نعم", "صح"] } async def get_backchannel_audio(self, phrase: str, voice_id: str) -> bytes def should_trigger(self, session_state: SessionState) -> bool

Timing Logic

  • Trigger after 2-3 seconds of continuous user speech
  • Only during natural pauses (150-300ms silence)
  • Minimum 5 seconds between backchannels
  • Never interrupt mid-sentence

WebSocket Message

{ type: "backchannel.trigger", data: { phrase: string, audio_base64: string } }

Latency Impact: ~0ms (pre-cached audio)


Phase 3: Prosody Analysis

Goal: Analyze speech patterns for better intent understanding.

Backend Service

Location: services/api-gateway/app/services/prosody_analysis_service.py

@dataclass class ProsodyAnalysis: speech_rate_wpm: float # Words per minute pitch_variance: float # Emotion indicator loudness: float # Urgency indicator pause_patterns: List[float] # Hesitation detection urgency_score: float # Derived 0-1 score confidence_score: float # Speaker certainty

Integration

  • Parses Deepgram's prosody/topics metadata
  • Matches response speech rate to user's rate
  • Detects uncertainty from pitch patterns

Latency Impact: +0ms (data from Deepgram)


Phase 4: Memory & Context System

Goal: Conversation memory across turns and sessions.

Backend Service

Location: services/api-gateway/app/services/memory_context_service.py

class MemoryContextService: """Three-tier memory management.""" async def store_turn_context(self, user_id, session_id, turn) -> None # Redis: last 10 turns, TTL = session duration async def get_recent_context(self, user_id, session_id, turns=5) -> list # Retrieve from Redis async def summarize_session(self, session_id) -> SessionContext # LLM-generated summary at session end async def store_long_term_memory(self, user_id, memory) -> str # Store in PostgreSQL + Qdrant vector async def retrieve_relevant_memories(self, user_id, query, top_k=5) -> list # Semantic search over Qdrant async def build_context_window(self, user_id, session_id, query) -> str # Assemble optimized context for LLM (max 4K tokens)

WebSocket Message

{ type: "memory.context_loaded", data: { memories: Memory[], relevance_scores: number[] } }

Phase 5 & 6: Turn-Taking and Response Timing

Goal: Fluid conversation flow with natural turn transitions and human-like timing.

Turn States

class TurnTakingState(Enum): USER_TURN = "user_turn" TRANSITION = "transition" # Brief transition window AI_TURN = "ai_turn" OVERLAP = "overlap" # Both speaking (barge-in)

Response Timing Configuration

RESPONSE_TIMING = { "urgent": {"delay_ms": 0, "use_filler": False}, # Medical emergency "simple": {"delay_ms": 200, "use_filler": False}, # Yes/no, confirmations "complex": {"delay_ms": 600, "use_filler": True}, # Multi-part questions "clarification": {"delay_ms": 0, "use_filler": False} }

WebSocket Message

{ type: "turn.state", data: { state: "user_turn" | "transition" | "ai_turn" } }

Phase 7: Conversational Repair

Goal: Graceful handling of misunderstandings.

Backend Service

Location: services/api-gateway/app/services/repair_strategy_service.py

class RepairStrategy(Enum): ECHO_CHECK = "echo_check" # "So you're asking about X?" CLARIFY_SPECIFIC = "clarify_specific" # "Did you mean X or Y?" REQUEST_REPHRASE = "request_rephrase" # "Could you say that differently?" PARTIAL_ANSWER = "partial_answer" # "I'm not sure, but..."

Features

  • Confidence scoring for responses
  • Clarifying questions when confidence < 0.7
  • Natural upward inflection for questions (SSML)
  • Frustration detection from repeated corrections

Phase 8: Medical Dictation Core

Goal: Hands-free clinical documentation.

Backend Services

Location: services/api-gateway/app/services/

dictation_service.py

class DictationState(Enum): IDLE = "idle" LISTENING = "listening" PROCESSING = "processing" PAUSED = "paused" REVIEWING = "reviewing" class NoteType(Enum): SOAP = "soap" # Subjective, Objective, Assessment, Plan HP = "h_and_p" # History and Physical PROGRESS = "progress" # Progress Note PROCEDURE = "procedure" CUSTOM = "custom"

voice_command_service.py

# Navigation "go to subjective", "move to objective", "next section", "previous section" # Formatting "new paragraph", "bullet point", "number one/two/three" # Editing "delete that", "scratch that", "read that back", "undo" # Clinical "check interactions", "what's the dosing for", "show labs", "show medications" # Control "start dictation", "pause", "stop dictation", "save note"

note_formatter_service.py

  • LLM-assisted note formatting
  • Grammar correction preserving medical terminology
  • Auto-punctuation and abbreviation handling

medical_vocabulary_service.py

  • Specialty-specific keyword sets
  • User-customizable vocabulary
  • Medical abbreviation expansion

WebSocket Messages

{ type: "dictation.state", data: { state: DictationState, note_type: NoteType } } { type: "dictation.section_update", data: { section: string, content: string } } { type: "dictation.section_change", data: { previous: string, current: string } } { type: "dictation.command", data: { command: string, executed: boolean } }

Phase 9: Patient Context Integration

Goal: Context-aware clinical assistance with HIPAA compliance.

Backend Services

patient_context_service.py

class PatientContextService: async def get_context_for_dictation(self, user_id, patient_id) -> DictationContext def generate_context_prompts(self, context) -> List[str] # "I see 3 recent lab results. Would you like me to summarize them?"

dictation_phi_monitor.py

  • Real-time PHI detection during dictation
  • Alert if unexpected PHI spoken outside patient context

HIPAA Audit Events

# Added to audit_service.py DICTATION_STARTED = "dictation_started" PATIENT_CONTEXT_ACCESSED = "patient_context_accessed" NOTE_SAVED = "note_saved" PHI_DETECTED = "phi_detected"

WebSocket Messages

{ type: "patient.context_loaded", data: { patientId: string, context: PatientContext } } { type: "phi.alert", data: { severity: string, message: string, detected_phi: string[] } }

Phase 10: Frontend Integration & Analytics

Goal: Session analytics, feedback collection, and full frontend integration.

Backend Services

session_analytics_service.py

Location: services/api-gateway/app/services/session_analytics_service.py

class SessionAnalyticsService: """ Comprehensive voice session analytics tracking. Tracks: - Latency metrics (STT, LLM, TTS, E2E) with percentiles - Interaction counts (utterances, responses, tool calls, barge-ins) - Quality metrics (confidence scores, turn-taking, repairs) - Dictation-specific metrics """ def create_session(self, session_id: str, user_id: Optional[str], mode: str, on_analytics_update: Optional[Callable]) -> SessionAnalytics def record_latency(self, session_id: str, latency_type: str, latency_ms: float) -> None def record_interaction(self, session_id: str, interaction_type: InteractionType, word_count: int, duration_ms: float) -> None def record_emotion(self, session_id: str, emotion: str, valence: float, arousal: float) -> None def record_barge_in(self, session_id: str) -> None def record_repair(self, session_id: str) -> None def record_error(self, session_id: str, error_type: str, message: str) -> None def end_session(self, session_id: str) -> Optional[Dict[str, Any]]

feedback_service.py

Location: services/api-gateway/app/services/feedback_service.py

class FeedbackService: """ User feedback collection for voice sessions. Features: - Quick thumbs up/down during session - Detailed session ratings with categories - Bug reports and suggestions - Feedback prompts based on session context """ def record_quick_feedback(self, session_id: str, user_id: Optional[str] = None, thumbs_up: bool = True, message_id: Optional[str] = None) -> FeedbackItem def record_session_rating(self, session_id: str, user_id: Optional[str] = None, rating: int = 5, categories: Optional[Dict[str, int]] = None, comment: Optional[str] = None) -> List[FeedbackItem] def get_feedback_prompts(self, session_id: str, session_duration_ms: float = 0, interaction_count: int = 0, has_errors: bool = False) -> List[FeedbackPrompt] def generate_analytics_report(self, session_ids: Optional[List[str]] = None) -> Dict[str, Any]

Analytics Data Structure

interface TTSessionAnalytics { sessionId: string; userId: string | null; phase: string; mode: string; timing: { startedAt: string; endedAt: string | null; durationMs: number; }; latency: { stt: { count: number; total: number; min: number; max: number; p50: number; p95: number; p99: number }; llm: { count: number; total: number; min: number; max: number; p50: number; p95: number; p99: number }; tts: { count: number; total: number; min: number; max: number; p50: number; p95: number; p99: number }; e2e: { count: number; total: number; min: number; max: number; p50: number; p95: number; p99: number }; }; interactions: { counts: Record<string, number>; words: { user: number; assistant: number }; speakingTimeMs: { user: number; assistant: number }; }; quality: { sttConfidence: { count: number; total: number; min: number; max: number }; aiConfidence: { count: number; total: number; min: number; max: number }; emotion: { dominant: string | null; valence: number; arousal: number }; turnTaking: { bargeIns: number; overlaps: number; smoothTransitions: number }; repairs: number; }; dictation: { sectionsEdited: string[]; commandsExecuted: number; wordsTranscribed: number; } | null; errors: { count: number; details: Array<{ timestamp: string; type: string; message: string }>; }; }

WebSocket Messages

// Analytics { type: "analytics.update", data: TTSessionAnalytics } { type: "analytics.session_ended", data: TTSessionAnalytics } // Feedback { type: "feedback.prompts", data: { prompts: TTFeedbackPrompt[] } } { type: "feedback.recorded", data: { thumbsUp: boolean, messageId: string | null } }

Frontend Handlers

In useThinkerTalkerSession.ts:

// Phase 10 callbacks onAnalyticsUpdate?: (analytics: TTSessionAnalytics) => void; onSessionEnded?: (analytics: TTSessionAnalytics) => void; onFeedbackPrompts?: (event: TTFeedbackPromptsEvent) => void; onFeedbackRecorded?: (event: TTFeedbackRecordedEvent) => void;

Complete WebSocket Protocol

All Message Types

PhaseMessage TypeDirectionDescription
1emotion.detectedServer → ClientUser emotion detected
2backchannel.triggerServer → ClientPlay backchannel audio
4memory.context_loadedServer → ClientRelevant memories loaded
5turn.stateServer → ClientTurn state changed
8dictation.stateServer → ClientDictation state changed
8dictation.section_updateServer → ClientSection content updated
8dictation.section_changeServer → ClientCurrent section changed
8dictation.commandServer → ClientVoice command executed
9patient.context_loadedServer → ClientPatient context loaded
9phi.alertServer → ClientPHI detected alert
10analytics.updateServer → ClientSession analytics update
10analytics.session_endedServer → ClientFinal session analytics
10feedback.promptsServer → ClientFeedback prompts
10feedback.recordedServer → ClientFeedback recorded confirmation

Integration Points

Voice Pipeline Service

Location: services/api-gateway/app/services/voice_pipeline_service.py

The voice pipeline service orchestrates all 10 phases:

class VoicePipelineService: # Phase 1-9 services _emotion_detector: EmotionDetectionService _backchannel_service: BackchannelService _prosody_analyzer: ProsodyAnalysisService _memory_service: MemoryContextService _repair_service: RepairStrategyService _dictation_service: DictationService _voice_command_service: VoiceCommandService _note_formatter: NoteFormatterService _medical_vocabulary: MedicalVocabularyService _patient_context_service: PatientContextService _phi_monitor: DictationPHIMonitor # Phase 10 services _analytics: SessionAnalytics _analytics_service: SessionAnalyticsService _feedback_service: FeedbackService async def start(self): # Initialize analytics session self._analytics = self._analytics_service.create_session( session_id=self.session_id, user_id=self.user_id, mode="dictation" if self.config.mode == PipelineMode.DICTATION else "conversation", on_analytics_update=self._send_analytics_update, ) async def stop(self): # Send feedback prompts prompts = self._feedback_service.get_feedback_prompts(...) await self._on_message(PipelineMessage(type="feedback.prompts", ...)) # Finalize analytics final_analytics = self._analytics_service.end_session(self.session_id) await self._on_message(PipelineMessage(type="analytics.session_ended", ...))

Frontend Hook

Location: apps/web-app/src/hooks/useThinkerTalkerSession.ts

All 10 phases integrated with callbacks:

export interface UseThinkerTalkerSessionOptions { // ... existing options ... // Phase 1: Emotion onEmotionDetected?: (event: TTEmotionDetectedEvent) => void; // Phase 2: Backchanneling onBackchannelTrigger?: (event: TTBackchannelTriggerEvent) => void; // Phase 4: Memory onMemoryContextLoaded?: (event: TTMemoryContextLoadedEvent) => void; // Phase 5: Turn-taking onTurnStateChange?: (event: TTTurnStateChangeEvent) => void; // Phase 8: Dictation onDictationStateChange?: (event: TTDictationStateChangeEvent) => void; onDictationSectionUpdate?: (event: TTDictationSectionUpdateEvent) => void; onDictationSectionChange?: (event: TTDictationSectionChangeEvent) => void; onDictationCommand?: (event: TTDictationCommandEvent) => void; // Phase 9: Patient Context onPatientContextLoaded?: (event: TTPatientContextLoadedEvent) => void; onPHIAlert?: (event: TTPHIAlertEvent) => void; // Phase 10: Analytics & Feedback onAnalyticsUpdate?: (analytics: TTSessionAnalytics) => void; onSessionEnded?: (analytics: TTSessionAnalytics) => void; onFeedbackPrompts?: (event: TTFeedbackPromptsEvent) => void; onFeedbackRecorded?: (event: TTFeedbackRecordedEvent) => void; }

File Reference

Backend Services (New)

FilePhasePurpose
emotion_detection_service.py1Hume AI emotion detection
backchannel_service.py2Natural acknowledgments
prosody_analysis_service.py3Speech pattern analysis
memory_context_service.py4Three-tier memory system
repair_strategy_service.py7Conversational repair
dictation_service.py8Medical dictation state
voice_command_service.py8Voice command processing
note_formatter_service.py8Note formatting
medical_vocabulary_service.py8Medical terminology
patient_context_service.py9Patient context
dictation_phi_monitor.py9PHI monitoring
session_analytics_service.py10Session analytics
feedback_service.py10User feedback

Backend Services (Modified)

FileChanges
voice_pipeline_service.pyOrchestrates all 10 phases, analytics integration
thinker_service.pyEmotion context, repair strategies
talker_service.pyVariable timing, backchanneling
streaming_stt_service.pyNova-3 Medical, prosody features
audit_service.pyDictation audit events

Frontend

FilePurpose
useThinkerTalkerSession.tsAll message type handlers

Success Metrics

MetricTargetMeasurement
Response latency<200msP95 from analytics
Emotion detection accuracy>80%Manual validation
User satisfaction>4.2/5Feedback ratings
Dictation word accuracy>95% WERMedical vocabulary tests
Memory retrieval relevance>0.7Cosine similarity
Turn-taking smoothness<5% interruption rateSession analytics


Last updated: 2025-12-03 All 10 phases implemented and integrated

Beginning of guide
End of guide