Docs / Raw

Voice Mode v4.1 Phase 3 Implementation Plan

Sourced from docs/voice/phase3-implementation-plan.md

Edit on GitHub

Voice Mode v4.1 Phase 3 Implementation Plan

Status: COMPLETE - PR #155 merged on 2024-12-04

This document outlines the work completed for Voice Mode v4.1 Phase 3, including UI integration, advanced services, and performance tuning.

Phase 3 Overview

gantt title Phase 3 Implementation Timeline dateFormat YYYY-MM-DD section UI Integration Voice-first input bar :a1, 2024-12-05, 3d Streaming text rendering :a2, after a1, 2d Latency indicator :a3, after a2, 1d Thinking feedback panel :a4, after a3, 2d section Advanced Services FHIR streaming integration :b1, 2024-12-05, 4d Speaker diarization :b2, after b1, 3d section Performance Final performance tuning :c1, after a4, 2d Load testing & optimization :c2, after c1, 2d

Workstream 1: UI Integration

1.1 Voice-First Input Bar

Feature Flag: frontend.voice_first_input_bar

A unified input component that prioritizes voice interaction:

interface VoiceFirstInputBarProps { mode: "voice" | "text" | "hybrid"; onVoiceStart: () => void; onVoiceEnd: () => void; onTextSubmit: (text: string) => void; vadPreset: VADPresetType; rtlEnabled: boolean; }

Tasks:

  • Create VoiceFirstInputBar component
  • Integrate VAD preset selector (sensitive/balanced/relaxed)
  • Add RTL layout support for Arabic/Hebrew
  • Implement smooth voice/text mode transition
  • Add accessibility keyboard shortcuts

Success Criteria:

  • Voice activation < 100ms
  • Mode switch < 200ms
  • Meets WCAG 2.1 AA accessibility standards

1.2 Streaming Text Rendering

Feature Flag: frontend.streaming_text_render

Real-time text display as Thinker generates response:

Tasks:

  • Implement token-by-token streaming display
  • Add cursor animation during streaming
  • Support markdown rendering during stream
  • Handle RTL text direction switching
  • Add smooth scroll-to-bottom behavior

Success Criteria:

  • First token visible within 50ms of receipt
  • No flicker or reflow during streaming
  • RTL text renders correctly

1.3 Latency Indicator

Feature Flag: frontend.latency_indicator

Visual feedback showing response latency:

interface LatencyIndicatorProps { ttfa: number; // Time to first audio (ms) totalLatency: number; // Total response time (ms) phiMode: PHIRoutingMode; showDetails: boolean; }

Tasks:

  • Create LatencyIndicator component
  • Color-code by performance (green < 300ms, yellow < 500ms, red > 500ms)
  • Show PHI routing mode indicator (🛡️/🔒/☁️)
  • Add tooltip with detailed breakdown
  • Store latency history for user feedback

Success Criteria:

  • Updates in real-time during response
  • Accurate to ±10ms
  • Non-intrusive visual design

1.4 Thinking Feedback Panel

Feature Flag: frontend.thinking_feedback

Visual and audio feedback while AI processes:

Tasks:

  • Create ThinkingFeedbackPanel component
  • Implement audio tones (gentle_beep, soft_chime, subtle_tick)
  • Add visual indicators (dots, pulse, spinner, progress)
  • Support haptic feedback on mobile
  • Integrate with existing thinking tone settings

Success Criteria:

  • Feedback starts within 50ms of thinking state
  • Respects user volume preferences
  • Works across mobile and desktop

Workstream 2: Advanced Services

2.1 FHIR Streaming Integration

Feature Flag: backend.fhir_streaming

Real-time FHIR data streaming for clinical context:

class FHIRStreamingService: async def subscribe_to_patient(self, patient_id: str): """Subscribe to real-time patient updates.""" pass async def stream_observations(self, patient_id: str): """Stream lab results, vitals as they arrive.""" pass

Tasks:

  • Implement FHIR subscription service
  • Add WebSocket endpoint for real-time updates
  • Integrate with Thinker context for live data
  • Add PHI detection for streamed data
  • Implement reconnection and error handling

Success Criteria:

  • New data visible within 2 seconds of FHIR event
  • PHI properly detected and routed
  • Handles network disconnections gracefully

2.2 Speaker Diarization

Feature Flag: backend.speaker_diarization

Multi-speaker detection and attribution:

class SpeakerDiarizationService: async def process_audio( self, audio: bytes, num_speakers: Optional[int] = None ) -> List[SpeakerSegment]: """Identify speaker segments in audio.""" pass def get_speaker_profile(self, speaker_id: str) -> SpeakerProfile: """Get or create speaker profile.""" pass

Tasks:

  • Implement pyannote.audio integration
  • Create speaker embedding database
  • Add real-time speaker change detection
  • Integrate with Thinker for multi-party context
  • Support up to 4 concurrent speakers

Success Criteria:

  • Speaker change detected within 500ms
  • 90% accuracy for 2-speaker conversations

  • Latency < 200ms per segment

Workstream 3: Performance Tuning

3.1 Final Performance Optimization

Tasks:

  • Profile end-to-end latency breakdown
  • Optimize VAD chunk size for latency/accuracy trade-off
  • Tune Thinker token generation parameters
  • Optimize Talker audio chunk sizes
  • Implement adaptive quality based on connection speed

Target Metrics:

MetricTargetCurrent
Time to First Audio (TTFA)< 300ms~400ms
End-to-End Latency< 1000ms~1200ms
PHI Detection Latency< 50ms~75ms
VAD Latency< 20ms~25ms

3.2 Load Testing

Tasks:

  • Create load testing scenarios (10, 50, 100 concurrent sessions)
  • Test PHI routing under load
  • Measure memory usage over extended sessions
  • Validate WebSocket connection stability
  • Document performance characteristics

Feature Flag Definitions

Add to flag_definitions.py:

# Phase 3 Feature Flags PHASE_3_FLAGS = { # UI Features "frontend.voice_first_input_bar": { "default": False, "description": "Enable voice-first unified input bar", "rollout_percentage": 0, }, "frontend.streaming_text_render": { "default": False, "description": "Enable streaming text rendering", "rollout_percentage": 0, }, "frontend.latency_indicator": { "default": False, "description": "Show latency indicator in voice mode", "rollout_percentage": 0, }, "frontend.thinking_feedback": { "default": True, # Already partially implemented "description": "Enable thinking feedback panel", "rollout_percentage": 100, }, # Backend Features "backend.fhir_streaming": { "default": False, "description": "Enable FHIR real-time streaming", "rollout_percentage": 0, }, "backend.speaker_diarization": { "default": False, "description": "Enable multi-speaker detection", "rollout_percentage": 0, }, # Performance Features "backend.adaptive_quality": { "default": False, "description": "Adapt quality based on connection speed", "rollout_percentage": 0, }, }

PR Templates

UI Feature PR Template

## Summary [Brief description of UI feature] ## Changes - [ ] Component implementation - [ ] Store integration - [ ] Accessibility support - [ ] RTL support - [ ] Unit tests - [ ] Storybook stories ## Test Plan - [ ] Manual testing on Chrome, Firefox, Safari - [ ] Mobile testing (iOS Safari, Android Chrome) - [ ] Screen reader testing - [ ] RTL layout testing ## Screenshots [Before/After screenshots] ## Performance Impact [Any latency or bundle size changes]

Backend Service PR Template

## Summary [Brief description of backend feature] ## Changes - [ ] Service implementation - [ ] API endpoints - [ ] Feature flag integration - [ ] PHI handling (if applicable) - [ ] Unit tests - [ ] Integration tests ## Test Plan - [ ] pytest tests pass - [ ] Load testing results - [ ] PHI routing verification ## Metrics - Latency impact: [expected change] - Memory impact: [expected change] ## Rollback Plan [How to disable/rollback if issues]

Success Criteria (Phase 3 Complete)

  • All UI components implemented and accessible
  • FHIR streaming integration functional
  • Speaker diarization working for 2+ speakers
  • TTFA < 300ms for 95th percentile
  • All feature flags documented and functional
  • Load testing complete (100 concurrent sessions)
  • Documentation updated

Prototypes: Surfacing Data to Users

FHIR Data Display Prototype

When FHIR streaming detects new patient data, it will be surfaced in the voice interface:

// VitalsPanel component prototype interface VitalsPanelProps { patientId: string; observations: FHIRObservation[]; onVitalClick: (observation: FHIRObservation) => void; } const VitalsPanel: React.FC<VitalsPanelProps> = ({ patientId, observations, onVitalClick }) => { // Group by category const vitals = observations.filter((o) => o.resourceType === "vital-signs"); const labs = observations.filter((o) => o.resourceType === "laboratory"); return ( <div className="vitals-panel bg-white dark:bg-gray-800 rounded-lg p-4"> <h3 className="text-lg font-semibold mb-3">Latest Patient Data</h3> {/* Real-time indicator */} <div className="flex items-center gap-2 mb-3"> <span className="w-2 h-2 bg-green-500 rounded-full animate-pulse" /> <span className="text-sm text-gray-500">Live updates</span> </div> {/* Vital signs grid */} <div className="grid grid-cols-2 gap-3 mb-4"> {vitals.slice(0, 4).map((vital) => ( <VitalCard key={vital.resourceId} label={vital.codeDisplay} value={`${vital.valueQuantity} ${vital.valueUnit}`} interpretation={vital.interpretation} onClick={() => onVitalClick(vital)} /> ))} </div> {/* Lab results list */} {labs.length > 0 && ( <div className="mt-4"> <h4 className="text-sm font-medium mb-2">Recent Labs</h4> <ul className="space-y-2"> {labs.slice(0, 5).map((lab) => ( <LabResultRow key={lab.resourceId} label={lab.codeDisplay} value={lab.value || `${lab.valueQuantity} ${lab.valueUnit}`} interpretation={lab.interpretation} referenceRange={lab.referenceRange} /> ))} </ul> </div> )} </div> ); };

Voice Context Injection:

# In Thinker service, inject FHIR context into prompt async def build_context_with_fhir( session_id: str, patient_id: str, query: str, ) -> str: # Get latest observations fhir_service = get_fhir_subscription_service() vitals = await fhir_service.get_latest_vitals(patient_id, max_results=5) labs = await fhir_service.get_latest_labs(patient_id, max_results=5) # Build context string context_parts = ["## Current Patient Data"] if vitals: context_parts.append("\n### Vital Signs") for v in vitals: context_parts.append(f"- {v.to_context_string()}") if labs: context_parts.append("\n### Recent Lab Results") for l in labs: context_parts.append(f"- {l.to_context_string()}") return "\n".join(context_parts)

Speaker Diarization Display Prototype

When multiple speakers are detected, the UI will show speaker attribution:

// SpeakerAttributedTranscript component prototype interface SpeakerAttributedTranscriptProps { segments: SpeakerSegment[]; speakerProfiles: Map<string, SpeakerProfile>; currentSpeaker?: string; } const SpeakerAttributedTranscript: React.FC<SpeakerAttributedTranscriptProps> = ({ segments, speakerProfiles, currentSpeaker, }) => { // Get speaker color const getSpeakerColor = (speakerId: string) => { const colors = ["blue", "green", "purple", "orange"]; const index = parseInt(speakerId.replace("SPEAKER_", "")) || 0; return colors[index % colors.length]; }; return ( <div className="speaker-transcript space-y-3"> {/* Speaker legend */} <div className="flex gap-2 mb-4"> {Array.from(speakerProfiles.entries()).map(([id, profile]) => ( <div key={id} className={cn( "flex items-center gap-1 px-2 py-1 rounded-full text-xs", `bg-${getSpeakerColor(id)}-100 text-${getSpeakerColor(id)}-700`, currentSpeaker === id && "ring-2 ring-blue-500", )} > <span className="w-2 h-2 rounded-full bg-current" /> {profile.name || id} </div> ))} </div> {/* Transcript with speaker labels */} {segments.map((segment, index) => ( <div key={index} className={cn("flex gap-3", segment.speakerId === currentSpeaker && "animate-pulse")}> {/* Speaker indicator */} <div className={cn( "w-8 h-8 rounded-full flex items-center justify-center text-xs font-medium", `bg-${getSpeakerColor(segment.speakerId)}-100`, `text-${getSpeakerColor(segment.speakerId)}-700`, )} > {segment.speakerId.replace("SPEAKER_", "")} </div> {/* Transcript text */} <div className="flex-1"> <div className="text-xs text-gray-500 mb-1"> {formatTime(segment.startMs)} - {formatTime(segment.endMs)} </div> <div className="text-sm">{segment.transcript}</div> </div> </div> ))} </div> ); };

Multi-Party Context for Thinker:

# Build multi-speaker context for Thinker def build_multi_speaker_context( diarization_result: DiarizationResult, transcripts: Dict[str, str], # speaker_id -> transcript ) -> str: context_parts = ["## Conversation Participants"] speaker_summary = diarization_result.get_speaker_summary() for speaker_id, speaking_time_ms in speaker_summary.items(): context_parts.append( f"- {speaker_id}: {speaking_time_ms / 1000:.1f}s speaking time" ) context_parts.append("\n## Conversation Transcript") for speaker_id, transcript in transcripts.items(): context_parts.append(f"\n### {speaker_id}:") context_parts.append(transcript) return "\n".join(context_parts)

PR Breakdown for Phase 3

PR #1: UI Integration (Voice-First Input)

Branch: feature/voice-mode-v4.1-phase3-ui

Files:

  • apps/web-app/src/components/voice/VoiceFirstInputBar.tsx
  • apps/web-app/src/components/voice/StreamingTextDisplay.tsx
  • apps/web-app/src/components/voice/LatencyIndicator.tsx
  • apps/web-app/src/components/voice/ThinkingFeedbackPanel.tsx
  • apps/web-app/src/hooks/useStreamingText.ts
  • apps/web-app/src/hooks/useThinkingFeedback.ts

PR #2: Advanced Services (FHIR + Diarization)

Branch: feature/voice-mode-v4.1-phase3-services

Files:

  • services/api-gateway/app/services/speaker_diarization_service.py
  • services/api-gateway/app/services/fhir_subscription_service.py
  • services/api-gateway/app/api/voice_fhir.py
  • services/api-gateway/app/api/voice_diarization.py
  • apps/web-app/src/components/voice/VitalsPanel.tsx
  • apps/web-app/src/components/voice/SpeakerAttributedTranscript.tsx

PR #3: Performance & Quality

Branch: feature/voice-mode-v4.1-phase3-performance

Files:

  • services/api-gateway/app/services/adaptive_quality_service.py
  • services/api-gateway/tests/load/voice_load_test.py
  • docs/voice/performance-tuning-guide.md
Beginning of guide
End of guide