2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"] 4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""] 5:I[4126,[],""] 7:I[9630,[],""] 8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"] 9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"] a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"] b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"] 3:T27e0, # Latency Budgets Guide Voice Mode v4.1 introduces latency-aware orchestration to maintain responsive voice interactions with a target of sub-700ms end-to-end latency. ## Overview The latency-aware orchestrator monitors each processing stage and applies graceful degradation when stages exceed their budgets. ``` ┌─────────────────────────────────────────────────────────────────┐ │ Voice Pipeline Stages │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Audio STT Lang Translation RAG LLM TTS │ │ Capture ─────▶ Detect ─────────────▶ ─────▶ ─────▶ ────▶ │ │ │ │ [50ms] [200ms] [50ms] [200ms] [300ms] [300ms] [150ms]│ │ │ │ Total Budget: 700ms E2E │ └─────────────────────────────────────────────────────────────────┘ ``` ## Budget Configuration ### Default Budgets ```python from app.services.latency_aware_orchestrator import LatencyBudget default_budget = LatencyBudget( audio_capture_ms=50, stt_ms=200, language_detection_ms=50, translation_ms=200, rag_ms=300, llm_first_token_ms=300, tts_first_chunk_ms=150, total_budget_ms=700 ) ``` ### Stage Details | Stage | Budget | Description | Degradation | | ------------------ | ------ | ----------------------------------- | -------------------- | | Audio capture | 50ms | Mic activation to first audio chunk | Log warning | | STT | 200ms | Speech-to-text processing | Use cached partial | | Language detection | 50ms | Detect query language | Default to user lang | | Translation | 200ms | Translate non-English queries | Skip translation | | RAG retrieval | 300ms | Knowledge base search | Limit results | | LLM first token | 300ms | Time to first LLM token | Shorten context | | TTS first chunk | 150ms | Time to first audio chunk | Use cached greeting | ## Degradation Types ### Degradation Enum ```python from app.services.latency_aware_orchestrator import DegradationType class DegradationType(str, Enum): LANGUAGE_DETECTION_SKIPPED = "language_detection_skipped" LANGUAGE_DETECTION_BUDGET_EXCEEDED = "language_detection_budget_exceeded" TRANSLATION_SKIPPED = "translation_skipped" TRANSLATION_BUDGET_EXCEEDED = "translation_budget_exceeded" TRANSLATION_FAILED = "translation_failed" RAG_LIMITED_TO_1 = "rag_limited_to_1" RAG_LIMITED_TO_3 = "rag_limited_to_3" RAG_RETRIEVAL_FAILED = "rag_retrieval_failed" LLM_CONTEXT_SHORTENED = "llm_context_shortened" TTS_USED_CACHED_GREETING = "tts_used_cached_greeting" PARALLEL_STT_REDUCED = "parallel_stt_reduced" ``` ### Degradation Actions | Scenario | Condition | Action | | ----------------------- | ---------------------------- | ------------------------------------- | | Language detection slow | > 50ms | Skip, use user's preferred language | | Translation slow | > 200ms | Skip translation, use original query | | Translation failed | API error or `result.failed` | Use original query + multilingual LLM | | RAG under pressure | < 500ms remaining | Return top-1 result only | | RAG moderately slow | < 700ms remaining | Return top-3 results | | LLM context too large | Exceeds token limit | Truncate context | | TTS cold start | First request | Use cached greeting audio | ### Translation Failure Handling When translation fails, the orchestrator raises `TranslationFailedError`: ```python from app.services.latency_aware_orchestrator import TranslationFailedError try: result = await orchestrator.process_with_budgets(audio_data, user_language="es") except TranslationFailedError as e: # Graceful degradation: use original query logger.warning(f"Translation failed: {e}, using original query") ``` The orchestrator checks both: - **Exception handling**: Wraps translation API exceptions - **Failed result flag**: Checks `result.failed` on translation results This triggers `DegradationType.TRANSLATION_FAILED` in the degradation list, allowing the system to continue processing with the original (non-translated) query while informing the user of reduced accuracy. ## Usage ### Basic Usage ```python from app.services.latency_aware_orchestrator import ( LatencyAwareVoiceOrchestrator, get_latency_aware_orchestrator ) # Get singleton instance orchestrator = get_latency_aware_orchestrator() # Process voice request with budget tracking result = await orchestrator.process_with_budgets( audio_data=audio_bytes, user_language="es" ) # Check result print(f"Transcript: {result.transcript}") print(f"Response: {result.response}") print(f"Total latency: {result.total_latency_ms}ms") print(f"Degradations: {result.degradation_applied}") print(f"Warnings: {result.warnings}") ``` ### Result Structure ```python @dataclass class VoiceProcessingResult: transcript: str # STT result detected_language: str # Detected query language response: str # LLM response sources: List[Dict] # RAG sources audio_url: Optional[str] # TTS audio URL total_latency_ms: float # End-to-end latency stage_latencies: Dict[str, float] # Per-stage timing degradation_applied: List[str] # Applied degradations warnings: List[str] # Warning messages success: bool # Overall success ``` ## Frontend Integration ### LatencyIndicator Component Display real-time latency status: ```tsx import { LatencyIndicator } from "@/components/voice/LatencyIndicator"; ; ``` ### Status Colors | Status | Latency | Color | | ------ | --------- | ------ | | Good | < 500ms | Green | | Fair | 500-700ms | Yellow | | Slow | > 700ms | Red | ### Degradation Tooltips The component shows user-friendly labels for degradations: ```typescript const DEGRADATION_LABELS = { language_detection_skipped: "Language detection skipped", translation_skipped: "Translation skipped", translation_failed: "Translation failed", rag_limited_to_1: "Search limited", rag_limited_to_3: "Search limited", llm_context_shortened: "Context shortened", tts_used_cached_greeting: "Audio cached", parallel_stt_reduced: "Speech recognition simplified", }; ``` ## Monitoring ### Metrics The orchestrator emits metrics for monitoring: ```python # Stage timing metrics voice_stage_latency_ms{stage="stt"} 145 voice_stage_latency_ms{stage="translation"} 178 voice_stage_latency_ms{stage="rag"} 234 # Degradation counters voice_degradation_total{type="translation_skipped"} 23 voice_degradation_total{type="rag_limited_to_3"} 156 # Overall latency histogram voice_e2e_latency_ms_bucket{le="500"} 8234 voice_e2e_latency_ms_bucket{le="700"} 9156 voice_e2e_latency_ms_bucket{le="+Inf"} 9500 ``` ### Logging ```python logger.info(f"Voice processing complete", extra={ "total_latency_ms": result.total_latency_ms, "stage_latencies": result.stage_latencies, "degradations": result.degradation_applied, "user_language": result.detected_language }) ``` ## Configuration ### Environment Variables ```bash # Latency budget overrides (milliseconds) VOICE_LATENCY_BUDGET_TOTAL=700 VOICE_LATENCY_BUDGET_STT=200 VOICE_LATENCY_BUDGET_TRANSLATION=200 VOICE_LATENCY_BUDGET_RAG=300 # Feature flag VOICE_V4_LATENCY_BUDGETS=true ``` ### Runtime Configuration ```python # Custom budget for high-latency scenarios high_latency_budget = LatencyBudget( total_budget_ms=1000, stt_ms=300, translation_ms=300, rag_ms=400 ) orchestrator = LatencyAwareVoiceOrchestrator( budget=high_latency_budget ) ``` ## Testing ### Unit Tests ```python # Test translation timeout triggers degradation @pytest.mark.asyncio async def test_translation_timeout_triggers_degradation(): orchestrator = LatencyAwareVoiceOrchestrator( budget=LatencyBudget(translation_ms=1) # Very short ) # ... setup mocks ... result = await orchestrator.process_with_budgets( audio_data=b"fake_audio", user_language="es" ) assert DegradationType.TRANSLATION_SKIPPED.value in result.degradation_applied ``` ### Integration Tests ```bash # Run latency budget tests pytest tests/services/test_voice_v4_services.py::TestLatencyOrchestration -v ``` ## Best Practices 1. **Monitor degradation rates**: High degradation rates indicate capacity issues 2. **Tune budgets per environment**: Development can use looser budgets 3. **Cache aggressively**: Translation caching reduces degradation frequency 4. **Use feature flags**: Roll out gradually and monitor impact 5. **Alert on sustained degradation**: Set up alerts for > 10% degradation rate ## Related Documentation - [Voice Mode v4.1 Overview](./voice-mode-v4-overview.md) - [Multilingual RAG Architecture](./multilingual-rag-architecture.md) - [Voice Pipeline Architecture](../VOICE_MODE_PIPELINE.md) 6:["slug","voice/latency-budgets-guide","c"] 0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","voice/latency-budgets-guide","c"],{"children":["__PAGE__?{\"slug\":[\"voice\",\"latency-budgets-guide\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","voice/latency-budgets-guide","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Latency Budgets Guide"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","voice/latency-budgets-guide.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/voice/latency-budgets-guide.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]] c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Latency Budgets Guide | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"Guide to latency-aware orchestration and graceful degradation"}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]] 1:null