2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"]
4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""]
5:I[4126,[],""]
7:I[9630,[],""]
8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"]
9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"]
a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"]
b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"]
3:T27e0,
# Latency Budgets Guide

Voice Mode v4.1 introduces latency-aware orchestration to maintain responsive voice interactions with a target of sub-700ms end-to-end latency.

## Overview

The latency-aware orchestrator monitors each processing stage and applies graceful degradation when stages exceed their budgets.

```
┌─────────────────────────────────────────────────────────────────┐
│                    Voice Pipeline Stages                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Audio    STT     Lang      Translation   RAG      LLM     TTS  │
│  Capture  ─────▶  Detect  ─────────────▶  ─────▶  ─────▶  ────▶ │
│                                                                  │
│  [50ms]  [200ms]  [50ms]     [200ms]     [300ms] [300ms] [150ms]│
│                                                                  │
│  Total Budget: 700ms E2E                                        │
└─────────────────────────────────────────────────────────────────┘
```

## Budget Configuration

### Default Budgets

```python
from app.services.latency_aware_orchestrator import LatencyBudget

default_budget = LatencyBudget(
    audio_capture_ms=50,
    stt_ms=200,
    language_detection_ms=50,
    translation_ms=200,
    rag_ms=300,
    llm_first_token_ms=300,
    tts_first_chunk_ms=150,
    total_budget_ms=700
)
```

### Stage Details

| Stage              | Budget | Description                         | Degradation          |
| ------------------ | ------ | ----------------------------------- | -------------------- |
| Audio capture      | 50ms   | Mic activation to first audio chunk | Log warning          |
| STT                | 200ms  | Speech-to-text processing           | Use cached partial   |
| Language detection | 50ms   | Detect query language               | Default to user lang |
| Translation        | 200ms  | Translate non-English queries       | Skip translation     |
| RAG retrieval      | 300ms  | Knowledge base search               | Limit results        |
| LLM first token    | 300ms  | Time to first LLM token             | Shorten context      |
| TTS first chunk    | 150ms  | Time to first audio chunk           | Use cached greeting  |

## Degradation Types

### Degradation Enum

```python
from app.services.latency_aware_orchestrator import DegradationType

class DegradationType(str, Enum):
    LANGUAGE_DETECTION_SKIPPED = "language_detection_skipped"
    LANGUAGE_DETECTION_BUDGET_EXCEEDED = "language_detection_budget_exceeded"
    TRANSLATION_SKIPPED = "translation_skipped"
    TRANSLATION_BUDGET_EXCEEDED = "translation_budget_exceeded"
    TRANSLATION_FAILED = "translation_failed"
    RAG_LIMITED_TO_1 = "rag_limited_to_1"
    RAG_LIMITED_TO_3 = "rag_limited_to_3"
    RAG_RETRIEVAL_FAILED = "rag_retrieval_failed"
    LLM_CONTEXT_SHORTENED = "llm_context_shortened"
    TTS_USED_CACHED_GREETING = "tts_used_cached_greeting"
    PARALLEL_STT_REDUCED = "parallel_stt_reduced"
```

### Degradation Actions

| Scenario                | Condition                    | Action                                |
| ----------------------- | ---------------------------- | ------------------------------------- |
| Language detection slow | > 50ms                       | Skip, use user's preferred language   |
| Translation slow        | > 200ms                      | Skip translation, use original query  |
| Translation failed      | API error or `result.failed` | Use original query + multilingual LLM |
| RAG under pressure      | < 500ms remaining            | Return top-1 result only              |
| RAG moderately slow     | < 700ms remaining            | Return top-3 results                  |
| LLM context too large   | Exceeds token limit          | Truncate context                      |
| TTS cold start          | First request                | Use cached greeting audio             |

### Translation Failure Handling

When translation fails, the orchestrator raises `TranslationFailedError`:

```python
from app.services.latency_aware_orchestrator import TranslationFailedError

try:
    result = await orchestrator.process_with_budgets(audio_data, user_language="es")
except TranslationFailedError as e:
    # Graceful degradation: use original query
    logger.warning(f"Translation failed: {e}, using original query")
```

The orchestrator checks both:

- **Exception handling**: Wraps translation API exceptions
- **Failed result flag**: Checks `result.failed` on translation results

This triggers `DegradationType.TRANSLATION_FAILED` in the degradation list, allowing the system to continue processing with the original (non-translated) query while informing the user of reduced accuracy.

## Usage

### Basic Usage

```python
from app.services.latency_aware_orchestrator import (
    LatencyAwareVoiceOrchestrator,
    get_latency_aware_orchestrator
)

# Get singleton instance
orchestrator = get_latency_aware_orchestrator()

# Process voice request with budget tracking
result = await orchestrator.process_with_budgets(
    audio_data=audio_bytes,
    user_language="es"
)

# Check result
print(f"Transcript: {result.transcript}")
print(f"Response: {result.response}")
print(f"Total latency: {result.total_latency_ms}ms")
print(f"Degradations: {result.degradation_applied}")
print(f"Warnings: {result.warnings}")
```

### Result Structure

```python
@dataclass
class VoiceProcessingResult:
    transcript: str                      # STT result
    detected_language: str               # Detected query language
    response: str                        # LLM response
    sources: List[Dict]                  # RAG sources
    audio_url: Optional[str]             # TTS audio URL
    total_latency_ms: float              # End-to-end latency
    stage_latencies: Dict[str, float]    # Per-stage timing
    degradation_applied: List[str]       # Applied degradations
    warnings: List[str]                  # Warning messages
    success: bool                        # Overall success
```

## Frontend Integration

### LatencyIndicator Component

Display real-time latency status:

```tsx
import { LatencyIndicator } from "@/components/voice/LatencyIndicator";

<LatencyIndicator
  latencyMs={523}
  degradations={["translation_skipped", "rag_limited_to_3"]}
  showDetails={true}
  size="sm"
/>;
```

### Status Colors

| Status | Latency   | Color  |
| ------ | --------- | ------ |
| Good   | < 500ms   | Green  |
| Fair   | 500-700ms | Yellow |
| Slow   | > 700ms   | Red    |

### Degradation Tooltips

The component shows user-friendly labels for degradations:

```typescript
const DEGRADATION_LABELS = {
  language_detection_skipped: "Language detection skipped",
  translation_skipped: "Translation skipped",
  translation_failed: "Translation failed",
  rag_limited_to_1: "Search limited",
  rag_limited_to_3: "Search limited",
  llm_context_shortened: "Context shortened",
  tts_used_cached_greeting: "Audio cached",
  parallel_stt_reduced: "Speech recognition simplified",
};
```

## Monitoring

### Metrics

The orchestrator emits metrics for monitoring:

```python
# Stage timing metrics
voice_stage_latency_ms{stage="stt"} 145
voice_stage_latency_ms{stage="translation"} 178
voice_stage_latency_ms{stage="rag"} 234

# Degradation counters
voice_degradation_total{type="translation_skipped"} 23
voice_degradation_total{type="rag_limited_to_3"} 156

# Overall latency histogram
voice_e2e_latency_ms_bucket{le="500"} 8234
voice_e2e_latency_ms_bucket{le="700"} 9156
voice_e2e_latency_ms_bucket{le="+Inf"} 9500
```

### Logging

```python
logger.info(f"Voice processing complete", extra={
    "total_latency_ms": result.total_latency_ms,
    "stage_latencies": result.stage_latencies,
    "degradations": result.degradation_applied,
    "user_language": result.detected_language
})
```

## Configuration

### Environment Variables

```bash
# Latency budget overrides (milliseconds)
VOICE_LATENCY_BUDGET_TOTAL=700
VOICE_LATENCY_BUDGET_STT=200
VOICE_LATENCY_BUDGET_TRANSLATION=200
VOICE_LATENCY_BUDGET_RAG=300

# Feature flag
VOICE_V4_LATENCY_BUDGETS=true
```

### Runtime Configuration

```python
# Custom budget for high-latency scenarios
high_latency_budget = LatencyBudget(
    total_budget_ms=1000,
    stt_ms=300,
    translation_ms=300,
    rag_ms=400
)

orchestrator = LatencyAwareVoiceOrchestrator(
    budget=high_latency_budget
)
```

## Testing

### Unit Tests

```python
# Test translation timeout triggers degradation
@pytest.mark.asyncio
async def test_translation_timeout_triggers_degradation():
    orchestrator = LatencyAwareVoiceOrchestrator(
        budget=LatencyBudget(translation_ms=1)  # Very short
    )
    # ... setup mocks ...

    result = await orchestrator.process_with_budgets(
        audio_data=b"fake_audio",
        user_language="es"
    )

    assert DegradationType.TRANSLATION_SKIPPED.value in result.degradation_applied
```

### Integration Tests

```bash
# Run latency budget tests
pytest tests/services/test_voice_v4_services.py::TestLatencyOrchestration -v
```

## Best Practices

1. **Monitor degradation rates**: High degradation rates indicate capacity issues
2. **Tune budgets per environment**: Development can use looser budgets
3. **Cache aggressively**: Translation caching reduces degradation frequency
4. **Use feature flags**: Roll out gradually and monitor impact
5. **Alert on sustained degradation**: Set up alerts for > 10% degradation rate

## Related Documentation

- [Voice Mode v4.1 Overview](./voice-mode-v4-overview.md)
- [Multilingual RAG Architecture](./multilingual-rag-architecture.md)
- [Voice Pipeline Architecture](../VOICE_MODE_PIPELINE.md)
6:["slug","voice/latency-budgets-guide","c"]
0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","voice/latency-budgets-guide","c"],{"children":["__PAGE__?{\"slug\":[\"voice\",\"latency-budgets-guide\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","voice/latency-budgets-guide","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Latency Budgets Guide"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","voice/latency-budgets-guide.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/voice/latency-budgets-guide.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]]
c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Latency Budgets Guide | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"Guide to latency-aware orchestration and graceful degradation"}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]]
1:null