Latency Budgets Guide

Voice Mode v4.1 introduces latency-aware orchestration to maintain responsive voice interactions with a target of sub-700ms end-to-end latency.

Overview

The latency-aware orchestrator monitors each processing stage and applies graceful degradation when stages exceed their budgets.

┌─────────────────────────────────────────────────────────────────┐
│                    Voice Pipeline Stages                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Audio    STT     Lang      Translation   RAG      LLM     TTS  │
│  Capture  ─────▶  Detect  ─────────────▶  ─────▶  ─────▶  ────▶ │
│                                                                  │
│  [50ms]  [200ms]  [50ms]     [200ms]     [300ms] [300ms] [150ms]│
│                                                                  │
│  Total Budget: 700ms E2E                                        │
└─────────────────────────────────────────────────────────────────┘

Budget Configuration

Default Budgets

from app.services.latency_aware_orchestrator import LatencyBudget

default_budget = LatencyBudget(
    audio_capture_ms=50,
    stt_ms=200,
    language_detection_ms=50,
    translation_ms=200,
    rag_ms=300,
    llm_first_token_ms=300,
    tts_first_chunk_ms=150,
    total_budget_ms=700
)

Stage Details

Stage	Budget	Description	Degradation
Audio capture	50ms	Mic activation to first audio chunk	Log warning
STT	200ms	Speech-to-text processing	Use cached partial
Language detection	50ms	Detect query language	Default to user lang
Translation	200ms	Translate non-English queries	Skip translation
RAG retrieval	300ms	Knowledge base search	Limit results
LLM first token	300ms	Time to first LLM token	Shorten context
TTS first chunk	150ms	Time to first audio chunk	Use cached greeting

Degradation Types

Degradation Enum

from app.services.latency_aware_orchestrator import DegradationType

class DegradationType(str, Enum):
    LANGUAGE_DETECTION_SKIPPED = "language_detection_skipped"
    LANGUAGE_DETECTION_BUDGET_EXCEEDED = "language_detection_budget_exceeded"
    TRANSLATION_SKIPPED = "translation_skipped"
    TRANSLATION_BUDGET_EXCEEDED = "translation_budget_exceeded"
    TRANSLATION_FAILED = "translation_failed"
    RAG_LIMITED_TO_1 = "rag_limited_to_1"
    RAG_LIMITED_TO_3 = "rag_limited_to_3"
    RAG_RETRIEVAL_FAILED = "rag_retrieval_failed"
    LLM_CONTEXT_SHORTENED = "llm_context_shortened"
    TTS_USED_CACHED_GREETING = "tts_used_cached_greeting"
    PARALLEL_STT_REDUCED = "parallel_stt_reduced"

Degradation Actions

Scenario	Condition	Action
Language detection slow	> 50ms	Skip, use user's preferred language
Translation slow	> 200ms	Skip translation, use original query
Translation failed	API error or `result.failed`	Use original query + multilingual LLM
RAG under pressure	< 500ms remaining	Return top-1 result only
RAG moderately slow	< 700ms remaining	Return top-3 results
LLM context too large	Exceeds token limit	Truncate context
TTS cold start	First request	Use cached greeting audio

Translation Failure Handling

When translation fails, the orchestrator raises TranslationFailedError:

from app.services.latency_aware_orchestrator import TranslationFailedError

try:
    result = await orchestrator.process_with_budgets(audio_data, user_language="es")
except TranslationFailedError as e:
    # Graceful degradation: use original query
    logger.warning(f"Translation failed: {e}, using original query")

The orchestrator checks both:

Exception handling: Wraps translation API exceptions
Failed result flag: Checks result.failed on translation results

This triggers DegradationType.TRANSLATION_FAILED in the degradation list, allowing the system to continue processing with the original (non-translated) query while informing the user of reduced accuracy.

Usage

Basic Usage

from app.services.latency_aware_orchestrator import (
    LatencyAwareVoiceOrchestrator,
    get_latency_aware_orchestrator
)

# Get singleton instance
orchestrator = get_latency_aware_orchestrator()

# Process voice request with budget tracking
result = await orchestrator.process_with_budgets(
    audio_data=audio_bytes,
    user_language="es"
)

# Check result
print(f"Transcript: {result.transcript}")
print(f"Response: {result.response}")
print(f"Total latency: {result.total_latency_ms}ms")
print(f"Degradations: {result.degradation_applied}")
print(f"Warnings: {result.warnings}")

Result Structure

@dataclass
class VoiceProcessingResult:
    transcript: str                      # STT result
    detected_language: str               # Detected query language
    response: str                        # LLM response
    sources: List[Dict]                  # RAG sources
    audio_url: Optional[str]             # TTS audio URL
    total_latency_ms: float              # End-to-end latency
    stage_latencies: Dict[str, float]    # Per-stage timing
    degradation_applied: List[str]       # Applied degradations
    warnings: List[str]                  # Warning messages
    success: bool                        # Overall success

Frontend Integration

LatencyIndicator Component

Display real-time latency status:

import { LatencyIndicator } from "@/components/voice/LatencyIndicator";

<LatencyIndicator
  latencyMs={523}
  degradations={["translation_skipped", "rag_limited_to_3"]}
  showDetails={true}
  size="sm"
/>;

Status Colors

Status	Latency	Color
Good	< 500ms	Green
Fair	500-700ms	Yellow
Slow	> 700ms	Red

Degradation Tooltips

The component shows user-friendly labels for degradations:

const DEGRADATION_LABELS = {
  language_detection_skipped: "Language detection skipped",
  translation_skipped: "Translation skipped",
  translation_failed: "Translation failed",
  rag_limited_to_1: "Search limited",
  rag_limited_to_3: "Search limited",
  llm_context_shortened: "Context shortened",
  tts_used_cached_greeting: "Audio cached",
  parallel_stt_reduced: "Speech recognition simplified",
};

Monitoring

Metrics

The orchestrator emits metrics for monitoring:

# Stage timing metrics
voice_stage_latency_ms{stage="stt"} 145
voice_stage_latency_ms{stage="translation"} 178
voice_stage_latency_ms{stage="rag"} 234

# Degradation counters
voice_degradation_total{type="translation_skipped"} 23
voice_degradation_total{type="rag_limited_to_3"} 156

# Overall latency histogram
voice_e2e_latency_ms_bucket{le="500"} 8234
voice_e2e_latency_ms_bucket{le="700"} 9156
voice_e2e_latency_ms_bucket{le="+Inf"} 9500

Logging

logger.info(f"Voice processing complete", extra={
    "total_latency_ms": result.total_latency_ms,
    "stage_latencies": result.stage_latencies,
    "degradations": result.degradation_applied,
    "user_language": result.detected_language
})

Configuration

Environment Variables

# Latency budget overrides (milliseconds)
VOICE_LATENCY_BUDGET_TOTAL=700
VOICE_LATENCY_BUDGET_STT=200
VOICE_LATENCY_BUDGET_TRANSLATION=200
VOICE_LATENCY_BUDGET_RAG=300

# Feature flag
VOICE_V4_LATENCY_BUDGETS=true

Runtime Configuration

# Custom budget for high-latency scenarios
high_latency_budget = LatencyBudget(
    total_budget_ms=1000,
    stt_ms=300,
    translation_ms=300,
    rag_ms=400
)

orchestrator = LatencyAwareVoiceOrchestrator(
    budget=high_latency_budget
)

Testing

Unit Tests

# Test translation timeout triggers degradation
@pytest.mark.asyncio
async def test_translation_timeout_triggers_degradation():
    orchestrator = LatencyAwareVoiceOrchestrator(
        budget=LatencyBudget(translation_ms=1)  # Very short
    )
    # ... setup mocks ...

    result = await orchestrator.process_with_budgets(
        audio_data=b"fake_audio",
        user_language="es"
    )

    assert DegradationType.TRANSLATION_SKIPPED.value in result.degradation_applied

Integration Tests

# Run latency budget tests
pytest tests/services/test_voice_v4_services.py::TestLatencyOrchestration -v

Best Practices

Monitor degradation rates: High degradation rates indicate capacity issues
Tune budgets per environment: Development can use looser budgets
Cache aggressively: Translation caching reduces degradation frequency
Use feature flags: Roll out gradually and monitor impact
Alert on sustained degradation: Set up alerts for > 10% degradation rate

Latency Budgets Guide

Latency Budgets Guide

Overview

Budget Configuration

Default Budgets

Stage Details

Degradation Types

Degradation Enum

Degradation Actions

Translation Failure Handling

Usage

Basic Usage

Result Structure

Frontend Integration

LatencyIndicator Component

Status Colors

Degradation Tooltips

Monitoring

Metrics

Logging

Configuration

Environment Variables

Runtime Configuration

Testing

Unit Tests

Integration Tests

Best Practices

Related Documentation