Docs / Raw

Multilingual RAG Architecture

Sourced from docs/voice/multilingual-rag-architecture.md

Edit on GitHub

Multilingual RAG Architecture

The multilingual RAG service enables voice interactions in multiple languages by implementing a translate-then-retrieve pattern with graceful degradation.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                     Multilingual RAG Pipeline                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐   ┌─────────────┐   ┌───────────────┐             │
│  │  User    │──▶│  Language   │──▶│  Translation  │             │
│  │  Query   │   │  Detection  │   │  (if needed)  │             │
│  └──────────┘   └─────────────┘   └───────────────┘             │
│                        │                   │                     │
│                        ▼                   ▼                     │
│              ┌─────────────────────────────────┐                │
│              │    English Query for RAG        │                │
│              └─────────────────────────────────┘                │
│                              │                                   │
│                              ▼                                   │
│              ┌─────────────────────────────────┐                │
│              │      RAG Knowledge Base         │                │
│              │   (English embeddings only)     │                │
│              └─────────────────────────────────┘                │
│                              │                                   │
│                              ▼                                   │
│              ┌─────────────────────────────────┐                │
│              │      LLM Response Generation    │                │
│              │   (with language instruction)   │                │
│              └─────────────────────────────────┘                │
│                              │                                   │
│                              ▼                                   │
│              ┌─────────────────────────────────┐                │
│              │     Response in User Language   │                │
│              └─────────────────────────────────┘                │
└─────────────────────────────────────────────────────────────────┘

Translation Service

Multi-Provider Fallback

from app.services.translation_service import TranslationService # Initialize with providers service = TranslationService( primary_provider="google", fallback_provider="deepl" ) # Translate with automatic fallback result = await service.translate_with_fallback( text="¿Cuáles son los síntomas de la diabetes?", source="es", target="en" ) if result.failed: # Graceful degradation - use original query print(f"Translation failed: {result.error_message}") else: print(f"Translated: {result.text}") if result.used_fallback: print("Used fallback provider")

Caching Strategy

Translations are cached in Redis with a 7-day TTL:

# Cache key format cache_key = f"trans:{source}:{target}:{hash(text)}" # TTL TTL_DAYS = 7 # Cache hit rate typically >80% for common queries

Supported Languages

CodeLanguageStatus
enEnglishNative
esSpanishFull
frFrenchFull
deGermanFull
itItalianFull
ptPortugueseFull
arArabicFull
zhChineseFull
hiHindiFull
urUrduFull
jaJapanesePlaceholder
koKoreanPlaceholder
ruRussianPlaceholder
plPolishPlaceholder
trTurkishPlaceholder

Language Detection

Code-Switching Detection

The language detection service identifies when users mix languages:

from app.services.multilingual_rag_service import LanguageDetectionService detector = LanguageDetectionService() # Detect primary language result = await detector.detect("Tell me about مرض السكري please") # result.primary_language = "en" # result.secondary_languages = ["ar"] # result.is_code_switched = True

Detection Algorithm

  1. Fast detection: Use langdetect for initial guess
  2. Confidence check: Verify confidence > 0.7
  3. Code-switching scan: Check for embedded phrases in other languages
  4. Fallback: Default to user's preferred language

RAG Integration

Query Flow

from app.services.multilingual_rag_service import MultilingualRAGService service = MultilingualRAGService() response = await service.query_multilingual( query="¿Qué medicamentos se usan para la diabetes?", user_language="es" ) # Response structure { "answer": "Los medicamentos más comunes para...", "language": "es", "sources": [...], "original_query": "¿Qué medicamentos se usan para...", "translated_query": "What medications are used for...", "translation_warning": None, # or "Translation used fallback" "latency_ms": 523.4, "degradation_applied": [] }

LLM Prompting for Multilingual Response

The LLM is instructed to respond in the user's language:

system_prompt = f"""You are a helpful medical assistant. Respond to the user's question using the provided context. IMPORTANT: Respond entirely in {language_name}. Do not mix languages unless the user's query contains specific terms that should remain in their original language (e.g., medication names). Be accurate, helpful, and cite your sources when providing information."""

Graceful Degradation

When translation fails, the system degrades gracefully:

Degradation Levels

ScenarioActionDegradationType
Primary translation failsUse fallback providertranslation_used_fallback
All translation failsUse original query + LLMtranslation_failed
Translation too slowSkip translationtranslation_budget_exceeded
RAG retrieval failsReturn empty resultsrag_retrieval_failed

Error Messages by Language

FALLBACK_MESSAGES = { "en": "I apologize, but I'm unable to process your request. Please try again.", "es": "Lo siento, no puedo procesar su solicitud. Por favor, inténtelo de nuevo.", "fr": "Je m'excuse, je ne peux pas traiter votre demande. Veuillez réessayer.", "de": "Es tut mir leid, ich kann Ihre Anfrage nicht bearbeiten. Bitte versuchen Sie es erneut.", "ar": "عذراً، لا أستطيع معالجة طلبك. يرجى المحاولة مرة أخرى.", "zh": "抱歉,我目前无法处理您的请求。请重试。", # ... more languages }

Performance Considerations

Latency Impact

StageTypical LatencyBudget
Language detection10-30ms50ms
Translation100-180ms200ms
RAG retrieval150-250ms300ms
Total impact~300ms550ms

Optimization Strategies

  1. Translation caching: 7-day Redis cache
  2. Async detection: Run language detection in parallel with audio processing
  3. Skip translation for English: Detect English early and bypass translation
  4. Budget-aware skipping: Skip translation when budget is tight

Configuration

Environment Variables

# Primary translation provider TRANSLATION_PROVIDER=google # API keys (store in secrets manager) GOOGLE_TRANSLATE_API_KEY=xxx DEEPL_API_KEY=xxx # Cache settings TRANSLATION_CACHE_TTL_DAYS=7 TRANSLATION_CACHE_PREFIX=trans # Feature flag VOICE_V4_MULTILINGUAL_RAG=true

Feature Flag

from app.core.feature_flags import is_enabled if is_enabled("voice_v4_multilingual_rag", user_id=user.id): service = MultilingualRAGService() response = await service.query_multilingual(query, user_language) else: # Fall back to English-only RAG response = await rag_service.query(query)

Testing

# Test translation fallback pytest tests/services/test_voice_v4_services.py::TestTranslationFailureHandling -v # Test multilingual RAG pytest tests/services/test_voice_v4_services.py::TestMultilingualRAG -v
Beginning of guide
End of guide