Lexicon Service Guide

Voice Mode v4.1 includes a pronunciation lexicon service for accurate text-to-speech of medical terminology across multiple languages.

Overview

The lexicon service provides:

Medical pronunciation lexicons for 15 languages
Shared drug name pronunciations (100+ medications)
G2P fallback using espeak-ng
User custom pronunciations
Coverage validation tools

Supported Languages

Code	Language	Status	Terms	Domain Terms
en	English	Complete	146	+280 Quranic
es	Spanish	In Progress	30	-
fr	French	In Progress	10	-
de	German	In Progress	10	-
it	Italian	In Progress	10	-
pt	Portuguese	In Progress	10	-
ar	Arabic	Complete	155	+364 Quranic
zh	Chinese	In Progress	25	-
hi	Hindi	In Progress	10	-
ur	Urdu	In Progress	10	-
ja	Japanese	Placeholder	5	-
ko	Korean	Placeholder	5	-
ru	Russian	Placeholder	5	-
pl	Polish	Placeholder	5	-
tr	Turkish	Placeholder	5	-

Quranic Lexicons

For the Quran Voice Tutor application, dedicated Quranic lexicons provide pronunciations for:

114 Surah names (Arabic and transliterated)
50+ Tajweed terms (Idgham, Ikhfa, Qalqalah, Madd, etc.)
200+ Quranic vocabulary (Islamic terms, prophets, concepts)

These are automatically loaded alongside the medical lexicons for Arabic and English.

Basic Usage

Get Pronunciation

from app.services.lexicon_service import get_lexicon_service

service = get_lexicon_service()

# Get pronunciation for a term
result = await service.get_phoneme("diabetes", "en")

print(f"Term: {result.term}")           # diabetes
print(f"Phoneme: {result.phoneme}")     # ˌdaɪəˈbiːtiːz
print(f"Source: {result.source}")       # lexicon
print(f"Confidence: {result.confidence}") # 1.0

Batch Lookup

terms = ["metformin", "lisinopril", "atorvastatin"]
results = await service.get_phonemes_batch(terms, "en")

for result in results:
    print(f"{result.term}: {result.phoneme}")

Custom Pronunciations

# Add user-defined pronunciation
service.add_user_pronunciation(
    term="Ozempic",
    phoneme="oʊˈzɛmpɪk",
    language="en"
)

# User pronunciations take priority
result = await service.get_phoneme("Ozempic", "en")
assert result.source == "user_custom"

Lookup Order

The service checks sources in this order:

User custom pronunciations (highest priority)
Language-specific lexicon (lexicons/{lang}/medical_phonemes.json)
Shared drug lexicon (lexicons/shared/drug_names.json)
G2P generation (espeak-ng)
English G2P fallback (lowest priority)

Term: "metformin" (Spanish)
         │
         ▼
┌─────────────────────┐
│  User Custom (es)   │ ──▶ Not found
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│  Spanish Lexicon    │ ──▶ Not found
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│  Shared Drug Names  │ ──▶ Found! "mɛtˈfɔrmɪn"
└─────────────────────┘

Lexicon File Format

Language-Specific Lexicon

{
  "_meta": {
    "version": "1.0.0",
    "term_count": 146,
    "last_updated": "2024-12-04",
    "language": "en",
    "alphabet": "ipa",
    "status": "complete",
    "categories": ["drug_names", "conditions", "procedures", "anatomy"]
  },
  "diabetes": "ˌdaɪəˈbiːtiːz",
  "hypertension": "ˌhaɪpərˈtɛnʃən",
  "metformin": "mɛtˈfɔrmɪn"
}

Shared Drug Lexicon

{
  "_meta": {
    "version": "1.0.0",
    "term_count": 97,
    "note": "Common drug pronunciations shared across languages"
  },
  "metformin": "mɛtˈfɔrmɪn",
  "lisinopril": "laɪˈsɪnəprɪl",
  "atorvastatin": "əˌtɔːvəˈstætɪn"
}

Coverage Validation

Validate Single Language

report = await service.validate_lexicon_coverage("en")

print(f"Language: {report.language}")
print(f"Status: {report.status}")           # complete | partial | placeholder
print(f"Term count: {report.term_count}")
print(f"Coverage: {report.coverage_pct}%")
print(f"Missing: {report.missing_categories}")

Validate All Languages

reports = await service.validate_all_lexicons()

for lang, report in reports.items():
    print(f"{lang}: {report.term_count} terms ({report.status})")

CLI Validation

python -c "
from app.services.lexicon_service import get_lexicon_service
import asyncio

async def validate():
    service = get_lexicon_service()
    reports = await service.validate_all_lexicons()
    for lang, report in reports.items():
        status_icon = '✓' if report.status == 'complete' else '○'
        print(f'{status_icon} {lang}: {report.term_count} terms ({report.status})')

asyncio.run(validate())
"

G2P Fallback

espeak-ng Integration

For terms not in lexicons, the service uses espeak-ng:

from app.services.lexicon_service import G2PService

g2p = G2PService()

# Generate pronunciation
phoneme = await g2p.generate("Rybelsus", "en")
print(phoneme)  # ɹɪbɛlsəs

Language Support

Language	Engine	Voice
English	espeak-ng	en-us
Spanish	espeak-ng	es
French	espeak-ng	fr
German	espeak-ng	de
Italian	espeak-ng	it
Portuguese	espeak-ng	pt
Arabic	mishkal	ar
Chinese	pypinyin	zh
Hindi	espeak-ng	hi
Urdu	espeak-ng	ur

Fallback Chain

Term: "unknownterm" (Spanish)
         │
         ▼
┌─────────────────────┐
│  Spanish G2P        │ ──▶ espeak-ng -v es
└─────────────────────┘
         │ (if fails)
         ▼
┌─────────────────────┐
│  English G2P        │ ──▶ espeak-ng -v en-us
└─────────────────────┘
         │ (if fails)
         ▼
┌─────────────────────┐
│  Raw Term           │ ──▶ /unknownterm/
└─────────────────────┘

Data Directory Configuration

Environment Variable

# Set custom data directory
export VOICEASSIST_DATA_DIR=/path/to/data

# Or in .env
VOICEASSIST_DATA_DIR=/opt/voiceassist/data

Auto-Discovery via `_resolve_data_dir()`

The _resolve_data_dir() function provides flexible path resolution that works across environments:

Environment variable: Check VOICEASSIST_DATA_DIR for an absolute path
Repository root: Walk up from the service file to find data/lexicons/ directory
Current working directory: Check ./data/ relative to cwd
Fallback: Use relative path from service file

This ensures the lexicon service works correctly in:

Local development (uses repo-relative path)
CI/CD pipelines (set VOICEASSIST_DATA_DIR to test fixture path)
Production (set VOICEASSIST_DATA_DIR to deployed data location)

from app.services.lexicon_service import _resolve_data_dir

data_dir = _resolve_data_dir()
print(f"Using data directory: {data_dir}")

For production deployments, explicitly set the environment variable:

export VOICEASSIST_DATA_DIR=/opt/voiceassist/data

TTS Integration

With ElevenLabs

from app.services.elevenlabs_service import ElevenLabsService
from app.services.lexicon_service import get_lexicon_service

lexicon = get_lexicon_service()
tts = ElevenLabsService()

# Get pronunciation for medical terms
text = "Take metformin twice daily for diabetes."
terms_to_pronounce = ["metformin", "diabetes"]

pronunciations = {}
for term in terms_to_pronounce:
    result = await lexicon.get_phoneme(term, "en")
    pronunciations[term] = result.phoneme

# ElevenLabs supports IPA in <phoneme> SSML tags
ssml_text = f'Take <phoneme alphabet="ipa" ph="{pronunciations["metformin"]}">metformin</phoneme> twice daily.'

Adding New Lexicons

1. Create Lexicon File

# Create language directory
mkdir -p data/lexicons/pt

# Create lexicon file
cat > data/lexicons/pt/medical_phonemes.json << 'EOF'
{
  "_meta": {
    "version": "0.1.0",
    "term_count": 10,
    "last_updated": "2024-12-04",
    "language": "pt",
    "alphabet": "ipa",
    "status": "in_progress",
    "categories": ["conditions", "anatomy"]
  },
  "diabetes": "dʒiaˈbetʃis",
  "coração": "koɾaˈsɐ̃w̃"
}
EOF

2. Update LexiconService

Add to LEXICON_PATHS in lexicon_service.py:

LEXICON_PATHS = {
    # ...existing...
    "pt": "lexicons/pt/medical_phonemes.json",
}

3. Validate

pytest tests/services/test_voice_v4_services.py::TestLexiconLoading -v

Contributing to Lexicons

Adding New Terms

Identify the correct lexicon file:
- Medical terms: data/lexicons/{lang}/medical_phonemes.json
- Quranic terms: data/lexicons/{lang}/quranic_phonemes.json
Use proper IPA notation:
- Consult IPA chart
- Use consistent stress markers (ˈ for primary, ˌ for secondary)
- Include vowel length markers (ː) for Arabic/English

Follow the JSON format:

{
  "term_in_target_language": "ipa_pronunciation"
}

Update metadata:
- Increment term_count in _meta
- Update last_updated date
- Add new categories if needed

Quranic Term Guidelines

Surah names: Use both Arabic script and transliterated forms
Tajweed terms: Include common spelling variations
Arabic IPA: Use proper pharyngeal (ħ, ʕ), emphatic (tˤ, sˤ, dˤ), and uvular (q) consonants
English transliteration IPA: Approximate Arabic sounds with closest English equivalents

Validation

Run the lexicon validation to check coverage:

cd services/api-gateway
python -c "
from app.services.lexicon_service import get_lexicon_service
import asyncio

async def validate():
    service = get_lexicon_service()
    reports = await service.validate_all_lexicons()
    for lang, report in reports.items():
        status_icon = '✓' if report.status == 'complete' else '○'
        print(f'{status_icon} {lang}: {report.term_count} terms ({report.status})')

asyncio.run(validate())
"

Lexicon Service Guide

Lexicon Service Guide

Overview

Supported Languages

Quranic Lexicons

Basic Usage

Get Pronunciation

Batch Lookup

Custom Pronunciations

Lookup Order

Lexicon File Format

Language-Specific Lexicon

Shared Drug Lexicon

Coverage Validation

Validate Single Language

Validate All Languages

CLI Validation

G2P Fallback

espeak-ng Integration

Language Support

Fallback Chain

Data Directory Configuration

Environment Variable

Auto-Discovery via _resolve_data_dir()

TTS Integration

With ElevenLabs

Adding New Lexicons

1. Create Lexicon File

2. Update LexiconService

3. Validate

Contributing to Lexicons

Adding New Terms

Quranic Term Guidelines

Validation

Related Documentation

Auto-Discovery via `_resolve_data_dir()`