2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"] 4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""] 5:I[4126,[],""] 7:I[9630,[],""] 8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"] 9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"] a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"] b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"] 3:T3728, # Talker Service > **Location:** `services/api-gateway/app/services/talker_service.py` > **Status:** Production Ready > **Last Updated:** 2025-12-01 ## Overview The TalkerService handles text-to-speech synthesis for the Thinker-Talker voice pipeline. It streams LLM tokens through a sentence chunker and synthesizes speech via ElevenLabs for gapless audio playback. ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ TalkerService │ │ │ │ LLM Tokens ──►┌──────────────────┐ │ │ │ Markdown Buffer │ (accumulates for pattern │ │ │ │ detection before strip) │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ SentenceChunker │ (splits at natural │ │ │ (40-120-200 chars)│ boundaries) │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ strip_markdown │ (removes **bold**, │ │ │ _for_tts() │ [links](url), LaTeX) │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ ElevenLabs TTS │ (streaming synthesis │ │ │ (sequential) │ with previous_text) │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ Audio Chunks ──► on_audio_chunk callback │ └─────────────────────────────────────────────────────────────────┘ ``` ## Classes ### TalkerService Main service class (singleton pattern). ```python from app.services.talker_service import talker_service # Check if TTS is available if talker_service.is_enabled(): # Start a speaking session (uses DEFAULT_VOICE_ID from voice_constants.py) session = await talker_service.start_session( on_audio_chunk=handle_audio, voice_config=VoiceConfig( # voice_id defaults to DEFAULT_VOICE_ID (Brian) stability=0.65, ), ) # Feed tokens from LLM for token in llm_stream: await session.add_token(token) # Finish and get metrics metrics = await session.finish() ``` #### Methods | Method | Description | Parameters | Returns | | ------------------------ | ------------------------- | -------------------------------- | ---------------------- | | `is_enabled()` | Check if TTS is available | None | `bool` | | `get_provider()` | Get active TTS provider | None | `TTSProvider` | | `start_session()` | Start a TTS session | `on_audio_chunk`, `voice_config` | `TalkerSession` | | `synthesize_text()` | Simple text synthesis | `text`, `voice_config` | `AsyncIterator[bytes]` | | `get_available_voices()` | List available voices | None | `List[Dict]` | ### TalkerSession Session class for streaming TTS. ```python class TalkerSession: """ A single TTS speaking session with streaming support. Manages the flow: 1. Receive LLM tokens 2. Chunk into sentences 3. Synthesize each sentence 4. Stream audio chunks to callback """ ``` #### Methods | Method | Description | Parameters | Returns | | --------------- | ------------------- | ------------ | --------------- | | `add_token()` | Add token from LLM | `token: str` | `None` | | `finish()` | Complete synthesis | None | `TalkerMetrics` | | `cancel()` | Cancel for barge-in | None | `None` | | `get_metrics()` | Get session metrics | None | `TalkerMetrics` | #### Properties | Property | Type | Description | | -------- | ------------- | ------------- | | `state` | `TalkerState` | Current state | ### AudioQueue Queue management for gapless playback. ```python class AudioQueue: """ Manages audio chunks for gapless playback with cancellation support. Features: - Async queue for audio chunks - Cancellation clears pending audio - Tracks queue state """ async def put(self, chunk: AudioChunk) -> bool async def get(self) -> Optional[AudioChunk] async def cancel(self) -> None def finish(self) -> None def reset(self) -> None ``` ## Data Classes ### TalkerState ```python class TalkerState(str, Enum): IDLE = "idle" # Ready for input SPEAKING = "speaking" # Synthesizing/playing CANCELLED = "cancelled" # Interrupted by barge-in ``` ### TTSProvider ```python class TTSProvider(str, Enum): ELEVENLABS = "elevenlabs" OPENAI = "openai" # Fallback ``` ### VoiceConfig > **Note:** Default voice is configured in `app/core/voice_constants.py`. > See [Voice Configuration](/docs/voice/voice-configuration) for details. ```python from app.core.voice_constants import DEFAULT_VOICE_ID, DEFAULT_TTS_MODEL @dataclass class VoiceConfig: provider: TTSProvider = TTSProvider.ELEVENLABS voice_id: str = DEFAULT_VOICE_ID # Brian (from voice_constants.py) model_id: str = DEFAULT_TTS_MODEL # eleven_flash_v2_5 stability: float = 0.65 # 0.0-1.0, higher = consistent similarity_boost: float = 0.80 # 0.0-1.0, higher = clearer style: float = 0.15 # 0.0-1.0, lower = natural use_speaker_boost: bool = True output_format: str = "pcm_24000" ``` ### AudioChunk ```python @dataclass class AudioChunk: data: bytes # Raw audio bytes format: str # "pcm16" or "mp3" is_final: bool # True for last chunk sentence_index: int # Which sentence this is from latency_ms: int # Time since synthesis started ``` ### TalkerMetrics ```python @dataclass class TalkerMetrics: sentences_processed: int = 0 total_chars_synthesized: int = 0 total_audio_bytes: int = 0 total_latency_ms: int = 0 first_audio_latency_ms: int = 0 cancelled: bool = False ``` ## Sentence Chunking The TalkerSession uses `SentenceChunker` with these settings: ```python self._chunker = SentenceChunker( ChunkerConfig( min_chunk_chars=40, # Avoid tiny fragments optimal_chunk_chars=120, # Full sentences max_chunk_chars=200, # Allow complete thoughts ) ) ``` ### Why These Settings? | Parameter | Value | Rationale | | --------------------- | ----- | -------------------------------------- | | `min_chunk_chars` | 40 | Prevents choppy TTS from short phrases | | `optimal_chunk_chars` | 120 | Full sentences sound more natural | | `max_chunk_chars` | 200 | Prevents excessive buffering | Trade-off: Larger chunks = better prosody but higher latency to first audio. ## Markdown Stripping LLM responses often contain markdown that sounds unnatural when spoken: ````python def strip_markdown_for_tts(text: str) -> str: """ Converts: - [Link Text](URL) → "Link Text" - **bold** → "bold" - *italic* → "italic" - `code` → "code" - ```blocks``` → (removed) - # Headers → "Headers" - LaTeX formulas → (removed) """ ```` ### Markdown-Aware Token Buffering The TalkerSession buffers tokens to detect incomplete patterns: ```python def _process_markdown_token(self, token: str) -> str: """ Accumulates tokens to detect patterns that should be stripped: - Markdown links: [text](url) - wait for closing ) - LaTeX display: [ ... ] with backslashes - LaTeX inline: \\( ... \\) - Bold/italic: **text** - wait for closing ** """ ``` This prevents sending "[Link Te" to TTS before we know it's a markdown link. ## Voice Continuity For consistent voice across sentences: ```python async for audio_data in self._elevenlabs.synthesize_stream( text=tts_text, previous_text=self._previous_text, # Context for voice continuity ... ): ... # Update for next synthesis self._previous_text = tts_text ``` The `previous_text` parameter helps ElevenLabs maintain consistent prosody. ## Sequential Synthesis To prevent voice variations between chunks: ```python # Semaphore ensures one synthesis at a time self._synthesis_semaphore = asyncio.Semaphore(1) async with self._synthesis_semaphore: async for audio_data in self._elevenlabs.synthesize_stream(...): ... ``` Parallel synthesis can cause noticeable voice quality differences between sentences. ## Usage Examples ### Basic Token Streaming ```python async def handle_llm_response(llm_stream): async def on_audio_chunk(chunk: AudioChunk): # Send to client via WebSocket await websocket.send_json({ "type": "audio.output", "audio": base64.b64encode(chunk.data).decode(), "is_final": chunk.is_final, }) session = await talker_service.start_session(on_audio_chunk=on_audio_chunk) async for token in llm_stream: await session.add_token(token) metrics = await session.finish() print(f"Synthesized {metrics.sentences_processed} sentences") print(f"First audio in {metrics.first_audio_latency_ms}ms") ``` ### Custom Voice Configuration ```python config = VoiceConfig( voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel (female) model_id="eleven_flash_v2_5", # Lower latency stability=0.65, # More variation similarity_boost=0.90, # Very clear style=0.15, # Slightly expressive ) session = await talker_service.start_session( on_audio_chunk=handle_audio, voice_config=config, ) ``` ### Handling Barge-in ```python active_session = None async def start_speaking(llm_stream): global active_session active_session = await talker_service.start_session(on_audio_chunk=send_audio) for token in llm_stream: if active_session.is_cancelled(): break await active_session.add_token(token) await active_session.finish() async def handle_barge_in(): global active_session if active_session: await active_session.cancel() # Cancels pending synthesis and clears audio queue ``` ### Simple Text Synthesis ```python # For non-streaming use cases async for audio_chunk in talker_service.synthesize_text( text="Hello, how can I help you today?", voice_config=VoiceConfig(voice_id="TxGEqnHWrfWFTfGW9XjX"), ): await send_audio(audio_chunk) ``` ## Available Voices ```python voices = talker_service.get_available_voices() # Returns: [ {"id": "TxGEqnHWrfWFTfGW9XjX", "name": "Josh", "gender": "male", "premium": True}, {"id": "pNInz6obpgDQGcFmaJgB", "name": "Adam", "gender": "male", "premium": True}, {"id": "EXAVITQu4vr4xnSDxMaL", "name": "Bella", "gender": "female", "premium": True}, {"id": "21m00Tcm4TlvDq8ikWAM", "name": "Rachel", "gender": "female", "premium": True}, # ... more voices ] ``` ## Performance Tuning ### Latency Optimization | Setting | Lower Latency | Higher Quality | | --------------------- | ------------------- | ------------------- | | `model_id` | `eleven_flash_v2_5` | `eleven_turbo_v2_5` | | `min_chunk_chars` | 15 | 40 | | `optimal_chunk_chars` | 50 | 120 | | `output_format` | `pcm_24000` | `mp3_44100_192` | ### Quality Optimization | Setting | More Natural | More Consistent | | ------------------ | ------------ | --------------- | | `stability` | 0.50 | 0.85 | | `similarity_boost` | 0.70 | 0.90 | | `style` | 0.20 | 0.05 | ## Error Handling Synthesis errors don't fail the entire session: ```python async def _synthesize_sentence(self, sentence: str) -> None: try: async for audio_data in self._elevenlabs.synthesize_stream(...): if self._state == TalkerState.CANCELLED: return await self._on_audio_chunk(chunk) except Exception as e: logger.error(f"TTS synthesis error: {e}") # Session continues, just skips this sentence ``` ## Related Documentation - [Thinker-Talker Pipeline Overview](../THINKER_TALKER_PIPELINE.md) - [Thinker Service](thinker-service.md) - [Voice Pipeline WebSocket API](../api-reference/voice-pipeline-ws.md) 6:["slug","services/talker-service","c"] 0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","services/talker-service","c"],{"children":["__PAGE__?{\"slug\":[\"services\",\"talker-service\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","services/talker-service","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Talker Service"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","services/talker-service.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/services/talker-service.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]] c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Talker Service | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"Text-to-speech synthesis service using ElevenLabs with sentence chunking for gapless audio playback."}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]] 1:null