2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"]
4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""]
5:I[4126,[],""]
7:I[9630,[],""]
8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"]
9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"]
a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"]
b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"]
3:T3728,
# Talker Service

> **Location:** `services/api-gateway/app/services/talker_service.py`
> **Status:** Production Ready
> **Last Updated:** 2025-12-01

## Overview

The TalkerService handles text-to-speech synthesis for the Thinker-Talker voice pipeline. It streams LLM tokens through a sentence chunker and synthesizes speech via ElevenLabs for gapless audio playback.

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                       TalkerService                              │
│                                                                  │
│   LLM Tokens ──►┌──────────────────┐                            │
│                 │ Markdown Buffer  │  (accumulates for pattern  │
│                 │                  │   detection before strip)   │
│                 └────────┬─────────┘                            │
│                          │                                       │
│                          ▼                                       │
│                 ┌──────────────────┐                            │
│                 │ SentenceChunker  │  (splits at natural        │
│                 │ (40-120-200 chars)│   boundaries)              │
│                 └────────┬─────────┘                            │
│                          │                                       │
│                          ▼                                       │
│                 ┌──────────────────┐                            │
│                 │ strip_markdown   │  (removes **bold**,        │
│                 │ _for_tts()       │   [links](url), LaTeX)     │
│                 └────────┬─────────┘                            │
│                          │                                       │
│                          ▼                                       │
│                 ┌──────────────────┐                            │
│                 │ ElevenLabs TTS   │  (streaming synthesis      │
│                 │ (sequential)     │   with previous_text)      │
│                 └────────┬─────────┘                            │
│                          │                                       │
│                          ▼                                       │
│                   Audio Chunks ──► on_audio_chunk callback       │
└─────────────────────────────────────────────────────────────────┘
```

## Classes

### TalkerService

Main service class (singleton pattern).

```python
from app.services.talker_service import talker_service

# Check if TTS is available
if talker_service.is_enabled():
    # Start a speaking session (uses DEFAULT_VOICE_ID from voice_constants.py)
    session = await talker_service.start_session(
        on_audio_chunk=handle_audio,
        voice_config=VoiceConfig(
            # voice_id defaults to DEFAULT_VOICE_ID (Brian)
            stability=0.65,
        ),
    )

    # Feed tokens from LLM
    for token in llm_stream:
        await session.add_token(token)

    # Finish and get metrics
    metrics = await session.finish()
```

#### Methods

| Method                   | Description               | Parameters                       | Returns                |
| ------------------------ | ------------------------- | -------------------------------- | ---------------------- |
| `is_enabled()`           | Check if TTS is available | None                             | `bool`                 |
| `get_provider()`         | Get active TTS provider   | None                             | `TTSProvider`          |
| `start_session()`        | Start a TTS session       | `on_audio_chunk`, `voice_config` | `TalkerSession`        |
| `synthesize_text()`      | Simple text synthesis     | `text`, `voice_config`           | `AsyncIterator[bytes]` |
| `get_available_voices()` | List available voices     | None                             | `List[Dict]`           |

### TalkerSession

Session class for streaming TTS.

```python
class TalkerSession:
    """
    A single TTS speaking session with streaming support.

    Manages the flow:
    1. Receive LLM tokens
    2. Chunk into sentences
    3. Synthesize each sentence
    4. Stream audio chunks to callback
    """
```

#### Methods

| Method          | Description         | Parameters   | Returns         |
| --------------- | ------------------- | ------------ | --------------- |
| `add_token()`   | Add token from LLM  | `token: str` | `None`          |
| `finish()`      | Complete synthesis  | None         | `TalkerMetrics` |
| `cancel()`      | Cancel for barge-in | None         | `None`          |
| `get_metrics()` | Get session metrics | None         | `TalkerMetrics` |

#### Properties

| Property | Type          | Description   |
| -------- | ------------- | ------------- |
| `state`  | `TalkerState` | Current state |

### AudioQueue

Queue management for gapless playback.

```python
class AudioQueue:
    """
    Manages audio chunks for gapless playback with cancellation support.

    Features:
    - Async queue for audio chunks
    - Cancellation clears pending audio
    - Tracks queue state
    """

    async def put(self, chunk: AudioChunk) -> bool
    async def get(self) -> Optional[AudioChunk]
    async def cancel(self) -> None
    def finish(self) -> None
    def reset(self) -> None
```

## Data Classes

### TalkerState

```python
class TalkerState(str, Enum):
    IDLE = "idle"           # Ready for input
    SPEAKING = "speaking"   # Synthesizing/playing
    CANCELLED = "cancelled" # Interrupted by barge-in
```

### TTSProvider

```python
class TTSProvider(str, Enum):
    ELEVENLABS = "elevenlabs"
    OPENAI = "openai"  # Fallback
```

### VoiceConfig

> **Note:** Default voice is configured in `app/core/voice_constants.py`.
> See [Voice Configuration](/docs/voice/voice-configuration) for details.

```python
from app.core.voice_constants import DEFAULT_VOICE_ID, DEFAULT_TTS_MODEL

@dataclass
class VoiceConfig:
    provider: TTSProvider = TTSProvider.ELEVENLABS
    voice_id: str = DEFAULT_VOICE_ID  # Brian (from voice_constants.py)
    model_id: str = DEFAULT_TTS_MODEL  # eleven_flash_v2_5
    stability: float = 0.65         # 0.0-1.0, higher = consistent
    similarity_boost: float = 0.80  # 0.0-1.0, higher = clearer
    style: float = 0.15             # 0.0-1.0, lower = natural
    use_speaker_boost: bool = True
    output_format: str = "pcm_24000"
```

### AudioChunk

```python
@dataclass
class AudioChunk:
    data: bytes          # Raw audio bytes
    format: str          # "pcm16" or "mp3"
    is_final: bool       # True for last chunk
    sentence_index: int  # Which sentence this is from
    latency_ms: int      # Time since synthesis started
```

### TalkerMetrics

```python
@dataclass
class TalkerMetrics:
    sentences_processed: int = 0
    total_chars_synthesized: int = 0
    total_audio_bytes: int = 0
    total_latency_ms: int = 0
    first_audio_latency_ms: int = 0
    cancelled: bool = False
```

## Sentence Chunking

The TalkerSession uses `SentenceChunker` with these settings:

```python
self._chunker = SentenceChunker(
    ChunkerConfig(
        min_chunk_chars=40,    # Avoid tiny fragments
        optimal_chunk_chars=120,  # Full sentences
        max_chunk_chars=200,   # Allow complete thoughts
    )
)
```

### Why These Settings?

| Parameter             | Value | Rationale                              |
| --------------------- | ----- | -------------------------------------- |
| `min_chunk_chars`     | 40    | Prevents choppy TTS from short phrases |
| `optimal_chunk_chars` | 120   | Full sentences sound more natural      |
| `max_chunk_chars`     | 200   | Prevents excessive buffering           |

Trade-off: Larger chunks = better prosody but higher latency to first audio.

## Markdown Stripping

LLM responses often contain markdown that sounds unnatural when spoken:

````python
def strip_markdown_for_tts(text: str) -> str:
    """
    Converts:
    - [Link Text](URL) → "Link Text"
    - **bold** → "bold"
    - *italic* → "italic"
    - `code` → "code"
    - ```blocks``` → (removed)
    - # Headers → "Headers"
    - LaTeX formulas → (removed)
    """
````

### Markdown-Aware Token Buffering

The TalkerSession buffers tokens to detect incomplete patterns:

```python
def _process_markdown_token(self, token: str) -> str:
    """
    Accumulates tokens to detect patterns that should be stripped:
    - Markdown links: [text](url) - wait for closing )
    - LaTeX display: [ ... ] with backslashes
    - LaTeX inline: \\( ... \\)
    - Bold/italic: **text** - wait for closing **
    """
```

This prevents sending "[Link Te" to TTS before we know it's a markdown link.

## Voice Continuity

For consistent voice across sentences:

```python
async for audio_data in self._elevenlabs.synthesize_stream(
    text=tts_text,
    previous_text=self._previous_text,  # Context for voice continuity
    ...
):
    ...

# Update for next synthesis
self._previous_text = tts_text
```

The `previous_text` parameter helps ElevenLabs maintain consistent prosody.

## Sequential Synthesis

To prevent voice variations between chunks:

```python
# Semaphore ensures one synthesis at a time
self._synthesis_semaphore = asyncio.Semaphore(1)

async with self._synthesis_semaphore:
    async for audio_data in self._elevenlabs.synthesize_stream(...):
        ...
```

Parallel synthesis can cause noticeable voice quality differences between sentences.

## Usage Examples

### Basic Token Streaming

```python
async def handle_llm_response(llm_stream):
    async def on_audio_chunk(chunk: AudioChunk):
        # Send to client via WebSocket
        await websocket.send_json({
            "type": "audio.output",
            "audio": base64.b64encode(chunk.data).decode(),
            "is_final": chunk.is_final,
        })

    session = await talker_service.start_session(on_audio_chunk=on_audio_chunk)

    async for token in llm_stream:
        await session.add_token(token)

    metrics = await session.finish()
    print(f"Synthesized {metrics.sentences_processed} sentences")
    print(f"First audio in {metrics.first_audio_latency_ms}ms")
```

### Custom Voice Configuration

```python
config = VoiceConfig(
    voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel (female)
    model_id="eleven_flash_v2_5",      # Lower latency
    stability=0.65,                    # More variation
    similarity_boost=0.90,             # Very clear
    style=0.15,                        # Slightly expressive
)

session = await talker_service.start_session(
    on_audio_chunk=handle_audio,
    voice_config=config,
)
```

### Handling Barge-in

```python
active_session = None

async def start_speaking(llm_stream):
    global active_session
    active_session = await talker_service.start_session(on_audio_chunk=send_audio)

    for token in llm_stream:
        if active_session.is_cancelled():
            break
        await active_session.add_token(token)

    await active_session.finish()

async def handle_barge_in():
    global active_session
    if active_session:
        await active_session.cancel()
        # Cancels pending synthesis and clears audio queue
```

### Simple Text Synthesis

```python
# For non-streaming use cases
async for audio_chunk in talker_service.synthesize_text(
    text="Hello, how can I help you today?",
    voice_config=VoiceConfig(voice_id="TxGEqnHWrfWFTfGW9XjX"),
):
    await send_audio(audio_chunk)
```

## Available Voices

```python
voices = talker_service.get_available_voices()
# Returns:
[
    {"id": "TxGEqnHWrfWFTfGW9XjX", "name": "Josh", "gender": "male", "premium": True},
    {"id": "pNInz6obpgDQGcFmaJgB", "name": "Adam", "gender": "male", "premium": True},
    {"id": "EXAVITQu4vr4xnSDxMaL", "name": "Bella", "gender": "female", "premium": True},
    {"id": "21m00Tcm4TlvDq8ikWAM", "name": "Rachel", "gender": "female", "premium": True},
    # ... more voices
]
```

## Performance Tuning

### Latency Optimization

| Setting               | Lower Latency       | Higher Quality      |
| --------------------- | ------------------- | ------------------- |
| `model_id`            | `eleven_flash_v2_5` | `eleven_turbo_v2_5` |
| `min_chunk_chars`     | 15                  | 40                  |
| `optimal_chunk_chars` | 50                  | 120                 |
| `output_format`       | `pcm_24000`         | `mp3_44100_192`     |

### Quality Optimization

| Setting            | More Natural | More Consistent |
| ------------------ | ------------ | --------------- |
| `stability`        | 0.50         | 0.85            |
| `similarity_boost` | 0.70         | 0.90            |
| `style`            | 0.20         | 0.05            |

## Error Handling

Synthesis errors don't fail the entire session:

```python
async def _synthesize_sentence(self, sentence: str) -> None:
    try:
        async for audio_data in self._elevenlabs.synthesize_stream(...):
            if self._state == TalkerState.CANCELLED:
                return
            await self._on_audio_chunk(chunk)
    except Exception as e:
        logger.error(f"TTS synthesis error: {e}")
        # Session continues, just skips this sentence
```

## Related Documentation

- [Thinker-Talker Pipeline Overview](../THINKER_TALKER_PIPELINE.md)
- [Thinker Service](thinker-service.md)
- [Voice Pipeline WebSocket API](../api-reference/voice-pipeline-ws.md)
6:["slug","services/talker-service","c"]
0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","services/talker-service","c"],{"children":["__PAGE__?{\"slug\":[\"services\",\"talker-service\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","services/talker-service","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Talker Service"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","services/talker-service.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/services/talker-service.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]]
c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Talker Service | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"Text-to-speech synthesis service using ElevenLabs with sentence chunking for gapless audio playback."}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]]
1:null