2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"]
4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""]
5:I[4126,[],""]
7:I[9630,[],""]
8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"]
9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"]
a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"]
b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"]
3:T2b61,
# Voice Pipeline WebSocket API

> **Endpoint:** `wss://{host}/api/voice/pipeline-ws`
> **Protocol:** JSON over WebSocket
> **Status:** Production Ready
> **Last Updated:** 2025-12-02

## Overview

The Voice Pipeline WebSocket provides bidirectional communication for the Thinker-Talker voice mode. It handles audio streaming, transcription, LLM responses, and TTS playback.

## Connection

### Authentication

Include JWT token in connection URL or headers:

```javascript
const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`);
```

### Connection Lifecycle

```
1. Client connects with auth token
   │
2. Server accepts, creates pipeline session
   │
3. Server sends: session.ready
   │
4. Client sends: session.init (optional config)
   │
5. Server acknowledges: session.init.ack
   │
6. Voice mode active - bidirectional streaming
   │
7. Client or server closes connection
```

## Message Format

All messages are JSON objects with a `type` field:

```json
{
  "type": "message_type",
  "field1": "value1",
  "field2": "value2"
}
```

## Client → Server Messages

### session.init

Initialize or reconfigure the session.

```json
{
  "type": "session.init",
  "conversation_id": "conv-123",
  "voice_settings": {
    "voice_id": "TxGEqnHWrfWFTfGW9XjX",
    "language": "en",
    "barge_in_enabled": true
  }
}
```

| Field                             | Type    | Required | Description                             |
| --------------------------------- | ------- | -------- | --------------------------------------- |
| `conversation_id`                 | string  | No       | Link to existing chat conversation      |
| `voice_settings.voice_id`         | string  | No       | ElevenLabs voice ID                     |
| `voice_settings.language`         | string  | No       | STT language code (default: "en")       |
| `voice_settings.barge_in_enabled` | boolean | No       | Allow user interruption (default: true) |

### audio.input

Stream audio from microphone.

```json
{
  "type": "audio.input",
  "audio": "base64_encoded_pcm16_audio"
}
```

| Field   | Type   | Required | Description                              |
| ------- | ------ | -------- | ---------------------------------------- |
| `audio` | string | Yes      | Base64-encoded PCM16 audio (16kHz, mono) |

**Audio Format Requirements:**

- Sample rate: 16000 Hz
- Channels: 1 (mono)
- Bit depth: 16-bit signed PCM
- Encoding: Little-endian
- Chunk size: ~100ms recommended (1600 samples)

### audio.input.complete

Signal end of user speech (manual commit).

```json
{
  "type": "audio.input.complete"
}
```

Normally, VAD auto-detects speech end. Use this for push-to-talk implementations.

### barge_in

Interrupt AI response.

```json
{
  "type": "barge_in"
}
```

When received:

- Cancels TTS synthesis
- Clears audio queue
- Resets pipeline to listening state

### message

Send text input (fallback when mic unavailable).

```json
{
  "type": "message",
  "content": "What's the weather like?"
}
```

### ping

Keep-alive heartbeat.

```json
{
  "type": "ping"
}
```

Server responds with `pong`.

## Server → Client Messages

### session.ready

Session initialized successfully.

```json
{
  "type": "session.ready",
  "session_id": "sess-abc123",
  "pipeline_mode": "thinker_talker"
}
```

### session.init.ack

Acknowledges session.init message.

```json
{
  "type": "session.init.ack"
}
```

### transcript.delta

Partial STT transcript (streaming).

```json
{
  "type": "transcript.delta",
  "text": "What is the",
  "is_final": false
}
```

| Field      | Type    | Description             |
| ---------- | ------- | ----------------------- |
| `text`     | string  | Partial transcript text |
| `is_final` | boolean | Always false for delta  |

### transcript.complete

Final STT transcript.

```json
{
  "type": "transcript.complete",
  "text": "What is the weather today?",
  "message_id": "msg-xyz789"
}
```

| Field        | Type   | Description               |
| ------------ | ------ | ------------------------- |
| `text`       | string | Complete transcript       |
| `message_id` | string | Unique message identifier |

### response.delta

Streaming LLM response token.

```json
{
  "type": "response.delta",
  "delta": "The",
  "message_id": "resp-123"
}
```

| Field        | Type   | Description          |
| ------------ | ------ | -------------------- |
| `delta`      | string | Response token/chunk |
| `message_id` | string | Response message ID  |

### response.complete

Complete LLM response.

```json
{
  "type": "response.complete",
  "text": "The weather today is sunny with a high of 72 degrees.",
  "message_id": "resp-123"
}
```

### audio.output

TTS audio chunk.

```json
{
  "type": "audio.output",
  "audio": "base64_encoded_pcm_audio",
  "is_final": false,
  "sentence_index": 0
}
```

| Field            | Type    | Description                            |
| ---------------- | ------- | -------------------------------------- |
| `audio`          | string  | Base64-encoded PCM audio (24kHz, mono) |
| `is_final`       | boolean | True for last chunk                    |
| `sentence_index` | number  | Which sentence this is from            |

**Output Audio Format:**

- Sample rate: 24000 Hz
- Channels: 1 (mono)
- Bit depth: 16-bit signed PCM
- Encoding: Little-endian

### tool.call

Tool invocation started.

```json
{
  "type": "tool.call",
  "id": "call-abc",
  "name": "calendar_list_events",
  "arguments": {
    "start_date": "2025-12-01",
    "end_date": "2025-12-07"
  }
}
```

| Field       | Type   | Description        |
| ----------- | ------ | ------------------ |
| `id`        | string | Tool call ID       |
| `name`      | string | Tool function name |
| `arguments` | object | Tool arguments     |

### tool.result

Tool execution completed.

```json
{
  "type": "tool.result",
  "id": "call-abc",
  "name": "calendar_list_events",
  "result": {
    "events": [{ "title": "Team Meeting", "start": "2025-12-02T10:00:00" }]
  }
}
```

| Field    | Type   | Description           |
| -------- | ------ | --------------------- |
| `id`     | string | Tool call ID          |
| `name`   | string | Tool function name    |
| `result` | any    | Tool execution result |

### voice.state

Pipeline state change.

```json
{
  "type": "voice.state",
  "state": "speaking"
}
```

| State        | Description                 |
| ------------ | --------------------------- |
| `idle`       | Waiting for user input      |
| `listening`  | Receiving audio, STT active |
| `processing` | LLM thinking                |
| `speaking`   | TTS playing                 |
| `cancelled`  | Barge-in occurred           |

### heartbeat

Server heartbeat (every 30s).

```json
{
  "type": "heartbeat"
}
```

### pong

Response to client ping.

```json
{
  "type": "pong"
}
```

### error

Error occurred.

```json
{
  "type": "error",
  "code": "stt_failed",
  "message": "Speech-to-text service unavailable",
  "recoverable": true
}
```

| Field         | Type    | Description              |
| ------------- | ------- | ------------------------ |
| `code`        | string  | Error code               |
| `message`     | string  | Human-readable message   |
| `recoverable` | boolean | True if client can retry |

**Error Codes:**

| Code                | Description            | Recoverable |
| ------------------- | ---------------------- | ----------- |
| `invalid_json`      | Malformed JSON message | Yes         |
| `connection_failed` | Pipeline init failed   | No          |
| `stt_failed`        | STT service error      | Yes         |
| `llm_failed`        | LLM service error      | Yes         |
| `tts_failed`        | TTS service error      | Yes         |
| `auth_failed`       | Authentication error   | No          |
| `rate_limited`      | Too many requests      | Yes         |

## Example: Complete Session

```javascript
// 1. Connect
const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${token}`);

ws.onopen = () => {
  console.log("Connected");
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);

  switch (msg.type) {
    case "session.ready":
      // 2. Initialize with settings
      ws.send(
        JSON.stringify({
          type: "session.init",
          conversation_id: currentConversationId,
          voice_settings: {
            voice_id: "TxGEqnHWrfWFTfGW9XjX",
            language: "en",
          },
        }),
      );
      break;

    case "session.init.ack":
      // 3. Start sending audio
      startMicrophoneCapture();
      break;

    case "transcript.delta":
      // Show partial transcript
      updatePartialTranscript(msg.text);
      break;

    case "transcript.complete":
      // Show final transcript
      setTranscript(msg.text);
      break;

    case "response.delta":
      // Append LLM response
      appendResponse(msg.delta);
      break;

    case "audio.output":
      // Play TTS audio
      if (msg.audio) {
        const pcm = base64ToArrayBuffer(msg.audio);
        audioPlayer.queueChunk(pcm);
      }
      if (msg.is_final) {
        audioPlayer.finish();
      }
      break;

    case "tool.call":
      // Show tool being called
      showToolCall(msg.name, msg.arguments);
      break;

    case "tool.result":
      // Show tool result
      showToolResult(msg.name, msg.result);
      break;

    case "error":
      console.error(`Error [${msg.code}]: ${msg.message}`);
      if (!msg.recoverable) {
        ws.close();
      }
      break;
  }
};

// Send audio chunks from microphone
function sendAudioChunk(pcmData) {
  ws.send(
    JSON.stringify({
      type: "audio.input",
      audio: arrayBufferToBase64(pcmData),
    }),
  );
}

// Handle barge-in (user speaks while AI is talking)
function handleBargeIn() {
  ws.send(JSON.stringify({ type: "barge_in" }));
  audioPlayer.stop();
}
```

## Configuration Reference

### TTSessionConfig (Backend)

```python
@dataclass
class TTSessionConfig:
    user_id: str
    session_id: str
    conversation_id: Optional[str] = None

    # Voice settings
    voice_id: str = "TxGEqnHWrfWFTfGW9XjX"
    tts_model: str = "eleven_flash_v2_5"
    language: str = "en"

    # STT settings
    stt_sample_rate: int = 16000
    stt_endpointing_ms: int = 800
    stt_utterance_end_ms: int = 1500

    # Barge-in
    barge_in_enabled: bool = True

    # Timeouts
    connection_timeout_sec: float = 10.0
    idle_timeout_sec: float = 300.0
```

## Rate Limiting

| Limit                            | Value                  |
| -------------------------------- | ---------------------- |
| Max concurrent sessions per user | 2                      |
| Max concurrent sessions total    | 100                    |
| Audio chunk rate                 | ~10/second recommended |
| Idle timeout                     | 300 seconds            |

## Related Documentation

- [Thinker-Talker Pipeline Overview](../THINKER_TALKER_PIPELINE.md)
- [Frontend Voice Hooks](../frontend/thinker-talker-hooks.md)
- [Voice Mode Settings Guide](../VOICE_MODE_SETTINGS_GUIDE.md)
6:["slug","api-reference/voice-pipeline-ws","c"]
0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","api-reference/voice-pipeline-ws","c"],{"children":["__PAGE__?{\"slug\":[\"api-reference\",\"voice-pipeline-ws\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","api-reference/voice-pipeline-ws","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Voice Pipeline WebSocket API"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","api-reference/voice-pipeline-ws.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/api-reference/voice-pipeline-ws.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]]
c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Voice Pipeline WebSocket API | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"WebSocket API reference for the Thinker-Talker voice pipeline with audio streaming and TTS playback."}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]]
1:null