Docs / Raw

Voice Pipeline WebSocket API

Sourced from docs/api-reference/voice-pipeline-ws.md

Edit on GitHub

Voice Pipeline WebSocket API

Endpoint: wss://{host}/api/voice/pipeline-ws Protocol: JSON over WebSocket Status: Production Ready Last Updated: 2025-12-02

Overview

The Voice Pipeline WebSocket provides bidirectional communication for the Thinker-Talker voice mode. It handles audio streaming, transcription, LLM responses, and TTS playback.

Connection

Authentication

Include JWT token in connection URL or headers:

const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`);

Connection Lifecycle

1. Client connects with auth token
   │
2. Server accepts, creates pipeline session
   │
3. Server sends: session.ready
   │
4. Client sends: session.init (optional config)
   │
5. Server acknowledges: session.init.ack
   │
6. Voice mode active - bidirectional streaming
   │
7. Client or server closes connection

Message Format

All messages are JSON objects with a type field:

{ "type": "message_type", "field1": "value1", "field2": "value2" }

Client → Server Messages

session.init

Initialize or reconfigure the session.

{ "type": "session.init", "conversation_id": "conv-123", "voice_settings": { "voice_id": "TxGEqnHWrfWFTfGW9XjX", "language": "en", "barge_in_enabled": true } }
FieldTypeRequiredDescription
conversation_idstringNoLink to existing chat conversation
voice_settings.voice_idstringNoElevenLabs voice ID
voice_settings.languagestringNoSTT language code (default: "en")
voice_settings.barge_in_enabledbooleanNoAllow user interruption (default: true)

audio.input

Stream audio from microphone.

{ "type": "audio.input", "audio": "base64_encoded_pcm16_audio" }
FieldTypeRequiredDescription
audiostringYesBase64-encoded PCM16 audio (16kHz, mono)

Audio Format Requirements:

  • Sample rate: 16000 Hz
  • Channels: 1 (mono)
  • Bit depth: 16-bit signed PCM
  • Encoding: Little-endian
  • Chunk size: ~100ms recommended (1600 samples)

audio.input.complete

Signal end of user speech (manual commit).

{ "type": "audio.input.complete" }

Normally, VAD auto-detects speech end. Use this for push-to-talk implementations.

barge_in

Interrupt AI response.

{ "type": "barge_in" }

When received:

  • Cancels TTS synthesis
  • Clears audio queue
  • Resets pipeline to listening state

message

Send text input (fallback when mic unavailable).

{ "type": "message", "content": "What's the weather like?" }

ping

Keep-alive heartbeat.

{ "type": "ping" }

Server responds with pong.

Server → Client Messages

session.ready

Session initialized successfully.

{ "type": "session.ready", "session_id": "sess-abc123", "pipeline_mode": "thinker_talker" }

session.init.ack

Acknowledges session.init message.

{ "type": "session.init.ack" }

transcript.delta

Partial STT transcript (streaming).

{ "type": "transcript.delta", "text": "What is the", "is_final": false }
FieldTypeDescription
textstringPartial transcript text
is_finalbooleanAlways false for delta

transcript.complete

Final STT transcript.

{ "type": "transcript.complete", "text": "What is the weather today?", "message_id": "msg-xyz789" }
FieldTypeDescription
textstringComplete transcript
message_idstringUnique message identifier

response.delta

Streaming LLM response token.

{ "type": "response.delta", "delta": "The", "message_id": "resp-123" }
FieldTypeDescription
deltastringResponse token/chunk
message_idstringResponse message ID

response.complete

Complete LLM response.

{ "type": "response.complete", "text": "The weather today is sunny with a high of 72 degrees.", "message_id": "resp-123" }

audio.output

TTS audio chunk.

{ "type": "audio.output", "audio": "base64_encoded_pcm_audio", "is_final": false, "sentence_index": 0 }
FieldTypeDescription
audiostringBase64-encoded PCM audio (24kHz, mono)
is_finalbooleanTrue for last chunk
sentence_indexnumberWhich sentence this is from

Output Audio Format:

  • Sample rate: 24000 Hz
  • Channels: 1 (mono)
  • Bit depth: 16-bit signed PCM
  • Encoding: Little-endian

tool.call

Tool invocation started.

{ "type": "tool.call", "id": "call-abc", "name": "calendar_list_events", "arguments": { "start_date": "2025-12-01", "end_date": "2025-12-07" } }
FieldTypeDescription
idstringTool call ID
namestringTool function name
argumentsobjectTool arguments

tool.result

Tool execution completed.

{ "type": "tool.result", "id": "call-abc", "name": "calendar_list_events", "result": { "events": [{ "title": "Team Meeting", "start": "2025-12-02T10:00:00" }] } }
FieldTypeDescription
idstringTool call ID
namestringTool function name
resultanyTool execution result

voice.state

Pipeline state change.

{ "type": "voice.state", "state": "speaking" }
StateDescription
idleWaiting for user input
listeningReceiving audio, STT active
processingLLM thinking
speakingTTS playing
cancelledBarge-in occurred

heartbeat

Server heartbeat (every 30s).

{ "type": "heartbeat" }

pong

Response to client ping.

{ "type": "pong" }

error

Error occurred.

{ "type": "error", "code": "stt_failed", "message": "Speech-to-text service unavailable", "recoverable": true }
FieldTypeDescription
codestringError code
messagestringHuman-readable message
recoverablebooleanTrue if client can retry

Error Codes:

CodeDescriptionRecoverable
invalid_jsonMalformed JSON messageYes
connection_failedPipeline init failedNo
stt_failedSTT service errorYes
llm_failedLLM service errorYes
tts_failedTTS service errorYes
auth_failedAuthentication errorNo
rate_limitedToo many requestsYes

Example: Complete Session

// 1. Connect const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${token}`); ws.onopen = () => { console.log("Connected"); }; ws.onmessage = (event) => { const msg = JSON.parse(event.data); switch (msg.type) { case "session.ready": // 2. Initialize with settings ws.send( JSON.stringify({ type: "session.init", conversation_id: currentConversationId, voice_settings: { voice_id: "TxGEqnHWrfWFTfGW9XjX", language: "en", }, }), ); break; case "session.init.ack": // 3. Start sending audio startMicrophoneCapture(); break; case "transcript.delta": // Show partial transcript updatePartialTranscript(msg.text); break; case "transcript.complete": // Show final transcript setTranscript(msg.text); break; case "response.delta": // Append LLM response appendResponse(msg.delta); break; case "audio.output": // Play TTS audio if (msg.audio) { const pcm = base64ToArrayBuffer(msg.audio); audioPlayer.queueChunk(pcm); } if (msg.is_final) { audioPlayer.finish(); } break; case "tool.call": // Show tool being called showToolCall(msg.name, msg.arguments); break; case "tool.result": // Show tool result showToolResult(msg.name, msg.result); break; case "error": console.error(`Error [${msg.code}]: ${msg.message}`); if (!msg.recoverable) { ws.close(); } break; } }; // Send audio chunks from microphone function sendAudioChunk(pcmData) { ws.send( JSON.stringify({ type: "audio.input", audio: arrayBufferToBase64(pcmData), }), ); } // Handle barge-in (user speaks while AI is talking) function handleBargeIn() { ws.send(JSON.stringify({ type: "barge_in" })); audioPlayer.stop(); }

Configuration Reference

TTSessionConfig (Backend)

@dataclass class TTSessionConfig: user_id: str session_id: str conversation_id: Optional[str] = None # Voice settings voice_id: str = "TxGEqnHWrfWFTfGW9XjX" tts_model: str = "eleven_flash_v2_5" language: str = "en" # STT settings stt_sample_rate: int = 16000 stt_endpointing_ms: int = 800 stt_utterance_end_ms: int = 1500 # Barge-in barge_in_enabled: bool = True # Timeouts connection_timeout_sec: float = 10.0 idle_timeout_sec: float = 300.0

Rate Limiting

LimitValue
Max concurrent sessions per user2
Max concurrent sessions total100
Audio chunk rate~10/second recommended
Idle timeout300 seconds
Beginning of guide
End of guide