Voice Pipeline WebSocket API

Endpoint: wss://{host}/api/voice/pipeline-ws Protocol: JSON over WebSocket Status: Production Ready Last Updated: 2025-12-02

Overview

The Voice Pipeline WebSocket provides bidirectional communication for the Thinker-Talker voice mode. It handles audio streaming, transcription, LLM responses, and TTS playback.

Connection

Authentication

Include JWT token in connection URL or headers:

const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`);

Connection Lifecycle

1. Client connects with auth token
   │
2. Server accepts, creates pipeline session
   │
3. Server sends: session.ready
   │
4. Client sends: session.init (optional config)
   │
5. Server acknowledges: session.init.ack
   │
6. Voice mode active - bidirectional streaming
   │
7. Client or server closes connection

Message Format

All messages are JSON objects with a type field:

{
  "type": "message_type",
  "field1": "value1",
  "field2": "value2"
}

Client → Server Messages

session.init

Initialize or reconfigure the session.

{
  "type": "session.init",
  "conversation_id": "conv-123",
  "voice_settings": {
    "voice_id": "TxGEqnHWrfWFTfGW9XjX",
    "language": "en",
    "barge_in_enabled": true
  }
}

Field	Type	Required	Description
`conversation_id`	string	No	Link to existing chat conversation
`voice_settings.voice_id`	string	No	ElevenLabs voice ID
`voice_settings.language`	string	No	STT language code (default: "en")
`voice_settings.barge_in_enabled`	boolean	No	Allow user interruption (default: true)

audio.input

Stream audio from microphone.

{
  "type": "audio.input",
  "audio": "base64_encoded_pcm16_audio"
}

Field	Type	Required	Description
`audio`	string	Yes	Base64-encoded PCM16 audio (16kHz, mono)

Audio Format Requirements:

Sample rate: 16000 Hz
Channels: 1 (mono)
Bit depth: 16-bit signed PCM
Encoding: Little-endian
Chunk size: ~100ms recommended (1600 samples)

audio.input.complete

Signal end of user speech (manual commit).

{
  "type": "audio.input.complete"
}

Normally, VAD auto-detects speech end. Use this for push-to-talk implementations.

barge_in

Interrupt AI response.

{
  "type": "barge_in"
}

When received:

Cancels TTS synthesis
Clears audio queue
Resets pipeline to listening state

message

Send text input (fallback when mic unavailable).

{
  "type": "message",
  "content": "What's the weather like?"
}

ping

Keep-alive heartbeat.

{
  "type": "ping"
}

Server responds with pong.

Server → Client Messages

session.ready

Session initialized successfully.

{
  "type": "session.ready",
  "session_id": "sess-abc123",
  "pipeline_mode": "thinker_talker"
}

session.init.ack

Acknowledges session.init message.

{
  "type": "session.init.ack"
}

transcript.delta

Partial STT transcript (streaming).

{
  "type": "transcript.delta",
  "text": "What is the",
  "is_final": false
}

Field	Type	Description
`text`	string	Partial transcript text
`is_final`	boolean	Always false for delta

transcript.complete

Final STT transcript.

{
  "type": "transcript.complete",
  "text": "What is the weather today?",
  "message_id": "msg-xyz789"
}

Field	Type	Description
`text`	string	Complete transcript
`message_id`	string	Unique message identifier

response.delta

Streaming LLM response token.

{
  "type": "response.delta",
  "delta": "The",
  "message_id": "resp-123"
}

Field	Type	Description
`delta`	string	Response token/chunk
`message_id`	string	Response message ID

response.complete

Complete LLM response.

{
  "type": "response.complete",
  "text": "The weather today is sunny with a high of 72 degrees.",
  "message_id": "resp-123"
}

audio.output

TTS audio chunk.

{
  "type": "audio.output",
  "audio": "base64_encoded_pcm_audio",
  "is_final": false,
  "sentence_index": 0
}

Field	Type	Description
`audio`	string	Base64-encoded PCM audio (24kHz, mono)
`is_final`	boolean	True for last chunk
`sentence_index`	number	Which sentence this is from

Output Audio Format:

Sample rate: 24000 Hz
Channels: 1 (mono)
Bit depth: 16-bit signed PCM
Encoding: Little-endian

tool.call

Tool invocation started.

{
  "type": "tool.call",
  "id": "call-abc",
  "name": "calendar_list_events",
  "arguments": {
    "start_date": "2025-12-01",
    "end_date": "2025-12-07"
  }
}

Field	Type	Description
`id`	string	Tool call ID
`name`	string	Tool function name
`arguments`	object	Tool arguments

tool.result

Tool execution completed.

{
  "type": "tool.result",
  "id": "call-abc",
  "name": "calendar_list_events",
  "result": {
    "events": [{ "title": "Team Meeting", "start": "2025-12-02T10:00:00" }]
  }
}

Field	Type	Description
`id`	string	Tool call ID
`name`	string	Tool function name
`result`	any	Tool execution result

voice.state

Pipeline state change.

{
  "type": "voice.state",
  "state": "speaking"
}

State	Description
`idle`	Waiting for user input
`listening`	Receiving audio, STT active
`processing`	LLM thinking
`speaking`	TTS playing
`cancelled`	Barge-in occurred

heartbeat

Server heartbeat (every 30s).

{
  "type": "heartbeat"
}

pong

Response to client ping.

{
  "type": "pong"
}

error

Error occurred.

{
  "type": "error",
  "code": "stt_failed",
  "message": "Speech-to-text service unavailable",
  "recoverable": true
}

Field	Type	Description
`code`	string	Error code
`message`	string	Human-readable message
`recoverable`	boolean	True if client can retry

Error Codes:

Code	Description	Recoverable
`invalid_json`	Malformed JSON message	Yes
`connection_failed`	Pipeline init failed	No
`stt_failed`	STT service error	Yes
`llm_failed`	LLM service error	Yes
`tts_failed`	TTS service error	Yes
`auth_failed`	Authentication error	No
`rate_limited`	Too many requests	Yes

Example: Complete Session

// 1. Connect
const ws = new WebSocket(`wss://assist.asimo.io/api/voice/pipeline-ws?token=${token}`);

ws.onopen = () => {
  console.log("Connected");
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);

  switch (msg.type) {
    case "session.ready":
      // 2. Initialize with settings
      ws.send(
        JSON.stringify({
          type: "session.init",
          conversation_id: currentConversationId,
          voice_settings: {
            voice_id: "TxGEqnHWrfWFTfGW9XjX",
            language: "en",
          },
        }),
      );
      break;

    case "session.init.ack":
      // 3. Start sending audio
      startMicrophoneCapture();
      break;

    case "transcript.delta":
      // Show partial transcript
      updatePartialTranscript(msg.text);
      break;

    case "transcript.complete":
      // Show final transcript
      setTranscript(msg.text);
      break;

    case "response.delta":
      // Append LLM response
      appendResponse(msg.delta);
      break;

    case "audio.output":
      // Play TTS audio
      if (msg.audio) {
        const pcm = base64ToArrayBuffer(msg.audio);
        audioPlayer.queueChunk(pcm);
      }
      if (msg.is_final) {
        audioPlayer.finish();
      }
      break;

    case "tool.call":
      // Show tool being called
      showToolCall(msg.name, msg.arguments);
      break;

    case "tool.result":
      // Show tool result
      showToolResult(msg.name, msg.result);
      break;

    case "error":
      console.error(`Error [${msg.code}]: ${msg.message}`);
      if (!msg.recoverable) {
        ws.close();
      }
      break;
  }
};

// Send audio chunks from microphone
function sendAudioChunk(pcmData) {
  ws.send(
    JSON.stringify({
      type: "audio.input",
      audio: arrayBufferToBase64(pcmData),
    }),
  );
}

// Handle barge-in (user speaks while AI is talking)
function handleBargeIn() {
  ws.send(JSON.stringify({ type: "barge_in" }));
  audioPlayer.stop();
}

Configuration Reference

TTSessionConfig (Backend)

@dataclass
class TTSessionConfig:
    user_id: str
    session_id: str
    conversation_id: Optional[str] = None

    # Voice settings
    voice_id: str = "TxGEqnHWrfWFTfGW9XjX"
    tts_model: str = "eleven_flash_v2_5"
    language: str = "en"

    # STT settings
    stt_sample_rate: int = 16000
    stt_endpointing_ms: int = 800
    stt_utterance_end_ms: int = 1500

    # Barge-in
    barge_in_enabled: bool = True

    # Timeouts
    connection_timeout_sec: float = 10.0
    idle_timeout_sec: float = 300.0

Rate Limiting

Limit	Value
Max concurrent sessions per user	2
Max concurrent sessions total	100
Audio chunk rate	~10/second recommended
Idle timeout	300 seconds

Voice Pipeline WebSocket API

Voice Pipeline WebSocket API

Overview

Connection

Authentication

Connection Lifecycle

Message Format

Client → Server Messages

session.init

audio.input

audio.input.complete

barge_in

message

ping

Server → Client Messages

session.ready

session.init.ack

transcript.delta

transcript.complete

response.delta

response.complete

audio.output

tool.call

tool.result

voice.state

heartbeat

pong

error

Example: Complete Session

Configuration Reference

TTSessionConfig (Backend)

Rate Limiting

Related Documentation