Voice & Realtime Debugging Guide

Last Updated: 2025-12-02 Components: Voice pipeline, WebSocket service, STT/TTS

Voice Pipeline Overview

VoiceAssist has two voice pipelines:

Pipeline	Status	Endpoint	Components
Thinker-Talker	Primary	`/api/voice/pipeline-ws`	Deepgram STT → GPT-4o → ElevenLabs TTS
OpenAI Realtime API	Legacy/Fallback	`/api/realtime`	OpenAI Realtime API (WebSocket)

Always debug Thinker-Talker first unless specifically working with the legacy pipeline.

Part A: Thinker-Talker Voice Pipeline (Primary)

Architecture

┌─────────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│   Browser   │───▶│ Deepgram    │───▶│  GPT-4o      │───▶│ ElevenLabs  │
│ Audio Input │    │ STT Service │    │ Thinker Svc  │    │ TTS Service │
└─────────────┘    └─────────────┘    └──────────────┘    └─────────────┘
       │                  │                  │                  │
       │                  ▼                  ▼                  ▼
       │           transcript.delta    response.delta     audio.output
       │           transcript.complete response.complete
       └───────────────────────────────────────────────────────────────▶
                                WebSocket Messages

Key Files

File	Purpose
`app/services/voice_pipeline_service.py`	Main pipeline orchestrator
`app/services/thinker_service.py`	LLM service (GPT-4o, tool calling)
`app/services/talker_service.py`	TTS service (ElevenLabs streaming)
`app/services/streaming_stt_service.py`	STT service (Deepgram streaming)
`app/services/sentence_chunker.py`	Phrase-level chunking for low latency
`app/services/thinker_talker_websocket_handler.py`	WebSocket handler
`apps/web-app/src/hooks/useThinkerTalkerSession.ts`	Client WebSocket hook
`apps/web-app/src/hooks/useThinkerTalkerVoiceMode.ts`	Voice mode state machine

WebSocket Message Types

Message Type	Direction	Description
`audio.input`	Client→Server	Base64-encoded PCM audio
`transcript.delta`	Server→Client	Partial transcript from STT
`transcript.complete`	Server→Client	Final transcript
`response.delta`	Server→Client	Streaming LLM token
`response.complete`	Server→Client	Full LLM response
`audio.output`	Server→Client	Base64-encoded TTS audio chunk
`tool.call`	Server→Client	Function/tool invocation
`tool.result`	Server→Client	Tool execution result
`voice.state`	Server→Client	Pipeline state change
`error`	Server→Client	Error notification

Pipeline States

PipelineState = {
    IDLE,           # Waiting for input
    LISTENING,      # Recording user audio
    PROCESSING,     # Running STT/LLM
    SPEAKING,       # Playing TTS audio
    CANCELLED,      # Barge-in triggered
    ERROR
}

Thinker-Talker Debugging

No Transcripts

Likely Causes:

Deepgram API key invalid or expired
Audio not reaching server
Wrong audio format (expects 16kHz PCM16)
Deepgram service down

Steps to Investigate:

Check Deepgram health:

# Check environment variable
echo $DEEPGRAM_API_KEY | head -c 10

# Test Deepgram directly
curl -X POST "https://api.deepgram.com/v1/listen" \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @test.wav

Check server logs for STT errors:

docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "deepgram|stt|transcri"

Verify audio format in client:

// Should be PCM16 at 16kHz
console.log("Sample rate:", audioContext.sampleRate);
// If 48kHz, ensure resampling is active

Check WebSocket messages in browser DevTools → Network → WS.

Relevant Code:

app/services/streaming_stt_service.py - Deepgram integration

No LLM Response

Likely Causes:

OpenAI API key invalid
Rate limiting
Context too long
Tool call hanging

Steps to Investigate:

Check Thinker service logs:

docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "thinker|openai|llm|gpt"

Verify OpenAI API:

curl https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 10}'

Check for tool call issues:

docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "tool|function|call"

Relevant Code:

app/services/thinker_service.py - LLM orchestration
app/services/llm_client.py - OpenAI client

No Audio Output

Likely Causes:

ElevenLabs API key invalid
Voice ID not found
TTS service failed
Audio not playing in browser (autoplay policy)

Steps to Investigate:

Check ElevenLabs health:

curl https://api.elevenlabs.io/v1/voices \
  -H "xi-api-key: $ELEVENLABS_API_KEY" | jq '.voices[0].voice_id'

Check Talker service logs:

docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "talker|elevenlabs|tts|audio"

Verify voice ID in config:

grep -r "voice_id" services/api-gateway/app/core/config.py
# Default: TxGEqnHWrfWFTfGW9XjX (Josh)

Check browser autoplay:

// AudioContext must be resumed after user interaction
if (audioContext.state === "suspended") {
  await audioContext.resume();
}

Relevant Code:

app/services/talker_service.py - TTS orchestration
app/services/elevenlabs_service.py - ElevenLabs client

Barge-in Not Working

Likely Causes:

Barge-in disabled in config
Voice Activity Detection (VAD) not triggering
Audio overlap prevention issue

Steps to Investigate:

Check config:

# In voice_pipeline_service.py
PipelineConfig:
  barge_in_enabled: True  # Should be True

Check VAD sensitivity:

// Client-side VAD config
const vadConfig = {
  threshold: 0.5, // Lower = more sensitive
  minSpeechFrames: 3,
};

Check logs for barge-in events:

docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "barge|cancel|interrupt"

Relevant Code:

app/services/voice_pipeline_service.py - barge_in() method
apps/web-app/src/hooks/useThinkerTalkerVoiceMode.ts - Client barge-in

High Latency

Targets:

Metric	Target	Alert
STT latency	< 300ms	> 800ms
First LLM token	< 500ms	> 1.5s
First TTS audio	< 200ms	> 600ms
Total (speech-to-speech)	< 1.2s	> 3s

Steps to Investigate:

Check pipeline metrics:

curl http://localhost:8000/api/voice/metrics | jq '.'

Check sentence chunker config:

# In sentence_chunker.py - phrase-level for low latency
ChunkerConfig:
  min_chunk_chars: 15    # Avoid tiny fragments
  optimal_chunk_chars: 50 # Clause boundary
  max_chunk_chars: 80    # Force split limit

Enable debug logging:

export VOICE_LOG_LEVEL=DEBUG
docker restart voiceassist-server

Part B: Legacy OpenAI Realtime API (Fallback)

Note: This pipeline is maintained for backward compatibility. Prefer Thinker-Talker for new development.

Key Files

File	Purpose
`app/api/realtime.py`	Legacy WebSocket endpoint
`app/services/realtime_voice_service.py`	OpenAI Realtime integration
`apps/web-app/src/hooks/useRealtimeVoiceSession.ts`	Legacy client hook

Legacy Debugging

For OpenAI Realtime API issues, refer to:

OpenAI Realtime API documentation
Check OPENAI_API_KEY environment variable
Verify WebSocket connection to /api/realtime

Part C: Common Issues (Both Pipelines)

Symptoms

WebSocket Won't Connect

Likely Causes:

CORS blocking WebSocket upgrade
Wrong WebSocket URL (ws vs wss)
Proxy not forwarding upgrade headers
Auth token invalid

Steps to Investigate:

Check browser console for errors:

WebSocket connection to 'wss://...' failed

Verify WebSocket URL:

// Thinker-Talker voice pipeline (primary)
const voiceWsUrl = `wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`;

// Chat streaming
const chatWsUrl = `wss://assist.asimo.io/api/realtime/ws?token=${accessToken}`;

Test WebSocket connection manually:

# Test Thinker-Talker voice pipeline (primary)
websocat "wss://assist.asimo.io/api/voice/pipeline-ws?token=YOUR_TOKEN"

# Test chat streaming WebSocket
wscat -c "wss://assist.asimo.io/api/realtime/ws?token=YOUR_TOKEN"

Check Apache/Nginx proxy config:

# WebSocket proxy for API endpoints
ProxyPass /api/voice/pipeline-ws ws://127.0.0.1:8000/api/voice/pipeline-ws
ProxyPassReverse /api/voice/pipeline-ws ws://127.0.0.1:8000/api/voice/pipeline-ws
ProxyPass /api/realtime/ws ws://127.0.0.1:8000/api/realtime/ws
ProxyPassReverse /api/realtime/ws ws://127.0.0.1:8000/api/realtime/ws

# WebSocket upgrade headers
RewriteCond %{HTTP:Upgrade} websocket [NC]
RewriteCond %{HTTP:Connection} upgrade [NC]
RewriteRule ^/api/(.*)$ ws://127.0.0.1:8000/api/$1 [P,L]

Relevant Code Paths:

services/api-gateway/app/api/websocket.py - WebSocket endpoint
apps/web-app/src/services/websocket/ - Client connection
Apache config: /etc/apache2/sites-available/

WebSocket Disconnects Frequently

Likely Causes:

Idle timeout (30-60s default)
Network instability
Server restarting
Memory pressure on server

Steps to Investigate:

Check disconnect reason:

socket.onclose = (event) => {
  console.log("Close code:", event.code);
  console.log("Close reason:", event.reason);
};
// 1000 = normal, 1001 = going away, 1006 = abnormal

Check server logs for connection drops:

docker logs voiceassist-server --since "10m" 2>&1 | grep -i "websocket\|disconnect"

Implement heartbeat/ping:

// Client side
setInterval(() => {
  if (socket.readyState === WebSocket.OPEN) {
    socket.send(JSON.stringify({ type: "ping" }));
  }
}, 30000);

Check proxy timeouts:

ProxyTimeout 300
# Or
ProxyBadHeader Ignore

Relevant Code Paths:

services/api-gateway/app/api/websocket.py - Connection handling
apps/web-app/src/services/websocket/WebSocketService.ts - Reconnection logic

Audio Not Recording

Likely Causes:

Browser permission denied
MediaRecorder not supported
Wrong audio format
AudioContext suspended

Steps to Investigate:

Check browser permissions:

const permission = await navigator.permissions.query({ name: "microphone" });
console.log("Microphone permission:", permission.state);
// 'granted', 'denied', or 'prompt'

Request microphone access:

try {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  console.log("Got audio stream:", stream);
} catch (err) {
  console.error("Microphone error:", err.name, err.message);
}

Check MediaRecorder support:

console.log("MediaRecorder supported:", typeof MediaRecorder !== "undefined");
console.log("Supported MIME types:");
["audio/webm", "audio/mp4", "audio/ogg"].forEach((type) => {
  console.log(type, MediaRecorder.isTypeSupported(type));
});

Resume AudioContext (required after user interaction):

const audioContext = new AudioContext();
if (audioContext.state === "suspended") {
  await audioContext.resume();
}

Relevant Code Paths:

apps/web-app/src/services/voice/VoiceRecorder.ts
apps/web-app/src/hooks/useVoiceInput.ts

Speech-to-Text (STT) Not Working

Likely Causes:

Audio format not supported
STT service down
API key invalid
Audio too quiet/noisy

Steps to Investigate:

Check STT service logs:

docker logs voiceassist-server --since "5m" 2>&1 | grep -i "stt\|transcri\|whisper"

Verify audio is being sent:

// Log audio blob details
console.log("Audio blob:", blob.size, blob.type);
// Should be > 0 bytes and correct MIME type

Test STT directly:

# Test OpenAI Whisper API
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@test.mp3 \
  -F model=whisper-1

Check audio quality:

// Analyze audio levels
const analyser = audioContext.createAnalyser();
// Connect to microphone stream
// Check for sufficient amplitude

Relevant Code Paths:

services/api-gateway/app/services/stt_service.py
services/api-gateway/app/api/voice.py

Text-to-Speech (TTS) Not Playing

Likely Causes:

AudioContext suspended (autoplay policy)
Audio element not connected
TTS API failure
Wrong audio format/codec

Steps to Investigate:

Check AudioContext state:

console.log("AudioContext state:", audioContext.state);
// Should be 'running', not 'suspended'

Resume after user interaction:

button.onclick = async () => {
  if (audioContext.state === "suspended") {
    await audioContext.resume();
  }
  playTTS();
};

Check TTS service:

# Test OpenAI TTS API
curl https://api.openai.com/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello", "voice": "alloy"}' \
  --output test.mp3

Verify audio playback:

const audio = new Audio();
audio.src = URL.createObjectURL(audioBlob);
audio.oncanplay = () => console.log("Audio can play");
audio.onerror = (e) => console.error("Audio error:", e);
await audio.play();

Relevant Code Paths:

services/api-gateway/app/services/tts_service.py
apps/web-app/src/services/voice/TTSPlayer.ts

Voice Activity Detection (VAD) Issues

Likely Causes:

Threshold too high/low
Background noise
Wrong sample rate
VAD model not loaded

Steps to Investigate:

Check VAD configuration:

const vadConfig = {
  threshold: 0.5, // Adjust based on noise level
  minSpeechFrames: 3, // Minimum frames to trigger
  preSpeechPadFrames: 10,
  redemptionFrames: 8,
};

Visualize audio levels:

// Use canvas to show real-time levels
const draw = () => {
  analyser.getByteFrequencyData(dataArray);
  // Draw to canvas
  requestAnimationFrame(draw);
};

Check sample rate:

console.log("AudioContext sample rate:", audioContext.sampleRate);
// VAD typically expects 16000 Hz

Relevant Code Paths:

apps/web-app/src/services/voice/VADService.ts
apps/web-app/src/utils/vad.ts

Debugging Tools

Browser DevTools

// Monitor WebSocket traffic
// DevTools → Network → WS → Select connection → Messages

Audio Debugging

// Create audio visualizer
const analyser = audioContext.createAnalyser();
analyser.fftSize = 2048;
const bufferLength = analyser.frequencyBinCount;
const dataArray = new Uint8Array(bufferLength);

function draw() {
  requestAnimationFrame(draw);
  analyser.getByteTimeDomainData(dataArray);
  // Draw waveform to canvas
}

WebSocket Testing

# websocat - WebSocket Swiss Army knife
# Test Thinker-Talker voice pipeline
websocat -v "wss://assist.asimo.io/api/voice/pipeline-ws?token=YOUR_TOKEN"

# wscat - WebSocket cat
npm install -g wscat
# Test chat streaming WebSocket
wscat -c "wss://assist.asimo.io/api/realtime/ws?token=YOUR_TOKEN"

Common Error Messages

Error	Cause	Fix
`NotAllowedError: Permission denied`	Microphone blocked	Request permission with user interaction
`NotFoundError: Device not found`	No microphone	Check hardware/drivers
`AudioContext was not allowed to start`	Autoplay policy	Resume after user click
`WebSocket is closed before connection established`	Connection rejected	Check auth, CORS, proxy
`MediaRecorder: not supported`	Browser compatibility	Use audio/webm, add polyfill

Performance Metrics

Metric	Target	Alert
WebSocket latency	< 100ms	> 500ms
STT processing time	< 2s	> 5s
TTS generation time	< 1s	> 3s
Audio capture to response	< 3s	> 7s

Voice Health Endpoint

Check voice pipeline health:

# Health check for all voice components
curl http://localhost:8000/health/voice | jq '.'

# Example response
{
  "status": "healthy",
  "components": {
    "deepgram": "healthy",
    "openai": "healthy",
    "elevenlabs": "healthy"
  }
}

Debugging Overview
Voice Mode Pipeline - Detailed Thinker-Talker architecture
Thinker-Talker Pipeline - Pipeline design and implementation
Implementation Status - Voice feature status
API Reference - Voice endpoints

Voice & Realtime Debugging Guide

Voice Pipeline Overview

Part A: Thinker-Talker Voice Pipeline (Primary)

Architecture

Key Files

WebSocket Message Types

Pipeline States

Thinker-Talker Debugging

No Transcripts

No LLM Response

No Audio Output

Barge-in Not Working

High Latency

Part B: Legacy OpenAI Realtime API (Fallback)

Key Files

Legacy Debugging

Part C: Common Issues (Both Pipelines)

Symptoms

WebSocket Won't Connect

WebSocket Disconnects Frequently

Audio Not Recording

Speech-to-Text (STT) Not Working

Text-to-Speech (TTS) Not Playing

Voice Activity Detection (VAD) Issues

Debugging Tools

Browser DevTools

Audio Debugging

WebSocket Testing

Common Error Messages

Performance Metrics

Voice Health Endpoint

Related Documentation