VoiceAssist Docs

Voice & Realtime Debugging

Troubleshooting guide for WebSocket, speech-to-text, text-to-speech, and realtime features

stablebackend2025-12-02human, agent, backend, frontend
debuggingrunbookvoicerealtimewebsocket

Voice & Realtime Debugging Guide

Last Updated: 2025-12-02 Components: Voice pipeline, WebSocket service, STT/TTS


Voice Pipeline Overview

VoiceAssist has two voice pipelines:

PipelineStatusEndpointComponents
Thinker-TalkerPrimary/api/voice/pipeline-wsDeepgram STT → GPT-4o → ElevenLabs TTS
OpenAI Realtime APILegacy/Fallback/api/realtimeOpenAI Realtime API (WebSocket)

Always debug Thinker-Talker first unless specifically working with the legacy pipeline.


Part A: Thinker-Talker Voice Pipeline (Primary)

Architecture

┌─────────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│   Browser   │───▶│ Deepgram    │───▶│  GPT-4o      │───▶│ ElevenLabs  │
│ Audio Input │    │ STT Service │    │ Thinker Svc  │    │ TTS Service │
└─────────────┘    └─────────────┘    └──────────────┘    └─────────────┘
       │                  │                  │                  │
       │                  ▼                  ▼                  ▼
       │           transcript.delta    response.delta     audio.output
       │           transcript.complete response.complete
       └───────────────────────────────────────────────────────────────▶
                                WebSocket Messages

Key Files

FilePurpose
app/services/voice_pipeline_service.pyMain pipeline orchestrator
app/services/thinker_service.pyLLM service (GPT-4o, tool calling)
app/services/talker_service.pyTTS service (ElevenLabs streaming)
app/services/streaming_stt_service.pySTT service (Deepgram streaming)
app/services/sentence_chunker.pyPhrase-level chunking for low latency
app/services/thinker_talker_websocket_handler.pyWebSocket handler
apps/web-app/src/hooks/useThinkerTalkerSession.tsClient WebSocket hook
apps/web-app/src/hooks/useThinkerTalkerVoiceMode.tsVoice mode state machine

WebSocket Message Types

Message TypeDirectionDescription
audio.inputClient→ServerBase64-encoded PCM audio
transcript.deltaServer→ClientPartial transcript from STT
transcript.completeServer→ClientFinal transcript
response.deltaServer→ClientStreaming LLM token
response.completeServer→ClientFull LLM response
audio.outputServer→ClientBase64-encoded TTS audio chunk
tool.callServer→ClientFunction/tool invocation
tool.resultServer→ClientTool execution result
voice.stateServer→ClientPipeline state change
errorServer→ClientError notification

Pipeline States

PipelineState = { IDLE, # Waiting for input LISTENING, # Recording user audio PROCESSING, # Running STT/LLM SPEAKING, # Playing TTS audio CANCELLED, # Barge-in triggered ERROR }

Thinker-Talker Debugging

No Transcripts

Likely Causes:

  • Deepgram API key invalid or expired
  • Audio not reaching server
  • Wrong audio format (expects 16kHz PCM16)
  • Deepgram service down

Steps to Investigate:

  1. Check Deepgram health:
# Check environment variable echo $DEEPGRAM_API_KEY | head -c 10 # Test Deepgram directly curl -X POST "https://api.deepgram.com/v1/listen" \ -H "Authorization: Token $DEEPGRAM_API_KEY" \ -H "Content-Type: audio/wav" \ --data-binary @test.wav
  1. Check server logs for STT errors:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "deepgram|stt|transcri"
  1. Verify audio format in client:
// Should be PCM16 at 16kHz console.log("Sample rate:", audioContext.sampleRate); // If 48kHz, ensure resampling is active
  1. Check WebSocket messages in browser DevTools → Network → WS.

Relevant Code:

  • app/services/streaming_stt_service.py - Deepgram integration

No LLM Response

Likely Causes:

  • OpenAI API key invalid
  • Rate limiting
  • Context too long
  • Tool call hanging

Steps to Investigate:

  1. Check Thinker service logs:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "thinker|openai|llm|gpt"
  1. Verify OpenAI API:
curl https://api.openai.com/v1/chat/completions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 10}'
  1. Check for tool call issues:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "tool|function|call"

Relevant Code:

  • app/services/thinker_service.py - LLM orchestration
  • app/services/llm_client.py - OpenAI client

No Audio Output

Likely Causes:

  • ElevenLabs API key invalid
  • Voice ID not found
  • TTS service failed
  • Audio not playing in browser (autoplay policy)

Steps to Investigate:

  1. Check ElevenLabs health:
curl https://api.elevenlabs.io/v1/voices \ -H "xi-api-key: $ELEVENLABS_API_KEY" | jq '.voices[0].voice_id'
  1. Check Talker service logs:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "talker|elevenlabs|tts|audio"
  1. Verify voice ID in config:
grep -r "voice_id" services/api-gateway/app/core/config.py # Default: TxGEqnHWrfWFTfGW9XjX (Josh)
  1. Check browser autoplay:
// AudioContext must be resumed after user interaction if (audioContext.state === "suspended") { await audioContext.resume(); }

Relevant Code:

  • app/services/talker_service.py - TTS orchestration
  • app/services/elevenlabs_service.py - ElevenLabs client

Barge-in Not Working

Likely Causes:

  • Barge-in disabled in config
  • Voice Activity Detection (VAD) not triggering
  • Audio overlap prevention issue

Steps to Investigate:

  1. Check config:
# In voice_pipeline_service.py PipelineConfig: barge_in_enabled: True # Should be True
  1. Check VAD sensitivity:
// Client-side VAD config const vadConfig = { threshold: 0.5, // Lower = more sensitive minSpeechFrames: 3, };
  1. Check logs for barge-in events:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "barge|cancel|interrupt"

Relevant Code:

  • app/services/voice_pipeline_service.py - barge_in() method
  • apps/web-app/src/hooks/useThinkerTalkerVoiceMode.ts - Client barge-in

High Latency

Targets:

MetricTargetAlert
STT latency< 300ms> 800ms
First LLM token< 500ms> 1.5s
First TTS audio< 200ms> 600ms
Total (speech-to-speech)< 1.2s> 3s

Steps to Investigate:

  1. Check pipeline metrics:
curl http://localhost:8000/api/voice/metrics | jq '.'
  1. Check sentence chunker config:
# In sentence_chunker.py - phrase-level for low latency ChunkerConfig: min_chunk_chars: 15 # Avoid tiny fragments optimal_chunk_chars: 50 # Clause boundary max_chunk_chars: 80 # Force split limit
  1. Enable debug logging:
export VOICE_LOG_LEVEL=DEBUG docker restart voiceassist-server

Part B: Legacy OpenAI Realtime API (Fallback)

Note: This pipeline is maintained for backward compatibility. Prefer Thinker-Talker for new development.

Key Files

FilePurpose
app/api/realtime.pyLegacy WebSocket endpoint
app/services/realtime_voice_service.pyOpenAI Realtime integration
apps/web-app/src/hooks/useRealtimeVoiceSession.tsLegacy client hook

Legacy Debugging

For OpenAI Realtime API issues, refer to:

  • OpenAI Realtime API documentation
  • Check OPENAI_API_KEY environment variable
  • Verify WebSocket connection to /api/realtime

Part C: Common Issues (Both Pipelines)

Symptoms

WebSocket Won't Connect

Likely Causes:

  • CORS blocking WebSocket upgrade
  • Wrong WebSocket URL (ws vs wss)
  • Proxy not forwarding upgrade headers
  • Auth token invalid

Steps to Investigate:

  1. Check browser console for errors:
WebSocket connection to 'wss://...' failed
  1. Verify WebSocket URL:
// Thinker-Talker voice pipeline (primary) const voiceWsUrl = `wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`; // Chat streaming const chatWsUrl = `wss://assist.asimo.io/api/realtime/ws?token=${accessToken}`;
  1. Test WebSocket connection manually:
# Test Thinker-Talker voice pipeline (primary) websocat "wss://assist.asimo.io/api/voice/pipeline-ws?token=YOUR_TOKEN" # Test chat streaming WebSocket wscat -c "wss://assist.asimo.io/api/realtime/ws?token=YOUR_TOKEN"
  1. Check Apache/Nginx proxy config:
# WebSocket proxy for API endpoints ProxyPass /api/voice/pipeline-ws ws://127.0.0.1:8000/api/voice/pipeline-ws ProxyPassReverse /api/voice/pipeline-ws ws://127.0.0.1:8000/api/voice/pipeline-ws ProxyPass /api/realtime/ws ws://127.0.0.1:8000/api/realtime/ws ProxyPassReverse /api/realtime/ws ws://127.0.0.1:8000/api/realtime/ws # WebSocket upgrade headers RewriteCond %{HTTP:Upgrade} websocket [NC] RewriteCond %{HTTP:Connection} upgrade [NC] RewriteRule ^/api/(.*)$ ws://127.0.0.1:8000/api/$1 [P,L]

Relevant Code Paths:

  • services/api-gateway/app/api/websocket.py - WebSocket endpoint
  • apps/web-app/src/services/websocket/ - Client connection
  • Apache config: /etc/apache2/sites-available/

WebSocket Disconnects Frequently

Likely Causes:

  • Idle timeout (30-60s default)
  • Network instability
  • Server restarting
  • Memory pressure on server

Steps to Investigate:

  1. Check disconnect reason:
socket.onclose = (event) => { console.log("Close code:", event.code); console.log("Close reason:", event.reason); }; // 1000 = normal, 1001 = going away, 1006 = abnormal
  1. Check server logs for connection drops:
docker logs voiceassist-server --since "10m" 2>&1 | grep -i "websocket\|disconnect"
  1. Implement heartbeat/ping:
// Client side setInterval(() => { if (socket.readyState === WebSocket.OPEN) { socket.send(JSON.stringify({ type: "ping" })); } }, 30000);
  1. Check proxy timeouts:
ProxyTimeout 300 # Or ProxyBadHeader Ignore

Relevant Code Paths:

  • services/api-gateway/app/api/websocket.py - Connection handling
  • apps/web-app/src/services/websocket/WebSocketService.ts - Reconnection logic

Audio Not Recording

Likely Causes:

  • Browser permission denied
  • MediaRecorder not supported
  • Wrong audio format
  • AudioContext suspended

Steps to Investigate:

  1. Check browser permissions:
const permission = await navigator.permissions.query({ name: "microphone" }); console.log("Microphone permission:", permission.state); // 'granted', 'denied', or 'prompt'
  1. Request microphone access:
try { const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); console.log("Got audio stream:", stream); } catch (err) { console.error("Microphone error:", err.name, err.message); }
  1. Check MediaRecorder support:
console.log("MediaRecorder supported:", typeof MediaRecorder !== "undefined"); console.log("Supported MIME types:"); ["audio/webm", "audio/mp4", "audio/ogg"].forEach((type) => { console.log(type, MediaRecorder.isTypeSupported(type)); });
  1. Resume AudioContext (required after user interaction):
const audioContext = new AudioContext(); if (audioContext.state === "suspended") { await audioContext.resume(); }

Relevant Code Paths:

  • apps/web-app/src/services/voice/VoiceRecorder.ts
  • apps/web-app/src/hooks/useVoiceInput.ts

Speech-to-Text (STT) Not Working

Likely Causes:

  • Audio format not supported
  • STT service down
  • API key invalid
  • Audio too quiet/noisy

Steps to Investigate:

  1. Check STT service logs:
docker logs voiceassist-server --since "5m" 2>&1 | grep -i "stt\|transcri\|whisper"
  1. Verify audio is being sent:
// Log audio blob details console.log("Audio blob:", blob.size, blob.type); // Should be > 0 bytes and correct MIME type
  1. Test STT directly:
# Test OpenAI Whisper API curl https://api.openai.com/v1/audio/transcriptions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -F file=@test.mp3 \ -F model=whisper-1
  1. Check audio quality:
// Analyze audio levels const analyser = audioContext.createAnalyser(); // Connect to microphone stream // Check for sufficient amplitude

Relevant Code Paths:

  • services/api-gateway/app/services/stt_service.py
  • services/api-gateway/app/api/voice.py

Text-to-Speech (TTS) Not Playing

Likely Causes:

  • AudioContext suspended (autoplay policy)
  • Audio element not connected
  • TTS API failure
  • Wrong audio format/codec

Steps to Investigate:

  1. Check AudioContext state:
console.log("AudioContext state:", audioContext.state); // Should be 'running', not 'suspended'
  1. Resume after user interaction:
button.onclick = async () => { if (audioContext.state === "suspended") { await audioContext.resume(); } playTTS(); };
  1. Check TTS service:
# Test OpenAI TTS API curl https://api.openai.com/v1/audio/speech \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "tts-1", "input": "Hello", "voice": "alloy"}' \ --output test.mp3
  1. Verify audio playback:
const audio = new Audio(); audio.src = URL.createObjectURL(audioBlob); audio.oncanplay = () => console.log("Audio can play"); audio.onerror = (e) => console.error("Audio error:", e); await audio.play();

Relevant Code Paths:

  • services/api-gateway/app/services/tts_service.py
  • apps/web-app/src/services/voice/TTSPlayer.ts

Voice Activity Detection (VAD) Issues

Likely Causes:

  • Threshold too high/low
  • Background noise
  • Wrong sample rate
  • VAD model not loaded

Steps to Investigate:

  1. Check VAD configuration:
const vadConfig = { threshold: 0.5, // Adjust based on noise level minSpeechFrames: 3, // Minimum frames to trigger preSpeechPadFrames: 10, redemptionFrames: 8, };
  1. Visualize audio levels:
// Use canvas to show real-time levels const draw = () => { analyser.getByteFrequencyData(dataArray); // Draw to canvas requestAnimationFrame(draw); };
  1. Check sample rate:
console.log("AudioContext sample rate:", audioContext.sampleRate); // VAD typically expects 16000 Hz

Relevant Code Paths:

  • apps/web-app/src/services/voice/VADService.ts
  • apps/web-app/src/utils/vad.ts

Debugging Tools

Browser DevTools

// Monitor WebSocket traffic // DevTools → Network → WS → Select connection → Messages

Audio Debugging

// Create audio visualizer const analyser = audioContext.createAnalyser(); analyser.fftSize = 2048; const bufferLength = analyser.frequencyBinCount; const dataArray = new Uint8Array(bufferLength); function draw() { requestAnimationFrame(draw); analyser.getByteTimeDomainData(dataArray); // Draw waveform to canvas }

WebSocket Testing

# websocat - WebSocket Swiss Army knife # Test Thinker-Talker voice pipeline websocat -v "wss://assist.asimo.io/api/voice/pipeline-ws?token=YOUR_TOKEN" # wscat - WebSocket cat npm install -g wscat # Test chat streaming WebSocket wscat -c "wss://assist.asimo.io/api/realtime/ws?token=YOUR_TOKEN"

Common Error Messages

ErrorCauseFix
NotAllowedError: Permission deniedMicrophone blockedRequest permission with user interaction
NotFoundError: Device not foundNo microphoneCheck hardware/drivers
AudioContext was not allowed to startAutoplay policyResume after user click
WebSocket is closed before connection establishedConnection rejectedCheck auth, CORS, proxy
MediaRecorder: not supportedBrowser compatibilityUse audio/webm, add polyfill

Performance Metrics

MetricTargetAlert
WebSocket latency< 100ms> 500ms
STT processing time< 2s> 5s
TTS generation time< 1s> 3s
Audio capture to response< 3s> 7s

Voice Health Endpoint

Check voice pipeline health:

# Health check for all voice components curl http://localhost:8000/health/voice | jq '.' # Example response { "status": "healthy", "components": { "deepgram": "healthy", "openai": "healthy", "elevenlabs": "healthy" } }