Voice & Realtime Debugging Guide
Last Updated: 2025-12-02 Components: Voice pipeline, WebSocket service, STT/TTS
Voice Pipeline Overview
VoiceAssist has two voice pipelines:
| Pipeline | Status | Endpoint | Components |
|---|---|---|---|
| Thinker-Talker | Primary | /api/voice/pipeline-ws | Deepgram STT → GPT-4o → ElevenLabs TTS |
| OpenAI Realtime API | Legacy/Fallback | /api/realtime | OpenAI Realtime API (WebSocket) |
Always debug Thinker-Talker first unless specifically working with the legacy pipeline.
Part A: Thinker-Talker Voice Pipeline (Primary)
Architecture
┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Browser │───▶│ Deepgram │───▶│ GPT-4o │───▶│ ElevenLabs │
│ Audio Input │ │ STT Service │ │ Thinker Svc │ │ TTS Service │
└─────────────┘ └─────────────┘ └──────────────┘ └─────────────┘
│ │ │ │
│ ▼ ▼ ▼
│ transcript.delta response.delta audio.output
│ transcript.complete response.complete
└───────────────────────────────────────────────────────────────▶
WebSocket Messages
Key Files
| File | Purpose |
|---|---|
app/services/voice_pipeline_service.py | Main pipeline orchestrator |
app/services/thinker_service.py | LLM service (GPT-4o, tool calling) |
app/services/talker_service.py | TTS service (ElevenLabs streaming) |
app/services/streaming_stt_service.py | STT service (Deepgram streaming) |
app/services/sentence_chunker.py | Phrase-level chunking for low latency |
app/services/thinker_talker_websocket_handler.py | WebSocket handler |
apps/web-app/src/hooks/useThinkerTalkerSession.ts | Client WebSocket hook |
apps/web-app/src/hooks/useThinkerTalkerVoiceMode.ts | Voice mode state machine |
WebSocket Message Types
| Message Type | Direction | Description |
|---|---|---|
audio.input | Client→Server | Base64-encoded PCM audio |
transcript.delta | Server→Client | Partial transcript from STT |
transcript.complete | Server→Client | Final transcript |
response.delta | Server→Client | Streaming LLM token |
response.complete | Server→Client | Full LLM response |
audio.output | Server→Client | Base64-encoded TTS audio chunk |
tool.call | Server→Client | Function/tool invocation |
tool.result | Server→Client | Tool execution result |
voice.state | Server→Client | Pipeline state change |
error | Server→Client | Error notification |
Pipeline States
PipelineState = { IDLE, # Waiting for input LISTENING, # Recording user audio PROCESSING, # Running STT/LLM SPEAKING, # Playing TTS audio CANCELLED, # Barge-in triggered ERROR }
Thinker-Talker Debugging
No Transcripts
Likely Causes:
- Deepgram API key invalid or expired
- Audio not reaching server
- Wrong audio format (expects 16kHz PCM16)
- Deepgram service down
Steps to Investigate:
- Check Deepgram health:
# Check environment variable echo $DEEPGRAM_API_KEY | head -c 10 # Test Deepgram directly curl -X POST "https://api.deepgram.com/v1/listen" \ -H "Authorization: Token $DEEPGRAM_API_KEY" \ -H "Content-Type: audio/wav" \ --data-binary @test.wav
- Check server logs for STT errors:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "deepgram|stt|transcri"
- Verify audio format in client:
// Should be PCM16 at 16kHz console.log("Sample rate:", audioContext.sampleRate); // If 48kHz, ensure resampling is active
- Check WebSocket messages in browser DevTools → Network → WS.
Relevant Code:
app/services/streaming_stt_service.py- Deepgram integration
No LLM Response
Likely Causes:
- OpenAI API key invalid
- Rate limiting
- Context too long
- Tool call hanging
Steps to Investigate:
- Check Thinker service logs:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "thinker|openai|llm|gpt"
- Verify OpenAI API:
curl https://api.openai.com/v1/chat/completions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 10}'
- Check for tool call issues:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "tool|function|call"
Relevant Code:
app/services/thinker_service.py- LLM orchestrationapp/services/llm_client.py- OpenAI client
No Audio Output
Likely Causes:
- ElevenLabs API key invalid
- Voice ID not found
- TTS service failed
- Audio not playing in browser (autoplay policy)
Steps to Investigate:
- Check ElevenLabs health:
curl https://api.elevenlabs.io/v1/voices \ -H "xi-api-key: $ELEVENLABS_API_KEY" | jq '.voices[0].voice_id'
- Check Talker service logs:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "talker|elevenlabs|tts|audio"
- Verify voice ID in config:
grep -r "voice_id" services/api-gateway/app/core/config.py # Default: TxGEqnHWrfWFTfGW9XjX (Josh)
- Check browser autoplay:
// AudioContext must be resumed after user interaction if (audioContext.state === "suspended") { await audioContext.resume(); }
Relevant Code:
app/services/talker_service.py- TTS orchestrationapp/services/elevenlabs_service.py- ElevenLabs client
Barge-in Not Working
Likely Causes:
- Barge-in disabled in config
- Voice Activity Detection (VAD) not triggering
- Audio overlap prevention issue
Steps to Investigate:
- Check config:
# In voice_pipeline_service.py PipelineConfig: barge_in_enabled: True # Should be True
- Check VAD sensitivity:
// Client-side VAD config const vadConfig = { threshold: 0.5, // Lower = more sensitive minSpeechFrames: 3, };
- Check logs for barge-in events:
docker logs voiceassist-server --since "5m" 2>&1 | grep -iE "barge|cancel|interrupt"
Relevant Code:
app/services/voice_pipeline_service.py-barge_in()methodapps/web-app/src/hooks/useThinkerTalkerVoiceMode.ts- Client barge-in
High Latency
Targets:
| Metric | Target | Alert |
|---|---|---|
| STT latency | < 300ms | > 800ms |
| First LLM token | < 500ms | > 1.5s |
| First TTS audio | < 200ms | > 600ms |
| Total (speech-to-speech) | < 1.2s | > 3s |
Steps to Investigate:
- Check pipeline metrics:
curl http://localhost:8000/api/voice/metrics | jq '.'
- Check sentence chunker config:
# In sentence_chunker.py - phrase-level for low latency ChunkerConfig: min_chunk_chars: 15 # Avoid tiny fragments optimal_chunk_chars: 50 # Clause boundary max_chunk_chars: 80 # Force split limit
- Enable debug logging:
export VOICE_LOG_LEVEL=DEBUG docker restart voiceassist-server
Part B: Legacy OpenAI Realtime API (Fallback)
Note: This pipeline is maintained for backward compatibility. Prefer Thinker-Talker for new development.
Key Files
| File | Purpose |
|---|---|
app/api/realtime.py | Legacy WebSocket endpoint |
app/services/realtime_voice_service.py | OpenAI Realtime integration |
apps/web-app/src/hooks/useRealtimeVoiceSession.ts | Legacy client hook |
Legacy Debugging
For OpenAI Realtime API issues, refer to:
- OpenAI Realtime API documentation
- Check
OPENAI_API_KEYenvironment variable - Verify WebSocket connection to
/api/realtime
Part C: Common Issues (Both Pipelines)
Symptoms
WebSocket Won't Connect
Likely Causes:
- CORS blocking WebSocket upgrade
- Wrong WebSocket URL (ws vs wss)
- Proxy not forwarding upgrade headers
- Auth token invalid
Steps to Investigate:
- Check browser console for errors:
WebSocket connection to 'wss://...' failed
- Verify WebSocket URL:
// Thinker-Talker voice pipeline (primary) const voiceWsUrl = `wss://assist.asimo.io/api/voice/pipeline-ws?token=${accessToken}`; // Chat streaming const chatWsUrl = `wss://assist.asimo.io/api/realtime/ws?token=${accessToken}`;
- Test WebSocket connection manually:
# Test Thinker-Talker voice pipeline (primary) websocat "wss://assist.asimo.io/api/voice/pipeline-ws?token=YOUR_TOKEN" # Test chat streaming WebSocket wscat -c "wss://assist.asimo.io/api/realtime/ws?token=YOUR_TOKEN"
- Check Apache/Nginx proxy config:
# WebSocket proxy for API endpoints ProxyPass /api/voice/pipeline-ws ws://127.0.0.1:8000/api/voice/pipeline-ws ProxyPassReverse /api/voice/pipeline-ws ws://127.0.0.1:8000/api/voice/pipeline-ws ProxyPass /api/realtime/ws ws://127.0.0.1:8000/api/realtime/ws ProxyPassReverse /api/realtime/ws ws://127.0.0.1:8000/api/realtime/ws # WebSocket upgrade headers RewriteCond %{HTTP:Upgrade} websocket [NC] RewriteCond %{HTTP:Connection} upgrade [NC] RewriteRule ^/api/(.*)$ ws://127.0.0.1:8000/api/$1 [P,L]
Relevant Code Paths:
services/api-gateway/app/api/websocket.py- WebSocket endpointapps/web-app/src/services/websocket/- Client connection- Apache config:
/etc/apache2/sites-available/
WebSocket Disconnects Frequently
Likely Causes:
- Idle timeout (30-60s default)
- Network instability
- Server restarting
- Memory pressure on server
Steps to Investigate:
- Check disconnect reason:
socket.onclose = (event) => { console.log("Close code:", event.code); console.log("Close reason:", event.reason); }; // 1000 = normal, 1001 = going away, 1006 = abnormal
- Check server logs for connection drops:
docker logs voiceassist-server --since "10m" 2>&1 | grep -i "websocket\|disconnect"
- Implement heartbeat/ping:
// Client side setInterval(() => { if (socket.readyState === WebSocket.OPEN) { socket.send(JSON.stringify({ type: "ping" })); } }, 30000);
- Check proxy timeouts:
ProxyTimeout 300 # Or ProxyBadHeader Ignore
Relevant Code Paths:
services/api-gateway/app/api/websocket.py- Connection handlingapps/web-app/src/services/websocket/WebSocketService.ts- Reconnection logic
Audio Not Recording
Likely Causes:
- Browser permission denied
- MediaRecorder not supported
- Wrong audio format
- AudioContext suspended
Steps to Investigate:
- Check browser permissions:
const permission = await navigator.permissions.query({ name: "microphone" }); console.log("Microphone permission:", permission.state); // 'granted', 'denied', or 'prompt'
- Request microphone access:
try { const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); console.log("Got audio stream:", stream); } catch (err) { console.error("Microphone error:", err.name, err.message); }
- Check MediaRecorder support:
console.log("MediaRecorder supported:", typeof MediaRecorder !== "undefined"); console.log("Supported MIME types:"); ["audio/webm", "audio/mp4", "audio/ogg"].forEach((type) => { console.log(type, MediaRecorder.isTypeSupported(type)); });
- Resume AudioContext (required after user interaction):
const audioContext = new AudioContext(); if (audioContext.state === "suspended") { await audioContext.resume(); }
Relevant Code Paths:
apps/web-app/src/services/voice/VoiceRecorder.tsapps/web-app/src/hooks/useVoiceInput.ts
Speech-to-Text (STT) Not Working
Likely Causes:
- Audio format not supported
- STT service down
- API key invalid
- Audio too quiet/noisy
Steps to Investigate:
- Check STT service logs:
docker logs voiceassist-server --since "5m" 2>&1 | grep -i "stt\|transcri\|whisper"
- Verify audio is being sent:
// Log audio blob details console.log("Audio blob:", blob.size, blob.type); // Should be > 0 bytes and correct MIME type
- Test STT directly:
# Test OpenAI Whisper API curl https://api.openai.com/v1/audio/transcriptions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -F file=@test.mp3 \ -F model=whisper-1
- Check audio quality:
// Analyze audio levels const analyser = audioContext.createAnalyser(); // Connect to microphone stream // Check for sufficient amplitude
Relevant Code Paths:
services/api-gateway/app/services/stt_service.pyservices/api-gateway/app/api/voice.py
Text-to-Speech (TTS) Not Playing
Likely Causes:
- AudioContext suspended (autoplay policy)
- Audio element not connected
- TTS API failure
- Wrong audio format/codec
Steps to Investigate:
- Check AudioContext state:
console.log("AudioContext state:", audioContext.state); // Should be 'running', not 'suspended'
- Resume after user interaction:
button.onclick = async () => { if (audioContext.state === "suspended") { await audioContext.resume(); } playTTS(); };
- Check TTS service:
# Test OpenAI TTS API curl https://api.openai.com/v1/audio/speech \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "tts-1", "input": "Hello", "voice": "alloy"}' \ --output test.mp3
- Verify audio playback:
const audio = new Audio(); audio.src = URL.createObjectURL(audioBlob); audio.oncanplay = () => console.log("Audio can play"); audio.onerror = (e) => console.error("Audio error:", e); await audio.play();
Relevant Code Paths:
services/api-gateway/app/services/tts_service.pyapps/web-app/src/services/voice/TTSPlayer.ts
Voice Activity Detection (VAD) Issues
Likely Causes:
- Threshold too high/low
- Background noise
- Wrong sample rate
- VAD model not loaded
Steps to Investigate:
- Check VAD configuration:
const vadConfig = { threshold: 0.5, // Adjust based on noise level minSpeechFrames: 3, // Minimum frames to trigger preSpeechPadFrames: 10, redemptionFrames: 8, };
- Visualize audio levels:
// Use canvas to show real-time levels const draw = () => { analyser.getByteFrequencyData(dataArray); // Draw to canvas requestAnimationFrame(draw); };
- Check sample rate:
console.log("AudioContext sample rate:", audioContext.sampleRate); // VAD typically expects 16000 Hz
Relevant Code Paths:
apps/web-app/src/services/voice/VADService.tsapps/web-app/src/utils/vad.ts
Debugging Tools
Browser DevTools
// Monitor WebSocket traffic // DevTools → Network → WS → Select connection → Messages
Audio Debugging
// Create audio visualizer const analyser = audioContext.createAnalyser(); analyser.fftSize = 2048; const bufferLength = analyser.frequencyBinCount; const dataArray = new Uint8Array(bufferLength); function draw() { requestAnimationFrame(draw); analyser.getByteTimeDomainData(dataArray); // Draw waveform to canvas }
WebSocket Testing
# websocat - WebSocket Swiss Army knife # Test Thinker-Talker voice pipeline websocat -v "wss://assist.asimo.io/api/voice/pipeline-ws?token=YOUR_TOKEN" # wscat - WebSocket cat npm install -g wscat # Test chat streaming WebSocket wscat -c "wss://assist.asimo.io/api/realtime/ws?token=YOUR_TOKEN"
Common Error Messages
| Error | Cause | Fix |
|---|---|---|
NotAllowedError: Permission denied | Microphone blocked | Request permission with user interaction |
NotFoundError: Device not found | No microphone | Check hardware/drivers |
AudioContext was not allowed to start | Autoplay policy | Resume after user click |
WebSocket is closed before connection established | Connection rejected | Check auth, CORS, proxy |
MediaRecorder: not supported | Browser compatibility | Use audio/webm, add polyfill |
Performance Metrics
| Metric | Target | Alert |
|---|---|---|
| WebSocket latency | < 100ms | > 500ms |
| STT processing time | < 2s | > 5s |
| TTS generation time | < 1s | > 3s |
| Audio capture to response | < 3s | > 7s |
Voice Health Endpoint
Check voice pipeline health:
# Health check for all voice components curl http://localhost:8000/health/voice | jq '.' # Example response { "status": "healthy", "components": { "deepgram": "healthy", "openai": "healthy", "elevenlabs": "healthy" } }
Related Documentation
- Debugging Overview
- Voice Mode Pipeline - Detailed Thinker-Talker architecture
- Thinker-Talker Pipeline - Pipeline design and implementation
- Implementation Status - Voice feature status
- API Reference - Voice endpoints