OpenAI Realtime API Integration
Comprehensive documentation for the OpenAI Realtime API integration in VoiceAssist.
Overview
This integration adds full-duplex voice conversation capabilities using OpenAI's Realtime API. Unlike the previous voice mode (Whisper + TTS), the Realtime API provides:
- Bidirectional Audio Streaming: Real-time audio input and output
- Server-Side VAD: OpenAI handles voice activity detection
- Lower Latency: Direct streaming reduces round-trip time
- Natural Conversations: More fluid, conversational interactions
Architecture
Backend Components
1. Configuration (services/api-gateway/app/core/config.py)
# OpenAI Realtime API settings REALTIME_ENABLED: bool = True REALTIME_MODEL: str = "gpt-4o-realtime-preview-2024-10-01" REALTIME_BASE_URL: str = "wss://api.openai.com/v1/realtime" REALTIME_TOKEN_EXPIRY_SEC: int = 300 # 5 minutes
2. Realtime Voice Service (services/api-gateway/app/services/realtime_voice_service.py)
Purpose: Generates ephemeral session configuration for frontend WebSocket connections.
Key Methods:
generate_session_config(user_id, conversation_id): Creates session configurationget_session_instructions(conversation_id): Returns system instructions for AIvalidate_session(session_id): Validates session format
Session Configuration Structure:
{ "url": "wss://api.openai.com/v1/realtime", "model": "gpt-4o-realtime-preview-2024-10-01", "api_key": "<OPENAI_API_KEY>", "session_id": "rtc_<user_id>_<random_token>", "expires_at": <unix_timestamp>, "conversation_id": "<optional_conversation_id>", "voice_config": { "voice": "alloy", "modalities": ["text", "audio"], "input_audio_format": "pcm16", "output_audio_format": "pcm16", "input_audio_transcription": {"model": "whisper-1"}, "turn_detection": { "type": "server_vad", "threshold": 0.5, "prefix_padding_ms": 300, "silence_duration_ms": 500 } } }
3. API Endpoint (services/api-gateway/app/api/voice.py)
Endpoint: POST /api/voice/realtime-session
Request:
{ "conversation_id": "optional_conversation_id" }
Response: Session configuration object (see above)
Authentication: Requires valid JWT token
Errors:
503 SERVICE_UNAVAILABLE: Realtime API not enabled or configured500 INTERNAL_SERVER_ERROR: Session creation failed
Frontend Components
1. useRealtimeVoiceSession Hook (apps/web-app/src/hooks/useRealtimeVoiceSession.ts)
Purpose: Manages WebSocket connection lifecycle and audio streaming.
Usage:
const { status, // Connection status error, // Error object (if any) transcript, // Current transcript isSpeaking, // User speaking indicator sessionConfig, // Session configuration connect, // Start connection disconnect, // End connection sendMessage, // Send text message isConnected, // Boolean: connected isConnecting, // Boolean: connecting canSend, // Boolean: can send } = useRealtimeVoiceSession({ conversation_id: "conv_123", onTranscript: (transcript) => { console.log(transcript.text, transcript.is_final); }, onAudioChunk: (chunk) => { // Handle audio playback }, onError: (error) => { console.error(error); }, onConnectionChange: (status) => { console.log(status); }, autoConnect: false, // Manual connect });
Features:
- WebSocket connection management
- Microphone capture (24kHz PCM16)
- Audio streaming to OpenAI
- Real-time transcript updates
- Audio chunk buffering
- Automatic cleanup
2. VoiceModePanel Component (apps/web-app/src/components/voice/VoiceModePanel.tsx)
Purpose: UI component for Realtime voice conversations.
Usage:
<VoiceModePanel conversationId="conv_123" onClose={() => setShowVoice(false)} onTranscriptReceived={(text, isFinal) => { if (isFinal) { // Save to conversation history } }} />
Features:
- Connection status indicator
- Waveform visualization
- Live transcript display (user + AI)
- Connection controls (start/stop)
- Error handling UI
- Instructions panel
3. Chat UI Integration (apps/web-app/src/components/chat/MessageInput.tsx)
New Props:
enableRealtimeVoice?: boolean- Enable Realtime voice buttonconversationId?: string- Current conversation ID
UI Elements:
- Purple speaker icon button
- Inline VoiceModePanel when active
- Separate from existing voice-to-text
4. API Client Method (packages/api-client/src/index.ts)
async createRealtimeSession(request: { conversation_id?: string | null; }): Promise<RealtimeSessionConfig>
Audio Processing
Input (Microphone → OpenAI)
- Capture: Web Audio API (24kHz, mono)
- Process: ScriptProcessorNode converts Float32 → Int16 (PCM16)
- Encode: Base64 encoding for JSON transport
- Send: WebSocket message
input_audio_buffer.append
// Example audio processing processor.onaudioprocess = (event) => { const inputData = event.inputBuffer.getChannelData(0); const pcm16 = new Int16Array(inputData.length); // Convert Float32 (-1.0 to 1.0) to Int16 (-32768 to 32767) for (let i = 0; i < inputData.length; i++) { const s = Math.max(-1, Math.min(1, inputData[i])); pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7fff; } // Send to OpenAI ws.send( JSON.stringify({ type: "input_audio_buffer.append", audio: btoa(String.fromCharCode(...new Uint8Array(pcm16.buffer))), }), ); };
Output (OpenAI → Speakers)
- Receive: WebSocket message
response.audio.delta - Decode: Base64 → Int16 PCM16
- Convert: Int16 → Float32 for Web Audio API
- Queue: Buffer audio chunks
- Play: AudioContext.createBufferSource()
// Example audio playback const audioBuffer = audioContext.createBuffer(1, pcm16.length, 24000); const channelData = audioBuffer.getChannelData(0); // Convert Int16 to Float32 for (let i = 0; i < pcm16.length; i++) { channelData[i] = pcm16[i] / 32768.0; } const source = audioContext.createBufferSource(); source.buffer = audioBuffer; source.connect(audioContext.destination); source.start();
WebSocket Protocol
Connection Flow
- Fetch Session Config:
POST /api/voice/realtime-session - Connect WebSocket:
wss://api.openai.com/v1/realtime?model=<model> - Send session.update: Configure voice, VAD, transcription
- Stream Audio: Bidirectional audio/text streaming
- Close Connection: Cleanup and disconnect
Message Types
Sent to OpenAI
// Session configuration { "type": "session.update", "session": { "modalities": ["text", "audio"], "instructions": "You are a helpful medical AI assistant.", "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "input_audio_transcription": {"model": "whisper-1"}, "turn_detection": {...} } } // Audio data { "type": "input_audio_buffer.append", "audio": "<base64_pcm16>" } // Text message { "type": "conversation.item.create", "item": { "type": "message", "role": "user", "content": [{"type": "input_text", "text": "Hello"}] } } // Trigger response { "type": "response.create" }
Received from OpenAI
// Session created { "type": "session.created", "session": {...} } // User speech transcription { "type": "conversation.item.input_audio_transcription.completed", "transcript": "Hello, how are you?" } // AI audio response { "type": "response.audio.delta", "delta": "<base64_pcm16>" } // AI speech transcript (partial) { "type": "response.audio_transcript.delta", "delta": "I'm doing well" } // AI speech transcript (final) { "type": "response.audio_transcript.done", "transcript": "I'm doing well, thank you for asking!" } // Speech detection { "type": "input_audio_buffer.speech_started" } { "type": "input_audio_buffer.speech_stopped" } // Error { "type": "error", "error": { "message": "Error message", "code": "error_code" } }
Configuration
Environment Variables
# Required OPENAI_API_KEY=sk-... # Optional (with defaults) REALTIME_ENABLED=true REALTIME_MODEL=gpt-4o-realtime-preview-2024-10-01 REALTIME_BASE_URL=wss://api.openai.com/v1/realtime REALTIME_TOKEN_EXPIRY_SEC=300
Feature Flags
Enable Realtime voice mode in Chat UI:
<MessageInput enableRealtimeVoice={true} // Enable Realtime voice button conversationId={conversationId} ... />
Testing
Manual Testing
- Navigate to Test Page:
/voice-test - Locate Realtime Section: "Realtime Voice Mode (OpenAI Realtime API)"
- Start Session: Click "Start Voice Session"
- Test Connection: Verify status indicator turns green
- Speak: Say something and verify transcript appears
- Listen: Verify AI responds with voice
- End Session: Click "End Session"
Test Scenarios
Scenario 1: Basic Connection
1. Start voice session
2. Verify WebSocket connects
3. Verify microphone permission granted
4. Verify waveform visualization appears
5. End session
6. Verify cleanup completes
Scenario 2: Voice Conversation
1. Start voice session
2. Speak: "Hello, can you hear me?"
3. Verify user transcript appears
4. Wait for AI response
5. Verify AI transcript appears
6. Verify audio plays
7. End session
Scenario 3: Error Handling
1. Disable network
2. Start voice session
3. Verify error message appears
4. Re-enable network
5. Retry connection
6. Verify successful connection
Scenario 4: Conversation Context
1. Start voice session in existing conversation
2. Ask question related to previous messages
3. Verify AI has context from conversation history
4. End session
Browser Compatibility
Supported Browsers
- ✅ Chrome 80+
- ✅ Edge 80+
- ✅ Safari 14.1+
- ✅ Firefox 75+
Required APIs
- WebSocket API
- Web Audio API (AudioContext, ScriptProcessorNode)
- MediaDevices.getUserMedia
- Canvas API (for waveform)
Known Limitations
- Safari: Requires user gesture for AudioContext
- Mobile: May have higher latency
- iOS: Requires iOS 14.3+ for Web Audio API support
Troubleshooting
Common Issues
1. "Microphone Access Denied"
Cause: Browser blocked microphone permission Solution: Grant microphone access in browser settings
2. "WebSocket connection failed"
Cause: Network issues or invalid session configuration Solution: Check network, verify API key is set
3. "Session configuration expired"
Cause: Session config older than 5 minutes Solution: Retry connection to fetch new config
4. No audio playback
Cause: AudioContext not started or suspended Solution: Ensure user gesture occurred before playback
5. High latency
Cause: Network congestion or slow connection Solution: Check network speed, reduce concurrent connections
Debug Mode
Enable detailed logging:
// In useRealtimeVoiceSession.ts console.log("[RealtimeVoiceSession] Debug message");
Check browser console for:
- WebSocket connection status
- Audio processing metrics
- Error messages with stack traces
Security Considerations
API Key Protection
- ✅ API key stored server-side
- ✅ Session tokens generated backend
- ✅ Short-lived sessions (5 minutes)
- ✅ User authentication required
Data Privacy
- Audio streams directly to OpenAI (not stored on server)
- Transcripts saved to conversation history (optional)
- No audio recording stored locally
- WebSocket connections encrypted (WSS)
Rate Limiting
- Session creation rate-limited per user
- WebSocket connections monitored
- Audio streaming bandwidth monitored
Performance Optimization
Recommendations
-
Audio Processing:
- Use ScriptProcessorNode for now (AudioWorklet in future)
- Buffer size: 4096 samples (good balance)
- Sample rate: 24000 Hz (Realtime API native)
-
Network:
- Minimize concurrent WebSocket connections
- Use binary frames for audio (future enhancement)
- Monitor connection quality
-
UI:
- Throttle waveform updates to 60 FPS
- Debounce transcript updates
- Lazy load VoiceModePanel
-
Memory:
- Cleanup audio contexts on disconnect
- Clear audio buffers regularly
- Release MediaStream tracks
Future Enhancements
Planned Features
- Conversation history integration
- Audio recording/playback controls
- Voice settings (speed, pitch)
- Multiple voice options
- Text-only mode toggle
- Noise cancellation
- Echo cancellation tuning
- Mobile app support
- Offline mode
Technical Improvements
- Migrate to AudioWorklet for better performance
- Use binary WebSocket frames for audio
- Add reconnection logic with exponential backoff
- Implement session resumption
- Add metrics and monitoring
- Create unit tests for audio processing
- Add E2E tests for voice flows
References
Support
For issues or questions:
- Check this documentation
- Review browser console logs
- Test on
/voice-testpage - File issue with reproduction steps
Changelog
2025-11-24 - Initial Release
- Backend Realtime session management
- Frontend useRealtimeVoiceSession hook
- VoiceModePanel component
- Chat UI integration
- Test page updates
- Documentation
Status: ✅ Production Ready Version: 1.0.0 Last Updated: 2025-11-24