Smart Conversational Voice Design
Status: Design Document Version: 1.0 Last Updated: 2025-12-04 Authors: AI Assistant, Development Team
Executive Summary
This document outlines the technical design for making VoiceAssist voice mode feel natural and conversational. It covers two phases:
- Phase 2: Smart Acknowledgments - Context-aware barge-in responses
- Phase 3: Natural Conversational Flow - Human-like turn-taking and prosody
Current State Analysis
What Works
-
Thinking Tones (Just Implemented)
ThinkingFeedbackPanelnow integrated intoThinkerTalkerVoicePanel.tsx- Plays configurable audio tones during "processing" state
- Settings in
voiceSettingsStore.ts(enabled by default)
-
Barge-in Detection
- Fast VAD detection (<30ms latency)
- Pattern classification (backchannel, soft barge, hard barge)
- ElevenLabs TTS for consistent voice
What Needs Improvement
-
Static Acknowledgments
- Current: Always plays "I'm listening" regardless of context
- Problem: Unnatural, doesn't match what user is saying
-
Poor Timing
- Current: Fixed thresholds for pause detection
- Problem: Doesn't adapt to user's natural speech rhythm
-
No Conversational Intelligence
- Current: Pattern matching only
- Problem: Can't understand user intent or generate appropriate responses
Phase 2: Smart Acknowledgments
Overview
Replace static barge-in phrases with context-aware acknowledgments that reflect what the user is saying and why they're interrupting.
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Smart Acknowledgment Pipeline │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ User Speech ──► STT ──► Partial Transcript │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Intent Classifier │ │
│ │ (Fast, <50ms) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Question │ │ Correction │ │ Interruption│ │
│ │ Intent │ │ Intent │ │ Intent │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Contextual Phrase Selector │ │
│ │ - Matches intent to phrase library │ │
│ │ - Considers conversation history │ │
│ │ - Respects user's language preference │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Cached TTS Lookup │ │
│ │ (Pre-synthesized) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ Audio Playback │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Implementation Details
1. Intent Classifier Service
Location: services/api-gateway/app/services/acknowledgment_intent_classifier.py
""" Acknowledgment Intent Classifier Fast intent classification for barge-in acknowledgments. Must complete in <50ms to maintain conversational flow. """ from enum import Enum from dataclasses import dataclass from typing import Optional, List import re class AcknowledgmentIntent(str, Enum): """Detected intent for acknowledgment selection.""" QUESTION = "question" # User asking a question CORRECTION = "correction" # User correcting AI CLARIFICATION = "clarification" # User needs more info AGREEMENT = "agreement" # User agrees/confirms DISAGREEMENT = "disagreement" # User disagrees HESITATION = "hesitation" # User is thinking/unsure INTERRUPTION = "interruption" # User wants to change topic COMMAND = "command" # User giving instruction CONTINUATION = "continuation" # User wants AI to continue UNKNOWN = "unknown" # Cannot determine @dataclass class IntentResult: """Result of intent classification.""" intent: AcknowledgmentIntent confidence: float # 0.0 - 1.0 keywords: List[str] # Matched keywords suggested_phrases: List[str] # Pre-selected phrases class AcknowledgmentIntentClassifier: """ Fast intent classifier for barge-in acknowledgments. Uses keyword matching and pattern analysis for speed. No ML inference - must be <50ms. """ # Intent detection patterns (compiled for speed) PATTERNS = { AcknowledgmentIntent.QUESTION: [ r'\b(what|where|when|why|how|who|which|can you|could you|is it|are you|do you)\b', r'\?$', r'\b(tell me|explain|describe)\b', ], AcknowledgmentIntent.CORRECTION: [ r'\b(no|not|wrong|incorrect|actually|but|wait)\b', r'\b(i (meant|said|wanted)|that\'s not)\b', ], AcknowledgmentIntent.CLARIFICATION: [ r'\b(what do you mean|i don\'t understand|clarify|repeat|again)\b', r'\b(sorry|pardon|huh)\b', ], AcknowledgmentIntent.AGREEMENT: [ r'\b(yes|yeah|yep|correct|right|exactly|sure|ok|okay)\b', r'\b(that\'s right|i agree|makes sense)\b', ], AcknowledgmentIntent.DISAGREEMENT: [ r'\b(no|nope|wrong|i disagree|that\'s not|but)\b', r'\b(i don\'t think|not quite|actually)\b', ], AcknowledgmentIntent.HESITATION: [ r'\b(um+|uh+|hmm+|well|let me think|i\'m not sure)\b', r'\.\.\.$', ], AcknowledgmentIntent.INTERRUPTION: [ r'\b(stop|wait|hold on|one moment|let me|can i)\b', r'\b(before you|hang on|pause)\b', ], AcknowledgmentIntent.COMMAND: [ r'\b(go to|open|show|play|start|stop|read|skip)\b', r'\b(louder|quieter|slower|faster|repeat)\b', ], AcknowledgmentIntent.CONTINUATION: [ r'\b(continue|go on|keep going|and then|more)\b', r'\b(what else|tell me more|go ahead)\b', ], } def __init__(self): # Pre-compile patterns for speed self._compiled_patterns = { intent: [re.compile(p, re.IGNORECASE) for p in patterns] for intent, patterns in self.PATTERNS.items() } def classify( self, transcript: str, duration_ms: int, during_ai_speech: bool, conversation_context: Optional[str] = None, ) -> IntentResult: """ Classify user intent for acknowledgment selection. Args: transcript: The partial or final transcript duration_ms: How long the user has been speaking during_ai_speech: Whether AI was speaking when user started conversation_context: Recent conversation for context Returns: IntentResult with classified intent and suggestions """ transcript_lower = transcript.lower().strip() # Score each intent scores = {} matched_keywords = {} for intent, patterns in self._compiled_patterns.items(): score = 0.0 keywords = [] for pattern in patterns: matches = pattern.findall(transcript_lower) if matches: score += 0.3 * len(matches) keywords.extend(matches) scores[intent] = min(score, 1.0) matched_keywords[intent] = keywords # Apply contextual adjustments if during_ai_speech: # More likely to be interruption if AI was speaking scores[AcknowledgmentIntent.INTERRUPTION] *= 1.5 scores[AcknowledgmentIntent.CORRECTION] *= 1.3 if duration_ms < 500: # Short utterances more likely to be backchannels scores[AcknowledgmentIntent.AGREEMENT] *= 1.3 scores[AcknowledgmentIntent.HESITATION] *= 1.2 # Find best match best_intent = max(scores, key=scores.get) best_score = scores[best_intent] # Fall back to unknown if confidence too low if best_score < 0.2: best_intent = AcknowledgmentIntent.UNKNOWN best_score = 0.0 return IntentResult( intent=best_intent, confidence=best_score, keywords=matched_keywords.get(best_intent, []), suggested_phrases=self._get_phrases(best_intent), ) def _get_phrases(self, intent: AcknowledgmentIntent) -> List[str]: """Get suggested acknowledgment phrases for an intent.""" # These will be selected from the phrase library # See SmartPhraseLibrary below from .smart_phrase_library import get_phrases_for_intent return get_phrases_for_intent(intent)
2. Smart Phrase Library
Location: services/api-gateway/app/services/smart_phrase_library.py
""" Smart Phrase Library Contextual acknowledgment phrases organized by intent. Supports multiple languages and formality levels. """ from typing import List, Dict from enum import Enum class FormalityLevel(str, Enum): CASUAL = "casual" NEUTRAL = "neutral" FORMAL = "formal" # Phrase library organized by intent PHRASE_LIBRARY: Dict[str, Dict[str, List[str]]] = { "question": { "en": [ "Yes?", "What is it?", "Go ahead", "I'm listening", "What would you like to know?", ], "ar": [ "نعم؟", "تفضل", "ما هو سؤالك؟", "أنا أستمع", ], }, "correction": { "en": [ "I see", "Got it", "Understood", "Let me correct that", "My apologies", "Thanks for the correction", ], "ar": [ "فهمت", "حسناً", "أعتذر", "شكراً للتصحيح", ], }, "clarification": { "en": [ "Let me explain", "Of course", "I'll clarify", "What specifically?", ], "ar": [ "دعني أوضح", "بالتأكيد", "ما الذي تريد توضيحه؟", ], }, "agreement": { "en": [ "Great", "Perfect", "Excellent", "Wonderful", ], "ar": [ "ممتاز", "رائع", "جميل", ], }, "disagreement": { "en": [ "I understand", "I hear you", "Let me reconsider", "Fair point", ], "ar": [ "أفهم", "أسمعك", "نقطة جيدة", ], }, "hesitation": { "en": [ "Take your time", "No rush", "I'm here", "Whenever you're ready", ], "ar": [ "خذ وقتك", "لا تستعجل", "أنا هنا", ], }, "interruption": { "en": [ "Of course", "Go ahead", "Yes?", "Please, continue", ], "ar": [ "بالتأكيد", "تفضل", "نعم؟", ], }, "command": { "en": [ "Right away", "On it", "Sure thing", "Doing that now", ], "ar": [ "حالاً", "فوراً", "بالتأكيد", ], }, "continuation": { "en": [ "Certainly", "Of course", "Let me continue", "Where was I...", ], "ar": [ "بالتأكيد", "حسناً", "دعني أكمل", ], }, "unknown": { "en": [ "I'm listening", "Go ahead", "Yes?", ], "ar": [ "أنا أستمع", "تفضل", "نعم؟", ], }, } def get_phrases_for_intent( intent: str, language: str = "en", formality: FormalityLevel = FormalityLevel.NEUTRAL, ) -> List[str]: """ Get acknowledgment phrases for a given intent. Args: intent: The classified intent (from AcknowledgmentIntent) language: Language code (en, ar, etc.) formality: Formality level for phrase selection Returns: List of suitable phrases """ intent_phrases = PHRASE_LIBRARY.get(intent, PHRASE_LIBRARY["unknown"]) return intent_phrases.get(language, intent_phrases.get("en", ["I'm listening"])) def select_phrase( intent: str, language: str = "en", avoid_recent: List[str] = None, ) -> str: """ Select a single phrase, avoiding recently used ones. Args: intent: The classified intent language: Language code avoid_recent: List of recently used phrases to avoid Returns: Selected phrase """ import random phrases = get_phrases_for_intent(intent, language) if avoid_recent: available = [p for p in phrases if p not in avoid_recent] if available: phrases = available return random.choice(phrases)
3. Phrase Cache Service
Location: services/api-gateway/app/services/phrase_cache_service.py
""" Phrase Cache Service Pre-synthesizes and caches acknowledgment phrases for instant playback. Uses ElevenLabs TTS with the user's selected voice. """ import asyncio import hashlib from typing import Dict, Optional from dataclasses import dataclass import aioredis from app.services.elevenlabs_service import ElevenLabsService from app.services.smart_phrase_library import PHRASE_LIBRARY from app.core.voice_constants import DEFAULT_VOICE_ID @dataclass class CachedPhrase: """A pre-synthesized phrase.""" text: str voice_id: str language: str audio_data: bytes duration_ms: int class PhraseCacheService: """ Manages pre-synthesized acknowledgment phrases. Caches phrases in Redis for fast retrieval. Pre-warms cache on voice selection. """ CACHE_PREFIX = "phrase_cache:" CACHE_TTL = 86400 * 7 # 7 days def __init__( self, redis_client: aioredis.Redis, elevenlabs: ElevenLabsService, ): self.redis = redis_client self.elevenlabs = elevenlabs self._warming = False def _cache_key(self, text: str, voice_id: str, language: str) -> str: """Generate cache key for a phrase.""" content = f"{text}:{voice_id}:{language}" hash_val = hashlib.sha256(content.encode()).hexdigest()[:16] return f"{self.CACHE_PREFIX}{hash_val}" async def get_phrase( self, text: str, voice_id: str = DEFAULT_VOICE_ID, language: str = "en", ) -> Optional[bytes]: """ Get cached audio for a phrase. Args: text: The phrase text voice_id: ElevenLabs voice ID language: Language code Returns: Audio bytes if cached, None otherwise """ key = self._cache_key(text, voice_id, language) data = await self.redis.get(key) return data async def cache_phrase( self, text: str, voice_id: str, language: str, audio_data: bytes, ) -> None: """Cache synthesized audio for a phrase.""" key = self._cache_key(text, voice_id, language) await self.redis.setex(key, self.CACHE_TTL, audio_data) async def synthesize_and_cache( self, text: str, voice_id: str = DEFAULT_VOICE_ID, language: str = "en", ) -> bytes: """ Synthesize a phrase and cache it. Args: text: Phrase to synthesize voice_id: ElevenLabs voice ID language: Language code Returns: Synthesized audio bytes """ # Check cache first cached = await self.get_phrase(text, voice_id, language) if cached: return cached # Synthesize audio_chunks = [] async for chunk in self.elevenlabs.synthesize_stream( text=text, voice_id=voice_id, model_id="eleven_flash_v2_5", # Fast model for acknowledgments ): audio_chunks.append(chunk) audio_data = b"".join(audio_chunks) # Cache for future use await self.cache_phrase(text, voice_id, language, audio_data) return audio_data async def warm_cache( self, voice_id: str, languages: List[str] = ["en", "ar"], ) -> None: """ Pre-synthesize all phrases for a voice. Called when user selects a voice to pre-warm cache. Args: voice_id: ElevenLabs voice ID languages: Languages to cache """ if self._warming: return self._warming = True try: tasks = [] for intent, lang_phrases in PHRASE_LIBRARY.items(): for lang in languages: if lang not in lang_phrases: continue for phrase in lang_phrases[lang]: # Check if already cached cached = await self.get_phrase(phrase, voice_id, lang) if not cached: tasks.append( self.synthesize_and_cache(phrase, voice_id, lang) ) # Synthesize in batches to avoid rate limits batch_size = 5 for i in range(0, len(tasks), batch_size): batch = tasks[i:i + batch_size] await asyncio.gather(*batch, return_exceptions=True) await asyncio.sleep(0.5) # Rate limit buffer finally: self._warming = False
4. Smart Acknowledgment Service
Location: services/api-gateway/app/services/smart_acknowledgment_service.py
""" Smart Acknowledgment Service Orchestrates intent classification, phrase selection, and audio playback for contextual barge-in acknowledgments. """ import asyncio from typing import Optional, List, Callable from dataclasses import dataclass import time from app.services.acknowledgment_intent_classifier import ( AcknowledgmentIntentClassifier, IntentResult, ) from app.services.smart_phrase_library import select_phrase from app.services.phrase_cache_service import PhraseCacheService from app.core.voice_constants import DEFAULT_VOICE_ID @dataclass class AcknowledgmentResult: """Result of acknowledgment generation.""" phrase: str intent: str confidence: float audio_data: Optional[bytes] latency_ms: int class SmartAcknowledgmentService: """ Generates contextual acknowledgments for barge-in events. Pipeline: 1. Classify user intent from transcript 2. Select appropriate phrase 3. Retrieve cached audio (or synthesize) 4. Return for playback Target latency: <100ms """ def __init__( self, phrase_cache: PhraseCacheService, ): self.classifier = AcknowledgmentIntentClassifier() self.phrase_cache = phrase_cache self._recent_phrases: List[str] = [] self._max_recent = 5 async def generate_acknowledgment( self, transcript: str, duration_ms: int, during_ai_speech: bool, voice_id: str = DEFAULT_VOICE_ID, language: str = "en", conversation_context: Optional[str] = None, ) -> AcknowledgmentResult: """ Generate a contextual acknowledgment for a barge-in. Args: transcript: User's speech transcript duration_ms: How long user has been speaking during_ai_speech: Whether AI was speaking voice_id: Voice to use for TTS language: User's language preference conversation_context: Recent conversation for context Returns: AcknowledgmentResult with phrase and audio """ start_time = time.monotonic() # 1. Classify intent intent_result = self.classifier.classify( transcript=transcript, duration_ms=duration_ms, during_ai_speech=during_ai_speech, conversation_context=conversation_context, ) # 2. Select phrase (avoid repetition) phrase = select_phrase( intent=intent_result.intent.value, language=language, avoid_recent=self._recent_phrases, ) # Track recent phrases self._recent_phrases.append(phrase) if len(self._recent_phrases) > self._max_recent: self._recent_phrases.pop(0) # 3. Get cached audio audio_data = await self.phrase_cache.get_phrase( text=phrase, voice_id=voice_id, language=language, ) # 4. Synthesize if not cached (should be rare after warm-up) if not audio_data: audio_data = await self.phrase_cache.synthesize_and_cache( text=phrase, voice_id=voice_id, language=language, ) latency_ms = int((time.monotonic() - start_time) * 1000) return AcknowledgmentResult( phrase=phrase, intent=intent_result.intent.value, confidence=intent_result.confidence, audio_data=audio_data, latency_ms=latency_ms, ) def reset_recent_phrases(self) -> None: """Reset recent phrase tracking (e.g., on new session).""" self._recent_phrases.clear()
5. Frontend Integration
Location: apps/web-app/src/hooks/useSmartAcknowledgment.ts
/** * useSmartAcknowledgment Hook * * Fetches and plays contextual acknowledgment audio based on * user intent classification. */ import { useCallback, useRef } from "react"; import { voiceLog } from "../lib/logger"; interface SmartAcknowledgmentOptions { /** ElevenLabs voice ID */ voiceId?: string; /** Language code */ language?: string; /** API base URL */ apiBaseUrl?: string; /** Auth token getter */ getAccessToken?: () => string | null; /** Volume (0-1) */ volume?: number; } interface AcknowledgmentResult { phrase: string; intent: string; confidence: number; latency_ms: number; } export function useSmartAcknowledgment(options: SmartAcknowledgmentOptions = {}) { const { voiceId, language = "en", apiBaseUrl = typeof window !== "undefined" ? window.location.origin : "", getAccessToken, volume = 0.8, } = options; const audioContextRef = useRef<AudioContext | null>(null); const gainNodeRef = useRef<GainNode | null>(null); const getAudioContext = useCallback((): AudioContext => { if (!audioContextRef.current) { audioContextRef.current = new ( window.AudioContext || (window as unknown as { webkitAudioContext: typeof AudioContext }).webkitAudioContext )(); } if (audioContextRef.current.state === "suspended") { audioContextRef.current.resume(); } return audioContextRef.current; }, []); /** * Play a smart acknowledgment based on user transcript. */ const playAcknowledgment = useCallback( async (transcript: string, durationMs: number, duringAiSpeech: boolean): Promise<AcknowledgmentResult | null> => { try { const token = getAccessToken?.(); const url = `${apiBaseUrl}/api/voice/smart-acknowledgment`; const response = await fetch(url, { method: "POST", headers: { "Content-Type": "application/json", ...(token ? { Authorization: `Bearer ${token}` } : {}), }, body: JSON.stringify({ transcript, duration_ms: durationMs, during_ai_speech: duringAiSpeech, voice_id: voiceId, language, }), }); if (!response.ok) { throw new Error(`Acknowledgment request failed: ${response.status}`); } // Response contains both metadata and audio const contentType = response.headers.get("content-type"); if (contentType?.includes("application/json")) { // Metadata-only response (audio from cache) const data = await response.json(); // Fetch audio separately const audioResponse = await fetch( `${apiBaseUrl}/api/voice/phrase-audio?` + `phrase=${encodeURIComponent(data.phrase)}&` + `voice_id=${voiceId}&language=${language}`, { headers: token ? { Authorization: `Bearer ${token}` } : {}, }, ); if (audioResponse.ok) { const audioData = await audioResponse.arrayBuffer(); await playAudioBuffer(audioData); } return data; } else { // Multipart response with audio embedded // Parse and play const audioData = await response.arrayBuffer(); await playAudioBuffer(audioData); // Extract metadata from header const metadata = response.headers.get("X-Acknowledgment-Metadata"); return metadata ? JSON.parse(metadata) : null; } } catch (error) { voiceLog.error("[SmartAcknowledgment] Failed to play:", error); return null; } }, [apiBaseUrl, voiceId, language, getAccessToken], ); const playAudioBuffer = useCallback( async (audioData: ArrayBuffer): Promise<void> => { const ctx = getAudioContext(); const audioBuffer = await ctx.decodeAudioData(audioData); const source = ctx.createBufferSource(); source.buffer = audioBuffer; if (!gainNodeRef.current) { gainNodeRef.current = ctx.createGain(); gainNodeRef.current.connect(ctx.destination); } gainNodeRef.current.gain.value = volume; source.connect(gainNodeRef.current); source.start(0); }, [getAudioContext, volume], ); return { playAcknowledgment, }; }
6. Voice Pipeline Integration
Location: Update services/api-gateway/app/services/voice_pipeline_service.py
# Add to VoicePipelineService class async def _handle_barge_in_with_smart_ack( self, transcript: str, duration_ms: int, ) -> None: """ Handle barge-in with smart acknowledgment. Called when user interrupts AI speech. """ # Stop current speech await self._stop_speaking(reason="barge_in") # Generate smart acknowledgment if self._smart_ack_service: result = await self._smart_ack_service.generate_acknowledgment( transcript=transcript, duration_ms=duration_ms, during_ai_speech=True, voice_id=self._config.voice_id, language=self._config.language, ) # Send acknowledgment audio to client if result.audio_data: await self._on_message( PipelineMessage( type="voice.acknowledgment", data={ "phrase": result.phrase, "intent": result.intent, "confidence": result.confidence, "audio": base64.b64encode(result.audio_data).decode(), }, ) )
Phase 3: Natural Conversational Flow
Overview
Make voice mode feel like a natural human conversation with proper turn-taking, prosody variation, and adaptive timing.
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Natural Conversational Flow System │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Turn-Taking Manager │ │
│ │ │ │
│ │ User Speaking ◄────────────────────────────► AI Speaking │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ End-of-Turn │ │ Yield Point │ │ │
│ │ │ Detector │ │ Detector │ │ │
│ │ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └──────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────▼──────────────────────────────────┐ │
│ │ Prosody Controller │ │
│ │ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Emotion │ │ Emphasis │ │ Pacing │ │ Intonation│ │ │
│ │ │ Mapping │ │ Markers │ │ Control │ │ Patterns │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ │ │ │
│ └──────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────▼──────────────────────────────────┐ │
│ │ Adaptive Timing Engine │ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ User Pattern │ │ Response Gap │ │ Hesitation │ │ │
│ │ │ Learning │ │ Calibration │ │ Detection │ │ │
│ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │
│ │ │ │
│ └──────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────▼──────────────────────────────────┐ │
│ │ Conversational Fillers │ │
│ │ │ │
│ │ "So..." "Well..." "Let me see..." "Hmm..." [thinking] │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Implementation Details
1. Turn-Taking Manager
Location: services/api-gateway/app/services/turn_taking_manager.py
""" Turn-Taking Manager Manages conversational turn-taking between user and AI. Detects end-of-turn signals and yield points for natural flow. """ from enum import Enum from dataclasses import dataclass, field from typing import Optional, List, Callable import asyncio import time class TurnState(str, Enum): """Current turn state.""" USER_SPEAKING = "user_speaking" USER_FINISHED = "user_finished" AI_THINKING = "ai_thinking" AI_SPEAKING = "ai_speaking" AI_YIELDING = "ai_yielding" # AI pausing for potential user input IDLE = "idle" class EndOfTurnSignal(str, Enum): """Detected end-of-turn signals.""" SILENCE = "silence" # Long pause FALLING_INTONATION = "falling_intonation" # Voice pitch drops COMPLETE_SENTENCE = "complete_sentence" # Syntactically complete QUESTION_MARKER = "question_marker" # Rising intonation / "?" EXPLICIT_HANDOFF = "explicit_handoff" # "What do you think?" BACKCHANNEL_REQUEST = "backchannel_request" # Trailing "right?", "you know?" @dataclass class TurnEvent: """A turn-taking event.""" timestamp: float previous_state: TurnState new_state: TurnState signal: Optional[EndOfTurnSignal] = None confidence: float = 1.0 metadata: dict = field(default_factory=dict) class TurnTakingManager: """ Manages natural turn-taking in voice conversations. Features: - End-of-turn detection with multiple signals - Yield point detection for AI speech - Overlap handling - Turn history tracking """ # Timing thresholds (ms) SILENCE_THRESHOLD_SHORT = 300 # Brief pause (continue listening) SILENCE_THRESHOLD_MEDIUM = 700 # Possible end of turn SILENCE_THRESHOLD_LONG = 1200 # Definite end of turn # AI yield points (opportunities to yield floor to user) YIELD_AFTER_QUESTION = True YIELD_AFTER_LIST_ITEM = True YIELD_AFTER_PARAGRAPH = True def __init__( self, on_turn_change: Optional[Callable[[TurnEvent], None]] = None, ): self._state = TurnState.IDLE self._on_turn_change = on_turn_change self._turn_history: List[TurnEvent] = [] self._last_activity_time = time.monotonic() self._user_pattern_stats = UserPatternStats() @property def state(self) -> TurnState: return self._state def user_started_speaking(self) -> TurnEvent: """Called when VAD detects user speech start.""" return self._transition( TurnState.USER_SPEAKING, signal=None, ) def user_silence_detected(self, silence_ms: int) -> Optional[TurnEvent]: """ Called periodically during user silence. Returns TurnEvent if this silence indicates end of turn. """ if self._state != TurnState.USER_SPEAKING: return None # Use adaptive threshold based on user patterns threshold = self._user_pattern_stats.get_silence_threshold() if silence_ms >= threshold: return self._transition( TurnState.USER_FINISHED, signal=EndOfTurnSignal.SILENCE, confidence=min(silence_ms / self.SILENCE_THRESHOLD_LONG, 1.0), ) return None def user_sentence_complete( self, transcript: str, is_question: bool = False, ) -> Optional[TurnEvent]: """ Called when syntactic analysis detects complete sentence. """ if self._state != TurnState.USER_SPEAKING: return None signal = ( EndOfTurnSignal.QUESTION_MARKER if is_question else EndOfTurnSignal.COMPLETE_SENTENCE ) return self._transition( TurnState.USER_FINISHED, signal=signal, metadata={"transcript": transcript, "is_question": is_question}, ) def ai_started_thinking(self) -> TurnEvent: """Called when LLM processing begins.""" return self._transition(TurnState.AI_THINKING) def ai_started_speaking(self) -> TurnEvent: """Called when TTS audio starts playing.""" return self._transition(TurnState.AI_SPEAKING) def ai_yield_point(self, reason: str) -> TurnEvent: """ Called at natural yield points in AI speech. Yield points: - After asking a question - After each list item - After paragraph breaks - After "What do you think?" type phrases """ if self._state != TurnState.AI_SPEAKING: return self._state return self._transition( TurnState.AI_YIELDING, metadata={"yield_reason": reason}, ) def ai_finished_speaking(self) -> TurnEvent: """Called when AI finishes speaking.""" return self._transition(TurnState.IDLE) def user_interrupted(self) -> TurnEvent: """Called when user barges in during AI speech.""" return self._transition( TurnState.USER_SPEAKING, signal=None, metadata={"was_interruption": True}, ) def _transition( self, new_state: TurnState, signal: Optional[EndOfTurnSignal] = None, confidence: float = 1.0, metadata: dict = None, ) -> TurnEvent: """Perform state transition and notify listeners.""" event = TurnEvent( timestamp=time.monotonic(), previous_state=self._state, new_state=new_state, signal=signal, confidence=confidence, metadata=metadata or {}, ) self._state = new_state self._turn_history.append(event) self._last_activity_time = event.timestamp # Keep history bounded if len(self._turn_history) > 100: self._turn_history = self._turn_history[-50:] if self._on_turn_change: self._on_turn_change(event) return event def get_conversation_stats(self) -> dict: """Get statistics about turn-taking patterns.""" user_turns = [e for e in self._turn_history if e.new_state == TurnState.USER_SPEAKING] ai_turns = [e for e in self._turn_history if e.new_state == TurnState.AI_SPEAKING] interruptions = [e for e in self._turn_history if e.metadata.get("was_interruption")] return { "user_turns": len(user_turns), "ai_turns": len(ai_turns), "interruptions": len(interruptions), "avg_user_turn_duration_ms": self._user_pattern_stats.avg_turn_duration_ms, "learned_silence_threshold_ms": self._user_pattern_stats.get_silence_threshold(), } @dataclass class UserPatternStats: """ Tracks user's speech patterns for adaptive timing. Learns: - Typical pause duration within utterances - Typical turn duration - Speaking rate """ pause_durations: List[int] = field(default_factory=list) turn_durations: List[int] = field(default_factory=list) @property def avg_pause_ms(self) -> int: if not self.pause_durations: return 500 return int(sum(self.pause_durations) / len(self.pause_durations)) @property def avg_turn_duration_ms(self) -> int: if not self.turn_durations: return 3000 return int(sum(self.turn_durations) / len(self.turn_durations)) def record_pause(self, duration_ms: int) -> None: """Record a within-utterance pause.""" self.pause_durations.append(duration_ms) if len(self.pause_durations) > 20: self.pause_durations.pop(0) def record_turn(self, duration_ms: int) -> None: """Record a complete turn duration.""" self.turn_durations.append(duration_ms) if len(self.turn_durations) > 20: self.turn_durations.pop(0) def get_silence_threshold(self) -> int: """ Get adaptive silence threshold for end-of-turn detection. Based on user's typical pause patterns. """ # Use 1.5x the average pause as threshold # Bounded between 400-1200ms threshold = int(self.avg_pause_ms * 1.5) return max(400, min(1200, threshold))
2. Prosody Controller
Location: services/api-gateway/app/services/prosody_controller.py
""" Prosody Controller Controls speech prosody (pitch, rate, emphasis) for natural delivery. Generates SSML-like annotations for ElevenLabs TTS. """ from enum import Enum from dataclasses import dataclass from typing import List, Optional, Tuple import re class EmotionalTone(str, Enum): """Emotional tone for speech.""" NEUTRAL = "neutral" WARM = "warm" CONCERNED = "concerned" ENTHUSIASTIC = "enthusiastic" THOUGHTFUL = "thoughtful" ENCOURAGING = "encouraging" APOLOGETIC = "apologetic" class EmphasisLevel(str, Enum): """Level of emphasis for words/phrases.""" NONE = "none" MODERATE = "moderate" STRONG = "strong" @dataclass class ProsodyMarker: """A prosody annotation for a text segment.""" start_pos: int end_pos: int tone: Optional[EmotionalTone] = None emphasis: EmphasisLevel = EmphasisLevel.NONE rate_multiplier: float = 1.0 # 0.5 = half speed, 2.0 = double pause_before_ms: int = 0 pause_after_ms: int = 0 @dataclass class ProsodyAnnotatedText: """Text with prosody annotations.""" text: str markers: List[ProsodyMarker] overall_tone: EmotionalTone overall_rate: float class ProsodyController: """ Analyzes text and generates prosody annotations. Features: - Emotion detection from content - Emphasis marking for key words - Rate adjustment for complexity - Natural pause insertion """ # Patterns for prosody detection QUESTION_PATTERNS = [ r'\?$', r'\b(what|where|when|why|how|who|which|can|could|would|should)\b', ] EMPHASIS_PATTERNS = [ (r'\b(important|critical|essential|key|main|primary)\b', EmphasisLevel.STRONG), (r'\b(note|remember|consider|notice)\b', EmphasisLevel.MODERATE), (r'\*\*([^*]+)\*\*', EmphasisLevel.STRONG), # Markdown bold ] PAUSE_PATTERNS = [ (r'\.\s+', 400), # Period (r',\s+', 150), # Comma (r':\s+', 300), # Colon (r';\s+', 250), # Semicolon (r'\n\n', 600), # Paragraph ] TONE_KEYWORDS = { EmotionalTone.WARM: ['welcome', 'glad', 'happy', 'pleased', 'wonderful'], EmotionalTone.CONCERNED: ['sorry', 'unfortunately', 'problem', 'issue', 'concern'], EmotionalTone.ENTHUSIASTIC: ['great', 'excellent', 'amazing', 'fantastic', 'exciting'], EmotionalTone.THOUGHTFUL: ['consider', 'perhaps', 'maybe', 'interesting', 'curious'], EmotionalTone.ENCOURAGING: ['you can', 'try', 'keep going', 'good job', 'well done'], EmotionalTone.APOLOGETIC: ['sorry', 'apologies', 'my mistake', 'I apologize'], } def __init__(self): self._compiled_patterns = { 'questions': [re.compile(p, re.IGNORECASE) for p in self.QUESTION_PATTERNS], 'emphasis': [(re.compile(p, re.IGNORECASE), level) for p, level in self.EMPHASIS_PATTERNS], 'pauses': [(re.compile(p), duration) for p, duration in self.PAUSE_PATTERNS], } def analyze(self, text: str, context: Optional[str] = None) -> ProsodyAnnotatedText: """ Analyze text and generate prosody annotations. Args: text: The text to analyze context: Optional conversation context Returns: ProsodyAnnotatedText with markers """ markers = [] # Detect overall tone overall_tone = self._detect_tone(text) # Find emphasis points for pattern, level in self._compiled_patterns['emphasis']: for match in pattern.finditer(text): markers.append(ProsodyMarker( start_pos=match.start(), end_pos=match.end(), emphasis=level, )) # Find pause points for pattern, duration in self._compiled_patterns['pauses']: for match in pattern.finditer(text): # Add pause after the punctuation markers.append(ProsodyMarker( start_pos=match.end(), end_pos=match.end(), pause_before_ms=duration, )) # Detect questions (for rising intonation) is_question = any(p.search(text) for p in self._compiled_patterns['questions']) # Adjust rate based on complexity overall_rate = self._calculate_rate(text) return ProsodyAnnotatedText( text=text, markers=sorted(markers, key=lambda m: m.start_pos), overall_tone=overall_tone, overall_rate=overall_rate, ) def _detect_tone(self, text: str) -> EmotionalTone: """Detect the emotional tone of the text.""" text_lower = text.lower() scores = {} for tone, keywords in self.TONE_KEYWORDS.items(): score = sum(1 for kw in keywords if kw in text_lower) scores[tone] = score if max(scores.values()) > 0: return max(scores, key=scores.get) return EmotionalTone.NEUTRAL def _calculate_rate(self, text: str) -> float: """ Calculate speaking rate based on content complexity. Complex content (technical terms, numbers) spoken slower. Simple greetings/acknowledgments spoken at normal pace. """ # Count complexity indicators technical_terms = len(re.findall(r'\b[A-Z]{2,}\b', text)) # Acronyms numbers = len(re.findall(r'\d+', text)) long_words = len(re.findall(r'\b\w{10,}\b', text)) complexity = technical_terms * 2 + numbers + long_words # Adjust rate (0.85 to 1.1) if complexity > 5: return 0.85 # Slower for complex content elif complexity > 2: return 0.92 else: return 1.0 def to_elevenlabs_params( self, annotated: ProsodyAnnotatedText, ) -> dict: """ Convert prosody annotations to ElevenLabs parameters. ElevenLabs doesn't support SSML, so we adjust: - stability (0-1): lower for more expressive - similarity_boost (0-1): voice clarity - style (0-1): expressiveness """ # Map emotional tone to ElevenLabs parameters tone_params = { EmotionalTone.NEUTRAL: {"stability": 0.65, "style": 0.15}, EmotionalTone.WARM: {"stability": 0.55, "style": 0.25}, EmotionalTone.CONCERNED: {"stability": 0.70, "style": 0.20}, EmotionalTone.ENTHUSIASTIC: {"stability": 0.45, "style": 0.35}, EmotionalTone.THOUGHTFUL: {"stability": 0.70, "style": 0.15}, EmotionalTone.ENCOURAGING: {"stability": 0.50, "style": 0.30}, EmotionalTone.APOLOGETIC: {"stability": 0.75, "style": 0.10}, } params = tone_params.get(annotated.overall_tone, tone_params[EmotionalTone.NEUTRAL]) return { "stability": params["stability"], "similarity_boost": 0.80, "style": params["style"], "speaking_rate": annotated.overall_rate, }
3. Conversational Filler Service
Location: services/api-gateway/app/services/conversational_filler_service.py
""" Conversational Filler Service Generates natural filler phrases during AI thinking/processing. Makes the AI feel more human-like by not having awkward silences. """ from enum import Enum from dataclasses import dataclass from typing import List, Optional import random import time class FillerType(str, Enum): """Type of conversational filler.""" THINKING = "thinking" # "Hmm...", "Let me see..." SEARCHING = "searching" # "Looking that up...", "Checking..." PROCESSING = "processing" # "One moment...", "Just a second..." TRANSITIONING = "transitioning" # "So...", "Well..." ACKNOWLEDGING = "acknowledging" # "Right...", "I see..." @dataclass class FillerPhrase: """A filler phrase with metadata.""" text: str filler_type: FillerType duration_estimate_ms: int # How long this typically takes to say can_interrupt: bool # Can be cut off if response is ready # Filler phrase library FILLER_LIBRARY = { FillerType.THINKING: [ FillerPhrase("Hmm...", FillerType.THINKING, 400, True), FillerPhrase("Let me think...", FillerType.THINKING, 600, True), FillerPhrase("Let me see...", FillerType.THINKING, 500, True), FillerPhrase("That's an interesting question...", FillerType.THINKING, 900, True), ], FillerType.SEARCHING: [ FillerPhrase("Looking that up...", FillerType.SEARCHING, 600, True), FillerPhrase("Let me find that for you...", FillerType.SEARCHING, 800, True), FillerPhrase("Checking the knowledge base...", FillerType.SEARCHING, 900, True), ], FillerType.PROCESSING: [ FillerPhrase("One moment...", FillerType.PROCESSING, 500, True), FillerPhrase("Just a second...", FillerType.PROCESSING, 600, True), FillerPhrase("Bear with me...", FillerType.PROCESSING, 500, True), ], FillerType.TRANSITIONING: [ FillerPhrase("So...", FillerType.TRANSITIONING, 300, False), FillerPhrase("Well...", FillerType.TRANSITIONING, 300, False), FillerPhrase("Alright...", FillerType.TRANSITIONING, 400, False), ], FillerType.ACKNOWLEDGING: [ FillerPhrase("Right...", FillerType.ACKNOWLEDGING, 300, True), FillerPhrase("I see...", FillerType.ACKNOWLEDGING, 400, True), FillerPhrase("Understood...", FillerType.ACKNOWLEDGING, 500, True), ], } # Arabic fillers FILLER_LIBRARY_AR = { FillerType.THINKING: [ FillerPhrase("هممم...", FillerType.THINKING, 400, True), FillerPhrase("دعني أفكر...", FillerType.THINKING, 600, True), ], FillerType.SEARCHING: [ FillerPhrase("أبحث عن ذلك...", FillerType.SEARCHING, 700, True), FillerPhrase("لحظة...", FillerType.SEARCHING, 400, True), ], FillerType.PROCESSING: [ FillerPhrase("لحظة من فضلك...", FillerType.PROCESSING, 600, True), ], } class ConversationalFillerService: """ Manages conversational fillers during AI processing. Features: - Context-appropriate filler selection - Timing coordination with response generation - Avoids repetition - Language support """ def __init__(self): self._recent_fillers: List[str] = [] self._max_recent = 5 self._last_filler_time = 0 self._min_filler_interval_ms = 2000 # Don't spam fillers def should_play_filler( self, processing_duration_ms: int, has_tool_call: bool = False, expected_response_time_ms: Optional[int] = None, ) -> bool: """ Determine if a filler should be played. Args: processing_duration_ms: How long processing has taken has_tool_call: Whether a tool call is in progress expected_response_time_ms: Estimated time until response Returns: True if a filler should be played """ current_time = time.monotonic() * 1000 # Don't spam fillers if current_time - self._last_filler_time < self._min_filler_interval_ms: return False # Play filler if processing is taking a while if processing_duration_ms > 1500: return True # Play filler for tool calls (searching/processing) if has_tool_call and processing_duration_ms > 800: return True return False def select_filler( self, filler_type: FillerType, language: str = "en", ) -> Optional[FillerPhrase]: """ Select an appropriate filler phrase. Args: filler_type: Type of filler needed language: Language code Returns: Selected filler phrase, or None if none available """ # Get library for language library = FILLER_LIBRARY_AR if language == "ar" else FILLER_LIBRARY phrases = library.get(filler_type, []) if not phrases: return None # Filter out recent fillers available = [p for p in phrases if p.text not in self._recent_fillers] if not available: available = phrases # Reset if all used selected = random.choice(available) # Track recent usage self._recent_fillers.append(selected.text) if len(self._recent_fillers) > self._max_recent: self._recent_fillers.pop(0) self._last_filler_time = time.monotonic() * 1000 return selected def get_filler_for_context( self, is_tool_call: bool = False, tool_name: Optional[str] = None, is_complex_query: bool = False, language: str = "en", ) -> Optional[FillerPhrase]: """ Get a context-appropriate filler phrase. Args: is_tool_call: Whether a tool is being called tool_name: Name of the tool (for specific fillers) is_complex_query: Whether the query is complex language: Language code Returns: Appropriate filler phrase """ # Determine filler type based on context if is_tool_call: if tool_name and "search" in tool_name.lower(): filler_type = FillerType.SEARCHING else: filler_type = FillerType.PROCESSING elif is_complex_query: filler_type = FillerType.THINKING else: filler_type = FillerType.THINKING return self.select_filler(filler_type, language)
4. Frontend: Adaptive Timing Hook
Location: apps/web-app/src/hooks/useAdaptiveTiming.ts
/** * useAdaptiveTiming Hook * * Learns user's speech patterns and adapts timing thresholds. */ import { useCallback, useRef, useState, useEffect } from "react"; interface TimingStats { avgPauseMs: number; avgTurnDurationMs: number; silenceThresholdMs: number; turnCount: number; } interface UseAdaptiveTimingOptions { /** Initial silence threshold */ initialSilenceThresholdMs?: number; /** Maximum turns to track */ maxTurnsTracked?: number; /** Callback when threshold changes significantly */ onThresholdChange?: (newThreshold: number) => void; } export function useAdaptiveTiming(options: UseAdaptiveTimingOptions = {}) { const { initialSilenceThresholdMs = 700, maxTurnsTracked = 20, onThresholdChange } = options; const [stats, setStats] = useState<TimingStats>({ avgPauseMs: 500, avgTurnDurationMs: 3000, silenceThresholdMs: initialSilenceThresholdMs, turnCount: 0, }); const pauseDurationsRef = useRef<number[]>([]); const turnDurationsRef = useRef<number[]>([]); const turnStartTimeRef = useRef<number | null>(null); const lastThresholdRef = useRef(initialSilenceThresholdMs); /** * Called when user starts speaking. */ const onSpeechStart = useCallback(() => { turnStartTimeRef.current = Date.now(); }, []); /** * Called when user stops speaking (end of turn). */ const onSpeechEnd = useCallback(() => { if (turnStartTimeRef.current) { const duration = Date.now() - turnStartTimeRef.current; turnDurationsRef.current.push(duration); // Keep bounded if (turnDurationsRef.current.length > maxTurnsTracked) { turnDurationsRef.current.shift(); } // Update stats const avgTurn = turnDurationsRef.current.reduce((a, b) => a + b, 0) / turnDurationsRef.current.length; setStats((prev) => ({ ...prev, avgTurnDurationMs: avgTurn, turnCount: prev.turnCount + 1, })); turnStartTimeRef.current = null; } }, [maxTurnsTracked]); /** * Called when a pause is detected within user speech. */ const onPauseDetected = useCallback( (pauseMs: number) => { pauseDurationsRef.current.push(pauseMs); // Keep bounded if (pauseDurationsRef.current.length > maxTurnsTracked) { pauseDurationsRef.current.shift(); } // Calculate new threshold (1.5x average pause) const avgPause = pauseDurationsRef.current.reduce((a, b) => a + b, 0) / pauseDurationsRef.current.length; const newThreshold = Math.max(400, Math.min(1200, avgPause * 1.5)); setStats((prev) => ({ ...prev, avgPauseMs: avgPause, silenceThresholdMs: newThreshold, })); // Notify if threshold changed significantly if (Math.abs(newThreshold - lastThresholdRef.current) > 100) { onThresholdChange?.(newThreshold); lastThresholdRef.current = newThreshold; } }, [maxTurnsTracked, onThresholdChange], ); /** * Reset learned patterns (e.g., for new session). */ const reset = useCallback(() => { pauseDurationsRef.current = []; turnDurationsRef.current = []; turnStartTimeRef.current = null; lastThresholdRef.current = initialSilenceThresholdMs; setStats({ avgPauseMs: 500, avgTurnDurationMs: 3000, silenceThresholdMs: initialSilenceThresholdMs, turnCount: 0, }); }, [initialSilenceThresholdMs]); return { stats, onSpeechStart, onSpeechEnd, onPauseDetected, reset, silenceThresholdMs: stats.silenceThresholdMs, }; }
5. Integration into Voice Pipeline
Location: Update services/api-gateway/app/services/voice_pipeline_service.py
Add these integrations:
class VoicePipelineService: """Updated with Phase 2 & 3 components.""" def __init__(self, ...): # ... existing init ... # Phase 2: Smart acknowledgments self._smart_ack_service = SmartAcknowledgmentService( phrase_cache=phrase_cache_service, ) # Phase 3: Natural conversational flow self._turn_manager = TurnTakingManager( on_turn_change=self._handle_turn_change, ) self._prosody_controller = ProsodyController() self._filler_service = ConversationalFillerService() async def _handle_turn_change(self, event: TurnEvent) -> None: """Handle turn-taking state changes.""" await self._on_message( PipelineMessage( type="voice.turn", data={ "state": event.new_state.value, "previous_state": event.previous_state.value, "signal": event.signal.value if event.signal else None, "confidence": event.confidence, }, ) ) async def _maybe_play_filler( self, processing_start_time: float, has_tool_call: bool, tool_name: Optional[str] = None, ) -> None: """Play a conversational filler if appropriate.""" processing_ms = int((time.monotonic() - processing_start_time) * 1000) if self._filler_service.should_play_filler( processing_duration_ms=processing_ms, has_tool_call=has_tool_call, ): filler = self._filler_service.get_filler_for_context( is_tool_call=has_tool_call, tool_name=tool_name, language=self._config.language, ) if filler: # Synthesize and send filler audio audio_data = await self._phrase_cache.synthesize_and_cache( text=filler.text, voice_id=self._config.voice_id, language=self._config.language, ) await self._on_message( PipelineMessage( type="voice.filler", data={ "text": filler.text, "type": filler.filler_type.value, "can_interrupt": filler.can_interrupt, "audio": base64.b64encode(audio_data).decode(), }, ) ) async def _synthesize_with_prosody( self, text: str, ) -> AsyncIterator[bytes]: """Synthesize text with prosody control.""" # Analyze prosody annotated = self._prosody_controller.analyze(text) params = self._prosody_controller.to_elevenlabs_params(annotated) # Synthesize with adjusted parameters async for chunk in self._elevenlabs.synthesize_stream( text=text, voice_id=self._config.voice_id, stability=params["stability"], similarity_boost=params["similarity_boost"], style=params["style"], ): yield chunk
API Endpoints
Phase 2 Endpoints
POST /api/voice/smart-acknowledgment
Request: { transcript, duration_ms, during_ai_speech, voice_id, language }
Response: { phrase, intent, confidence, audio (base64) }
GET /api/voice/phrase-audio
Query: phrase, voice_id, language
Response: audio/mpeg
POST /api/voice/warm-phrase-cache
Request: { voice_id, languages }
Response: { cached_count, total_count }
Phase 3 Endpoints
POST /api/voice/analyze-prosody
Request: { text, context }
Response: { tone, rate, emphasis_points }
GET /api/voice/timing-stats
Response: { avg_pause_ms, avg_turn_ms, threshold_ms }
Configuration
voiceSettingsStore.ts additions
// Phase 2: Smart Acknowledgments smartAcknowledgmentsEnabled: boolean; // Default: true acknowledgmentVolume: number; // 0-100, default: 70 // Phase 3: Natural Flow conversationalFillersEnabled: boolean; // Default: true fillerVolume: number; // 0-100, default: 50 adaptiveTimingEnabled: boolean; // Default: true prosodyEnhancementEnabled: boolean; // Default: true
Testing Plan
Phase 2 Tests
-
Intent Classification Accuracy
- Test with 100+ sample transcripts
- Target: >80% intent accuracy
-
Phrase Cache Performance
- Measure cache hit rate
- Target: >95% hit rate after warm-up
-
End-to-End Latency
- From barge-in detection to acknowledgment playback
- Target: <150ms
Phase 3 Tests
-
Turn-Taking Accuracy
- Measure false end-of-turn detections
- Target: <10% false positives
-
Filler Timing
- Ensure fillers don't overlap with responses
- Measure user perception of naturalness
-
Prosody Quality
- A/B test with/without prosody enhancement
- User preference surveys
Rollout Plan
Phase 2: Smart Acknowledgments
- Week 1: Implement backend services
- Week 2: Integrate with voice pipeline
- Week 3: Frontend integration
- Week 4: Testing and tuning
Phase 3: Natural Conversational Flow
- Week 1: Turn-taking manager
- Week 2: Prosody controller
- Week 3: Filler service
- Week 4: Frontend adaptive timing
- Week 5: Integration testing