2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"] 4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""] 5:I[4126,[],""] 7:I[9630,[],""] 8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"] 9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"] a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"] b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"] 3:T3e32, # Adaptive VAD Presets Voice Mode v4.1 introduces user-tunable Voice Activity Detection (VAD) presets to accommodate different speaking styles, environments, and accessibility needs. ## Overview The adaptive VAD system allows users to choose from presets optimized for different scenarios: ``` ┌─────────────────────────────────────────────────────────────────┐ │ VAD Preset Selection │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Sensitive │ │ Balanced │ │ Relaxed │ │ │ │ (Quiet) │ │ (Default) │ │ (Noisy) │ │ │ └────────────┘ └────────────┘ └────────────┘ │ │ │ │ Energy: -45 dB Energy: -35 dB Energy: -25 dB │ │ Silence: 300ms Silence: 500ms Silence: 800ms │ │ Min: 100ms Min: 150ms Min: 200ms │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### Thinker-Talker Pipeline Integration ```mermaid sequenceDiagram participant Mic as Microphone participant VAD as Adaptive VAD participant STT participant Thinker participant Talker Mic->>VAD: Audio stream (16kHz PCM) Note over VAD: Apply preset thresholds
Energy: -45 to -25 dB loop Voice Activity Detection VAD->>VAD: Check energy > threshold alt Speech detected VAD->>VAD: Buffer speech segment else Silence > preset duration VAD->>STT: Speech segment complete end end STT->>Thinker: Transcript Thinker->>Talker: Response text Talker-->>Mic: Playback (VAD pauses during output) ``` ### VAD Preset Selection Flow ```mermaid flowchart LR subgraph User Settings A[Voice Settings Panel] end subgraph Presets S[🤫 Sensitive
-45dB / 300ms] B[⚖️ Balanced
-35dB / 500ms] R[🔊 Relaxed
-25dB / 800ms] AC[♿ Accessibility
-42dB / 1000ms] C[⚙️ Custom
User-defined] end subgraph Backend VAD[Adaptive VAD Service] Pipeline[Voice Pipeline] end A --> S A --> B A --> R A --> AC A --> C S --> VAD B --> VAD R --> VAD AC --> VAD C --> VAD VAD --> Pipeline style B fill:#90EE90 ``` ### Cross-Link to Voice Settings See [Voice First Input Bar](./voice-first-input-bar.md) for UI implementation details. See [RTL Support](./rtl-support-guide.md) for right-to-left language support in the voice interface. ## Choosing the Right VAD Preset ### Quick Selection Guide ```mermaid flowchart TD Q1{Where are you
using voice mode?} Q1 -->|Quiet room| S[🤫 Sensitive] Q1 -->|Office/Home| B[⚖️ Balanced] Q1 -->|Public/Noisy| R[🔊 Relaxed] Q1 -->|Speech difficulties| A[♿ Accessibility] Q1 -->|Need specific tuning| C[⚙️ Custom] S --> S1[Best for:
• Home office
• Private rooms
• Close mic] B --> B1[Best for:
• Normal offices
• Mixed environments
• Default choice] R --> R1[Best for:
• Open offices
• Public spaces
• Distant mic] A --> A1[Best for:
• Speech impairments
• Stuttering
• Slow speech] C --> C1[Best for:
• Power users
• Specific needs
• Testing] style S fill:#E6F3FF style B fill:#90EE90 style R fill:#FFE4B5 style A fill:#DDA0DD style C fill:#D3D3D3 ``` ### Preset Comparison Table | Preset | Energy Threshold | Silence Duration | Min Speech | Best For | | -------------------- | ---------------- | ---------------- | ------------ | --------------------------------- | | 🤫 **Sensitive** | -45 dB | 300 ms | 100 ms | Quiet environments, soft speakers | | ⚖️ **Balanced** | -35 dB | 500 ms | 150 ms | General use (recommended default) | | 🔊 **Relaxed** | -25 dB | 800 ms | 200 ms | Noisy environments, distant mics | | ♿ **Accessibility** | -42 dB | 1000 ms | 80 ms | Speech impairments, slow speakers | | ⚙️ **Custom** | User-defined | User-defined | User-defined | Power users, specific needs | ## Understanding VAD Parameters ### Energy Threshold (dB) The **energy threshold** determines how loud speech must be to be detected: ``` Sound Level (dB) Example ───────────────────────────────── -50 dB Very soft whisper -45 dB Soft speech / quiet room -35 dB Normal conversation -25 dB Raised voice -20 dB Loud speech More negative = More sensitive (detects softer sounds) Less negative = Less sensitive (requires louder speech) ``` **Recommendations:** - **-45 dB**: Use in quiet environments or with soft speakers - **-35 dB**: Good default for most situations - **-25 dB**: Use when background noise is present ### Silence Duration (ms) The **silence duration** determines how long to wait after speech stops before finalizing: ``` Duration Effect ───────────────────────────────── 300 ms Quick response, may cut off pauses 500 ms Balanced (recommended default) 800 ms Tolerates longer pauses 1000 ms For speakers who pause frequently 1500 ms Maximum tolerance for hesitant speech ``` **Trade-offs:** - **Shorter (< 400 ms)**: Faster response but may interrupt natural pauses - **Medium (400-600 ms)**: Good balance for most speakers - **Longer (> 700 ms)**: Better for thoughtful speech but slower response ### How Energy and Silence Work Together ```mermaid sequenceDiagram participant Audio as Audio Input participant VAD as VAD Detector participant STT as Speech-to-Text Note over Audio,STT: Example with Balanced preset (-35 dB, 500 ms) Audio->>VAD: Audio chunk (-40 dB) Note over VAD: Below threshold (-35 dB)
No speech detected Audio->>VAD: Audio chunk (-30 dB) Note over VAD: Above threshold!
Speech started loop Speech continues Audio->>VAD: Audio chunks (-25 to -30 dB) Note over VAD: Buffering speech... end Audio->>VAD: Audio chunk (-45 dB) Note over VAD: Below threshold
Start silence timer Note over VAD: 500 ms silence elapsed VAD->>STT: Speech segment complete ``` ## Detailed Preset Explanations ## VAD Presets ### 1. Sensitive (Quiet Environment) Optimized for quiet rooms with minimal background noise: | Parameter | Value | Description | | ------------------- | ------ | ---------------------------------- | | Energy threshold | -45 dB | Very low threshold for soft speech | | Silence duration | 300 ms | Quick end-of-speech detection | | Min speech duration | 100 ms | Captures short utterances | | Pre-speech buffer | 200 ms | Captures speech start | **Best for:** - Quiet home offices - Private rooms - Users with soft voices - Close microphone positioning ### 2. Balanced (Default) General-purpose preset for typical environments: | Parameter | Value | Description | | ------------------- | ------ | -------------------- | | Energy threshold | -35 dB | Standard threshold | | Silence duration | 500 ms | Balanced response | | Min speech duration | 150 ms | Filters brief noises | | Pre-speech buffer | 250 ms | Good speech capture | **Best for:** - Normal office environments - Home with moderate ambient noise - Standard microphone distance ### 3. Relaxed (Noisy Environment) Optimized for noisy environments or distant microphones: | Parameter | Value | Description | | ------------------- | ------ | ------------------------------ | | Energy threshold | -25 dB | Higher threshold filters noise | | Silence duration | 800 ms | Longer pause tolerance | | Min speech duration | 200 ms | Filters more transient noises | | Pre-speech buffer | 300 ms | Extra buffer for clarity | **Best for:** - Open offices - Public spaces - Users with microphones far from mouth - Background music/TV ### 4. Custom (Advanced) User-defined parameters for specific needs: ```python custom_preset = VADPreset( name="custom", energy_threshold_db=-40, silence_duration_ms=400, min_speech_duration_ms=120, pre_speech_buffer_ms=200, post_speech_buffer_ms=150 ) ``` ## Configuration ### Backend Configuration ```python from app.services.adaptive_vad import AdaptiveVADService, VADPreset # Get VAD service vad_service = AdaptiveVADService() # Set preset for user session await vad_service.set_preset( session_id="session_123", preset="sensitive" ) # Get current configuration config = await vad_service.get_config(session_id="session_123") print(f"Energy threshold: {config.energy_threshold_db} dB") ``` ### User Settings Storage VAD preferences are stored in the user profile: ```python # Save user preference await user_settings_service.update( user_id="user_123", settings={"vad_preset": "relaxed"} ) # Load on session start user_settings = await user_settings_service.get(user_id="user_123") vad_preset = user_settings.get("vad_preset", "balanced") ``` ### Environment Variables ```bash # Default VAD preset VAD_DEFAULT_PRESET=balanced # Preset overrides (optional) VAD_SENSITIVE_ENERGY_THRESHOLD=-45 VAD_SENSITIVE_SILENCE_DURATION=300 VAD_BALANCED_ENERGY_THRESHOLD=-35 VAD_RELAXED_ENERGY_THRESHOLD=-25 # Custom preset limits VAD_MIN_ENERGY_THRESHOLD=-50 VAD_MAX_ENERGY_THRESHOLD=-20 VAD_MIN_SILENCE_DURATION=200 VAD_MAX_SILENCE_DURATION=1500 ``` ## Frontend Integration ### VAD Settings Component ```tsx import { VADSettings } from "@/components/voice/VADSettings"; ; ``` ### Preset Selector UI ```tsx const VADPresetSelector = () => { const { preset, setPreset } = useVoiceSettings(); return (
); }; ``` ### Advanced Controls For power users, expose individual parameters: ```tsx const AdvancedVADSettings = () => { const { config, updateConfig } = useVoiceSettings(); return (
updateConfig({ energy_threshold_db: v })} aria-label="Voice detection sensitivity" /> updateConfig({ silence_duration_ms: v })} aria-label="How long to wait for pause before ending" />
); }; ``` ## Accessibility Considerations ### Speech Impairments Users with speech impairments may benefit from: - **Longer silence duration**: Allows more time between words - **Lower minimum speech duration**: Captures shorter utterances - **Higher pre-speech buffer**: Ensures speech start is captured ```python accessibility_preset = VADPreset( name="accessibility", energy_threshold_db=-42, silence_duration_ms=1000, # Long pause tolerance min_speech_duration_ms=80, # Capture short sounds pre_speech_buffer_ms=400 # Extra lead time ) ``` ### Auto-Calibration The system can auto-calibrate based on ambient noise: ```python # Calibrate during session start calibration = await vad_service.calibrate( audio_sample=ambient_audio, duration_ms=3000 ) # Apply calibrated settings await vad_service.set_calibrated_config( session_id="session_123", base_preset="balanced", noise_floor_db=calibration.noise_floor_db ) ``` ## Monitoring ### Prometheus Metrics ```python # VAD activation accuracy vad_false_positive_rate.labels(preset="sensitive").observe(0.05) vad_false_negative_rate.labels(preset="relaxed").observe(0.08) # Preset usage distribution vad_preset_usage_total.labels(preset="balanced").inc() # Speech detection latency vad_detection_latency_ms.labels(preset="sensitive").observe(45) ``` ### Logging ```python logger.info("VAD configuration applied", extra={ "session_id": session_id, "preset": "sensitive", "energy_threshold_db": -45, "silence_duration_ms": 300, "calibrated": True }) ``` ## Testing ### Unit Tests ```python @pytest.mark.asyncio async def test_sensitive_preset_detects_soft_speech(): """Sensitive preset should detect soft speech at -40 dB.""" vad = AdaptiveVADService() await vad.set_preset("session_1", "sensitive") # Generate soft speech audio (-40 dB) audio = generate_audio(speech="hello", volume_db=-40) result = await vad.process(audio) assert result.speech_detected is True assert result.segments[0].start_ms < 100 @pytest.mark.asyncio async def test_relaxed_preset_filters_background_noise(): """Relaxed preset should filter background noise at -30 dB.""" vad = AdaptiveVADService() await vad.set_preset("session_1", "relaxed") # Generate background noise (-30 dB) audio = generate_noise(type="office", volume_db=-30) result = await vad.process(audio) assert result.speech_detected is False ``` ### Integration Tests ```bash # Run VAD preset tests pytest tests/services/test_adaptive_vad.py -v # Test with real audio samples pytest tests/integration/test_vad_presets_e2e.py -v --audio-samples ./test_audio/ ``` ## Best Practices 1. **Start with Balanced**: Recommend balanced preset for new users 2. **Offer calibration**: Prompt users to calibrate in noisy environments 3. **Persist preferences**: Save preset choice per user, not per session 4. **Monitor false positives**: High false positive rates suggest too-sensitive settings 5. **Consider context**: Auto-switch to relaxed in detected noisy environments ## Related Documentation - [Voice Mode v4.1 Overview](./voice-mode-v4-overview.md) - [Latency Budgets Guide](./latency-budgets-guide.md) - [Thinking Tone Settings](./thinking-tone-settings.md) 6:["slug","voice/adaptive-vad-presets","c"] 0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","voice/adaptive-vad-presets","c"],{"children":["__PAGE__?{\"slug\":[\"voice\",\"adaptive-vad-presets\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","voice/adaptive-vad-presets","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Adaptive VAD Presets"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","voice/adaptive-vad-presets.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/voice/adaptive-vad-presets.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]] c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Adaptive VAD Presets | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"Guide to user-tunable Voice Activity Detection presets"}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]] 1:null