Operations Overview
Last Updated: 2025-11-27
This document provides a central hub for all operations-related documentation for VoiceAssist.
Quick Links
| Category | Document | Purpose |
|---|---|---|
| SLOs | SLO Definitions | Reliability targets and error budgets |
| Metrics | Business Metrics | Key performance indicators |
| Performance | Connection Pool Optimization | Database connection tuning |
Runbooks
All runbooks follow a standardized format with severity levels, step-by-step procedures, and verification steps.
| Runbook | Purpose | Primary Audience |
|---|---|---|
| Deployment | Deploy VoiceAssist to production | DevOps, Backend |
| Monitoring | Set up and manage observability stack | DevOps |
| Troubleshooting | Diagnose and fix common issues | DevOps, Backend |
| Incident Response | Handle production incidents | On-call, DevOps |
| Backup & Restore | Data backup and recovery procedures | DevOps |
| Scaling | Scale infrastructure for load | DevOps, Backend |
Compliance
| Document | Purpose |
|---|---|
| Analytics Data Policy | Data handling for analytics |
For HIPAA compliance, see Security & Compliance.
Incident Severity Levels
| Severity | Description | Response Time |
|---|---|---|
| P1 - Critical | Complete service outage, data loss risk | 15 minutes |
| P2 - High | Major feature broken, significant degradation | 1 hour |
| P3 - Medium | Minor feature broken, degraded performance | 4 hours |
| P4 - Low | Cosmetic issues, minimal impact | 24 hours |
Key SLOs
| Metric | Target | Measurement Window |
|---|---|---|
| API Availability | 99.9% | 30 days |
| Success Rate | 99.5% | 30 days |
| P95 Latency | < 200ms | 30 days |
| Error Rate | < 0.5% | 30 days |
On-Call Essentials
Quick Diagnostic Commands
# Check service health curl http://localhost:8000/health curl http://localhost:8000/ready # Check all containers docker compose ps # View recent logs docker compose logs --tail=100 voiceassist-server # Check database docker compose exec postgres psql -U voiceassist -c "SELECT 1" # Check Redis docker compose exec redis redis-cli ping
Escalation Path
- L1 Support: Check health endpoints, restart services
- L2 DevOps: Investigate logs, check metrics, apply standard fixes
- L3 Engineering: Deep debugging, code-level investigation
- Management: Major incidents requiring business decisions
Related Documentation
- Unified Architecture - System architecture
- Backend Architecture - Backend details
- Security & Compliance - HIPAA compliance
- Implementation Status - Component status
Version History
| Date | Version | Changes |
|---|---|---|
| 2025-11-27 | 1.0.0 | Initial operations overview |
Deployment Runbook
Last Updated: 2025-11-27 Purpose: Step-by-step guide for deploying VoiceAssist V2
Pre-Deployment Checklist
- All tests passing in CI/CD
- Code reviewed and approved
- Database migrations reviewed
- Breaking changes documented
- Rollback plan documented
- Stakeholders notified
- Maintenance window scheduled (if required)
Deployment Steps
1. Pre-Deployment Verification
# Check current system health curl http://localhost:8000/health curl http://localhost:8000/ready # Verify all containers running docker compose ps # Check database connection docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT version();" # Check Redis docker compose exec redis redis-cli ping # Check Qdrant curl http://localhost:6333/collections
2. Backup Current State
# Backup database docker compose exec postgres pg_dump -U voiceassist voiceassist > backup_$(date +%Y%m%d_%H%M%S).sql # Backup environment configuration cp .env .env.backup_$(date +%Y%m%d_%H%M%S) # Tag current Docker images docker tag voiceassist-voiceassist-server:latest voiceassist-voiceassist-server:pre-deploy-$(date +%Y%m%d_%H%M%S)
3. Pull Latest Code
# Fetch latest changes git fetch origin # Check what's changing git log --oneline HEAD..origin/main # Pull changes git pull origin main # Verify correct branch git branch --show-current git log -1 --oneline
4. Update Environment Configuration
# Review .env changes diff .env.example .env # Update .env if needed vim .env # Validate configuration grep -v '^#' .env | grep -v '^$' | wc -l # Count non-empty lines
5. Run Database Migrations
# Check current migration status docker compose run --rm voiceassist-server alembic current # Review pending migrations docker compose run --rm voiceassist-server alembic history # Run migrations docker compose run --rm voiceassist-server alembic upgrade head # Verify migration success docker compose run --rm voiceassist-server alembic current
6. Build New Images
# Build updated images docker compose build voiceassist-server # Verify image built docker images | grep voiceassist-server # Check image size (should be reasonable) docker images voiceassist-voiceassist-server:latest --format "{{.Size}}"
7. Deploy Services
# Deploy with zero-downtime (recreate containers) docker compose up -d voiceassist-server # Watch logs for startup docker compose logs -f voiceassist-server # Wait for healthcheck sleep 10
8. Post-Deployment Verification
# Check health endpoint curl http://localhost:8000/health # Check readiness curl http://localhost:8000/ready # Verify version curl http://localhost:8000/health | jq '.version' # Check all containers running docker compose ps # Check logs for errors docker compose logs --tail=100 voiceassist-server | grep -i error # Verify metrics endpoint curl http://localhost:8000/metrics | head -20 # Test a sample API endpoint (requires auth) # curl -H "Authorization: Bearer $TOKEN" http://localhost:8000/api/users/me
9. Smoke Tests
# Test authentication curl -X POST http://localhost:8000/api/auth/login \ -H "Content-Type: application/json" \ -d '{"email":"admin@example.com","password":"password"}' | jq '.' # Test database connectivity docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT COUNT(*) FROM users;" # Test Redis docker compose exec redis redis-cli --raw incr deployment_test # Test Qdrant curl http://localhost:6333/collections
10. Monitor Initial Traffic
# Watch logs for first 5 minutes docker compose logs -f --tail=100 voiceassist-server # Monitor metrics watch -n 5 'curl -s http://localhost:8000/metrics | grep -E "(http_requests_total|http_request_duration)"' # Check error rate docker compose logs --since 5m voiceassist-server | grep -i error | wc -l
Rollback Procedure
If deployment fails, follow these steps:
Quick Rollback (Image-Based)
# Stop current containers docker compose down voiceassist-server # Revert to previous image PREVIOUS_TAG="pre-deploy-YYYYMMDD_HHMMSS" # From backup step docker tag voiceassist-voiceassist-server:$PREVIOUS_TAG voiceassist-voiceassist-server:latest # Start previous version docker compose up -d voiceassist-server # Verify rollback curl http://localhost:8000/health | jq '.version'
Full Rollback (Code + Database)
# Stop services docker compose down voiceassist-server # Revert code git log -1 --oneline # Note current commit git checkout HEAD~1 # Or specific commit hash # Rollback database migration BACKUP_FILE="backup_YYYYMMDD_HHMMSS.sql" docker compose exec -T postgres psql -U voiceassist voiceassist < $BACKUP_FILE # Rebuild image docker compose build voiceassist-server # Start services docker compose up -d voiceassist-server # Verify rollback curl http://localhost:8000/health
Deployment Checklist
Post-Deployment:
- Health endpoint returning 200
- Readiness endpoint returning 200
- No error logs in last 5 minutes
- Metrics endpoint accessible
- Database migrations applied
- All containers running
- Sample API requests successful
- Version number updated
- Stakeholders notified of completion
- Documentation updated (if needed)
Common Issues & Solutions
Issue: Database Migration Fails
Symptoms: Migration command returns error
Solution:
# Check current state docker compose run --rm voiceassist-server alembic current # Manually review SQL docker compose run --rm voiceassist-server alembic show <revision> # If safe, downgrade one step docker compose run --rm voiceassist-server alembic downgrade -1 # Fix issue and retry docker compose run --rm voiceassist-server alembic upgrade head
Issue: Container Won't Start
Symptoms: Container crashes immediately or fails healthcheck
Solution:
# Check logs docker compose logs --tail=50 voiceassist-server # Check container exit code docker compose ps -a voiceassist-server # Verify environment variables docker compose config | grep -A 20 voiceassist-server # Test dependencies docker compose exec postgres pg_isready docker compose exec redis redis-cli ping
Issue: High Error Rate After Deployment
Symptoms: Increased 5xx errors in logs/metrics
Solution:
# Check error logs docker compose logs voiceassist-server | grep -i error # Check database connections docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" # Check Redis memory docker compose exec redis redis-cli INFO memory | grep used_memory_human # Rollback if errors > 5% of traffic
Emergency Contacts
- On-Call Engineer: Check PagerDuty
- Database Admin: DBA on-call rotation
- DevOps Lead: ops-team@voiceassist.local
- Product Owner: product@voiceassist.local
##Related Documentation
- UNIFIED_ARCHITECTURE.md
- CONNECTION_POOL_OPTIMIZATION.md
- Incident Response Runbook
- Backup & Restore Runbook
Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: After each major deployment or quarterly
Incident Response Runbook
Last Updated: 2025-11-27 Purpose: Comprehensive guide for handling incidents in VoiceAssist V2
Incident Severity Levels
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| P1 - Critical | Complete service outage, data loss risk | 15 minutes | Database down, complete API failure, security breach |
| P2 - High | Major feature broken, significant performance degradation | 1 hour | Authentication failing, voice processing unavailable |
| P3 - Medium | Minor feature broken, degraded performance | 4 hours | Specific API endpoint failing, slow response times |
| P4 - Low | Cosmetic issues, minimal impact | 24 hours | UI glitches, non-critical warnings in logs |
Initial Response Procedure
1. Incident Detection
# Check system health curl -s http://localhost:8000/health | jq '.' # Expected output: # { # "status": "healthy", # "version": "2.0.0", # "timestamp": "2025-11-21T..." # } # Check all services docker compose ps # Check recent error logs docker compose logs --since 10m voiceassist-server | grep -i error # Check metrics for anomalies curl -s http://localhost:8000/metrics | grep -E "(error|failure)"
2. Immediate Triage (First 5 Minutes)
Checklist:
- Acknowledge the incident (update status page if available)
- Determine severity level using table above
- Notify on-call engineer if P1/P2
- Create incident tracking ticket/document
- Join incident response channel (Slack/Teams)
# Quick system overview echo "=== System Status ===" docker compose ps echo "" echo "=== Error Count (Last 10 min) ===" docker compose logs --since 10m | grep -i error | wc -l echo "" echo "=== Active Database Connections ===" docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" echo "" echo "=== Redis Memory ===" docker compose exec redis redis-cli INFO memory | grep used_memory_human echo "" echo "=== Disk Usage ===" df -h
3. Assess Impact
# Check request success rate docker compose logs --since 15m voiceassist-server | \ grep -oE "status=[0-9]+" | sort | uniq -c # Check database connectivity docker compose exec postgres pg_isready docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT 1;" # Check Redis connectivity docker compose exec redis redis-cli ping # Check Qdrant connectivity curl -s http://localhost:6333/healthz # Check network connectivity docker compose exec voiceassist-server ping -c 3 postgres docker compose exec voiceassist-server ping -c 3 redis docker compose exec voiceassist-server ping -c 3 qdrant
Incident Response by Severity
P1 - Critical Incident Response
Timeline: 0-15 minutes
-
Immediate Actions:
- Page on-call engineer
- Notify management
- Update status page: "Investigating outage"
- Join war room/incident call
-
Rapid Assessment:
# Check if complete outage curl -s http://localhost:8000/health || echo "COMPLETE OUTAGE" # Check all infrastructure docker compose ps -a # Check for recent deployments git log -5 --oneline --since="2 hours ago" # Check system resources docker stats --no-stream # Check disk space (common cause) df -h du -sh /var/lib/docker
- Emergency Mitigation:
# Option 1: Restart all services docker compose restart # Option 2: Rollback recent deployment (if within 2 hours) git log -1 --oneline # Current version git checkout HEAD~1 # Previous version docker compose build voiceassist-server docker compose up -d voiceassist-server # Option 3: Scale up resources (if performance issue) docker compose up -d --scale voiceassist-server=3 # Option 4: Enable maintenance mode # Create maintenance mode flag touch /tmp/maintenance_mode docker compose exec voiceassist-server touch /app/maintenance_mode
- Communication Template (P1):
Subject: [P1 INCIDENT] VoiceAssist Service Outage
Status: INVESTIGATING
Start Time: [TIME]
Impact: Complete service unavailable
Affected Users: All users
Incident Commander: [NAME]
Current Actions:
- Identified root cause as [X]
- Attempting mitigation via [Y]
- ETR: [TIME] (or "investigating")
Next Update: [TIME] (within 15 minutes)
P2 - High Severity Response
Timeline: 0-60 minutes
- Assessment (First 15 minutes):
# Identify affected component docker compose logs --since 30m voiceassist-server | grep -i error | tail -50 # Check specific service health curl -s http://localhost:8000/ready | jq '.' # Check database performance docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT pid, usename, application_name, state, query_start, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start DESC LIMIT 20;" # Check slow queries docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT query, calls, total_time, mean_time, max_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;"
-
Mitigation Actions:
- Isolate affected component
- Enable fallback mechanisms
- Scale affected service
- Update monitoring thresholds
-
Communication Template (P2):
Subject: [P2 INCIDENT] VoiceAssist Degraded Performance
Status: MITIGATING
Start Time: [TIME]
Impact: [Specific feature] unavailable/degraded
Affected Users: [Percentage or specific user group]
Incident Commander: [NAME]
Timeline:
- [TIME]: Issue detected
- [TIME]: Root cause identified
- [TIME]: Mitigation in progress
Root Cause: [Brief description]
Mitigation: [Actions being taken]
ETR: [TIME]
Next Update: [TIME] (within 30 minutes)
P3 - Medium Severity Response
Timeline: 0-4 hours
- Standard Investigation:
# Detailed log analysis docker compose logs --since 1h voiceassist-server | grep -A 5 -B 5 "error" # Check resource utilization trends docker stats --no-stream # Review recent changes git log --since="24 hours ago" --oneline # Check configuration docker compose config | grep -A 10 voiceassist-server
- Documented Fix Process:
- Create issue in tracking system
- Assign to appropriate team
- Document reproduction steps
- Implement fix
- Test in staging (if available)
- Deploy fix
- Verify resolution
P4 - Low Severity Response
Standard ticket workflow - no immediate response required
Escalation Paths
When to Escalate
Escalate Immediately If:
- Unable to identify root cause within 30 minutes (P1) or 2 hours (P2)
- Mitigation attempts unsuccessful
- Data loss suspected
- Security breach suspected
- Multiple systems affected
- Customer data at risk
Escalation Chain
L1 - On-Call Engineer
↓ (30 min for P1, 2 hrs for P2)
L2 - Team Lead
↓ (1 hr for P1, 4 hrs for P2)
L3 - Engineering Manager
↓ (2 hrs for P1)
L4 - CTO / VP Engineering
Escalation Command Script
# Document current state before escalating cat > /tmp/escalation_report_$(date +%Y%m%d_%H%M%S).txt <<EOF ESCALATION REPORT ================= Time: $(date) Severity: P1/P2/P3/P4 Duration: [X hours] Impact: [Description] Current System State: $(docker compose ps) Recent Errors: $(docker compose logs --since 30m voiceassist-server | grep -i error | tail -20) Actions Attempted: - [List all mitigation attempts] - [Include results of each attempt] Reason for Escalation: [Clear explanation of why escalating] Additional Context: [Any other relevant information] EOF cat /tmp/escalation_report_$(date +%Y%m%d_%H%M%S).txt
Common Incident Types
Database Connection Issues
Symptoms:
- "Connection pool exhausted" errors
- "Too many connections" errors
- Slow response times
Investigation:
# Check connection pool status docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" # Check max connections docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SHOW max_connections;" # Check current connections docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT datname, usename, application_name, count(*) FROM pg_stat_activity GROUP BY datname, usename, application_name;" # Kill idle connections docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < current_timestamp - INTERVAL '10 minutes';"
Resolution:
# Restart application to reset connection pool docker compose restart voiceassist-server # Temporarily increase connection pool docker compose exec voiceassist-server sh -c \ "export DB_POOL_SIZE=30 && supervisorctl restart all" # Long-term: Update docker-compose.yml or .env echo "DB_POOL_SIZE=30" >> .env docker compose up -d voiceassist-server
Memory/Resource Exhaustion
Symptoms:
- Container restarts
- OOMKilled status
- Slow performance
Investigation:
# Check container memory usage docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" # Check for OOMKilled containers docker inspect voiceassist-voiceassist-server-1 | grep OOMKilled # Check system memory free -h # Check Redis memory docker compose exec redis redis-cli INFO memory
Resolution:
# Increase memory limits in docker-compose.yml # Edit docker-compose.yml to increase mem_limit # Clear Redis cache if needed docker compose exec redis redis-cli FLUSHDB # Restart affected container docker compose restart voiceassist-server # Monitor memory after restart watch -n 5 'docker stats --no-stream | grep voiceassist-server'
API Performance Degradation
Symptoms:
- Slow response times
- Timeout errors
- High request queue
Investigation:
# Check response times in metrics curl -s http://localhost:8000/metrics | grep http_request_duration # Check slow queries docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state != 'idle' AND now() - query_start > interval '5 seconds' ORDER BY duration DESC;" # Check for locks docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT * FROM pg_locks WHERE NOT granted;" # Check CPU usage docker stats --no-stream
Resolution:
# Scale horizontally if needed docker compose up -d --scale voiceassist-server=3 # Kill slow queries docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state != 'idle' AND now() - query_start > interval '30 seconds';" # Enable query caching in Redis docker compose exec redis redis-cli CONFIG SET maxmemory-policy allkeys-lru
Security Incidents
Symptoms:
- Unusual traffic patterns
- Unauthorized access attempts
- Data breach alerts
IMMEDIATE ACTIONS:
# 1. DO NOT DESTROY EVIDENCE # 2. Document everything # 3. Isolate affected systems # Stop accepting new connections (if breach confirmed) docker compose exec voiceassist-server iptables -A INPUT -p tcp --dport 8000 -j DROP # Capture current state docker compose logs > /tmp/security_incident_logs_$(date +%Y%m%d_%H%M%S).txt docker compose exec postgres pg_dump -U voiceassist voiceassist > \ /tmp/security_incident_db_$(date +%Y%m%d_%H%M%S).sql # Check for suspicious activity docker compose logs voiceassist-server | grep -E "401|403|429" | tail -100 # Check database for unauthorized access docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT * FROM user_sessions WHERE created_at > NOW() - INTERVAL '1 hour' ORDER BY created_at DESC;" # Rotate credentials IMMEDIATELY # Generate new secrets openssl rand -base64 32 > /tmp/new_secret_key.txt # Update .env with new credentials # Force logout all users docker compose exec redis redis-cli FLUSHALL
ESCALATION: Security incidents ALWAYS require immediate escalation to security team
Post-Incident Activities
Immediate Post-Incident (Within 1 Hour)
Checklist:
- Verify incident fully resolved
- Update status page to "Resolved"
- Send final communication to stakeholders
- Document timeline in incident ticket
- Schedule post-mortem meeting (within 48 hours for P1/P2)
# Verification script echo "=== Post-Incident Verification ===" echo "Health Check:" curl -s http://localhost:8000/health | jq '.' echo "" echo "Error Rate (Last 30 min):" docker compose logs --since 30m voiceassist-server | grep -i error | wc -l echo "" echo "Container Status:" docker compose ps echo "" echo "Database Connections:" docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
Post-Mortem Process
Post-Mortem Template:
# Post-Mortem: [Incident Title] ## Incident Details - **Date**: YYYY-MM-DD - **Duration**: X hours Y minutes - **Severity**: P1/P2/P3/P4 - **Incident Commander**: [Name] - **Participants**: [Names] ## Impact - **Users Affected**: [Number or percentage] - **Services Affected**: [List] - **Financial Impact**: [If applicable] - **Data Loss**: None / [Description] ## Timeline | Time | Event | | ----- | --------------------------- | | HH:MM | Incident began | | HH:MM | Detected by [person/system] | | HH:MM | Initial response started | | HH:MM | Root cause identified | | HH:MM | Mitigation deployed | | HH:MM | Incident resolved | ## Root Cause [Detailed explanation of what caused the incident] ## What Went Well - [Things that worked during response] - [Effective tools/processes] ## What Went Wrong - [Issues encountered during response] - [Gaps in tooling/process] ## Action Items | Action | Owner | Due Date | Priority | | ------------------------ | ------ | -------- | -------- | | [Preventive measure] | [Name] | [Date] | P1/P2/P3 | | [Monitoring improvement] | [Name] | [Date] | P1/P2/P3 | | [Documentation update] | [Name] | [Date] | P1/P2/P3 | ## Lessons Learned - [Key takeaway 1] - [Key takeaway 2] - [Key takeaway 3]
Post-Mortem Meeting Agenda
-
Review Timeline (10 minutes)
- Walk through incident from detection to resolution
- No blame, focus on facts
-
Root Cause Analysis (15 minutes)
- Technical deep-dive
- Use "5 Whys" technique
-
Impact Assessment (10 minutes)
- User impact
- Business impact
- Reputation impact
-
Prevention Discussion (20 minutes)
- How to prevent recurrence
- Monitoring improvements
- Process improvements
-
Action Items (5 minutes)
- Assign owners and due dates
- Set follow-up meeting
Communication Templates
Initial Notification (P1/P2)
Subject: [P1/P2] VoiceAssist Service Issue - [Brief Description]
Dear Team,
We are currently experiencing [issue description] affecting [scope of impact].
Status: INVESTIGATING
Start Time: [TIME]
Severity: P1/P2
Impact: [Description]
Affected Systems: [List]
Incident Commander: [NAME]
We are actively working to resolve this issue and will provide updates
every [15 minutes for P1, 30 minutes for P2].
Next Update: [TIME]
VoiceAssist Operations Team
Status Update (During Incident)
Subject: [UPDATE - P1/P2] VoiceAssist Service Issue - [Brief Description]
Update #[N] - [TIME]
Current Status: [INVESTIGATING/IDENTIFIED/MITIGATING/RESOLVED]
Progress:
- [What we've learned]
- [What we've tried]
- [Current approach]
Impact Update: [Any changes to scope]
Next Steps:
- [Action 1]
- [Action 2]
ETR: [Estimated Time to Resolution or "investigating"]
Next Update: [TIME]
VoiceAssist Operations Team
Resolution Notification
Subject: [RESOLVED - P1/P2] VoiceAssist Service Issue - [Brief Description]
Status: RESOLVED
Resolution Time: [TIME]
Total Duration: [X hours Y minutes]
The issue affecting [description] has been fully resolved.
Root Cause: [Brief explanation]
Resolution: [What was done to fix it]
Impact Summary:
- Users Affected: [Number/Percentage]
- Duration: [X hours Y minutes]
- Data Loss: None / [Description]
Next Steps:
- Post-mortem scheduled for [DATE/TIME]
- Preventive measures being implemented
We apologize for any inconvenience this may have caused.
VoiceAssist Operations Team
Incident Response Tools
Quick Command Reference
# Health Check Bundle alias va-health='curl -s http://localhost:8000/health | jq .' alias va-ready='curl -s http://localhost:8000/ready | jq .' alias va-metrics='curl -s http://localhost:8000/metrics' # Log Analysis alias va-errors='docker compose logs --since 10m voiceassist-server | grep -i error' alias va-errors-count='docker compose logs --since 10m voiceassist-server | grep -i error | wc -l' alias va-logs-tail='docker compose logs -f --tail=100 voiceassist-server' # Resource Check alias va-stats='docker stats --no-stream | grep voiceassist' alias va-disk='df -h | grep -E "(Filesystem|/dev/)"' # Database Quick Checks alias va-db-connections='docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"' alias va-db-slow='docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state != '\''idle'\'' ORDER BY duration DESC LIMIT 10;"' # Redis Checks alias va-redis-info='docker compose exec redis redis-cli INFO' alias va-redis-memory='docker compose exec redis redis-cli INFO memory | grep used_memory_human'
Incident Response Script
#!/bin/bash # Save as: /usr/local/bin/va-incident-check echo "=== VoiceAssist Incident Response Check ===" echo "Time: $(date)" echo "" echo "=== 1. Service Health ===" curl -s http://localhost:8000/health | jq '.' || echo "HEALTH CHECK FAILED" echo "" echo "=== 2. Container Status ===" docker compose ps echo "" echo "=== 3. Recent Errors (Last 10 min) ===" ERROR_COUNT=$(docker compose logs --since 10m voiceassist-server 2>/dev/null | grep -i error | wc -l) echo "Error Count: $ERROR_COUNT" if [ "$ERROR_COUNT" -gt 10 ]; then echo "⚠️ HIGH ERROR RATE DETECTED" docker compose logs --since 10m voiceassist-server | grep -i error | tail -10 fi echo "" echo "=== 4. Database Status ===" docker compose exec -T postgres pg_isready || echo "DATABASE NOT READY" docker compose exec -T postgres psql -U voiceassist -d voiceassist -c \ "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" 2>/dev/null echo "" echo "=== 5. Redis Status ===" docker compose exec -T redis redis-cli ping || echo "REDIS NOT RESPONDING" docker compose exec -T redis redis-cli INFO memory | grep used_memory_human echo "" echo "=== 6. Resource Usage ===" docker stats --no-stream | grep voiceassist echo "" echo "=== 7. Disk Space ===" df -h | grep -E "(Filesystem|/$|/var)" echo "" echo "=== Summary ===" if [ "$ERROR_COUNT" -gt 50 ]; then echo "🔴 CRITICAL - High error rate detected" elif [ "$ERROR_COUNT" -gt 10 ]; then echo "🟡 WARNING - Elevated error rate" else echo "🟢 OK - System appears healthy" fi
Emergency Contacts
Primary Contacts
| Role | Contact | Availability |
|---|---|---|
| On-Call Engineer | PagerDuty alert | 24/7 |
| Backup On-Call | PagerDuty escalation | 24/7 |
| Engineering Manager | ops-manager@voiceassist.local | Business hours |
| DevOps Lead | devops-lead@voiceassist.local | Business hours + on-call |
| Database Admin | dba-oncall@voiceassist.local | 24/7 |
| Security Team | security@voiceassist.local | 24/7 for P1 security |
Escalation Contacts
| Level | Contact | When to Escalate |
|---|---|---|
| L1 | On-Call Engineer | Initial response |
| L2 | Team Lead | No resolution in 30 min (P1) or 2 hrs (P2) |
| L3 | Engineering Manager | No resolution in 1 hr (P1) or 4 hrs (P2) |
| L4 | VP Engineering / CTO | Major outage > 2 hours, data loss, security breach |
External Contacts
- Cloud Provider Support: [Support portal URL]
- Third-party Services: [Service provider contacts]
- Legal (for security incidents): legal@voiceassist.local
Related Documentation
- Deployment Runbook
- Backup & Restore Runbook
- Monitoring Runbook
- Troubleshooting Runbook
- Scaling Runbook
- UNIFIED_ARCHITECTURE.md
- CONNECTION_POOL_OPTIMIZATION.md
Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Monthly or after each P1/P2 incident Next Review: 2025-12-21
Backup & Restore Runbook
Last Updated: 2025-11-27 Purpose: Comprehensive guide for backup and restore operations in VoiceAssist V2
Backup Strategy Overview
Backup Schedule
| Component | Frequency | Retention | Method |
|---|---|---|---|
| PostgreSQL Database | Every 6 hours | 30 days | pg_dump + automated snapshots |
| Redis Cache | Daily | 7 days | RDB snapshots |
| Qdrant Vectors | Daily | 14 days | Collection snapshots |
| Configuration Files | On change | 90 days | Git + encrypted backups |
| Application Logs | Hourly | 30 days | Log aggregation |
| Docker Volumes | Weekly | 30 days | Volume snapshots |
Backup Storage Locations
# Default backup directory structure /backups/ ├── postgres/ │ ├── daily/ │ ├── weekly/ │ └── monthly/ ├── redis/ ├── qdrant/ ├── config/ ├── volumes/ └── logs/
PostgreSQL Database Backup
Full Database Backup
# Create timestamped backup BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/postgres/daily" # Ensure backup directory exists mkdir -p $BACKUP_DIR # Full database dump docker compose exec -T postgres pg_dump \ -U voiceassist \ -d voiceassist \ -F c \ -b \ -v \ -f /tmp/voiceassist_${BACKUP_DATE}.dump # Copy from container to host docker compose cp postgres:/tmp/voiceassist_${BACKUP_DATE}.dump \ ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.dump # Verify backup ls -lh ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.dump # Expected output: File size should be > 0 bytes
Compressed SQL Backup
# SQL format with compression BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/postgres/daily" mkdir -p $BACKUP_DIR # Create compressed SQL dump docker compose exec -T postgres pg_dump \ -U voiceassist \ -d voiceassist \ --clean \ --if-exists \ --verbose \ | gzip > ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.sql.gz # Verify backup ls -lh ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.sql.gz gunzip -t ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.sql.gz && echo "✓ Backup file is valid"
Schema-Only Backup
# Backup schema structure only (useful for development) BACKUP_DATE=$(date +%Y%m%d_%H%M%S) docker compose exec -T postgres pg_dump \ -U voiceassist \ -d voiceassist \ --schema-only \ --no-owner \ --no-acl \ > /backups/postgres/schema_${BACKUP_DATE}.sql echo "Schema backup completed: schema_${BACKUP_DATE}.sql"
Table-Specific Backup
# Backup specific tables BACKUP_DATE=$(date +%Y%m%d_%H%M%S) TABLES="users conversations messages" for TABLE in $TABLES; do echo "Backing up table: $TABLE" docker compose exec -T postgres pg_dump \ -U voiceassist \ -d voiceassist \ -t $TABLE \ --data-only \ | gzip > /backups/postgres/table_${TABLE}_${BACKUP_DATE}.sql.gz done echo "Table backups completed"
Automated Backup Script
#!/bin/bash # Save as: /usr/local/bin/va-backup-postgres set -e BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/postgres" DAILY_DIR="${BACKUP_DIR}/daily" WEEKLY_DIR="${BACKUP_DIR}/weekly" MONTHLY_DIR="${BACKUP_DIR}/monthly" LOG_FILE="${BACKUP_DIR}/backup.log" # Ensure directories exist mkdir -p $DAILY_DIR $WEEKLY_DIR $MONTHLY_DIR # Function to log messages log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE } log "Starting PostgreSQL backup" # Daily backup log "Creating daily backup" docker compose exec -T postgres pg_dump \ -U voiceassist \ -d voiceassist \ -F c \ -b \ | gzip > ${DAILY_DIR}/voiceassist_${BACKUP_DATE}.dump.gz if [ $? -eq 0 ]; then log "Daily backup completed: ${BACKUP_DATE}.dump.gz" BACKUP_SIZE=$(du -h ${DAILY_DIR}/voiceassist_${BACKUP_DATE}.dump.gz | cut -f1) log "Backup size: ${BACKUP_SIZE}" else log "ERROR: Daily backup failed" exit 1 fi # Weekly backup (every Sunday) if [ $(date +%u) -eq 7 ]; then log "Creating weekly backup" cp ${DAILY_DIR}/voiceassist_${BACKUP_DATE}.dump.gz \ ${WEEKLY_DIR}/voiceassist_week_$(date +%Y%U).dump.gz log "Weekly backup created" fi # Monthly backup (first day of month) if [ $(date +%d) -eq 01 ]; then log "Creating monthly backup" cp ${DAILY_DIR}/voiceassist_${BACKUP_DATE}.dump.gz \ ${MONTHLY_DIR}/voiceassist_$(date +%Y%m).dump.gz log "Monthly backup created" fi # Cleanup old daily backups (keep 30 days) log "Cleaning up old daily backups" find ${DAILY_DIR} -name "voiceassist_*.dump.gz" -mtime +30 -delete # Cleanup old weekly backups (keep 12 weeks) find ${WEEKLY_DIR} -name "voiceassist_week_*.dump.gz" -mtime +84 -delete # Cleanup old monthly backups (keep 12 months) find ${MONTHLY_DIR} -name "voiceassist_*.dump.gz" -mtime +365 -delete log "Backup process completed successfully"
Backup Verification
# Verify backup integrity BACKUP_FILE="/backups/postgres/daily/voiceassist_20251121_120000.dump.gz" # Check file exists and size if [ -f "$BACKUP_FILE" ]; then echo "✓ Backup file exists" ls -lh $BACKUP_FILE else echo "✗ Backup file not found" exit 1 fi # Test extraction gunzip -t $BACKUP_FILE if [ $? -eq 0 ]; then echo "✓ Backup file is not corrupted" else echo "✗ Backup file is corrupted" exit 1 fi # Test restore to temporary database (recommended) echo "Testing restore to temporary database..." docker compose exec -T postgres psql -U voiceassist -c "CREATE DATABASE test_restore;" gunzip -c $BACKUP_FILE | docker compose exec -T postgres pg_restore \ -U voiceassist \ -d test_restore \ --verbose if [ $? -eq 0 ]; then echo "✓ Backup restore test successful" docker compose exec -T postgres psql -U voiceassist -c "DROP DATABASE test_restore;" else echo "✗ Backup restore test failed" docker compose exec -T postgres psql -U voiceassist -c "DROP DATABASE IF EXISTS test_restore;" exit 1 fi
PostgreSQL Database Restore
Pre-Restore Checklist
- Verify backup file integrity
- Ensure sufficient disk space
- Notify all users of maintenance
- Stop application services
- Create a backup of current database (before restore)
- Document current state
Full Database Restore
# Stop application to prevent connections docker compose stop voiceassist-server # Verify no active connections docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT count(*) FROM pg_stat_activity WHERE datname = 'voiceassist' AND pid != pg_backend_pid();" # Terminate active connections if any docker compose exec postgres psql -U voiceassist -d postgres -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'voiceassist' AND pid != pg_backend_pid();" # Drop and recreate database docker compose exec postgres psql -U voiceassist -d postgres <<EOF DROP DATABASE IF EXISTS voiceassist; CREATE DATABASE voiceassist OWNER voiceassist; EOF # Restore from custom format dump BACKUP_FILE="/backups/postgres/daily/voiceassist_20251121_120000.dump.gz" gunzip -c $BACKUP_FILE | docker compose exec -T postgres pg_restore \ -U voiceassist \ -d voiceassist \ --verbose \ --no-owner \ --no-acl # Verify restore docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT schemaname, tablename FROM pg_tables WHERE schemaname = 'public';" # Restart application docker compose start voiceassist-server echo "Database restore completed"
Restore from SQL Dump
# For plain SQL dumps BACKUP_FILE="/backups/postgres/daily/voiceassist_20251121_120000.sql.gz" # Stop application docker compose stop voiceassist-server # Restore SQL gunzip -c $BACKUP_FILE | docker compose exec -T postgres psql \ -U voiceassist \ -d voiceassist # Restart application docker compose start voiceassist-server
Point-in-Time Recovery (PITR)
# Requires WAL archiving to be enabled in PostgreSQL configuration # 1. Stop database docker compose stop postgres # 2. Replace data directory with base backup BACKUP_DIR="/backups/postgres/base" DATA_DIR="/var/lib/docker/volumes/voiceassist_postgres_data/_data" # Backup current data mv $DATA_DIR ${DATA_DIR}.backup_$(date +%Y%m%d_%H%M%S) # Restore base backup cp -r $BACKUP_DIR $DATA_DIR # 3. Create recovery configuration cat > ${DATA_DIR}/recovery.conf <<EOF restore_command = 'cp /backups/postgres/wal_archive/%f %p' recovery_target_time = '2025-11-21 12:00:00' EOF # 4. Start PostgreSQL (will perform recovery) docker compose start postgres # 5. Monitor recovery docker compose logs -f postgres | grep -i recovery
Partial Restore (Single Table)
# Restore specific table from backup TABLE_NAME="users" BACKUP_FILE="/backups/postgres/table_users_20251121_120000.sql.gz" # Drop existing table data docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "TRUNCATE TABLE ${TABLE_NAME} CASCADE;" # Restore table gunzip -c $BACKUP_FILE | docker compose exec -T postgres psql \ -U voiceassist \ -d voiceassist # Verify docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT COUNT(*) FROM ${TABLE_NAME};"
Redis Backup
Manual Redis Backup
# Trigger Redis save docker compose exec redis redis-cli BGSAVE # Wait for save to complete docker compose exec redis redis-cli LASTSAVE # Copy RDB file BACKUP_DATE=$(date +%Y%m%d_%H%M%S) mkdir -p /backups/redis docker compose cp redis:/data/dump.rdb \ /backups/redis/dump_${BACKUP_DATE}.rdb # Verify backup ls -lh /backups/redis/dump_${BACKUP_DATE}.rdb
Automated Redis Backup Script
#!/bin/bash # Save as: /usr/local/bin/va-backup-redis set -e BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/redis" LOG_FILE="${BACKUP_DIR}/backup.log" mkdir -p $BACKUP_DIR log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE } log "Starting Redis backup" # Trigger background save docker compose exec -T redis redis-cli BGSAVE > /dev/null # Wait for save to complete (check every 2 seconds) TIMEOUT=60 ELAPSED=0 while [ $ELAPSED -lt $TIMEOUT ]; do STATUS=$(docker compose exec -T redis redis-cli LASTSAVE 2>/dev/null || echo "0") if [ ! -z "$STATUS" ]; then break fi sleep 2 ELAPSED=$((ELAPSED + 2)) done # Copy RDB file docker compose cp redis:/data/dump.rdb \ ${BACKUP_DIR}/dump_${BACKUP_DATE}.rdb if [ $? -eq 0 ]; then log "Redis backup completed: dump_${BACKUP_DATE}.rdb" BACKUP_SIZE=$(du -h ${BACKUP_DIR}/dump_${BACKUP_DATE}.rdb | cut -f1) log "Backup size: ${BACKUP_SIZE}" else log "ERROR: Redis backup failed" exit 1 fi # Cleanup old backups (keep 7 days) find ${BACKUP_DIR} -name "dump_*.rdb" -mtime +7 -delete log "Cleanup completed" log "Redis backup process completed successfully"
Redis Restore
# Stop Redis docker compose stop redis # Replace RDB file BACKUP_FILE="/backups/redis/dump_20251121_120000.rdb" docker compose cp $BACKUP_FILE redis:/data/dump.rdb # Start Redis (will load from dump.rdb) docker compose start redis # Verify data loaded docker compose exec redis redis-cli DBSIZE echo "Redis restore completed"
Qdrant Vector Database Backup
Create Qdrant Snapshot
# Create snapshot for specific collection COLLECTION_NAME="voice_embeddings" BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/qdrant" mkdir -p $BACKUP_DIR # Create snapshot via API SNAPSHOT_NAME=$(curl -X POST \ "http://localhost:6333/collections/${COLLECTION_NAME}/snapshots" \ | jq -r '.result.name') echo "Snapshot created: $SNAPSHOT_NAME" # Download snapshot curl -X GET \ "http://localhost:6333/collections/${COLLECTION_NAME}/snapshots/${SNAPSHOT_NAME}" \ -o ${BACKUP_DIR}/${COLLECTION_NAME}_${BACKUP_DATE}.snapshot # Verify backup ls -lh ${BACKUP_DIR}/${COLLECTION_NAME}_${BACKUP_DATE}.snapshot
Backup All Qdrant Collections
#!/bin/bash # Save as: /usr/local/bin/va-backup-qdrant set -e BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/qdrant" LOG_FILE="${BACKUP_DIR}/backup.log" mkdir -p $BACKUP_DIR log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE } log "Starting Qdrant backup" # Get all collections COLLECTIONS=$(curl -s http://localhost:6333/collections | jq -r '.result.collections[].name') if [ -z "$COLLECTIONS" ]; then log "No collections found" exit 0 fi # Backup each collection for COLLECTION in $COLLECTIONS; do log "Backing up collection: $COLLECTION" # Create snapshot SNAPSHOT_NAME=$(curl -s -X POST \ "http://localhost:6333/collections/${COLLECTION}/snapshots" \ | jq -r '.result.name') if [ ! -z "$SNAPSHOT_NAME" ] && [ "$SNAPSHOT_NAME" != "null" ]; then # Download snapshot curl -s -X GET \ "http://localhost:6333/collections/${COLLECTION}/snapshots/${SNAPSHOT_NAME}" \ -o ${BACKUP_DIR}/${COLLECTION}_${BACKUP_DATE}.snapshot log "Backup completed: ${COLLECTION}_${BACKUP_DATE}.snapshot" BACKUP_SIZE=$(du -h ${BACKUP_DIR}/${COLLECTION}_${BACKUP_DATE}.snapshot | cut -f1) log "Backup size: ${BACKUP_SIZE}" # Delete remote snapshot to save space curl -s -X DELETE \ "http://localhost:6333/collections/${COLLECTION}/snapshots/${SNAPSHOT_NAME}" \ > /dev/null else log "ERROR: Failed to create snapshot for $COLLECTION" fi done # Cleanup old backups (keep 14 days) find ${BACKUP_DIR} -name "*.snapshot" -mtime +14 -delete log "Cleanup completed" log "Qdrant backup process completed successfully"
Qdrant Restore
# Stop Qdrant docker compose stop qdrant # Clear existing data (optional, for full restore) docker compose exec qdrant rm -rf /qdrant/storage/* # Start Qdrant docker compose start qdrant # Wait for Qdrant to be ready sleep 5 # Restore each collection COLLECTION_NAME="voice_embeddings" BACKUP_FILE="/backups/qdrant/voice_embeddings_20251121_120000.snapshot" # Upload snapshot curl -X POST \ "http://localhost:6333/collections/${COLLECTION_NAME}/snapshots/upload" \ -H "Content-Type: multipart/form-data" \ -F "snapshot=@${BACKUP_FILE}" # Verify collection restored curl -s http://localhost:6333/collections/${COLLECTION_NAME} | jq '.result' echo "Qdrant restore completed"
Configuration Files Backup
Backup Configuration
#!/bin/bash # Save as: /usr/local/bin/va-backup-config set -e BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/config" PROJECT_DIR="/Users/mohammednazmy/VoiceAssist" mkdir -p $BACKUP_DIR echo "Starting configuration backup" # Create tarball of configuration files tar -czf ${BACKUP_DIR}/config_${BACKUP_DATE}.tar.gz \ -C $PROJECT_DIR \ .env \ docker-compose.yml \ docker-compose.override.yml \ alembic.ini \ pyproject.toml \ --exclude='.git' \ --exclude='__pycache__' # Encrypt backup (recommended for sensitive configs) if command -v gpg &> /dev/null; then gpg --symmetric --cipher-algo AES256 \ -o ${BACKUP_DIR}/config_${BACKUP_DATE}.tar.gz.gpg \ ${BACKUP_DIR}/config_${BACKUP_DATE}.tar.gz # Remove unencrypted version rm ${BACKUP_DIR}/config_${BACKUP_DATE}.tar.gz echo "Configuration backup encrypted: config_${BACKUP_DATE}.tar.gz.gpg" else echo "Configuration backup created: config_${BACKUP_DATE}.tar.gz" echo "WARNING: Backup is not encrypted. Consider installing gpg." fi # Cleanup old backups (keep 90 days) find ${BACKUP_DIR} -name "config_*.tar.gz*" -mtime +90 -delete echo "Configuration backup completed"
Restore Configuration
# For encrypted backups BACKUP_FILE="/backups/config/config_20251121_120000.tar.gz.gpg" PROJECT_DIR="/Users/mohammednazmy/VoiceAssist" # Decrypt and extract gpg --decrypt $BACKUP_FILE | tar -xzf - -C $PROJECT_DIR # For unencrypted backups BACKUP_FILE="/backups/config/config_20251121_120000.tar.gz" tar -xzf $BACKUP_FILE -C $PROJECT_DIR echo "Configuration restored"
Docker Volumes Backup
Backup Docker Volumes
#!/bin/bash # Save as: /usr/local/bin/va-backup-volumes set -e BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/volumes" mkdir -p $BACKUP_DIR echo "Starting Docker volumes backup" # List of volumes to backup VOLUMES=( "voiceassist_postgres_data" "voiceassist_redis_data" "voiceassist_qdrant_storage" ) for VOLUME in "${VOLUMES[@]}"; do echo "Backing up volume: $VOLUME" # Create tarball of volume docker run --rm \ -v ${VOLUME}:/source:ro \ -v ${BACKUP_DIR}:/backup \ alpine \ tar -czf /backup/${VOLUME}_${BACKUP_DATE}.tar.gz -C /source . if [ $? -eq 0 ]; then echo "Backup completed: ${VOLUME}_${BACKUP_DATE}.tar.gz" BACKUP_SIZE=$(du -h ${BACKUP_DIR}/${VOLUME}_${BACKUP_DATE}.tar.gz | cut -f1) echo "Backup size: ${BACKUP_SIZE}" else echo "ERROR: Backup failed for $VOLUME" fi done # Cleanup old backups (keep 30 days) find ${BACKUP_DIR} -name "*.tar.gz" -mtime +30 -delete echo "Docker volumes backup completed"
Restore Docker Volumes
# Stop services docker compose down # Restore specific volume VOLUME_NAME="voiceassist_postgres_data" BACKUP_FILE="/backups/volumes/voiceassist_postgres_data_20251121_120000.tar.gz" # Remove existing volume (WARNING: destructive) docker volume rm $VOLUME_NAME # Create new volume docker volume create $VOLUME_NAME # Restore data docker run --rm \ -v ${VOLUME_NAME}:/target \ -v $(dirname $BACKUP_FILE):/backup \ alpine \ tar -xzf /backup/$(basename $BACKUP_FILE) -C /target echo "Volume $VOLUME_NAME restored" # Start services docker compose up -d
Disaster Recovery
Complete System Backup
#!/bin/bash # Save as: /usr/local/bin/va-backup-full set -e BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_ROOT="/backups" DR_DIR="${BACKUP_ROOT}/disaster_recovery" mkdir -p $DR_DIR echo "============================================" echo "Starting Full System Backup for DR" echo "Date: $(date)" echo "============================================" # Stop application (keep databases running) docker compose stop voiceassist-server # 1. Backup PostgreSQL echo "[1/5] Backing up PostgreSQL..." /usr/local/bin/va-backup-postgres # 2. Backup Redis echo "[2/5] Backing up Redis..." /usr/local/bin/va-backup-redis # 3. Backup Qdrant echo "[3/5] Backing up Qdrant..." /usr/local/bin/va-backup-qdrant # 4. Backup Configuration echo "[4/5] Backing up Configuration..." /usr/local/bin/va-backup-config # 5. Backup Docker Volumes echo "[5/5] Backing up Docker Volumes..." /usr/local/bin/va-backup-volumes # Create DR manifest cat > ${DR_DIR}/manifest_${BACKUP_DATE}.txt <<EOF VoiceAssist V2 Disaster Recovery Backup ======================================== Date: $(date) Backup ID: ${BACKUP_DATE} Components Backed Up: - PostgreSQL Database - Redis Cache - Qdrant Vector Database - Configuration Files - Docker Volumes Backup Locations: - PostgreSQL: ${BACKUP_ROOT}/postgres/daily/ - Redis: ${BACKUP_ROOT}/redis/ - Qdrant: ${BACKUP_ROOT}/qdrant/ - Config: ${BACKUP_ROOT}/config/ - Volumes: ${BACKUP_ROOT}/volumes/ Backup Sizes: $(du -sh ${BACKUP_ROOT}/postgres/daily/voiceassist_${BACKUP_DATE}* 2>/dev/null || echo "PostgreSQL: N/A") $(du -sh ${BACKUP_ROOT}/redis/dump_${BACKUP_DATE}.rdb 2>/dev/null || echo "Redis: N/A") $(du -sh ${BACKUP_ROOT}/qdrant/*_${BACKUP_DATE}.snapshot 2>/dev/null || echo "Qdrant: N/A") $(du -sh ${BACKUP_ROOT}/config/config_${BACKUP_DATE}.tar.gz* 2>/dev/null || echo "Config: N/A") Total Backup Size: $(du -sh ${BACKUP_ROOT} | cut -f1) Verification Status: - PostgreSQL: $(test -f ${BACKUP_ROOT}/postgres/daily/voiceassist_${BACKUP_DATE}* && echo "✓" || echo "✗") - Redis: $(test -f ${BACKUP_ROOT}/redis/dump_${BACKUP_DATE}.rdb && echo "✓" || echo "✗") - Config: $(test -f ${BACKUP_ROOT}/config/config_${BACKUP_DATE}.tar.gz* && echo "✓" || echo "✗") Restore Command: /usr/local/bin/va-restore-full ${BACKUP_DATE} EOF # Create compressed archive of entire backup echo "Creating DR archive..." tar -czf ${DR_DIR}/voiceassist_dr_${BACKUP_DATE}.tar.gz \ -C ${BACKUP_ROOT} \ postgres/daily \ redis \ qdrant \ config \ volumes # Restart application docker compose start voiceassist-server echo "============================================" echo "Full System Backup Completed" echo "Manifest: ${DR_DIR}/manifest_${BACKUP_DATE}.txt" echo "Archive: ${DR_DIR}/voiceassist_dr_${BACKUP_DATE}.tar.gz" echo "============================================" cat ${DR_DIR}/manifest_${BACKUP_DATE}.txt
Complete System Restore
#!/bin/bash # Save as: /usr/local/bin/va-restore-full set -e if [ -z "$1" ]; then echo "Usage: $0 <backup_date>" echo "Example: $0 20251121_120000" exit 1 fi BACKUP_DATE=$1 BACKUP_ROOT="/backups" echo "============================================" echo "Starting Full System Restore" echo "Backup Date: ${BACKUP_DATE}" echo "============================================" # Verify manifest exists MANIFEST="${BACKUP_ROOT}/disaster_recovery/manifest_${BACKUP_DATE}.txt" if [ ! -f "$MANIFEST" ]; then echo "ERROR: Manifest not found: $MANIFEST" exit 1 fi echo "Manifest found. Displaying backup details:" cat $MANIFEST echo "" read -p "Do you want to proceed with restore? This will OVERWRITE all data (yes/no): " CONFIRM if [ "$CONFIRM" != "yes" ]; then echo "Restore cancelled" exit 0 fi # Stop all services echo "Stopping services..." docker compose down # 1. Restore PostgreSQL echo "[1/5] Restoring PostgreSQL..." POSTGRES_BACKUP="${BACKUP_ROOT}/postgres/daily/voiceassist_${BACKUP_DATE}.dump.gz" if [ -f "$POSTGRES_BACKUP" ]; then docker compose up -d postgres sleep 10 docker compose exec postgres psql -U voiceassist -d postgres -c \ "DROP DATABASE IF EXISTS voiceassist;" docker compose exec postgres psql -U voiceassist -d postgres -c \ "CREATE DATABASE voiceassist OWNER voiceassist;" gunzip -c $POSTGRES_BACKUP | docker compose exec -T postgres pg_restore \ -U voiceassist \ -d voiceassist \ --verbose \ --no-owner \ --no-acl echo "✓ PostgreSQL restored" else echo "✗ PostgreSQL backup not found" fi # 2. Restore Redis echo "[2/5] Restoring Redis..." REDIS_BACKUP="${BACKUP_ROOT}/redis/dump_${BACKUP_DATE}.rdb" if [ -f "$REDIS_BACKUP" ]; then docker compose stop redis docker compose cp $REDIS_BACKUP redis:/data/dump.rdb docker compose start redis sleep 5 echo "✓ Redis restored" else echo "✗ Redis backup not found" fi # 3. Restore Qdrant echo "[3/5] Restoring Qdrant..." docker compose up -d qdrant sleep 10 for SNAPSHOT in ${BACKUP_ROOT}/qdrant/*_${BACKUP_DATE}.snapshot; do if [ -f "$SNAPSHOT" ]; then COLLECTION=$(basename $SNAPSHOT | sed "s/_${BACKUP_DATE}.snapshot//") echo "Restoring collection: $COLLECTION" curl -X POST \ "http://localhost:6333/collections/${COLLECTION}/snapshots/upload" \ -H "Content-Type: multipart/form-data" \ -F "snapshot=@${SNAPSHOT}" echo "✓ Collection $COLLECTION restored" fi done # 4. Restore Configuration echo "[4/5] Restoring Configuration..." CONFIG_BACKUP="${BACKUP_ROOT}/config/config_${BACKUP_DATE}.tar.gz" CONFIG_BACKUP_ENC="${CONFIG_BACKUP}.gpg" if [ -f "$CONFIG_BACKUP_ENC" ]; then gpg --decrypt $CONFIG_BACKUP_ENC | tar -xzf - -C /Users/mohammednazmy/VoiceAssist echo "✓ Configuration restored (encrypted)" elif [ -f "$CONFIG_BACKUP" ]; then tar -xzf $CONFIG_BACKUP -C /Users/mohammednazmy/VoiceAssist echo "✓ Configuration restored" else echo "✗ Configuration backup not found" fi # 5. Start all services echo "[5/5] Starting all services..." docker compose up -d # Wait for services to be ready echo "Waiting for services to be ready..." sleep 30 # Verify system health echo "" echo "============================================" echo "Restore Completed - Verifying System Health" echo "============================================" curl -s http://localhost:8000/health | jq '.' docker compose ps echo "" echo "Full system restore completed" echo "Please verify all functionality before resuming operations"
Disaster Recovery Scenarios
Scenario 1: Complete Hardware Failure
# On NEW hardware: # 1. Install Docker and Docker Compose # 2. Clone repository git clone <repository_url> /Users/mohammednazmy/VoiceAssist cd /Users/mohammednazmy/VoiceAssist # 3. Copy DR archive from backup location scp backup-server:/backups/disaster_recovery/voiceassist_dr_YYYYMMDD_HHMMSS.tar.gz /tmp/ # 4. Extract DR archive mkdir -p /backups tar -xzf /tmp/voiceassist_dr_YYYYMMDD_HHMMSS.tar.gz -C /backups # 5. Run full restore /usr/local/bin/va-restore-full YYYYMMDD_HHMMSS # 6. Verify and resume operations
Scenario 2: Data Corruption
# 1. Stop application docker compose stop voiceassist-server # 2. Create backup of corrupted data (for analysis) /usr/local/bin/va-backup-full # 3. Identify last known good backup ls -lh /backups/disaster_recovery/manifest_*.txt # 4. Restore from last good backup /usr/local/bin/va-restore-full YYYYMMDD_HHMMSS # 5. Verify data integrity # Run data validation scripts # 6. Resume operations docker compose start voiceassist-server
Scenario 3: Accidental Data Deletion
# Restore specific component only (faster than full restore) # For deleted PostgreSQL table/data: BACKUP_FILE="/backups/postgres/daily/voiceassist_20251121_120000.dump.gz" # Use table-specific restore procedure # For deleted Redis data: # Use Redis restore procedure # For deleted Qdrant collection: # Use Qdrant restore procedure
Backup Monitoring
Backup Health Check
#!/bin/bash # Save as: /usr/local/bin/va-backup-health BACKUP_ROOT="/backups" ALERT_EMAIL="ops-team@voiceassist.local" echo "Backup Health Check - $(date)" echo "========================================" # Check PostgreSQL backups LATEST_PG=$(find ${BACKUP_ROOT}/postgres/daily -name "*.dump.gz" -mtime -1 | wc -l) if [ $LATEST_PG -eq 0 ]; then echo "⚠️ WARNING: No PostgreSQL backup in last 24 hours" else echo "✓ PostgreSQL backups are current" fi # Check Redis backups LATEST_REDIS=$(find ${BACKUP_ROOT}/redis -name "*.rdb" -mtime -1 | wc -l) if [ $LATEST_REDIS -eq 0 ]; then echo "⚠️ WARNING: No Redis backup in last 24 hours" else echo "✓ Redis backups are current" fi # Check Qdrant backups LATEST_QDRANT=$(find ${BACKUP_ROOT}/qdrant -name "*.snapshot" -mtime -1 | wc -l) if [ $LATEST_QDRANT -eq 0 ]; then echo "⚠️ WARNING: No Qdrant backup in last 24 hours" else echo "✓ Qdrant backups are current" fi # Check disk space DISK_USAGE=$(df -h ${BACKUP_ROOT} | tail -1 | awk '{print $5}' | sed 's/%//') if [ $DISK_USAGE -gt 80 ]; then echo "⚠️ WARNING: Backup disk usage at ${DISK_USAGE}%" else echo "✓ Backup disk space is adequate (${DISK_USAGE}%)" fi # Check backup sizes echo "" echo "Backup Sizes:" echo "PostgreSQL: $(du -sh ${BACKUP_ROOT}/postgres | cut -f1)" echo "Redis: $(du -sh ${BACKUP_ROOT}/redis | cut -f1)" echo "Qdrant: $(du -sh ${BACKUP_ROOT}/qdrant | cut -f1)" echo "Config: $(du -sh ${BACKUP_ROOT}/config | cut -f1)" echo "Total: $(du -sh ${BACKUP_ROOT} | cut -f1)"
Related Documentation
- Deployment Runbook
- Incident Response Runbook
- Troubleshooting Runbook
- Monitoring Runbook
- UNIFIED_ARCHITECTURE.md
Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Quarterly or after each disaster recovery event Next Review: 2026-02-21
Scaling Runbook
Last Updated: 2025-11-27 Purpose: Comprehensive guide for scaling VoiceAssist V2 infrastructure
Scaling Overview
Current Architecture
Load Balancer (if configured)
↓
VoiceAssist Server (Scalable)
↓
├── PostgreSQL (Primary + Read Replicas)
├── Redis (Cluster or Sentinel)
└── Qdrant (Distributed)
Scaling Strategy
| Component | Type | Method | Max Recommended |
|---|---|---|---|
| VoiceAssist Server | Stateless | Horizontal | 10+ instances |
| PostgreSQL | Stateful | Vertical + Read Replicas | 1 primary + 5 replicas |
| Redis | Stateful | Vertical + Cluster | 6 nodes (3 master + 3 slave) |
| Qdrant | Stateful | Horizontal + Sharding | 6+ nodes |
When to Scale
Scaling Triggers
Immediate Scaling (Reactive)
Scale immediately if:
- CPU usage > 80% for 10+ minutes
- Memory usage > 85%
- Response time > 2 seconds (p95)
- Error rate > 5%
- Connection pool exhausted
- Queue depth > 1000
Planned Scaling (Proactive)
Schedule scaling if:
- Expected traffic increase (events, marketing campaigns)
- New feature launch with heavy load
- Approaching 70% capacity on any metric
- Seasonal traffic patterns
Scaling Decision Matrix
# Quick capacity check cat > /usr/local/bin/va-capacity-check <<'EOF' #!/bin/bash echo "VoiceAssist Capacity Check - $(date)" echo "========================================" # Check application load CPU=$(docker stats --no-stream --format "{{.CPUPerc}}" voiceassist-voiceassist-server-1 | sed 's/%//') MEM=$(docker stats --no-stream --format "{{.MemPerc}}" voiceassist-voiceassist-server-1 | sed 's/%//') echo "Application:" echo " CPU: ${CPU}%" echo " Memory: ${MEM}%" # Database connections DB_CONN=$(docker compose exec -T postgres psql -U voiceassist -d voiceassist -t -c \ "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" | tr -d ' ') DB_MAX=$(docker compose exec -T postgres psql -U voiceassist -d voiceassist -t -c \ "SHOW max_connections;" | tr -d ' ') DB_USAGE=$((DB_CONN * 100 / DB_MAX)) echo "Database:" echo " Active Connections: ${DB_CONN}/${DB_MAX} (${DB_USAGE}%)" # Redis memory REDIS_MEM=$(docker compose exec -T redis redis-cli INFO memory | grep used_memory_human | cut -d: -f2 | tr -d '\r') echo "Redis:" echo " Memory Usage: ${REDIS_MEM}" # Recommendation echo "" echo "Scaling Recommendations:" if (( $(echo "$CPU > 80" | bc -l) )) || (( $(echo "$MEM > 85" | bc -l) )); then echo "🔴 IMMEDIATE: Scale application horizontally" elif (( $(echo "$CPU > 70" | bc -l) )) || (( $(echo "$MEM > 70" | bc -l) )); then echo "🟡 SOON: Plan to scale within 24 hours" elif [ $DB_USAGE -gt 80 ]; then echo "🔴 IMMEDIATE: Scale database connections or add read replica" else echo "🟢 OK: Current capacity is adequate" fi EOF chmod +x /usr/local/bin/va-capacity-check
Horizontal Scaling - Application Server
Quick Scale Up
# Scale to 3 instances docker compose up -d --scale voiceassist-server=3 # Verify all instances running docker compose ps voiceassist-server # Expected output: 3 containers running # voiceassist-voiceassist-server-1 # voiceassist-voiceassist-server-2 # voiceassist-voiceassist-server-3 # Check health of all instances for i in {1..3}; do echo "Instance $i:" docker inspect voiceassist-voiceassist-server-$i | jq '.[0].State.Health.Status' done
Scale with Load Balancer
# Add to docker-compose.yml services: nginx: image: nginx:alpine ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: - voiceassist-server voiceassist-server: # ... existing config ... deploy: replicas: 3 resources: limits: cpus: "2" memory: 2G reservations: cpus: "1" memory: 1G
# Create nginx.conf for load balancing upstream voiceassist_backend { least_conn; # Use least connections algorithm server voiceassist-server-1:8000 max_fails=3 fail_timeout=30s; server voiceassist-server-2:8000 max_fails=3 fail_timeout=30s; server voiceassist-server-3:8000 max_fails=3 fail_timeout=30s; keepalive 32; } server { listen 80; location / { proxy_pass http://voiceassist_backend; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; # Timeouts proxy_connect_timeout 5s; proxy_send_timeout 60s; proxy_read_timeout 60s; # Health check proxy_next_upstream error timeout http_500 http_502 http_503; } location /health { access_log off; proxy_pass http://voiceassist_backend; } }
# Deploy with load balancer docker compose up -d --scale voiceassist-server=3 # Verify load balancing for i in {1..10}; do curl -s http://localhost/health | jq -r '.hostname' done # Should show different hostnames, indicating round-robin
Auto-Scaling with Metrics
#!/bin/bash # Save as: /usr/local/bin/va-autoscale MIN_INSTANCES=2 MAX_INSTANCES=10 SCALE_UP_THRESHOLD=70 SCALE_DOWN_THRESHOLD=30 CHECK_INTERVAL=60 while true; do # Get current instance count CURRENT=$(docker compose ps -q voiceassist-server | wc -l) # Get average CPU across all instances AVG_CPU=$(docker stats --no-stream --format "{{.CPUPerc}}" \ $(docker compose ps -q voiceassist-server) | \ sed 's/%//g' | \ awk '{s+=$1; n++} END {print s/n}') echo "[$(date)] Instances: $CURRENT, Avg CPU: ${AVG_CPU}%" # Scale up if (( $(echo "$AVG_CPU > $SCALE_UP_THRESHOLD" | bc -l) )) && [ $CURRENT -lt $MAX_INSTANCES ]; then NEW_COUNT=$((CURRENT + 1)) echo "Scaling UP to $NEW_COUNT instances (CPU: ${AVG_CPU}%)" docker compose up -d --scale voiceassist-server=$NEW_COUNT # Scale down elif (( $(echo "$AVG_CPU < $SCALE_DOWN_THRESHOLD" | bc -l) )) && [ $CURRENT -gt $MIN_INSTANCES ]; then NEW_COUNT=$((CURRENT - 1)) echo "Scaling DOWN to $NEW_COUNT instances (CPU: ${AVG_CPU}%)" docker compose up -d --scale voiceassist-server=$NEW_COUNT else echo "No scaling needed" fi sleep $CHECK_INTERVAL done
Graceful Instance Shutdown
# Scale down with zero downtime CURRENT=$(docker compose ps -q voiceassist-server | wc -l) TARGET=$((CURRENT - 1)) echo "Scaling from $CURRENT to $TARGET instances" # Get last instance LAST_INSTANCE="voiceassist-voiceassist-server-${CURRENT}" # Stop accepting new connections (if using load balancer) docker compose exec nginx nginx -s reload # Wait for existing connections to drain (30 seconds) echo "Draining connections..." sleep 30 # Check remaining connections ACTIVE_CONN=$(docker exec $LAST_INSTANCE netstat -an | grep :8000 | grep ESTABLISHED | wc -l) echo "Active connections on instance: $ACTIVE_CONN" # Scale down docker compose up -d --scale voiceassist-server=$TARGET echo "Scaled down to $TARGET instances"
Vertical Scaling - Application Server
Increase CPU and Memory
# Update docker-compose.yml services: voiceassist-server: deploy: resources: limits: cpus: "4" # Increased from 2 memory: 4G # Increased from 2G reservations: cpus: "2" # Increased from 1 memory: 2G # Increased from 1G
# Apply changes docker compose up -d voiceassist-server # Verify new limits docker inspect voiceassist-voiceassist-server-1 | \ jq '.[0].HostConfig.Memory, .[0].HostConfig.NanoCpus' # Monitor performance improvement docker stats voiceassist-voiceassist-server-1
Optimize Application Workers
# Increase Gunicorn workers in Dockerfile or docker-compose.yml # Rule: workers = (2 x CPU cores) + 1 # For 4 CPU cores: WORKERS=9 # (2 x 4) + 1 # Update environment variable docker compose exec voiceassist-server sh -c \ "export GUNICORN_WORKERS=$WORKERS && supervisorctl restart gunicorn" # Verify worker count docker compose exec voiceassist-server ps aux | grep gunicorn
PostgreSQL Scaling
Vertical Scaling - Increase Resources
# Update docker-compose.yml services: postgres: deploy: resources: limits: cpus: "4" memory: 8G reservations: cpus: "2" memory: 4G command: - "postgres" - "-c" - "max_connections=200" # Increased from 100 - "-c" - "shared_buffers=2GB" # Increased from 256MB - "-c" - "effective_cache_size=6GB" # Increased - "-c" - "maintenance_work_mem=512MB" # Increased - "-c" - "checkpoint_completion_target=0.9" - "-c" - "wal_buffers=16MB" - "-c" - "default_statistics_target=100" - "-c" - "random_page_cost=1.1" - "-c" - "effective_io_concurrency=200" - "-c" - "work_mem=10MB" # Increased - "-c" - "min_wal_size=1GB" - "-c" - "max_wal_size=4GB" # Increased
# Apply changes docker compose up -d postgres # Verify new settings docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SHOW max_connections; SHOW shared_buffers; SHOW effective_cache_size;"
Read Replica Setup
# Add to docker-compose.yml services: postgres-replica: image: postgres:15 environment: POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} volumes: - postgres_replica_data:/var/lib/postgresql/data command: - "postgres" - "-c" - "hot_standby=on" - "-c" - "max_connections=200" depends_on: - postgres volumes: postgres_replica_data:
# Setup replication on primary docker compose exec postgres psql -U voiceassist -d postgres <<EOF -- Create replication user CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replica_password'; -- Configure pg_hba.conf for replication -- Add to postgresql.conf: -- wal_level = replica -- max_wal_senders = 10 -- max_replication_slots = 10 -- hot_standby = on EOF # Restart primary docker compose restart postgres # Initial replica setup docker compose exec postgres pg_basebackup \ -h postgres \ -D /var/lib/postgresql/data-replica \ -U replicator \ -v \ -P \ -W # Create recovery.conf on replica cat > recovery.conf <<EOF standby_mode = 'on' primary_conninfo = 'host=postgres port=5432 user=replicator password=replica_password' trigger_file = '/tmp/postgresql.trigger.5432' EOF # Start replica docker compose up -d postgres-replica # Verify replication docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT * FROM pg_stat_replication;"
Connection Pooling with PgBouncer
# Add to docker-compose.yml services: pgbouncer: image: pgbouncer/pgbouncer:latest environment: DATABASES_HOST: postgres DATABASES_PORT: 5432 DATABASES_USER: voiceassist DATABASES_PASSWORD: ${POSTGRES_PASSWORD} DATABASES_DBNAME: voiceassist PGBOUNCER_POOL_MODE: transaction PGBOUNCER_MAX_CLIENT_CONN: 1000 PGBOUNCER_DEFAULT_POOL_SIZE: 25 PGBOUNCER_MIN_POOL_SIZE: 10 PGBOUNCER_RESERVE_POOL_SIZE: 5 PGBOUNCER_SERVER_IDLE_TIMEOUT: 600 ports: - "6432:6432" depends_on: - postgres
# Update application to use PgBouncer # Change DATABASE_URL in .env DATABASE_URL=postgresql://voiceassist:password@pgbouncer:6432/voiceassist # Restart application docker compose up -d voiceassist-server # Monitor PgBouncer docker compose exec pgbouncer psql -h localhost -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;" docker compose exec pgbouncer psql -h localhost -p 6432 -U pgbouncer pgbouncer -c "SHOW STATS;"
Redis Scaling
Vertical Scaling - Increase Memory
# Update docker-compose.yml services: redis: deploy: resources: limits: cpus: "2" memory: 4G # Increased from 2G reservations: cpus: "1" memory: 2G command: - redis-server - --maxmemory 3gb # Increased from 1gb - --maxmemory-policy allkeys-lru
# Apply changes docker compose up -d redis # Verify new memory limit docker compose exec redis redis-cli CONFIG GET maxmemory
Redis Cluster Setup (Horizontal Scaling)
# Add to docker-compose.yml services: redis-node-1: image: redis:7-alpine command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379 volumes: - redis_node_1_data:/data redis-node-2: image: redis:7-alpine command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379 volumes: - redis_node_2_data:/data redis-node-3: image: redis:7-alpine command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379 volumes: - redis_node_3_data:/data redis-node-4: image: redis:7-alpine command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379 volumes: - redis_node_4_data:/data redis-node-5: image: redis:7-alpine command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379 volumes: - redis_node_5_data:/data redis-node-6: image: redis:7-alpine command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379 volumes: - redis_node_6_data:/data volumes: redis_node_1_data: redis_node_2_data: redis_node_3_data: redis_node_4_data: redis_node_5_data: redis_node_6_data:
# Start all nodes docker compose up -d redis-node-{1..6} # Create cluster docker compose exec redis-node-1 redis-cli --cluster create \ redis-node-1:6379 \ redis-node-2:6379 \ redis-node-3:6379 \ redis-node-4:6379 \ redis-node-5:6379 \ redis-node-6:6379 \ --cluster-replicas 1 # Verify cluster docker compose exec redis-node-1 redis-cli CLUSTER INFO docker compose exec redis-node-1 redis-cli CLUSTER NODES
Redis Sentinel (High Availability)
# Add to docker-compose.yml services: redis-master: image: redis:7-alpine command: redis-server --port 6379 volumes: - redis_master_data:/data redis-slave-1: image: redis:7-alpine command: redis-server --port 6379 --slaveof redis-master 6379 volumes: - redis_slave_1_data:/data depends_on: - redis-master redis-slave-2: image: redis:7-alpine command: redis-server --port 6379 --slaveof redis-master 6379 volumes: - redis_slave_2_data:/data depends_on: - redis-master redis-sentinel-1: image: redis:7-alpine command: redis-sentinel /etc/redis/sentinel.conf volumes: - ./redis-sentinel.conf:/etc/redis/sentinel.conf depends_on: - redis-master redis-sentinel-2: image: redis:7-alpine command: redis-sentinel /etc/redis/sentinel.conf volumes: - ./redis-sentinel.conf:/etc/redis/sentinel.conf depends_on: - redis-master redis-sentinel-3: image: redis:7-alpine command: redis-sentinel /etc/redis/sentinel.conf volumes: - ./redis-sentinel.conf:/etc/redis/sentinel.conf depends_on: - redis-master
# Create redis-sentinel.conf cat > redis-sentinel.conf <<EOF port 26379 sentinel monitor mymaster redis-master 6379 2 sentinel down-after-milliseconds mymaster 5000 sentinel parallel-syncs mymaster 1 sentinel failover-timeout mymaster 10000 EOF # Start Sentinel setup docker compose up -d redis-master redis-slave-1 redis-slave-2 docker compose up -d redis-sentinel-1 redis-sentinel-2 redis-sentinel-3 # Verify Sentinel docker compose exec redis-sentinel-1 redis-cli -p 26379 SENTINEL masters
Qdrant Scaling
Vertical Scaling - Increase Resources
# Update docker-compose.yml services: qdrant: deploy: resources: limits: cpus: "4" # Increased from 2 memory: 8G # Increased from 4G reservations: cpus: "2" memory: 4G
Horizontal Scaling - Distributed Cluster
# Add to docker-compose.yml services: qdrant-node-1: image: qdrant/qdrant:latest ports: - "6333:6333" - "6334:6334" environment: QDRANT__CLUSTER__ENABLED: "true" QDRANT__CLUSTER__P2P__PORT: "6335" volumes: - qdrant_node_1_storage:/qdrant/storage qdrant-node-2: image: qdrant/qdrant:latest ports: - "6343:6333" - "6344:6334" environment: QDRANT__CLUSTER__ENABLED: "true" QDRANT__CLUSTER__P2P__PORT: "6335" QDRANT__CLUSTER__P2P__BOOTSTRAP__URI: "http://qdrant-node-1:6335" volumes: - qdrant_node_2_storage:/qdrant/storage depends_on: - qdrant-node-1 qdrant-node-3: image: qdrant/qdrant:latest ports: - "6353:6333" - "6354:6334" environment: QDRANT__CLUSTER__ENABLED: "true" QDRANT__CLUSTER__P2P__PORT: "6335" QDRANT__CLUSTER__P2P__BOOTSTRAP__URI: "http://qdrant-node-1:6335" volumes: - qdrant_node_3_storage:/qdrant/storage depends_on: - qdrant-node-1 volumes: qdrant_node_1_storage: qdrant_node_2_storage: qdrant_node_3_storage:
# Start cluster docker compose up -d qdrant-node-{1..3} # Verify cluster curl -s http://localhost:6333/cluster | jq '.' # Create sharded collection curl -X PUT http://localhost:6333/collections/voice_embeddings \ -H 'Content-Type: application/json' \ -d '{ "vectors": { "size": 384, "distance": "Cosine" }, "shard_number": 3, "replication_factor": 2 }' # Verify sharding curl -s http://localhost:6333/collections/voice_embeddings/cluster | jq '.'
Load Testing
Setup Load Testing Tools
# Install Apache Bench (simple HTTP testing) # macOS: brew install httpd # Install Locust (Python load testing) pip install locust # Install k6 (modern load testing) brew install k6
Basic Load Test with Apache Bench
# Test health endpoint ab -n 1000 -c 10 http://localhost:8000/health # Test with authentication ab -n 1000 -c 10 -H "Authorization: Bearer YOUR_TOKEN" \ http://localhost:8000/api/users/me # Results show: # - Requests per second # - Time per request # - Transfer rate # - Distribution of response times
Advanced Load Test with Locust
# Create locustfile.py from locust import HttpUser, task, between class VoiceAssistUser(HttpUser): wait_time = between(1, 3) def on_start(self): # Login and get token response = self.client.post("/api/auth/login", json={ "email": "test@example.com", "password": "password" }) self.token = response.json()["access_token"] @task(3) def view_profile(self): self.client.get("/api/users/me", headers={"Authorization": f"Bearer {self.token}"}) @task(2) def list_conversations(self): self.client.get("/api/conversations", headers={"Authorization": f"Bearer {self.token}"}) @task(1) def create_message(self): self.client.post("/api/conversations/1/messages", headers={"Authorization": f"Bearer {self.token}"}, json={"content": "Test message"})
# Run load test locust -f locustfile.py --host=http://localhost:8000 # Open browser to http://localhost:8089 # Configure: # - Number of users: 100 # - Spawn rate: 10 users/second # - Host: http://localhost:8000 # Command line mode (headless) locust -f locustfile.py --host=http://localhost:8000 \ --users 100 --spawn-rate 10 --run-time 5m --headless
Load Test with k6
// Create loadtest.js import http from "k6/http"; import { check, sleep } from "k6"; export let options = { stages: [ { duration: "2m", target: 50 }, // Ramp up to 50 users { duration: "5m", target: 50 }, // Stay at 50 users { duration: "2m", target: 100 }, // Ramp up to 100 users { duration: "5m", target: 100 }, // Stay at 100 users { duration: "2m", target: 0 }, // Ramp down ], thresholds: { http_req_duration: ["p(95)<500"], // 95% of requests under 500ms http_req_failed: ["rate<0.01"], // Less than 1% errors }, }; export default function () { // Login let loginRes = http.post( "http://localhost:8000/api/auth/login", JSON.stringify({ email: "test@example.com", password: "password", }), { headers: { "Content-Type": "application/json" } }, ); check(loginRes, { "login successful": (r) => r.status === 200, }); let token = loginRes.json("access_token"); // Make authenticated requests let headers = { Authorization: `Bearer ${token}`, }; let profileRes = http.get("http://localhost:8000/api/users/me", { headers }); check(profileRes, { "profile retrieved": (r) => r.status === 200, }); sleep(1); }
# Run k6 load test k6 run loadtest.js # With custom output k6 run --out json=results.json loadtest.js # View results cat results.json | jq '.metrics'
Database Load Testing
# Test PostgreSQL under load # Create pgbench database docker compose exec postgres createdb -U voiceassist pgbench_test # Initialize pgbench docker compose exec postgres pgbench -i -U voiceassist pgbench_test # Run benchmark (100 clients, 1000 transactions each) docker compose exec postgres pgbench \ -c 100 \ -t 1000 \ -U voiceassist \ pgbench_test # Results show: # - TPS (transactions per second) # - Average latency # - Connection time
Redis Load Testing
# Use redis-benchmark docker compose exec redis redis-benchmark \ -h localhost \ -p 6379 \ -c 100 \ -n 100000 \ -d 100 \ --csv # Test specific commands docker compose exec redis redis-benchmark \ -t set,get,incr,lpush,lpop \ -n 100000 \ -q
Capacity Planning
Current Capacity Assessment
#!/bin/bash # Save as: /usr/local/bin/va-capacity-report echo "VoiceAssist Capacity Report - $(date)" echo "========================================" echo "" # Application instances APP_INSTANCES=$(docker compose ps -q voiceassist-server | wc -l) echo "Application Instances: $APP_INSTANCES" # Resource usage per instance docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" \ $(docker compose ps -q voiceassist-server) echo "" # Database metrics echo "Database Metrics:" docker compose exec -T postgres psql -U voiceassist -d voiceassist <<EOF SELECT 'Active Connections' as metric, count(*) as value FROM pg_stat_activity WHERE state = 'active' UNION ALL SELECT 'Database Size', pg_size_pretty(pg_database_size('voiceassist'))::text UNION ALL SELECT 'Largest Table', pg_size_pretty(max(pg_total_relation_size(schemaname||'.'||tablename)))::text FROM pg_tables WHERE schemaname = 'public'; EOF echo "" # Redis metrics echo "Redis Metrics:" docker compose exec -T redis redis-cli INFO stats | grep -E "(total_commands_processed|instantaneous_ops_per_sec|used_memory_human)" echo "" # Qdrant metrics echo "Qdrant Metrics:" curl -s http://localhost:6333/metrics | grep -E "(collections_total|points_total)" echo "" # Estimated capacity echo "Capacity Estimates:" echo " Current RPS: [Calculate from metrics]" echo " Max RPS (current setup): [Estimate based on testing]" echo " Headroom: [Percentage]" echo "" # Scaling recommendations echo "Scaling Recommendations:" echo " - Application: Scale to $(( APP_INSTANCES + 2 )) instances for 50% more capacity" echo " - Database: Consider read replica when connections > 150" echo " - Redis: Current memory usage allows 2x data growth"
Growth Planning
# Estimate required resources for growth # Current metrics (example) CURRENT_USERS=1000 CURRENT_RPS=50 CURRENT_DB_SIZE_GB=10 # Growth projections GROWTH_RATE=1.5 # 50% growth MONTHS=6 # Calculate future requirements cat > /tmp/capacity_projection.py <<EOF import math current_users = ${CURRENT_USERS} current_rps = ${CURRENT_RPS} current_db_gb = ${CURRENT_DB_SIZE_GB} monthly_growth = ${GROWTH_RATE} months = ${MONTHS} future_users = current_users * (monthly_growth ** months) future_rps = current_rps * (monthly_growth ** months) future_db_gb = current_db_gb * (monthly_growth ** months) # Resource estimates # Assuming 1 app instance handles 50 RPS app_instances = math.ceil(future_rps / 50) # Database: 100 connections per 1000 users db_connections = math.ceil((future_users / 1000) * 100) # Redis: 1GB per 10000 users redis_gb = math.ceil(future_users / 10000) print(f"Capacity Projection for {months} months:") print(f"=" * 50) print(f"Current Users: {current_users:,.0f}") print(f"Projected Users: {future_users:,.0f} ({future_users/current_users:.1f}x)") print(f"") print(f"Current RPS: {current_rps}") print(f"Projected RPS: {future_rps:.0f} ({future_rps/current_rps:.1f}x)") print(f"") print(f"Resource Requirements:") print(f" Application Instances: {app_instances}") print(f" Database Connections: {db_connections}") print(f" Database Storage: {future_db_gb:.0f} GB") print(f" Redis Memory: {redis_gb} GB") print(f"") print(f"Recommended Setup:") if app_instances <= 5: print(f" Application: {app_instances} instances with load balancer") else: print(f" Application: {app_instances} instances with auto-scaling") if db_connections > 150: print(f" Database: Primary + 2 read replicas + PgBouncer") else: print(f" Database: Primary + PgBouncer") if redis_gb > 4: print(f" Redis: 3-node cluster") else: print(f" Redis: Single instance ({redis_gb}GB)") EOF python3 /tmp/capacity_projection.py
Performance Optimization
Application Optimization
# Enable response caching cat >> .env <<EOF CACHE_ENABLED=true CACHE_TTL=300 CACHE_MAX_SIZE=1000 EOF # Enable gzip compression in nginx cat > nginx-compression.conf <<EOF gzip on; gzip_vary on; gzip_min_length 1024; gzip_comp_level 6; gzip_types text/plain text/css text/xml text/javascript application/json application/javascript application/xml+rss; EOF # Optimize database queries docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF -- Create missing indexes CREATE INDEX IF NOT EXISTS idx_conversations_user_id ON conversations(user_id); CREATE INDEX IF NOT EXISTS idx_messages_conversation_id ON messages(conversation_id); CREATE INDEX IF NOT EXISTS idx_messages_created_at ON messages(created_at DESC); -- Analyze tables ANALYZE conversations; ANALYZE messages; ANALYZE users; EOF
Database Query Optimization
# Identify slow queries docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF -- Enable pg_stat_statements CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Top 10 slowest queries SELECT substring(query, 1, 100) AS short_query, calls, total_time, mean_time, max_time, stddev_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10; EOF # Optimize connection management cat >> .env <<EOF DB_POOL_SIZE=20 DB_MAX_OVERFLOW=10 DB_POOL_TIMEOUT=30 DB_POOL_RECYCLE=1800 EOF
Caching Strategy
# Implement multi-layer caching in application # Example: cache.py import redis import hashlib from functools import wraps redis_client = redis.Redis(host='redis', port=6379, decode_responses=True) def cache_result(ttl=300): """Cache function results in Redis""" def decorator(func): @wraps(func) def wrapper(*args, **kwargs): # Generate cache key key_data = f"{func.__name__}:{args}:{kwargs}" cache_key = hashlib.md5(key_data.encode()).hexdigest() # Try to get from cache cached = redis_client.get(cache_key) if cached: return json.loads(cached) # Execute function result = func(*args, **kwargs) # Store in cache redis_client.setex(cache_key, ttl, json.dumps(result)) return result return wrapper return decorator # Usage: @cache_result(ttl=600) def get_user_conversations(user_id): # Expensive database query return db.query(Conversation).filter_by(user_id=user_id).all()
Monitoring During Scaling
Real-time Metrics
#!/bin/bash # Save as: /usr/local/bin/va-scaling-monitor watch -n 5 ' echo "=== Application Instances ===" docker compose ps voiceassist-server | grep Up | wc -l echo "" echo "=== Resource Usage ===" docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemPerc}}" | grep voiceassist echo "" echo "=== Request Rate (approx) ===" docker compose logs --since 1m voiceassist-server | grep "200 OK" | wc -l echo "requests/min" echo "" echo "=== Error Rate ===" docker compose logs --since 1m voiceassist-server | grep -i error | wc -l echo "errors/min" echo "" echo "=== Database Connections ===" docker compose exec -T postgres psql -U voiceassist -d voiceassist -t -c \ "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" '
Scaling Checklist
Pre-Scaling
- Review current metrics and capacity
- Identify bottlenecks
- Test scaling in staging environment
- Update monitoring thresholds
- Prepare rollback plan
- Notify team of scaling activity
During Scaling
- Monitor all metrics closely
- Watch for errors or anomalies
- Verify new instances are healthy
- Check load distribution
- Test critical functionality
Post-Scaling
- Verify performance improvement
- Update documentation
- Review metrics for 24 hours
- Adjust monitoring alerts
- Document lessons learned
- Update capacity planning estimates
Related Documentation
- Deployment Runbook
- Monitoring Runbook
- Troubleshooting Runbook
- CONNECTION_POOL_OPTIMIZATION.md
- UNIFIED_ARCHITECTURE.md
Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Quarterly or after significant scaling events Next Review: 2026-02-21
Monitoring Runbook
Last Updated: 2025-11-27 Purpose: Comprehensive guide for monitoring and observability in VoiceAssist V2
Monitoring Architecture
Application Metrics
↓
Prometheus (Metrics Collection)
↓
Grafana (Visualization)
↓
AlertManager (Alerting)
↓
PagerDuty/Slack/Email
Key Monitoring Components
| Component | Purpose | Port | Dashboard |
|---|---|---|---|
| Prometheus | Metrics collection & storage | 9090 | http://localhost:9090 |
| Grafana | Metrics visualization | 3000 | http://localhost:3000 |
| AlertManager | Alert routing & management | 9093 | http://localhost:9093 |
| Application Metrics | Custom app metrics | 8000/metrics | http://localhost:8000/metrics |
Setup Monitoring Stack
Docker Compose Configuration
# Add to docker-compose.yml services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml - ./monitoring/alerts.yml:/etc/prometheus/alerts.yml - prometheus_data:/prometheus command: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention.time=30d" - "--web.console.libraries=/etc/prometheus/console_libraries" - "--web.console.templates=/etc/prometheus/consoles" grafana: image: grafana/grafana:latest ports: - "3000:3000" volumes: - grafana_data:/var/lib/grafana - ./monitoring/grafana/provisioning:/etc/grafana/provisioning - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin} - GF_USERS_ALLOW_SIGN_UP=false depends_on: - prometheus alertmanager: image: prom/alertmanager:latest ports: - "9093:9093" volumes: - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml - alertmanager_data:/alertmanager command: - "--config.file=/etc/alertmanager/alertmanager.yml" - "--storage.path=/alertmanager" node-exporter: image: prom/node-exporter:latest ports: - "9100:9100" command: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro postgres-exporter: image: prometheuscommunity/postgres-exporter:latest ports: - "9187:9187" environment: DATA_SOURCE_NAME: "postgresql://voiceassist:${POSTGRES_PASSWORD}@postgres:5432/voiceassist?sslmode=disable" depends_on: - postgres redis-exporter: image: oliver006/redis_exporter:latest ports: - "9121:9121" environment: REDIS_ADDR: "redis:6379" depends_on: - redis volumes: prometheus_data: grafana_data: alertmanager_data:
Prometheus Configuration
# Create monitoring/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: "voiceassist-prod" environment: "production" # Load alerting rules rule_files: - "/etc/prometheus/alerts.yml" # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] # Scrape configurations scrape_configs: # VoiceAssist Application - job_name: "voiceassist-app" static_configs: - targets: ["voiceassist-server:8000"] metrics_path: "/metrics" scrape_interval: 10s # PostgreSQL - job_name: "postgresql" static_configs: - targets: ["postgres-exporter:9187"] # Redis - job_name: "redis" static_configs: - targets: ["redis-exporter:9121"] # Node metrics - job_name: "node" static_configs: - targets: ["node-exporter:9100"] # Prometheus itself - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Grafana - job_name: "grafana" static_configs: - targets: ["grafana:3000"]
Alert Rules
# Create monitoring/alerts.yml groups: - name: voiceassist_alerts interval: 30s rules: # Application availability - alert: ApplicationDown expr: up{job="voiceassist-app"} == 0 for: 1m labels: severity: critical component: application annotations: summary: "VoiceAssist application is down" description: "Application {{ $labels.instance }} is not responding" # High error rate - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: warning component: application annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} over last 5 minutes" # Slow response times - alert: SlowResponseTime expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 2 for: 5m labels: severity: warning component: application annotations: summary: "Slow API response times" description: "95th percentile response time is {{ $value }}s" # High CPU usage - alert: HighCPUUsage expr: | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: severity: warning component: infrastructure annotations: summary: "High CPU usage" description: "CPU usage is {{ $value }}% on {{ $labels.instance }}" # High memory usage - alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 10m labels: severity: warning component: infrastructure annotations: summary: "High memory usage" description: "Memory usage is {{ $value }}% on {{ $labels.instance }}" # Database connection pool exhaustion - alert: DatabaseConnectionPoolExhausted expr: | pg_stat_database_numbackends / pg_settings_max_connections > 0.8 for: 5m labels: severity: warning component: database annotations: summary: "Database connection pool nearly exhausted" description: "Database connections at {{ $value | humanizePercentage }} of maximum" # Database down - alert: DatabaseDown expr: up{job="postgresql"} == 0 for: 1m labels: severity: critical component: database annotations: summary: "PostgreSQL database is down" description: "Database {{ $labels.instance }} is not responding" # Redis down - alert: RedisDown expr: up{job="redis"} == 0 for: 1m labels: severity: critical component: cache annotations: summary: "Redis is down" description: "Redis {{ $labels.instance }} is not responding" # High Redis memory usage - alert: HighRedisMemory expr: | redis_memory_used_bytes / redis_memory_max_bytes > 0.9 for: 5m labels: severity: warning component: cache annotations: summary: "Redis memory usage high" description: "Redis memory usage at {{ $value | humanizePercentage }}" # Disk space low - alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20 for: 10m labels: severity: warning component: infrastructure annotations: summary: "Low disk space" description: "Only {{ $value }}% disk space remaining on {{ $labels.instance }}" # Certificate expiration - alert: SSLCertificateExpiring expr: | (ssl_certificate_expiry_seconds - time()) / 86400 < 30 for: 1h labels: severity: warning component: infrastructure annotations: summary: "SSL certificate expiring soon" description: "SSL certificate expires in {{ $value }} days"
AlertManager Configuration
# Create monitoring/alertmanager.yml global: resolve_timeout: 5m slack_api_url: "${SLACK_WEBHOOK_URL}" # Default route route: receiver: "default" group_by: ["alertname", "cluster", "service"] group_wait: 10s group_interval: 10s repeat_interval: 12h routes: # Critical alerts -> PagerDuty + Slack - match: severity: critical receiver: "pagerduty-critical" continue: true - match: severity: critical receiver: "slack-critical" # Warning alerts -> Slack only - match: severity: warning receiver: "slack-warnings" # Receivers receivers: - name: "default" slack_configs: - channel: "#voiceassist-alerts" title: "VoiceAssist Alert" text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}' - name: "pagerduty-critical" pagerduty_configs: - service_key: "${PAGERDUTY_SERVICE_KEY}" description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}" - name: "slack-critical" slack_configs: - channel: "#voiceassist-critical" username: "AlertManager" color: "danger" title: "🔴 CRITICAL: {{ .GroupLabels.alertname }}" text: | *Summary:* {{ .CommonAnnotations.summary }} *Description:* {{ .CommonAnnotations.description }} *Severity:* {{ .GroupLabels.severity }} *Component:* {{ .GroupLabels.component }} - name: "slack-warnings" slack_configs: - channel: "#voiceassist-alerts" username: "AlertManager" color: "warning" title: "⚠️ WARNING: {{ .GroupLabels.alertname }}" text: | *Summary:* {{ .CommonAnnotations.summary }} *Description:* {{ .CommonAnnotations.description }} *Severity:* {{ .GroupLabels.severity }} *Component:* {{ .GroupLabels.component }} - name: "email-ops" email_configs: - to: "ops-team@voiceassist.local" from: "alertmanager@voiceassist.local" smarthost: "smtp.gmail.com:587" auth_username: "${SMTP_USERNAME}" auth_password: "${SMTP_PASSWORD}" headers: Subject: "[VoiceAssist] {{ .GroupLabels.alertname }}"
Deploy Monitoring Stack
# Create monitoring directory mkdir -p /Users/mohammednazmy/VoiceAssist/monitoring/grafana/{provisioning,dashboards} # Start monitoring stack docker compose up -d prometheus grafana alertmanager node-exporter postgres-exporter redis-exporter # Verify services docker compose ps | grep -E "(prometheus|grafana|alertmanager)" # Check Prometheus targets curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}' # Access Grafana echo "Grafana: http://localhost:3000 (admin/admin)" echo "Prometheus: http://localhost:9090" echo "AlertManager: http://localhost:9093"
Grafana Dashboards
Provision Datasource
# Create monitoring/grafana/provisioning/datasources/prometheus.yml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false
Provision Dashboards
# Create monitoring/grafana/provisioning/dashboards/dashboards.yml apiVersion: 1 providers: - name: "VoiceAssist" orgId: 1 folder: "VoiceAssist V2" type: file disableDeletion: false updateIntervalSeconds: 30 allowUiUpdates: true options: path: /var/lib/grafana/dashboards
Application Overview Dashboard
// Create monitoring/grafana/dashboards/application-overview.json { "dashboard": { "title": "VoiceAssist - Application Overview", "tags": ["voiceassist", "application"], "timezone": "browser", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(http_requests_total{job=\"voiceassist-app\"}[5m])", "legendFormat": "{{method}} {{endpoint}}" } ] }, { "title": "Response Time (p95)", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "p95" } ] }, { "title": "Error Rate", "type": "graph", "targets": [ { "expr": "rate(http_requests_total{status=~\"5..\"}[5m])", "legendFormat": "5xx errors" } ] }, { "title": "Active Instances", "type": "stat", "targets": [ { "expr": "count(up{job=\"voiceassist-app\"} == 1)" } ] } ] } }
Database Dashboard
// Create monitoring/grafana/dashboards/database.json { "dashboard": { "title": "VoiceAssist - Database", "tags": ["voiceassist", "database", "postgresql"], "panels": [ { "title": "Database Connections", "type": "graph", "targets": [ { "expr": "pg_stat_database_numbackends", "legendFormat": "Active connections" } ] }, { "title": "Query Duration", "type": "graph", "targets": [ { "expr": "rate(pg_stat_database_tup_fetched[5m])", "legendFormat": "Rows fetched/sec" } ] }, { "title": "Database Size", "type": "graph", "targets": [ { "expr": "pg_database_size_bytes", "legendFormat": "Database size" } ] }, { "title": "Cache Hit Ratio", "type": "gauge", "targets": [ { "expr": "rate(pg_stat_database_blks_hit[5m]) / (rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))" } ] } ] } }
Import Pre-built Dashboards
# Import Node Exporter dashboard curl -X POST http://localhost:3000/api/dashboards/import \ -H "Content-Type: application/json" \ -u admin:admin \ -d '{ "dashboard": { "id": null, "uid": null, "title": "Node Exporter Full", "gnetId": 1860 }, "overwrite": false, "inputs": [ { "name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus" } ] }' # Import PostgreSQL dashboard curl -X POST http://localhost:3000/api/dashboards/import \ -H "Content-Type: application/json" \ -u admin:admin \ -d '{ "dashboard": { "id": null, "uid": null, "title": "PostgreSQL Database", "gnetId": 9628 }, "overwrite": false, "inputs": [ { "name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus" } ] }' # Import Redis dashboard curl -X POST http://localhost:3000/api/dashboards/import \ -H "Content-Type: application/json" \ -u admin:admin \ -d '{ "dashboard": { "id": null, "uid": null, "title": "Redis Dashboard", "gnetId": 11835 }, "overwrite": false, "inputs": [ { "name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus" } ] }'
Application Metrics
Instrument Application Code
# Add to application code (e.g., app/monitoring.py) from prometheus_client import Counter, Histogram, Gauge, generate_latest from fastapi import FastAPI, Response import time app = FastAPI() # Metrics REQUEST_COUNT = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) REQUEST_DURATION = Histogram( 'http_request_duration_seconds', 'HTTP request duration in seconds', ['method', 'endpoint'] ) ACTIVE_REQUESTS = Gauge( 'http_requests_active', 'Number of active HTTP requests', ['method', 'endpoint'] ) DB_CONNECTION_POOL = Gauge( 'db_connection_pool_size', 'Database connection pool size', ['state'] # active, idle ) CACHE_OPERATIONS = Counter( 'cache_operations_total', 'Total cache operations', ['operation', 'status'] # get/set, hit/miss ) # Middleware to track metrics @app.middleware("http") async def track_metrics(request, call_next): method = request.method endpoint = request.url.path ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).inc() start_time = time.time() try: response = await call_next(request) status = response.status_code except Exception as e: status = 500 raise finally: duration = time.time() - start_time REQUEST_COUNT.labels( method=method, endpoint=endpoint, status=status ).inc() REQUEST_DURATION.labels( method=method, endpoint=endpoint ).observe(duration) ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).dec() return response # Metrics endpoint @app.get("/metrics") async def metrics(): return Response( content=generate_latest(), media_type="text/plain" ) # Custom metric tracking def track_cache_operation(operation: str, hit: bool): """Track cache hit/miss""" status = "hit" if hit else "miss" CACHE_OPERATIONS.labels(operation=operation, status=status).inc() def update_connection_pool_metrics(active: int, idle: int): """Update database connection pool metrics""" DB_CONNECTION_POOL.labels(state="active").set(active) DB_CONNECTION_POOL.labels(state="idle").set(idle)
Custom Business Metrics
# Track business-specific metrics from prometheus_client import Counter, Gauge # User metrics USER_REGISTRATIONS = Counter( 'user_registrations_total', 'Total user registrations' ) ACTIVE_USERS = Gauge( 'active_users', 'Number of currently active users' ) # Conversation metrics CONVERSATIONS_CREATED = Counter( 'conversations_created_total', 'Total conversations created' ) MESSAGES_SENT = Counter( 'messages_sent_total', 'Total messages sent', ['conversation_type'] ) # Voice processing metrics VOICE_PROCESSING_DURATION = Histogram( 'voice_processing_duration_seconds', 'Voice processing duration in seconds' ) VOICE_PROCESSING_ERRORS = Counter( 'voice_processing_errors_total', 'Total voice processing errors', ['error_type'] ) # Usage in application def create_conversation(user_id: int): CONVERSATIONS_CREATED.inc() # ... rest of the logic def send_message(conversation_id: int, message: str): MESSAGES_SENT.labels(conversation_type="text").inc() # ... rest of the logic def process_voice(audio_data: bytes): start_time = time.time() try: result = process_audio(audio_data) VOICE_PROCESSING_DURATION.observe(time.time() - start_time) return result except Exception as e: VOICE_PROCESSING_ERRORS.labels(error_type=type(e).__name__).inc() raise
Log Aggregation
Structured Logging
# Configure structured logging import logging import json from datetime import datetime class JSONFormatter(logging.Formatter): def format(self, record): log_data = { 'timestamp': datetime.utcnow().isoformat(), 'level': record.levelname, 'logger': record.name, 'message': record.getMessage(), 'module': record.module, 'function': record.funcName, 'line': record.lineno } if record.exc_info: log_data['exception'] = self.formatException(record.exc_info) if hasattr(record, 'user_id'): log_data['user_id'] = record.user_id if hasattr(record, 'request_id'): log_data['request_id'] = record.request_id return json.dumps(log_data) # Configure logger handler = logging.StreamHandler() handler.setFormatter(JSONFormatter()) logger = logging.getLogger('voiceassist') logger.addHandler(handler) logger.setLevel(logging.INFO) # Usage logger.info("User logged in", extra={'user_id': 123}) logger.error("Database connection failed", exc_info=True)
Centralized Logging with Loki
# Add to docker-compose.yml services: loki: image: grafana/loki:latest ports: - "3100:3100" volumes: - ./monitoring/loki-config.yml:/etc/loki/local-config.yaml - loki_data:/loki command: -config.file=/etc/loki/local-config.yaml promtail: image: grafana/promtail:latest volumes: - ./monitoring/promtail-config.yml:/etc/promtail/config.yml - /var/lib/docker/containers:/var/lib/docker/containers:ro - /var/run/docker.sock:/var/run/docker.sock command: -config.file=/etc/promtail/config.yml depends_on: - loki volumes: loki_data:
# Create monitoring/loki-config.yml auth_enabled: false server: http_listen_port: 3100 ingester: lifecycler: address: 127.0.0.1 ring: kvstore: store: inmemory replication_factor: 1 chunk_idle_period: 5m chunk_retain_period: 30s schema_config: configs: - from: 2020-10-24 store: boltdb object_store: filesystem schema: v11 index: prefix: index_ period: 168h storage_config: boltdb: directory: /loki/index filesystem: directory: /loki/chunks limits_config: enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 168h chunk_store_config: max_look_back_period: 0s table_manager: retention_deletes_enabled: false retention_period: 0s
# Create monitoring/promtail-config.yml server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: docker docker_sd_configs: - host: unix:///var/run/docker.sock refresh_interval: 5s relabel_configs: - source_labels: ["__meta_docker_container_name"] regex: "/(.*)" target_label: "container" - source_labels: ["__meta_docker_container_log_stream"] target_label: "stream"
# Add Loki datasource to Grafana curl -X POST http://localhost:3000/api/datasources \ -H "Content-Type: application/json" \ -u admin:admin \ -d '{ "name": "Loki", "type": "loki", "url": "http://loki:3100", "access": "proxy", "isDefault": false }'
Health Checks
Application Health Endpoints
# Comprehensive health check endpoints from fastapi import APIRouter, status from typing import Dict import asyncio router = APIRouter() @router.get("/health") async def health_check() -> Dict: """Basic health check - always returns 200 if app is running""" return { "status": "healthy", "timestamp": datetime.utcnow().isoformat(), "version": "2.0.0" } @router.get("/ready") async def readiness_check() -> Dict: """Readiness check - verifies all dependencies""" checks = { "database": await check_database(), "redis": await check_redis(), "qdrant": await check_qdrant() } all_healthy = all(checks.values()) return { "status": "ready" if all_healthy else "not_ready", "timestamp": datetime.utcnow().isoformat(), "checks": checks } async def check_database() -> bool: """Check database connectivity""" try: await db.execute("SELECT 1") return True except Exception: return False async def check_redis() -> bool: """Check Redis connectivity""" try: redis_client.ping() return True except Exception: return False async def check_qdrant() -> bool: """Check Qdrant connectivity""" try: response = await http_client.get("http://qdrant:6333/healthz") return response.status_code == 200 except Exception: return False @router.get("/live") async def liveness_check() -> Dict: """Liveness check - for Kubernetes/Docker""" return {"status": "alive"}
Docker Health Checks
# Update docker-compose.yml with health checks services: voiceassist-server: # ... existing config ... healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s postgres: # ... existing config ... healthcheck: test: ["CMD-SHELL", "pg_isready -U voiceassist"] interval: 10s timeout: 5s retries: 5 redis: # ... existing config ... healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 3s retries: 3 qdrant: # ... existing config ... healthcheck: test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"] interval: 30s timeout: 10s retries: 3
Monitoring Operations
Daily Monitoring Routine
#!/bin/bash # Save as: /usr/local/bin/va-monitoring-daily echo "VoiceAssist Daily Monitoring Report - $(date)" echo "==============================================" echo "" # 1. Check all services are up echo "1. Service Health:" docker compose ps | grep -E "(Up|healthy)" | wc -l docker compose ps echo "" # 2. Check Prometheus targets echo "2. Prometheus Targets:" curl -s http://localhost:9090/api/v1/targets | \ jq '.data.activeTargets[] | {job: .labels.job, health: .health}' echo "" # 3. Check for active alerts echo "3. Active Alerts:" curl -s http://localhost:9093/api/v1/alerts | \ jq '.data[] | select(.status.state=="active") | {name: .labels.alertname, severity: .labels.severity}' echo "" # 4. Resource usage summary echo "4. Resource Usage:" docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemPerc}}" | head -10 echo "" # 5. Error rate (last 24 hours) echo "5. Error Rate (24h):" docker compose logs --since 24h voiceassist-server | grep -i error | wc -l echo "" # 6. Database health echo "6. Database Health:" docker compose exec -T postgres psql -U voiceassist -d voiceassist <<EOF SELECT 'Connections' as metric, count(*)::text as value FROM pg_stat_activity UNION ALL SELECT 'Database Size', pg_size_pretty(pg_database_size('voiceassist')) UNION ALL SELECT 'Cache Hit Ratio', round((sum(blks_hit) * 100.0 / NULLIF(sum(blks_hit) + sum(blks_read), 0))::numeric, 2)::text || '%' FROM pg_stat_database; EOF echo "" # 7. Backup status echo "7. Last Backup:" ls -lh /backups/postgres/daily/*.dump.gz 2>/dev/null | tail -1 echo "" echo "==============================================" echo "Report completed"
Troubleshooting Monitoring Issues
Prometheus Not Scraping Targets
# Check Prometheus logs docker compose logs prometheus | tail -50 # Check target configuration curl -s http://localhost:9090/api/v1/targets | jq '.' # Verify network connectivity docker compose exec prometheus wget -O- http://voiceassist-server:8000/metrics # Reload Prometheus configuration curl -X POST http://localhost:9090/-/reload
Grafana Dashboards Not Loading
# Check Grafana logs docker compose logs grafana | tail -50 # Verify datasource connection curl -s http://localhost:3000/api/datasources \ -u admin:admin | jq '.' # Test Prometheus connection from Grafana curl -s http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up \ -u admin:admin | jq '.' # Restart Grafana docker compose restart grafana
Alerts Not Firing
# Check AlertManager status curl -s http://localhost:9093/api/v1/status | jq '.' # Check alert rules in Prometheus curl -s http://localhost:9090/api/v1/rules | jq '.' # Check specific alert state curl -s 'http://localhost:9090/api/v1/query?query=ALERTS{alertname="HighErrorRate"}' | jq '.' # Verify AlertManager configuration docker compose exec alertmanager amtool config show # Check AlertManager logs docker compose logs alertmanager | tail -50
Monitoring Best Practices
1. Define SLOs (Service Level Objectives)
# Document SLOs SLOs: - name: Availability target: 99.9% measurement: uptime over 30 days - name: Response Time target: p95 < 500ms measurement: 95th percentile of all API requests - name: Error Rate target: < 0.1% measurement: 5xx errors / total requests - name: Data Durability target: 99.999% measurement: no data loss events
2. Alert Fatigue Prevention
# Guidelines for creating alerts: # - Alert on symptoms, not causes # - Make alerts actionable # - Include runbook links # - Set appropriate thresholds # - Use proper severity levels # - Group related alerts # Good alert example: - alert: UserFacingErrorRate expr: rate(http_requests_total{status="500"}[5m]) > 0.05 for: 5m annotations: summary: "High user-facing error rate" description: "More than 5% of requests failing" runbook_url: "https://docs.voiceassist.local/runbooks/troubleshooting#high-error-rate" # Bad alert example (too noisy): - alert: SingleError expr: increase(http_requests_total{status="500"}[1m]) > 0 for: 0s
3. Dashboard Organization
Dashboards Structure:
├── Executive Dashboard (high-level KPIs)
├── Application Overview (request rate, errors, latency)
├── Infrastructure (CPU, memory, disk, network)
├── Database Performance (connections, queries, cache hit ratio)
├── Cache Performance (Redis operations, memory, hit rate)
├── Business Metrics (users, conversations, messages)
└── On-Call Dashboard (active alerts, recent incidents)
Related Documentation
- Incident Response Runbook
- Troubleshooting Runbook
- Deployment Runbook
- Scaling Runbook
- UNIFIED_ARCHITECTURE.md
Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Quarterly Next Review: 2026-02-21
Troubleshooting Runbook
Last Updated: 2025-11-27 Purpose: Comprehensive troubleshooting guide for VoiceAssist V2 common issues
Quick Diagnostic Commands
# Save as: /usr/local/bin/va-diagnose #!/bin/bash echo "VoiceAssist Quick Diagnostics - $(date)" echo "=========================================" # System health echo -e "\n[1] Service Status:" docker compose ps echo -e "\n[2] Health Checks:" curl -s http://localhost:8000/health | jq '.' || echo "❌ Application not responding" echo -e "\n[3] Recent Errors (last 5 min):" docker compose logs --since 5m voiceassist-server 2>&1 | grep -i error | tail -10 echo -e "\n[4] Resource Usage:" docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" echo -e "\n[5] Database Connections:" docker compose exec -T postgres psql -U voiceassist -d voiceassist -t -c \ "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" 2>/dev/null echo -e "\n[6] Redis Status:" docker compose exec -T redis redis-cli INFO server | grep -E "(redis_version|uptime_in_seconds)" 2>/dev/null echo -e "\n[7] Disk Space:" df -h | grep -E "(Filesystem|/$)" echo -e "\n========================================="
Issues by Symptom
1. Application Won't Start
Symptom
- Container exits immediately
- Health check fails
- "Connection refused" errors
Investigation
# Check container logs docker compose logs --tail=100 voiceassist-server # Check exit code docker compose ps -a voiceassist-server # Exit code 0 = normal, 1 = error, 137 = OOM killed, 139 = segfault # Check if port is already in use lsof -i :8000 # Verify environment variables docker compose config | grep -A 20 voiceassist-server # Check for missing dependencies docker compose exec voiceassist-server python -c "import sys; print(sys.path)"
Common Causes & Solutions
Cause: Missing environment variables
# Check required variables cat .env | grep -E "(DATABASE_URL|REDIS_URL|SECRET_KEY)" # Copy from example cp .env.example .env # Edit with correct values vim .env # Restart docker compose up -d voiceassist-server
Cause: Database not ready
# Check PostgreSQL status docker compose exec postgres pg_isready # Wait for database sleep 10 # Try starting again docker compose up -d voiceassist-server # Or add depends_on with health check in docker-compose.yml
Cause: Port conflict
# Find process using port lsof -i :8000 # Kill conflicting process kill -9 <PID> # Or change application port in docker-compose.yml ports: - "8001:8000" # Changed from 8000:8000
Cause: Corrupted Python cache
# Remove Python cache docker compose exec voiceassist-server find . -type d -name __pycache__ -exec rm -r {} + docker compose exec voiceassist-server find . -type f -name "*.pyc" -delete # Rebuild image docker compose build --no-cache voiceassist-server docker compose up -d voiceassist-server
2. Database Connection Issues
Symptom
- "Connection pool exhausted"
- "Too many connections"
- "Could not connect to database"
- Slow database queries
Investigation
# Check database is running docker compose ps postgres docker compose exec postgres pg_isready # Check active connections docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT count(*), state, wait_event_type FROM pg_stat_activity WHERE datname = 'voiceassist' GROUP BY state, wait_event_type;" # Check connection limit docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SHOW max_connections;" # Check for connection leaks docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT pid, usename, application_name, state, state_change, query FROM pg_stat_activity WHERE datname = 'voiceassist' ORDER BY state_change DESC LIMIT 20;" # Check for locks docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT pg_stat_activity.pid, pg_stat_activity.query, pg_locks.granted FROM pg_stat_activity JOIN pg_locks ON pg_stat_activity.pid = pg_locks.pid WHERE NOT pg_locks.granted LIMIT 10;"
Solutions
Solution 1: Increase connection pool size
# Update .env cat >> .env <<EOF DB_POOL_SIZE=30 DB_MAX_OVERFLOW=10 DB_POOL_TIMEOUT=30 DB_POOL_RECYCLE=1800 EOF # Restart application docker compose restart voiceassist-server # Verify new pool size docker compose logs voiceassist-server | grep -i "pool size"
Solution 2: Kill idle connections
# Terminate idle connections older than 10 minutes docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'voiceassist' AND state = 'idle' AND state_change < current_timestamp - INTERVAL '10 minutes';" # Verify connections reduced docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SELECT count(*) FROM pg_stat_activity WHERE datname = 'voiceassist';"
Solution 3: Increase max_connections in PostgreSQL
# Update docker-compose.yml services: postgres: command: - "postgres" - "-c" - "max_connections=200" # Increased from 100
# Restart PostgreSQL docker compose restart postgres # Verify docker compose exec postgres psql -U voiceassist -d voiceassist -c \ "SHOW max_connections;"
Solution 4: Add PgBouncer for connection pooling
# Add to docker-compose.yml services: pgbouncer: image: pgbouncer/pgbouncer:latest environment: DATABASES_HOST: postgres DATABASES_PORT: 5432 DATABASES_USER: voiceassist DATABASES_PASSWORD: ${POSTGRES_PASSWORD} DATABASES_DBNAME: voiceassist PGBOUNCER_POOL_MODE: transaction PGBOUNCER_MAX_CLIENT_CONN: 1000 PGBOUNCER_DEFAULT_POOL_SIZE: 25 ports: - "6432:6432"
# Update DATABASE_URL in .env DATABASE_URL=postgresql://voiceassist:password@pgbouncer:6432/voiceassist # Restart docker compose up -d
Solution 5: Fix connection leaks in code
# Ensure proper connection cleanup from contextlib import asynccontextmanager @asynccontextmanager async def get_db(): db = SessionLocal() try: yield db finally: await db.close() # Use context manager async with get_db() as db: result = await db.execute(query) # Connection automatically closed
3. High Response Times / Performance Issues
Symptom
- API requests taking > 2 seconds
- Timeout errors
- Slow page loads
Investigation
# Check current response times curl -o /dev/null -s -w "Time: %{time_total}s\n" http://localhost:8000/health # Check application metrics curl -s http://localhost:8000/metrics | grep http_request_duration # Monitor in real-time watch -n 2 'curl -o /dev/null -s -w "Time: %{time_total}s\n" http://localhost:8000/api/users/me -H "Authorization: Bearer TOKEN"' # Check for resource constraints docker stats --no-stream | grep voiceassist # Identify slow database queries docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF SELECT pid, now() - query_start as duration, state, query FROM pg_stat_activity WHERE state != 'idle' AND now() - query_start > interval '5 seconds' ORDER BY duration DESC; EOF # Check query statistics docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF SELECT substring(query, 1, 100) AS query, calls, total_time, mean_time, max_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10; EOF # Check Redis latency docker compose exec redis redis-cli --latency # Check if Redis is slow docker compose exec redis redis-cli SLOWLOG GET 10
Solutions
Solution 1: Add database indexes
# Identify missing indexes docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF -- Find tables with sequential scans SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan, seq_tup_read / seq_scan as avg_seq_tup_read FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_tup_read DESC LIMIT 10; EOF # Add recommended indexes docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF -- Common indexes for VoiceAssist CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_conversations_user_id ON conversations(user_id); CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_messages_conversation_id ON messages(conversation_id); CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_messages_created_at ON messages(created_at DESC); CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users(email); -- Analyze tables ANALYZE conversations; ANALYZE messages; ANALYZE users; EOF # Verify index usage docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes ORDER BY idx_scan DESC; EOF
Solution 2: Enable query result caching
# Implement Redis caching for expensive queries import redis import json import hashlib from functools import wraps redis_client = redis.Redis(host='redis', port=6379, decode_responses=True) def cache_query(ttl=300): def decorator(func): @wraps(func) async def wrapper(*args, **kwargs): # Generate cache key cache_key = f"query:{func.__name__}:{hashlib.md5(str(args).encode()).hexdigest()}" # Try cache first cached = redis_client.get(cache_key) if cached: return json.loads(cached) # Execute query result = await func(*args, **kwargs) # Cache result redis_client.setex(cache_key, ttl, json.dumps(result)) return result return wrapper return decorator # Usage @cache_query(ttl=600) async def get_user_conversations(user_id: int): return await db.query(Conversation).filter_by(user_id=user_id).all()
Solution 3: Optimize database queries
# Use eager loading to avoid N+1 queries from sqlalchemy.orm import joinedload # Bad - causes N+1 queries conversations = db.query(Conversation).all() for conv in conversations: messages = conv.messages # Separate query for each conversation # Good - single query with join conversations = db.query(Conversation)\ .options(joinedload(Conversation.messages))\ .all() # Use select_in_loading for large collections conversations = db.query(Conversation)\ .options(selectinload(Conversation.messages))\ .all()
Solution 4: Scale application horizontally
# Add more application instances docker compose up -d --scale voiceassist-server=3 # Verify instances docker compose ps voiceassist-server # Add load balancer (nginx) # See SCALING.md for details
Solution 5: Increase resource limits
# Update docker-compose.yml services: voiceassist-server: deploy: resources: limits: cpus: "4" memory: 4G
docker compose up -d voiceassist-server
4. Redis Connection Issues
Symptom
- "Connection to Redis failed"
- "Redis timeout"
- Cache not working
Investigation
# Check Redis status docker compose ps redis docker compose exec redis redis-cli ping # Check Redis connections docker compose exec redis redis-cli CLIENT LIST # Check Redis memory docker compose exec redis redis-cli INFO memory # Check Redis logs docker compose logs --tail=100 redis # Test connection from application docker compose exec voiceassist-server python -c " import redis r = redis.Redis(host='redis', port=6379) print(r.ping()) "
Solutions
Solution 1: Restart Redis
# Restart Redis docker compose restart redis # Wait for startup sleep 5 # Verify docker compose exec redis redis-cli ping # Restart application docker compose restart voiceassist-server
Solution 2: Clear Redis if memory full
# Check memory usage docker compose exec redis redis-cli INFO memory | grep used_memory_human # Clear all keys (WARNING: destroys cache) docker compose exec redis redis-cli FLUSHALL # Or clear specific database docker compose exec redis redis-cli -n 0 FLUSHDB # Verify memory freed docker compose exec redis redis-cli INFO memory | grep used_memory_human
Solution 3: Increase Redis memory limit
# Update docker-compose.yml services: redis: command: - redis-server - --maxmemory 2gb # Increased from 1gb - --maxmemory-policy allkeys-lru
docker compose up -d redis
Solution 4: Fix connection string
# Verify REDIS_URL in .env cat .env | grep REDIS_URL # Should be: REDIS_URL=redis://redis:6379/0 # Update if wrong echo "REDIS_URL=redis://redis:6379/0" >> .env # Restart application docker compose restart voiceassist-server
5. Service Container Keeps Restarting
Symptom
- Container exits and restarts repeatedly
- "Restarting (1) X seconds ago" in docker compose ps
Investigation
# Check restart count docker inspect voiceassist-voiceassist-server-1 | grep -A 5 RestartCount # Check exit code docker compose ps -a voiceassist-server # Check recent logs docker compose logs --tail=200 voiceassist-server # Check health check docker inspect voiceassist-voiceassist-server-1 | grep -A 20 Health # Check resource limits docker stats --no-stream voiceassist-voiceassist-server-1
Solutions
Solution 1: OOMKilled (exit code 137)
# Verify OOM kill docker inspect voiceassist-voiceassist-server-1 | grep OOMKilled # Check memory usage docker stats --no-stream | grep voiceassist-server # Increase memory limit # Update docker-compose.yml: deploy: resources: limits: memory: 4G # Increased from 2G # Restart docker compose up -d voiceassist-server
Solution 2: Application crash loop
# Check for Python errors docker compose logs voiceassist-server | grep -i "traceback\|error\|exception" # Common fixes: # - Fix missing environment variables # - Fix import errors # - Fix database connection issues # Disable auto-restart temporarily to debug docker update --restart=no voiceassist-voiceassist-server-1 # Check logs without restart interference docker compose logs -f voiceassist-server
Solution 3: Failed health check
# Check health check command docker inspect voiceassist-voiceassist-server-1 | grep -A 10 Healthcheck # Test health check manually docker compose exec voiceassist-server curl -f http://localhost:8000/health # Increase health check timeout # Update docker-compose.yml: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s # Increased from 5s retries: 5 # Increased from 3 start_period: 60s # Increased from 40s # Restart docker compose up -d voiceassist-server
6. Authentication / JWT Issues
Symptom
- "Invalid token" errors
- "Token expired" errors
- Users logged out unexpectedly
Investigation
# Check JWT configuration cat .env | grep -E "(SECRET_KEY|JWT_)" # Test token generation docker compose exec voiceassist-server python -c " from jose import jwt from datetime import datetime, timedelta import os secret = os.getenv('SECRET_KEY') payload = {'sub': 'test', 'exp': datetime.utcnow() + timedelta(hours=1)} token = jwt.encode(payload, secret, algorithm='HS256') print('Token:', token) # Decode decoded = jwt.decode(token, secret, algorithms=['HS256']) print('Decoded:', decoded) " # Check for token in Redis docker compose exec redis redis-cli KEYS "session:*" docker compose exec redis redis-cli GET "session:some-session-id"
Solutions
Solution 1: SECRET_KEY changed
# This invalidates all tokens # Generate new SECRET_KEY openssl rand -base64 32 # Update .env echo "SECRET_KEY=<new-secret>" >> .env # Restart application docker compose restart voiceassist-server # Note: All users will need to log in again # Clear Redis sessions docker compose exec redis redis-cli FLUSHDB
Solution 2: Token expiration too short
# Update .env cat >> .env <<EOF JWT_EXPIRATION_HOURS=24 JWT_REFRESH_EXPIRATION_DAYS=30 EOF # Restart docker compose restart voiceassist-server
Solution 3: Clock skew issues
# Check system time date # Sync time (macOS) sudo sntp -sS time.apple.com # Restart Docker docker compose restart
7. Database Migration Issues
Symptom
- "Duplicate column" errors
- "Table already exists" errors
- Migration fails to apply
Investigation
# Check current migration version docker compose run --rm voiceassist-server alembic current # Check migration history docker compose run --rm voiceassist-server alembic history # Check pending migrations docker compose run --rm voiceassist-server alembic show head # Check database schema docker compose exec postgres psql -U voiceassist -d voiceassist -c "\dt" docker compose exec postgres psql -U voiceassist -d voiceassist -c "\d users"
Solutions
Solution 1: Migration already applied manually
# Stamp database with current migration docker compose run --rm voiceassist-server alembic stamp head # Verify docker compose run --rm voiceassist-server alembic current
Solution 2: Conflicting migrations
# Check for branches docker compose run --rm voiceassist-server alembic branches # Merge branches if needed docker compose run --rm voiceassist-server alembic merge -m "merge branches" <revision1> <revision2> # Upgrade to merged revision docker compose run --rm voiceassist-server alembic upgrade head
Solution 3: Rollback and retry
# Downgrade one version docker compose run --rm voiceassist-server alembic downgrade -1 # Fix migration file vim app/alembic/versions/<migration-file>.py # Retry upgrade docker compose run --rm voiceassist-server alembic upgrade head
Solution 4: Reset migrations (DESTRUCTIVE)
# ⚠️ WARNING: This will destroy all data! # Backup first docker compose exec postgres pg_dump -U voiceassist voiceassist > backup.sql # Drop and recreate database docker compose exec postgres psql -U voiceassist -d postgres <<EOF DROP DATABASE voiceassist; CREATE DATABASE voiceassist OWNER voiceassist; EOF # Run all migrations docker compose run --rm voiceassist-server alembic upgrade head # Verify docker compose run --rm voiceassist-server alembic current
8. Disk Space Issues
Symptom
- "No space left on device"
- Services failing to start
- Logs not writing
Investigation
# Check disk usage df -h # Check Docker disk usage docker system df # Find large files du -sh /var/lib/docker/* du -sh ~/Library/Containers/com.docker.docker/Data/* # Check logs size docker compose logs voiceassist-server | wc -c # Find large Docker objects docker image ls --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" docker volume ls docker ps -a --format "{{.Names}}\t{{.Size}}"
Solutions
Solution 1: Clean up Docker
# Remove unused containers docker container prune -f # Remove unused images docker image prune -a -f # Remove unused volumes docker volume prune -f # Remove unused networks docker network prune -f # Or clean everything (⚠️ stops all containers) docker system prune -a --volumes -f # Check space freed docker system df
Solution 2: Clean up old backups
# Remove old backups (keep last 7 days) find /backups/postgres/daily -name "*.dump.gz" -mtime +7 -delete find /backups/redis -name "*.rdb" -mtime +7 -delete find /backups/qdrant -name "*.snapshot" -mtime +14 -delete # Check backup directory size du -sh /backups/*
Solution 3: Configure log rotation
// Create /etc/docker/daemon.json { "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" } }
# Restart Docker daemon sudo systemctl restart docker # Or on macOS, restart Docker Desktop
Solution 4: Clear application logs
# Clear Docker logs for specific container truncate -s 0 $(docker inspect --format='{{.LogPath}}' voiceassist-voiceassist-server-1) # Remove old log files find /var/log -name "*.log" -mtime +30 -delete
9. Network Connectivity Issues
Symptom
- "Connection refused"
- "Host unreachable"
- Containers can't communicate
Investigation
# Check Docker networks docker network ls docker network inspect voiceassist_default # Test connectivity between containers docker compose exec voiceassist-server ping -c 3 postgres docker compose exec voiceassist-server ping -c 3 redis docker compose exec voiceassist-server ping -c 3 qdrant # Check DNS resolution docker compose exec voiceassist-server nslookup postgres docker compose exec voiceassist-server getent hosts postgres # Check if ports are exposed docker compose ps docker port voiceassist-voiceassist-server-1 # Test from host curl http://localhost:8000/health telnet localhost 8000
Solutions
Solution 1: Recreate network
# Stop all services docker compose down # Remove network docker network rm voiceassist_default # Recreate everything docker compose up -d # Verify network docker network inspect voiceassist_default
Solution 2: Fix DNS issues
# Add to docker-compose.yml services: voiceassist-server: dns: - 8.8.8.8 - 8.8.4.4
docker compose up -d voiceassist-server
Solution 3: Use explicit links
# Add to docker-compose.yml (if needed) services: voiceassist-server: links: - postgres:postgres - redis:redis - qdrant:qdrant
Solution 4: Check firewall
# macOS - check if firewall is blocking Docker sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getglobalstate # Temporarily disable for testing sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off # Re-enable after testing sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate on
10. Qdrant Vector Search Issues
Symptom
- "Collection not found"
- "Vector dimension mismatch"
- Slow search results
Investigation
# Check Qdrant status curl -s http://localhost:6333/healthz # List collections curl -s http://localhost:6333/collections | jq '.' # Get collection info curl -s http://localhost:6333/collections/voice_embeddings | jq '.' # Check collection size curl -s http://localhost:6333/collections/voice_embeddings | jq '.result.points_count' # Check Qdrant logs docker compose logs --tail=100 qdrant
Solutions
Solution 1: Create missing collection
# Create collection curl -X PUT http://localhost:6333/collections/voice_embeddings \ -H 'Content-Type: application/json' \ -d '{ "vectors": { "size": 384, "distance": "Cosine" } }' # Verify creation curl -s http://localhost:6333/collections/voice_embeddings | jq '.result.status'
Solution 2: Fix dimension mismatch
# Delete and recreate collection with correct dimensions curl -X DELETE http://localhost:6333/collections/voice_embeddings curl -X PUT http://localhost:6333/collections/voice_embeddings \ -H 'Content-Type: application/json' \ -d '{ "vectors": { "size": 384, # Match your embedding model "distance": "Cosine" } }'
Solution 3: Optimize collection for performance
# Create index curl -X POST http://localhost:6333/collections/voice_embeddings/index \ -H 'Content-Type: application/json' \ -d '{ "field_name": "text", "field_schema": "keyword" }' # Optimize collection curl -X POST http://localhost:6333/collections/voice_embeddings/optimizer
Solution 4: Clear and reindex
# Delete all points curl -X POST http://localhost:6333/collections/voice_embeddings/points/delete \ -H 'Content-Type: application/json' \ -d '{ "filter": {} }' # Trigger reindexing from application # (Application-specific code to rebuild vectors)
Troubleshooting Checklist
Before Escalating
- Checked recent logs (5-15 minutes)
- Verified all services are running
- Checked system resources (CPU, memory, disk)
- Reviewed recent changes (deployments, config)
- Attempted restart of affected service
- Checked for known issues in documentation
- Verified network connectivity
- Checked monitoring dashboards
- Documented symptoms and attempted solutions
Information to Collect for Escalation
#!/bin/bash # Save as: /usr/local/bin/va-collect-debug-info TIMESTAMP=$(date +%Y%m%d_%H%M%S) OUTPUT_DIR="/tmp/voiceassist-debug-${TIMESTAMP}" mkdir -p $OUTPUT_DIR echo "Collecting debug information..." # System info uname -a > $OUTPUT_DIR/system-info.txt docker version >> $OUTPUT_DIR/system-info.txt docker compose version >> $OUTPUT_DIR/system-info.txt # Service status docker compose ps > $OUTPUT_DIR/service-status.txt # Logs docker compose logs --tail=500 > $OUTPUT_DIR/all-logs.txt docker compose logs --tail=500 voiceassist-server > $OUTPUT_DIR/app-logs.txt docker compose logs --tail=200 postgres > $OUTPUT_DIR/postgres-logs.txt docker compose logs --tail=200 redis > $OUTPUT_DIR/redis-logs.txt # Configuration docker compose config > $OUTPUT_DIR/docker-compose-config.yml cp .env $OUTPUT_DIR/env-sanitized.txt sed -i '' 's/=.*/=REDACTED/g' $OUTPUT_DIR/env-sanitized.txt # Resource usage docker stats --no-stream > $OUTPUT_DIR/resource-usage.txt df -h > $OUTPUT_DIR/disk-usage.txt # Network docker network ls > $OUTPUT_DIR/networks.txt docker network inspect voiceassist_default > $OUTPUT_DIR/network-inspect.json # Database state docker compose exec -T postgres psql -U voiceassist -d voiceassist -c \ "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" \ > $OUTPUT_DIR/db-connections.txt # Create archive tar -czf voiceassist-debug-${TIMESTAMP}.tar.gz -C /tmp voiceassist-debug-${TIMESTAMP} echo "Debug information collected: voiceassist-debug-${TIMESTAMP}.tar.gz" echo "Please attach this file when escalating the issue"
Common Error Messages
Error: "bind: address already in use"
Solution:
# Find and kill process using the port lsof -i :8000 kill -9 <PID> # Or change port in docker-compose.yml
Error: "ERROR: could not find an available, non-overlapping IPv4 address pool"
Solution:
# Clean up unused networks docker network prune # Or specify custom network in docker-compose.yml networks: default: ipam: config: - subnet: 172.25.0.0/16
Error: "ERROR: Service 'X' failed to build"
Solution:
# Clean Docker build cache docker builder prune -a -f # Rebuild with no cache docker compose build --no-cache # Check Dockerfile syntax docker compose config
Error: "sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: password authentication failed"
Solution:
# Verify credentials in .env cat .env | grep -E "(POSTGRES_USER|POSTGRES_PASSWORD)" # Reset password docker compose exec postgres psql -U postgres -c \ "ALTER USER voiceassist WITH PASSWORD 'new_password';" # Update .env vim .env # Restart application docker compose restart voiceassist-server
Error: "redis.exceptions.ConnectionError: Error connecting to redis"
Solution:
# Check Redis is running docker compose ps redis # Check Redis URL in .env cat .env | grep REDIS_URL # Test connection docker compose exec redis redis-cli ping # Restart Redis and app docker compose restart redis voiceassist-server
Performance Tuning Quick Wins
# 1. Add database indexes docker compose exec postgres psql -U voiceassist -d voiceassist -f - <<EOF CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_conversations_user_id ON conversations(user_id); CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_messages_conversation_id ON messages(conversation_id); CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_messages_created_at ON messages(created_at DESC); ANALYZE; EOF # 2. Increase connection pool echo "DB_POOL_SIZE=30" >> .env echo "DB_MAX_OVERFLOW=10" >> .env # 3. Enable Redis caching echo "CACHE_ENABLED=true" >> .env echo "CACHE_TTL=300" >> .env # 4. Increase worker count # For 4 CPU cores: workers = (2 x 4) + 1 = 9 echo "GUNICORN_WORKERS=9" >> .env # 5. Optimize PostgreSQL settings # See SCALING.md for detailed configuration # Restart to apply changes docker compose restart
Related Documentation
- Incident Response Runbook
- Deployment Runbook
- Monitoring Runbook
- Scaling Runbook
- Backup & Restore Runbook
- UNIFIED_ARCHITECTURE.md
- CONNECTION_POOL_OPTIMIZATION.md
Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Monthly or after each major incident Next Review: 2025-12-21
Docs Site Deployment and TLS Runbook
Last Updated: 2025-11-27
URL: https://assistdocs.asimo.io
Document Root: /var/www/assistdocs.asimo.io
Quick Deployment Checklist
# 1. Navigate to repo cd ~/VoiceAssist # 2. Pull latest changes git pull origin main # 3. Install dependencies (if needed) pnpm install # 4. Navigate to docs-site cd apps/docs-site # 5. Validate metadata and links pnpm validate:metadata pnpm check:links # 6. Generate agent JSON (if docs changed) pnpm generate-agent-json # 7. Build the static site pnpm build # 8. Deploy to Apache document root sudo rm -rf /var/www/assistdocs.asimo.io/* sudo cp -r out/* /var/www/assistdocs.asimo.io/ # 9. Verify deployment curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/ curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/agent/index.json curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/agent/docs.json curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/search-index.json
Architecture Overview
Build Process
docs/*.md → Next.js static export
apps/docs-site/ → Build artifacts in out/
scripts/generate-agent-json → public/agent/*.json
→ search-index.json
Deployment Architecture
┌──────────────────────────────────────────────────┐
│ Apache2 (mod_ssl, mod_rewrite) │
│ - assistdocs.asimo.io-le-ssl.conf │
│ - DocumentRoot: /var/www/assistdocs.asimo.io │
│ - RewriteEngine for clean URLs │
└──────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Static Files │
│ - /*.html (Next.js pages) │
│ - /agent/*.json (AI agent endpoints) │
│ - /search-index.json (Fuse.js) │
│ - /sitemap.xml │
└──────────────────────────────────────────────────┘
Step 1: Prepare for Deployment
1.1 Sync Repository
cd ~/VoiceAssist git pull origin main git status # Verify clean state
1.2 Install Dependencies
# Root level (pnpm workspace) pnpm install # Verify docs-site dependencies cd apps/docs-site ls node_modules/.bin/next # Should exist
Step 2: Validate Documentation
2.1 Metadata Validation
cd ~/VoiceAssist/apps/docs-site pnpm validate:metadata
Expected Output: No errors about missing or invalid frontmatter.
2.2 Link Validation
pnpm check:links
Expected Output: All internal links resolve correctly.
2.3 Fix Common Issues
Missing frontmatter:
--- title: "Document Title" slug: "path/to-document" summary: "Brief description" status: stable stability: production owner: team lastUpdated: "YYYY-MM-DD" audience: ["human", "agent"] tags: ["tag1", "tag2"] category: category-name ---
Broken links: Update markdown links to use relative paths from docs/ directory.
Step 3: Generate Agent JSON
The agent JSON files provide machine-readable access to documentation.
3.1 Run Generation Script
cd ~/VoiceAssist/apps/docs-site pnpm generate-agent-json
3.2 Verify Output
# Check index.json cat public/agent/index.json | jq '.name' # Should output: "VoiceAssist Documentation" # Check docs.json count cat public/agent/docs.json | jq 'length' # Should output document count (e.g., 220+) # Check search index ls -la public/search-index.json
Step 4: Build Static Site
4.1 Run Build
cd ~/VoiceAssist/apps/docs-site pnpm build
Expected Output:
✓ Compiled successfullyExport successful- Files in
out/directory
4.2 Verify Build Output
ls out/ # Should contain: index.html, ai/, docs/, agent/, search-index.json, sitemap.xml ls out/agent/ # Should contain: index.json, docs.json, schema.json
Step 5: Deploy to Apache
5.1 Clear Old Files
sudo rm -rf /var/www/assistdocs.asimo.io/*
5.2 Copy New Build
sudo cp -r ~/VoiceAssist/apps/docs-site/out/* /var/www/assistdocs.asimo.io/
5.3 Set Permissions
sudo chown -R www-data:www-data /var/www/assistdocs.asimo.io sudo chmod -R 755 /var/www/assistdocs.asimo.io
5.4 Reload Apache (if config changed)
sudo apache2ctl configtest sudo systemctl reload apache2
Step 6: Verify Deployment
6.1 Check HTTP Status
# Main page curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/ # AI agent endpoints curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/agent/index.json curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/agent/docs.json curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/search-index.json # Clean URLs (should return 200, not 404) curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/ai/onboarding curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/ai/status
Expected: All should return 200.
6.2 Check Content
# Verify agent JSON content curl -s https://assistdocs.asimo.io/agent/index.json | jq '.endpoints' # Verify sitemap curl -s https://assistdocs.asimo.io/sitemap.xml | head -20
TLS Certificate Management
Current Certificate Status
sudo certbot certificates | grep -A 5 "assistdocs.asimo.io"
Current Certificate:
- Domain: assistdocs.asimo.io
- Issuer: Let's Encrypt
- Key Type: ECDSA
- Certificate Path:
/etc/letsencrypt/live/assistdocs.asimo.io/fullchain.pem - Private Key Path:
/etc/letsencrypt/live/assistdocs.asimo.io/privkey.pem - Expiry: 2026-02-19 (auto-renewed)
Automatic Renewal
Certbot automatically renews certificates via systemd timer.
# Check timer status sudo systemctl status certbot.timer # View renewal schedule sudo systemctl list-timers | grep certbot # Test renewal (dry run) sudo certbot renew --dry-run
Manual Renewal (if needed)
# Renew specific certificate sudo certbot renew --cert-name assistdocs.asimo.io # Force renewal sudo certbot renew --cert-name assistdocs.asimo.io --force-renewal # Reload Apache after renewal sudo systemctl reload apache2
New Certificate (if domain changes)
sudo certbot --apache -d assistdocs.asimo.io
Apache Configuration
Configuration File
Location: /etc/apache2/sites-available/assistdocs.asimo.io-le-ssl.conf
Key Configuration
<VirtualHost *:443> ServerName assistdocs.asimo.io DocumentRoot /var/www/assistdocs.asimo.io <Directory /var/www/assistdocs.asimo.io> Options Indexes FollowSymLinks AllowOverride All Require all granted DirectoryIndex index.html # Clean URLs for Next.js static export RewriteEngine On RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI}.html -f RewriteRule ^(.*)$ $1.html [L] </Directory> # SSL (managed by Certbot) SSLEngine on SSLCertificateFile /etc/letsencrypt/live/assistdocs.asimo.io/fullchain.pem SSLCertificateKeyFile /etc/letsencrypt/live/assistdocs.asimo.io/privkey.pem </VirtualHost>
Test Configuration
sudo apache2ctl configtest
Reload After Changes
sudo systemctl reload apache2
Troubleshooting
404 for Clean URLs
Symptom: /ai/onboarding returns 404 but /ai/onboarding.html works.
Cause: RewriteEngine rules not applied.
Fix:
- Ensure
mod_rewriteis enabled:sudo a2enmod rewrite - Verify rules are inside
<Directory>block - Reload Apache:
sudo systemctl reload apache2
Build Fails
Symptom: pnpm build fails with errors.
Checks:
# Check for TypeScript errors pnpm tsc --noEmit # Check for missing dependencies pnpm install # Clear cache rm -rf .next out pnpm build
Agent JSON Not Updated
Symptom: /agent/docs.json shows old documents.
Fix:
# Regenerate agent JSON pnpm generate-agent-json # Rebuild and redeploy pnpm build sudo cp -r out/* /var/www/assistdocs.asimo.io/
TLS Certificate Expired
Symptom: Browser shows certificate error.
Fix:
# Check certificate status sudo certbot certificates # Force renewal sudo certbot renew --cert-name assistdocs.asimo.io --force-renewal # Reload Apache sudo systemctl reload apache2