Operations Overview

Last Updated: 2025-11-27

This document provides a central hub for all operations-related documentation for VoiceAssist.

Quick Links

Category	Document	Purpose
SLOs	SLO Definitions	Reliability targets and error budgets
Metrics	Business Metrics	Key performance indicators
Performance	Connection Pool Optimization	Database connection tuning

Runbooks

All runbooks follow a standardized format with severity levels, step-by-step procedures, and verification steps.

Runbook	Purpose	Primary Audience
Deployment	Deploy VoiceAssist to production	DevOps, Backend
Monitoring	Set up and manage observability stack	DevOps
Troubleshooting	Diagnose and fix common issues	DevOps, Backend
Incident Response	Handle production incidents	On-call, DevOps
Backup & Restore	Data backup and recovery procedures	DevOps
Scaling	Scale infrastructure for load	DevOps, Backend

Compliance

Document	Purpose
Analytics Data Policy	Data handling for analytics

For HIPAA compliance, see Security & Compliance.

Incident Severity Levels

Severity	Description	Response Time
P1 - Critical	Complete service outage, data loss risk	15 minutes
P2 - High	Major feature broken, significant degradation	1 hour
P3 - Medium	Minor feature broken, degraded performance	4 hours
P4 - Low	Cosmetic issues, minimal impact	24 hours

Key SLOs

Metric	Target	Measurement Window
API Availability	99.9%	30 days
Success Rate	99.5%	30 days
P95 Latency	< 200ms	30 days
Error Rate	< 0.5%	30 days

On-Call Essentials

Quick Diagnostic Commands

# Check service health
curl http://localhost:8000/health
curl http://localhost:8000/ready

# Check all containers
docker compose ps

# View recent logs
docker compose logs --tail=100 voiceassist-server

# Check database
docker compose exec postgres psql -U voiceassist -c "SELECT 1"

# Check Redis
docker compose exec redis redis-cli ping

Escalation Path

L1 Support: Check health endpoints, restart services
L2 DevOps: Investigate logs, check metrics, apply standard fixes
L3 Engineering: Deep debugging, code-level investigation
Management: Major incidents requiring business decisions

Unified Architecture - System architecture
Backend Architecture - Backend details
Security & Compliance - HIPAA compliance
Implementation Status - Component status

Version History

Date	Version	Changes
2025-11-27	1.0.0	Initial operations overview

Deployment Runbook

Last Updated: 2025-11-27 Purpose: Step-by-step guide for deploying VoiceAssist V2

Pre-Deployment Checklist

Deployment Steps

1. Pre-Deployment Verification

# Check current system health
curl http://localhost:8000/health
curl http://localhost:8000/ready

# Verify all containers running
docker compose ps

# Check database connection
docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT version();"

# Check Redis
docker compose exec redis redis-cli ping

# Check Qdrant
curl http://localhost:6333/collections

2. Backup Current State

# Backup database
docker compose exec postgres pg_dump -U voiceassist voiceassist > backup_$(date +%Y%m%d_%H%M%S).sql

# Backup environment configuration
cp .env .env.backup_$(date +%Y%m%d_%H%M%S)

# Tag current Docker images
docker tag voiceassist-voiceassist-server:latest voiceassist-voiceassist-server:pre-deploy-$(date +%Y%m%d_%H%M%S)

3. Pull Latest Code

# Fetch latest changes
git fetch origin

# Check what's changing
git log --oneline HEAD..origin/main

# Pull changes
git pull origin main

# Verify correct branch
git branch --show-current
git log -1 --oneline

4. Update Environment Configuration

# Review .env changes
diff .env.example .env

# Update .env if needed
vim .env

# Validate configuration
grep -v '^#' .env | grep -v '^$' | wc -l  # Count non-empty lines

5. Run Database Migrations

# Check current migration status
docker compose run --rm voiceassist-server alembic current

# Review pending migrations
docker compose run --rm voiceassist-server alembic history

# Run migrations
docker compose run --rm voiceassist-server alembic upgrade head

# Verify migration success
docker compose run --rm voiceassist-server alembic current

6. Build New Images

# Build updated images
docker compose build voiceassist-server

# Verify image built
docker images | grep voiceassist-server

# Check image size (should be reasonable)
docker images voiceassist-voiceassist-server:latest --format "{{.Size}}"

7. Deploy Services

# Deploy with zero-downtime (recreate containers)
docker compose up -d voiceassist-server

# Watch logs for startup
docker compose logs -f voiceassist-server

# Wait for healthcheck
sleep 10

8. Post-Deployment Verification

# Check health endpoint
curl http://localhost:8000/health

# Check readiness
curl http://localhost:8000/ready

# Verify version
curl http://localhost:8000/health | jq '.version'

# Check all containers running
docker compose ps

# Check logs for errors
docker compose logs --tail=100 voiceassist-server | grep -i error

# Verify metrics endpoint
curl http://localhost:8000/metrics | head -20

# Test a sample API endpoint (requires auth)
# curl -H "Authorization: Bearer $TOKEN" http://localhost:8000/api/users/me

9. Smoke Tests

# Test authentication
curl -X POST http://localhost:8000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@example.com","password":"password"}' | jq '.'

# Test database connectivity
docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT COUNT(*) FROM users;"

# Test Redis
docker compose exec redis redis-cli --raw incr deployment_test

# Test Qdrant
curl http://localhost:6333/collections

10. Monitor Initial Traffic

# Watch logs for first 5 minutes
docker compose logs -f --tail=100 voiceassist-server

# Monitor metrics
watch -n 5 'curl -s http://localhost:8000/metrics | grep -E "(http_requests_total|http_request_duration)"'

# Check error rate
docker compose logs --since 5m voiceassist-server | grep -i error | wc -l

Rollback Procedure

If deployment fails, follow these steps:

Quick Rollback (Image-Based)

# Stop current containers
docker compose down voiceassist-server

# Revert to previous image
PREVIOUS_TAG="pre-deploy-YYYYMMDD_HHMMSS"  # From backup step
docker tag voiceassist-voiceassist-server:$PREVIOUS_TAG voiceassist-voiceassist-server:latest

# Start previous version
docker compose up -d voiceassist-server

# Verify rollback
curl http://localhost:8000/health | jq '.version'

Full Rollback (Code + Database)

# Stop services
docker compose down voiceassist-server

# Revert code
git log -1 --oneline  # Note current commit
git checkout HEAD~1   # Or specific commit hash

# Rollback database migration
BACKUP_FILE="backup_YYYYMMDD_HHMMSS.sql"
docker compose exec -T postgres psql -U voiceassist voiceassist < $BACKUP_FILE

# Rebuild image
docker compose build voiceassist-server

# Start services
docker compose up -d voiceassist-server

# Verify rollback
curl http://localhost:8000/health

Deployment Checklist

Post-Deployment:

Common Issues & Solutions

Issue: Database Migration Fails

Symptoms: Migration command returns error

Solution:

# Check current state
docker compose run --rm voiceassist-server alembic current

# Manually review SQL
docker compose run --rm voiceassist-server alembic show <revision>

# If safe, downgrade one step
docker compose run --rm voiceassist-server alembic downgrade -1

# Fix issue and retry
docker compose run --rm voiceassist-server alembic upgrade head

Issue: Container Won't Start

Symptoms: Container crashes immediately or fails healthcheck

Solution:

# Check logs
docker compose logs --tail=50 voiceassist-server

# Check container exit code
docker compose ps -a voiceassist-server

# Verify environment variables
docker compose config | grep -A 20 voiceassist-server

# Test dependencies
docker compose exec postgres pg_isready
docker compose exec redis redis-cli ping

Issue: High Error Rate After Deployment

Symptoms: Increased 5xx errors in logs/metrics

Solution:

# Check error logs
docker compose logs voiceassist-server | grep -i error

# Check database connections
docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

# Check Redis memory
docker compose exec redis redis-cli INFO memory | grep used_memory_human

# Rollback if errors > 5% of traffic

Emergency Contacts

On-Call Engineer: Check PagerDuty
Database Admin: DBA on-call rotation
DevOps Lead: ops-team@voiceassist.local
Product Owner: product@voiceassist.local

##Related Documentation

Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: After each major deployment or quarterly

Incident Response Runbook

Last Updated: 2025-11-27 Purpose: Comprehensive guide for handling incidents in VoiceAssist V2

Incident Severity Levels

Severity	Description	Response Time	Examples
P1 - Critical	Complete service outage, data loss risk	15 minutes	Database down, complete API failure, security breach
P2 - High	Major feature broken, significant performance degradation	1 hour	Authentication failing, voice processing unavailable
P3 - Medium	Minor feature broken, degraded performance	4 hours	Specific API endpoint failing, slow response times
P4 - Low	Cosmetic issues, minimal impact	24 hours	UI glitches, non-critical warnings in logs

Initial Response Procedure

1. Incident Detection

# Check system health
curl -s http://localhost:8000/health | jq '.'

# Expected output:
# {
#   "status": "healthy",
#   "version": "2.0.0",
#   "timestamp": "2025-11-21T..."
# }

# Check all services
docker compose ps

# Check recent error logs
docker compose logs --since 10m voiceassist-server | grep -i error

# Check metrics for anomalies
curl -s http://localhost:8000/metrics | grep -E "(error|failure)"

2. Immediate Triage (First 5 Minutes)

Checklist:

Acknowledge the incident (update status page if available)
Determine severity level using table above
Notify on-call engineer if P1/P2
Create incident tracking ticket/document
Join incident response channel (Slack/Teams)

# Quick system overview
echo "=== System Status ==="
docker compose ps
echo ""
echo "=== Error Count (Last 10 min) ==="
docker compose logs --since 10m | grep -i error | wc -l
echo ""
echo "=== Active Database Connections ==="
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
echo ""
echo "=== Redis Memory ==="
docker compose exec redis redis-cli INFO memory | grep used_memory_human
echo ""
echo "=== Disk Usage ==="
df -h

3. Assess Impact

# Check request success rate
docker compose logs --since 15m voiceassist-server | \
  grep -oE "status=[0-9]+" | sort | uniq -c

# Check database connectivity
docker compose exec postgres pg_isready
docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT 1;"

# Check Redis connectivity
docker compose exec redis redis-cli ping

# Check Qdrant connectivity
curl -s http://localhost:6333/healthz

# Check network connectivity
docker compose exec voiceassist-server ping -c 3 postgres
docker compose exec voiceassist-server ping -c 3 redis
docker compose exec voiceassist-server ping -c 3 qdrant

Incident Response by Severity

P1 - Critical Incident Response

Timeline: 0-15 minutes

Immediate Actions:
- Page on-call engineer
- Notify management
- Update status page: "Investigating outage"
- Join war room/incident call
Rapid Assessment:

# Check if complete outage
curl -s http://localhost:8000/health || echo "COMPLETE OUTAGE"

# Check all infrastructure
docker compose ps -a

# Check for recent deployments
git log -5 --oneline --since="2 hours ago"

# Check system resources
docker stats --no-stream

# Check disk space (common cause)
df -h
du -sh /var/lib/docker

Emergency Mitigation:

# Option 1: Restart all services
docker compose restart

# Option 2: Rollback recent deployment (if within 2 hours)
git log -1 --oneline  # Current version
git checkout HEAD~1   # Previous version
docker compose build voiceassist-server
docker compose up -d voiceassist-server

# Option 3: Scale up resources (if performance issue)
docker compose up -d --scale voiceassist-server=3

# Option 4: Enable maintenance mode
# Create maintenance mode flag
touch /tmp/maintenance_mode
docker compose exec voiceassist-server touch /app/maintenance_mode

Communication Template (P1):

Subject: [P1 INCIDENT] VoiceAssist Service Outage

Status: INVESTIGATING
Start Time: [TIME]
Impact: Complete service unavailable
Affected Users: All users
Incident Commander: [NAME]

Current Actions:
- Identified root cause as [X]
- Attempting mitigation via [Y]
- ETR: [TIME] (or "investigating")

Next Update: [TIME] (within 15 minutes)

P2 - High Severity Response

Timeline: 0-60 minutes

Assessment (First 15 minutes):

# Identify affected component
docker compose logs --since 30m voiceassist-server | grep -i error | tail -50

# Check specific service health
curl -s http://localhost:8000/ready | jq '.'

# Check database performance
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT pid, usename, application_name, state, query_start,
   wait_event_type, query
   FROM pg_stat_activity
   WHERE state != 'idle'
   ORDER BY query_start DESC
   LIMIT 20;"

# Check slow queries
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT query, calls, total_time, mean_time, max_time
   FROM pg_stat_statements
   ORDER BY mean_time DESC
   LIMIT 10;"

Mitigation Actions:
- Isolate affected component
- Enable fallback mechanisms
- Scale affected service
- Update monitoring thresholds
Communication Template (P2):

Subject: [P2 INCIDENT] VoiceAssist Degraded Performance

Status: MITIGATING
Start Time: [TIME]
Impact: [Specific feature] unavailable/degraded
Affected Users: [Percentage or specific user group]
Incident Commander: [NAME]

Timeline:
- [TIME]: Issue detected
- [TIME]: Root cause identified
- [TIME]: Mitigation in progress

Root Cause: [Brief description]
Mitigation: [Actions being taken]
ETR: [TIME]

Next Update: [TIME] (within 30 minutes)

P3 - Medium Severity Response

Timeline: 0-4 hours

Standard Investigation:

# Detailed log analysis
docker compose logs --since 1h voiceassist-server | grep -A 5 -B 5 "error"

# Check resource utilization trends
docker stats --no-stream

# Review recent changes
git log --since="24 hours ago" --oneline

# Check configuration
docker compose config | grep -A 10 voiceassist-server

Documented Fix Process:
- Create issue in tracking system
- Assign to appropriate team
- Document reproduction steps
- Implement fix
- Test in staging (if available)
- Deploy fix
- Verify resolution

P4 - Low Severity Response

Standard ticket workflow - no immediate response required

Escalation Paths

When to Escalate

Escalate Immediately If:

Unable to identify root cause within 30 minutes (P1) or 2 hours (P2)
Mitigation attempts unsuccessful
Data loss suspected
Security breach suspected
Multiple systems affected
Customer data at risk

Escalation Chain

L1 - On-Call Engineer
  ↓ (30 min for P1, 2 hrs for P2)
L2 - Team Lead
  ↓ (1 hr for P1, 4 hrs for P2)
L3 - Engineering Manager
  ↓ (2 hrs for P1)
L4 - CTO / VP Engineering

Escalation Command Script

# Document current state before escalating
cat > /tmp/escalation_report_$(date +%Y%m%d_%H%M%S).txt <<EOF
ESCALATION REPORT
=================
Time: $(date)
Severity: P1/P2/P3/P4
Duration: [X hours]
Impact: [Description]

Current System State:
$(docker compose ps)

Recent Errors:
$(docker compose logs --since 30m voiceassist-server | grep -i error | tail -20)

Actions Attempted:
- [List all mitigation attempts]
- [Include results of each attempt]

Reason for Escalation:
[Clear explanation of why escalating]

Additional Context:
[Any other relevant information]
EOF

cat /tmp/escalation_report_$(date +%Y%m%d_%H%M%S).txt

Common Incident Types

Database Connection Issues

Symptoms:

"Connection pool exhausted" errors
"Too many connections" errors
Slow response times

Investigation:

# Check connection pool status
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Check max connections
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SHOW max_connections;"

# Check current connections
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT datname, usename, application_name, count(*)
   FROM pg_stat_activity
   GROUP BY datname, usename, application_name;"

# Kill idle connections
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state = 'idle'
   AND state_change < current_timestamp - INTERVAL '10 minutes';"

Resolution:

# Restart application to reset connection pool
docker compose restart voiceassist-server

# Temporarily increase connection pool
docker compose exec voiceassist-server sh -c \
  "export DB_POOL_SIZE=30 && supervisorctl restart all"

# Long-term: Update docker-compose.yml or .env
echo "DB_POOL_SIZE=30" >> .env
docker compose up -d voiceassist-server

Memory/Resource Exhaustion

Symptoms:

Container restarts
OOMKilled status
Slow performance

Investigation:

# Check container memory usage
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Check for OOMKilled containers
docker inspect voiceassist-voiceassist-server-1 | grep OOMKilled

# Check system memory
free -h

# Check Redis memory
docker compose exec redis redis-cli INFO memory

Resolution:

# Increase memory limits in docker-compose.yml
# Edit docker-compose.yml to increase mem_limit

# Clear Redis cache if needed
docker compose exec redis redis-cli FLUSHDB

# Restart affected container
docker compose restart voiceassist-server

# Monitor memory after restart
watch -n 5 'docker stats --no-stream | grep voiceassist-server'

API Performance Degradation

Symptoms:

Slow response times
Timeout errors
High request queue

Investigation:

# Check response times in metrics
curl -s http://localhost:8000/metrics | grep http_request_duration

# Check slow queries
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT pid, now() - query_start as duration, query
   FROM pg_stat_activity
   WHERE state != 'idle'
   AND now() - query_start > interval '5 seconds'
   ORDER BY duration DESC;"

# Check for locks
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT * FROM pg_locks WHERE NOT granted;"

# Check CPU usage
docker stats --no-stream

Resolution:

# Scale horizontally if needed
docker compose up -d --scale voiceassist-server=3

# Kill slow queries
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state != 'idle'
   AND now() - query_start > interval '30 seconds';"

# Enable query caching in Redis
docker compose exec redis redis-cli CONFIG SET maxmemory-policy allkeys-lru

Security Incidents

Symptoms:

Unusual traffic patterns
Unauthorized access attempts
Data breach alerts

IMMEDIATE ACTIONS:

# 1. DO NOT DESTROY EVIDENCE
# 2. Document everything
# 3. Isolate affected systems

# Stop accepting new connections (if breach confirmed)
docker compose exec voiceassist-server iptables -A INPUT -p tcp --dport 8000 -j DROP

# Capture current state
docker compose logs > /tmp/security_incident_logs_$(date +%Y%m%d_%H%M%S).txt
docker compose exec postgres pg_dump -U voiceassist voiceassist > \
  /tmp/security_incident_db_$(date +%Y%m%d_%H%M%S).sql

# Check for suspicious activity
docker compose logs voiceassist-server | grep -E "401|403|429" | tail -100

# Check database for unauthorized access
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT * FROM user_sessions WHERE created_at > NOW() - INTERVAL '1 hour'
   ORDER BY created_at DESC;"

# Rotate credentials IMMEDIATELY
# Generate new secrets
openssl rand -base64 32 > /tmp/new_secret_key.txt

# Update .env with new credentials
# Force logout all users
docker compose exec redis redis-cli FLUSHALL

ESCALATION: Security incidents ALWAYS require immediate escalation to security team

Post-Incident Activities

Immediate Post-Incident (Within 1 Hour)

Checklist:

Verify incident fully resolved
Update status page to "Resolved"
Send final communication to stakeholders
Document timeline in incident ticket
Schedule post-mortem meeting (within 48 hours for P1/P2)

# Verification script
echo "=== Post-Incident Verification ==="
echo "Health Check:"
curl -s http://localhost:8000/health | jq '.'
echo ""
echo "Error Rate (Last 30 min):"
docker compose logs --since 30m voiceassist-server | grep -i error | wc -l
echo ""
echo "Container Status:"
docker compose ps
echo ""
echo "Database Connections:"
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Post-Mortem Process

Post-Mortem Template:

# Post-Mortem: [Incident Title]

## Incident Details

- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes
- **Severity**: P1/P2/P3/P4
- **Incident Commander**: [Name]
- **Participants**: [Names]

## Impact

- **Users Affected**: [Number or percentage]
- **Services Affected**: [List]
- **Financial Impact**: [If applicable]
- **Data Loss**: None / [Description]

## Timeline

| Time  | Event                       |
| ----- | --------------------------- |
| HH:MM | Incident began              |
| HH:MM | Detected by [person/system] |
| HH:MM | Initial response started    |
| HH:MM | Root cause identified       |
| HH:MM | Mitigation deployed         |
| HH:MM | Incident resolved           |

## Root Cause

[Detailed explanation of what caused the incident]

## What Went Well

- [Things that worked during response]
- [Effective tools/processes]

## What Went Wrong

- [Issues encountered during response]
- [Gaps in tooling/process]

## Action Items

| Action                   | Owner  | Due Date | Priority |
| ------------------------ | ------ | -------- | -------- |
| [Preventive measure]     | [Name] | [Date]   | P1/P2/P3 |
| [Monitoring improvement] | [Name] | [Date]   | P1/P2/P3 |
| [Documentation update]   | [Name] | [Date]   | P1/P2/P3 |

## Lessons Learned

- [Key takeaway 1]
- [Key takeaway 2]
- [Key takeaway 3]

Post-Mortem Meeting Agenda

Review Timeline (10 minutes)
- Walk through incident from detection to resolution
- No blame, focus on facts
Root Cause Analysis (15 minutes)
- Technical deep-dive
- Use "5 Whys" technique
Impact Assessment (10 minutes)
- User impact
- Business impact
- Reputation impact
Prevention Discussion (20 minutes)
- How to prevent recurrence
- Monitoring improvements
- Process improvements
Action Items (5 minutes)
- Assign owners and due dates
- Set follow-up meeting

Communication Templates

Initial Notification (P1/P2)

Subject: [P1/P2] VoiceAssist Service Issue - [Brief Description]

Dear Team,

We are currently experiencing [issue description] affecting [scope of impact].

Status: INVESTIGATING
Start Time: [TIME]
Severity: P1/P2
Impact: [Description]
Affected Systems: [List]
Incident Commander: [NAME]

We are actively working to resolve this issue and will provide updates
every [15 minutes for P1, 30 minutes for P2].

Next Update: [TIME]

VoiceAssist Operations Team

Status Update (During Incident)

Subject: [UPDATE - P1/P2] VoiceAssist Service Issue - [Brief Description]

Update #[N] - [TIME]

Current Status: [INVESTIGATING/IDENTIFIED/MITIGATING/RESOLVED]

Progress:
- [What we've learned]
- [What we've tried]
- [Current approach]

Impact Update: [Any changes to scope]

Next Steps:
- [Action 1]
- [Action 2]

ETR: [Estimated Time to Resolution or "investigating"]

Next Update: [TIME]

VoiceAssist Operations Team

Resolution Notification

Subject: [RESOLVED - P1/P2] VoiceAssist Service Issue - [Brief Description]

Status: RESOLVED
Resolution Time: [TIME]
Total Duration: [X hours Y minutes]

The issue affecting [description] has been fully resolved.

Root Cause: [Brief explanation]

Resolution: [What was done to fix it]

Impact Summary:
- Users Affected: [Number/Percentage]
- Duration: [X hours Y minutes]
- Data Loss: None / [Description]

Next Steps:
- Post-mortem scheduled for [DATE/TIME]
- Preventive measures being implemented

We apologize for any inconvenience this may have caused.

VoiceAssist Operations Team

Incident Response Tools

Quick Command Reference

# Health Check Bundle
alias va-health='curl -s http://localhost:8000/health | jq .'
alias va-ready='curl -s http://localhost:8000/ready | jq .'
alias va-metrics='curl -s http://localhost:8000/metrics'

# Log Analysis
alias va-errors='docker compose logs --since 10m voiceassist-server | grep -i error'
alias va-errors-count='docker compose logs --since 10m voiceassist-server | grep -i error | wc -l'
alias va-logs-tail='docker compose logs -f --tail=100 voiceassist-server'

# Resource Check
alias va-stats='docker stats --no-stream | grep voiceassist'
alias va-disk='df -h | grep -E "(Filesystem|/dev/)"'

# Database Quick Checks
alias va-db-connections='docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"'
alias va-db-slow='docker compose exec postgres psql -U voiceassist -d voiceassist -c "SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state != '\''idle'\'' ORDER BY duration DESC LIMIT 10;"'

# Redis Checks
alias va-redis-info='docker compose exec redis redis-cli INFO'
alias va-redis-memory='docker compose exec redis redis-cli INFO memory | grep used_memory_human'

Incident Response Script

#!/bin/bash
# Save as: /usr/local/bin/va-incident-check

echo "=== VoiceAssist Incident Response Check ==="
echo "Time: $(date)"
echo ""

echo "=== 1. Service Health ==="
curl -s http://localhost:8000/health | jq '.' || echo "HEALTH CHECK FAILED"
echo ""

echo "=== 2. Container Status ==="
docker compose ps
echo ""

echo "=== 3. Recent Errors (Last 10 min) ==="
ERROR_COUNT=$(docker compose logs --since 10m voiceassist-server 2>/dev/null | grep -i error | wc -l)
echo "Error Count: $ERROR_COUNT"
if [ "$ERROR_COUNT" -gt 10 ]; then
    echo "⚠️  HIGH ERROR RATE DETECTED"
    docker compose logs --since 10m voiceassist-server | grep -i error | tail -10
fi
echo ""

echo "=== 4. Database Status ==="
docker compose exec -T postgres pg_isready || echo "DATABASE NOT READY"
docker compose exec -T postgres psql -U voiceassist -d voiceassist -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" 2>/dev/null
echo ""

echo "=== 5. Redis Status ==="
docker compose exec -T redis redis-cli ping || echo "REDIS NOT RESPONDING"
docker compose exec -T redis redis-cli INFO memory | grep used_memory_human
echo ""

echo "=== 6. Resource Usage ==="
docker stats --no-stream | grep voiceassist
echo ""

echo "=== 7. Disk Space ==="
df -h | grep -E "(Filesystem|/$|/var)"
echo ""

echo "=== Summary ==="
if [ "$ERROR_COUNT" -gt 50 ]; then
    echo "🔴 CRITICAL - High error rate detected"
elif [ "$ERROR_COUNT" -gt 10 ]; then
    echo "🟡 WARNING - Elevated error rate"
else
    echo "🟢 OK - System appears healthy"
fi

Emergency Contacts

Primary Contacts

Role	Contact	Availability
On-Call Engineer	PagerDuty alert	24/7
Backup On-Call	PagerDuty escalation	24/7
Engineering Manager	ops-manager@voiceassist.local	Business hours
DevOps Lead	devops-lead@voiceassist.local	Business hours + on-call
Database Admin	dba-oncall@voiceassist.local	24/7
Security Team	security@voiceassist.local	24/7 for P1 security

Escalation Contacts

Level	Contact	When to Escalate
L1	On-Call Engineer	Initial response
L2	Team Lead	No resolution in 30 min (P1) or 2 hrs (P2)
L3	Engineering Manager	No resolution in 1 hr (P1) or 4 hrs (P2)
L4	VP Engineering / CTO	Major outage > 2 hours, data loss, security breach

External Contacts

Cloud Provider Support: [Support portal URL]
Third-party Services: [Service provider contacts]
Legal (for security incidents): legal@voiceassist.local

Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Monthly or after each P1/P2 incident Next Review: 2025-12-21

Backup & Restore Runbook

Last Updated: 2025-11-27 Purpose: Comprehensive guide for backup and restore operations in VoiceAssist V2

Backup Strategy Overview

Backup Schedule

Component	Frequency	Retention	Method
PostgreSQL Database	Every 6 hours	30 days	pg_dump + automated snapshots
Redis Cache	Daily	7 days	RDB snapshots
Qdrant Vectors	Daily	14 days	Collection snapshots
Configuration Files	On change	90 days	Git + encrypted backups
Application Logs	Hourly	30 days	Log aggregation
Docker Volumes	Weekly	30 days	Volume snapshots

Backup Storage Locations

# Default backup directory structure
/backups/
├── postgres/
│   ├── daily/
│   ├── weekly/
│   └── monthly/
├── redis/
├── qdrant/
├── config/
├── volumes/
└── logs/

PostgreSQL Database Backup

Full Database Backup

# Create timestamped backup
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/postgres/daily"

# Ensure backup directory exists
mkdir -p $BACKUP_DIR

# Full database dump
docker compose exec -T postgres pg_dump \
  -U voiceassist \
  -d voiceassist \
  -F c \
  -b \
  -v \
  -f /tmp/voiceassist_${BACKUP_DATE}.dump

# Copy from container to host
docker compose cp postgres:/tmp/voiceassist_${BACKUP_DATE}.dump \
  ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.dump

# Verify backup
ls -lh ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.dump

# Expected output: File size should be > 0 bytes

Compressed SQL Backup

# SQL format with compression
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/postgres/daily"

mkdir -p $BACKUP_DIR

# Create compressed SQL dump
docker compose exec -T postgres pg_dump \
  -U voiceassist \
  -d voiceassist \
  --clean \
  --if-exists \
  --verbose \
  | gzip > ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.sql.gz

# Verify backup
ls -lh ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.sql.gz
gunzip -t ${BACKUP_DIR}/voiceassist_${BACKUP_DATE}.sql.gz && echo "✓ Backup file is valid"

Schema-Only Backup

# Backup schema structure only (useful for development)
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)

docker compose exec -T postgres pg_dump \
  -U voiceassist \
  -d voiceassist \
  --schema-only \
  --no-owner \
  --no-acl \
  > /backups/postgres/schema_${BACKUP_DATE}.sql

echo "Schema backup completed: schema_${BACKUP_DATE}.sql"

Table-Specific Backup

# Backup specific tables
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
TABLES="users conversations messages"

for TABLE in $TABLES; do
  echo "Backing up table: $TABLE"
  docker compose exec -T postgres pg_dump \
    -U voiceassist \
    -d voiceassist \
    -t $TABLE \
    --data-only \
    | gzip > /backups/postgres/table_${TABLE}_${BACKUP_DATE}.sql.gz
done

echo "Table backups completed"

Automated Backup Script

#!/bin/bash
# Save as: /usr/local/bin/va-backup-postgres

set -e

BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/postgres"
DAILY_DIR="${BACKUP_DIR}/daily"
WEEKLY_DIR="${BACKUP_DIR}/weekly"
MONTHLY_DIR="${BACKUP_DIR}/monthly"
LOG_FILE="${BACKUP_DIR}/backup.log"

# Ensure directories exist
mkdir -p $DAILY_DIR $WEEKLY_DIR $MONTHLY_DIR

# Function to log messages
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE
}

log "Starting PostgreSQL backup"

# Daily backup
log "Creating daily backup"
docker compose exec -T postgres pg_dump \
  -U voiceassist \
  -d voiceassist \
  -F c \
  -b \
  | gzip > ${DAILY_DIR}/voiceassist_${BACKUP_DATE}.dump.gz

if [ $? -eq 0 ]; then
    log "Daily backup completed: ${BACKUP_DATE}.dump.gz"
    BACKUP_SIZE=$(du -h ${DAILY_DIR}/voiceassist_${BACKUP_DATE}.dump.gz | cut -f1)
    log "Backup size: ${BACKUP_SIZE}"
else
    log "ERROR: Daily backup failed"
    exit 1
fi

# Weekly backup (every Sunday)
if [ $(date +%u) -eq 7 ]; then
    log "Creating weekly backup"
    cp ${DAILY_DIR}/voiceassist_${BACKUP_DATE}.dump.gz \
       ${WEEKLY_DIR}/voiceassist_week_$(date +%Y%U).dump.gz
    log "Weekly backup created"
fi

# Monthly backup (first day of month)
if [ $(date +%d) -eq 01 ]; then
    log "Creating monthly backup"
    cp ${DAILY_DIR}/voiceassist_${BACKUP_DATE}.dump.gz \
       ${MONTHLY_DIR}/voiceassist_$(date +%Y%m).dump.gz
    log "Monthly backup created"
fi

# Cleanup old daily backups (keep 30 days)
log "Cleaning up old daily backups"
find ${DAILY_DIR} -name "voiceassist_*.dump.gz" -mtime +30 -delete

# Cleanup old weekly backups (keep 12 weeks)
find ${WEEKLY_DIR} -name "voiceassist_week_*.dump.gz" -mtime +84 -delete

# Cleanup old monthly backups (keep 12 months)
find ${MONTHLY_DIR} -name "voiceassist_*.dump.gz" -mtime +365 -delete

log "Backup process completed successfully"

Backup Verification

# Verify backup integrity
BACKUP_FILE="/backups/postgres/daily/voiceassist_20251121_120000.dump.gz"

# Check file exists and size
if [ -f "$BACKUP_FILE" ]; then
    echo "✓ Backup file exists"
    ls -lh $BACKUP_FILE
else
    echo "✗ Backup file not found"
    exit 1
fi

# Test extraction
gunzip -t $BACKUP_FILE
if [ $? -eq 0 ]; then
    echo "✓ Backup file is not corrupted"
else
    echo "✗ Backup file is corrupted"
    exit 1
fi

# Test restore to temporary database (recommended)
echo "Testing restore to temporary database..."
docker compose exec -T postgres psql -U voiceassist -c "CREATE DATABASE test_restore;"
gunzip -c $BACKUP_FILE | docker compose exec -T postgres pg_restore \
  -U voiceassist \
  -d test_restore \
  --verbose

if [ $? -eq 0 ]; then
    echo "✓ Backup restore test successful"
    docker compose exec -T postgres psql -U voiceassist -c "DROP DATABASE test_restore;"
else
    echo "✗ Backup restore test failed"
    docker compose exec -T postgres psql -U voiceassist -c "DROP DATABASE IF EXISTS test_restore;"
    exit 1
fi

PostgreSQL Database Restore

Pre-Restore Checklist

Verify backup file integrity
Ensure sufficient disk space
Notify all users of maintenance
Stop application services
Create a backup of current database (before restore)
Document current state

Full Database Restore

# Stop application to prevent connections
docker compose stop voiceassist-server

# Verify no active connections
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname = 'voiceassist' AND pid != pg_backend_pid();"

# Terminate active connections if any
docker compose exec postgres psql -U voiceassist -d postgres -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity
   WHERE datname = 'voiceassist' AND pid != pg_backend_pid();"

# Drop and recreate database
docker compose exec postgres psql -U voiceassist -d postgres <<EOF
DROP DATABASE IF EXISTS voiceassist;
CREATE DATABASE voiceassist OWNER voiceassist;
EOF

# Restore from custom format dump
BACKUP_FILE="/backups/postgres/daily/voiceassist_20251121_120000.dump.gz"

gunzip -c $BACKUP_FILE | docker compose exec -T postgres pg_restore \
  -U voiceassist \
  -d voiceassist \
  --verbose \
  --no-owner \
  --no-acl

# Verify restore
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT schemaname, tablename FROM pg_tables WHERE schemaname = 'public';"

# Restart application
docker compose start voiceassist-server

echo "Database restore completed"

Restore from SQL Dump

# For plain SQL dumps
BACKUP_FILE="/backups/postgres/daily/voiceassist_20251121_120000.sql.gz"

# Stop application
docker compose stop voiceassist-server

# Restore SQL
gunzip -c $BACKUP_FILE | docker compose exec -T postgres psql \
  -U voiceassist \
  -d voiceassist

# Restart application
docker compose start voiceassist-server

Point-in-Time Recovery (PITR)

# Requires WAL archiving to be enabled in PostgreSQL configuration

# 1. Stop database
docker compose stop postgres

# 2. Replace data directory with base backup
BACKUP_DIR="/backups/postgres/base"
DATA_DIR="/var/lib/docker/volumes/voiceassist_postgres_data/_data"

# Backup current data
mv $DATA_DIR ${DATA_DIR}.backup_$(date +%Y%m%d_%H%M%S)

# Restore base backup
cp -r $BACKUP_DIR $DATA_DIR

# 3. Create recovery configuration
cat > ${DATA_DIR}/recovery.conf <<EOF
restore_command = 'cp /backups/postgres/wal_archive/%f %p'
recovery_target_time = '2025-11-21 12:00:00'
EOF

# 4. Start PostgreSQL (will perform recovery)
docker compose start postgres

# 5. Monitor recovery
docker compose logs -f postgres | grep -i recovery

Partial Restore (Single Table)

# Restore specific table from backup
TABLE_NAME="users"
BACKUP_FILE="/backups/postgres/table_users_20251121_120000.sql.gz"

# Drop existing table data
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "TRUNCATE TABLE ${TABLE_NAME} CASCADE;"

# Restore table
gunzip -c $BACKUP_FILE | docker compose exec -T postgres psql \
  -U voiceassist \
  -d voiceassist

# Verify
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT COUNT(*) FROM ${TABLE_NAME};"

Redis Backup

Manual Redis Backup

# Trigger Redis save
docker compose exec redis redis-cli BGSAVE

# Wait for save to complete
docker compose exec redis redis-cli LASTSAVE

# Copy RDB file
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p /backups/redis

docker compose cp redis:/data/dump.rdb \
  /backups/redis/dump_${BACKUP_DATE}.rdb

# Verify backup
ls -lh /backups/redis/dump_${BACKUP_DATE}.rdb

Automated Redis Backup Script

#!/bin/bash
# Save as: /usr/local/bin/va-backup-redis

set -e

BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/redis"
LOG_FILE="${BACKUP_DIR}/backup.log"

mkdir -p $BACKUP_DIR

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE
}

log "Starting Redis backup"

# Trigger background save
docker compose exec -T redis redis-cli BGSAVE > /dev/null

# Wait for save to complete (check every 2 seconds)
TIMEOUT=60
ELAPSED=0
while [ $ELAPSED -lt $TIMEOUT ]; do
    STATUS=$(docker compose exec -T redis redis-cli LASTSAVE 2>/dev/null || echo "0")
    if [ ! -z "$STATUS" ]; then
        break
    fi
    sleep 2
    ELAPSED=$((ELAPSED + 2))
done

# Copy RDB file
docker compose cp redis:/data/dump.rdb \
  ${BACKUP_DIR}/dump_${BACKUP_DATE}.rdb

if [ $? -eq 0 ]; then
    log "Redis backup completed: dump_${BACKUP_DATE}.rdb"
    BACKUP_SIZE=$(du -h ${BACKUP_DIR}/dump_${BACKUP_DATE}.rdb | cut -f1)
    log "Backup size: ${BACKUP_SIZE}"
else
    log "ERROR: Redis backup failed"
    exit 1
fi

# Cleanup old backups (keep 7 days)
find ${BACKUP_DIR} -name "dump_*.rdb" -mtime +7 -delete
log "Cleanup completed"

log "Redis backup process completed successfully"

Redis Restore

# Stop Redis
docker compose stop redis

# Replace RDB file
BACKUP_FILE="/backups/redis/dump_20251121_120000.rdb"

docker compose cp $BACKUP_FILE redis:/data/dump.rdb

# Start Redis (will load from dump.rdb)
docker compose start redis

# Verify data loaded
docker compose exec redis redis-cli DBSIZE

echo "Redis restore completed"

Qdrant Vector Database Backup

Create Qdrant Snapshot

# Create snapshot for specific collection
COLLECTION_NAME="voice_embeddings"
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/qdrant"

mkdir -p $BACKUP_DIR

# Create snapshot via API
SNAPSHOT_NAME=$(curl -X POST \
  "http://localhost:6333/collections/${COLLECTION_NAME}/snapshots" \
  | jq -r '.result.name')

echo "Snapshot created: $SNAPSHOT_NAME"

# Download snapshot
curl -X GET \
  "http://localhost:6333/collections/${COLLECTION_NAME}/snapshots/${SNAPSHOT_NAME}" \
  -o ${BACKUP_DIR}/${COLLECTION_NAME}_${BACKUP_DATE}.snapshot

# Verify backup
ls -lh ${BACKUP_DIR}/${COLLECTION_NAME}_${BACKUP_DATE}.snapshot

Backup All Qdrant Collections

#!/bin/bash
# Save as: /usr/local/bin/va-backup-qdrant

set -e

BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/qdrant"
LOG_FILE="${BACKUP_DIR}/backup.log"

mkdir -p $BACKUP_DIR

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE
}

log "Starting Qdrant backup"

# Get all collections
COLLECTIONS=$(curl -s http://localhost:6333/collections | jq -r '.result.collections[].name')

if [ -z "$COLLECTIONS" ]; then
    log "No collections found"
    exit 0
fi

# Backup each collection
for COLLECTION in $COLLECTIONS; do
    log "Backing up collection: $COLLECTION"

    # Create snapshot
    SNAPSHOT_NAME=$(curl -s -X POST \
      "http://localhost:6333/collections/${COLLECTION}/snapshots" \
      | jq -r '.result.name')

    if [ ! -z "$SNAPSHOT_NAME" ] && [ "$SNAPSHOT_NAME" != "null" ]; then
        # Download snapshot
        curl -s -X GET \
          "http://localhost:6333/collections/${COLLECTION}/snapshots/${SNAPSHOT_NAME}" \
          -o ${BACKUP_DIR}/${COLLECTION}_${BACKUP_DATE}.snapshot

        log "Backup completed: ${COLLECTION}_${BACKUP_DATE}.snapshot"
        BACKUP_SIZE=$(du -h ${BACKUP_DIR}/${COLLECTION}_${BACKUP_DATE}.snapshot | cut -f1)
        log "Backup size: ${BACKUP_SIZE}"

        # Delete remote snapshot to save space
        curl -s -X DELETE \
          "http://localhost:6333/collections/${COLLECTION}/snapshots/${SNAPSHOT_NAME}" \
          > /dev/null
    else
        log "ERROR: Failed to create snapshot for $COLLECTION"
    fi
done

# Cleanup old backups (keep 14 days)
find ${BACKUP_DIR} -name "*.snapshot" -mtime +14 -delete
log "Cleanup completed"

log "Qdrant backup process completed successfully"

Qdrant Restore

# Stop Qdrant
docker compose stop qdrant

# Clear existing data (optional, for full restore)
docker compose exec qdrant rm -rf /qdrant/storage/*

# Start Qdrant
docker compose start qdrant

# Wait for Qdrant to be ready
sleep 5

# Restore each collection
COLLECTION_NAME="voice_embeddings"
BACKUP_FILE="/backups/qdrant/voice_embeddings_20251121_120000.snapshot"

# Upload snapshot
curl -X POST \
  "http://localhost:6333/collections/${COLLECTION_NAME}/snapshots/upload" \
  -H "Content-Type: multipart/form-data" \
  -F "snapshot=@${BACKUP_FILE}"

# Verify collection restored
curl -s http://localhost:6333/collections/${COLLECTION_NAME} | jq '.result'

echo "Qdrant restore completed"

Configuration Files Backup

Backup Configuration

#!/bin/bash
# Save as: /usr/local/bin/va-backup-config

set -e

BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/config"
PROJECT_DIR="/Users/mohammednazmy/VoiceAssist"

mkdir -p $BACKUP_DIR

echo "Starting configuration backup"

# Create tarball of configuration files
tar -czf ${BACKUP_DIR}/config_${BACKUP_DATE}.tar.gz \
  -C $PROJECT_DIR \
  .env \
  docker-compose.yml \
  docker-compose.override.yml \
  alembic.ini \
  pyproject.toml \
  --exclude='.git' \
  --exclude='__pycache__'

# Encrypt backup (recommended for sensitive configs)
if command -v gpg &> /dev/null; then
    gpg --symmetric --cipher-algo AES256 \
      -o ${BACKUP_DIR}/config_${BACKUP_DATE}.tar.gz.gpg \
      ${BACKUP_DIR}/config_${BACKUP_DATE}.tar.gz

    # Remove unencrypted version
    rm ${BACKUP_DIR}/config_${BACKUP_DATE}.tar.gz
    echo "Configuration backup encrypted: config_${BACKUP_DATE}.tar.gz.gpg"
else
    echo "Configuration backup created: config_${BACKUP_DATE}.tar.gz"
    echo "WARNING: Backup is not encrypted. Consider installing gpg."
fi

# Cleanup old backups (keep 90 days)
find ${BACKUP_DIR} -name "config_*.tar.gz*" -mtime +90 -delete

echo "Configuration backup completed"

Restore Configuration

# For encrypted backups
BACKUP_FILE="/backups/config/config_20251121_120000.tar.gz.gpg"
PROJECT_DIR="/Users/mohammednazmy/VoiceAssist"

# Decrypt and extract
gpg --decrypt $BACKUP_FILE | tar -xzf - -C $PROJECT_DIR

# For unencrypted backups
BACKUP_FILE="/backups/config/config_20251121_120000.tar.gz"
tar -xzf $BACKUP_FILE -C $PROJECT_DIR

echo "Configuration restored"

Docker Volumes Backup

Backup Docker Volumes

#!/bin/bash
# Save as: /usr/local/bin/va-backup-volumes

set -e

BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/volumes"

mkdir -p $BACKUP_DIR

echo "Starting Docker volumes backup"

# List of volumes to backup
VOLUMES=(
  "voiceassist_postgres_data"
  "voiceassist_redis_data"
  "voiceassist_qdrant_storage"
)

for VOLUME in "${VOLUMES[@]}"; do
    echo "Backing up volume: $VOLUME"

    # Create tarball of volume
    docker run --rm \
      -v ${VOLUME}:/source:ro \
      -v ${BACKUP_DIR}:/backup \
      alpine \
      tar -czf /backup/${VOLUME}_${BACKUP_DATE}.tar.gz -C /source .

    if [ $? -eq 0 ]; then
        echo "Backup completed: ${VOLUME}_${BACKUP_DATE}.tar.gz"
        BACKUP_SIZE=$(du -h ${BACKUP_DIR}/${VOLUME}_${BACKUP_DATE}.tar.gz | cut -f1)
        echo "Backup size: ${BACKUP_SIZE}"
    else
        echo "ERROR: Backup failed for $VOLUME"
    fi
done

# Cleanup old backups (keep 30 days)
find ${BACKUP_DIR} -name "*.tar.gz" -mtime +30 -delete

echo "Docker volumes backup completed"

Restore Docker Volumes

# Stop services
docker compose down

# Restore specific volume
VOLUME_NAME="voiceassist_postgres_data"
BACKUP_FILE="/backups/volumes/voiceassist_postgres_data_20251121_120000.tar.gz"

# Remove existing volume (WARNING: destructive)
docker volume rm $VOLUME_NAME

# Create new volume
docker volume create $VOLUME_NAME

# Restore data
docker run --rm \
  -v ${VOLUME_NAME}:/target \
  -v $(dirname $BACKUP_FILE):/backup \
  alpine \
  tar -xzf /backup/$(basename $BACKUP_FILE) -C /target

echo "Volume $VOLUME_NAME restored"

# Start services
docker compose up -d

Disaster Recovery

Complete System Backup

#!/bin/bash
# Save as: /usr/local/bin/va-backup-full

set -e

BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_ROOT="/backups"
DR_DIR="${BACKUP_ROOT}/disaster_recovery"

mkdir -p $DR_DIR

echo "============================================"
echo "Starting Full System Backup for DR"
echo "Date: $(date)"
echo "============================================"

# Stop application (keep databases running)
docker compose stop voiceassist-server

# 1. Backup PostgreSQL
echo "[1/5] Backing up PostgreSQL..."
/usr/local/bin/va-backup-postgres

# 2. Backup Redis
echo "[2/5] Backing up Redis..."
/usr/local/bin/va-backup-redis

# 3. Backup Qdrant
echo "[3/5] Backing up Qdrant..."
/usr/local/bin/va-backup-qdrant

# 4. Backup Configuration
echo "[4/5] Backing up Configuration..."
/usr/local/bin/va-backup-config

# 5. Backup Docker Volumes
echo "[5/5] Backing up Docker Volumes..."
/usr/local/bin/va-backup-volumes

# Create DR manifest
cat > ${DR_DIR}/manifest_${BACKUP_DATE}.txt <<EOF
VoiceAssist V2 Disaster Recovery Backup
========================================
Date: $(date)
Backup ID: ${BACKUP_DATE}

Components Backed Up:
- PostgreSQL Database
- Redis Cache
- Qdrant Vector Database
- Configuration Files
- Docker Volumes

Backup Locations:
- PostgreSQL: ${BACKUP_ROOT}/postgres/daily/
- Redis: ${BACKUP_ROOT}/redis/
- Qdrant: ${BACKUP_ROOT}/qdrant/
- Config: ${BACKUP_ROOT}/config/
- Volumes: ${BACKUP_ROOT}/volumes/

Backup Sizes:
$(du -sh ${BACKUP_ROOT}/postgres/daily/voiceassist_${BACKUP_DATE}* 2>/dev/null || echo "PostgreSQL: N/A")
$(du -sh ${BACKUP_ROOT}/redis/dump_${BACKUP_DATE}.rdb 2>/dev/null || echo "Redis: N/A")
$(du -sh ${BACKUP_ROOT}/qdrant/*_${BACKUP_DATE}.snapshot 2>/dev/null || echo "Qdrant: N/A")
$(du -sh ${BACKUP_ROOT}/config/config_${BACKUP_DATE}.tar.gz* 2>/dev/null || echo "Config: N/A")

Total Backup Size:
$(du -sh ${BACKUP_ROOT} | cut -f1)

Verification Status:
- PostgreSQL: $(test -f ${BACKUP_ROOT}/postgres/daily/voiceassist_${BACKUP_DATE}* && echo "✓" || echo "✗")
- Redis: $(test -f ${BACKUP_ROOT}/redis/dump_${BACKUP_DATE}.rdb && echo "✓" || echo "✗")
- Config: $(test -f ${BACKUP_ROOT}/config/config_${BACKUP_DATE}.tar.gz* && echo "✓" || echo "✗")

Restore Command:
/usr/local/bin/va-restore-full ${BACKUP_DATE}
EOF

# Create compressed archive of entire backup
echo "Creating DR archive..."
tar -czf ${DR_DIR}/voiceassist_dr_${BACKUP_DATE}.tar.gz \
  -C ${BACKUP_ROOT} \
  postgres/daily \
  redis \
  qdrant \
  config \
  volumes

# Restart application
docker compose start voiceassist-server

echo "============================================"
echo "Full System Backup Completed"
echo "Manifest: ${DR_DIR}/manifest_${BACKUP_DATE}.txt"
echo "Archive: ${DR_DIR}/voiceassist_dr_${BACKUP_DATE}.tar.gz"
echo "============================================"

cat ${DR_DIR}/manifest_${BACKUP_DATE}.txt

Complete System Restore

#!/bin/bash
# Save as: /usr/local/bin/va-restore-full

set -e

if [ -z "$1" ]; then
    echo "Usage: $0 <backup_date>"
    echo "Example: $0 20251121_120000"
    exit 1
fi

BACKUP_DATE=$1
BACKUP_ROOT="/backups"

echo "============================================"
echo "Starting Full System Restore"
echo "Backup Date: ${BACKUP_DATE}"
echo "============================================"

# Verify manifest exists
MANIFEST="${BACKUP_ROOT}/disaster_recovery/manifest_${BACKUP_DATE}.txt"
if [ ! -f "$MANIFEST" ]; then
    echo "ERROR: Manifest not found: $MANIFEST"
    exit 1
fi

echo "Manifest found. Displaying backup details:"
cat $MANIFEST
echo ""

read -p "Do you want to proceed with restore? This will OVERWRITE all data (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
    echo "Restore cancelled"
    exit 0
fi

# Stop all services
echo "Stopping services..."
docker compose down

# 1. Restore PostgreSQL
echo "[1/5] Restoring PostgreSQL..."
POSTGRES_BACKUP="${BACKUP_ROOT}/postgres/daily/voiceassist_${BACKUP_DATE}.dump.gz"
if [ -f "$POSTGRES_BACKUP" ]; then
    docker compose up -d postgres
    sleep 10

    docker compose exec postgres psql -U voiceassist -d postgres -c \
      "DROP DATABASE IF EXISTS voiceassist;"
    docker compose exec postgres psql -U voiceassist -d postgres -c \
      "CREATE DATABASE voiceassist OWNER voiceassist;"

    gunzip -c $POSTGRES_BACKUP | docker compose exec -T postgres pg_restore \
      -U voiceassist \
      -d voiceassist \
      --verbose \
      --no-owner \
      --no-acl

    echo "✓ PostgreSQL restored"
else
    echo "✗ PostgreSQL backup not found"
fi

# 2. Restore Redis
echo "[2/5] Restoring Redis..."
REDIS_BACKUP="${BACKUP_ROOT}/redis/dump_${BACKUP_DATE}.rdb"
if [ -f "$REDIS_BACKUP" ]; then
    docker compose stop redis
    docker compose cp $REDIS_BACKUP redis:/data/dump.rdb
    docker compose start redis
    sleep 5
    echo "✓ Redis restored"
else
    echo "✗ Redis backup not found"
fi

# 3. Restore Qdrant
echo "[3/5] Restoring Qdrant..."
docker compose up -d qdrant
sleep 10

for SNAPSHOT in ${BACKUP_ROOT}/qdrant/*_${BACKUP_DATE}.snapshot; do
    if [ -f "$SNAPSHOT" ]; then
        COLLECTION=$(basename $SNAPSHOT | sed "s/_${BACKUP_DATE}.snapshot//")
        echo "Restoring collection: $COLLECTION"

        curl -X POST \
          "http://localhost:6333/collections/${COLLECTION}/snapshots/upload" \
          -H "Content-Type: multipart/form-data" \
          -F "snapshot=@${SNAPSHOT}"

        echo "✓ Collection $COLLECTION restored"
    fi
done

# 4. Restore Configuration
echo "[4/5] Restoring Configuration..."
CONFIG_BACKUP="${BACKUP_ROOT}/config/config_${BACKUP_DATE}.tar.gz"
CONFIG_BACKUP_ENC="${CONFIG_BACKUP}.gpg"

if [ -f "$CONFIG_BACKUP_ENC" ]; then
    gpg --decrypt $CONFIG_BACKUP_ENC | tar -xzf - -C /Users/mohammednazmy/VoiceAssist
    echo "✓ Configuration restored (encrypted)"
elif [ -f "$CONFIG_BACKUP" ]; then
    tar -xzf $CONFIG_BACKUP -C /Users/mohammednazmy/VoiceAssist
    echo "✓ Configuration restored"
else
    echo "✗ Configuration backup not found"
fi

# 5. Start all services
echo "[5/5] Starting all services..."
docker compose up -d

# Wait for services to be ready
echo "Waiting for services to be ready..."
sleep 30

# Verify system health
echo ""
echo "============================================"
echo "Restore Completed - Verifying System Health"
echo "============================================"

curl -s http://localhost:8000/health | jq '.'
docker compose ps

echo ""
echo "Full system restore completed"
echo "Please verify all functionality before resuming operations"

Disaster Recovery Scenarios

Scenario 1: Complete Hardware Failure

# On NEW hardware:

# 1. Install Docker and Docker Compose
# 2. Clone repository
git clone <repository_url> /Users/mohammednazmy/VoiceAssist
cd /Users/mohammednazmy/VoiceAssist

# 3. Copy DR archive from backup location
scp backup-server:/backups/disaster_recovery/voiceassist_dr_YYYYMMDD_HHMMSS.tar.gz /tmp/

# 4. Extract DR archive
mkdir -p /backups
tar -xzf /tmp/voiceassist_dr_YYYYMMDD_HHMMSS.tar.gz -C /backups

# 5. Run full restore
/usr/local/bin/va-restore-full YYYYMMDD_HHMMSS

# 6. Verify and resume operations

Scenario 2: Data Corruption

# 1. Stop application
docker compose stop voiceassist-server

# 2. Create backup of corrupted data (for analysis)
/usr/local/bin/va-backup-full

# 3. Identify last known good backup
ls -lh /backups/disaster_recovery/manifest_*.txt

# 4. Restore from last good backup
/usr/local/bin/va-restore-full YYYYMMDD_HHMMSS

# 5. Verify data integrity
# Run data validation scripts

# 6. Resume operations
docker compose start voiceassist-server

Scenario 3: Accidental Data Deletion

# Restore specific component only (faster than full restore)

# For deleted PostgreSQL table/data:
BACKUP_FILE="/backups/postgres/daily/voiceassist_20251121_120000.dump.gz"
# Use table-specific restore procedure

# For deleted Redis data:
# Use Redis restore procedure

# For deleted Qdrant collection:
# Use Qdrant restore procedure

Backup Monitoring

Backup Health Check

#!/bin/bash
# Save as: /usr/local/bin/va-backup-health

BACKUP_ROOT="/backups"
ALERT_EMAIL="ops-team@voiceassist.local"

echo "Backup Health Check - $(date)"
echo "========================================"

# Check PostgreSQL backups
LATEST_PG=$(find ${BACKUP_ROOT}/postgres/daily -name "*.dump.gz" -mtime -1 | wc -l)
if [ $LATEST_PG -eq 0 ]; then
    echo "⚠️  WARNING: No PostgreSQL backup in last 24 hours"
else
    echo "✓ PostgreSQL backups are current"
fi

# Check Redis backups
LATEST_REDIS=$(find ${BACKUP_ROOT}/redis -name "*.rdb" -mtime -1 | wc -l)
if [ $LATEST_REDIS -eq 0 ]; then
    echo "⚠️  WARNING: No Redis backup in last 24 hours"
else
    echo "✓ Redis backups are current"
fi

# Check Qdrant backups
LATEST_QDRANT=$(find ${BACKUP_ROOT}/qdrant -name "*.snapshot" -mtime -1 | wc -l)
if [ $LATEST_QDRANT -eq 0 ]; then
    echo "⚠️  WARNING: No Qdrant backup in last 24 hours"
else
    echo "✓ Qdrant backups are current"
fi

# Check disk space
DISK_USAGE=$(df -h ${BACKUP_ROOT} | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 80 ]; then
    echo "⚠️  WARNING: Backup disk usage at ${DISK_USAGE}%"
else
    echo "✓ Backup disk space is adequate (${DISK_USAGE}%)"
fi

# Check backup sizes
echo ""
echo "Backup Sizes:"
echo "PostgreSQL: $(du -sh ${BACKUP_ROOT}/postgres | cut -f1)"
echo "Redis: $(du -sh ${BACKUP_ROOT}/redis | cut -f1)"
echo "Qdrant: $(du -sh ${BACKUP_ROOT}/qdrant | cut -f1)"
echo "Config: $(du -sh ${BACKUP_ROOT}/config | cut -f1)"
echo "Total: $(du -sh ${BACKUP_ROOT} | cut -f1)"

Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Quarterly or after each disaster recovery event Next Review: 2026-02-21

Scaling Runbook

Last Updated: 2025-11-27 Purpose: Comprehensive guide for scaling VoiceAssist V2 infrastructure

Scaling Overview

Current Architecture

Load Balancer (if configured)
    ↓
VoiceAssist Server (Scalable)
    ↓
├── PostgreSQL (Primary + Read Replicas)
├── Redis (Cluster or Sentinel)
└── Qdrant (Distributed)

Scaling Strategy

Component	Type	Method	Max Recommended
VoiceAssist Server	Stateless	Horizontal	10+ instances
PostgreSQL	Stateful	Vertical + Read Replicas	1 primary + 5 replicas
Redis	Stateful	Vertical + Cluster	6 nodes (3 master + 3 slave)
Qdrant	Stateful	Horizontal + Sharding	6+ nodes

When to Scale

Scaling Triggers

Immediate Scaling (Reactive)

Scale immediately if:

CPU usage > 80% for 10+ minutes
Memory usage > 85%
Response time > 2 seconds (p95)
Error rate > 5%
Connection pool exhausted
Queue depth > 1000

Planned Scaling (Proactive)

Schedule scaling if:

Expected traffic increase (events, marketing campaigns)
New feature launch with heavy load
Approaching 70% capacity on any metric
Seasonal traffic patterns

Scaling Decision Matrix

# Quick capacity check
cat > /usr/local/bin/va-capacity-check <<'EOF'
#!/bin/bash

echo "VoiceAssist Capacity Check - $(date)"
echo "========================================"

# Check application load
CPU=$(docker stats --no-stream --format "{{.CPUPerc}}" voiceassist-voiceassist-server-1 | sed 's/%//')
MEM=$(docker stats --no-stream --format "{{.MemPerc}}" voiceassist-voiceassist-server-1 | sed 's/%//')

echo "Application:"
echo "  CPU: ${CPU}%"
echo "  Memory: ${MEM}%"

# Database connections
DB_CONN=$(docker compose exec -T postgres psql -U voiceassist -d voiceassist -t -c \
  "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" | tr -d ' ')
DB_MAX=$(docker compose exec -T postgres psql -U voiceassist -d voiceassist -t -c \
  "SHOW max_connections;" | tr -d ' ')
DB_USAGE=$((DB_CONN * 100 / DB_MAX))

echo "Database:"
echo "  Active Connections: ${DB_CONN}/${DB_MAX} (${DB_USAGE}%)"

# Redis memory
REDIS_MEM=$(docker compose exec -T redis redis-cli INFO memory | grep used_memory_human | cut -d: -f2 | tr -d '\r')
echo "Redis:"
echo "  Memory Usage: ${REDIS_MEM}"

# Recommendation
echo ""
echo "Scaling Recommendations:"
if (( $(echo "$CPU > 80" | bc -l) )) || (( $(echo "$MEM > 85" | bc -l) )); then
    echo "🔴 IMMEDIATE: Scale application horizontally"
elif (( $(echo "$CPU > 70" | bc -l) )) || (( $(echo "$MEM > 70" | bc -l) )); then
    echo "🟡 SOON: Plan to scale within 24 hours"
elif [ $DB_USAGE -gt 80 ]; then
    echo "🔴 IMMEDIATE: Scale database connections or add read replica"
else
    echo "🟢 OK: Current capacity is adequate"
fi
EOF

chmod +x /usr/local/bin/va-capacity-check

Horizontal Scaling - Application Server

Quick Scale Up

# Scale to 3 instances
docker compose up -d --scale voiceassist-server=3

# Verify all instances running
docker compose ps voiceassist-server

# Expected output: 3 containers running
# voiceassist-voiceassist-server-1
# voiceassist-voiceassist-server-2
# voiceassist-voiceassist-server-3

# Check health of all instances
for i in {1..3}; do
    echo "Instance $i:"
    docker inspect voiceassist-voiceassist-server-$i | jq '.[0].State.Health.Status'
done

Scale with Load Balancer

# Add to docker-compose.yml
services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - voiceassist-server

  voiceassist-server:
    # ... existing config ...
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: "2"
          memory: 2G
        reservations:
          cpus: "1"
          memory: 1G

# Create nginx.conf for load balancing
upstream voiceassist_backend {
    least_conn;  # Use least connections algorithm

    server voiceassist-server-1:8000 max_fails=3 fail_timeout=30s;
    server voiceassist-server-2:8000 max_fails=3 fail_timeout=30s;
    server voiceassist-server-3:8000 max_fails=3 fail_timeout=30s;

    keepalive 32;
}

server {
    listen 80;

    location / {
        proxy_pass http://voiceassist_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Timeouts
        proxy_connect_timeout 5s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;

        # Health check
        proxy_next_upstream error timeout http_500 http_502 http_503;
    }

    location /health {
        access_log off;
        proxy_pass http://voiceassist_backend;
    }
}

# Deploy with load balancer
docker compose up -d --scale voiceassist-server=3

# Verify load balancing
for i in {1..10}; do
    curl -s http://localhost/health | jq -r '.hostname'
done

# Should show different hostnames, indicating round-robin

Auto-Scaling with Metrics

#!/bin/bash
# Save as: /usr/local/bin/va-autoscale

MIN_INSTANCES=2
MAX_INSTANCES=10
SCALE_UP_THRESHOLD=70
SCALE_DOWN_THRESHOLD=30
CHECK_INTERVAL=60

while true; do
    # Get current instance count
    CURRENT=$(docker compose ps -q voiceassist-server | wc -l)

    # Get average CPU across all instances
    AVG_CPU=$(docker stats --no-stream --format "{{.CPUPerc}}" \
      $(docker compose ps -q voiceassist-server) | \
      sed 's/%//g' | \
      awk '{s+=$1; n++} END {print s/n}')

    echo "[$(date)] Instances: $CURRENT, Avg CPU: ${AVG_CPU}%"

    # Scale up
    if (( $(echo "$AVG_CPU > $SCALE_UP_THRESHOLD" | bc -l) )) && [ $CURRENT -lt $MAX_INSTANCES ]; then
        NEW_COUNT=$((CURRENT + 1))
        echo "Scaling UP to $NEW_COUNT instances (CPU: ${AVG_CPU}%)"
        docker compose up -d --scale voiceassist-server=$NEW_COUNT

    # Scale down
    elif (( $(echo "$AVG_CPU < $SCALE_DOWN_THRESHOLD" | bc -l) )) && [ $CURRENT -gt $MIN_INSTANCES ]; then
        NEW_COUNT=$((CURRENT - 1))
        echo "Scaling DOWN to $NEW_COUNT instances (CPU: ${AVG_CPU}%)"
        docker compose up -d --scale voiceassist-server=$NEW_COUNT
    else
        echo "No scaling needed"
    fi

    sleep $CHECK_INTERVAL
done

Graceful Instance Shutdown

# Scale down with zero downtime
CURRENT=$(docker compose ps -q voiceassist-server | wc -l)
TARGET=$((CURRENT - 1))

echo "Scaling from $CURRENT to $TARGET instances"

# Get last instance
LAST_INSTANCE="voiceassist-voiceassist-server-${CURRENT}"

# Stop accepting new connections (if using load balancer)
docker compose exec nginx nginx -s reload

# Wait for existing connections to drain (30 seconds)
echo "Draining connections..."
sleep 30

# Check remaining connections
ACTIVE_CONN=$(docker exec $LAST_INSTANCE netstat -an | grep :8000 | grep ESTABLISHED | wc -l)
echo "Active connections on instance: $ACTIVE_CONN"

# Scale down
docker compose up -d --scale voiceassist-server=$TARGET

echo "Scaled down to $TARGET instances"

Vertical Scaling - Application Server

Increase CPU and Memory

# Update docker-compose.yml
services:
  voiceassist-server:
    deploy:
      resources:
        limits:
          cpus: "4" # Increased from 2
          memory: 4G # Increased from 2G
        reservations:
          cpus: "2" # Increased from 1
          memory: 2G # Increased from 1G

# Apply changes
docker compose up -d voiceassist-server

# Verify new limits
docker inspect voiceassist-voiceassist-server-1 | \
  jq '.[0].HostConfig.Memory, .[0].HostConfig.NanoCpus'

# Monitor performance improvement
docker stats voiceassist-voiceassist-server-1

Optimize Application Workers

# Increase Gunicorn workers in Dockerfile or docker-compose.yml
# Rule: workers = (2 x CPU cores) + 1

# For 4 CPU cores:
WORKERS=9  # (2 x 4) + 1

# Update environment variable
docker compose exec voiceassist-server sh -c \
  "export GUNICORN_WORKERS=$WORKERS && supervisorctl restart gunicorn"

# Verify worker count
docker compose exec voiceassist-server ps aux | grep gunicorn

PostgreSQL Scaling

Vertical Scaling - Increase Resources

# Update docker-compose.yml
services:
  postgres:
    deploy:
      resources:
        limits:
          cpus: "4"
          memory: 8G
        reservations:
          cpus: "2"
          memory: 4G
    command:
      - "postgres"
      - "-c"
      - "max_connections=200" # Increased from 100
      - "-c"
      - "shared_buffers=2GB" # Increased from 256MB
      - "-c"
      - "effective_cache_size=6GB" # Increased
      - "-c"
      - "maintenance_work_mem=512MB" # Increased
      - "-c"
      - "checkpoint_completion_target=0.9"
      - "-c"
      - "wal_buffers=16MB"
      - "-c"
      - "default_statistics_target=100"
      - "-c"
      - "random_page_cost=1.1"
      - "-c"
      - "effective_io_concurrency=200"
      - "-c"
      - "work_mem=10MB" # Increased
      - "-c"
      - "min_wal_size=1GB"
      - "-c"
      - "max_wal_size=4GB" # Increased

# Apply changes
docker compose up -d postgres

# Verify new settings
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SHOW max_connections; SHOW shared_buffers; SHOW effective_cache_size;"

Read Replica Setup

# Add to docker-compose.yml
services:
  postgres-replica:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - postgres_replica_data:/var/lib/postgresql/data
    command:
      - "postgres"
      - "-c"
      - "hot_standby=on"
      - "-c"
      - "max_connections=200"
    depends_on:
      - postgres

volumes:
  postgres_replica_data:

# Setup replication on primary
docker compose exec postgres psql -U voiceassist -d postgres <<EOF
-- Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replica_password';

-- Configure pg_hba.conf for replication
-- Add to postgresql.conf:
-- wal_level = replica
-- max_wal_senders = 10
-- max_replication_slots = 10
-- hot_standby = on
EOF

# Restart primary
docker compose restart postgres

# Initial replica setup
docker compose exec postgres pg_basebackup \
  -h postgres \
  -D /var/lib/postgresql/data-replica \
  -U replicator \
  -v \
  -P \
  -W

# Create recovery.conf on replica
cat > recovery.conf <<EOF
standby_mode = 'on'
primary_conninfo = 'host=postgres port=5432 user=replicator password=replica_password'
trigger_file = '/tmp/postgresql.trigger.5432'
EOF

# Start replica
docker compose up -d postgres-replica

# Verify replication
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT * FROM pg_stat_replication;"

Connection Pooling with PgBouncer

# Add to docker-compose.yml
services:
  pgbouncer:
    image: pgbouncer/pgbouncer:latest
    environment:
      DATABASES_HOST: postgres
      DATABASES_PORT: 5432
      DATABASES_USER: voiceassist
      DATABASES_PASSWORD: ${POSTGRES_PASSWORD}
      DATABASES_DBNAME: voiceassist
      PGBOUNCER_POOL_MODE: transaction
      PGBOUNCER_MAX_CLIENT_CONN: 1000
      PGBOUNCER_DEFAULT_POOL_SIZE: 25
      PGBOUNCER_MIN_POOL_SIZE: 10
      PGBOUNCER_RESERVE_POOL_SIZE: 5
      PGBOUNCER_SERVER_IDLE_TIMEOUT: 600
    ports:
      - "6432:6432"
    depends_on:
      - postgres

# Update application to use PgBouncer
# Change DATABASE_URL in .env
DATABASE_URL=postgresql://voiceassist:password@pgbouncer:6432/voiceassist

# Restart application
docker compose up -d voiceassist-server

# Monitor PgBouncer
docker compose exec pgbouncer psql -h localhost -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
docker compose exec pgbouncer psql -h localhost -p 6432 -U pgbouncer pgbouncer -c "SHOW STATS;"

Redis Scaling

Vertical Scaling - Increase Memory

# Update docker-compose.yml
services:
  redis:
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 4G # Increased from 2G
        reservations:
          cpus: "1"
          memory: 2G
    command:
      - redis-server
      - --maxmemory 3gb # Increased from 1gb
      - --maxmemory-policy allkeys-lru

# Apply changes
docker compose up -d redis

# Verify new memory limit
docker compose exec redis redis-cli CONFIG GET maxmemory

Redis Cluster Setup (Horizontal Scaling)

# Add to docker-compose.yml
services:
  redis-node-1:
    image: redis:7-alpine
    command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379
    volumes:
      - redis_node_1_data:/data

  redis-node-2:
    image: redis:7-alpine
    command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379
    volumes:
      - redis_node_2_data:/data

  redis-node-3:
    image: redis:7-alpine
    command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379
    volumes:
      - redis_node_3_data:/data

  redis-node-4:
    image: redis:7-alpine
    command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379
    volumes:
      - redis_node_4_data:/data

  redis-node-5:
    image: redis:7-alpine
    command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379
    volumes:
      - redis_node_5_data:/data

  redis-node-6:
    image: redis:7-alpine
    command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --port 6379
    volumes:
      - redis_node_6_data:/data

volumes:
  redis_node_1_data:
  redis_node_2_data:
  redis_node_3_data:
  redis_node_4_data:
  redis_node_5_data:
  redis_node_6_data:

# Start all nodes
docker compose up -d redis-node-{1..6}

# Create cluster
docker compose exec redis-node-1 redis-cli --cluster create \
  redis-node-1:6379 \
  redis-node-2:6379 \
  redis-node-3:6379 \
  redis-node-4:6379 \
  redis-node-5:6379 \
  redis-node-6:6379 \
  --cluster-replicas 1

# Verify cluster
docker compose exec redis-node-1 redis-cli CLUSTER INFO
docker compose exec redis-node-1 redis-cli CLUSTER NODES

Redis Sentinel (High Availability)

# Add to docker-compose.yml
services:
  redis-master:
    image: redis:7-alpine
    command: redis-server --port 6379
    volumes:
      - redis_master_data:/data

  redis-slave-1:
    image: redis:7-alpine
    command: redis-server --port 6379 --slaveof redis-master 6379
    volumes:
      - redis_slave_1_data:/data
    depends_on:
      - redis-master

  redis-slave-2:
    image: redis:7-alpine
    command: redis-server --port 6379 --slaveof redis-master 6379
    volumes:
      - redis_slave_2_data:/data
    depends_on:
      - redis-master

  redis-sentinel-1:
    image: redis:7-alpine
    command: redis-sentinel /etc/redis/sentinel.conf
    volumes:
      - ./redis-sentinel.conf:/etc/redis/sentinel.conf
    depends_on:
      - redis-master

  redis-sentinel-2:
    image: redis:7-alpine
    command: redis-sentinel /etc/redis/sentinel.conf
    volumes:
      - ./redis-sentinel.conf:/etc/redis/sentinel.conf
    depends_on:
      - redis-master

  redis-sentinel-3:
    image: redis:7-alpine
    command: redis-sentinel /etc/redis/sentinel.conf
    volumes:
      - ./redis-sentinel.conf:/etc/redis/sentinel.conf
    depends_on:
      - redis-master

# Create redis-sentinel.conf
cat > redis-sentinel.conf <<EOF
port 26379
sentinel monitor mymaster redis-master 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000
EOF

# Start Sentinel setup
docker compose up -d redis-master redis-slave-1 redis-slave-2
docker compose up -d redis-sentinel-1 redis-sentinel-2 redis-sentinel-3

# Verify Sentinel
docker compose exec redis-sentinel-1 redis-cli -p 26379 SENTINEL masters

Qdrant Scaling

Vertical Scaling - Increase Resources

# Update docker-compose.yml
services:
  qdrant:
    deploy:
      resources:
        limits:
          cpus: "4" # Increased from 2
          memory: 8G # Increased from 4G
        reservations:
          cpus: "2"
          memory: 4G

Horizontal Scaling - Distributed Cluster

# Add to docker-compose.yml
services:
  qdrant-node-1:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
      - "6334:6334"
    environment:
      QDRANT__CLUSTER__ENABLED: "true"
      QDRANT__CLUSTER__P2P__PORT: "6335"
    volumes:
      - qdrant_node_1_storage:/qdrant/storage

  qdrant-node-2:
    image: qdrant/qdrant:latest
    ports:
      - "6343:6333"
      - "6344:6334"
    environment:
      QDRANT__CLUSTER__ENABLED: "true"
      QDRANT__CLUSTER__P2P__PORT: "6335"
      QDRANT__CLUSTER__P2P__BOOTSTRAP__URI: "http://qdrant-node-1:6335"
    volumes:
      - qdrant_node_2_storage:/qdrant/storage
    depends_on:
      - qdrant-node-1

  qdrant-node-3:
    image: qdrant/qdrant:latest
    ports:
      - "6353:6333"
      - "6354:6334"
    environment:
      QDRANT__CLUSTER__ENABLED: "true"
      QDRANT__CLUSTER__P2P__PORT: "6335"
      QDRANT__CLUSTER__P2P__BOOTSTRAP__URI: "http://qdrant-node-1:6335"
    volumes:
      - qdrant_node_3_storage:/qdrant/storage
    depends_on:
      - qdrant-node-1

volumes:
  qdrant_node_1_storage:
  qdrant_node_2_storage:
  qdrant_node_3_storage:

# Start cluster
docker compose up -d qdrant-node-{1..3}

# Verify cluster
curl -s http://localhost:6333/cluster | jq '.'

# Create sharded collection
curl -X PUT http://localhost:6333/collections/voice_embeddings \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 384,
      "distance": "Cosine"
    },
    "shard_number": 3,
    "replication_factor": 2
  }'

# Verify sharding
curl -s http://localhost:6333/collections/voice_embeddings/cluster | jq '.'

Load Testing

Setup Load Testing Tools

# Install Apache Bench (simple HTTP testing)
# macOS:
brew install httpd

# Install Locust (Python load testing)
pip install locust

# Install k6 (modern load testing)
brew install k6

Basic Load Test with Apache Bench

# Test health endpoint
ab -n 1000 -c 10 http://localhost:8000/health

# Test with authentication
ab -n 1000 -c 10 -H "Authorization: Bearer YOUR_TOKEN" \
  http://localhost:8000/api/users/me

# Results show:
# - Requests per second
# - Time per request
# - Transfer rate
# - Distribution of response times

Advanced Load Test with Locust

# Create locustfile.py
from locust import HttpUser, task, between

class VoiceAssistUser(HttpUser):
    wait_time = between(1, 3)

    def on_start(self):
        # Login and get token
        response = self.client.post("/api/auth/login", json={
            "email": "test@example.com",
            "password": "password"
        })
        self.token = response.json()["access_token"]

    @task(3)
    def view_profile(self):
        self.client.get("/api/users/me",
            headers={"Authorization": f"Bearer {self.token}"})

    @task(2)
    def list_conversations(self):
        self.client.get("/api/conversations",
            headers={"Authorization": f"Bearer {self.token}"})

    @task(1)
    def create_message(self):
        self.client.post("/api/conversations/1/messages",
            headers={"Authorization": f"Bearer {self.token}"},
            json={"content": "Test message"})

# Run load test
locust -f locustfile.py --host=http://localhost:8000

# Open browser to http://localhost:8089
# Configure:
# - Number of users: 100
# - Spawn rate: 10 users/second
# - Host: http://localhost:8000

# Command line mode (headless)
locust -f locustfile.py --host=http://localhost:8000 \
  --users 100 --spawn-rate 10 --run-time 5m --headless

Load Test with k6

// Create loadtest.js
import http from "k6/http";
import { check, sleep } from "k6";

export let options = {
  stages: [
    { duration: "2m", target: 50 }, // Ramp up to 50 users
    { duration: "5m", target: 50 }, // Stay at 50 users
    { duration: "2m", target: 100 }, // Ramp up to 100 users
    { duration: "5m", target: 100 }, // Stay at 100 users
    { duration: "2m", target: 0 }, // Ramp down
  ],
  thresholds: {
    http_req_duration: ["p(95)<500"], // 95% of requests under 500ms
    http_req_failed: ["rate<0.01"], // Less than 1% errors
  },
};

export default function () {
  // Login
  let loginRes = http.post(
    "http://localhost:8000/api/auth/login",
    JSON.stringify({
      email: "test@example.com",
      password: "password",
    }),
    { headers: { "Content-Type": "application/json" } },
  );

  check(loginRes, {
    "login successful": (r) => r.status === 200,
  });

  let token = loginRes.json("access_token");

  // Make authenticated requests
  let headers = {
    Authorization: `Bearer ${token}`,
  };

  let profileRes = http.get("http://localhost:8000/api/users/me", { headers });
  check(profileRes, {
    "profile retrieved": (r) => r.status === 200,
  });

  sleep(1);
}

# Run k6 load test
k6 run loadtest.js

# With custom output
k6 run --out json=results.json loadtest.js

# View results
cat results.json | jq '.metrics'

Database Load Testing

# Test PostgreSQL under load
# Create pgbench database
docker compose exec postgres createdb -U voiceassist pgbench_test

# Initialize pgbench
docker compose exec postgres pgbench -i -U voiceassist pgbench_test

# Run benchmark (100 clients, 1000 transactions each)
docker compose exec postgres pgbench \
  -c 100 \
  -t 1000 \
  -U voiceassist \
  pgbench_test

# Results show:
# - TPS (transactions per second)
# - Average latency
# - Connection time

Redis Load Testing

# Use redis-benchmark
docker compose exec redis redis-benchmark \
  -h localhost \
  -p 6379 \
  -c 100 \
  -n 100000 \
  -d 100 \
  --csv

# Test specific commands
docker compose exec redis redis-benchmark \
  -t set,get,incr,lpush,lpop \
  -n 100000 \
  -q

Capacity Planning

Current Capacity Assessment

#!/bin/bash
# Save as: /usr/local/bin/va-capacity-report

echo "VoiceAssist Capacity Report - $(date)"
echo "========================================"
echo ""

# Application instances
APP_INSTANCES=$(docker compose ps -q voiceassist-server | wc -l)
echo "Application Instances: $APP_INSTANCES"

# Resource usage per instance
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" \
  $(docker compose ps -q voiceassist-server)
echo ""

# Database metrics
echo "Database Metrics:"
docker compose exec -T postgres psql -U voiceassist -d voiceassist <<EOF
SELECT
    'Active Connections' as metric,
    count(*) as value
FROM pg_stat_activity
WHERE state = 'active'
UNION ALL
SELECT
    'Database Size',
    pg_size_pretty(pg_database_size('voiceassist'))::text
UNION ALL
SELECT
    'Largest Table',
    pg_size_pretty(max(pg_total_relation_size(schemaname||'.'||tablename)))::text
FROM pg_tables
WHERE schemaname = 'public';
EOF
echo ""

# Redis metrics
echo "Redis Metrics:"
docker compose exec -T redis redis-cli INFO stats | grep -E "(total_commands_processed|instantaneous_ops_per_sec|used_memory_human)"
echo ""

# Qdrant metrics
echo "Qdrant Metrics:"
curl -s http://localhost:6333/metrics | grep -E "(collections_total|points_total)"
echo ""

# Estimated capacity
echo "Capacity Estimates:"
echo "  Current RPS: [Calculate from metrics]"
echo "  Max RPS (current setup): [Estimate based on testing]"
echo "  Headroom: [Percentage]"
echo ""

# Scaling recommendations
echo "Scaling Recommendations:"
echo "  - Application: Scale to $(( APP_INSTANCES + 2 )) instances for 50% more capacity"
echo "  - Database: Consider read replica when connections > 150"
echo "  - Redis: Current memory usage allows 2x data growth"

Growth Planning

# Estimate required resources for growth

# Current metrics (example)
CURRENT_USERS=1000
CURRENT_RPS=50
CURRENT_DB_SIZE_GB=10

# Growth projections
GROWTH_RATE=1.5  # 50% growth
MONTHS=6

# Calculate future requirements
cat > /tmp/capacity_projection.py <<EOF
import math

current_users = ${CURRENT_USERS}
current_rps = ${CURRENT_RPS}
current_db_gb = ${CURRENT_DB_SIZE_GB}
monthly_growth = ${GROWTH_RATE}
months = ${MONTHS}

future_users = current_users * (monthly_growth ** months)
future_rps = current_rps * (monthly_growth ** months)
future_db_gb = current_db_gb * (monthly_growth ** months)

# Resource estimates
# Assuming 1 app instance handles 50 RPS
app_instances = math.ceil(future_rps / 50)

# Database: 100 connections per 1000 users
db_connections = math.ceil((future_users / 1000) * 100)

# Redis: 1GB per 10000 users
redis_gb = math.ceil(future_users / 10000)

print(f"Capacity Projection for {months} months:")
print(f"=" * 50)
print(f"Current Users: {current_users:,.0f}")
print(f"Projected Users: {future_users:,.0f} ({future_users/current_users:.1f}x)")
print(f"")
print(f"Current RPS: {current_rps}")
print(f"Projected RPS: {future_rps:.0f} ({future_rps/current_rps:.1f}x)")
print(f"")
print(f"Resource Requirements:")
print(f"  Application Instances: {app_instances}")
print(f"  Database Connections: {db_connections}")
print(f"  Database Storage: {future_db_gb:.0f} GB")
print(f"  Redis Memory: {redis_gb} GB")
print(f"")
print(f"Recommended Setup:")
if app_instances <= 5:
    print(f"  Application: {app_instances} instances with load balancer")
else:
    print(f"  Application: {app_instances} instances with auto-scaling")

if db_connections > 150:
    print(f"  Database: Primary + 2 read replicas + PgBouncer")
else:
    print(f"  Database: Primary + PgBouncer")

if redis_gb > 4:
    print(f"  Redis: 3-node cluster")
else:
    print(f"  Redis: Single instance ({redis_gb}GB)")
EOF

python3 /tmp/capacity_projection.py

Performance Optimization

Application Optimization

# Enable response caching
cat >> .env <<EOF
CACHE_ENABLED=true
CACHE_TTL=300
CACHE_MAX_SIZE=1000
EOF

# Enable gzip compression in nginx
cat > nginx-compression.conf <<EOF
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_comp_level 6;
gzip_types text/plain text/css text/xml text/javascript application/json application/javascript application/xml+rss;
EOF

# Optimize database queries
docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF
-- Create missing indexes
CREATE INDEX IF NOT EXISTS idx_conversations_user_id ON conversations(user_id);
CREATE INDEX IF NOT EXISTS idx_messages_conversation_id ON messages(conversation_id);
CREATE INDEX IF NOT EXISTS idx_messages_created_at ON messages(created_at DESC);

-- Analyze tables
ANALYZE conversations;
ANALYZE messages;
ANALYZE users;
EOF

Database Query Optimization

# Identify slow queries
docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF
-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 slowest queries
SELECT
    substring(query, 1, 100) AS short_query,
    calls,
    total_time,
    mean_time,
    max_time,
    stddev_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
EOF

# Optimize connection management
cat >> .env <<EOF
DB_POOL_SIZE=20
DB_MAX_OVERFLOW=10
DB_POOL_TIMEOUT=30
DB_POOL_RECYCLE=1800
EOF

Caching Strategy

# Implement multi-layer caching in application
# Example: cache.py

import redis
import hashlib
from functools import wraps

redis_client = redis.Redis(host='redis', port=6379, decode_responses=True)

def cache_result(ttl=300):
    """Cache function results in Redis"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            # Generate cache key
            key_data = f"{func.__name__}:{args}:{kwargs}"
            cache_key = hashlib.md5(key_data.encode()).hexdigest()

            # Try to get from cache
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)

            # Execute function
            result = func(*args, **kwargs)

            # Store in cache
            redis_client.setex(cache_key, ttl, json.dumps(result))

            return result
        return wrapper
    return decorator

# Usage:
@cache_result(ttl=600)
def get_user_conversations(user_id):
    # Expensive database query
    return db.query(Conversation).filter_by(user_id=user_id).all()

Monitoring During Scaling

Real-time Metrics

#!/bin/bash
# Save as: /usr/local/bin/va-scaling-monitor

watch -n 5 '
echo "=== Application Instances ==="
docker compose ps voiceassist-server | grep Up | wc -l
echo ""

echo "=== Resource Usage ==="
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemPerc}}" | grep voiceassist
echo ""

echo "=== Request Rate (approx) ==="
docker compose logs --since 1m voiceassist-server | grep "200 OK" | wc -l
echo "requests/min"
echo ""

echo "=== Error Rate ==="
docker compose logs --since 1m voiceassist-server | grep -i error | wc -l
echo "errors/min"
echo ""

echo "=== Database Connections ==="
docker compose exec -T postgres psql -U voiceassist -d voiceassist -t -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
'

Scaling Checklist

Pre-Scaling

Review current metrics and capacity
Identify bottlenecks
Test scaling in staging environment
Update monitoring thresholds
Prepare rollback plan
Notify team of scaling activity

During Scaling

Post-Scaling

Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Quarterly or after significant scaling events Next Review: 2026-02-21

Monitoring Runbook

Last Updated: 2025-11-27 Purpose: Comprehensive guide for monitoring and observability in VoiceAssist V2

Monitoring Architecture

Application Metrics
    ↓
Prometheus (Metrics Collection)
    ↓
Grafana (Visualization)
    ↓
AlertManager (Alerting)
    ↓
PagerDuty/Slack/Email

Key Monitoring Components

Component	Purpose	Port	Dashboard
Prometheus	Metrics collection & storage	9090	http://localhost:9090
Grafana	Metrics visualization	3000	http://localhost:3000
AlertManager	Alert routing & management	9093	http://localhost:9093
Application Metrics	Custom app metrics	8000/metrics	http://localhost:8000/metrics

Setup Monitoring Stack

Docker Compose Configuration

# Add to docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./monitoring/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.console.libraries=/etc/prometheus/console_libraries"
      - "--web.console.templates=/etc/prometheus/consoles"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
      - GF_USERS_ALLOW_SIGN_UP=false
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:latest
    ports:
      - "9187:9187"
    environment:
      DATA_SOURCE_NAME: "postgresql://voiceassist:${POSTGRES_PASSWORD}@postgres:5432/voiceassist?sslmode=disable"
    depends_on:
      - postgres

  redis-exporter:
    image: oliver006/redis_exporter:latest
    ports:
      - "9121:9121"
    environment:
      REDIS_ADDR: "redis:6379"
    depends_on:
      - redis

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

Prometheus Configuration

# Create monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "voiceassist-prod"
    environment: "production"

# Load alerting rules
rule_files:
  - "/etc/prometheus/alerts.yml"

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

# Scrape configurations
scrape_configs:
  # VoiceAssist Application
  - job_name: "voiceassist-app"
    static_configs:
      - targets: ["voiceassist-server:8000"]
    metrics_path: "/metrics"
    scrape_interval: 10s

  # PostgreSQL
  - job_name: "postgresql"
    static_configs:
      - targets: ["postgres-exporter:9187"]

  # Redis
  - job_name: "redis"
    static_configs:
      - targets: ["redis-exporter:9121"]

  # Node metrics
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  # Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Grafana
  - job_name: "grafana"
    static_configs:
      - targets: ["grafana:3000"]

Alert Rules

# Create monitoring/alerts.yml
groups:
  - name: voiceassist_alerts
    interval: 30s
    rules:
      # Application availability
      - alert: ApplicationDown
        expr: up{job="voiceassist-app"} == 0
        for: 1m
        labels:
          severity: critical
          component: application
        annotations:
          summary: "VoiceAssist application is down"
          description: "Application {{ $labels.instance }} is not responding"

      # High error rate
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) /
          rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
          component: application
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} over last 5 minutes"

      # Slow response times
      - alert: SlowResponseTime
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
          component: application
        annotations:
          summary: "Slow API response times"
          description: "95th percentile response time is {{ $value }}s"

      # High CPU usage
      - alert: HighCPUUsage
        expr: |
          100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

      # Database connection pool exhaustion
      - alert: DatabaseConnectionPoolExhausted
        expr: |
          pg_stat_database_numbackends / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
          component: database
        annotations:
          summary: "Database connection pool nearly exhausted"
          description: "Database connections at {{ $value | humanizePercentage }} of maximum"

      # Database down
      - alert: DatabaseDown
        expr: up{job="postgresql"} == 0
        for: 1m
        labels:
          severity: critical
          component: database
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database {{ $labels.instance }} is not responding"

      # Redis down
      - alert: RedisDown
        expr: up{job="redis"} == 0
        for: 1m
        labels:
          severity: critical
          component: cache
        annotations:
          summary: "Redis is down"
          description: "Redis {{ $labels.instance }} is not responding"

      # High Redis memory usage
      - alert: HighRedisMemory
        expr: |
          redis_memory_used_bytes / redis_memory_max_bytes > 0.9
        for: 5m
        labels:
          severity: warning
          component: cache
        annotations:
          summary: "Redis memory usage high"
          description: "Redis memory usage at {{ $value | humanizePercentage }}"

      # Disk space low
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "Low disk space"
          description: "Only {{ $value }}% disk space remaining on {{ $labels.instance }}"

      # Certificate expiration
      - alert: SSLCertificateExpiring
        expr: |
          (ssl_certificate_expiry_seconds - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "SSL certificate expiring soon"
          description: "SSL certificate expires in {{ $value }} days"

AlertManager Configuration

# Create monitoring/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "${SLACK_WEBHOOK_URL}"

# Default route
route:
  receiver: "default"
  group_by: ["alertname", "cluster", "service"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    # Critical alerts -> PagerDuty + Slack
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      continue: true

    - match:
        severity: critical
      receiver: "slack-critical"

    # Warning alerts -> Slack only
    - match:
        severity: warning
      receiver: "slack-warnings"

# Receivers
receivers:
  - name: "default"
    slack_configs:
      - channel: "#voiceassist-alerts"
        title: "VoiceAssist Alert"
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}'

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SERVICE_KEY}"
        description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"

  - name: "slack-critical"
    slack_configs:
      - channel: "#voiceassist-critical"
        username: "AlertManager"
        color: "danger"
        title: "🔴 CRITICAL: {{ .GroupLabels.alertname }}"
        text: |
          *Summary:* {{ .CommonAnnotations.summary }}
          *Description:* {{ .CommonAnnotations.description }}
          *Severity:* {{ .GroupLabels.severity }}
          *Component:* {{ .GroupLabels.component }}

  - name: "slack-warnings"
    slack_configs:
      - channel: "#voiceassist-alerts"
        username: "AlertManager"
        color: "warning"
        title: "⚠️  WARNING: {{ .GroupLabels.alertname }}"
        text: |
          *Summary:* {{ .CommonAnnotations.summary }}
          *Description:* {{ .CommonAnnotations.description }}
          *Severity:* {{ .GroupLabels.severity }}
          *Component:* {{ .GroupLabels.component }}

  - name: "email-ops"
    email_configs:
      - to: "ops-team@voiceassist.local"
        from: "alertmanager@voiceassist.local"
        smarthost: "smtp.gmail.com:587"
        auth_username: "${SMTP_USERNAME}"
        auth_password: "${SMTP_PASSWORD}"
        headers:
          Subject: "[VoiceAssist] {{ .GroupLabels.alertname }}"

Deploy Monitoring Stack

# Create monitoring directory
mkdir -p /Users/mohammednazmy/VoiceAssist/monitoring/grafana/{provisioning,dashboards}

# Start monitoring stack
docker compose up -d prometheus grafana alertmanager node-exporter postgres-exporter redis-exporter

# Verify services
docker compose ps | grep -E "(prometheus|grafana|alertmanager)"

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Access Grafana
echo "Grafana: http://localhost:3000 (admin/admin)"
echo "Prometheus: http://localhost:9090"
echo "AlertManager: http://localhost:9093"

Grafana Dashboards

Provision Datasource

# Create monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Provision Dashboards

# Create monitoring/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: "VoiceAssist"
    orgId: 1
    folder: "VoiceAssist V2"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards

Application Overview Dashboard

// Create monitoring/grafana/dashboards/application-overview.json
{
  "dashboard": {
    "title": "VoiceAssist - Application Overview",
    "tags": ["voiceassist", "application"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{job=\"voiceassist-app\"}[5m])",
            "legendFormat": "{{method}} {{endpoint}}"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p95"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
            "legendFormat": "5xx errors"
          }
        ]
      },
      {
        "title": "Active Instances",
        "type": "stat",
        "targets": [
          {
            "expr": "count(up{job=\"voiceassist-app\"} == 1)"
          }
        ]
      }
    ]
  }
}

Database Dashboard

// Create monitoring/grafana/dashboards/database.json
{
  "dashboard": {
    "title": "VoiceAssist - Database",
    "tags": ["voiceassist", "database", "postgresql"],
    "panels": [
      {
        "title": "Database Connections",
        "type": "graph",
        "targets": [
          {
            "expr": "pg_stat_database_numbackends",
            "legendFormat": "Active connections"
          }
        ]
      },
      {
        "title": "Query Duration",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(pg_stat_database_tup_fetched[5m])",
            "legendFormat": "Rows fetched/sec"
          }
        ]
      },
      {
        "title": "Database Size",
        "type": "graph",
        "targets": [
          {
            "expr": "pg_database_size_bytes",
            "legendFormat": "Database size"
          }
        ]
      },
      {
        "title": "Cache Hit Ratio",
        "type": "gauge",
        "targets": [
          {
            "expr": "rate(pg_stat_database_blks_hit[5m]) / (rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))"
          }
        ]
      }
    ]
  }
}

Import Pre-built Dashboards

# Import Node Exporter dashboard
curl -X POST http://localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "dashboard": {
      "id": null,
      "uid": null,
      "title": "Node Exporter Full",
      "gnetId": 1860
    },
    "overwrite": false,
    "inputs": [
      {
        "name": "DS_PROMETHEUS",
        "type": "datasource",
        "pluginId": "prometheus",
        "value": "Prometheus"
      }
    ]
  }'

# Import PostgreSQL dashboard
curl -X POST http://localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "dashboard": {
      "id": null,
      "uid": null,
      "title": "PostgreSQL Database",
      "gnetId": 9628
    },
    "overwrite": false,
    "inputs": [
      {
        "name": "DS_PROMETHEUS",
        "type": "datasource",
        "pluginId": "prometheus",
        "value": "Prometheus"
      }
    ]
  }'

# Import Redis dashboard
curl -X POST http://localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "dashboard": {
      "id": null,
      "uid": null,
      "title": "Redis Dashboard",
      "gnetId": 11835
    },
    "overwrite": false,
    "inputs": [
      {
        "name": "DS_PROMETHEUS",
        "type": "datasource",
        "pluginId": "prometheus",
        "value": "Prometheus"
      }
    ]
  }'

Application Metrics

Instrument Application Code

# Add to application code (e.g., app/monitoring.py)
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Response
import time

app = FastAPI()

# Metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint']
)

ACTIVE_REQUESTS = Gauge(
    'http_requests_active',
    'Number of active HTTP requests',
    ['method', 'endpoint']
)

DB_CONNECTION_POOL = Gauge(
    'db_connection_pool_size',
    'Database connection pool size',
    ['state']  # active, idle
)

CACHE_OPERATIONS = Counter(
    'cache_operations_total',
    'Total cache operations',
    ['operation', 'status']  # get/set, hit/miss
)

# Middleware to track metrics
@app.middleware("http")
async def track_metrics(request, call_next):
    method = request.method
    endpoint = request.url.path

    ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).inc()

    start_time = time.time()
    try:
        response = await call_next(request)
        status = response.status_code
    except Exception as e:
        status = 500
        raise
    finally:
        duration = time.time() - start_time

        REQUEST_COUNT.labels(
            method=method,
            endpoint=endpoint,
            status=status
        ).inc()

        REQUEST_DURATION.labels(
            method=method,
            endpoint=endpoint
        ).observe(duration)

        ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).dec()

    return response

# Metrics endpoint
@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(),
        media_type="text/plain"
    )

# Custom metric tracking
def track_cache_operation(operation: str, hit: bool):
    """Track cache hit/miss"""
    status = "hit" if hit else "miss"
    CACHE_OPERATIONS.labels(operation=operation, status=status).inc()

def update_connection_pool_metrics(active: int, idle: int):
    """Update database connection pool metrics"""
    DB_CONNECTION_POOL.labels(state="active").set(active)
    DB_CONNECTION_POOL.labels(state="idle").set(idle)

Custom Business Metrics

# Track business-specific metrics
from prometheus_client import Counter, Gauge

# User metrics
USER_REGISTRATIONS = Counter(
    'user_registrations_total',
    'Total user registrations'
)

ACTIVE_USERS = Gauge(
    'active_users',
    'Number of currently active users'
)

# Conversation metrics
CONVERSATIONS_CREATED = Counter(
    'conversations_created_total',
    'Total conversations created'
)

MESSAGES_SENT = Counter(
    'messages_sent_total',
    'Total messages sent',
    ['conversation_type']
)

# Voice processing metrics
VOICE_PROCESSING_DURATION = Histogram(
    'voice_processing_duration_seconds',
    'Voice processing duration in seconds'
)

VOICE_PROCESSING_ERRORS = Counter(
    'voice_processing_errors_total',
    'Total voice processing errors',
    ['error_type']
)

# Usage in application
def create_conversation(user_id: int):
    CONVERSATIONS_CREATED.inc()
    # ... rest of the logic

def send_message(conversation_id: int, message: str):
    MESSAGES_SENT.labels(conversation_type="text").inc()
    # ... rest of the logic

def process_voice(audio_data: bytes):
    start_time = time.time()
    try:
        result = process_audio(audio_data)
        VOICE_PROCESSING_DURATION.observe(time.time() - start_time)
        return result
    except Exception as e:
        VOICE_PROCESSING_ERRORS.labels(error_type=type(e).__name__).inc()
        raise

Log Aggregation

Structured Logging

# Configure structured logging
import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            'timestamp': datetime.utcnow().isoformat(),
            'level': record.levelname,
            'logger': record.name,
            'message': record.getMessage(),
            'module': record.module,
            'function': record.funcName,
            'line': record.lineno
        }

        if record.exc_info:
            log_data['exception'] = self.formatException(record.exc_info)

        if hasattr(record, 'user_id'):
            log_data['user_id'] = record.user_id

        if hasattr(record, 'request_id'):
            log_data['request_id'] = record.request_id

        return json.dumps(log_data)

# Configure logger
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())

logger = logging.getLogger('voiceassist')
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
logger.info("User logged in", extra={'user_id': 123})
logger.error("Database connection failed", exc_info=True)

Centralized Logging with Loki

# Add to docker-compose.yml
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./monitoring/loki-config.yml:/etc/loki/local-config.yaml
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml

  promtail:
    image: grafana/promtail:latest
    volumes:
      - ./monitoring/promtail-config.yml:/etc/promtail/config.yml
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
    command: -config.file=/etc/promtail/config.yml
    depends_on:
      - loki

volumes:
  loki_data:

# Create monitoring/loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /loki/index
  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

# Create monitoring/promtail-config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ["__meta_docker_container_name"]
        regex: "/(.*)"
        target_label: "container"
      - source_labels: ["__meta_docker_container_log_stream"]
        target_label: "stream"

# Add Loki datasource to Grafana
curl -X POST http://localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "name": "Loki",
    "type": "loki",
    "url": "http://loki:3100",
    "access": "proxy",
    "isDefault": false
  }'

Health Checks

Application Health Endpoints

# Comprehensive health check endpoints
from fastapi import APIRouter, status
from typing import Dict
import asyncio

router = APIRouter()

@router.get("/health")
async def health_check() -> Dict:
    """Basic health check - always returns 200 if app is running"""
    return {
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "version": "2.0.0"
    }

@router.get("/ready")
async def readiness_check() -> Dict:
    """Readiness check - verifies all dependencies"""
    checks = {
        "database": await check_database(),
        "redis": await check_redis(),
        "qdrant": await check_qdrant()
    }

    all_healthy = all(checks.values())

    return {
        "status": "ready" if all_healthy else "not_ready",
        "timestamp": datetime.utcnow().isoformat(),
        "checks": checks
    }

async def check_database() -> bool:
    """Check database connectivity"""
    try:
        await db.execute("SELECT 1")
        return True
    except Exception:
        return False

async def check_redis() -> bool:
    """Check Redis connectivity"""
    try:
        redis_client.ping()
        return True
    except Exception:
        return False

async def check_qdrant() -> bool:
    """Check Qdrant connectivity"""
    try:
        response = await http_client.get("http://qdrant:6333/healthz")
        return response.status_code == 200
    except Exception:
        return False

@router.get("/live")
async def liveness_check() -> Dict:
    """Liveness check - for Kubernetes/Docker"""
    return {"status": "alive"}

Docker Health Checks

# Update docker-compose.yml with health checks
services:
  voiceassist-server:
    # ... existing config ...
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  postgres:
    # ... existing config ...
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U voiceassist"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    # ... existing config ...
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3

  qdrant:
    # ... existing config ...
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 30s
      timeout: 10s
      retries: 3

Monitoring Operations

Daily Monitoring Routine

#!/bin/bash
# Save as: /usr/local/bin/va-monitoring-daily

echo "VoiceAssist Daily Monitoring Report - $(date)"
echo "=============================================="
echo ""

# 1. Check all services are up
echo "1. Service Health:"
docker compose ps | grep -E "(Up|healthy)" | wc -l
docker compose ps
echo ""

# 2. Check Prometheus targets
echo "2. Prometheus Targets:"
curl -s http://localhost:9090/api/v1/targets | \
  jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
echo ""

# 3. Check for active alerts
echo "3. Active Alerts:"
curl -s http://localhost:9093/api/v1/alerts | \
  jq '.data[] | select(.status.state=="active") | {name: .labels.alertname, severity: .labels.severity}'
echo ""

# 4. Resource usage summary
echo "4. Resource Usage:"
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemPerc}}" | head -10
echo ""

# 5. Error rate (last 24 hours)
echo "5. Error Rate (24h):"
docker compose logs --since 24h voiceassist-server | grep -i error | wc -l
echo ""

# 6. Database health
echo "6. Database Health:"
docker compose exec -T postgres psql -U voiceassist -d voiceassist <<EOF
SELECT
    'Connections' as metric,
    count(*)::text as value
FROM pg_stat_activity
UNION ALL
SELECT
    'Database Size',
    pg_size_pretty(pg_database_size('voiceassist'))
UNION ALL
SELECT
    'Cache Hit Ratio',
    round((sum(blks_hit) * 100.0 / NULLIF(sum(blks_hit) + sum(blks_read), 0))::numeric, 2)::text || '%'
FROM pg_stat_database;
EOF
echo ""

# 7. Backup status
echo "7. Last Backup:"
ls -lh /backups/postgres/daily/*.dump.gz 2>/dev/null | tail -1
echo ""

echo "=============================================="
echo "Report completed"

Troubleshooting Monitoring Issues

Prometheus Not Scraping Targets

# Check Prometheus logs
docker compose logs prometheus | tail -50

# Check target configuration
curl -s http://localhost:9090/api/v1/targets | jq '.'

# Verify network connectivity
docker compose exec prometheus wget -O- http://voiceassist-server:8000/metrics

# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload

Grafana Dashboards Not Loading

# Check Grafana logs
docker compose logs grafana | tail -50

# Verify datasource connection
curl -s http://localhost:3000/api/datasources \
  -u admin:admin | jq '.'

# Test Prometheus connection from Grafana
curl -s http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up \
  -u admin:admin | jq '.'

# Restart Grafana
docker compose restart grafana

Alerts Not Firing

# Check AlertManager status
curl -s http://localhost:9093/api/v1/status | jq '.'

# Check alert rules in Prometheus
curl -s http://localhost:9090/api/v1/rules | jq '.'

# Check specific alert state
curl -s 'http://localhost:9090/api/v1/query?query=ALERTS{alertname="HighErrorRate"}' | jq '.'

# Verify AlertManager configuration
docker compose exec alertmanager amtool config show

# Check AlertManager logs
docker compose logs alertmanager | tail -50

Monitoring Best Practices

1. Define SLOs (Service Level Objectives)

# Document SLOs
SLOs:
  - name: Availability
    target: 99.9%
    measurement: uptime over 30 days

  - name: Response Time
    target: p95 < 500ms
    measurement: 95th percentile of all API requests

  - name: Error Rate
    target: < 0.1%
    measurement: 5xx errors / total requests

  - name: Data Durability
    target: 99.999%
    measurement: no data loss events

2. Alert Fatigue Prevention

# Guidelines for creating alerts:
# - Alert on symptoms, not causes
# - Make alerts actionable
# - Include runbook links
# - Set appropriate thresholds
# - Use proper severity levels
# - Group related alerts

# Good alert example:
- alert: UserFacingErrorRate
  expr: rate(http_requests_total{status="500"}[5m]) > 0.05
  for: 5m
  annotations:
    summary: "High user-facing error rate"
    description: "More than 5% of requests failing"
    runbook_url: "https://docs.voiceassist.local/runbooks/troubleshooting#high-error-rate"

# Bad alert example (too noisy):
- alert: SingleError
  expr: increase(http_requests_total{status="500"}[1m]) > 0
  for: 0s

3. Dashboard Organization

Dashboards Structure:
├── Executive Dashboard (high-level KPIs)
├── Application Overview (request rate, errors, latency)
├── Infrastructure (CPU, memory, disk, network)
├── Database Performance (connections, queries, cache hit ratio)
├── Cache Performance (Redis operations, memory, hit rate)
├── Business Metrics (users, conversations, messages)
└── On-Call Dashboard (active alerts, recent incidents)

Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Quarterly Next Review: 2026-02-21

Troubleshooting Runbook

Last Updated: 2025-11-27 Purpose: Comprehensive troubleshooting guide for VoiceAssist V2 common issues

Quick Diagnostic Commands

# Save as: /usr/local/bin/va-diagnose
#!/bin/bash

echo "VoiceAssist Quick Diagnostics - $(date)"
echo "========================================="

# System health
echo -e "\n[1] Service Status:"
docker compose ps

echo -e "\n[2] Health Checks:"
curl -s http://localhost:8000/health | jq '.' || echo "❌ Application not responding"

echo -e "\n[3] Recent Errors (last 5 min):"
docker compose logs --since 5m voiceassist-server 2>&1 | grep -i error | tail -10

echo -e "\n[4] Resource Usage:"
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"

echo -e "\n[5] Database Connections:"
docker compose exec -T postgres psql -U voiceassist -d voiceassist -t -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" 2>/dev/null

echo -e "\n[6] Redis Status:"
docker compose exec -T redis redis-cli INFO server | grep -E "(redis_version|uptime_in_seconds)" 2>/dev/null

echo -e "\n[7] Disk Space:"
df -h | grep -E "(Filesystem|/$)"

echo -e "\n========================================="

Issues by Symptom

1. Application Won't Start

Symptom

Container exits immediately
Health check fails
"Connection refused" errors

Investigation

# Check container logs
docker compose logs --tail=100 voiceassist-server

# Check exit code
docker compose ps -a voiceassist-server
# Exit code 0 = normal, 1 = error, 137 = OOM killed, 139 = segfault

# Check if port is already in use
lsof -i :8000

# Verify environment variables
docker compose config | grep -A 20 voiceassist-server

# Check for missing dependencies
docker compose exec voiceassist-server python -c "import sys; print(sys.path)"

Common Causes & Solutions

Cause: Missing environment variables

# Check required variables
cat .env | grep -E "(DATABASE_URL|REDIS_URL|SECRET_KEY)"

# Copy from example
cp .env.example .env

# Edit with correct values
vim .env

# Restart
docker compose up -d voiceassist-server

Cause: Database not ready

# Check PostgreSQL status
docker compose exec postgres pg_isready

# Wait for database
sleep 10

# Try starting again
docker compose up -d voiceassist-server

# Or add depends_on with health check in docker-compose.yml

Cause: Port conflict

# Find process using port
lsof -i :8000

# Kill conflicting process
kill -9 <PID>

# Or change application port in docker-compose.yml
ports:
  - "8001:8000"  # Changed from 8000:8000

Cause: Corrupted Python cache

# Remove Python cache
docker compose exec voiceassist-server find . -type d -name __pycache__ -exec rm -r {} +
docker compose exec voiceassist-server find . -type f -name "*.pyc" -delete

# Rebuild image
docker compose build --no-cache voiceassist-server
docker compose up -d voiceassist-server

2. Database Connection Issues

Symptom

"Connection pool exhausted"
"Too many connections"
"Could not connect to database"
Slow database queries

Investigation

# Check database is running
docker compose ps postgres
docker compose exec postgres pg_isready

# Check active connections
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT count(*), state, wait_event_type
   FROM pg_stat_activity
   WHERE datname = 'voiceassist'
   GROUP BY state, wait_event_type;"

# Check connection limit
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SHOW max_connections;"

# Check for connection leaks
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT pid, usename, application_name, state, state_change, query
   FROM pg_stat_activity
   WHERE datname = 'voiceassist'
   ORDER BY state_change DESC
   LIMIT 20;"

# Check for locks
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT
     pg_stat_activity.pid,
     pg_stat_activity.query,
     pg_locks.granted
   FROM pg_stat_activity
   JOIN pg_locks ON pg_stat_activity.pid = pg_locks.pid
   WHERE NOT pg_locks.granted
   LIMIT 10;"

Solutions

Solution 1: Increase connection pool size

# Update .env
cat >> .env <<EOF
DB_POOL_SIZE=30
DB_MAX_OVERFLOW=10
DB_POOL_TIMEOUT=30
DB_POOL_RECYCLE=1800
EOF

# Restart application
docker compose restart voiceassist-server

# Verify new pool size
docker compose logs voiceassist-server | grep -i "pool size"

Solution 2: Kill idle connections

# Terminate idle connections older than 10 minutes
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE datname = 'voiceassist'
   AND state = 'idle'
   AND state_change < current_timestamp - INTERVAL '10 minutes';"

# Verify connections reduced
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname = 'voiceassist';"

Solution 3: Increase max_connections in PostgreSQL

# Update docker-compose.yml
services:
  postgres:
    command:
      - "postgres"
      - "-c"
      - "max_connections=200" # Increased from 100

# Restart PostgreSQL
docker compose restart postgres

# Verify
docker compose exec postgres psql -U voiceassist -d voiceassist -c \
  "SHOW max_connections;"

Solution 4: Add PgBouncer for connection pooling

# Add to docker-compose.yml
services:
  pgbouncer:
    image: pgbouncer/pgbouncer:latest
    environment:
      DATABASES_HOST: postgres
      DATABASES_PORT: 5432
      DATABASES_USER: voiceassist
      DATABASES_PASSWORD: ${POSTGRES_PASSWORD}
      DATABASES_DBNAME: voiceassist
      PGBOUNCER_POOL_MODE: transaction
      PGBOUNCER_MAX_CLIENT_CONN: 1000
      PGBOUNCER_DEFAULT_POOL_SIZE: 25
    ports:
      - "6432:6432"

# Update DATABASE_URL in .env
DATABASE_URL=postgresql://voiceassist:password@pgbouncer:6432/voiceassist

# Restart
docker compose up -d

Solution 5: Fix connection leaks in code

# Ensure proper connection cleanup
from contextlib import asynccontextmanager

@asynccontextmanager
async def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        await db.close()

# Use context manager
async with get_db() as db:
    result = await db.execute(query)
    # Connection automatically closed

3. High Response Times / Performance Issues

Symptom

API requests taking > 2 seconds
Timeout errors
Slow page loads

Investigation

# Check current response times
curl -o /dev/null -s -w "Time: %{time_total}s\n" http://localhost:8000/health

# Check application metrics
curl -s http://localhost:8000/metrics | grep http_request_duration

# Monitor in real-time
watch -n 2 'curl -o /dev/null -s -w "Time: %{time_total}s\n" http://localhost:8000/api/users/me -H "Authorization: Bearer TOKEN"'

# Check for resource constraints
docker stats --no-stream | grep voiceassist

# Identify slow database queries
docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF
SELECT
    pid,
    now() - query_start as duration,
    state,
    query
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;
EOF

# Check query statistics
docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF
SELECT
    substring(query, 1, 100) AS query,
    calls,
    total_time,
    mean_time,
    max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
EOF

# Check Redis latency
docker compose exec redis redis-cli --latency

# Check if Redis is slow
docker compose exec redis redis-cli SLOWLOG GET 10

Solutions

Solution 1: Add database indexes

# Identify missing indexes
docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF
-- Find tables with sequential scans
SELECT
    schemaname,
    tablename,
    seq_scan,
    seq_tup_read,
    idx_scan,
    seq_tup_read / seq_scan as avg_seq_tup_read
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10;
EOF

# Add recommended indexes
docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF
-- Common indexes for VoiceAssist
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_conversations_user_id
    ON conversations(user_id);

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_messages_conversation_id
    ON messages(conversation_id);

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_messages_created_at
    ON messages(created_at DESC);

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email
    ON users(email);

-- Analyze tables
ANALYZE conversations;
ANALYZE messages;
ANALYZE users;
EOF

# Verify index usage
docker compose exec postgres psql -U voiceassist -d voiceassist <<EOF
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
EOF

Solution 2: Enable query result caching

# Implement Redis caching for expensive queries
import redis
import json
import hashlib
from functools import wraps

redis_client = redis.Redis(host='redis', port=6379, decode_responses=True)

def cache_query(ttl=300):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            # Generate cache key
            cache_key = f"query:{func.__name__}:{hashlib.md5(str(args).encode()).hexdigest()}"

            # Try cache first
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)

            # Execute query
            result = await func(*args, **kwargs)

            # Cache result
            redis_client.setex(cache_key, ttl, json.dumps(result))

            return result
        return wrapper
    return decorator

# Usage
@cache_query(ttl=600)
async def get_user_conversations(user_id: int):
    return await db.query(Conversation).filter_by(user_id=user_id).all()

Solution 3: Optimize database queries

# Use eager loading to avoid N+1 queries
from sqlalchemy.orm import joinedload

# Bad - causes N+1 queries
conversations = db.query(Conversation).all()
for conv in conversations:
    messages = conv.messages  # Separate query for each conversation

# Good - single query with join
conversations = db.query(Conversation)\
    .options(joinedload(Conversation.messages))\
    .all()

# Use select_in_loading for large collections
conversations = db.query(Conversation)\
    .options(selectinload(Conversation.messages))\
    .all()

Solution 4: Scale application horizontally

# Add more application instances
docker compose up -d --scale voiceassist-server=3

# Verify instances
docker compose ps voiceassist-server

# Add load balancer (nginx)
# See SCALING.md for details

Solution 5: Increase resource limits

# Update docker-compose.yml
services:
  voiceassist-server:
    deploy:
      resources:
        limits:
          cpus: "4"
          memory: 4G

docker compose up -d voiceassist-server

4. Redis Connection Issues

Symptom

"Connection to Redis failed"
"Redis timeout"
Cache not working

Investigation

# Check Redis status
docker compose ps redis
docker compose exec redis redis-cli ping

# Check Redis connections
docker compose exec redis redis-cli CLIENT LIST

# Check Redis memory
docker compose exec redis redis-cli INFO memory

# Check Redis logs
docker compose logs --tail=100 redis

# Test connection from application
docker compose exec voiceassist-server python -c "
import redis
r = redis.Redis(host='redis', port=6379)
print(r.ping())
"

Solutions

Solution 1: Restart Redis

# Restart Redis
docker compose restart redis

# Wait for startup
sleep 5

# Verify
docker compose exec redis redis-cli ping

# Restart application
docker compose restart voiceassist-server

Solution 2: Clear Redis if memory full

# Check memory usage
docker compose exec redis redis-cli INFO memory | grep used_memory_human

# Clear all keys (WARNING: destroys cache)
docker compose exec redis redis-cli FLUSHALL

# Or clear specific database
docker compose exec redis redis-cli -n 0 FLUSHDB

# Verify memory freed
docker compose exec redis redis-cli INFO memory | grep used_memory_human

Solution 3: Increase Redis memory limit

# Update docker-compose.yml
services:
  redis:
    command:
      - redis-server
      - --maxmemory 2gb # Increased from 1gb
      - --maxmemory-policy allkeys-lru

docker compose up -d redis

Solution 4: Fix connection string

# Verify REDIS_URL in .env
cat .env | grep REDIS_URL

# Should be:
REDIS_URL=redis://redis:6379/0

# Update if wrong
echo "REDIS_URL=redis://redis:6379/0" >> .env

# Restart application
docker compose restart voiceassist-server

5. Service Container Keeps Restarting

Symptom

Container exits and restarts repeatedly
"Restarting (1) X seconds ago" in docker compose ps

Investigation

# Check restart count
docker inspect voiceassist-voiceassist-server-1 | grep -A 5 RestartCount

# Check exit code
docker compose ps -a voiceassist-server

# Check recent logs
docker compose logs --tail=200 voiceassist-server

# Check health check
docker inspect voiceassist-voiceassist-server-1 | grep -A 20 Health

# Check resource limits
docker stats --no-stream voiceassist-voiceassist-server-1

Solutions

Solution 1: OOMKilled (exit code 137)

# Verify OOM kill
docker inspect voiceassist-voiceassist-server-1 | grep OOMKilled

# Check memory usage
docker stats --no-stream | grep voiceassist-server

# Increase memory limit
# Update docker-compose.yml:
deploy:
  resources:
    limits:
      memory: 4G  # Increased from 2G

# Restart
docker compose up -d voiceassist-server

Solution 2: Application crash loop

# Check for Python errors
docker compose logs voiceassist-server | grep -i "traceback\|error\|exception"

# Common fixes:
# - Fix missing environment variables
# - Fix import errors
# - Fix database connection issues

# Disable auto-restart temporarily to debug
docker update --restart=no voiceassist-voiceassist-server-1

# Check logs without restart interference
docker compose logs -f voiceassist-server

Solution 3: Failed health check

# Check health check command
docker inspect voiceassist-voiceassist-server-1 | grep -A 10 Healthcheck

# Test health check manually
docker compose exec voiceassist-server curl -f http://localhost:8000/health

# Increase health check timeout
# Update docker-compose.yml:
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
  interval: 30s
  timeout: 10s  # Increased from 5s
  retries: 5    # Increased from 3
  start_period: 60s  # Increased from 40s

# Restart
docker compose up -d voiceassist-server

6. Authentication / JWT Issues

Symptom

"Invalid token" errors
"Token expired" errors
Users logged out unexpectedly

Investigation

# Check JWT configuration
cat .env | grep -E "(SECRET_KEY|JWT_)"

# Test token generation
docker compose exec voiceassist-server python -c "
from jose import jwt
from datetime import datetime, timedelta
import os

secret = os.getenv('SECRET_KEY')
payload = {'sub': 'test', 'exp': datetime.utcnow() + timedelta(hours=1)}
token = jwt.encode(payload, secret, algorithm='HS256')
print('Token:', token)

# Decode
decoded = jwt.decode(token, secret, algorithms=['HS256'])
print('Decoded:', decoded)
"

# Check for token in Redis
docker compose exec redis redis-cli KEYS "session:*"
docker compose exec redis redis-cli GET "session:some-session-id"

Solutions

Solution 1: SECRET_KEY changed

# This invalidates all tokens
# Generate new SECRET_KEY
openssl rand -base64 32

# Update .env
echo "SECRET_KEY=<new-secret>" >> .env

# Restart application
docker compose restart voiceassist-server

# Note: All users will need to log in again
# Clear Redis sessions
docker compose exec redis redis-cli FLUSHDB

Solution 2: Token expiration too short

# Update .env
cat >> .env <<EOF
JWT_EXPIRATION_HOURS=24
JWT_REFRESH_EXPIRATION_DAYS=30
EOF

# Restart
docker compose restart voiceassist-server

Solution 3: Clock skew issues

# Check system time
date

# Sync time (macOS)
sudo sntp -sS time.apple.com

# Restart Docker
docker compose restart

7. Database Migration Issues

Symptom

"Duplicate column" errors
"Table already exists" errors
Migration fails to apply

Investigation

# Check current migration version
docker compose run --rm voiceassist-server alembic current

# Check migration history
docker compose run --rm voiceassist-server alembic history

# Check pending migrations
docker compose run --rm voiceassist-server alembic show head

# Check database schema
docker compose exec postgres psql -U voiceassist -d voiceassist -c "\dt"
docker compose exec postgres psql -U voiceassist -d voiceassist -c "\d users"

Solutions

Solution 1: Migration already applied manually

# Stamp database with current migration
docker compose run --rm voiceassist-server alembic stamp head

# Verify
docker compose run --rm voiceassist-server alembic current

Solution 2: Conflicting migrations

# Check for branches
docker compose run --rm voiceassist-server alembic branches

# Merge branches if needed
docker compose run --rm voiceassist-server alembic merge -m "merge branches" <revision1> <revision2>

# Upgrade to merged revision
docker compose run --rm voiceassist-server alembic upgrade head

Solution 3: Rollback and retry

# Downgrade one version
docker compose run --rm voiceassist-server alembic downgrade -1

# Fix migration file
vim app/alembic/versions/<migration-file>.py

# Retry upgrade
docker compose run --rm voiceassist-server alembic upgrade head

Solution 4: Reset migrations (DESTRUCTIVE)

# ⚠️  WARNING: This will destroy all data!

# Backup first
docker compose exec postgres pg_dump -U voiceassist voiceassist > backup.sql

# Drop and recreate database
docker compose exec postgres psql -U voiceassist -d postgres <<EOF
DROP DATABASE voiceassist;
CREATE DATABASE voiceassist OWNER voiceassist;
EOF

# Run all migrations
docker compose run --rm voiceassist-server alembic upgrade head

# Verify
docker compose run --rm voiceassist-server alembic current

8. Disk Space Issues

Symptom

"No space left on device"
Services failing to start
Logs not writing

Investigation

# Check disk usage
df -h

# Check Docker disk usage
docker system df

# Find large files
du -sh /var/lib/docker/*
du -sh ~/Library/Containers/com.docker.docker/Data/*

# Check logs size
docker compose logs voiceassist-server | wc -c

# Find large Docker objects
docker image ls --format "{{.Repository}}:{{.Tag}}\t{{.Size}}"
docker volume ls
docker ps -a --format "{{.Names}}\t{{.Size}}"

Solutions

Solution 1: Clean up Docker

# Remove unused containers
docker container prune -f

# Remove unused images
docker image prune -a -f

# Remove unused volumes
docker volume prune -f

# Remove unused networks
docker network prune -f

# Or clean everything (⚠️  stops all containers)
docker system prune -a --volumes -f

# Check space freed
docker system df

Solution 2: Clean up old backups

# Remove old backups (keep last 7 days)
find /backups/postgres/daily -name "*.dump.gz" -mtime +7 -delete
find /backups/redis -name "*.rdb" -mtime +7 -delete
find /backups/qdrant -name "*.snapshot" -mtime +14 -delete

# Check backup directory size
du -sh /backups/*

Solution 3: Configure log rotation

// Create /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

# Restart Docker daemon
sudo systemctl restart docker

# Or on macOS, restart Docker Desktop

Solution 4: Clear application logs

# Clear Docker logs for specific container
truncate -s 0 $(docker inspect --format='{{.LogPath}}' voiceassist-voiceassist-server-1)

# Remove old log files
find /var/log -name "*.log" -mtime +30 -delete

9. Network Connectivity Issues

Symptom

"Connection refused"
"Host unreachable"
Containers can't communicate

Investigation

# Check Docker networks
docker network ls
docker network inspect voiceassist_default

# Test connectivity between containers
docker compose exec voiceassist-server ping -c 3 postgres
docker compose exec voiceassist-server ping -c 3 redis
docker compose exec voiceassist-server ping -c 3 qdrant

# Check DNS resolution
docker compose exec voiceassist-server nslookup postgres
docker compose exec voiceassist-server getent hosts postgres

# Check if ports are exposed
docker compose ps
docker port voiceassist-voiceassist-server-1

# Test from host
curl http://localhost:8000/health
telnet localhost 8000

Solutions

Solution 1: Recreate network

# Stop all services
docker compose down

# Remove network
docker network rm voiceassist_default

# Recreate everything
docker compose up -d

# Verify network
docker network inspect voiceassist_default

Solution 2: Fix DNS issues

# Add to docker-compose.yml
services:
  voiceassist-server:
    dns:
      - 8.8.8.8
      - 8.8.4.4

docker compose up -d voiceassist-server

Solution 3: Use explicit links

# Add to docker-compose.yml (if needed)
services:
  voiceassist-server:
    links:
      - postgres:postgres
      - redis:redis
      - qdrant:qdrant

Solution 4: Check firewall

# macOS - check if firewall is blocking Docker
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getglobalstate

# Temporarily disable for testing
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off

# Re-enable after testing
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate on

10. Qdrant Vector Search Issues

Symptom

"Collection not found"
"Vector dimension mismatch"
Slow search results

Investigation

# Check Qdrant status
curl -s http://localhost:6333/healthz

# List collections
curl -s http://localhost:6333/collections | jq '.'

# Get collection info
curl -s http://localhost:6333/collections/voice_embeddings | jq '.'

# Check collection size
curl -s http://localhost:6333/collections/voice_embeddings | jq '.result.points_count'

# Check Qdrant logs
docker compose logs --tail=100 qdrant

Solutions

Solution 1: Create missing collection

# Create collection
curl -X PUT http://localhost:6333/collections/voice_embeddings \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 384,
      "distance": "Cosine"
    }
  }'

# Verify creation
curl -s http://localhost:6333/collections/voice_embeddings | jq '.result.status'

Solution 2: Fix dimension mismatch

# Delete and recreate collection with correct dimensions
curl -X DELETE http://localhost:6333/collections/voice_embeddings

curl -X PUT http://localhost:6333/collections/voice_embeddings \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 384,  # Match your embedding model
      "distance": "Cosine"
    }
  }'

Solution 3: Optimize collection for performance

# Create index
curl -X POST http://localhost:6333/collections/voice_embeddings/index \
  -H 'Content-Type: application/json' \
  -d '{
    "field_name": "text",
    "field_schema": "keyword"
  }'

# Optimize collection
curl -X POST http://localhost:6333/collections/voice_embeddings/optimizer

Solution 4: Clear and reindex

# Delete all points
curl -X POST http://localhost:6333/collections/voice_embeddings/points/delete \
  -H 'Content-Type: application/json' \
  -d '{
    "filter": {}
  }'

# Trigger reindexing from application
# (Application-specific code to rebuild vectors)

Troubleshooting Checklist

Before Escalating

Checked recent logs (5-15 minutes)
Verified all services are running
Checked system resources (CPU, memory, disk)
Reviewed recent changes (deployments, config)
Attempted restart of affected service
Checked for known issues in documentation
Verified network connectivity
Checked monitoring dashboards
Documented symptoms and attempted solutions

Information to Collect for Escalation

#!/bin/bash
# Save as: /usr/local/bin/va-collect-debug-info

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OUTPUT_DIR="/tmp/voiceassist-debug-${TIMESTAMP}"

mkdir -p $OUTPUT_DIR

echo "Collecting debug information..."

# System info
uname -a > $OUTPUT_DIR/system-info.txt
docker version >> $OUTPUT_DIR/system-info.txt
docker compose version >> $OUTPUT_DIR/system-info.txt

# Service status
docker compose ps > $OUTPUT_DIR/service-status.txt

# Logs
docker compose logs --tail=500 > $OUTPUT_DIR/all-logs.txt
docker compose logs --tail=500 voiceassist-server > $OUTPUT_DIR/app-logs.txt
docker compose logs --tail=200 postgres > $OUTPUT_DIR/postgres-logs.txt
docker compose logs --tail=200 redis > $OUTPUT_DIR/redis-logs.txt

# Configuration
docker compose config > $OUTPUT_DIR/docker-compose-config.yml
cp .env $OUTPUT_DIR/env-sanitized.txt
sed -i '' 's/=.*/=REDACTED/g' $OUTPUT_DIR/env-sanitized.txt

# Resource usage
docker stats --no-stream > $OUTPUT_DIR/resource-usage.txt
df -h > $OUTPUT_DIR/disk-usage.txt

# Network
docker network ls > $OUTPUT_DIR/networks.txt
docker network inspect voiceassist_default > $OUTPUT_DIR/network-inspect.json

# Database state
docker compose exec -T postgres psql -U voiceassist -d voiceassist -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" \
  > $OUTPUT_DIR/db-connections.txt

# Create archive
tar -czf voiceassist-debug-${TIMESTAMP}.tar.gz -C /tmp voiceassist-debug-${TIMESTAMP}

echo "Debug information collected: voiceassist-debug-${TIMESTAMP}.tar.gz"
echo "Please attach this file when escalating the issue"

Common Error Messages

Error: "bind: address already in use"

Solution:

# Find and kill process using the port
lsof -i :8000
kill -9 <PID>

# Or change port in docker-compose.yml

Error: "ERROR: could not find an available, non-overlapping IPv4 address pool"

Solution:

# Clean up unused networks
docker network prune

# Or specify custom network in docker-compose.yml
networks:
  default:
    ipam:
      config:
        - subnet: 172.25.0.0/16

Error: "ERROR: Service 'X' failed to build"

Solution:

# Clean Docker build cache
docker builder prune -a -f

# Rebuild with no cache
docker compose build --no-cache

# Check Dockerfile syntax
docker compose config

Error: "sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: password authentication failed"

Solution:

# Verify credentials in .env
cat .env | grep -E "(POSTGRES_USER|POSTGRES_PASSWORD)"

# Reset password
docker compose exec postgres psql -U postgres -c \
  "ALTER USER voiceassist WITH PASSWORD 'new_password';"

# Update .env
vim .env

# Restart application
docker compose restart voiceassist-server

Error: "redis.exceptions.ConnectionError: Error connecting to redis"

Solution:

# Check Redis is running
docker compose ps redis

# Check Redis URL in .env
cat .env | grep REDIS_URL

# Test connection
docker compose exec redis redis-cli ping

# Restart Redis and app
docker compose restart redis voiceassist-server

Performance Tuning Quick Wins

# 1. Add database indexes
docker compose exec postgres psql -U voiceassist -d voiceassist -f - <<EOF
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_conversations_user_id ON conversations(user_id);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_messages_conversation_id ON messages(conversation_id);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_messages_created_at ON messages(created_at DESC);
ANALYZE;
EOF

# 2. Increase connection pool
echo "DB_POOL_SIZE=30" >> .env
echo "DB_MAX_OVERFLOW=10" >> .env

# 3. Enable Redis caching
echo "CACHE_ENABLED=true" >> .env
echo "CACHE_TTL=300" >> .env

# 4. Increase worker count
# For 4 CPU cores: workers = (2 x 4) + 1 = 9
echo "GUNICORN_WORKERS=9" >> .env

# 5. Optimize PostgreSQL settings
# See SCALING.md for detailed configuration

# Restart to apply changes
docker compose restart

Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Monthly or after each major incident Next Review: 2025-12-21

Docs Site Deployment and TLS Runbook

Last Updated: 2025-11-27 URL: https://assistdocs.asimo.io Document Root: /var/www/assistdocs.asimo.io

Quick Deployment Checklist

# 1. Navigate to repo
cd ~/VoiceAssist

# 2. Pull latest changes
git pull origin main

# 3. Install dependencies (if needed)
pnpm install

# 4. Navigate to docs-site
cd apps/docs-site

# 5. Validate metadata and links
pnpm validate:metadata
pnpm check:links

# 6. Generate agent JSON (if docs changed)
pnpm generate-agent-json

# 7. Build the static site
pnpm build

# 8. Deploy to Apache document root
sudo rm -rf /var/www/assistdocs.asimo.io/*
sudo cp -r out/* /var/www/assistdocs.asimo.io/

# 9. Verify deployment
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/agent/index.json
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/agent/docs.json
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/search-index.json

Architecture Overview

Build Process

docs/*.md                    → Next.js static export
apps/docs-site/              → Build artifacts in out/
scripts/generate-agent-json  → public/agent/*.json
                             → search-index.json

Deployment Architecture

┌──────────────────────────────────────────────────┐
│ Apache2 (mod_ssl, mod_rewrite)                   │
│   - assistdocs.asimo.io-le-ssl.conf              │
│   - DocumentRoot: /var/www/assistdocs.asimo.io   │
│   - RewriteEngine for clean URLs                 │
└──────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────┐
│ Static Files                                      │
│   - /*.html (Next.js pages)                      │
│   - /agent/*.json (AI agent endpoints)           │
│   - /search-index.json (Fuse.js)                 │
│   - /sitemap.xml                                 │
└──────────────────────────────────────────────────┘

Step 1: Prepare for Deployment

1.1 Sync Repository

cd ~/VoiceAssist
git pull origin main
git status  # Verify clean state

1.2 Install Dependencies

# Root level (pnpm workspace)
pnpm install

# Verify docs-site dependencies
cd apps/docs-site
ls node_modules/.bin/next  # Should exist

Step 2: Validate Documentation

2.1 Metadata Validation

cd ~/VoiceAssist/apps/docs-site
pnpm validate:metadata

Expected Output: No errors about missing or invalid frontmatter.

2.2 Link Validation

pnpm check:links

Expected Output: All internal links resolve correctly.

2.3 Fix Common Issues

Missing frontmatter:

---
title: "Document Title"
slug: "path/to-document"
summary: "Brief description"
status: stable
stability: production
owner: team
lastUpdated: "YYYY-MM-DD"
audience: ["human", "agent"]
tags: ["tag1", "tag2"]
category: category-name
---

Broken links: Update markdown links to use relative paths from docs/ directory.

Step 3: Generate Agent JSON

The agent JSON files provide machine-readable access to documentation.

3.1 Run Generation Script

cd ~/VoiceAssist/apps/docs-site
pnpm generate-agent-json

3.2 Verify Output

# Check index.json
cat public/agent/index.json | jq '.name'
# Should output: "VoiceAssist Documentation"

# Check docs.json count
cat public/agent/docs.json | jq 'length'
# Should output document count (e.g., 220+)

# Check search index
ls -la public/search-index.json

Step 4: Build Static Site

4.1 Run Build

cd ~/VoiceAssist/apps/docs-site
pnpm build

Expected Output:

✓ Compiled successfully
Export successful
Files in out/ directory

4.2 Verify Build Output

ls out/
# Should contain: index.html, ai/, docs/, agent/, search-index.json, sitemap.xml

ls out/agent/
# Should contain: index.json, docs.json, schema.json

Step 5: Deploy to Apache

5.1 Clear Old Files

sudo rm -rf /var/www/assistdocs.asimo.io/*

5.2 Copy New Build

sudo cp -r ~/VoiceAssist/apps/docs-site/out/* /var/www/assistdocs.asimo.io/

5.3 Set Permissions

sudo chown -R www-data:www-data /var/www/assistdocs.asimo.io
sudo chmod -R 755 /var/www/assistdocs.asimo.io

5.4 Reload Apache (if config changed)

sudo apache2ctl configtest
sudo systemctl reload apache2

Step 6: Verify Deployment

6.1 Check HTTP Status

# Main page
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/

# AI agent endpoints
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/agent/index.json
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/agent/docs.json
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/search-index.json

# Clean URLs (should return 200, not 404)
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/ai/onboarding
curl -s -o /dev/null -w "%{http_code}" https://assistdocs.asimo.io/ai/status

Expected: All should return 200.

6.2 Check Content

# Verify agent JSON content
curl -s https://assistdocs.asimo.io/agent/index.json | jq '.endpoints'

# Verify sitemap
curl -s https://assistdocs.asimo.io/sitemap.xml | head -20

TLS Certificate Management

Current Certificate Status

sudo certbot certificates | grep -A 5 "assistdocs.asimo.io"

Current Certificate:

Domain: assistdocs.asimo.io
Issuer: Let's Encrypt
Key Type: ECDSA
Certificate Path: /etc/letsencrypt/live/assistdocs.asimo.io/fullchain.pem
Private Key Path: /etc/letsencrypt/live/assistdocs.asimo.io/privkey.pem
Expiry: 2026-02-19 (auto-renewed)

Automatic Renewal

Certbot automatically renews certificates via systemd timer.

# Check timer status
sudo systemctl status certbot.timer

# View renewal schedule
sudo systemctl list-timers | grep certbot

# Test renewal (dry run)
sudo certbot renew --dry-run

Manual Renewal (if needed)

# Renew specific certificate
sudo certbot renew --cert-name assistdocs.asimo.io

# Force renewal
sudo certbot renew --cert-name assistdocs.asimo.io --force-renewal

# Reload Apache after renewal
sudo systemctl reload apache2

New Certificate (if domain changes)

sudo certbot --apache -d assistdocs.asimo.io

Apache Configuration

Configuration File

Location: /etc/apache2/sites-available/assistdocs.asimo.io-le-ssl.conf

Key Configuration

<VirtualHost *:443>
    ServerName assistdocs.asimo.io
    DocumentRoot /var/www/assistdocs.asimo.io

    <Directory /var/www/assistdocs.asimo.io>
        Options Indexes FollowSymLinks
        AllowOverride All
        Require all granted
        DirectoryIndex index.html

        # Clean URLs for Next.js static export
        RewriteEngine On
        RewriteCond %{REQUEST_FILENAME} !-f
        RewriteCond %{REQUEST_FILENAME} !-d
        RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI}.html -f
        RewriteRule ^(.*)$ $1.html [L]
    </Directory>

    # SSL (managed by Certbot)
    SSLEngine on
    SSLCertificateFile /etc/letsencrypt/live/assistdocs.asimo.io/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/assistdocs.asimo.io/privkey.pem
</VirtualHost>

Test Configuration

sudo apache2ctl configtest

Reload After Changes

sudo systemctl reload apache2

Troubleshooting

404 for Clean URLs

Symptom: /ai/onboarding returns 404 but /ai/onboarding.html works.

Cause: RewriteEngine rules not applied.

Fix:

Ensure mod_rewrite is enabled: sudo a2enmod rewrite
Verify rules are inside <Directory> block
Reload Apache: sudo systemctl reload apache2

Build Fails

Symptom: pnpm build fails with errors.

Checks:

# Check for TypeScript errors
pnpm tsc --noEmit

# Check for missing dependencies
pnpm install

# Clear cache
rm -rf .next out
pnpm build

Agent JSON Not Updated

Symptom: /agent/docs.json shows old documents.

Fix:

# Regenerate agent JSON
pnpm generate-agent-json

# Rebuild and redeploy
pnpm build
sudo cp -r out/* /var/www/assistdocs.asimo.io/

TLS Certificate Expired

Symptom: Browser shows certificate error.

Fix:

# Check certificate status
sudo certbot certificates

# Force renewal
sudo certbot renew --cert-name assistdocs.asimo.io --force-renewal

# Reload Apache
sudo systemctl reload apache2