Disaster Recovery Runbook
Document Version: 1.0 Last Updated: 2025-11-21 Status: Production-Ready Phase: Phase 12 - High Availability & Disaster Recovery
Executive Summary
This runbook provides step-by-step procedures for recovering the VoiceAssist platform from various disaster scenarios. It covers database failures, complete system failures, data corruption, and ransomware attacks.
Recovery Objectives:
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 24 hours (daily backups)
- RTO (with replication): 30 minutes (failover to replica)
- RPO (with replication): < 1 minute (streaming replication)
Table of Contents
- Disaster Scenarios
- Pre-Disaster Preparation
- Recovery Procedures
- Post-Recovery Procedures
- Testing and Validation
- Contact Information
Disaster Scenarios
Covered Scenarios
- Database Failure - Primary PostgreSQL server becomes unavailable
- Complete System Failure - Entire infrastructure is lost (fire, flood, hardware failure)
- Data Corruption - Database corruption or accidental data deletion
- Ransomware Attack - Data encrypted by ransomware
- Application Server Failure - API Gateway or worker services fail
Not Covered (Escalate to Management)
- Physical security breaches
- Large-scale network outages beyond your control
- Legal or regulatory issues requiring counsel
Pre-Disaster Preparation
Before a Disaster Occurs
1. Verify Backup System
# Check last backup status ls -lht /var/backups/voiceassist/ | head -5 # Verify automated backups are running crontab -l | grep backup # Test backup restoration (recommended monthly) sudo /opt/voiceassist/ha-dr/backup/verify-backup.sh
2. Document Current Configuration
# Export current environment configuration cd ~/VoiceAssist docker-compose config > docker-compose.current.yml # Document container versions docker-compose images > container-versions.txt # Save database schema pg_dump -h localhost -U voiceassist -d voiceassist --schema-only > schema-backup.sql
3. Verify Off-Site Backups
# Check S3 backups (if using AWS) aws s3 ls s3://voiceassist-backups/ --recursive | tail -10 # Check Nextcloud backups (if using Nextcloud) curl -u admin:password \ "http://nextcloud:8080/remote.php/dav/files/admin/backups/voiceassist/" \ | grep -o 'voiceassist_backup_[^<]*'
4. Maintain Contact List
Keep updated contact list for:
- On-call engineers
- Database administrators
- Cloud provider support
- Management escalation chain
Recovery Procedures
General Recovery Steps
- Assess the situation - Determine the extent of the disaster
- Communicate - Notify stakeholders and team members
- Execute recovery - Follow the appropriate scenario procedure
- Verify recovery - Test system functionality
- Document - Record what happened and lessons learned
Scenario 1: Database Failure (Primary Down)
Situation: Primary PostgreSQL server is unavailable but replica is operational
RTO: 30 minutes RPO: < 1 minute Severity: High
Detection
# Check primary database health docker-compose ps postgres # Test database connection psql -h localhost -p 5432 -U voiceassist -d voiceassist -c "SELECT 1;"
Recovery Steps
Step 1: Verify Replica Status
# Check replica is up and running docker exec voiceassist-postgres-replica psql -U voiceassist -c "SELECT pg_is_in_recovery();" # Expected: t (true - in recovery/standby mode) # Check replication lag docker exec voiceassist-postgres-replica psql -U voiceassist -c \ "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"
Step 2: Promote Replica to Primary
# Promote replica to become the new primary docker exec voiceassist-postgres-replica pg_ctl promote -D /var/lib/postgresql/data # Wait for promotion to complete (30 seconds) sleep 30 # Verify replica is now a primary docker exec voiceassist-postgres-replica psql -U voiceassist -c "SELECT pg_is_in_recovery();" # Expected: f (false - not in recovery/now primary)
Step 3: Update Application Configuration
# Update API Gateway to point to new primary # Edit docker-compose.yml or environment variables export DB_HOST=postgres-replica export DB_PORT=5432 # Restart API Gateway and workers docker-compose restart voiceassist-server voiceassist-worker
Step 4: Verify Application Functionality
# Test health endpoint curl http://localhost:8000/health # Test database connectivity curl http://localhost:8000/ready # Check application logs docker-compose logs -f voiceassist-server | grep -i error
Step 5: Restore Original Primary (when available)
# Once original primary hardware is fixed, reconfigure it as replica # Follow replication setup procedures in ha-dr/postgresql/ cd ~/VoiceAssist/ha-dr/postgresql/ docker-compose -f docker-compose.replication.yml up -d postgres-replica
Estimated Recovery Time: 30 minutes
Scenario 2: Complete System Failure
Situation: Entire infrastructure is lost (server destroyed, complete hardware failure)
RTO: 4 hours RPO: 24 hours Severity: Critical
Detection
- Server is completely inaccessible
- All services are down
- No response from any system components
Recovery Steps
Step 1: Provision New Infrastructure
# If using cloud (AWS, GCP, Azure) # Use Terraform to provision new infrastructure cd ~/VoiceAssist/infrastructure/terraform/ terraform init terraform plan -out=disaster-recovery.tfplan terraform apply disaster-recovery.tfplan # If using physical server # 1. Install Ubuntu Server 22.04 LTS # 2. Configure network and SSH access # 3. Install Docker and Docker Compose
Step 2: Clone VoiceAssist Repository
# On new server git clone https://github.com/your-org/VoiceAssist.git ~/VoiceAssist cd ~/VoiceAssist
Step 3: Download Latest Backup
# Create backup directory sudo mkdir -p /var/backups/voiceassist # Download from S3 aws s3 cp s3://voiceassist-backups/voiceassist_backup_latest.sql.gpg \ /var/backups/voiceassist/ # Or download from Nextcloud curl -u admin:password \ -o /var/backups/voiceassist/voiceassist_backup_latest.sql.gpg \ "http://nextcloud:8080/remote.php/dav/files/admin/backups/voiceassist/voiceassist_backup_latest.sql.gpg" # Download checksum aws s3 cp s3://voiceassist-backups/voiceassist_backup_latest.sql.gpg.sha256 \ /var/backups/voiceassist/
Step 4: Verify Backup Integrity
# Verify checksum cd /var/backups/voiceassist sha256sum -c voiceassist_backup_latest.sql.gpg.sha256
Step 5: Start Infrastructure Services
cd ~/VoiceAssist # Copy environment file cp .env.example .env # Edit .env with production credentials nano .env # Start infrastructure services (PostgreSQL, Redis, Qdrant) docker-compose up -d postgres redis qdrant # Wait for services to be ready sleep 30 docker-compose ps
Step 6: Restore Database
# Set encryption key export BACKUP_ENCRYPTION_KEY="your-encryption-passphrase" export POSTGRES_PASSWORD="your-db-password" # Restore database cd ~/VoiceAssist/ha-dr/backup/ ./restore-database.sh /var/backups/voiceassist/voiceassist_backup_latest.sql.gpg
Step 7: Start Application Services
cd ~/VoiceAssist # Build and start API Gateway and workers docker-compose up -d --build voiceassist-server voiceassist-worker # Wait for services to start sleep 60
Step 8: Verify System Functionality
# Check all services are running docker-compose ps # Test health endpoints curl http://localhost:8000/health curl http://localhost:8000/ready # Test authentication curl -X POST http://localhost:8000/api/v1/auth/login \ -H "Content-Type: application/json" \ -d '{"email": "test@example.com", "password": "testpass"}' # Check application logs docker-compose logs -f voiceassist-server
Step 9: Update DNS and Load Balancers
# Update DNS records to point to new server # Update load balancer configuration # This depends on your infrastructure setup
Estimated Recovery Time: 4 hours
Scenario 3: Data Corruption
Situation: Database corruption or accidental data deletion detected
RTO: 2 hours RPO: 24 hours Severity: High
Detection
# Database integrity check docker exec postgres pg_dump -U voiceassist voiceassist --schema-only > /dev/null # Check for corruption errors in logs docker-compose logs postgres | grep -i "corrupt\|error" # Verify table row counts psql -h localhost -U voiceassist -d voiceassist -c \ "SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
Recovery Steps
Step 1: Stop Application Services
# Stop API Gateway and workers to prevent further writes docker-compose stop voiceassist-server voiceassist-worker # Verify services are stopped docker-compose ps
Step 2: Create Emergency Backup of Current State
# Even if corrupted, backup current state for forensics pg_dump -h localhost -U voiceassist voiceassist \ > /tmp/emergency_backup_$(date +%Y%m%d_%H%M%S).sql
Step 3: Identify Point-in-Time to Restore
# List available backups ls -lht /var/backups/voiceassist/ # Check backup metadata to find appropriate backup cat /var/backups/voiceassist/voiceassist_backup_YYYYMMDD_HHMMSS.sql.gpg.metadata
Step 4: Restore from Backup
# Set environment variables export BACKUP_ENCRYPTION_KEY="your-encryption-passphrase" export POSTGRES_PASSWORD="your-db-password" # Restore database cd ~/VoiceAssist/ha-dr/backup/ ./restore-database.sh /var/backups/voiceassist/voiceassist_backup_YYYYMMDD_HHMMSS.sql.gpg
Step 5: Verify Data Integrity
# Check table counts psql -h localhost -U voiceassist -d voiceassist -c \ "SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;" # Run integrity checks psql -h localhost -U voiceassist -d voiceassist -c "VACUUM ANALYZE;" # Query critical data psql -h localhost -U voiceassist -d voiceassist -c "SELECT COUNT(*) FROM users;" psql -h localhost -U voiceassist -d voiceassist -c "SELECT COUNT(*) FROM messages;"
Step 6: Restart Application Services
docker-compose up -d voiceassist-server voiceassist-worker # Monitor logs for errors docker-compose logs -f voiceassist-server
Step 7: Test Application Functionality
# Test critical workflows curl http://localhost:8000/health curl http://localhost:8000/api/v1/users/me -H "Authorization: Bearer $TOKEN"
Estimated Recovery Time: 2 hours
Scenario 4: Ransomware Attack
Situation: Data has been encrypted by ransomware
RTO: 6 hours RPO: 24 hours Severity: Critical
Detection
- Files encrypted with unusual extensions (.encrypted, .locked, etc.)
- Ransom note present in directories
- Unusual file access patterns in logs
- Database access denied or data appears encrypted
Recovery Steps
Step 1: Immediate Containment
# IMMEDIATELY disconnect from network sudo ip link set eth0 down # Stop all services to prevent spread docker-compose down # Document everything visible (take screenshots of ransom notes)
Step 2: Incident Response
- Notify management and security team
- Contact law enforcement (FBI Cyber Division)
- Preserve evidence (do not delete anything)
- Document timeline of events
Step 3: Assess Damage
# Check which files are encrypted find /var/lib/docker/volumes -type f -exec file {} \; | grep -i encrypted # Check backup integrity (verify backups are not infected) sha256sum -c /var/backups/voiceassist/voiceassist_backup_latest.sql.gpg.sha256
Step 4: Provision Clean Infrastructure
# Build completely new infrastructure (do not reuse infected systems) # Follow Scenario 2 (Complete System Failure) steps 1-3
Step 5: Restore from Clean Backup
# Use backup from BEFORE infection timeline # Identify clean backup (check backup dates vs infection timeline) ls -lt /var/backups/voiceassist/ # Restore from clean backup cd ~/VoiceAssist/ha-dr/backup/ ./restore-database.sh /var/backups/voiceassist/voiceassist_backup_YYYYMMDD_HHMMSS.sql.gpg
Step 6: Security Hardening
# Reset all passwords # Rotate all API keys and secrets # Update all access credentials # Review audit logs for indicators of compromise # Apply security patches sudo apt update && sudo apt upgrade -y # Run security audit cd ~/VoiceAssist/security/audit/ ./security-audit.sh
Step 7: Gradual Restoration
# Bring up services one at a time docker-compose up -d postgres docker-compose up -d redis docker-compose up -d qdrant docker-compose up -d voiceassist-server # Monitor for suspicious activity docker-compose logs -f
Estimated Recovery Time: 6 hours (excluding investigation time)
Important: Do NOT pay ransom. Contact law enforcement instead.
Scenario 5: Application Server Failure
Situation: API Gateway or worker services fail but database is healthy
RTO: 15 minutes RPO: 0 (no data loss) Severity: Medium
Detection
# Check service status docker-compose ps # Check for container crashes docker-compose ps -a | grep -i exit # Check logs for errors docker-compose logs voiceassist-server | tail -50
Recovery Steps
Step 1: Restart Failed Services
# Restart API Gateway docker-compose restart voiceassist-server # Restart workers docker-compose restart voiceassist-worker # Check status docker-compose ps
Step 2: If Restart Fails, Rebuild
# Rebuild and restart docker-compose up -d --build voiceassist-server voiceassist-worker # Check logs docker-compose logs -f voiceassist-server
Step 3: Verify Functionality
curl http://localhost:8000/health curl http://localhost:8000/ready
Estimated Recovery Time: 15 minutes
Post-Recovery Procedures
After Successful Recovery
1. Verify All Systems
# Run comprehensive health checks cd ~/VoiceAssist/ha-dr/testing/ ./comprehensive-health-check.sh # Verify data integrity ./verify-data-integrity.sh # Test critical workflows ./test-critical-workflows.sh
2. Update Monitoring
# Check all monitoring alerts are active # Verify metrics are flowing to Grafana/Prometheus curl http://localhost:8000/metrics # Check audit logs are being written psql -h localhost -U voiceassist -d voiceassist -c \ "SELECT COUNT(*) FROM audit_logs WHERE timestamp > NOW() - INTERVAL '1 hour';"
3. Communication
- Notify stakeholders that system is restored
- Provide timeline of outage and recovery
- Document lessons learned
- Update runbook if new issues discovered
4. Post-Mortem
Conduct post-mortem meeting within 48 hours:
- What happened? - Root cause analysis
- Why did it happen? - Contributing factors
- How was it detected? - Monitoring effectiveness
- How was it resolved? - Recovery effectiveness
- How can we prevent it? - Action items
Template: Use docs/templates/post-mortem-template.md
Testing and Validation
Regular DR Drills
Frequency: Quarterly
Drill Procedures:
-
Backup Restoration Test (Monthly)
cd ~/VoiceAssist/ha-dr/backup/ ./verify-backup.sh -
Failover Test (Quarterly)
# Simulate primary failure and promote replica cd ~/VoiceAssist/ha-dr/testing/ ./test-failover.sh -
Full DR Test (Annually)
- Provision new infrastructure
- Restore from backup
- Verify all functionality
- Document timing and issues
Validation Checklist
After each test or real recovery:
- All services running and healthy
- Database connectivity verified
- Authentication working
- Critical workflows tested
- Monitoring and alerting active
- Backups resuming normally
- Audit logging functional
- RTO/RPO objectives met
- Documentation updated
- Stakeholders notified
Contact Information
Emergency Contacts
| Role | Name | Phone | Escalation Time | |
|---|---|---|---|---|
| On-Call Engineer | [Name] | [Phone] | [Email] | Immediate |
| Database Admin | [Name] | [Phone] | [Email] | 30 minutes |
| Infrastructure Lead | [Name] | [Phone] | [Email] | 1 hour |
| Engineering Manager | [Name] | [Phone] | [Email] | 2 hours |
| CTO | [Name] | [Phone] | [Email] | 4 hours |
Vendor Contacts
| Vendor | Support | Phone | Website |
|---|---|---|---|
| AWS Support | Premium | 1-xxx-xxx-xxxx | https://console.aws.amazon.com/support |
| PostgreSQL Support | [Company] | 1-xxx-xxx-xxxx | [URL] |
| Docker Support | Enterprise | 1-xxx-xxx-xxxx | [URL] |
Internal Resources
- Wiki: https://wiki.company.com/voiceassist
- Status Page: https://status.voiceassist.com
- Slack Channel: #voiceassist-incidents
- Incident Management: https://pagerduty.com
Appendix
A. Backup Schedule
| Backup Type | Frequency | Retention | Location |
|---|---|---|---|
| Full Database | Daily (2 AM) | 30 days | Local + S3 |
| Incremental WAL | Continuous | 7 days | Local |
| Configuration | Daily | 90 days | Git + S3 |
B. Recovery Scripts
All recovery scripts located in: ~/VoiceAssist/ha-dr/
backup/backup-database.sh- Manual backupbackup/restore-database.sh- Manual restorebackup/verify-backup.sh- Backup verificationbackup/upload-backup.sh- Off-site uploadtesting/test-failover.sh- Failover simulationtesting/comprehensive-health-check.sh- Post-recovery validation
C. Change Log
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-11-21 | Phase 12 Team | Initial creation |
Document Control:
- Classification: Internal Use Only - CONFIDENTIAL
- Distribution: Engineering Team, Operations Team
- Review Frequency: Quarterly
- Next Review: 2026-02-21
Version: 1.0 Last Updated: 2025-11-21 Phase: Phase 12 - High Availability & Disaster Recovery