RTO/RPO Documentation
Document Version: 1.0 Last Updated: 2025-11-21 Status: Production-Ready Phase: Phase 12 - High Availability & Disaster Recovery
Executive Summary
This document defines the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for the VoiceAssist platform. These metrics establish the maximum acceptable downtime and data loss for various disaster scenarios.
Key Commitments:
- Primary RTO: 4 hours (complete system failure)
- Primary RPO: 24 hours (daily backups)
- Failover RTO: 30 minutes (with replication)
- Failover RPO: < 1 minute (with streaming replication)
Table of Contents
- Definitions
- RTO/RPO Objectives
- Disaster Scenarios
- Recovery Capabilities
- Measurement and Monitoring
- Continuous Improvement
Definitions
Recovery Time Objective (RTO)
Definition: The maximum acceptable length of time that a system can be down after a failure or disaster begins.
Measured from: Time of failure detection Measured to: Time when system is fully operational and serving production traffic
Components of RTO:
- Detection Time - Time to detect the failure
- Response Time - Time to initiate recovery procedures
- Recovery Time - Time to execute recovery procedures
- Verification Time - Time to verify system is operational
Example: If a database fails at 2:00 AM and is restored at 3:30 AM, the RTO is 90 minutes.
Recovery Point Objective (RPO)
Definition: The maximum acceptable amount of data that can be lost, measured in time.
Measured as: The age of the oldest data that must be recovered
Example: If backups run daily at midnight and a failure occurs at 11:00 PM, up to 23 hours of data could be lost (RPO = 24 hours).
Business Impact
| Metric | Low Impact | Medium Impact | High Impact | Critical Impact |
|---|---|---|---|---|
| RTO | > 24 hours | 4-24 hours | 1-4 hours | < 1 hour |
| RPO | > 7 days | 1-7 days | 1-24 hours | < 1 hour |
VoiceAssist Classification: High Impact (Healthcare application with PHI)
RTO/RPO Objectives
Tier 1: Critical Components (Database)
PostgreSQL Database
Scenario 1: Primary Database Failure (with replication)
- RTO: 30 minutes
- RPO: < 1 minute
- Recovery Method: Failover to streaming replica
- Business Impact: Minimal - brief service interruption
Scenario 2: Complete Database Loss (restore from backup)
- RTO: 4 hours
- RPO: 24 hours
- Recovery Method: Restore from encrypted backup
- Business Impact: Moderate - up to 1 day of data loss
Justification:
- Healthcare applications require high availability
- Streaming replication provides near-zero data loss
- Daily backups balance protection with operational overhead
Redis Cache
- RTO: 15 minutes
- RPO: 0 (cache can be regenerated)
- Recovery Method: Restart and repopulate from database
- Business Impact: Low - temporary performance degradation
Qdrant Vector Store
- RTO: 2 hours
- RPO: 24 hours
- Recovery Method: Restore from backup or rebuild from documents
- Business Impact: Moderate - search functionality degraded
Tier 2: Application Services
API Gateway (FastAPI)
- RTO: 15 minutes
- RPO: 0 (stateless service)
- Recovery Method: Container restart or redeploy
- Business Impact: High - service unavailable
Worker Services
- RTO: 30 minutes
- RPO: 0 (jobs can be reprocessed)
- Recovery Method: Container restart or redeploy
- Business Impact: Medium - background processing delayed
Tier 3: Infrastructure
Complete Infrastructure Loss
- RTO: 8 hours
- RPO: 24 hours
- Recovery Method: Provision new infrastructure + restore from backup
- Business Impact: Critical - complete service outage
Network Outage
- RTO: Depends on provider (escalate immediately)
- RPO: 0
- Recovery Method: Provider resolution + failover to alternate region
- Business Impact: Critical if single region
Disaster Scenarios
Scenario Matrix
| Scenario | Likelihood | Impact | RTO | RPO | Mitigation |
|---|---|---|---|---|---|
| Database server failure | Medium | High | 30 min | < 1 min | Streaming replication |
| Database corruption | Low | High | 2 hours | 24 hours | Daily backups + PITR |
| Complete data center loss | Very Low | Critical | 8 hours | 24 hours | Off-site backups |
| Ransomware attack | Low | Critical | 6 hours | 24 hours | Immutable backups |
| Application container crash | Medium | Medium | 15 min | 0 | Auto-restart + monitoring |
| Network partition | Low | High | 30 min | 0 | Multiple availability zones |
| Human error (accidental deletion) | Medium | Medium | 2 hours | 24 hours | Audit logging + backups |
| Hardware failure | Medium | Medium | 4 hours | 24 hours | Cloud infrastructure |
| Power outage | Low | High | Immediate* | 0 | Battery backup + generator |
| Natural disaster | Very Low | Critical | 8 hours | 24 hours | Geographic redundancy |
*Power outages are typically handled by data center infrastructure
RTO/RPO by Scenario
Scenario 1: Primary Database Failure
Detection: Automatic (health checks fail within 30 seconds)
RTO Breakdown:
- Detection: 30 seconds
- Notification: 1 minute
- Decision to failover: 5 minutes
- Replica promotion: 30 seconds
- Application reconfiguration: 5 minutes
- Verification: 5 minutes
- Total: 17 minutes (well within 30-minute target)
RPO Analysis:
- Streaming replication lag: typically < 1 second
- Maximum lag before failover: < 5 seconds
- Data loss: < 1 minute of transactions (if any)
Scenario 2: Complete Data Center Loss
Detection: Automatic (all health checks fail)
RTO Breakdown:
- Detection: 5 minutes
- Notification: 5 minutes
- Provision new infrastructure: 2 hours
- Download and verify backup: 30 minutes
- Restore database: 45 minutes
- Start services: 15 minutes
- Verification and testing: 30 minutes
- DNS/load balancer updates: 15 minutes
- Total: 4 hours 30 minutes (exceeds 4-hour target)
Improvement Actions:
- Pre-provision standby infrastructure (reduce to 2 hours)
- Use faster backup restoration (reduce to 1.5 hours)
RPO Analysis:
- Last backup: Up to 24 hours old
- Data loss: All transactions since last backup
- Maximum: 24 hours
Scenario 3: Data Corruption
Detection: Manual (user reports or data validation)
RTO Breakdown:
- Detection: 15 minutes (average)
- Investigation: 30 minutes
- Decision to restore: 15 minutes
- Identify clean backup: 15 minutes
- Stop services: 5 minutes
- Restore database: 45 minutes
- Verify data integrity: 15 minutes
- Restart services: 10 minutes
- Total: 2 hours 30 minutes (within 4-hour target)
RPO Analysis:
- Restore from backup before corruption
- Depends on when corruption occurred
- Maximum: 24 hours
Recovery Capabilities
Current Capabilities
High Availability (HA)
PostgreSQL Streaming Replication:
- Primary-replica setup with automatic replication
- Synchronous or asynchronous replication (configurable)
- Automatic failure detection via health checks
- Manual or automatic failover (promotion)
Benefits:
- Near-zero data loss (RPO < 1 minute)
- Fast failover (RTO < 30 minutes)
- Read scalability (queries can be distributed to replica)
Limitations:
- Manual failover process (can be automated with Patroni/stolon)
- Single replica (no automatic multi-replica configuration)
- Same data center (no geographic redundancy)
Backup and Restore
Daily Encrypted Backups:
- Full database dumps using pg_dump
- AES-256 encryption (GPG)
- SHA-256 checksum verification
- Off-site storage (S3, Nextcloud, or local)
- 30-day retention
Benefits:
- Protection against data corruption
- Protection against ransomware
- Point-in-time recovery capability
- Compliance with data retention requirements
Limitations:
- 24-hour RPO (daily backups)
- Restore time depends on database size (currently ~45 minutes)
- Manual restore process
Monitoring and Alerting
Health Checks:
- Database connectivity checks (every 30 seconds)
- Replication lag monitoring
- Service availability monitoring
- Disk space monitoring
Alerts:
- PagerDuty/Slack integration (when configured)
- Email notifications
- Automated incident creation
Future Enhancements
Short-Term (1-3 months)
-
Continuous Archiving (PITR)
- Implement WAL archiving for point-in-time recovery
- Benefit: Reduce RPO from 24 hours to < 1 minute
- Implementation: Configure archive_command in PostgreSQL
-
Automated Failover
- Deploy Patroni or stolon for automatic failover
- Benefit: Reduce RTO from 30 minutes to < 5 minutes
- Implementation: Patroni + etcd/consul cluster
-
Multi-Region Replication
- Configure replica in different geographic region
- Benefit: Protection against regional disasters
- Implementation: Cross-region streaming replication + VPN
Medium-Term (3-6 months)
-
Backup Optimization
- Implement incremental backups
- Parallel backup/restore processes
- Benefit: Reduce restore time by 50%
-
Read Replicas
- Add multiple read replicas for load distribution
- Benefit: Improved read scalability and HA
-
Automated DR Testing
- Monthly automated failover drills
- Automated restore validation
- Benefit: Ensure DR procedures remain effective
Long-Term (6-12 months)
-
Active-Active Configuration
- Multi-master database setup (with conflict resolution)
- Benefit: Zero downtime, zero data loss
-
Global Load Balancing
- Multi-region deployment with global load balancer
- Benefit: Geographic redundancy + reduced latency
Measurement and Monitoring
Key Metrics
RTO Metrics
Measured: Time from failure to full recovery
Dashboard Metrics:
- Average RTO (last 30 days)
- Maximum RTO (last 30 days)
- RTO by scenario type
- RTO vs. target comparison
Calculation:
RTO = Recovery_Time - Failure_Time
Example Query (from audit logs):
SELECT incident_type, AVG(EXTRACT(EPOCH FROM (recovery_time - failure_time))) AS avg_rto_seconds, MAX(EXTRACT(EPOCH FROM (recovery_time - failure_time))) AS max_rto_seconds FROM incident_log WHERE incident_date >= NOW() - INTERVAL '30 days' GROUP BY incident_type;
RPO Metrics
Measured: Age of data at time of recovery
Dashboard Metrics:
- Last backup timestamp
- Replication lag (real-time)
- Data loss estimation (during incidents)
- RPO vs. target comparison
Calculation:
RPO = Recovery_Data_Timestamp - Latest_Available_Data_Timestamp
Example Query (replication lag):
SELECT application_name, client_addr, state, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds FROM pg_stat_replication;
Availability Metrics
Service Level Agreement (SLA):
- Target Availability: 99.9% (8.76 hours downtime/year)
- Actual Availability: Measured monthly
Calculation:
Availability = (Total_Time - Downtime) / Total_Time * 100%
Example:
- Month: 720 hours (30 days * 24 hours)
- Downtime: 1 hour
- Availability: (720 - 1) / 720 * 100% = 99.86%
Monitoring Tools
Prometheus Metrics
Database Metrics:
# Replication lag
pg_stat_replication_lag_seconds
# Backup age
pg_backup_age_seconds
# Database availability
pg_up
Application Metrics:
# Service uptime
service_uptime_seconds
# Request success rate
http_requests_success_rate
Grafana Dashboards
-
HA/DR Dashboard
- Replication status
- Backup status
- Recovery time trends
- Availability percentage
-
Incident Dashboard
- Active incidents
- RTO/RPO tracking
- Recovery progress
Alert Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Replication Lag | > 10 seconds | > 60 seconds | Check network, investigate primary load |
| Backup Age | > 26 hours | > 48 hours | Investigate backup job, manual backup |
| Database Availability | N/A | Down | Initiate failover procedures |
| Disk Space | > 80% | > 90% | Cleanup old backups, expand storage |
| RTO Exceeded | N/A | > target | Post-mortem, process improvement |
Continuous Improvement
Review Cycle
Quarterly Reviews:
- Review RTO/RPO objectives
- Analyze incident trends
- Update disaster recovery procedures
- Test DR plans
Annual Reviews:
- Full DR drill (complete system recovery)
- Capacity planning
- Infrastructure upgrades
- Budget planning for HA/DR improvements
Incident Analysis
Post-Incident Review: After each incident:
- Calculate actual RTO and RPO
- Compare to targets
- Identify improvement opportunities
- Update procedures
- Implement improvements
Template:
## Incident: [Name] **Date:** [Date] **Duration:** [Duration] **RTO Target:** [Target] **RTO Actual:** [Actual] **RPO Target:** [Target] **RPO Actual:** [Actual] ### Root Cause [Description] ### Timeline [Event timeline] ### Impact [Business impact] ### Action Items - [ ] [Action 1] - [ ] [Action 2]
Performance Trends
Track Over Time:
-
RTO Trends
- Are we getting faster at recovery?
- Which scenarios need improvement?
-
RPO Trends
- Is replication lag increasing?
- Are backups completing on time?
-
Availability Trends
- Are we meeting SLA targets?
- What are the common failure modes?
Capacity Planning
Annual Assessment:
- Database growth rate
- Backup storage requirements
- Recovery time scalability
- Infrastructure capacity
Example:
Current Database Size: 100 GB
Growth Rate: 20% per year
Restore Time: 45 minutes
Projected (Year 2):
Database Size: 120 GB
Estimated Restore Time: 54 minutes
Action: Implement incremental backups to maintain < 1 hour restore
Appendix
A. RTO/RPO Calculation Examples
Example 1: Database Failover
Failure Time: 2025-01-15 14:30:00
Detection Time: 2025-01-15 14:30:30 (30 seconds)
Failover Started: 2025-01-15 14:35:00 (5 minutes decision)
Replica Promoted: 2025-01-15 14:35:30 (30 seconds promotion)
App Reconfigured: 2025-01-15 14:40:00 (5 minutes reconfiguration)
Service Restored: 2025-01-15 14:45:00 (5 minutes verification)
RTO = 14:45:00 - 14:30:00 = 15 minutes ✓ (within 30-minute target)
Last Replicated Transaction: 14:29:58
RPO = 14:30:00 - 14:29:58 = 2 seconds ✓ (within 1-minute target)
Example 2: Restore from Backup
Failure Time: 2025-01-15 10:00:00
Last Backup: 2025-01-15 02:00:00 (daily backup)
Restoration Started: 2025-01-15 10:30:00
Restoration Completed: 2025-01-15 11:15:00
Service Restored: 2025-01-15 11:30:00
RTO = 11:30:00 - 10:00:00 = 1.5 hours ✓ (within 4-hour target)
RPO = 10:00:00 - 02:00:00 = 8 hours ✓ (within 24-hour target)
Data Loss: All transactions between 02:00 and 10:00 (8 hours)
B. Testing Schedule
| Test Type | Frequency | Last Performed | Next Scheduled | Owner |
|---|---|---|---|---|
| Backup Verification | Weekly | 2025-11-21 | 2025-11-28 | Ops Team |
| Failover Test | Quarterly | 2025-11-21 | 2026-02-21 | DB Admin |
| Full DR Drill | Annually | N/A | 2026-06-01 | Engineering Manager |
| RTO/RPO Review | Quarterly | 2025-11-21 | 2026-02-21 | Leadership |
C. References
- DISASTER_RECOVERY_RUNBOOK.md - Step-by-step recovery procedures
- ha-dr/backup/ - Backup and restore scripts
- ha-dr/testing/ - DR testing scripts
- HIPAA_COMPLIANCE_MATRIX.md - Compliance documentation
Document Control:
- Classification: Internal Use Only - CONFIDENTIAL
- Distribution: Engineering Team, Operations Team, Management
- Review Frequency: Quarterly
- Next Review: 2026-02-21
Version: 1.0 Last Updated: 2025-11-21 Phase: Phase 12 - High Availability & Disaster Recovery