Docs / Raw

Rto Rpo Documentation

Sourced from docs/RTO_RPO_DOCUMENTATION.md

Edit on GitHub

RTO/RPO Documentation

Document Version: 1.0 Last Updated: 2025-11-21 Status: Production-Ready Phase: Phase 12 - High Availability & Disaster Recovery


Executive Summary

This document defines the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for the VoiceAssist platform. These metrics establish the maximum acceptable downtime and data loss for various disaster scenarios.

Key Commitments:

  • Primary RTO: 4 hours (complete system failure)
  • Primary RPO: 24 hours (daily backups)
  • Failover RTO: 30 minutes (with replication)
  • Failover RPO: < 1 minute (with streaming replication)

Table of Contents

  1. Definitions
  2. RTO/RPO Objectives
  3. Disaster Scenarios
  4. Recovery Capabilities
  5. Measurement and Monitoring
  6. Continuous Improvement

Definitions

Recovery Time Objective (RTO)

Definition: The maximum acceptable length of time that a system can be down after a failure or disaster begins.

Measured from: Time of failure detection Measured to: Time when system is fully operational and serving production traffic

Components of RTO:

  1. Detection Time - Time to detect the failure
  2. Response Time - Time to initiate recovery procedures
  3. Recovery Time - Time to execute recovery procedures
  4. Verification Time - Time to verify system is operational

Example: If a database fails at 2:00 AM and is restored at 3:30 AM, the RTO is 90 minutes.

Recovery Point Objective (RPO)

Definition: The maximum acceptable amount of data that can be lost, measured in time.

Measured as: The age of the oldest data that must be recovered

Example: If backups run daily at midnight and a failure occurs at 11:00 PM, up to 23 hours of data could be lost (RPO = 24 hours).

Business Impact

MetricLow ImpactMedium ImpactHigh ImpactCritical Impact
RTO> 24 hours4-24 hours1-4 hours< 1 hour
RPO> 7 days1-7 days1-24 hours< 1 hour

VoiceAssist Classification: High Impact (Healthcare application with PHI)


RTO/RPO Objectives

Tier 1: Critical Components (Database)

PostgreSQL Database

Scenario 1: Primary Database Failure (with replication)

  • RTO: 30 minutes
  • RPO: < 1 minute
  • Recovery Method: Failover to streaming replica
  • Business Impact: Minimal - brief service interruption

Scenario 2: Complete Database Loss (restore from backup)

  • RTO: 4 hours
  • RPO: 24 hours
  • Recovery Method: Restore from encrypted backup
  • Business Impact: Moderate - up to 1 day of data loss

Justification:

  • Healthcare applications require high availability
  • Streaming replication provides near-zero data loss
  • Daily backups balance protection with operational overhead

Redis Cache

  • RTO: 15 minutes
  • RPO: 0 (cache can be regenerated)
  • Recovery Method: Restart and repopulate from database
  • Business Impact: Low - temporary performance degradation

Qdrant Vector Store

  • RTO: 2 hours
  • RPO: 24 hours
  • Recovery Method: Restore from backup or rebuild from documents
  • Business Impact: Moderate - search functionality degraded

Tier 2: Application Services

API Gateway (FastAPI)

  • RTO: 15 minutes
  • RPO: 0 (stateless service)
  • Recovery Method: Container restart or redeploy
  • Business Impact: High - service unavailable

Worker Services

  • RTO: 30 minutes
  • RPO: 0 (jobs can be reprocessed)
  • Recovery Method: Container restart or redeploy
  • Business Impact: Medium - background processing delayed

Tier 3: Infrastructure

Complete Infrastructure Loss

  • RTO: 8 hours
  • RPO: 24 hours
  • Recovery Method: Provision new infrastructure + restore from backup
  • Business Impact: Critical - complete service outage

Network Outage

  • RTO: Depends on provider (escalate immediately)
  • RPO: 0
  • Recovery Method: Provider resolution + failover to alternate region
  • Business Impact: Critical if single region

Disaster Scenarios

Scenario Matrix

ScenarioLikelihoodImpactRTORPOMitigation
Database server failureMediumHigh30 min< 1 minStreaming replication
Database corruptionLowHigh2 hours24 hoursDaily backups + PITR
Complete data center lossVery LowCritical8 hours24 hoursOff-site backups
Ransomware attackLowCritical6 hours24 hoursImmutable backups
Application container crashMediumMedium15 min0Auto-restart + monitoring
Network partitionLowHigh30 min0Multiple availability zones
Human error (accidental deletion)MediumMedium2 hours24 hoursAudit logging + backups
Hardware failureMediumMedium4 hours24 hoursCloud infrastructure
Power outageLowHighImmediate*0Battery backup + generator
Natural disasterVery LowCritical8 hours24 hoursGeographic redundancy

*Power outages are typically handled by data center infrastructure

RTO/RPO by Scenario

Scenario 1: Primary Database Failure

Detection: Automatic (health checks fail within 30 seconds)

RTO Breakdown:

  1. Detection: 30 seconds
  2. Notification: 1 minute
  3. Decision to failover: 5 minutes
  4. Replica promotion: 30 seconds
  5. Application reconfiguration: 5 minutes
  6. Verification: 5 minutes
  7. Total: 17 minutes (well within 30-minute target)

RPO Analysis:

  • Streaming replication lag: typically < 1 second
  • Maximum lag before failover: < 5 seconds
  • Data loss: < 1 minute of transactions (if any)

Scenario 2: Complete Data Center Loss

Detection: Automatic (all health checks fail)

RTO Breakdown:

  1. Detection: 5 minutes
  2. Notification: 5 minutes
  3. Provision new infrastructure: 2 hours
  4. Download and verify backup: 30 minutes
  5. Restore database: 45 minutes
  6. Start services: 15 minutes
  7. Verification and testing: 30 minutes
  8. DNS/load balancer updates: 15 minutes
  9. Total: 4 hours 30 minutes (exceeds 4-hour target)

Improvement Actions:

  • Pre-provision standby infrastructure (reduce to 2 hours)
  • Use faster backup restoration (reduce to 1.5 hours)

RPO Analysis:

  • Last backup: Up to 24 hours old
  • Data loss: All transactions since last backup
  • Maximum: 24 hours

Scenario 3: Data Corruption

Detection: Manual (user reports or data validation)

RTO Breakdown:

  1. Detection: 15 minutes (average)
  2. Investigation: 30 minutes
  3. Decision to restore: 15 minutes
  4. Identify clean backup: 15 minutes
  5. Stop services: 5 minutes
  6. Restore database: 45 minutes
  7. Verify data integrity: 15 minutes
  8. Restart services: 10 minutes
  9. Total: 2 hours 30 minutes (within 4-hour target)

RPO Analysis:

  • Restore from backup before corruption
  • Depends on when corruption occurred
  • Maximum: 24 hours

Recovery Capabilities

Current Capabilities

High Availability (HA)

PostgreSQL Streaming Replication:

  • Primary-replica setup with automatic replication
  • Synchronous or asynchronous replication (configurable)
  • Automatic failure detection via health checks
  • Manual or automatic failover (promotion)

Benefits:

  • Near-zero data loss (RPO < 1 minute)
  • Fast failover (RTO < 30 minutes)
  • Read scalability (queries can be distributed to replica)

Limitations:

  • Manual failover process (can be automated with Patroni/stolon)
  • Single replica (no automatic multi-replica configuration)
  • Same data center (no geographic redundancy)

Backup and Restore

Daily Encrypted Backups:

  • Full database dumps using pg_dump
  • AES-256 encryption (GPG)
  • SHA-256 checksum verification
  • Off-site storage (S3, Nextcloud, or local)
  • 30-day retention

Benefits:

  • Protection against data corruption
  • Protection against ransomware
  • Point-in-time recovery capability
  • Compliance with data retention requirements

Limitations:

  • 24-hour RPO (daily backups)
  • Restore time depends on database size (currently ~45 minutes)
  • Manual restore process

Monitoring and Alerting

Health Checks:

  • Database connectivity checks (every 30 seconds)
  • Replication lag monitoring
  • Service availability monitoring
  • Disk space monitoring

Alerts:

  • PagerDuty/Slack integration (when configured)
  • Email notifications
  • Automated incident creation

Future Enhancements

Short-Term (1-3 months)

  1. Continuous Archiving (PITR)

    • Implement WAL archiving for point-in-time recovery
    • Benefit: Reduce RPO from 24 hours to < 1 minute
    • Implementation: Configure archive_command in PostgreSQL
  2. Automated Failover

    • Deploy Patroni or stolon for automatic failover
    • Benefit: Reduce RTO from 30 minutes to < 5 minutes
    • Implementation: Patroni + etcd/consul cluster
  3. Multi-Region Replication

    • Configure replica in different geographic region
    • Benefit: Protection against regional disasters
    • Implementation: Cross-region streaming replication + VPN

Medium-Term (3-6 months)

  1. Backup Optimization

    • Implement incremental backups
    • Parallel backup/restore processes
    • Benefit: Reduce restore time by 50%
  2. Read Replicas

    • Add multiple read replicas for load distribution
    • Benefit: Improved read scalability and HA
  3. Automated DR Testing

    • Monthly automated failover drills
    • Automated restore validation
    • Benefit: Ensure DR procedures remain effective

Long-Term (6-12 months)

  1. Active-Active Configuration

    • Multi-master database setup (with conflict resolution)
    • Benefit: Zero downtime, zero data loss
  2. Global Load Balancing

    • Multi-region deployment with global load balancer
    • Benefit: Geographic redundancy + reduced latency

Measurement and Monitoring

Key Metrics

RTO Metrics

Measured: Time from failure to full recovery

Dashboard Metrics:

  • Average RTO (last 30 days)
  • Maximum RTO (last 30 days)
  • RTO by scenario type
  • RTO vs. target comparison

Calculation:

RTO = Recovery_Time - Failure_Time

Example Query (from audit logs):

SELECT incident_type, AVG(EXTRACT(EPOCH FROM (recovery_time - failure_time))) AS avg_rto_seconds, MAX(EXTRACT(EPOCH FROM (recovery_time - failure_time))) AS max_rto_seconds FROM incident_log WHERE incident_date >= NOW() - INTERVAL '30 days' GROUP BY incident_type;

RPO Metrics

Measured: Age of data at time of recovery

Dashboard Metrics:

  • Last backup timestamp
  • Replication lag (real-time)
  • Data loss estimation (during incidents)
  • RPO vs. target comparison

Calculation:

RPO = Recovery_Data_Timestamp - Latest_Available_Data_Timestamp

Example Query (replication lag):

SELECT application_name, client_addr, state, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds FROM pg_stat_replication;

Availability Metrics

Service Level Agreement (SLA):

  • Target Availability: 99.9% (8.76 hours downtime/year)
  • Actual Availability: Measured monthly

Calculation:

Availability = (Total_Time - Downtime) / Total_Time * 100%

Example:

  • Month: 720 hours (30 days * 24 hours)
  • Downtime: 1 hour
  • Availability: (720 - 1) / 720 * 100% = 99.86%

Monitoring Tools

Prometheus Metrics

Database Metrics:

# Replication lag
pg_stat_replication_lag_seconds

# Backup age
pg_backup_age_seconds

# Database availability
pg_up

Application Metrics:

# Service uptime
service_uptime_seconds

# Request success rate
http_requests_success_rate

Grafana Dashboards

  1. HA/DR Dashboard

    • Replication status
    • Backup status
    • Recovery time trends
    • Availability percentage
  2. Incident Dashboard

    • Active incidents
    • RTO/RPO tracking
    • Recovery progress

Alert Thresholds

MetricWarningCriticalAction
Replication Lag> 10 seconds> 60 secondsCheck network, investigate primary load
Backup Age> 26 hours> 48 hoursInvestigate backup job, manual backup
Database AvailabilityN/ADownInitiate failover procedures
Disk Space> 80%> 90%Cleanup old backups, expand storage
RTO ExceededN/A> targetPost-mortem, process improvement

Continuous Improvement

Review Cycle

Quarterly Reviews:

  • Review RTO/RPO objectives
  • Analyze incident trends
  • Update disaster recovery procedures
  • Test DR plans

Annual Reviews:

  • Full DR drill (complete system recovery)
  • Capacity planning
  • Infrastructure upgrades
  • Budget planning for HA/DR improvements

Incident Analysis

Post-Incident Review: After each incident:

  1. Calculate actual RTO and RPO
  2. Compare to targets
  3. Identify improvement opportunities
  4. Update procedures
  5. Implement improvements

Template:

## Incident: [Name] **Date:** [Date] **Duration:** [Duration] **RTO Target:** [Target] **RTO Actual:** [Actual] **RPO Target:** [Target] **RPO Actual:** [Actual] ### Root Cause [Description] ### Timeline [Event timeline] ### Impact [Business impact] ### Action Items - [ ] [Action 1] - [ ] [Action 2]

Track Over Time:

  1. RTO Trends

    • Are we getting faster at recovery?
    • Which scenarios need improvement?
  2. RPO Trends

    • Is replication lag increasing?
    • Are backups completing on time?
  3. Availability Trends

    • Are we meeting SLA targets?
    • What are the common failure modes?

Capacity Planning

Annual Assessment:

  • Database growth rate
  • Backup storage requirements
  • Recovery time scalability
  • Infrastructure capacity

Example:

Current Database Size: 100 GB
Growth Rate: 20% per year
Restore Time: 45 minutes

Projected (Year 2):
Database Size: 120 GB
Estimated Restore Time: 54 minutes
Action: Implement incremental backups to maintain < 1 hour restore

Appendix

A. RTO/RPO Calculation Examples

Example 1: Database Failover

Failure Time: 2025-01-15 14:30:00
Detection Time: 2025-01-15 14:30:30 (30 seconds)
Failover Started: 2025-01-15 14:35:00 (5 minutes decision)
Replica Promoted: 2025-01-15 14:35:30 (30 seconds promotion)
App Reconfigured: 2025-01-15 14:40:00 (5 minutes reconfiguration)
Service Restored: 2025-01-15 14:45:00 (5 minutes verification)

RTO = 14:45:00 - 14:30:00 = 15 minutes ✓ (within 30-minute target)

Last Replicated Transaction: 14:29:58
RPO = 14:30:00 - 14:29:58 = 2 seconds ✓ (within 1-minute target)

Example 2: Restore from Backup

Failure Time: 2025-01-15 10:00:00
Last Backup: 2025-01-15 02:00:00 (daily backup)
Restoration Started: 2025-01-15 10:30:00
Restoration Completed: 2025-01-15 11:15:00
Service Restored: 2025-01-15 11:30:00

RTO = 11:30:00 - 10:00:00 = 1.5 hours ✓ (within 4-hour target)

RPO = 10:00:00 - 02:00:00 = 8 hours ✓ (within 24-hour target)
Data Loss: All transactions between 02:00 and 10:00 (8 hours)

B. Testing Schedule

Test TypeFrequencyLast PerformedNext ScheduledOwner
Backup VerificationWeekly2025-11-212025-11-28Ops Team
Failover TestQuarterly2025-11-212026-02-21DB Admin
Full DR DrillAnnuallyN/A2026-06-01Engineering Manager
RTO/RPO ReviewQuarterly2025-11-212026-02-21Leadership

C. References

  • DISASTER_RECOVERY_RUNBOOK.md - Step-by-step recovery procedures
  • ha-dr/backup/ - Backup and restore scripts
  • ha-dr/testing/ - DR testing scripts
  • HIPAA_COMPLIANCE_MATRIX.md - Compliance documentation

Document Control:

  • Classification: Internal Use Only - CONFIDENTIAL
  • Distribution: Engineering Team, Operations Team, Management
  • Review Frequency: Quarterly
  • Next Review: 2026-02-21

Version: 1.0 Last Updated: 2025-11-21 Phase: Phase 12 - High Availability & Disaster Recovery

Beginning of guide
End of guide