categories.cloud-architecture Advanced
How do you design a multi-region high-availability architecture? What are RPO and RTO?
Key Metric Definitions
RTO (Recovery Time Objective): The maximum acceptable time from disaster occurrence to service restoration. Example: RTO = 4 hours means up to 4 hours of downtime is acceptable.
RPO (Recovery Point Objective): The maximum acceptable data loss window. Example: RPO = 1 hour means losing up to 1 hour of data is acceptable.
Multi-Region Architecture Patterns
Backup and Restore
- RTO: Hours, RPO: Hours
- Lowest cost — periodically back up data to another Region
- Best for: Non-critical systems, low cost requirements
Pilot Light
- RTO: Tens of minutes, RPO: Minutes
- Maintain a minimal core infrastructure in the backup Region (database replication running); scale up quickly during disaster
- Best for: Moderately important systems
Warm Standby
- RTO: Minutes, RPO: Seconds
- Backup Region maintains a scaled-down full environment that continuously receives data replication
- Best for: Important business systems
Active-Active
- RTO: Seconds, RPO: Near-zero
- Both Regions serve traffic simultaneously; if one fails, the other takes over
- Highest cost, most complex data consistency challenges
- Best for: Global critical systems (finance, e-commerce)
Key Technical Components
- Data replication: Cross-region async or sync replication (RDS Read Replica, DynamoDB Global Tables)
- DNS Failover: Route 53 Health Check automatically switches traffic to the healthy Region
- Global load balancing: CloudFront, AWS Global Accelerator for intelligent routing
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
