What are the common reliability design patterns in distributed systems?
Why Reliability Patterns Are Needed
In distributed systems, any service dependency can fail. Unprotected dependency failures cause cascading failures — the entire system collapses.
Core Reliability Patterns
Circuit Breaker Monitors the failure rate of service calls and automatically "trips" when it exceeds a threshold, stopping requests to the downstream service until it recovers.
Three states:
- Closed (normal): Requests pass through normally
- Open (tripped): Returns errors immediately without calling downstream
- Half-Open (probing): Allows a small number of requests to test recovery
Tools: Resilience4j (Java), Polly (.NET), opossum (Node.js)
Timeout Set timeouts on all external calls to prevent slow dependencies from exhausting the connection pool.
Retry Automatically retry on failure, but must be combined with:
- Exponential backoff: Avoid simultaneous retries causing an avalanche
- Maximum retry count: Prevent infinite retries
- Idempotency: Ensure retries don't cause duplicate operations
Bulkhead Isolate resources (thread pools, connection pools) by service, so one downstream service's problems don't exhaust all resources.
Fallback Provide alternative responses when dependencies fail:
- Return cached stale data
- Return default values
- Degrade gracefully (e.g., when the recommendation system is down, return a popular items list)
Combining Patterns
Best practice is to combine them: Timeout + Retry + Circuit Breaker + Bulkhead + Fallback, forming a complete resilience protection layer.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
