categories.reliability-sre Intermediate

What are the common reliability design patterns in distributed systems?

AI Practice

Why Reliability Patterns Are Needed

In distributed systems, any service dependency can fail. Unprotected dependency failures cause cascading failures — the entire system collapses.

Core Reliability Patterns

Circuit Breaker Monitors the failure rate of service calls and automatically "trips" when it exceeds a threshold, stopping requests to the downstream service until it recovers.

Three states:

  • Closed (normal): Requests pass through normally
  • Open (tripped): Returns errors immediately without calling downstream
  • Half-Open (probing): Allows a small number of requests to test recovery

Tools: Resilience4j (Java), Polly (.NET), opossum (Node.js)

Timeout Set timeouts on all external calls to prevent slow dependencies from exhausting the connection pool.

Retry Automatically retry on failure, but must be combined with:

  • Exponential backoff: Avoid simultaneous retries causing an avalanche
  • Maximum retry count: Prevent infinite retries
  • Idempotency: Ensure retries don't cause duplicate operations

Bulkhead Isolate resources (thread pools, connection pools) by service, so one downstream service's problems don't exhaust all resources.

Fallback Provide alternative responses when dependencies fail:

  • Return cached stale data
  • Return default values
  • Degrade gracefully (e.g., when the recommendation system is down, return a popular items list)

Combining Patterns

Best practice is to combine them: Timeout + Retry + Circuit Breaker + Bulkhead + Fallback, forming a complete resilience protection layer.

✦ AI Mock Interview

Type your answer and get instant AI feedback

Sign in to use AI scoring

Copyright © 2026 Wood All Rights Reserved · FE Interview Hub