System Design

Failure Is Not an Option
(But It Will Happen)

How to build systems that survive when everything goes wrong. Timeouts, retries, circuit breakers, and the patterns that keep distributed systems alive.

Bahgat Bahgat Ahmed · January 2025 · 35 min read

Quick Summary

Want the full story? Keep reading.

This is for you if...

The Day Everything Fell Down

Tuesday, 2:47 PM. You're in a meeting when your phone buzzes. Then buzzes again. Then doesn't stop.

The payment service is slow. Then the checkout page times out. Then the product catalog stops responding. Within 8 minutes, your entire e-commerce platform is down. 50,000 users. Zero transactions.

The root cause? A third-party payment API started responding slowly instead of failing fast.

The 8-Minute Cascade: How One Slow Service Killed Everything
2:47 PM

Everything is Fine

Payment API responding in ~200ms as usual. All systems nominal.

2:48 PM

Payment API Slows Down

Response time jumps to 5 seconds. Your checkout threads start waiting...

2:49 PM

Thread Pool Exhaustion

All checkout worker threads are blocked waiting for payment. New requests queue up.

2:51 PM

Connection Pool Drained

Database connections held by blocked threads. Product service can't get connections.

2:53 PM

Load Balancer Timeouts

Health checks failing. Load balancer marks servers as unhealthy. Traffic concentrates on remaining servers.

2:55 PM

Complete System Failure

Remaining servers overwhelmed. Homepage returns 503. Platform is down.

A slow dependency is worse than a dead one. Dead services fail fast. Slow services hold resources hostage.

The Brutal Truth

The payment service never crashed. It never returned errors. It just got slow. And that slowness spread through every system that called it, like a virus.

Here's the code that caused the outage:

Python - The code that took down the platform
async def checkout(order):
    # No timeout - waits forever
    payment = await payment_api.charge(order.total)

    # No retry - fails permanently on transient errors
    if not payment.success:
        raise PaymentError("Payment failed")

    # No fallback - nothing to show when broken
    return {"order_id": order.id, "status": "confirmed"}

And here's what it should have looked like — the same function, protected:

Python - The resilient version
async def checkout(order):
    try:
        # Timeout: don't wait forever (2s)
        payment = await asyncio.wait_for(
            # Circuit breaker: fail fast if payment is known-down
            circuit_breaker.call(
                # Retry: handle transient failures (2 attempts)
                lambda: retry_with_backoff(
                    lambda: payment_api.charge(order.total),
                    max_retries=2
                )
            ),
            timeout=5.0  # Total budget including retries
        )
        return {"order_id": order.id, "status": "confirmed"}

    except (asyncio.TimeoutError, CircuitOpenError):
        # Fallback: queue the payment for later processing
        await queue.send({"order": order.id, "retry_at": now() + minutes(5)})
        return {"order_id": order.id, "status": "pending",
                "message": "Payment is being processed. You'll receive confirmation shortly."}

The rest of this post teaches you each layer of defense. By the end, you'll know exactly when and how to add each one.

Why Distributed Systems Fail Differently

In a monolith, when something breaks, you usually get an exception. Stack trace. Clear error. Easy to debug.

In distributed systems, failures are partial, delayed, and ambiguous:

  • Partial failure — Service A is up, Service B is down, Service C is slow. The system is simultaneously healthy AND broken.
  • Network ambiguity — Did the request fail? Did it succeed but the response got lost? If you sent a payment request and got no response, was the customer charged?
  • Cascade effects — One sick service infects everything that depends on it, like the cascade in the story above.
  • No single source of truth — Each service has its own view of the world. Service A thinks the order was placed. Service B never got the message.
The 8 fallacies of distributed computing

In 1994, Peter Deutsch (and others at Sun Microsystems) documented the false assumptions developers make about networks. Every failure pattern in this post exists because one or more of these assumptions is wrong:

  1. The network is reliable — Packets get dropped. Cables get cut. DNS fails. (Why we need retries)
  2. Latency is zero — Cross-datacenter calls add 30-100ms. Slow services add seconds. (Why we need timeouts)
  3. Bandwidth is infinite — Large payloads cause backpressure. (Why we need bulkheads)
  4. The network is secure — TLS handshakes fail, certificates expire.
  5. Topology doesn't change — Load balancers add/remove instances. DNS changes.
  6. There is one administrator — Your dependency's team deploys on their own schedule.
  7. Transport cost is zero — Each network call costs CPU, memory, and latency.
  8. The network is homogeneous — Different services use different protocols, versions, and behaviors.

The takeaway: Every call across a network boundary is a potential failure point. The patterns in this post are your defense against each of these false assumptions.

Slow is Worse Than Down
Service is DOWN
  • Fails immediately
  • Releases resources fast
  • Retry kicks in quickly
  • Circuit breaker opens
  • Impact: Contained
Service is SLOW
  • Holds threads hostage
  • Drains connection pools
  • Timeouts may not trigger
  • Looks "almost working"
  • Impact: Cascades

A dead service is predictable. A slow service is a resource vampire that drains everything around it.

This is why failure handling in distributed systems requires multiple layers of defense. No single technique is enough. You need a complete toolkit.

The Cost of Failure (Real Numbers)

How much does a slow service actually cost? Let's do the math for a typical e-commerce site handling 1,000 requests/second:

Scenario: Payment API goes from 200ms to 10s response time.

Without resilience patterns:

  • 200 thread pool threads × 10s = all threads blocked in 2 seconds
  • Remaining 998 req/s queued, then rejected
  • Average order value: $80 × 1% conversion = $800/sec revenue
  • 8-minute outage = $384,000 in lost revenue
  • Plus: customer trust damage, social media complaints, SEO impact

With resilience patterns:

  • Timeout at 2s, circuit opens after 5 failures (10 seconds)
  • Fallback: "Payment pending" with queue for later processing
  • 98% of requests served normally (non-payment features unaffected)
  • 2% of requests get "pending" status, processed within 5 minutes
  • Revenue impact: ~$0 (all orders eventually processed)
Looking for the "Why"?

This post focuses on implementation — the specific patterns, code examples, and configuration values. For the conceptual foundation of why we need these patterns and how they fit into a broader reliability strategy, see Building Resilient Systems.

Defense Layer 1
Timeouts: The First Line of Defense

Timeouts

Think of calling a restaurant to make a reservation. If no one picks up after 10 rings, you hang up and try another restaurant. You don't hold the phone for 30 minutes hoping someone eventually answers. That's a timeout.

A timeout is a promise: "I will not wait forever." It's the simplest and most important failure handling mechanism. Without timeouts, a slow dependency can hold your resources indefinitely. With timeouts, you bound the worst-case wait time.

The Math That Matters

With 200 threads and no timeout, a dependency that responds in 30 seconds means: 200 threads × 30s = 6,000 thread-seconds consumed per minute. Your entire server is frozen waiting. With a 2-second timeout, the same scenario costs 200 × 2s = 400 thread-seconds — 15x less resource waste, and those threads are freed to serve other requests.

Python
import httpx

# BAD: No timeout - can hang forever
response = httpx.get("https://payment-api.com/charge")

# GOOD: Explicit timeouts
response = httpx.get(
    "https://payment-api.com/charge",
    timeout=httpx.Timeout(
        connect=5.0,    # Max time to establish connection
        read=10.0,       # Max time to read response
        write=5.0,       # Max time to send request
        pool=5.0         # Max time to get connection from pool
    )
)
How to choose timeout values

Start with your SLA. If users expect responses in 2 seconds, your timeout can't be 30 seconds.

Measure p99 latency. Look at your dependency's 99th percentile response time. Set timeout slightly above that.

Rule of thumb: timeout = p99 + buffer. If p99 is 500ms, timeout at 1-2 seconds.

Different operations need different timeouts:

  • Connection timeout: 1-5 seconds (how long to wait for TCP handshake)
  • Read timeout: varies by operation (simple lookup: 1-5s, complex query: 10-30s)
  • Total request timeout: your user-facing SLA minus processing overhead

What Can Go Wrong With Timeouts

Timeouts seem simple, but misconfigured timeouts cause their own disasters:

  • Timeout too short — You time out legitimate requests. If your payment API normally takes 800ms but occasionally takes 1.5s during peak load, a 1s timeout means you'll reject ~5% of valid payments. Customers get charged but your system thinks it failed.
  • Timeout too long — You're back to the original problem. A 60s timeout on a dependency that usually responds in 200ms means threads are held hostage for 60 seconds when things go wrong.
  • Same timeout everywhere — A health check endpoint and a complex report generation shouldn't share the same timeout value. Match timeout to the operation.
  • No timeout at all — Many HTTP clients default to "wait forever." Check your defaults.
Timeouts in Other Languages & Frameworks

Java (OkHttp):

new OkHttpClient.Builder()
    .connectTimeout(5, TimeUnit.SECONDS)
    .readTimeout(10, TimeUnit.SECONDS)
    .writeTimeout(5, TimeUnit.SECONDS)
    .build();

Node.js (Axios):

axios.get('https://payment-api.com/charge', {
  timeout: 5000,  // 5 seconds total
  signal: AbortSignal.timeout(10000)  // hard abort at 10s
});

Go (net/http):

client := &http.Client{
    Timeout: 5 * time.Second,
    Transport: &http.Transport{
        DialContext:         (&net.Dialer{Timeout: 2 * time.Second}).DialContext,
        TLSHandshakeTimeout: 2 * time.Second,
    },
}
Real timeout values for common services

Here are typical timeout values based on real production data. Use these as starting points, then adjust based on your own p99 measurements:

Service Type Typical p99 Recommended Timeout Notes
Database (simple query) 5-50ms 200-500ms Generous because spikes during vacuuming/replication
Database (complex query) 100-500ms 2-5s Consider query optimization first
Redis/Memcached 1-5ms 50-200ms Cache should be fast; long timeout defeats the purpose
Internal microservice 10-100ms 500ms-2s Same datacenter, should be fast
Payment gateway (Stripe, etc.) 500ms-2s 5-10s External, fraud checks add latency
Email API (SendGrid, SES) 200-800ms 5s Should be async anyway
ML inference 100ms-5s 10-30s Varies wildly by model size
File upload to S3 Varies by size 30-60s Size-dependent; use multipart for large files

The measurement approach: Run for 1 week, collect p50/p95/p99 latency per endpoint, then set timeout at p99 × 2. Revisit monthly as traffic patterns change.

Timeout configuration at the infrastructure level

Application-level timeouts (httpx, OkHttp, Axios) are just one layer. Your infrastructure has its own timeouts, and they must align. If your NGINX proxy timeout is 60s but your app timeout is 2s, the proxy holds the connection for 58 extra seconds after your app has given up.

NGINX:

# NGINX proxy timeouts - must be >= app timeout
proxy_connect_timeout 5s;     # TCP connection to upstream
proxy_send_timeout    10s;    # Sending request to upstream
proxy_read_timeout    15s;    # Reading response from upstream

# Client-side timeouts
client_header_timeout 10s;    # Reading client request headers
client_body_timeout   10s;    # Reading client request body
send_timeout          10s;    # Sending response to client

AWS Application Load Balancer:

# ALB idle timeout (default 60s - almost always too long)
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn $ALB_ARN \
  --attributes Key=idle_timeout.timeout_seconds,Value=30

# Target group health check timeouts
aws elbv2 modify-target-group \
  --target-group-arn $TG_ARN \
  --health-check-timeout-seconds 5 \
  --health-check-interval-seconds 10

Kubernetes:

# Readiness probe - K8s checks if pod can handle traffic
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3       # Must be < periodSeconds
  failureThreshold: 3     # 3 failures = stop traffic

# Liveness probe - K8s checks if pod is alive
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  timeoutSeconds: 5
  failureThreshold: 3     # 3 failures = kill & restart pod

The timeout stack (all must align):

  • CDN/Edge ≥ Load Balancer ≥ Proxy ≥ App ≥ DB client
  • Each layer should be slightly longer than the one it proxies to
  • If any layer is shorter, it'll cut off requests the downstream is still processing

Timeout Interacts With Retry

If your timeout is 2s and you retry 3 times, the user waits up to 6 seconds. Keep this in mind when setting values — the total timeout budget is timeout × retries.

The Timeout Cascade Problem

If Service A calls Service B, which calls Service C:

Timeout Budgets Must Cascade
User
5s total budget
Service A
4s timeout
Service B
3s timeout
Service C
2s timeout

Each downstream service must have a shorter timeout than its caller. Otherwise, the caller times out while the downstream is still working.

Think About It
Your database has a p99 latency of 50ms. What timeout should you set for database queries?
50ms - match the p99 exactly
100ms - 2x the p99
200-500ms - 4-10x the p99 to handle spikes
No timeout - let the database decide
A good timeout is typically 4-10x your p99 latency. Too tight (50-100ms) means you reject valid requests during normal spikes (garbage collection, replication lag). Too loose or no timeout means you risk thread starvation when the database hangs. 200-500ms gives headroom for occasional slowness without letting a hung connection block forever.
Defense Layer 2
Retries: Handling Transient Failures

Retries with Exponential Backoff

You call a friend and they don't answer. Do you call back immediately? Maybe once. Do you call 100 times in 10 seconds? That's harassment. Instead, you wait a minute, try again. Then wait five minutes. Then maybe leave a message. That's exponential backoff.

Many failures are transient: network blips, temporary overload, brief downtime during deployment. A simple retry often succeeds. But naive retries are dangerous — if a service is struggling, hammering it with immediate retries makes things worse.

The Math That Matters

With base delay of 1s and 5 retries: 1 + 2 + 4 + 8 + 16 = 31 seconds maximum wait. With jitter, the actual wait is randomly spread across this range. Compare to 5 instant retries that complete in <1s but hammer the recovering service with 5x the load.

Python
import random
import time

def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    """Retry with exponential backoff and jitter."""
    for attempt in range(max_retries):
        try:
            return func()
        except (ConnectionError, TimeoutError) as e:
            if attempt == max_retries - 1:
                raise  # Last attempt, give up

            # Exponential backoff: 1s, 2s, 4s, 8s...
            delay = base_delay * (2 ** attempt)

            # Add jitter to prevent thundering herd
            jitter = random.uniform(0, delay * 0.1)

            time.sleep(delay + jitter)
Why jitter matters: The Thundering Herd

Imagine 1,000 requests fail at the same time. Without jitter, all 1,000 retry at exactly t+1s. Then all retry again at t+3s. Then t+7s.

This synchronized retry creates thundering herd — massive traffic spikes that overwhelm the recovering service.

Jitter spreads retries over time, smoothing the load. Instead of 1,000 requests at t+1s, you get ~100 requests spread across t+0.9s to t+1.1s.

Full jitter (recommended): delay = random.uniform(0, base_delay * 2**attempt)

Exponential Backoff: Giving the Service Time to Recover
Attempt 1 1s wait
Attempt 2 2s wait
Attempt 3 4s wait
Attempt 4 8s wait

Each retry waits longer, giving the failing service more time to recover. After max retries, give up and fail gracefully.

What to Retry

Retry: Network errors, 5xx errors, timeouts, connection refused.
Don't retry: 4xx errors (client errors), authentication failures, validation errors. These won't succeed on retry.

War Story: The Retry Storm That Kept the Service Down

A team had a notification service that pushed messages to a third-party SMS API. They added retries (good!) but forgot jitter (bad). When the SMS API had a brief 30-second outage, here's what happened:

T=0: SMS API goes down. 500 messages in queue.
T=1s: All 500 messages retry simultaneously.
T=1s: SMS API gets 500 requests at once. Still recovering — all fail.
T=3s: All 500 retry again (2s backoff, no jitter).
T=3s: SMS API gets another 500 requests. Collapses again.
T=7s: 500 more retries. New messages arriving too — now 700 in queue.
T=15s: Queue grows to 1,200. Each retry cycle adds more load.
T=30s: SMS API would have recovered... but the retry storm prevents it.
T=5min: SMS API team triggers circuit breaker on THEIR side. Returns 429.
T=6min: Our team notices. Adds jitter. Deploys fix. Queue drains over 2 min.

The fix was one line: jitter = random.uniform(0, delay). With full jitter, those 500 retries spread across 0-1s, 0-2s, 0-4s instead of hitting at exactly 1s, 2s, 4s. The SMS API never sees a spike.

Lesson: Retries without jitter are a DDoS attack on your own dependencies. Always add randomness.

What Can Go Wrong With Retries

The Idempotency Trap

If your payment API times out, did the charge go through? If you retry, will the customer be charged twice? Never retry non-idempotent operations blindly. Use idempotency keys (like Stripe's Idempotency-Key header) to ensure retrying is safe. If an operation can't be made idempotent, don't retry it — use a fallback instead.

  • Retry amplification — If Service A retries 3x against Service B, and B retries 3x against Service C, a single user request can generate 3 × 3 = 9 requests to Service C. At scale, this multiplies load exponentially.
  • Retrying permanent errors — A 400 Bad Request will fail every time you retry. Wasting time and resources on requests that will never succeed.
  • Missing jitter — Without randomness in your backoff, all failed requests retry at the exact same moment, creating traffic spikes that re-crash the recovering service.
Retries in Other Languages & Frameworks

Java (Resilience4j):

RetryConfig config = RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(500))
    .retryExceptions(IOException.class, TimeoutException.class)
    .ignoreExceptions(BusinessException.class)
    .build();
Retry retry = Retry.of("paymentRetry", config);

Node.js (p-retry):

import pRetry from 'p-retry';

const result = await pRetry(
  () => fetch('https://payment-api.com/charge'),
  { retries: 3, minTimeout: 1000, factor: 2 }
);
The Retry Budget: How many retries can you actually afford?

Before choosing retry settings, calculate your budget:

Given:
  User-facing SLA = 5 seconds
  Processing overhead = 500ms
  Per-request timeout = 2 seconds
Budget:
  Available for retries = 5s - 0.5s = 4.5 seconds
  Max attempts = floor(4.5 / 2.0) = 2 attempts
  That's 1 original + 1 retry
With backoff:
  Attempt 1: 2s timeout
  Wait: 1s backoff
  Attempt 2: 2s timeout
  Total: 2 + 1 + 2 = 5s (right at SLA limit!)

The formula: max_retries = floor((SLA - overhead) / (timeout + avg_backoff)) - 1

The common mistake: Teams set retries to 3-5 by default without calculating the budget. With 5 retries × 2s timeout + backoff, total worst case is 31+ seconds — 6x the SLA.

For real-time operations (payment, search): 1-2 retries max. For background jobs (email, analytics): 5-10 retries with longer backoff is fine because there's no user waiting.

Retry Interacts With Circuit Breaker

If each retry counts as a failure in the circuit breaker, two user requests with 3 retries each = 6 failures = circuit opens. We'll cover this interaction in When Patterns Collide.

Think About It
Your payment API call times out. The customer might have been charged, but you never got the response. Should you retry?
Yes - always retry, that's what retries are for
No - not without an idempotency key that lets the API detect duplicates
Yes - but wait 10 minutes first to be safe
No - retrying payments is never safe
Retrying non-idempotent operations (like payments) can cause double-charges. The correct approach is using an idempotency key: a unique identifier you send with the request. If the API sees the same key twice, it returns the original result instead of processing again. Stripe, PayPal, and most payment APIs support this. Never retry a payment without one.
Defense Layer 3
Circuit Breakers: Fail Fast When It's Hopeless

Circuit Breakers

Your house has a circuit breaker box. When too much current flows through a circuit, the breaker trips — cutting power to that circuit instantly. You don't keep pushing more electricity through a failing wire and hope it works. The breaker protects the rest of the house.

In software, a circuit breaker does the same thing. It monitors failure rates and "trips" when too many failures occur. Once tripped, it fails fast — returning an error immediately without even attempting to call the service. This protects your system from wasting resources on requests that will almost certainly fail.

Retries handle transient failures (the kind that fix themselves). Circuit breakers handle persistent failures (the kind where the service is actually down). Without a circuit breaker, retries against a dead service just pile on more load, consuming threads, connections, and time.

The Math That Matters

Without a circuit breaker: 500 req/s × 3 retries × 2s timeout = 3,000 wasted request-seconds per second. Your retry logic is spending 3,000 thread-seconds every second waiting for a service that isn't coming back. With a circuit breaker that opens after 5 failures, you spend 5 × 2s = 10 seconds discovering the problem, then fail instantly (in ~1ms) for the next 30 seconds until recovery is tested.

Circuit Breaker State Machine
CLOSED Normal operation
failures ≥ threshold
OPEN Failing fast
timeout expires
HALF-OPEN Testing recovery
success in HALF-OPEN → back to CLOSED
failure in HALF-OPEN → back to OPEN

The circuit breaker automatically recovers: after the timeout, it lets one request through to test if the service is back.

Python
import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = 0

    def call(self, func):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit is open")

        try:
            result = func()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
Count-based vs sliding window circuit breakers

The basic circuit breaker above uses a count-based threshold: "open after N consecutive failures." This works but has edge cases:

  • If you get 4 failures, then 1 success, then 4 more failures, the counter resets. You never trip despite a 90% failure rate.

A sliding window circuit breaker is more robust. It tracks the failure rate over the last N calls:

class SlidingWindowBreaker:
    def __init__(self, window_size=10, failure_threshold=0.5):
        self.window = []         # Last N results (True=success, False=failure)
        self.window_size = window_size
        self.failure_threshold = failure_threshold  # 50% failure rate

    def should_trip(self):
        if len(self.window) < self.window_size:
            return False  # Not enough data yet

        failure_rate = self.window.count(False) / len(self.window)
        return failure_rate >= self.failure_threshold

    def record(self, success: bool):
        self.window.append(success)
        if len(self.window) > self.window_size:
            self.window.pop(0)  # Slide the window

When to use which:

  • Count-based: Simple services, low traffic (under 10 req/s). Fewer edge cases matter at low volume.
  • Sliding window: High traffic services where a small number of failures mixed with successes shouldn't trip the breaker. This is what Resilience4j and most production libraries use.

What Can Go Wrong With Circuit Breakers

  • Threshold too low — Circuit opens after 2-3 failures. Normal transient errors (brief network blips) trip the breaker unnecessarily, cutting off a healthy service. Your users see "service unavailable" when the service is actually fine.
  • Threshold too high — Circuit doesn't open until 50+ failures. By then, you've already exhausted thread pools and cascaded the failure. The breaker opens too late to help.
  • Recovery timeout too short — Circuit enters HALF-OPEN after 5 seconds, sends a test request, and it fails because the service needs 60 seconds to recover. Circuit reopens. This cycle repeats, and the service never gets a chance to fully recover because you keep testing too early.
  • Recovery timeout too long — Circuit stays OPEN for 5 minutes. The service recovered after 30 seconds, but your system doesn't know. Users get fallback responses for 4.5 minutes unnecessarily.
  • HALF-OPEN lets too many through — Some implementations let multiple requests through in HALF-OPEN state. If the service is still struggling, these requests make it worse. Best practice: let exactly ONE request through to test recovery.
Circuit Breakers in Other Languages & Frameworks

Java (Resilience4j):

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)          // Open after 50% failure rate
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)               // Measure over last 10 calls
    .permittedNumberOfCallsInHalfOpenState(3)
    .build();

CircuitBreaker breaker = CircuitBreaker.of("payment", config);
Supplier<String> decorated = CircuitBreaker
    .decorateSupplier(breaker, () -> paymentApi.charge(amount));

Node.js (opossum):

import CircuitBreaker from 'opossum';

const breaker = new CircuitBreaker(paymentApi.charge, {
  timeout: 3000,          // Treat calls > 3s as failures
  errorThresholdPercentage: 50,
  resetTimeout: 30000,    // Try again after 30s
  volumeThreshold: 5      // Need at least 5 calls to trip
});

breaker.fallback(() => ({ status: 'pending', queued: true }));
breaker.on('open', () => logger.warn('Payment circuit OPEN'));
breaker.on('halfOpen', () => logger.info('Payment circuit testing...'));
breaker.on('close', () => logger.info('Payment circuit recovered'));

Go (sony/gobreaker):

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "payment",
    MaxRequests: 1,                     // Allow 1 request in half-open
    Interval:    60 * time.Second,       // Reset failure count after 60s
    Timeout:     30 * time.Second,       // Time in open before half-open
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.ConsecutiveFailures > 5
    },
})
Infrastructure-level circuit breakers (Envoy, Istio, AWS)

You can implement circuit breakers at the infrastructure level without changing application code. This is especially useful for service meshes:

Envoy Proxy / Istio:

# Istio DestinationRule with circuit breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100           # Bulkhead: max connections
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 10   # Queue limit
        http2MaxRequests: 100         # Max concurrent
        maxRequestsPerConnection: 10  # Connection reuse
    outlierDetection:
      consecutive5xxErrors: 5        # CB threshold
      interval: 30s                    # Evaluation window
      baseEjectionTime: 30s           # Recovery timeout
      maxEjectionPercent: 50          # Max % of hosts ejected

AWS App Mesh:

// Virtual node with circuit breaker
{
  "connectionPool": {
    "http": {
      "maxConnections": 100,
      "maxPendingRequests": 10
    }
  },
  "outlierDetection": {
    "maxServerErrors": 5,
    "maxEjectionPercent": 50,
    "interval": { "value": 30, "unit": "s" },
    "baseEjectionDuration": { "value": 30, "unit": "s" }
  }
}

When to use infrastructure vs application circuit breakers:

  • Infrastructure (Envoy/Istio): Polyglot environment, many services, team can't modify every app. Circuit breaking is applied uniformly.
  • Application (Resilience4j/opossum): More control over thresholds per-endpoint, custom fallback logic, more granular metrics.
  • Both: Infrastructure for coarse protection, application for fine-grained business logic fallbacks.

Circuit Breaker Interacts With Retry

Each retry that fails counts as a separate failure in the circuit breaker. If your threshold is 5 and each request retries 3 times, it only takes 2 user requests to trip the circuit (2 × 3 = 6 failures). We'll cover this trap in detail in When Patterns Collide.

War Story: The Redis Singleton That Took Down Production

A production chatbot used Redis for session management. Classic singleton pattern — one Redis connection shared across the entire application:

Python
class InfrastructureManager:
    _instance = None

    def __init__(self):
        self.redis = Redis(host='redis.internal', port=6379)

The Trigger: A routine network blip. Redis connection drops for 3 seconds. Should be a non-event.

The Cascade:

T=0: Network blip, Redis connection drops
T=0.5s: Redis client enters 'reconnecting' state
T=1s: Application code checks "is Redis healthy?"
T=1.1s: Code tries to create "new" Redis connection
T=1.2s: ioredis throws: "Redis is already connecting/connected"
↓ Error propagates to webhook handler
T=1.4s: Webhook returns 500 to WhatsApp
↓ WhatsApp marks bot as "unhealthy"
T=3s: WhatsApp stops sending messages to bot
T=30min: Users see "bot offline" until manual restart

The Root Cause: Redis client manages reconnection internally. The singleton pattern meant only ONE connection existed. When the application code tried to "help" by creating a new connection, it conflicted with the client's own reconnection logic.

Redis connection states you must handle

Redis clients go through specific states during disconnection:

readyconnectingreconnecting → (on success) → ready
  • During 'reconnecting': DON'T create new connections. The client is handling it.
  • During 'connecting': DON'T queue commands. They'll pile up.
  • Only in 'ready': Safe to use normally.

The Fix: Respect the client's connection states and use fallbacks during recovery:

Python
def execute_redis_command(self, command):
    if self.redis.status == 'ready':
        return command()
    elif self.redis.status in ['connecting', 'reconnecting']:
        # Don't interfere - let client handle reconnection
        return self.fallback_response()
    else:
        # Actually disconnected - now we can reconnect
        self.redis = Redis(host='redis.internal', port=6379)
        return command()

A circuit breaker around the Redis calls would have caught this at T=1s: after the first few failures, it would fail fast with a fallback instead of letting the error propagate to WhatsApp.

Think About It
Your circuit breaker is configured for 5 failures to trip. Each request does 3 retries before giving up. How many user requests will it take to trip the circuit?
5 requests (5 failures)
2 requests (2 requests x 3 retries = 6 failures > threshold)
15 requests (5 threshold x 3 retries)
It depends on the success rate
If each retry counts as a failure in your circuit breaker, 2 requests x 3 retries = 6 failures, which exceeds the 5-failure threshold. This is the "retry storm" problem. The fix: only count the final failure (after all retries exhausted) against the circuit breaker, not each individual retry attempt. Otherwise, your circuit breaker trips way too fast.
Defense Layer 4
Bulkheads: Isolating Failures

Bulkheads

The Titanic sank because water flooded from compartment to compartment — there were walls between compartments, but they didn't go all the way to the ceiling. Modern ships have bulkheads: fully sealed, watertight compartments. One compartment floods? The rest stay dry. The ship stays afloat.

In software, bulkheads isolate resources so one failing component can't consume all resources and take down everything else. Without bulkheads, a single slow dependency can starve every other feature of threads, connections, or memory.

The Math That Matters

Your server has 200 threads shared across 4 services. Payment API slows to 30s per request. Without bulkheads: all 200 threads get consumed waiting for payment, and every other feature dies. With bulkheads allocating 50 threads per service: only 50 threads are affected. Product catalog, search, and user profiles continue serving 150 requests/second as if nothing happened.

Bulkhead Pattern: Isolated Thread Pools
Without Bulkheads
Shared Thread Pool (20 threads)
All threads blocked by slow Payment API
Product, Search, Cart all waiting for threads...
With Bulkheads
Payment Pool (5 threads)
Product Pool (5 threads)
Search Pool (5 threads)
Payment is blocked, but Product & Search work fine!

Bulkheads limit the blast radius. A slow dependency can only affect its allocated resources.

Types of Bulkheads

Thread Pool Isolation

Separate thread pools for different dependencies. Payment gets 10 threads, catalog gets 20.

Most common
Connection Pool Isolation

Dedicated DB/Redis connections per feature. See DB Connections post for deep treatment.

Critical for databases
Semaphores

Limit concurrent requests to a dependency. Simpler than thread pools, no resource allocation.

Lightweight
Process/Container Isolation

Separate processes or containers for critical services. Memory/CPU isolation at OS level.

Strongest isolation
Kubernetes as a bulkhead: resource limits and pod isolation

If you run Kubernetes, you already have a powerful bulkhead mechanism: resource limits. By setting CPU and memory limits per pod/container, one misbehaving service can't starve the rest of the cluster.

# Payment service: critical, gets generous resources
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: payment
        resources:
          requests:
            cpu: "500m"       # Guaranteed 0.5 CPU
            memory: "512Mi"    # Guaranteed 512MB
          limits:
            cpu: "1000m"      # Max 1 CPU (can't steal more)
            memory: "1Gi"      # Max 1GB (OOMKilled if exceeded)

---
# Analytics service: non-critical, gets fewer resources
apiVersion: apps/v1
kind: Deployment
metadata:
  name: analytics-service
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: analytics
        resources:
          requests:
            cpu: "100m"       # Modest allocation
            memory: "128Mi"
          limits:
            cpu: "200m"       # Strict limit
            memory: "256Mi"    # If it leaks memory, K8s kills it

Why this is a bulkhead: If the analytics service has a memory leak or CPU spike, Kubernetes limits the damage to its allocated resources. The payment service (with its own allocation) is completely unaffected. The OOMKiller handles the cleanup.

Combined with PodDisruptionBudgets: You can also ensure that even during node failures or updates, critical services always have minimum replicas running:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-pdb
spec:
  minAvailable: 2   # Always keep 2 payment pods running
  selector:
    matchLabels:
      app: payment-service
Python
import asyncio

# Semaphore as a simple bulkhead
payment_semaphore = asyncio.Semaphore(10)  # Max 10 concurrent payment calls
inventory_semaphore = asyncio.Semaphore(20)  # Max 20 concurrent inventory calls

async def process_payment(order):
    async with payment_semaphore:
        # Only 10 payments can run concurrently
        return await payment_api.charge(order)

async def check_inventory(items):
    async with inventory_semaphore:
        # Inventory checks don't compete with payments
        return await inventory_api.check(items)

What Can Go Wrong With Bulkheads

  • Pools too small — You give payment only 5 threads, but normal traffic needs 15. Now you're throttling healthy traffic during peak load. Size bulkheads based on actual traffic patterns, not guesses.
  • Pools too large — You give payment 150 out of 200 threads "because it's important." Now a payment slowdown still consumes 75% of your capacity. The point of a bulkhead is to limit impact.
  • No overflow strategy — When the bulkhead is full (all threads busy), what happens to new requests? They need to fail fast with a clear error, not queue up indefinitely.
  • Static allocation — Traffic patterns change. Your bulkhead sizes should be based on monitoring, not set-and-forget. Review quarterly at minimum.
Real Example: Redis Connection Pool Bulkheads

The bulkhead principle applies to any shared resource — not just threads. Here's a common Redis anti-pattern and the fix:

Python - BAD: Shared Redis client
# BAD: One shared Redis client for everything
redis = Redis(connection_pool=ConnectionPool(max_connections=50))

def handle_webhook(data):       # Uses shared redis
    redis.get(f"session:{data.user}")

def cache_response(key, val):   # Uses shared redis
    redis.setex(key, 3600, val)

def log_analytics(event):       # Uses shared redis
    redis.lpush("analytics", event)

# If analytics logging blocks, sessions AND cache break too
Python - GOOD: Separate pools by criticality
# GOOD: Separate pools by criticality
pool_sessions = ConnectionPool(max_connections=30)    # Critical
pool_cache = ConnectionPool(max_connections=15)       # Can fail
pool_analytics = ConnectionPool(max_connections=5)    # Best-effort

redis_sessions = Redis(connection_pool=pool_sessions)
redis_cache = Redis(connection_pool=pool_cache)
redis_analytics = Redis(connection_pool=pool_analytics)

def handle_webhook(data):
    return redis_sessions.get(f"session:{data.user}")  # Protected

def cache_response(key, val):
    return redis_cache.setex(key, 3600, val)  # Isolated

def log_analytics(event):
    return redis_analytics.lpush("analytics", event)  # Can't hurt others

Why this works: Analytics Redis slow? Only 5 connections blocked. Sessions (30 connections) completely unaffected. Each pool has its own failure domain.

When to use: Multiple features share the same Redis/database, features have different criticality levels, some operations are "nice to have" vs "must have."

When NOT to use: Single-purpose Redis (only sessions, only cache), low traffic where pool exhaustion is unlikely.

Bulkheads in Other Languages & Frameworks

Java (Resilience4j):

// Thread pool bulkhead: dedicated thread pool per dependency
ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
    .maxThreadPoolSize(10)
    .coreThreadPoolSize(5)
    .queueCapacity(20)
    .build();

ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("payment", config);

// Semaphore bulkhead: simpler, limits concurrency
BulkheadConfig semConfig = BulkheadConfig.custom()
    .maxConcurrentCalls(10)
    .maxWaitDuration(Duration.ofMillis(500))
    .build();

Node.js (custom semaphore):

class Bulkhead {
  constructor(maxConcurrent) {
    this.max = maxConcurrent;
    this.current = 0;
  }

  async execute(fn) {
    if (this.current >= this.max) {
      throw new Error('Bulkhead full - request rejected');
    }
    this.current++;
    try { return await fn(); }
    finally { this.current--; }
  }
}

const paymentBulkhead = new Bulkhead(10);
const searchBulkhead = new Bulkhead(20);
Defense Layer 5
Fallbacks: Graceful Degradation

Fallbacks

When your flight gets cancelled, the airline doesn't tell you "no flight, go home." They rebook you on the next available flight, offer a hotel voucher, or at minimum give you a refund. They have a hierarchy of backup plans. That's a fallback strategy.

When all other defenses fail — the timeout fires, retries are exhausted, circuit breaker is open — the fallback is your last line of defense. It provides degraded but functional behavior instead of a crash or error page. The user experience goes from "this is broken" to "this feature is temporarily limited."

The User Experience Priority

A page with "Recommendations unavailable" is infinitely better than a 500 error. A checkout with "Express shipping unavailable, standard only" is better than no checkout at all. Partial service beats no service.

Fallback Strategies
Cached Value
Return stale data from cache
Default Value
Return safe default response
Alternate Service
Try a backup provider
Feature Disable
Hide broken feature
Python
async def get_product_recommendations(user_id):
    try:
        # Try the ML recommendation service
        return await recommendation_service.get(user_id)
    except (TimeoutError, CircuitOpenError):
        # Fallback 1: Try cached recommendations
        cached = await cache.get(f"recs:{user_id}")
        if cached:
            return cached

        # Fallback 2: Return popular products
        return await get_popular_products()
    except Exception:
        # Fallback 3: Empty recommendations (feature disabled)
        return []

What Can Go Wrong With Fallbacks

  • Stale cached data — Your fallback returns prices from 3 hours ago. A customer sees $99 but gets charged $129. Fallback data needs a staleness limit — better to show "price unavailable" than a wrong price.
  • Untested fallbacks — The most common failure. You write a fallback path, never test it, and when it's finally triggered in production, it throws its own exception. If you haven't tested your fallback, it doesn't work.
  • Fallback cascades — Fallback A calls Service B, which is also down. Now your fallback needs a fallback. Keep fallback logic simple and local — don't make external calls in fallback paths.
  • Silent degradation — Your search fallback returns empty results. Users think there are no products. Always communicate when you're in degraded mode: "Search is temporarily limited, showing popular items instead."
Fallback Strategy Decision Guide

Choosing the right fallback depends on the feature:

  • Recommendations engine down? → Cached recommendations → Popular products → Empty list. Users barely notice.
  • Search down? → Show browse categories. Users can still navigate, just differently.
  • Payment processor down?Don't fall back silently. Show "Checkout temporarily unavailable, try again in a few minutes." Some features shouldn't degrade — they should communicate clearly.
  • User profile down? → Show name from JWT token + "Some features unavailable." Basic info from the auth token, no external call needed.
  • Analytics/logging down? → Drop silently. Users shouldn't know or care. Log locally and batch-send later.

The rule: Critical business operations (payment, orders) should fail loudly with clear messaging. Non-critical features (recommendations, personalization, analytics) should degrade silently or with minimal user impact.

Fallback patterns in other languages

Java (Resilience4j):

// Decorate supplier with circuit breaker + fallback
Supplier<String> decorated = CircuitBreaker
    .decorateSupplier(breaker, () -> recommendationService.get(userId));

String result = Try.ofSupplier(decorated)
    .recover(CallNotPermittedException.class,
        e -> cache.get("recs:" + userId))        // CB open: use cache
    .recover(TimeoutException.class,
        e -> getPopularProducts())                  // Timeout: popular items
    .recover(Exception.class,
        e -> Collections.emptyList())               // Anything else: empty
    .get();

Node.js (opossum):

const breaker = new CircuitBreaker(getRecommendations, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000
});

// Chain of fallbacks
breaker.fallback(async (userId) => {
  // Try cache first
  const cached = await redis.get(`recs:${userId}`);
  if (cached) return JSON.parse(cached);

  // Then popular products
  return getPopularProducts();
});

const result = await breaker.fire(userId);

Go (custom with generics):

func withFallback[T any](
    primary func() (T, error),
    fallbacks ...func() (T, error),
) (T, error) {
    result, err := primary()
    if err == nil {
        return result, nil
    }

    for _, fb := range fallbacks {
        result, err = fb()
        if err == nil {
            return result, nil
        }
    }
    return result, fmt.Errorf("all fallbacks exhausted: %w", err)
}

// Usage
recs, err := withFallback(
    func() ([]Product, error) { return recsService.Get(userID) },
    func() ([]Product, error) { return cache.GetRecs(userID) },
    func() ([]Product, error) { return GetPopular(), nil },
)
Real-world: Building a fallback hierarchy for a product page

A product page depends on 5 services. Here's how to degrade each independently so the page always loads:

Python - Product page with independent fallbacks
async def render_product_page(product_id: str):
    # Fetch all data concurrently with independent fallbacks
    product, reviews, recs, inventory, pricing = await asyncio.gather(
        fetch_with_fallback(
            primary=lambda: product_service.get(product_id),
            fallback=lambda: cache.get(f"product:{product_id}"),
            criticality="critical"  # No product = no page
        ),
        fetch_with_fallback(
            primary=lambda: review_service.get(product_id),
            fallback=lambda: {"reviews": [], "message": "Reviews loading..."},
            criticality="optional"
        ),
        fetch_with_fallback(
            primary=lambda: recommendation_service.get(product_id),
            fallback=lambda: get_popular_in_category(product_id),
            criticality="optional"
        ),
        fetch_with_fallback(
            primary=lambda: inventory_service.check(product_id),
            fallback=lambda: {"available": True, "estimated": True},
            criticality="important"  # Show "likely available"
        ),
        fetch_with_fallback(
            primary=lambda: pricing_service.get(product_id),
            fallback=lambda: cache.get(f"price:{product_id}"),
            criticality="critical"  # Wrong price = liability
        ),
        return_exceptions=True
    )

    # Critical services failed = show error page
    if isinstance(product, Exception):
        return error_page("Product not found")
    if isinstance(pricing, Exception):
        return error_page("Price unavailable, try again")

    # Optional services failed = show page with gaps
    return render_template("product.html",
        product=product,
        reviews=reviews if not isinstance(reviews, Exception) else [],
        recs=recs if not isinstance(recs, Exception) else [],
        inventory=inventory,
        pricing=pricing,
    )

Key insight: Use asyncio.gather(return_exceptions=True) so one failing service doesn't cancel the others. Then check each result independently. This is the bulkhead principle applied at the application level.

Putting It Together
Defense in Depth

The Complete Stack

Real resilience comes from layering these patterns. Each layer catches what the previous layer missed. Think of it like a building's safety systems:

Timeout
Like a fire alarm — detects the problem early and alerts you. Without it, the fire spreads before anyone notices.
Retry
Like a sprinkler system — handles small fires automatically. Most fires (transient errors) get extinguished without human intervention.
Circuit Breaker
Like fire doors — when the fire is too big for sprinklers, close the doors to contain it. Stop trying to fight what's already lost.
Bulkhead
Like fireproof walls — even if one room burns, the rest of the building is protected. Different departments (dependencies) in isolated compartments.
Fallback
Like emergency exits — when everything else fails, people (users) still get out safely. They reach a degraded-but-functional experience.
Defense in Depth: The Complete Protection Stack
Timeout (2s) Bound the wait
Retry with Backoff (3x) Handle transient
Circuit Breaker Fail fast
Bulkhead (10 concurrent) Limit blast
Fallback Graceful degrade

Request flows through each layer. If one catches the failure, the layers below don't need to activate.

Python - Complete Pattern
class ResilientClient:
    def __init__(self):
        self.circuit = CircuitBreaker(failure_threshold=5)
        self.semaphore = asyncio.Semaphore(10)  # Bulkhead
        self.cache = {}

    async def call(self, request):
        # Layer 1: Bulkhead
        async with self.semaphore:
            # Layer 2: Circuit Breaker
            if self.circuit.is_open:
                return self._fallback(request)

            # Layer 3: Retry with Backoff
            for attempt in range(3):
                try:
                    # Layer 4: Timeout
                    result = await asyncio.wait_for(
                        self._do_request(request),
                        timeout=2.0
                    )
                    self.circuit.record_success()
                    self._update_cache(request, result)
                    return result

                except asyncio.TimeoutError:
                    self.circuit.record_failure()
                    if attempt < 2:
                        await asyncio.sleep(1 * (2 ** attempt))

            # Layer 5: Fallback
            return self._fallback(request)

    def _fallback(self, request):
        return self.cache.get(request.key, request.default)
The Hidden Dangers
When Patterns Collide

When Patterns Collide

Each pattern works well in isolation. But when you combine them (which you must), they can interact in ways that create new failure modes. These interactions are the most insidious bugs because each pattern appears to be working correctly.

Pattern Interaction Map
Timeout
Bounds wait time
TRAP: budget overflow
Retry
Handles transient
↓ triggers
TRAP: failure count ×3
↓ feeds into
Bulkhead
Limits blast radius
Circuit Breaker
Fails fast
Fallback
Graceful degrade
TRAP: pool ≠ bulkhead
TRAP: health check dep

Red labels show where patterns interact dangerously. Each trap is explained below.

Trap 1: Retry + Circuit Breaker

Your circuit breaker opens after 5 failures. Your retry policy retries 3 times per request. A user makes 2 requests to a failing service:

Request 1: try → fail, retry 1 → fail, retry 2 → fail (3 failures counted)
Request 2: try → fail, retry 1 → fail, retry 2 → fail (6 failures total)
Circuit OPENS after just 2 user requests!

The fix: Don't count retries as separate failures in the circuit breaker. Count each original request as one failure, regardless of how many retries it took. Or place the circuit breaker outside the retry logic — if the circuit is open, skip retries entirely.

Python - Correct ordering
# CORRECT: Circuit breaker wraps the entire retry block
async def call_payment(request):
    if circuit_breaker.is_open:
        return fallback(request)  # Skip retries entirely

    for attempt in range(3):
        try:
            result = await payment_api.charge(request)
            circuit_breaker.record_success()
            return result
        except TransientError:
            if attempt == 2:  # Last attempt
                circuit_breaker.record_failure()  # Count once, not 3x
                raise
            await asyncio.sleep(1 * (2 ** attempt))

Trap 2: Timeout + Retry Budget

Your user-facing SLA is 5 seconds. Your timeout per request is 2 seconds. You retry 3 times. Here's what actually happens:

T=0.0s: User clicks "Place Order"
T=0.0s: Attempt 1 → payment service slow...
T=2.0s: Timeout! Start retry 1 (wait 1s backoff)
T=3.0s: Attempt 2 → payment service still slow...
T=5.0s: SLA breached! User still waiting. Timeout! Start retry 2 (wait 2s backoff)
T=7.0s: Attempt 3 → payment service still slow...
T=9.0s: Final timeout. Return error to user.
Total: 9 seconds. User's SLA was 5 seconds. They left at T=6s.
The Retry Budget Rule

Your total retry budget must fit within your user-facing timeout: (per_request_timeout + backoff) × max_retries ≤ SLA_timeout. If SLA is 5s and per-request timeout is 2s with 1s base backoff, you can afford 1 retry (2s + 1s + 2s = 5s). Not 3. Always account for backoff delays in the budget.

Trap 3: Bulkhead + Connection Pool Sizing

Your bulkhead allows 10 concurrent requests to the payment service. But your connection pool to the payment service has 5 connections. What happens?

Requests 1-5: Get connections from pool → call payment service (fine)
Requests 6-10: Bulkhead says "proceed"... but no connections available
↓ Block waiting for connection pool (another timeout!)
Result: Bulkhead lets 10 through, but 5 just block inside.
You've moved the bottleneck from the thread pool to the connection pool.

The fix: Size connection pools to at least match bulkhead limits. If your bulkhead allows N concurrent requests, your connection pool needs at least N connections. Better yet: set the pool timeout shorter than the bulkhead timeout, so pool exhaustion triggers a clean rejection.

Trap 4: Circuit Breaker + Health Checks

Your circuit breaker opens because the payment service is down. Now watch the cascade:

T=0: Payment service goes down
T=10s: Circuit breaker opens (good!)
T=15s: ALB health check hits /health
T=15.1s: /health endpoint tries to ping payment service
T=15.2s: Circuit breaker rejects → health check returns 500
T=45s: 3 failed health checks → ALB marks instance UNHEALTHY
T=45s: ALL traffic stops — even search, catalog, profile
Payment was down. Now EVERYTHING is down.

The fix: Health check endpoints should only check the service's own health (can it respond?), not the health of its dependencies. Dependency health is the circuit breaker's job, not the health check's job.

Python - Correct health check
@app.get("/health")
async def health_check():
    # GOOD: Only check OWN health
    return {
        "status": "healthy",
        "uptime": get_uptime(),
        "memory_mb": get_memory_usage(),
    }

@app.get("/health/dependencies")
async def deep_health_check():
    # Separate endpoint for monitoring dashboards (NOT for ALB)
    return {
        "payment": circuit_breaker_payment.state.value,
        "inventory": circuit_breaker_inventory.state.value,
        "recommendations": circuit_breaker_recs.state.value,
    }
Observability
Monitoring Your Defenses

Monitoring Your Defenses

You've added timeouts, retries, circuit breakers, bulkheads, and fallbacks. But how do you know they're actually working? Without monitoring, your defenses are invisible — you won't know if they're firing too often, not firing at all, or misconfigured.

Failure Symptom Taxonomy

When something goes wrong, start with the symptom and work backward to the pattern that should have caught it:

Symptom → Root Cause → Missing Pattern
Symptom You See Root Cause Pattern That Prevents It
Threads exhausted, all requests queued Dependency slow (not down), no timeout set Timeout — bound the wait to 2-5s
Requests fail once then succeed on refresh Transient network errors, no retry logic Retry + Backoff — auto-recover from blips
Sustained 5xx errors for 5+ minutes Dead dependency, wasting resources on every request Circuit Breaker — fail fast, stop hammering
Payment down takes product catalog down Shared thread pool, one dependency drains all resources Bulkhead — isolate thread/connection pools
Users see blank pages or 500 errors All-or-nothing behavior, no degraded mode Fallback — serve cached/default data
Retry storm makes outage worse 1000 clients retry simultaneously Jitter — randomize retry timing
Circuit breaker opens from 2 requests Retries counted as separate failures Correct ordering — CB wraps retry block
User waits 30 seconds for error Retry budget exceeds SLA Budget math — timeout × retries ≤ SLA

What to Monitor Per Pattern

Resilience Metrics Dashboard
Pattern Key Metrics Alert When...
Timeouts Timeout rate (%), p99 latency, timeout count by endpoint Timeout rate > 5% for any dependency
Retries Retry rate, retries per request, success-after-retry rate Retry rate > 20% (something is broken, not transient)
Circuit Breakers State (OPEN/CLOSED/HALF-OPEN), trips per hour, time spent OPEN Circuit opens more than 3 times per hour
Bulkheads Pool utilization (%), rejected requests, queue depth Pool utilization > 80% sustained
Fallbacks Fallback invocation rate, fallback type used, duration in fallback Any fallback active > 5 minutes
What the logs look like when defenses fire

Timeout firing:

WARN  [payment-client] Request to /charge timed out after 2000ms
      request_id=abc-123 attempt=1/3 endpoint=payment-api.com
      // Look for: sustained timeout warnings = dependency issue

Circuit breaker opening:

ERROR [circuit-breaker] Circuit OPENED for payment-service
      failure_count=5 threshold=5 window=60s
      last_error="Connection timed out" recovery_timeout=30s
      // Look for: circuit state changes = investigate immediately

Bulkhead rejecting:

WARN  [bulkhead] Payment pool exhausted, rejecting request
      pool_size=10 active=10 queued=0 rejected=1
      // Look for: rejections = pool too small or dependency too slow

Fallback activating:

INFO  [recommendations] Serving cached recommendations (fallback)
      reason=circuit_open cache_age=3600s user_id=usr-456
      // Look for: sustained fallback usage = service needs attention
The Most Dangerous Signal

Watch for patterns that fire but nobody notices. If your circuit breaker has been opening and closing 5 times a day for the past month, you have a dependency that's unreliable enough to trip breakers but "working enough" that nobody investigates. These slow-burning issues become outages.

What a Real Outage Looks Like (With Resilience Patterns)

Here's the same Tuesday 2:47 PM scenario from the opening — but this time, with all five patterns in place. Watch how the system self-heals:

2:47:00 PM — Everything normal. Payment API responds in ~200ms.
2:48:00 PM — Payment API slows to 5s. TIMEOUT fires at 2s. Thread freed.
2:48:02 PMRETRY #1 (after 1s backoff). Still slow. Timeout fires again.
2:48:05 PMRETRY #2 (after 2s backoff). Still slow. Give up on this request.
2:48:05 PMFALLBACK: Queue payment for later, return "order pending" to user.
2:48:10 PM — 5 requests have failed. CIRCUIT OPENS.
2:48:11 PM — Next payment request: circuit is OPEN. Fail fast in 1ms (no 2s wait).
2:48:11 PMFALLBACK: Queue payment, "order pending." User barely notices.
2:48:12 PMBULKHEAD: Only payment threads affected. Catalog, search, profile all serving normally.
2:48:40 PMHALF-OPEN: Circuit lets 1 test request through.
2:48:42 PM — Test request times out. Circuit re-opens for another 30s.
2:49:12 PMHALF-OPEN: Another test request. Payment responds in 300ms!
2:49:12 PMCIRCUIT CLOSES. Normal operation resumes.
2:49:13 PM — Queued payments start processing from the backlog.
2:49:30 PM — All queued payments processed. Full recovery.
Total user impact: ~90 seconds of "order pending" messages.
Vs. without patterns: complete platform outage for 8+ minutes.
Prometheus metrics for resilience patterns

If you use Prometheus (or any metrics system), here are the exact metrics to expose from your resilience layer:

Python - Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge

# Timeout metrics
request_duration = Histogram(
    'http_client_request_duration_seconds',
    'Request duration in seconds',
    ['service', 'endpoint', 'status'],
    buckets=[.1, .25, .5, 1, 2, 5, 10]
)
timeout_total = Counter(
    'http_client_timeouts_total',
    'Total number of timeouts',
    ['service', 'endpoint']
)

# Retry metrics
retry_total = Counter(
    'http_client_retries_total',
    'Total retry attempts',
    ['service', 'attempt', 'result']
)

# Circuit breaker metrics
circuit_state = Gauge(
    'circuit_breaker_state',
    'Circuit breaker state (0=closed, 1=open, 2=half-open)',
    ['service']
)
circuit_trips = Counter(
    'circuit_breaker_trips_total',
    'Times circuit breaker tripped open',
    ['service']
)

# Bulkhead metrics
bulkhead_active = Gauge(
    'bulkhead_active_count',
    'Currently active requests in bulkhead',
    ['service']
)
bulkhead_rejected = Counter(
    'bulkhead_rejected_total',
    'Requests rejected by bulkhead',
    ['service']
)

# Fallback metrics
fallback_invocations = Counter(
    'fallback_invocations_total',
    'Fallback activations',
    ['service', 'fallback_type']
)

Key Grafana alerts to set up:

  • Timeout rate > 5% for 5 minutes — Dependency is degraded, investigate immediately
  • Circuit breaker opens — Instant alert, this means a dependency is down
  • Retry success rate < 50% — Retries aren't helping, the problem isn't transient
  • Bulkhead utilization > 80% for 10 minutes — Either traffic spike or dependency slowdown
  • Fallback active > 5 minutes — Service isn't recovering, needs human investigation
Defense Layer
Rate Limiting: Protecting Your Own Service

Rate Limiting

So far we've talked about protecting your system from failing dependencies. Rate limiting protects your system from too much incoming traffic — a related but different concern.

Rate limiting caps how many requests you'll accept, preventing overload from traffic spikes, attacks, or misbehaving clients.

Token Bucket Algorithm

The most common approach. Imagine a bucket that fills with tokens at a steady rate. Each request consumes a token. No tokens? Request denied (HTTP 429).

  • Bucket size determines burst capacity (e.g., 100 tokens = 100 rapid requests)
  • Refill rate determines sustained throughput (e.g., 10 tokens/second)

Where to Apply Rate Limits

  • Edge/CDN: DDoS protection, geographic limits
  • API Gateway: Per-user limits, API key quotas
  • Application: Business logic limits, per-endpoint
  • Internal Services: Prevent one service from overwhelming another

The Response

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706540400

Best Practices

  • Always return Retry-After header — Tell clients when to retry
  • Different limits for different tiers — Free vs paid users
  • Don't rate limit health checks — Your monitoring needs access
  • Log rate limited requests — Detect abuse patterns vs legitimate spikes

Real-World: The 2017 Amazon S3 Outage

On February 28, 2017, a typo during routine maintenance took down Amazon S3 in us-east-1 for nearly 4 hours. The blast radius was enormous: Slack, Trello, Quora, IFTTT, and thousands of other services went down. The internet felt like it was broken.

9:37 AM PT: Engineer runs maintenance script, typos a parameter
9:37 AM: More servers removed from S3 than intended
9:40 AM: S3 starts returning errors and slow responses
9:45 AM: Services using S3 start timing out
↓ Key detail: S3 wasn't fully down. It was SLOW.
9:50 AM: Thread pools exhaust across thousands of services
10:00 AM: Cascading failures. Services that don't use S3 go down because they share thread pools with S3-dependent code
10:30 AM: Even the AWS health dashboard was down (it used S3)
1:54 PM: Full recovery after nearly 4 hours

Why "slow" is worse than "down": If S3 had returned errors immediately, circuit breakers would have tripped and services would have served fallbacks. But S3 was responding slowly — 10-30 second response times. Without proper timeouts, threads waited. And waited. And consumed all available resources.

Every pattern from this post would have helped:

  • Timeouts (2s): Would have freed threads after 2s instead of holding them for 30s+. A 15x improvement in resource utilization.
  • Circuit breakers: After 5 timeouts, would have failed fast (1ms) instead of waiting 2s per request. Orders of magnitude less resource waste.
  • Bulkheads: Would have limited S3 thread consumption to its own pool. Product catalog, user profiles, and checkout — anything not needing S3 — would have kept working.
  • Fallbacks: Would have served locally cached content. Users see yesterday's data instead of an error page.

Services that survived (like Netflix) had all of these layered together. They showed degraded experience instead of complete failure. Netflix subscribers watched movies uninterrupted while much of the internet was down.

Verification
Testing Your Defenses

Testing Your Defenses

Here's the uncomfortable truth: if you haven't tested your failure handling, it doesn't work. Every pattern in this post can have bugs — a timeout that's never applied, a circuit breaker with the wrong threshold, a fallback that throws its own exception. You won't discover these bugs during normal operation. You'll discover them during an outage, when it's too late.

The Untested Fallback Problem

The most common failure handling bug: you write a beautiful fallback path, deploy it, and never trigger it. Six months later the cache layer it depends on got refactored. When the fallback finally fires in production, it throws a NullPointerException. Your safety net had a hole in it.

Unit Testing Each Pattern

Every pattern can be tested in isolation with dependency injection and controlled failures:

Python - Testing Timeout Behavior
import pytest
import asyncio

async def test_timeout_fires_on_slow_dependency():
    """Verify timeout actually triggers when dependency is slow."""

    async def slow_service():
        await asyncio.sleep(5.0)  # Simulate slow response
        return "should never reach here"

    with pytest.raises(asyncio.TimeoutError):
        await asyncio.wait_for(slow_service(), timeout=2.0)

async def test_retry_succeeds_on_transient_failure():
    """Verify retry recovers from transient errors."""
    call_count = 0

    def flaky_service():
        nonlocal call_count
        call_count += 1
        if call_count < 3:
            raise ConnectionError("transient failure")
        return "success"

    result = retry_with_backoff(flaky_service, max_retries=3)
    assert result == "success"
    assert call_count == 3  # Failed twice, succeeded on third

async def test_circuit_breaker_opens_after_threshold():
    """Verify circuit opens and fails fast."""
    cb = CircuitBreaker(failure_threshold=3, recovery_timeout=30)

    # Trigger failures to open the circuit
    for _ in range(3):
        with pytest.raises(Exception):
            cb.call(lambda: (_ for _ in ()).throw(Exception("fail")))

    # Circuit should be OPEN now - next call fails fast
    with pytest.raises(CircuitOpenError):
        cb.call(lambda: "this should never execute")

async def test_fallback_returns_cached_data():
    """Verify fallback serves cached data when primary fails."""
    client = ResilientClient()
    client.cache["product:123"] = {"name": "Widget", "price": 29.99}

    # Force circuit open to trigger fallback path
    client.circuit.state = CircuitState.OPEN

    result = await client.call(Request(key="product:123"))
    assert result["name"] == "Widget"  # Got cached data, not an error
What to test for each pattern

Timeouts:

  • Timeout actually fires (not silently ignored by the client)
  • Resources are released after timeout (no thread/connection leak)
  • The right exception type is thrown (so retry logic catches it)

Retries:

  • Retries happen the right number of times
  • Backoff delays increase exponentially
  • Non-retryable errors (4xx) are NOT retried
  • Final failure propagates after max retries

Circuit Breakers:

  • Opens after exactly N failures
  • Fails fast when OPEN (no actual call made)
  • Transitions to HALF-OPEN after recovery timeout
  • Closes after successful test request in HALF-OPEN
  • Re-opens if test request fails in HALF-OPEN

Bulkheads:

  • Rejects when pool is full
  • Doesn't block requests to other pools
  • Resources are released even on failure

Fallbacks:

  • Returns correct cached/default data
  • Handles missing cache gracefully
  • Doesn't make external calls (no cascading fallback failures)

Chaos Engineering: Testing in Production

Unit tests verify your code logic. Chaos engineering verifies your system behavior. The idea is simple: intentionally inject failures and observe what happens.

Netflix's Simian Army Philosophy

Netflix runs chaos experiments in production because "the best way to verify that something works is to break it." Their Chaos Monkey randomly kills instances. Chaos Kong simulates entire region failures. If their systems survive these attacks during business hours, they'll survive real failures at 3am.

You don't need Netflix's scale to practice chaos engineering. Start small:

Python - Simple Fault Injection Decorator
import random
import os

class FaultInjector:
    """Inject failures for testing resilience patterns."""

    def __init__(self):
        self.enabled = os.getenv("CHAOS_ENABLED", "false") == "true"
        self.failure_rate = float(os.getenv("CHAOS_FAILURE_RATE", "0.0"))
        self.latency_ms = int(os.getenv("CHAOS_LATENCY_MS", "0"))

    def maybe_fail(self, service_name: str):
        if not self.enabled:
            return

        if random.random() < self.failure_rate:
            logger.warning(f"CHAOS: Injecting failure for {service_name}")
            raise ConnectionError(f"Chaos: {service_name} failure injected")

    async def maybe_delay(self, service_name: str):
        if not self.enabled or self.latency_ms == 0:
            return

        delay = random.uniform(0, self.latency_ms / 1000.0)
        logger.warning(f"CHAOS: Injecting {delay:.0f}ms delay for {service_name}")
        await asyncio.sleep(delay)

# Usage in your service client:
chaos = FaultInjector()

async def call_payment_service(order):
    chaos.maybe_fail("payment")
    await chaos.maybe_delay("payment")
    return await payment_api.charge(order)

# Enable in staging with env vars:
# CHAOS_ENABLED=true CHAOS_FAILURE_RATE=0.1 CHAOS_LATENCY_MS=3000
Your first chaos experiment: a step-by-step guide

Step 1: Pick one non-critical dependency (recommendation engine, analytics, notifications). Never start with payment or auth.

Step 2: Define your hypothesis. "If the recommendation service is down, the product page should still load in under 2 seconds, showing popular products instead of personalized ones."

Step 3: Inject the failure in staging. Use the fault injector above, or simply block the dependency's DNS/port with iptables:

# Block traffic to recommendation service (staging only!)
iptables -A OUTPUT -d recommendation-service.internal -j DROP

# Simulate slow responses (add 2s latency)
tc qdisc add dev eth0 root netem delay 2000ms

# Restore after test
iptables -D OUTPUT -d recommendation-service.internal -j DROP
tc qdisc del dev eth0 root netem

Step 4: Observe. Does the page still load? How long? What do users see? Check your monitoring dashboards: did the circuit breaker open? Did fallbacks fire? Did latency stay within SLA?

Step 5: Fix what broke. Then repeat with a different dependency.

Graduation path:

  • Level 1: Kill a non-critical dependency in staging
  • Level 2: Kill a critical dependency in staging
  • Level 3: Add random latency to all dependencies in staging
  • Level 4: Kill a non-critical dependency in production (during business hours, with the team watching)
Chaos Engineering Tools

If you outgrow the DIY approach, these tools automate chaos experiments:

Tool Best For Complexity
Chaos Monkey Random instance termination (AWS) Low
Litmus Kubernetes-native chaos (pod/node/network) Medium
Gremlin Enterprise chaos platform (SaaS) Low (managed)
Toxiproxy Network-level fault injection (local/CI) Low
tc / iptables Linux kernel-level network manipulation High (manual)

Start with Toxiproxy for local testing and CI pipelines. It sits between your service and dependencies, letting you inject latency, connection failures, bandwidth limits, and timeouts at the TCP level:

# Create a proxy for your payment dependency
toxiproxy-cli create payment -l localhost:8474 -u payment-api:443

# Add 2 second latency
toxiproxy-cli toxic add payment -t latency -a latency=2000

# Simulate connection reset (service crash)
toxiproxy-cli toxic add payment -t reset_peer -a timeout=500

Failure Handling Cheat Sheet

TIMEOUTS

  • Connect: 1-5s | Read: 1-30s (by operation)
  • Rule: timeout = p99 latency + buffer
  • Cascade: downstream < upstream
  • Total budget: timeout × retries ≤ SLA

RETRIES

  • Backoff: delay = base × 2^attempt
  • Always add jitter (full jitter best)
  • Only retry 5xx/timeout, never 4xx
  • Use idempotency keys for non-idempotent ops

CIRCUIT BREAKERS

  • States: CLOSED → OPEN → HALF-OPEN
  • Threshold: 5-10 failures or 50% rate
  • Recovery timeout: 15-60 seconds
  • HALF-OPEN: let exactly 1 request through

BULKHEADS

  • Match pool size to bulkhead limit
  • Size by criticality, not equally
  • Reject (don't queue) when full
  • Review sizing quarterly

FALLBACKS

  • Hierarchy: cache → default → degrade → disable
  • No external calls in fallback paths
  • Test fallbacks regularly
  • Critical ops: fail loudly, don't silently degrade

PATTERN TRAPS

  • Retry + CB: count per-request, not per-attempt
  • Timeout + Retry: total ≤ SLA budget
  • Bulkhead + Pool: pool ≥ bulkhead limit
  • CB + Health: don't check deps in health endpoint
Practice Mode: Test Your Understanding
0 / 4
Scenario 1 of 4
Your e-commerce checkout calls a payment API, an inventory service, and a shipping calculator. The payment API is responding in 30 seconds instead of the usual 500ms. Users are complaining that "the whole site is broken."
What's the most likely root cause of the entire site being affected?
A
The payment API is down
B
No timeouts or shared thread pool — slow payment blocks everything
C
The database is down
Scenario 2 of 4
Your service needs a 5-second SLA. You're calling an external API with a 2-second timeout and 500ms processing overhead. Your team suggests 5 retries "to be safe."
Why is 5 retries a bad idea here?
A
5 retries is never appropriate
B
5 retries x 2s timeout = 10+ seconds, exceeding the 5s SLA
C
The external API might charge the user 5 times
Scenario 3 of 4
Your notification service depends on three downstream services: email API, SMS gateway, and push notification service. The email API just started failing 100% of requests. Your notification service is now returning 500 errors even for SMS-only requests.
Which pattern would have prevented this?
A
More aggressive timeouts
B
Faster retries
C
Bulkheads (isolated resources per downstream service)
Scenario 4 of 4
Your analytics service is completely down. You've added a circuit breaker that opens after 5 failures. But users are still seeing slow page loads (10+ seconds) before the analytics call gives up.
What's missing from your failure handling?
A
Timeouts — the circuit breaker needs failures to trip, but without timeouts, requests hang
B
More retries before circuit breaker trips
C
A fallback service for analytics
Action Items
Decision Checklist

What to Implement

Not every service needs every pattern. Here's the tradeoffs and decision guide:

Tradeoffs: What Each Pattern Costs

Every pattern adds complexity. Use them when the protection justifies the cost:

Timeouts
Pros: Simple to add, prevents thread starvation, near-zero overhead
Cons: Choosing the right value is hard, too-low timeouts reject valid requests, must cascade correctly across services
Always use. The cost is near-zero, the risk of NOT having them is catastrophic.
Retries
Pros: Handles transient failures automatically, invisible to users
Cons: Can amplify load on failing services, must handle idempotency, adds latency, retry budget math required
Use for: Any external call that fails transiently. Skip for: Non-idempotent operations without idempotency keys.
Circuit Breakers
Pros: Stops wasting resources on known-dead services, auto-recovers
Cons: More state to manage, threshold tuning is non-trivial, can mask underlying issues, need monitoring
Use for: Any dependency that has outages lasting > 10 seconds. Skip for: Internal function calls, rarely-failing dependencies.
Bulkheads
Pros: Limits blast radius, one failure can't take everything down
Cons: Reduces total throughput (resources partitioned), sizing requires traffic analysis, adds operational complexity
Use for: Services calling 3+ dependencies. Skip for: Simple services with 1 dependency.
Fallbacks
Pros: Users see degraded service instead of errors, preserves user experience
Cons: Must be tested separately, stale data risks, silent degradation can mask issues, additional code paths to maintain
Use for: Any non-critical feature. Skip for: Financial operations where wrong data is worse than no data.
If your service... You need... Priority
Calls any external API Timeouts on every call Critical
Has transient failures Retry with exponential backoff Critical
Has unreliable dependencies Circuit breakers High
Can show partial data Fallbacks with cached/default values High
Calls multiple dependencies Bulkheads (separate pools) Medium
Receives external traffic Rate limiting High
Is business critical All of the above + monitoring Critical
Monday Morning Checklist

This week:

  • Audit all external API calls. Do they have timeouts? Add them.
  • Check your HTTP client defaults. Many have no timeout by default!
  • Add retry with backoff to your most critical integration.

This month:

  • Implement circuit breakers for your top 3 flakiest dependencies.
  • Add fallback responses for non-critical features.
  • Set up monitoring for timeout rates, retry rates, circuit breaker state.

This quarter:

  • Run chaos engineering: randomly inject failures and measure impact.
  • Review bulkhead sizing based on actual traffic patterns.
  • Document your failure handling strategy for the team.

Key Takeaways

  1. Slow is worse than down. A dead service fails fast. A slow service holds resources hostage and cascades.
  2. Layer your defenses. Timeout → Retry → Circuit Breaker → Bulkhead → Fallback. Each catches what the previous missed.
  3. Fail fast, recover gracefully. The goal isn't preventing all failures—it's containing their impact.
  4. Test your failure handling. If you haven't tested it, it doesn't work. Chaos engineering isn't optional.
  5. Monitor everything. You can't fix what you can't see. Track timeout rates, circuit breaker trips, fallback usage.
The Bottom Line

Distributed systems will fail. Your job isn't to prevent all failures—it's to build systems where failures are contained, fast, and recoverable. The patterns in this post are your toolkit for doing exactly that.

Complete production configuration: all 5 patterns for a real service

Here's a complete, copy-pastable configuration for an order service that calls payment, inventory, and notification services. This combines all 5 patterns with production-ready settings:

Python - Production Resilient Service Client
import asyncio
import random
import time
import logging
from dataclasses import dataclass
from enum import Enum

logger = logging.getLogger(__name__)

# ─── CONFIGURATION ───
# Tune these per-dependency based on your SLA and traffic patterns

PAYMENT_CONFIG = {
    "timeout_seconds": 5.0,        # p99 is 2s, generous buffer
    "max_retries": 2,              # Budget: 5s × 2 = 10s < 15s SLA
    "backoff_base": 1.0,           # 1s, 2s delays
    "cb_failure_threshold": 5,     # Open after 5 failures
    "cb_recovery_timeout": 30,     # Test recovery after 30s
    "bulkhead_max_concurrent": 20, # Max 20 parallel payment calls
}

INVENTORY_CONFIG = {
    "timeout_seconds": 2.0,        # Internal service, should be fast
    "max_retries": 1,              # Quick: 2s × 1 = 2s
    "backoff_base": 0.5,
    "cb_failure_threshold": 10,    # Higher tolerance (internal)
    "cb_recovery_timeout": 15,
    "bulkhead_max_concurrent": 30,
}

NOTIFICATION_CONFIG = {
    "timeout_seconds": 3.0,
    "max_retries": 3,              # More retries OK (async, no user waiting)
    "backoff_base": 2.0,
    "cb_failure_threshold": 5,
    "cb_recovery_timeout": 60,     # Longer recovery (external service)
    "bulkhead_max_concurrent": 10, # Non-critical, small pool
}

# ─── THE RESILIENT CLIENT ───

class ResilientServiceClient:
    def __init__(self, service_name: str, config: dict):
        self.name = service_name
        self.config = config

        # Circuit breaker state
        self.cb_state = "closed"
        self.cb_failure_count = 0
        self.cb_last_failure = 0

        # Bulkhead
        self.semaphore = asyncio.Semaphore(config["bulkhead_max_concurrent"])

        # Metrics (replace with Prometheus in production)
        self.metrics = {"timeouts": 0, "retries": 0, "cb_trips": 0,
                        "fallbacks": 0, "rejections": 0}

    async def call(self, func, fallback=None):
        # Layer 1: Bulkhead - limit concurrency
        if self.semaphore.locked():
            self.metrics["rejections"] += 1
            logger.warning(f"[{self.name}] Bulkhead full, using fallback")
            return fallback() if fallback else None

        async with self.semaphore:
            # Layer 2: Circuit Breaker - fail fast if known-broken
            if self.cb_state == "open":
                if time.time() - self.cb_last_failure > self.config["cb_recovery_timeout"]:
                    self.cb_state = "half_open"
                    logger.info(f"[{self.name}] Circuit HALF-OPEN, testing...")
                else:
                    self.metrics["fallbacks"] += 1
                    return fallback() if fallback else None

            # Layer 3: Retry with backoff
            for attempt in range(self.config["max_retries"]):
                try:
                    # Layer 4: Timeout
                    result = await asyncio.wait_for(
                        func(), timeout=self.config["timeout_seconds"]
                    )
                    self._record_success()
                    return result

                except asyncio.TimeoutError:
                    self.metrics["timeouts"] += 1
                    self._record_failure()

                    if attempt < self.config["max_retries"] - 1:
                        self.metrics["retries"] += 1
                        delay = self.config["backoff_base"] * (2 ** attempt)
                        jitter = random.uniform(0, delay)
                        await asyncio.sleep(jitter)

                except Exception as e:
                    self._record_failure()
                    if attempt < self.config["max_retries"] - 1:
                        self.metrics["retries"] += 1
                        await asyncio.sleep(self.config["backoff_base"])
                    else:
                        break

            # Layer 5: Fallback
            self.metrics["fallbacks"] += 1
            logger.warning(f"[{self.name}] All attempts failed, using fallback")
            return fallback() if fallback else None

    def _record_success(self):
        self.cb_failure_count = 0
        if self.cb_state != "closed":
            logger.info(f"[{self.name}] Circuit CLOSED (recovered)")
            self.cb_state = "closed"

    def _record_failure(self):
        self.cb_failure_count += 1
        self.cb_last_failure = time.time()
        if self.cb_failure_count >= self.config["cb_failure_threshold"]:
            if self.cb_state != "open":
                self.metrics["cb_trips"] += 1
                logger.error(f"[{self.name}] Circuit OPENED after {self.cb_failure_count} failures")
            self.cb_state = "open"

# ─── USAGE ───
payment_client = ResilientServiceClient("payment", PAYMENT_CONFIG)
inventory_client = ResilientServiceClient("inventory", INVENTORY_CONFIG)
notification_client = ResilientServiceClient("notification", NOTIFICATION_CONFIG)

async def place_order(order):
    inventory = await inventory_client.call(
        lambda: inventory_api.reserve(order.items),
        fallback=lambda: {"status": "estimated_available"}
    )

    payment = await payment_client.call(
        lambda: payment_api.charge(order.total),
        fallback=lambda: queue_for_later(order)
    )

    # Notification is fire-and-forget (non-critical)
    asyncio.create_task(notification_client.call(
        lambda: email_api.send_confirmation(order),
        fallback=lambda: None  # Silent fail OK
    ))

Key design decisions:

  • Payment has a generous timeout (5s) because external APIs are slower, but only 2 retries to keep within SLA
  • Inventory has a tight timeout (2s) because it's internal, and only 1 retry
  • Notifications get more retries (3) because they're async — no user waiting
  • Bulkhead sizes reflect criticality: payment (20), inventory (30), notifications (10)
  • Each service has independent circuit breakers so payment going down doesn't affect inventory

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this helpful?

Comments

Loading comments...

Leave a comment