The Day Everything Fell Down
Tuesday, 2:47 PM. You're in a meeting when your phone buzzes. Then buzzes again. Then doesn't stop.
The payment service is slow. Then the checkout page times out. Then the product catalog stops responding. Within 8 minutes, your entire e-commerce platform is down. 50,000 users. Zero transactions.
The root cause? A third-party payment API started responding slowly instead of failing fast.
Everything is Fine
Payment API responding in ~200ms as usual. All systems nominal.
Payment API Slows Down
Response time jumps to 5 seconds. Your checkout threads start waiting...
Thread Pool Exhaustion
All checkout worker threads are blocked waiting for payment. New requests queue up.
Connection Pool Drained
Database connections held by blocked threads. Product service can't get connections.
Load Balancer Timeouts
Health checks failing. Load balancer marks servers as unhealthy. Traffic concentrates on remaining servers.
Complete System Failure
Remaining servers overwhelmed. Homepage returns 503. Platform is down.
A slow dependency is worse than a dead one. Dead services fail fast. Slow services hold resources hostage.
The payment service never crashed. It never returned errors. It just got slow. And that slowness spread through every system that called it, like a virus.
Here's the code that caused the outage:
async def checkout(order): # No timeout - waits forever payment = await payment_api.charge(order.total) # No retry - fails permanently on transient errors if not payment.success: raise PaymentError("Payment failed") # No fallback - nothing to show when broken return {"order_id": order.id, "status": "confirmed"}
And here's what it should have looked like — the same function, protected:
async def checkout(order): try: # Timeout: don't wait forever (2s) payment = await asyncio.wait_for( # Circuit breaker: fail fast if payment is known-down circuit_breaker.call( # Retry: handle transient failures (2 attempts) lambda: retry_with_backoff( lambda: payment_api.charge(order.total), max_retries=2 ) ), timeout=5.0 # Total budget including retries ) return {"order_id": order.id, "status": "confirmed"} except (asyncio.TimeoutError, CircuitOpenError): # Fallback: queue the payment for later processing await queue.send({"order": order.id, "retry_at": now() + minutes(5)}) return {"order_id": order.id, "status": "pending", "message": "Payment is being processed. You'll receive confirmation shortly."}
The rest of this post teaches you each layer of defense. By the end, you'll know exactly when and how to add each one.
Why Distributed Systems Fail Differently
In a monolith, when something breaks, you usually get an exception. Stack trace. Clear error. Easy to debug.
In distributed systems, failures are partial, delayed, and ambiguous:
- Partial failure — Service A is up, Service B is down, Service C is slow. The system is simultaneously healthy AND broken.
- Network ambiguity — Did the request fail? Did it succeed but the response got lost? If you sent a payment request and got no response, was the customer charged?
- Cascade effects — One sick service infects everything that depends on it, like the cascade in the story above.
- No single source of truth — Each service has its own view of the world. Service A thinks the order was placed. Service B never got the message.
In 1994, Peter Deutsch (and others at Sun Microsystems) documented the false assumptions developers make about networks. Every failure pattern in this post exists because one or more of these assumptions is wrong:
- The network is reliable — Packets get dropped. Cables get cut. DNS fails. (Why we need retries)
- Latency is zero — Cross-datacenter calls add 30-100ms. Slow services add seconds. (Why we need timeouts)
- Bandwidth is infinite — Large payloads cause backpressure. (Why we need bulkheads)
- The network is secure — TLS handshakes fail, certificates expire.
- Topology doesn't change — Load balancers add/remove instances. DNS changes.
- There is one administrator — Your dependency's team deploys on their own schedule.
- Transport cost is zero — Each network call costs CPU, memory, and latency.
- The network is homogeneous — Different services use different protocols, versions, and behaviors.
The takeaway: Every call across a network boundary is a potential failure point. The patterns in this post are your defense against each of these false assumptions.
- Fails immediately
- Releases resources fast
- Retry kicks in quickly
- Circuit breaker opens
- Impact: Contained
- Holds threads hostage
- Drains connection pools
- Timeouts may not trigger
- Looks "almost working"
- Impact: Cascades
A dead service is predictable. A slow service is a resource vampire that drains everything around it.
This is why failure handling in distributed systems requires multiple layers of defense. No single technique is enough. You need a complete toolkit.
How much does a slow service actually cost? Let's do the math for a typical e-commerce site handling 1,000 requests/second:
Scenario: Payment API goes from 200ms to 10s response time.
Without resilience patterns:
- 200 thread pool threads × 10s = all threads blocked in 2 seconds
- Remaining 998 req/s queued, then rejected
- Average order value: $80 × 1% conversion = $800/sec revenue
- 8-minute outage = $384,000 in lost revenue
- Plus: customer trust damage, social media complaints, SEO impact
With resilience patterns:
- Timeout at 2s, circuit opens after 5 failures (10 seconds)
- Fallback: "Payment pending" with queue for later processing
- 98% of requests served normally (non-payment features unaffected)
- 2% of requests get "pending" status, processed within 5 minutes
- Revenue impact: ~$0 (all orders eventually processed)
This post focuses on implementation — the specific patterns, code examples, and configuration values. For the conceptual foundation of why we need these patterns and how they fit into a broader reliability strategy, see Building Resilient Systems.
Timeouts
Think of calling a restaurant to make a reservation. If no one picks up after 10 rings, you hang up and try another restaurant. You don't hold the phone for 30 minutes hoping someone eventually answers. That's a timeout.
A timeout is a promise: "I will not wait forever." It's the simplest and most important failure handling mechanism. Without timeouts, a slow dependency can hold your resources indefinitely. With timeouts, you bound the worst-case wait time.
With 200 threads and no timeout, a dependency that responds in 30 seconds means: 200 threads × 30s = 6,000 thread-seconds consumed per minute. Your entire server is frozen waiting. With a 2-second timeout, the same scenario costs 200 × 2s = 400 thread-seconds — 15x less resource waste, and those threads are freed to serve other requests.
import httpx # BAD: No timeout - can hang forever response = httpx.get("https://payment-api.com/charge") # GOOD: Explicit timeouts response = httpx.get( "https://payment-api.com/charge", timeout=httpx.Timeout( connect=5.0, # Max time to establish connection read=10.0, # Max time to read response write=5.0, # Max time to send request pool=5.0 # Max time to get connection from pool ) )
Start with your SLA. If users expect responses in 2 seconds, your timeout can't be 30 seconds.
Measure p99 latency. Look at your dependency's 99th percentile response time. Set timeout slightly above that.
Rule of thumb: timeout = p99 + buffer. If p99 is 500ms, timeout at 1-2 seconds.
Different operations need different timeouts:
- Connection timeout: 1-5 seconds (how long to wait for TCP handshake)
- Read timeout: varies by operation (simple lookup: 1-5s, complex query: 10-30s)
- Total request timeout: your user-facing SLA minus processing overhead
What Can Go Wrong With Timeouts
Timeouts seem simple, but misconfigured timeouts cause their own disasters:
- Timeout too short — You time out legitimate requests. If your payment API normally takes 800ms but occasionally takes 1.5s during peak load, a 1s timeout means you'll reject ~5% of valid payments. Customers get charged but your system thinks it failed.
- Timeout too long — You're back to the original problem. A 60s timeout on a dependency that usually responds in 200ms means threads are held hostage for 60 seconds when things go wrong.
- Same timeout everywhere — A health check endpoint and a complex report generation shouldn't share the same timeout value. Match timeout to the operation.
- No timeout at all — Many HTTP clients default to "wait forever." Check your defaults.
Java (OkHttp):
new OkHttpClient.Builder() .connectTimeout(5, TimeUnit.SECONDS) .readTimeout(10, TimeUnit.SECONDS) .writeTimeout(5, TimeUnit.SECONDS) .build();
Node.js (Axios):
axios.get('https://payment-api.com/charge', { timeout: 5000, // 5 seconds total signal: AbortSignal.timeout(10000) // hard abort at 10s });
Go (net/http):
client := &http.Client{
Timeout: 5 * time.Second,
Transport: &http.Transport{
DialContext: (&net.Dialer{Timeout: 2 * time.Second}).DialContext,
TLSHandshakeTimeout: 2 * time.Second,
},
}
Here are typical timeout values based on real production data. Use these as starting points, then adjust based on your own p99 measurements:
| Service Type | Typical p99 | Recommended Timeout | Notes |
|---|---|---|---|
| Database (simple query) | 5-50ms | 200-500ms | Generous because spikes during vacuuming/replication |
| Database (complex query) | 100-500ms | 2-5s | Consider query optimization first |
| Redis/Memcached | 1-5ms | 50-200ms | Cache should be fast; long timeout defeats the purpose |
| Internal microservice | 10-100ms | 500ms-2s | Same datacenter, should be fast |
| Payment gateway (Stripe, etc.) | 500ms-2s | 5-10s | External, fraud checks add latency |
| Email API (SendGrid, SES) | 200-800ms | 5s | Should be async anyway |
| ML inference | 100ms-5s | 10-30s | Varies wildly by model size |
| File upload to S3 | Varies by size | 30-60s | Size-dependent; use multipart for large files |
The measurement approach: Run for 1 week, collect p50/p95/p99 latency per endpoint, then set timeout at p99 × 2. Revisit monthly as traffic patterns change.
Application-level timeouts (httpx, OkHttp, Axios) are just one layer. Your infrastructure has its own timeouts, and they must align. If your NGINX proxy timeout is 60s but your app timeout is 2s, the proxy holds the connection for 58 extra seconds after your app has given up.
NGINX:
# NGINX proxy timeouts - must be >= app timeout proxy_connect_timeout 5s; # TCP connection to upstream proxy_send_timeout 10s; # Sending request to upstream proxy_read_timeout 15s; # Reading response from upstream # Client-side timeouts client_header_timeout 10s; # Reading client request headers client_body_timeout 10s; # Reading client request body send_timeout 10s; # Sending response to client
AWS Application Load Balancer:
# ALB idle timeout (default 60s - almost always too long) aws elbv2 modify-load-balancer-attributes \ --load-balancer-arn $ALB_ARN \ --attributes Key=idle_timeout.timeout_seconds,Value=30 # Target group health check timeouts aws elbv2 modify-target-group \ --target-group-arn $TG_ARN \ --health-check-timeout-seconds 5 \ --health-check-interval-seconds 10
Kubernetes:
# Readiness probe - K8s checks if pod can handle traffic readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 3 # Must be < periodSeconds failureThreshold: 3 # 3 failures = stop traffic # Liveness probe - K8s checks if pod is alive livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 15 periodSeconds: 20 timeoutSeconds: 5 failureThreshold: 3 # 3 failures = kill & restart pod
The timeout stack (all must align):
- CDN/Edge ≥ Load Balancer ≥ Proxy ≥ App ≥ DB client
- Each layer should be slightly longer than the one it proxies to
- If any layer is shorter, it'll cut off requests the downstream is still processing
Timeout Interacts With Retry
If your timeout is 2s and you retry 3 times, the user waits up to 6 seconds. Keep this in mind when setting values — the total timeout budget is timeout × retries.
The Timeout Cascade Problem
If Service A calls Service B, which calls Service C:
Each downstream service must have a shorter timeout than its caller. Otherwise, the caller times out while the downstream is still working.
Retries with Exponential Backoff
You call a friend and they don't answer. Do you call back immediately? Maybe once. Do you call 100 times in 10 seconds? That's harassment. Instead, you wait a minute, try again. Then wait five minutes. Then maybe leave a message. That's exponential backoff.
Many failures are transient: network blips, temporary overload, brief downtime during deployment. A simple retry often succeeds. But naive retries are dangerous — if a service is struggling, hammering it with immediate retries makes things worse.
With base delay of 1s and 5 retries: 1 + 2 + 4 + 8 + 16 = 31 seconds maximum wait. With jitter, the actual wait is randomly spread across this range. Compare to 5 instant retries that complete in <1s but hammer the recovering service with 5x the load.
import random import time def retry_with_backoff(func, max_retries=3, base_delay=1.0): """Retry with exponential backoff and jitter.""" for attempt in range(max_retries): try: return func() except (ConnectionError, TimeoutError) as e: if attempt == max_retries - 1: raise # Last attempt, give up # Exponential backoff: 1s, 2s, 4s, 8s... delay = base_delay * (2 ** attempt) # Add jitter to prevent thundering herd jitter = random.uniform(0, delay * 0.1) time.sleep(delay + jitter)
Imagine 1,000 requests fail at the same time. Without jitter, all 1,000 retry at exactly t+1s. Then all retry again at t+3s. Then t+7s.
This synchronized retry creates thundering herd — massive traffic spikes that overwhelm the recovering service.
Jitter spreads retries over time, smoothing the load. Instead of 1,000 requests at t+1s, you get ~100 requests spread across t+0.9s to t+1.1s.
Full jitter (recommended): delay = random.uniform(0, base_delay * 2**attempt)
Each retry waits longer, giving the failing service more time to recover. After max retries, give up and fail gracefully.
Retry: Network errors, 5xx errors, timeouts, connection refused.
Don't retry: 4xx errors (client errors), authentication failures, validation errors. These won't succeed on retry.
War Story: The Retry Storm That Kept the Service Down
A team had a notification service that pushed messages to a third-party SMS API. They added retries (good!) but forgot jitter (bad). When the SMS API had a brief 30-second outage, here's what happened:
The fix was one line: jitter = random.uniform(0, delay). With full jitter, those 500 retries spread across 0-1s, 0-2s, 0-4s instead of hitting at exactly 1s, 2s, 4s. The SMS API never sees a spike.
Lesson: Retries without jitter are a DDoS attack on your own dependencies. Always add randomness.
What Can Go Wrong With Retries
If your payment API times out, did the charge go through? If you retry, will the customer be charged twice? Never retry non-idempotent operations blindly. Use idempotency keys (like Stripe's Idempotency-Key header) to ensure retrying is safe. If an operation can't be made idempotent, don't retry it — use a fallback instead.
- Retry amplification — If Service A retries 3x against Service B, and B retries 3x against Service C, a single user request can generate 3 × 3 = 9 requests to Service C. At scale, this multiplies load exponentially.
- Retrying permanent errors — A 400 Bad Request will fail every time you retry. Wasting time and resources on requests that will never succeed.
- Missing jitter — Without randomness in your backoff, all failed requests retry at the exact same moment, creating traffic spikes that re-crash the recovering service.
Java (Resilience4j):
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.retryExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(BusinessException.class)
.build();
Retry retry = Retry.of("paymentRetry", config);
Node.js (p-retry):
import pRetry from 'p-retry'; const result = await pRetry( () => fetch('https://payment-api.com/charge'), { retries: 3, minTimeout: 1000, factor: 2 } );
Before choosing retry settings, calculate your budget:
The formula: max_retries = floor((SLA - overhead) / (timeout + avg_backoff)) - 1
The common mistake: Teams set retries to 3-5 by default without calculating the budget. With 5 retries × 2s timeout + backoff, total worst case is 31+ seconds — 6x the SLA.
For real-time operations (payment, search): 1-2 retries max. For background jobs (email, analytics): 5-10 retries with longer backoff is fine because there's no user waiting.
Retry Interacts With Circuit Breaker
If each retry counts as a failure in the circuit breaker, two user requests with 3 retries each = 6 failures = circuit opens. We'll cover this interaction in When Patterns Collide.
Circuit Breakers
Your house has a circuit breaker box. When too much current flows through a circuit, the breaker trips — cutting power to that circuit instantly. You don't keep pushing more electricity through a failing wire and hope it works. The breaker protects the rest of the house.
In software, a circuit breaker does the same thing. It monitors failure rates and "trips" when too many failures occur. Once tripped, it fails fast — returning an error immediately without even attempting to call the service. This protects your system from wasting resources on requests that will almost certainly fail.
Retries handle transient failures (the kind that fix themselves). Circuit breakers handle persistent failures (the kind where the service is actually down). Without a circuit breaker, retries against a dead service just pile on more load, consuming threads, connections, and time.
Without a circuit breaker: 500 req/s × 3 retries × 2s timeout = 3,000 wasted request-seconds per second. Your retry logic is spending 3,000 thread-seconds every second waiting for a service that isn't coming back. With a circuit breaker that opens after 5 failures, you spend 5 × 2s = 10 seconds discovering the problem, then fail instantly (in ~1ms) for the next 30 seconds until recovery is tested.
→
→
failure in HALF-OPEN → back to OPEN
The circuit breaker automatically recovers: after the timeout, it lets one request through to test if the service is back.
import time from enum import Enum class CircuitState(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=30): self.state = CircuitState.CLOSED self.failure_count = 0 self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.last_failure_time = 0 def call(self, func): if self.state == CircuitState.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = CircuitState.HALF_OPEN else: raise CircuitOpenError("Circuit is open") try: result = func() self._on_success() return result except Exception as e: self._on_failure() raise def _on_success(self): self.failure_count = 0 self.state = CircuitState.CLOSED def _on_failure(self): self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = CircuitState.OPEN
The basic circuit breaker above uses a count-based threshold: "open after N consecutive failures." This works but has edge cases:
- If you get 4 failures, then 1 success, then 4 more failures, the counter resets. You never trip despite a 90% failure rate.
A sliding window circuit breaker is more robust. It tracks the failure rate over the last N calls:
class SlidingWindowBreaker: def __init__(self, window_size=10, failure_threshold=0.5): self.window = [] # Last N results (True=success, False=failure) self.window_size = window_size self.failure_threshold = failure_threshold # 50% failure rate def should_trip(self): if len(self.window) < self.window_size: return False # Not enough data yet failure_rate = self.window.count(False) / len(self.window) return failure_rate >= self.failure_threshold def record(self, success: bool): self.window.append(success) if len(self.window) > self.window_size: self.window.pop(0) # Slide the window
When to use which:
- Count-based: Simple services, low traffic (under 10 req/s). Fewer edge cases matter at low volume.
- Sliding window: High traffic services where a small number of failures mixed with successes shouldn't trip the breaker. This is what Resilience4j and most production libraries use.
What Can Go Wrong With Circuit Breakers
- Threshold too low — Circuit opens after 2-3 failures. Normal transient errors (brief network blips) trip the breaker unnecessarily, cutting off a healthy service. Your users see "service unavailable" when the service is actually fine.
- Threshold too high — Circuit doesn't open until 50+ failures. By then, you've already exhausted thread pools and cascaded the failure. The breaker opens too late to help.
- Recovery timeout too short — Circuit enters HALF-OPEN after 5 seconds, sends a test request, and it fails because the service needs 60 seconds to recover. Circuit reopens. This cycle repeats, and the service never gets a chance to fully recover because you keep testing too early.
- Recovery timeout too long — Circuit stays OPEN for 5 minutes. The service recovered after 30 seconds, but your system doesn't know. Users get fallback responses for 4.5 minutes unnecessarily.
- HALF-OPEN lets too many through — Some implementations let multiple requests through in HALF-OPEN state. If the service is still struggling, these requests make it worse. Best practice: let exactly ONE request through to test recovery.
Java (Resilience4j):
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open after 50% failure rate
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10) // Measure over last 10 calls
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker breaker = CircuitBreaker.of("payment", config);
Supplier<String> decorated = CircuitBreaker
.decorateSupplier(breaker, () -> paymentApi.charge(amount));
Node.js (opossum):
import CircuitBreaker from 'opossum'; const breaker = new CircuitBreaker(paymentApi.charge, { timeout: 3000, // Treat calls > 3s as failures errorThresholdPercentage: 50, resetTimeout: 30000, // Try again after 30s volumeThreshold: 5 // Need at least 5 calls to trip }); breaker.fallback(() => ({ status: 'pending', queued: true })); breaker.on('open', () => logger.warn('Payment circuit OPEN')); breaker.on('halfOpen', () => logger.info('Payment circuit testing...')); breaker.on('close', () => logger.info('Payment circuit recovered'));
Go (sony/gobreaker):
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "payment",
MaxRequests: 1, // Allow 1 request in half-open
Interval: 60 * time.Second, // Reset failure count after 60s
Timeout: 30 * time.Second, // Time in open before half-open
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.ConsecutiveFailures > 5
},
})
You can implement circuit breakers at the infrastructure level without changing application code. This is especially useful for service meshes:
Envoy Proxy / Istio:
# Istio DestinationRule with circuit breaking apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: payment-service spec: host: payment-service trafficPolicy: connectionPool: tcp: maxConnections: 100 # Bulkhead: max connections http: h2UpgradePolicy: DEFAULT http1MaxPendingRequests: 10 # Queue limit http2MaxRequests: 100 # Max concurrent maxRequestsPerConnection: 10 # Connection reuse outlierDetection: consecutive5xxErrors: 5 # CB threshold interval: 30s # Evaluation window baseEjectionTime: 30s # Recovery timeout maxEjectionPercent: 50 # Max % of hosts ejected
AWS App Mesh:
// Virtual node with circuit breaker { "connectionPool": { "http": { "maxConnections": 100, "maxPendingRequests": 10 } }, "outlierDetection": { "maxServerErrors": 5, "maxEjectionPercent": 50, "interval": { "value": 30, "unit": "s" }, "baseEjectionDuration": { "value": 30, "unit": "s" } } }
When to use infrastructure vs application circuit breakers:
- Infrastructure (Envoy/Istio): Polyglot environment, many services, team can't modify every app. Circuit breaking is applied uniformly.
- Application (Resilience4j/opossum): More control over thresholds per-endpoint, custom fallback logic, more granular metrics.
- Both: Infrastructure for coarse protection, application for fine-grained business logic fallbacks.
Circuit Breaker Interacts With Retry
Each retry that fails counts as a separate failure in the circuit breaker. If your threshold is 5 and each request retries 3 times, it only takes 2 user requests to trip the circuit (2 × 3 = 6 failures). We'll cover this trap in detail in When Patterns Collide.
War Story: The Redis Singleton That Took Down Production
A production chatbot used Redis for session management. Classic singleton pattern — one Redis connection shared across the entire application:
class InfrastructureManager: _instance = None def __init__(self): self.redis = Redis(host='redis.internal', port=6379)
The Trigger: A routine network blip. Redis connection drops for 3 seconds. Should be a non-event.
The Cascade:
The Root Cause: Redis client manages reconnection internally. The singleton pattern meant only ONE connection existed. When the application code tried to "help" by creating a new connection, it conflicted with the client's own reconnection logic.
Redis clients go through specific states during disconnection:
- During 'reconnecting': DON'T create new connections. The client is handling it.
- During 'connecting': DON'T queue commands. They'll pile up.
- Only in 'ready': Safe to use normally.
The Fix: Respect the client's connection states and use fallbacks during recovery:
def execute_redis_command(self, command): if self.redis.status == 'ready': return command() elif self.redis.status in ['connecting', 'reconnecting']: # Don't interfere - let client handle reconnection return self.fallback_response() else: # Actually disconnected - now we can reconnect self.redis = Redis(host='redis.internal', port=6379) return command()
A circuit breaker around the Redis calls would have caught this at T=1s: after the first few failures, it would fail fast with a fallback instead of letting the error propagate to WhatsApp.
Bulkheads
The Titanic sank because water flooded from compartment to compartment — there were walls between compartments, but they didn't go all the way to the ceiling. Modern ships have bulkheads: fully sealed, watertight compartments. One compartment floods? The rest stay dry. The ship stays afloat.
In software, bulkheads isolate resources so one failing component can't consume all resources and take down everything else. Without bulkheads, a single slow dependency can starve every other feature of threads, connections, or memory.
Your server has 200 threads shared across 4 services. Payment API slows to 30s per request. Without bulkheads: all 200 threads get consumed waiting for payment, and every other feature dies. With bulkheads allocating 50 threads per service: only 50 threads are affected. Product catalog, search, and user profiles continue serving 150 requests/second as if nothing happened.
Bulkheads limit the blast radius. A slow dependency can only affect its allocated resources.
Types of Bulkheads
Separate thread pools for different dependencies. Payment gets 10 threads, catalog gets 20.
Dedicated DB/Redis connections per feature. See DB Connections post for deep treatment.
Limit concurrent requests to a dependency. Simpler than thread pools, no resource allocation.
Separate processes or containers for critical services. Memory/CPU isolation at OS level.
If you run Kubernetes, you already have a powerful bulkhead mechanism: resource limits. By setting CPU and memory limits per pod/container, one misbehaving service can't starve the rest of the cluster.
# Payment service: critical, gets generous resources apiVersion: apps/v1 kind: Deployment metadata: name: payment-service spec: replicas: 3 template: spec: containers: - name: payment resources: requests: cpu: "500m" # Guaranteed 0.5 CPU memory: "512Mi" # Guaranteed 512MB limits: cpu: "1000m" # Max 1 CPU (can't steal more) memory: "1Gi" # Max 1GB (OOMKilled if exceeded) --- # Analytics service: non-critical, gets fewer resources apiVersion: apps/v1 kind: Deployment metadata: name: analytics-service spec: replicas: 1 template: spec: containers: - name: analytics resources: requests: cpu: "100m" # Modest allocation memory: "128Mi" limits: cpu: "200m" # Strict limit memory: "256Mi" # If it leaks memory, K8s kills it
Why this is a bulkhead: If the analytics service has a memory leak or CPU spike, Kubernetes limits the damage to its allocated resources. The payment service (with its own allocation) is completely unaffected. The OOMKiller handles the cleanup.
Combined with PodDisruptionBudgets: You can also ensure that even during node failures or updates, critical services always have minimum replicas running:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: payment-pdb spec: minAvailable: 2 # Always keep 2 payment pods running selector: matchLabels: app: payment-service
import asyncio # Semaphore as a simple bulkhead payment_semaphore = asyncio.Semaphore(10) # Max 10 concurrent payment calls inventory_semaphore = asyncio.Semaphore(20) # Max 20 concurrent inventory calls async def process_payment(order): async with payment_semaphore: # Only 10 payments can run concurrently return await payment_api.charge(order) async def check_inventory(items): async with inventory_semaphore: # Inventory checks don't compete with payments return await inventory_api.check(items)
What Can Go Wrong With Bulkheads
- Pools too small — You give payment only 5 threads, but normal traffic needs 15. Now you're throttling healthy traffic during peak load. Size bulkheads based on actual traffic patterns, not guesses.
- Pools too large — You give payment 150 out of 200 threads "because it's important." Now a payment slowdown still consumes 75% of your capacity. The point of a bulkhead is to limit impact.
- No overflow strategy — When the bulkhead is full (all threads busy), what happens to new requests? They need to fail fast with a clear error, not queue up indefinitely.
- Static allocation — Traffic patterns change. Your bulkhead sizes should be based on monitoring, not set-and-forget. Review quarterly at minimum.
The bulkhead principle applies to any shared resource — not just threads. Here's a common Redis anti-pattern and the fix:
# BAD: One shared Redis client for everything redis = Redis(connection_pool=ConnectionPool(max_connections=50)) def handle_webhook(data): # Uses shared redis redis.get(f"session:{data.user}") def cache_response(key, val): # Uses shared redis redis.setex(key, 3600, val) def log_analytics(event): # Uses shared redis redis.lpush("analytics", event) # If analytics logging blocks, sessions AND cache break too
# GOOD: Separate pools by criticality pool_sessions = ConnectionPool(max_connections=30) # Critical pool_cache = ConnectionPool(max_connections=15) # Can fail pool_analytics = ConnectionPool(max_connections=5) # Best-effort redis_sessions = Redis(connection_pool=pool_sessions) redis_cache = Redis(connection_pool=pool_cache) redis_analytics = Redis(connection_pool=pool_analytics) def handle_webhook(data): return redis_sessions.get(f"session:{data.user}") # Protected def cache_response(key, val): return redis_cache.setex(key, 3600, val) # Isolated def log_analytics(event): return redis_analytics.lpush("analytics", event) # Can't hurt others
Why this works: Analytics Redis slow? Only 5 connections blocked. Sessions (30 connections) completely unaffected. Each pool has its own failure domain.
When to use: Multiple features share the same Redis/database, features have different criticality levels, some operations are "nice to have" vs "must have."
When NOT to use: Single-purpose Redis (only sessions, only cache), low traffic where pool exhaustion is unlikely.
Java (Resilience4j):
// Thread pool bulkhead: dedicated thread pool per dependency ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom() .maxThreadPoolSize(10) .coreThreadPoolSize(5) .queueCapacity(20) .build(); ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("payment", config); // Semaphore bulkhead: simpler, limits concurrency BulkheadConfig semConfig = BulkheadConfig.custom() .maxConcurrentCalls(10) .maxWaitDuration(Duration.ofMillis(500)) .build();
Node.js (custom semaphore):
class Bulkhead { constructor(maxConcurrent) { this.max = maxConcurrent; this.current = 0; } async execute(fn) { if (this.current >= this.max) { throw new Error('Bulkhead full - request rejected'); } this.current++; try { return await fn(); } finally { this.current--; } } } const paymentBulkhead = new Bulkhead(10); const searchBulkhead = new Bulkhead(20);
Fallbacks
When your flight gets cancelled, the airline doesn't tell you "no flight, go home." They rebook you on the next available flight, offer a hotel voucher, or at minimum give you a refund. They have a hierarchy of backup plans. That's a fallback strategy.
When all other defenses fail — the timeout fires, retries are exhausted, circuit breaker is open — the fallback is your last line of defense. It provides degraded but functional behavior instead of a crash or error page. The user experience goes from "this is broken" to "this feature is temporarily limited."
A page with "Recommendations unavailable" is infinitely better than a 500 error. A checkout with "Express shipping unavailable, standard only" is better than no checkout at all. Partial service beats no service.
async def get_product_recommendations(user_id): try: # Try the ML recommendation service return await recommendation_service.get(user_id) except (TimeoutError, CircuitOpenError): # Fallback 1: Try cached recommendations cached = await cache.get(f"recs:{user_id}") if cached: return cached # Fallback 2: Return popular products return await get_popular_products() except Exception: # Fallback 3: Empty recommendations (feature disabled) return []
What Can Go Wrong With Fallbacks
- Stale cached data — Your fallback returns prices from 3 hours ago. A customer sees $99 but gets charged $129. Fallback data needs a staleness limit — better to show "price unavailable" than a wrong price.
- Untested fallbacks — The most common failure. You write a fallback path, never test it, and when it's finally triggered in production, it throws its own exception. If you haven't tested your fallback, it doesn't work.
- Fallback cascades — Fallback A calls Service B, which is also down. Now your fallback needs a fallback. Keep fallback logic simple and local — don't make external calls in fallback paths.
- Silent degradation — Your search fallback returns empty results. Users think there are no products. Always communicate when you're in degraded mode: "Search is temporarily limited, showing popular items instead."
Choosing the right fallback depends on the feature:
- Recommendations engine down? → Cached recommendations → Popular products → Empty list. Users barely notice.
- Search down? → Show browse categories. Users can still navigate, just differently.
- Payment processor down? → Don't fall back silently. Show "Checkout temporarily unavailable, try again in a few minutes." Some features shouldn't degrade — they should communicate clearly.
- User profile down? → Show name from JWT token + "Some features unavailable." Basic info from the auth token, no external call needed.
- Analytics/logging down? → Drop silently. Users shouldn't know or care. Log locally and batch-send later.
The rule: Critical business operations (payment, orders) should fail loudly with clear messaging. Non-critical features (recommendations, personalization, analytics) should degrade silently or with minimal user impact.
Java (Resilience4j):
// Decorate supplier with circuit breaker + fallback Supplier<String> decorated = CircuitBreaker .decorateSupplier(breaker, () -> recommendationService.get(userId)); String result = Try.ofSupplier(decorated) .recover(CallNotPermittedException.class, e -> cache.get("recs:" + userId)) // CB open: use cache .recover(TimeoutException.class, e -> getPopularProducts()) // Timeout: popular items .recover(Exception.class, e -> Collections.emptyList()) // Anything else: empty .get();
Node.js (opossum):
const breaker = new CircuitBreaker(getRecommendations, { timeout: 3000, errorThresholdPercentage: 50, resetTimeout: 30000 }); // Chain of fallbacks breaker.fallback(async (userId) => { // Try cache first const cached = await redis.get(`recs:${userId}`); if (cached) return JSON.parse(cached); // Then popular products return getPopularProducts(); }); const result = await breaker.fire(userId);
Go (custom with generics):
func withFallback[T any]( primary func() (T, error), fallbacks ...func() (T, error), ) (T, error) { result, err := primary() if err == nil { return result, nil } for _, fb := range fallbacks { result, err = fb() if err == nil { return result, nil } } return result, fmt.Errorf("all fallbacks exhausted: %w", err) } // Usage recs, err := withFallback( func() ([]Product, error) { return recsService.Get(userID) }, func() ([]Product, error) { return cache.GetRecs(userID) }, func() ([]Product, error) { return GetPopular(), nil }, )
A product page depends on 5 services. Here's how to degrade each independently so the page always loads:
async def render_product_page(product_id: str): # Fetch all data concurrently with independent fallbacks product, reviews, recs, inventory, pricing = await asyncio.gather( fetch_with_fallback( primary=lambda: product_service.get(product_id), fallback=lambda: cache.get(f"product:{product_id}"), criticality="critical" # No product = no page ), fetch_with_fallback( primary=lambda: review_service.get(product_id), fallback=lambda: {"reviews": [], "message": "Reviews loading..."}, criticality="optional" ), fetch_with_fallback( primary=lambda: recommendation_service.get(product_id), fallback=lambda: get_popular_in_category(product_id), criticality="optional" ), fetch_with_fallback( primary=lambda: inventory_service.check(product_id), fallback=lambda: {"available": True, "estimated": True}, criticality="important" # Show "likely available" ), fetch_with_fallback( primary=lambda: pricing_service.get(product_id), fallback=lambda: cache.get(f"price:{product_id}"), criticality="critical" # Wrong price = liability ), return_exceptions=True ) # Critical services failed = show error page if isinstance(product, Exception): return error_page("Product not found") if isinstance(pricing, Exception): return error_page("Price unavailable, try again") # Optional services failed = show page with gaps return render_template("product.html", product=product, reviews=reviews if not isinstance(reviews, Exception) else [], recs=recs if not isinstance(recs, Exception) else [], inventory=inventory, pricing=pricing, )
Key insight: Use asyncio.gather(return_exceptions=True) so one failing service doesn't cancel the others. Then check each result independently. This is the bulkhead principle applied at the application level.
The Complete Stack
Real resilience comes from layering these patterns. Each layer catches what the previous layer missed. Think of it like a building's safety systems:
Request flows through each layer. If one catches the failure, the layers below don't need to activate.
class ResilientClient: def __init__(self): self.circuit = CircuitBreaker(failure_threshold=5) self.semaphore = asyncio.Semaphore(10) # Bulkhead self.cache = {} async def call(self, request): # Layer 1: Bulkhead async with self.semaphore: # Layer 2: Circuit Breaker if self.circuit.is_open: return self._fallback(request) # Layer 3: Retry with Backoff for attempt in range(3): try: # Layer 4: Timeout result = await asyncio.wait_for( self._do_request(request), timeout=2.0 ) self.circuit.record_success() self._update_cache(request, result) return result except asyncio.TimeoutError: self.circuit.record_failure() if attempt < 2: await asyncio.sleep(1 * (2 ** attempt)) # Layer 5: Fallback return self._fallback(request) def _fallback(self, request): return self.cache.get(request.key, request.default)
When Patterns Collide
Each pattern works well in isolation. But when you combine them (which you must), they can interact in ways that create new failure modes. These interactions are the most insidious bugs because each pattern appears to be working correctly.
Red labels show where patterns interact dangerously. Each trap is explained below.
Trap 1: Retry + Circuit Breaker
Your circuit breaker opens after 5 failures. Your retry policy retries 3 times per request. A user makes 2 requests to a failing service:
The fix: Don't count retries as separate failures in the circuit breaker. Count each original request as one failure, regardless of how many retries it took. Or place the circuit breaker outside the retry logic — if the circuit is open, skip retries entirely.
# CORRECT: Circuit breaker wraps the entire retry block async def call_payment(request): if circuit_breaker.is_open: return fallback(request) # Skip retries entirely for attempt in range(3): try: result = await payment_api.charge(request) circuit_breaker.record_success() return result except TransientError: if attempt == 2: # Last attempt circuit_breaker.record_failure() # Count once, not 3x raise await asyncio.sleep(1 * (2 ** attempt))
Trap 2: Timeout + Retry Budget
Your user-facing SLA is 5 seconds. Your timeout per request is 2 seconds. You retry 3 times. Here's what actually happens:
Your total retry budget must fit within your user-facing timeout: (per_request_timeout + backoff) × max_retries ≤ SLA_timeout. If SLA is 5s and per-request timeout is 2s with 1s base backoff, you can afford 1 retry (2s + 1s + 2s = 5s). Not 3. Always account for backoff delays in the budget.
Trap 3: Bulkhead + Connection Pool Sizing
Your bulkhead allows 10 concurrent requests to the payment service. But your connection pool to the payment service has 5 connections. What happens?
The fix: Size connection pools to at least match bulkhead limits. If your bulkhead allows N concurrent requests, your connection pool needs at least N connections. Better yet: set the pool timeout shorter than the bulkhead timeout, so pool exhaustion triggers a clean rejection.
Trap 4: Circuit Breaker + Health Checks
Your circuit breaker opens because the payment service is down. Now watch the cascade:
The fix: Health check endpoints should only check the service's own health (can it respond?), not the health of its dependencies. Dependency health is the circuit breaker's job, not the health check's job.
@app.get("/health") async def health_check(): # GOOD: Only check OWN health return { "status": "healthy", "uptime": get_uptime(), "memory_mb": get_memory_usage(), } @app.get("/health/dependencies") async def deep_health_check(): # Separate endpoint for monitoring dashboards (NOT for ALB) return { "payment": circuit_breaker_payment.state.value, "inventory": circuit_breaker_inventory.state.value, "recommendations": circuit_breaker_recs.state.value, }
Monitoring Your Defenses
You've added timeouts, retries, circuit breakers, bulkheads, and fallbacks. But how do you know they're actually working? Without monitoring, your defenses are invisible — you won't know if they're firing too often, not firing at all, or misconfigured.
Failure Symptom Taxonomy
When something goes wrong, start with the symptom and work backward to the pattern that should have caught it:
| Symptom You See | Root Cause | Pattern That Prevents It |
|---|---|---|
| Threads exhausted, all requests queued | Dependency slow (not down), no timeout set | Timeout — bound the wait to 2-5s |
| Requests fail once then succeed on refresh | Transient network errors, no retry logic | Retry + Backoff — auto-recover from blips |
| Sustained 5xx errors for 5+ minutes | Dead dependency, wasting resources on every request | Circuit Breaker — fail fast, stop hammering |
| Payment down takes product catalog down | Shared thread pool, one dependency drains all resources | Bulkhead — isolate thread/connection pools |
| Users see blank pages or 500 errors | All-or-nothing behavior, no degraded mode | Fallback — serve cached/default data |
| Retry storm makes outage worse | 1000 clients retry simultaneously | Jitter — randomize retry timing |
| Circuit breaker opens from 2 requests | Retries counted as separate failures | Correct ordering — CB wraps retry block |
| User waits 30 seconds for error | Retry budget exceeds SLA | Budget math — timeout × retries ≤ SLA |
What to Monitor Per Pattern
| Pattern | Key Metrics | Alert When... |
|---|---|---|
| Timeouts | Timeout rate (%), p99 latency, timeout count by endpoint | Timeout rate > 5% for any dependency |
| Retries | Retry rate, retries per request, success-after-retry rate | Retry rate > 20% (something is broken, not transient) |
| Circuit Breakers | State (OPEN/CLOSED/HALF-OPEN), trips per hour, time spent OPEN | Circuit opens more than 3 times per hour |
| Bulkheads | Pool utilization (%), rejected requests, queue depth | Pool utilization > 80% sustained |
| Fallbacks | Fallback invocation rate, fallback type used, duration in fallback | Any fallback active > 5 minutes |
Timeout firing:
WARN [payment-client] Request to /charge timed out after 2000ms
request_id=abc-123 attempt=1/3 endpoint=payment-api.com
// Look for: sustained timeout warnings = dependency issue
Circuit breaker opening:
ERROR [circuit-breaker] Circuit OPENED for payment-service
failure_count=5 threshold=5 window=60s
last_error="Connection timed out" recovery_timeout=30s
// Look for: circuit state changes = investigate immediately
Bulkhead rejecting:
WARN [bulkhead] Payment pool exhausted, rejecting request
pool_size=10 active=10 queued=0 rejected=1
// Look for: rejections = pool too small or dependency too slow
Fallback activating:
INFO [recommendations] Serving cached recommendations (fallback)
reason=circuit_open cache_age=3600s user_id=usr-456
// Look for: sustained fallback usage = service needs attention
Watch for patterns that fire but nobody notices. If your circuit breaker has been opening and closing 5 times a day for the past month, you have a dependency that's unreliable enough to trip breakers but "working enough" that nobody investigates. These slow-burning issues become outages.
What a Real Outage Looks Like (With Resilience Patterns)
Here's the same Tuesday 2:47 PM scenario from the opening — but this time, with all five patterns in place. Watch how the system self-heals:
If you use Prometheus (or any metrics system), here are the exact metrics to expose from your resilience layer:
from prometheus_client import Counter, Histogram, Gauge # Timeout metrics request_duration = Histogram( 'http_client_request_duration_seconds', 'Request duration in seconds', ['service', 'endpoint', 'status'], buckets=[.1, .25, .5, 1, 2, 5, 10] ) timeout_total = Counter( 'http_client_timeouts_total', 'Total number of timeouts', ['service', 'endpoint'] ) # Retry metrics retry_total = Counter( 'http_client_retries_total', 'Total retry attempts', ['service', 'attempt', 'result'] ) # Circuit breaker metrics circuit_state = Gauge( 'circuit_breaker_state', 'Circuit breaker state (0=closed, 1=open, 2=half-open)', ['service'] ) circuit_trips = Counter( 'circuit_breaker_trips_total', 'Times circuit breaker tripped open', ['service'] ) # Bulkhead metrics bulkhead_active = Gauge( 'bulkhead_active_count', 'Currently active requests in bulkhead', ['service'] ) bulkhead_rejected = Counter( 'bulkhead_rejected_total', 'Requests rejected by bulkhead', ['service'] ) # Fallback metrics fallback_invocations = Counter( 'fallback_invocations_total', 'Fallback activations', ['service', 'fallback_type'] )
Key Grafana alerts to set up:
- Timeout rate > 5% for 5 minutes — Dependency is degraded, investigate immediately
- Circuit breaker opens — Instant alert, this means a dependency is down
- Retry success rate < 50% — Retries aren't helping, the problem isn't transient
- Bulkhead utilization > 80% for 10 minutes — Either traffic spike or dependency slowdown
- Fallback active > 5 minutes — Service isn't recovering, needs human investigation
Rate Limiting
So far we've talked about protecting your system from failing dependencies. Rate limiting protects your system from too much incoming traffic — a related but different concern.
Rate limiting caps how many requests you'll accept, preventing overload from traffic spikes, attacks, or misbehaving clients.
Token Bucket Algorithm
The most common approach. Imagine a bucket that fills with tokens at a steady rate. Each request consumes a token. No tokens? Request denied (HTTP 429).
- Bucket size determines burst capacity (e.g., 100 tokens = 100 rapid requests)
- Refill rate determines sustained throughput (e.g., 10 tokens/second)
Where to Apply Rate Limits
- Edge/CDN: DDoS protection, geographic limits
- API Gateway: Per-user limits, API key quotas
- Application: Business logic limits, per-endpoint
- Internal Services: Prevent one service from overwhelming another
The Response
HTTP/1.1 429 Too Many Requests Retry-After: 30 X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1706540400
Best Practices
- Always return Retry-After header — Tell clients when to retry
- Different limits for different tiers — Free vs paid users
- Don't rate limit health checks — Your monitoring needs access
- Log rate limited requests — Detect abuse patterns vs legitimate spikes
Real-World: The 2017 Amazon S3 Outage
On February 28, 2017, a typo during routine maintenance took down Amazon S3 in us-east-1 for nearly 4 hours. The blast radius was enormous: Slack, Trello, Quora, IFTTT, and thousands of other services went down. The internet felt like it was broken.
Why "slow" is worse than "down": If S3 had returned errors immediately, circuit breakers would have tripped and services would have served fallbacks. But S3 was responding slowly — 10-30 second response times. Without proper timeouts, threads waited. And waited. And consumed all available resources.
Every pattern from this post would have helped:
- Timeouts (2s): Would have freed threads after 2s instead of holding them for 30s+. A 15x improvement in resource utilization.
- Circuit breakers: After 5 timeouts, would have failed fast (1ms) instead of waiting 2s per request. Orders of magnitude less resource waste.
- Bulkheads: Would have limited S3 thread consumption to its own pool. Product catalog, user profiles, and checkout — anything not needing S3 — would have kept working.
- Fallbacks: Would have served locally cached content. Users see yesterday's data instead of an error page.
Services that survived (like Netflix) had all of these layered together. They showed degraded experience instead of complete failure. Netflix subscribers watched movies uninterrupted while much of the internet was down.
Testing Your Defenses
Here's the uncomfortable truth: if you haven't tested your failure handling, it doesn't work. Every pattern in this post can have bugs — a timeout that's never applied, a circuit breaker with the wrong threshold, a fallback that throws its own exception. You won't discover these bugs during normal operation. You'll discover them during an outage, when it's too late.
The most common failure handling bug: you write a beautiful fallback path, deploy it, and never trigger it. Six months later the cache layer it depends on got refactored. When the fallback finally fires in production, it throws a NullPointerException. Your safety net had a hole in it.
Unit Testing Each Pattern
Every pattern can be tested in isolation with dependency injection and controlled failures:
import pytest import asyncio async def test_timeout_fires_on_slow_dependency(): """Verify timeout actually triggers when dependency is slow.""" async def slow_service(): await asyncio.sleep(5.0) # Simulate slow response return "should never reach here" with pytest.raises(asyncio.TimeoutError): await asyncio.wait_for(slow_service(), timeout=2.0) async def test_retry_succeeds_on_transient_failure(): """Verify retry recovers from transient errors.""" call_count = 0 def flaky_service(): nonlocal call_count call_count += 1 if call_count < 3: raise ConnectionError("transient failure") return "success" result = retry_with_backoff(flaky_service, max_retries=3) assert result == "success" assert call_count == 3 # Failed twice, succeeded on third async def test_circuit_breaker_opens_after_threshold(): """Verify circuit opens and fails fast.""" cb = CircuitBreaker(failure_threshold=3, recovery_timeout=30) # Trigger failures to open the circuit for _ in range(3): with pytest.raises(Exception): cb.call(lambda: (_ for _ in ()).throw(Exception("fail"))) # Circuit should be OPEN now - next call fails fast with pytest.raises(CircuitOpenError): cb.call(lambda: "this should never execute") async def test_fallback_returns_cached_data(): """Verify fallback serves cached data when primary fails.""" client = ResilientClient() client.cache["product:123"] = {"name": "Widget", "price": 29.99} # Force circuit open to trigger fallback path client.circuit.state = CircuitState.OPEN result = await client.call(Request(key="product:123")) assert result["name"] == "Widget" # Got cached data, not an error
Timeouts:
- Timeout actually fires (not silently ignored by the client)
- Resources are released after timeout (no thread/connection leak)
- The right exception type is thrown (so retry logic catches it)
Retries:
- Retries happen the right number of times
- Backoff delays increase exponentially
- Non-retryable errors (4xx) are NOT retried
- Final failure propagates after max retries
Circuit Breakers:
- Opens after exactly N failures
- Fails fast when OPEN (no actual call made)
- Transitions to HALF-OPEN after recovery timeout
- Closes after successful test request in HALF-OPEN
- Re-opens if test request fails in HALF-OPEN
Bulkheads:
- Rejects when pool is full
- Doesn't block requests to other pools
- Resources are released even on failure
Fallbacks:
- Returns correct cached/default data
- Handles missing cache gracefully
- Doesn't make external calls (no cascading fallback failures)
Chaos Engineering: Testing in Production
Unit tests verify your code logic. Chaos engineering verifies your system behavior. The idea is simple: intentionally inject failures and observe what happens.
Netflix runs chaos experiments in production because "the best way to verify that something works is to break it." Their Chaos Monkey randomly kills instances. Chaos Kong simulates entire region failures. If their systems survive these attacks during business hours, they'll survive real failures at 3am.
You don't need Netflix's scale to practice chaos engineering. Start small:
import random import os class FaultInjector: """Inject failures for testing resilience patterns.""" def __init__(self): self.enabled = os.getenv("CHAOS_ENABLED", "false") == "true" self.failure_rate = float(os.getenv("CHAOS_FAILURE_RATE", "0.0")) self.latency_ms = int(os.getenv("CHAOS_LATENCY_MS", "0")) def maybe_fail(self, service_name: str): if not self.enabled: return if random.random() < self.failure_rate: logger.warning(f"CHAOS: Injecting failure for {service_name}") raise ConnectionError(f"Chaos: {service_name} failure injected") async def maybe_delay(self, service_name: str): if not self.enabled or self.latency_ms == 0: return delay = random.uniform(0, self.latency_ms / 1000.0) logger.warning(f"CHAOS: Injecting {delay:.0f}ms delay for {service_name}") await asyncio.sleep(delay) # Usage in your service client: chaos = FaultInjector() async def call_payment_service(order): chaos.maybe_fail("payment") await chaos.maybe_delay("payment") return await payment_api.charge(order) # Enable in staging with env vars: # CHAOS_ENABLED=true CHAOS_FAILURE_RATE=0.1 CHAOS_LATENCY_MS=3000
Step 1: Pick one non-critical dependency (recommendation engine, analytics, notifications). Never start with payment or auth.
Step 2: Define your hypothesis. "If the recommendation service is down, the product page should still load in under 2 seconds, showing popular products instead of personalized ones."
Step 3: Inject the failure in staging. Use the fault injector above, or simply block the dependency's DNS/port with iptables:
# Block traffic to recommendation service (staging only!) iptables -A OUTPUT -d recommendation-service.internal -j DROP # Simulate slow responses (add 2s latency) tc qdisc add dev eth0 root netem delay 2000ms # Restore after test iptables -D OUTPUT -d recommendation-service.internal -j DROP tc qdisc del dev eth0 root netem
Step 4: Observe. Does the page still load? How long? What do users see? Check your monitoring dashboards: did the circuit breaker open? Did fallbacks fire? Did latency stay within SLA?
Step 5: Fix what broke. Then repeat with a different dependency.
Graduation path:
- Level 1: Kill a non-critical dependency in staging
- Level 2: Kill a critical dependency in staging
- Level 3: Add random latency to all dependencies in staging
- Level 4: Kill a non-critical dependency in production (during business hours, with the team watching)
If you outgrow the DIY approach, these tools automate chaos experiments:
| Tool | Best For | Complexity |
|---|---|---|
| Chaos Monkey | Random instance termination (AWS) | Low |
| Litmus | Kubernetes-native chaos (pod/node/network) | Medium |
| Gremlin | Enterprise chaos platform (SaaS) | Low (managed) |
| Toxiproxy | Network-level fault injection (local/CI) | Low |
| tc / iptables | Linux kernel-level network manipulation | High (manual) |
Start with Toxiproxy for local testing and CI pipelines. It sits between your service and dependencies, letting you inject latency, connection failures, bandwidth limits, and timeouts at the TCP level:
# Create a proxy for your payment dependency toxiproxy-cli create payment -l localhost:8474 -u payment-api:443 # Add 2 second latency toxiproxy-cli toxic add payment -t latency -a latency=2000 # Simulate connection reset (service crash) toxiproxy-cli toxic add payment -t reset_peer -a timeout=500
Failure Handling Cheat Sheet
TIMEOUTS
- Connect: 1-5s | Read: 1-30s (by operation)
- Rule: timeout = p99 latency + buffer
- Cascade: downstream < upstream
- Total budget: timeout × retries ≤ SLA
RETRIES
- Backoff: delay = base × 2^attempt
- Always add jitter (full jitter best)
- Only retry 5xx/timeout, never 4xx
- Use idempotency keys for non-idempotent ops
CIRCUIT BREAKERS
- States: CLOSED → OPEN → HALF-OPEN
- Threshold: 5-10 failures or 50% rate
- Recovery timeout: 15-60 seconds
- HALF-OPEN: let exactly 1 request through
BULKHEADS
- Match pool size to bulkhead limit
- Size by criticality, not equally
- Reject (don't queue) when full
- Review sizing quarterly
FALLBACKS
- Hierarchy: cache → default → degrade → disable
- No external calls in fallback paths
- Test fallbacks regularly
- Critical ops: fail loudly, don't silently degrade
PATTERN TRAPS
- Retry + CB: count per-request, not per-attempt
- Timeout + Retry: total ≤ SLA budget
- Bulkhead + Pool: pool ≥ bulkhead limit
- CB + Health: don't check deps in health endpoint
What to Implement
Not every service needs every pattern. Here's the tradeoffs and decision guide:
Tradeoffs: What Each Pattern Costs
Every pattern adds complexity. Use them when the protection justifies the cost:
| If your service... | You need... | Priority |
|---|---|---|
| Calls any external API | Timeouts on every call | Critical |
| Has transient failures | Retry with exponential backoff | Critical |
| Has unreliable dependencies | Circuit breakers | High |
| Can show partial data | Fallbacks with cached/default values | High |
| Calls multiple dependencies | Bulkheads (separate pools) | Medium |
| Receives external traffic | Rate limiting | High |
| Is business critical | All of the above + monitoring | Critical |
This week:
- Audit all external API calls. Do they have timeouts? Add them.
- Check your HTTP client defaults. Many have no timeout by default!
- Add retry with backoff to your most critical integration.
This month:
- Implement circuit breakers for your top 3 flakiest dependencies.
- Add fallback responses for non-critical features.
- Set up monitoring for timeout rates, retry rates, circuit breaker state.
This quarter:
- Run chaos engineering: randomly inject failures and measure impact.
- Review bulkhead sizing based on actual traffic patterns.
- Document your failure handling strategy for the team.
Key Takeaways
- Slow is worse than down. A dead service fails fast. A slow service holds resources hostage and cascades.
- Layer your defenses. Timeout → Retry → Circuit Breaker → Bulkhead → Fallback. Each catches what the previous missed.
- Fail fast, recover gracefully. The goal isn't preventing all failures—it's containing their impact.
- Test your failure handling. If you haven't tested it, it doesn't work. Chaos engineering isn't optional.
- Monitor everything. You can't fix what you can't see. Track timeout rates, circuit breaker trips, fallback usage.
Distributed systems will fail. Your job isn't to prevent all failures—it's to build systems where failures are contained, fast, and recoverable. The patterns in this post are your toolkit for doing exactly that.
Here's a complete, copy-pastable configuration for an order service that calls payment, inventory, and notification services. This combines all 5 patterns with production-ready settings:
import asyncio import random import time import logging from dataclasses import dataclass from enum import Enum logger = logging.getLogger(__name__) # ─── CONFIGURATION ─── # Tune these per-dependency based on your SLA and traffic patterns PAYMENT_CONFIG = { "timeout_seconds": 5.0, # p99 is 2s, generous buffer "max_retries": 2, # Budget: 5s × 2 = 10s < 15s SLA "backoff_base": 1.0, # 1s, 2s delays "cb_failure_threshold": 5, # Open after 5 failures "cb_recovery_timeout": 30, # Test recovery after 30s "bulkhead_max_concurrent": 20, # Max 20 parallel payment calls } INVENTORY_CONFIG = { "timeout_seconds": 2.0, # Internal service, should be fast "max_retries": 1, # Quick: 2s × 1 = 2s "backoff_base": 0.5, "cb_failure_threshold": 10, # Higher tolerance (internal) "cb_recovery_timeout": 15, "bulkhead_max_concurrent": 30, } NOTIFICATION_CONFIG = { "timeout_seconds": 3.0, "max_retries": 3, # More retries OK (async, no user waiting) "backoff_base": 2.0, "cb_failure_threshold": 5, "cb_recovery_timeout": 60, # Longer recovery (external service) "bulkhead_max_concurrent": 10, # Non-critical, small pool } # ─── THE RESILIENT CLIENT ─── class ResilientServiceClient: def __init__(self, service_name: str, config: dict): self.name = service_name self.config = config # Circuit breaker state self.cb_state = "closed" self.cb_failure_count = 0 self.cb_last_failure = 0 # Bulkhead self.semaphore = asyncio.Semaphore(config["bulkhead_max_concurrent"]) # Metrics (replace with Prometheus in production) self.metrics = {"timeouts": 0, "retries": 0, "cb_trips": 0, "fallbacks": 0, "rejections": 0} async def call(self, func, fallback=None): # Layer 1: Bulkhead - limit concurrency if self.semaphore.locked(): self.metrics["rejections"] += 1 logger.warning(f"[{self.name}] Bulkhead full, using fallback") return fallback() if fallback else None async with self.semaphore: # Layer 2: Circuit Breaker - fail fast if known-broken if self.cb_state == "open": if time.time() - self.cb_last_failure > self.config["cb_recovery_timeout"]: self.cb_state = "half_open" logger.info(f"[{self.name}] Circuit HALF-OPEN, testing...") else: self.metrics["fallbacks"] += 1 return fallback() if fallback else None # Layer 3: Retry with backoff for attempt in range(self.config["max_retries"]): try: # Layer 4: Timeout result = await asyncio.wait_for( func(), timeout=self.config["timeout_seconds"] ) self._record_success() return result except asyncio.TimeoutError: self.metrics["timeouts"] += 1 self._record_failure() if attempt < self.config["max_retries"] - 1: self.metrics["retries"] += 1 delay = self.config["backoff_base"] * (2 ** attempt) jitter = random.uniform(0, delay) await asyncio.sleep(jitter) except Exception as e: self._record_failure() if attempt < self.config["max_retries"] - 1: self.metrics["retries"] += 1 await asyncio.sleep(self.config["backoff_base"]) else: break # Layer 5: Fallback self.metrics["fallbacks"] += 1 logger.warning(f"[{self.name}] All attempts failed, using fallback") return fallback() if fallback else None def _record_success(self): self.cb_failure_count = 0 if self.cb_state != "closed": logger.info(f"[{self.name}] Circuit CLOSED (recovered)") self.cb_state = "closed" def _record_failure(self): self.cb_failure_count += 1 self.cb_last_failure = time.time() if self.cb_failure_count >= self.config["cb_failure_threshold"]: if self.cb_state != "open": self.metrics["cb_trips"] += 1 logger.error(f"[{self.name}] Circuit OPENED after {self.cb_failure_count} failures") self.cb_state = "open" # ─── USAGE ─── payment_client = ResilientServiceClient("payment", PAYMENT_CONFIG) inventory_client = ResilientServiceClient("inventory", INVENTORY_CONFIG) notification_client = ResilientServiceClient("notification", NOTIFICATION_CONFIG) async def place_order(order): inventory = await inventory_client.call( lambda: inventory_api.reserve(order.items), fallback=lambda: {"status": "estimated_available"} ) payment = await payment_client.call( lambda: payment_api.charge(order.total), fallback=lambda: queue_for_later(order) ) # Notification is fire-and-forget (non-critical) asyncio.create_task(notification_client.call( lambda: email_api.send_confirmation(order), fallback=lambda: None # Silent fail OK ))
Key design decisions:
- Payment has a generous timeout (5s) because external APIs are slower, but only 2 retries to keep within SLA
- Inventory has a tight timeout (2s) because it's internal, and only 1 retry
- Notifications get more retries (3) because they're async — no user waiting
- Bulkhead sizes reflect criticality: payment (20), inventory (30), notifications (10)
- Each service has independent circuit breakers so payment going down doesn't affect inventory
Where To Go Deep
وَاللَّهُ أَعْلَمُ
And Allah knows best
وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ
May Allah's peace and blessings be upon our master Muhammad and his family
Was this helpful?
Your feedback helps me create better content
Comments
Leave a comment