System Design

Building Resilient Systems:
What Happens When Things Fail?

Your payment API is down. What happens next? The difference between a crashed checkout and a happy customer is resilience: designing for failure before it happens.

Bahgat Bahgat Ahmed
· February 2026 · 20 min read
Your App
Payment API
Without resilience:
500 Error - Revenue stops
With resilience:
Your App
Queue + Notify
Order saved - User happy
Table of Contents
4 parts

In the name of Allah, the Most Gracious, the Most Merciful

Your app calls a payment API. The API is down.

Option A: Your entire checkout flow crashes. Users see 500 errors. Revenue stops. Support tickets pile up. Your phone buzzes at 3am.

Option B: Your app retries once, then shows "Payment delayed, we'll process shortly." Order is queued. User is happy. You sleep through the night.

Same failure. Completely different outcome. The difference is resilience: designing for failure before it happens.

Quick Summary
  • Resilience = staying up when dependencies fail (different from scaling)
  • Each pattern has a specific use case - don't apply blindly
  • Start with timeouts - the simplest pattern that prevents cascade failures
  • Add complexity only when needed - premature resilience is over-engineering

Want the full story? Keep reading.

About This Post

This post covers WHEN to use resilience patterns and the thinking behind each one. For implementation details and code examples, see Failure Handling.

This post is for you if:

  • Your app crashes when a third-party API is slow or down
  • You've seen cascade failures bring down your whole system
  • You want to understand circuit breakers, retries, and fallbacks
  • You're calling external services and need to handle their failures gracefully

Resilience vs. Scaling: Know the Difference

These terms get confused constantly. They solve completely different problems:

Two Different Problems
Scaling
Handle more load
Question:

"Can we handle 10,000 users instead of 1,000?"

Solutions:
Caching Load Balancing Sharding
Resilience
Handle failures
Question:

"What happens when the payment API goes down?"

Solutions:
Timeouts Retries Circuit Breakers

A system can be highly scalable but not resilient (handles 10K RPS but crashes on a 1-second DB hiccup), or resilient but not scalable (gracefully handles all failures but only serves 100 users). You need both.

Why is failure "normal" in distributed systems?
Think of a City

In a small town, you might drive the same route every day without issues. In a city with millions of people and thousands of roads, there's ALWAYS something broken somewhere: a traffic light out, road construction, an accident. The city doesn't shut down because of one issue - it routes around problems. Distributed systems are the same.

Why Failures Are Inevitable
Networks are unreliable
Services restart for deploys
Databases have hiccups
Third-party APIs have outages
Hardware fails eventually
Resources get exhausted

The question isn't IF things will fail. It's WHEN, and what happens then.

When NOT to Add Resilience Patterns

Before diving into patterns, a critical warning: resilience patterns have costs. Adding them everywhere is over-engineering.

Each Pattern Has a Cost
Retries
Can amplify load on failing service
Circuit Breaker
Added complexity, state management
Fallbacks
Stale/degraded data, user confusion
Bulkheads
Resource overhead, config complexity

Skip Resilience Patterns When...

Calling services in a monolith

If InventoryService is in the same process, circuit breakers are overkill. If it fails, you have bigger problems - just let it throw.

You control both ends

Your App to Your API to Your Database? If it's failing, fix the root cause. Don't paper over it with retries.

The failure is not transient

400 Bad Request, 401 Unauthorized, 404 Not Found - retrying these will never work. Fix the request.

You're at PoC/MVP stage

Focus: "Does anyone want this?" NOT "What if the payment API is down?" Add resilience when you have users who depend on uptime.

Add Resilience Patterns When...

Signal Pattern to Consider
Calling external APIs you don't control Timeouts + Retries + Circuit Breaker
User-facing critical paths (checkout, login) Fallbacks + Graceful Degradation
High traffic to unreliable dependency Circuit Breaker
Occasional network blips causing errors Retries with Backoff
One slow service blocking everything Bulkheads + Timeouts
Quick Check
Your e-commerce app has a "product recommendations" widget that sometimes loads slowly. Should you add a circuit breaker to it?
Yes - protect against the slow service
Depends - is it external or internal? High traffic?
No - it's just a widget
Correct! Context matters. If it's your own internal recommendation service with low traffic, a timeout + fallback (show popular items) is enough. If it's a high-traffic external API that fails frequently, a circuit breaker makes sense. Don't apply patterns blindly.
Think again. The answer depends on whether this is an external API, how much traffic it gets, and whether failures are common. A circuit breaker for a low-traffic internal service is over-engineering.
Part 2
The Resilience Toolkit

Let's go through each pattern, from simplest to most complex. Always start with timeouts.

Add Complexity Only When Needed
1. Timeouts
Always, everywhere
2. Retries
Transient failures
3. Fallbacks
Critical user paths
4. Circuit Breakers
High traffic + unreliable
5. Bulkheads
Isolation needs

Don't start with circuit breakers. Start with timeouts. Add patterns as your system and traffic justify them.

Pattern 1: Timeouts (Start Here)

The simplest and most important pattern. Every external call should have a timeout. No exceptions.

What happens without a timeout?
The Endless Phone Call

Imagine calling customer support and being put on hold. Without a "timeout" (hanging up after 10 minutes), you could be stuck forever, unable to do anything else. Your system is the same - a request without a timeout can hang forever, consuming a connection/thread that never gets released.

Without Timeout

API hangs. Request waits forever. Connection pool exhausted. All requests start failing. Cascade failure.

With Timeout

API hangs. After 5 seconds, request fails fast. Connection released. Error handled. System stays healthy.

Choosing Timeout Values
Call Type Typical Timeout Why
Health check 1-2 seconds Quick yes/no - if it takes longer, something's wrong
Internal service 2-5 seconds Should be fast - same network, optimized
Database query 5-10 seconds If it takes longer, the query needs optimization
External API 10-30 seconds They might be slow, but not forever
The Formula

timeout = (expected_response_time x 2) + buffer

If your p95 is 200ms, a timeout of 1-2 seconds is reasonable. Don't set timeout = 60 seconds "just in case" - that defeats the purpose.

Show me the code (JavaScript)
// Without timeout - DANGEROUS
const response = await fetch('https://api.payment.com/charge');
// If API hangs, your request hangs forever
// Connection pool exhausted. Everything dies.

// With timeout - SAFE
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 5000);

try {
  const response = await fetch('https://api.payment.com/charge', {
    signal: controller.signal
  });
  // Process response
} catch (e) {
  if (e.name === 'AbortError') {
    // Handle timeout - fail gracefully
    console.log('Request timed out after 5 seconds');
  }
} finally {
  clearTimeout(timeout);
}
Use WHEN
  • Any external call (APIs, databases, services)
  • Any network request whatsoever
  • Always. No exceptions.
DON'T Use WHEN
  • Never skip timeouts
  • This is the one pattern that's always appropriate
  • Seriously. Always use timeouts.

Pattern 2: Retries with Backoff

Sometimes things fail temporarily: a network blip, a service restarting, momentary overload. Retrying can help - but only for transient failures.

What to Retry vs What NOT to Retry
Retry These
  • 503 Service Unavailable
  • 429 Too Many Requests (with backoff)
  • Connection timeout
  • ECONNRESET (connection reset)
  • Network errors

These are transient - might work on retry

Never Retry These
  • 400 Bad Request
  • 401 Unauthorized
  • 403 Forbidden
  • 404 Not Found
  • Business logic failures ("insufficient funds")

These will fail again - fix the request

What is exponential backoff? Why add jitter?

When retrying, HOW you wait between retries matters as much as whether you retry.

Fixed Backoff
Wait same time: 1s, 1s, 1s, 1s
Problem: If 1000 clients retry at the same time, they all hit the server together again
Exponential Backoff
Wait increasing time: 1s, 2s, 4s, 8s
Better: Gives service time to recover. But still synchronizes retries.
Exponential + Jitter (Best Practice)
Wait increasing time + random variation: 1.2s, 2.7s, 4.1s, 8.9s
Best: Spreads out retries. Prevents "thundering herd" - 1000 clients don't all retry at once.
The Formula
const delay = Math.min(1000 * Math.pow(2, attempt), maxDelay);
const jitter = delay * 0.5 * Math.random();
await sleep(delay + jitter);
Retry implementation with backoff and jitter
async function fetchWithRetry(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fetch(url);
    } catch (error) {
      // Don't retry permanent failures
      if (!isTransient(error)) throw error;

      // Exponential backoff with jitter
      const delay = Math.min(1000 * Math.pow(2, attempt), 10000);
      const jitter = delay * 0.5 * Math.random();
      await sleep(delay + jitter);
    }
  }
  throw new Error(`Failed after ${maxRetries} attempts`);
}

function isTransient(error) {
  // Only retry transient errors
  return error.code === 'ECONNRESET' ||
         error.code === 'ETIMEDOUT' ||
         error.status === 503 ||
         error.status === 429;
}
Critical: Idempotency

Retrying is only safe if the operation is idempotent (can be done multiple times without side effects). Charging a credit card twice is NOT idempotent. Use idempotency keys:

// Dangerous: Retrying might charge twice
POST /payments { amount: 100 }

// Safe: Idempotency key ensures single charge
POST /payments { amount: 100, idempotency_key: "order-123" }
Use WHEN
  • Transient network errors
  • 503, 429 responses
  • Connection timeouts
  • Operation is idempotent
DON'T Use WHEN
  • 4xx client errors
  • Business logic failures
  • Non-idempotent operations without idempotency keys

Pattern 3: Circuit Breakers

When a service is failing repeatedly, continuing to call it wastes resources and can make things worse. A circuit breaker "trips" to fail fast instead of waiting.

Circuit Breaker State Machine
CLOSED
Normal operation
Requests flow through
Failures > threshold
OPEN
Fail immediately
No requests sent
Timeout expires
HALF-OPEN
Test 1-3 requests
Is service back?
Half-Open success = CLOSED
Half-Open failure = OPEN again
Why is it called a "circuit breaker"?
Like Your Home's Breaker Box

In your house, if there's an electrical short, the circuit breaker "trips" to cut power and prevent fire. It stays open until you manually reset it. Similarly, a software circuit breaker "trips" when a service is failing, preventing your system from wasting resources on doomed requests. After a timeout, it tests if the service is back (half-open) before fully resetting.

Configuration Guidelines

Setting Typical Value Tune Based On
Failure Threshold 5-10 failures Traffic volume, expected error rate
Reset Timeout 30-60 seconds How long service typically takes to recover
Half-Open Requests 1-3 requests How quickly to test recovery
Simple Circuit Breaker implementation
class CircuitBreaker {
  constructor(options) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 30000;
    this.state = 'CLOSED';
    this.failures = 0;
    this.lastFailure = null;
  }

  async call(fn) {
    // If OPEN, check if we should try half-open
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

// Usage
const paymentCircuit = new CircuitBreaker({
  failureThreshold: 5,
  resetTimeout: 30000
});

async function chargeUser(amount) {
  return paymentCircuit.call(() => paymentAPI.charge(amount));
}
Use WHEN
  • High traffic to unreliable external services
  • When failures cascade (one down brings others)
  • Need to fail fast under heavy load
DON'T Use WHEN
  • Low traffic services (not enough data to trip)
  • Calling your own internal services (fix them instead)
  • PoC/MVP stage (over-engineering)
Quick Check
Your app calls a weather API to show forecast widgets. You get about 100 requests/day. The API has been reliable. Should you add a circuit breaker?
Yes - always use circuit breakers for external APIs
No - timeout + fallback (show cached weather) is enough
Yes - but only with 1 failure threshold
Correct! At 100 requests/day, a circuit breaker won't have enough data to work properly. A timeout + simple fallback (show cached weather or "Weather unavailable") is the right level of complexity for this traffic volume.
Think again. Circuit breakers need sufficient traffic to detect patterns. At 100 req/day, you might have 1 failure every few days - not enough to trigger the breaker meaningfully. Simple timeout + fallback is appropriate here.

Pattern 4: Fallbacks

When the primary call fails, return something useful instead of an error. The key word is useful - a bad fallback can be worse than no fallback.

Fallback Hierarchy: Degrade Gracefully
Primary (Best)
Personalized recommendations from ML service
Cached Data
Last known recommendations (stale but personalized)
Degraded Response
Popular products (less personalized)
Static Default
Curated "best of" list
Graceful Error
"Recommendations unavailable" (better than 500 error)

Each level is worse than the previous but still provides value. The goal: never show an error if you can show something useful.

Good vs Bad Fallbacks

Scenario Good Fallback Bad Fallback
Recommendations down Show popular items Show random items
User profile unavailable Show cached profile Show blank profile
Payment API down Queue for retry, notify user Silently skip payment
Inventory check fails "Check availability in store" Show as "In Stock" (lie)
Use WHEN
  • Non-critical features that enhance UX
  • Stale data is acceptable
  • Must show something (empty state is worse)
DON'T Use WHEN
  • Fallback would be misleading/incorrect
  • Financial transactions
  • Users expect real-time accuracy
  • Security-sensitive operations

Pattern 5: Bulkheads

Named after ship compartments that prevent one leak from sinking the whole vessel. Isolate components so one failure doesn't bring down everything.

Bulkhead Isolation
Without Bulkhead
Shared Thread Pool (100 threads)
API A
API B
API C (slow)

If API C is slow, all 100 threads wait on it. API A and B can't get threads. Everything dies.

With Bulkhead
API A
30 threads
API B
30 threads
API C
40 threads

If API C is slow, only its 40 threads are stuck. API A and B continue working.

How to implement bulkheads
Separate Thread Pools
Each integration gets its own pool. Easy to implement in most frameworks.
Separate Service Instances
Critical path gets dedicated servers. Non-critical shares infrastructure.
Queue-Based Isolation
Each integration gets its own queue. Slow consumer doesn't block others.
Use WHEN
  • Mission-critical paths that must stay up
  • Unreliable dependencies you can't control
  • Multi-tenant systems (isolate tenants)
  • One slow call can exhaust resources
DON'T Use WHEN
  • Simple systems with few dependencies
  • All calls are equally important
  • Resource overhead isn't justified
  • PoC/MVP stage

Pattern 6: Health Checks

Proactively know when services are healthy or unhealthy instead of waiting for failures.

Types of Health Checks
Liveness

Is the process running?

Used by: Kubernetes to restart dead pods
Readiness

Can it serve traffic?

Used by: Load balancer to route traffic
Deep Health

Are all dependencies up?

Used by: Monitoring and alerting
Health check endpoint example
app.get('/health', async (req, res) => {
  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    checks: {}
  };

  // Check database
  try {
    await db.query('SELECT 1');
    health.checks.database = 'healthy';
  } catch (e) {
    health.checks.database = 'unhealthy';
    health.status = 'unhealthy';
  }

  // Check Redis
  try {
    await redis.ping();
    health.checks.cache = 'healthy';
  } catch (e) {
    health.checks.cache = 'unhealthy';
    health.status = 'unhealthy';
  }

  const statusCode = health.status === 'healthy' ? 200 : 503;
  res.status(statusCode).json(health);
});
Part 3
Putting It Together

Combining Patterns: A Real Example

Here's how patterns work together in a payment service:

Combined Resilience: Payment Service
1
Timeout (10 seconds)
Don't wait forever if payment API hangs
2
Circuit Breaker
After 5 failures, fail fast for 30 seconds
3
Retry (with idempotency key)
Try up to 3 times for transient failures
4
Fallback (queue for later)
If all else fails, save order and retry async

Pattern Combination Guide

Scenario Recommended Patterns
Calling payment API Timeout + Retry (idempotent) + Circuit Breaker + Fallback (queue)
Fetching recommendations Timeout + Fallback (cached/popular)
Sending emails Timeout + Queue (async)
Critical database call Timeout + Connection pool
Third-party API integration Timeout + Retry + Circuit Breaker

Practice Mode: Apply What You Learned

Test your understanding with these scenarios. Think through the answer before clicking.

Practice Scenario 1
Your e-commerce app calls an inventory service before checkout. The service is internal (you control it), gets moderate traffic (5K req/day), and occasionally times out. What patterns should you add?
Timeout + Retry (limited) + Fallback ("check in-store")
Circuit Breaker + Bulkhead + Health Check
Just timeout - it's internal, fix the service
Good thinking! For an internal service you control with moderate traffic: timeout is essential, 1-2 retries can handle transient issues, and a user-friendly fallback ("check availability in-store") improves UX. Circuit breakers are overkill for internal services - you should fix reliability at the source.
Consider: It's internal (you control it) and moderate traffic. Circuit breakers need high traffic to work well. A simple timeout + limited retry + graceful fallback is the right balance. If it keeps timing out, fix the service itself.
Practice Scenario 2
Your app integrates with Stripe for payments. You get 50K transactions/day. Stripe occasionally has brief outages. What's your resilience strategy?
Timeout + Retry - Stripe is reliable, keep it simple
Timeout + Retry (with idempotency) + Circuit Breaker + Queue fallback
Circuit Breaker only - fail fast during outages
Excellent! At 50K transactions/day with an external API you don't control, you need the full toolkit. Timeouts prevent hanging, retries with idempotency keys handle transient failures safely, circuit breaker prevents thundering herd during outages, and queue fallback ensures no orders are lost. This is exactly when circuit breakers shine.
At 50K transactions/day, brief outages affect thousands of users. You need: timeout (always), retries with idempotency keys (payments can't double-charge), circuit breaker (high traffic + external API), and queue fallback (no lost orders). This scenario warrants full resilience.
Practice Scenario 3
You're building a new feature that calls an ML model for content moderation. It's a PoC, you're the only user, and the ML service is experimental. What resilience do you add?
Timeout + Retry + Circuit Breaker - protect against failures
Timeout only - it's a PoC, focus on the feature
No resilience - it's experimental anyway
Perfect! At PoC stage, focus on validating the feature, not production hardening. A simple timeout prevents infinite hangs. That's all you need. Adding circuit breakers to a one-user PoC is classic over-engineering. Add resilience when you have users who depend on uptime.
It's a PoC with one user. The question is "does this ML moderation even work?" not "what if it's down?" A timeout is enough to prevent hangs. Everything else is premature optimization. Add resilience patterns when you have real users and real traffic.
Part 4
Reference

The Resilience Checklist

Use this as a quick reference when building or reviewing systems.

Always (Every External Call)

  • Timeout configured
  • Error handling that doesn't crash the app

When Calling External APIs

  • Retries for transient failures only
  • Exponential backoff with jitter
  • Idempotency keys for non-idempotent calls

For Critical Paths

  • Fallback strategy defined
  • Graceful degradation tested

For High-Traffic Systems

  • Circuit breakers on unreliable deps
  • Bulkheads for isolation

For Production

  • Health check endpoints
  • Monitoring and alerting on failures
  • Runbook for common scenarios

Key Takeaways

  • Resilience != Scaling
  • Timeouts are non-negotiable
  • Context matters - don't apply blindly
  • Start simple, add complexity when needed

Related Posts

Failure Handling: Timeouts, Retries, and More
A deeper dive into failure patterns with more code examples
The 95% Problem: Understanding DB Connections
Connection resilience, pooling, and idle timeouts
Async Processing: Don't Make Users Wait
Dead letter queues and background processing for fallbacks
Designing for 10,000 Requests/Second
The complete scaling toolkit: caching, queuing, load balancing

Failure will happen. The question is whether you've designed for it.

Start with timeouts. Add complexity only when you have the problem. And remember: a system that fails gracefully is more valuable than one that "never fails" but catastrophically crashes when it eventually does.

Was this helpful?

Have a suggestion or question?

Comments

Loading comments...

Leave a comment