Building Resilient Systems: What Happens When Things Fail?

In the name of Allah, the Most Gracious, the Most Merciful

Your app calls a payment API. The API is down.

Option A: Your entire checkout flow crashes. Users see 500 errors. Revenue stops. Support tickets pile up. Your phone buzzes at 3am.

Option B: Your app retries once, then shows "Payment delayed, we'll process shortly." Order is queued. User is happy. You sleep through the night.

Same failure. Completely different outcome. The difference is resilience: designing for failure before it happens.

Quick Summary

Resilience = staying up when dependencies fail (different from scaling)
Each pattern has a specific use case - don't apply blindly
Start with timeouts - the simplest pattern that prevents cascade failures
Add complexity only when needed - premature resilience is over-engineering

Want the full story? Keep reading.

About This Post

This post covers WHEN to use resilience patterns and the thinking behind each one. For implementation details and code examples, see Failure Handling.

This post is for you if:

Your app crashes when a third-party API is slow or down
You've seen cascade failures bring down your whole system
You want to understand circuit breakers, retries, and fallbacks
You're calling external services and need to handle their failures gracefully

Resilience vs. Scaling: Know the Difference

These terms get confused constantly. They solve completely different problems:

Two Different Problems

Scaling

Handle more load

Question:

"Can we handle 10,000 users instead of 1,000?"

Solutions:

Caching Load Balancing Sharding

Resilience

Handle failures

Question:

"What happens when the payment API goes down?"

Solutions:

Timeouts Retries Circuit Breakers

A system can be highly scalable but not resilient (handles 10K RPS but crashes on a 1-second DB hiccup), or resilient but not scalable (gracefully handles all failures but only serves 100 users). You need both.

Why is failure "normal" in distributed systems?

Think of a City

In a small town, you might drive the same route every day without issues. In a city with millions of people and thousands of roads, there's ALWAYS something broken somewhere: a traffic light out, road construction, an accident. The city doesn't shut down because of one issue - it routes around problems. Distributed systems are the same.

Why Failures Are Inevitable

Networks are unreliable

Services restart for deploys

Databases have hiccups

Third-party APIs have outages

Hardware fails eventually

Resources get exhausted

The question isn't IF things will fail. It's WHEN, and what happens then.

When NOT to Add Resilience Patterns

Before diving into patterns, a critical warning: resilience patterns have costs. Adding them everywhere is over-engineering.

Each Pattern Has a Cost

Retries

Can amplify load on failing service

Circuit Breaker

Added complexity, state management

Fallbacks

Stale/degraded data, user confusion

Bulkheads

Resource overhead, config complexity

Skip Resilience Patterns When...

Calling services in a monolith

If InventoryService is in the same process, circuit breakers are overkill. If it fails, you have bigger problems - just let it throw.

You control both ends

Your App to Your API to Your Database? If it's failing, fix the root cause. Don't paper over it with retries.

The failure is not transient

400 Bad Request, 401 Unauthorized, 404 Not Found - retrying these will never work. Fix the request.

You're at PoC/MVP stage

Focus: "Does anyone want this?" NOT "What if the payment API is down?" Add resilience when you have users who depend on uptime.

Add Resilience Patterns When...

Signal	Pattern to Consider
Calling external APIs you don't control	`Timeouts` + `Retries` + `Circuit Breaker`
User-facing critical paths (checkout, login)	`Fallbacks` + `Graceful Degradation`
High traffic to unreliable dependency	`Circuit Breaker`
Occasional network blips causing errors	`Retries with Backoff`
One slow service blocking everything	`Bulkheads` + `Timeouts`

Quick Check

Your e-commerce app has a "product recommendations" widget that sometimes loads slowly. Should you add a circuit breaker to it?

Yes - protect against the slow service

Depends - is it external or internal? High traffic?

No - it's just a widget

Correct! Context matters. If it's your own internal recommendation service with low traffic, a timeout + fallback (show popular items) is enough. If it's a high-traffic external API that fails frequently, a circuit breaker makes sense. Don't apply patterns blindly.

Think again. The answer depends on whether this is an external API, how much traffic it gets, and whether failures are common. A circuit breaker for a low-traffic internal service is over-engineering.

Part 2

The Resilience Toolkit

Let's go through each pattern, from simplest to most complex. Always start with timeouts.

Add Complexity Only When Needed

1. Timeouts

Always, everywhere

2. Retries

Transient failures

3. Fallbacks

Critical user paths

4. Circuit Breakers

High traffic + unreliable

5. Bulkheads

Isolation needs

Don't start with circuit breakers. Start with timeouts. Add patterns as your system and traffic justify them.

Pattern 1: Timeouts (Start Here)

The simplest and most important pattern. Every external call should have a timeout. No exceptions.

What happens without a timeout?

The Endless Phone Call

Imagine calling customer support and being put on hold. Without a "timeout" (hanging up after 10 minutes), you could be stuck forever, unable to do anything else. Your system is the same - a request without a timeout can hang forever, consuming a connection/thread that never gets released.

Without Timeout

API hangs. Request waits forever. Connection pool exhausted. All requests start failing. Cascade failure.

With Timeout

API hangs. After 5 seconds, request fails fast. Connection released. Error handled. System stays healthy.

Choosing Timeout Values

Call Type	Typical Timeout	Why
Health check	`1-2 seconds`	Quick yes/no - if it takes longer, something's wrong
Internal service	`2-5 seconds`	Should be fast - same network, optimized
Database query	`5-10 seconds`	If it takes longer, the query needs optimization
External API	`10-30 seconds`	They might be slow, but not forever

The Formula

timeout = (expected_response_time x 2) + buffer

If your p95 is 200ms, a timeout of 1-2 seconds is reasonable. Don't set timeout = 60 seconds "just in case" - that defeats the purpose.

Show me the code (JavaScript)

// Without timeout - DANGEROUS
const response = await fetch('https://api.payment.com/charge');
// If API hangs, your request hangs forever
// Connection pool exhausted. Everything dies.

// With timeout - SAFE
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 5000);

try {
  const response = await fetch('https://api.payment.com/charge', {
    signal: controller.signal
  });
  // Process response
} catch (e) {
  if (e.name === 'AbortError') {
    // Handle timeout - fail gracefully
    console.log('Request timed out after 5 seconds');
  }
} finally {
  clearTimeout(timeout);
}

Use WHEN

Any external call (APIs, databases, services)
Any network request whatsoever
Always. No exceptions.

DON'T Use WHEN

Never skip timeouts
This is the one pattern that's always appropriate
Seriously. Always use timeouts.

Pattern 2: Retries with Backoff

Sometimes things fail temporarily: a network blip, a service restarting, momentary overload. Retrying can help - but only for transient failures.

What to Retry vs What NOT to Retry

Retry These

503 Service Unavailable
429 Too Many Requests (with backoff)
Connection timeout
ECONNRESET (connection reset)
Network errors

These are transient - might work on retry

Never Retry These

400 Bad Request
401 Unauthorized
403 Forbidden
404 Not Found
Business logic failures ("insufficient funds")

These will fail again - fix the request

What is exponential backoff? Why add jitter?

When retrying, HOW you wait between retries matters as much as whether you retry.

Fixed Backoff

Wait same time: 1s, 1s, 1s, 1s

Problem: If 1000 clients retry at the same time, they all hit the server together again

Exponential Backoff

Wait increasing time: 1s, 2s, 4s, 8s

Better: Gives service time to recover. But still synchronizes retries.

Exponential + Jitter (Best Practice)

Wait increasing time + random variation: 1.2s, 2.7s, 4.1s, 8.9s

Best: Spreads out retries. Prevents "thundering herd" - 1000 clients don't all retry at once.

The Formula

const delay = Math.min(1000 * Math.pow(2, attempt), maxDelay);
const jitter = delay * 0.5 * Math.random();
await sleep(delay + jitter);

Retry implementation with backoff and jitter

async function fetchWithRetry(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fetch(url);
    } catch (error) {
      // Don't retry permanent failures
      if (!isTransient(error)) throw error;

      // Exponential backoff with jitter
      const delay = Math.min(1000 * Math.pow(2, attempt), 10000);
      const jitter = delay * 0.5 * Math.random();
      await sleep(delay + jitter);
    }
  }
  throw new Error(`Failed after ${maxRetries} attempts`);
}

function isTransient(error) {
  // Only retry transient errors
  return error.code === 'ECONNRESET' ||
         error.code === 'ETIMEDOUT' ||
         error.status === 503 ||
         error.status === 429;
}

Critical: Idempotency

Retrying is only safe if the operation is idempotent (can be done multiple times without side effects). Charging a credit card twice is NOT idempotent. Use idempotency keys:

// Dangerous: Retrying might charge twice
POST /payments { amount: 100 }

// Safe: Idempotency key ensures single charge
POST /payments { amount: 100, idempotency_key: "order-123" }

Use WHEN

Transient network errors
503, 429 responses
Connection timeouts
Operation is idempotent

DON'T Use WHEN

4xx client errors
Business logic failures
Non-idempotent operations without idempotency keys

Pattern 3: Circuit Breakers

When a service is failing repeatedly, continuing to call it wastes resources and can make things worse. A circuit breaker "trips" to fail fast instead of waiting.

Circuit Breaker State Machine

CLOSED

Normal operation
Requests flow through

Failures > threshold

OPEN

Fail immediately
No requests sent

Timeout expires

HALF-OPEN

Test 1-3 requests
Is service back?

Half-Open success = CLOSED

Half-Open failure = OPEN again

Why is it called a "circuit breaker"?

Like Your Home's Breaker Box

In your house, if there's an electrical short, the circuit breaker "trips" to cut power and prevent fire. It stays open until you manually reset it. Similarly, a software circuit breaker "trips" when a service is failing, preventing your system from wasting resources on doomed requests. After a timeout, it tests if the service is back (half-open) before fully resetting.

Configuration Guidelines

Setting	Typical Value	Tune Based On
`Failure Threshold`	5-10 failures	Traffic volume, expected error rate
`Reset Timeout`	30-60 seconds	How long service typically takes to recover
`Half-Open Requests`	1-3 requests	How quickly to test recovery

Simple Circuit Breaker implementation

class CircuitBreaker {
  constructor(options) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 30000;
    this.state = 'CLOSED';
    this.failures = 0;
    this.lastFailure = null;
  }

  async call(fn) {
    // If OPEN, check if we should try half-open
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

// Usage
const paymentCircuit = new CircuitBreaker({
  failureThreshold: 5,
  resetTimeout: 30000
});

async function chargeUser(amount) {
  return paymentCircuit.call(() => paymentAPI.charge(amount));
}

Use WHEN

High traffic to unreliable external services
When failures cascade (one down brings others)
Need to fail fast under heavy load

DON'T Use WHEN

Low traffic services (not enough data to trip)
Calling your own internal services (fix them instead)
PoC/MVP stage (over-engineering)

Quick Check

Your app calls a weather API to show forecast widgets. You get about 100 requests/day. The API has been reliable. Should you add a circuit breaker?

Yes - always use circuit breakers for external APIs

No - timeout + fallback (show cached weather) is enough

Yes - but only with 1 failure threshold

Correct! At 100 requests/day, a circuit breaker won't have enough data to work properly. A timeout + simple fallback (show cached weather or "Weather unavailable") is the right level of complexity for this traffic volume.

Think again. Circuit breakers need sufficient traffic to detect patterns. At 100 req/day, you might have 1 failure every few days - not enough to trigger the breaker meaningfully. Simple timeout + fallback is appropriate here.

Pattern 4: Fallbacks

When the primary call fails, return something useful instead of an error. The key word is useful - a bad fallback can be worse than no fallback.

Fallback Hierarchy: Degrade Gracefully

Primary (Best)

Personalized recommendations from ML service

Cached Data

Last known recommendations (stale but personalized)

Degraded Response

Good vs Bad Fallbacks

Scenario	Good Fallback	Bad Fallback
Recommendations down	Show popular items	Show random items
User profile unavailable	Show cached profile	Show blank profile
Payment API down	Queue for retry, notify user	Silently skip payment
Inventory check fails	"Check availability in store"	Show as "In Stock" (lie)

Use WHEN

Non-critical features that enhance UX
Stale data is acceptable
Must show something (empty state is worse)

DON'T Use WHEN

Fallback would be misleading/incorrect
Financial transactions
Users expect real-time accuracy
Security-sensitive operations

Pattern 5: Bulkheads

Named after ship compartments that prevent one leak from sinking the whole vessel. Isolate components so one failure doesn't bring down everything.

Bulkhead Isolation

Without Bulkhead

Shared Thread Pool (100 threads)

API A

API B

API C (slow)

If API C is slow, all 100 threads wait on it. API A and B can't get threads. Everything dies.

With Bulkhead

API A

30 threads

API B

30 threads

API C

40 threads

If API C is slow, only its 40 threads are stuck. API A and B continue working.

How to implement bulkheads

Separate Thread Pools

Each integration gets its own pool. Easy to implement in most frameworks.

Separate Service Instances

Critical path gets dedicated servers. Non-critical shares infrastructure.

Queue-Based Isolation

Each integration gets its own queue. Slow consumer doesn't block others.

Use WHEN

Mission-critical paths that must stay up
Unreliable dependencies you can't control
Multi-tenant systems (isolate tenants)
One slow call can exhaust resources

DON'T Use WHEN

Simple systems with few dependencies
All calls are equally important
Resource overhead isn't justified
PoC/MVP stage

Pattern 6: Health Checks

Proactively know when services are healthy or unhealthy instead of waiting for failures.

Types of Health Checks

Liveness

Is the process running?

Used by: Kubernetes to restart dead pods

Readiness

Can it serve traffic?

Used by: Load balancer to route traffic

Deep Health

Are all dependencies up?

Used by: Monitoring and alerting

Health check endpoint example

app.get('/health', async (req, res) => {
  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    checks: {}
  };

  // Check database
  try {
    await db.query('SELECT 1');
    health.checks.database = 'healthy';
  } catch (e) {
    health.checks.database = 'unhealthy';
    health.status = 'unhealthy';
  }

  // Check Redis
  try {
    await redis.ping();
    health.checks.cache = 'healthy';
  } catch (e) {
    health.checks.cache = 'unhealthy';
    health.status = 'unhealthy';
  }

  const statusCode = health.status === 'healthy' ? 200 : 503;
  res.status(statusCode).json(health);
});

Part 3

Putting It Together

Combining Patterns: A Real Example

Here's how patterns work together in a payment service:

Combined Resilience: Payment Service

Timeout (10 seconds)

Don't wait forever if payment API hangs

Circuit Breaker

After 5 failures, fail fast for 30 seconds

Retry (with idempotency key)

Try up to 3 times for transient failures

Fallback (queue for later)

If all else fails, save order and retry async

Pattern Combination Guide

Scenario	Recommended Patterns
Calling payment API	`Timeout` + `Retry (idempotent)` + `Circuit Breaker` + `Fallback (queue)`
Fetching recommendations	`Timeout` + `Fallback (cached/popular)`
Sending emails	`Timeout` + `Queue (async)`
Critical database call	`Timeout` + `Connection pool`
Third-party API integration	`Timeout` + `Retry` + `Circuit Breaker`

Practice Mode: Apply What You Learned

Test your understanding with these scenarios. Think through the answer before clicking.

Practice Scenario 1

Your e-commerce app calls an inventory service before checkout. The service is internal (you control it), gets moderate traffic (5K req/day), and occasionally times out. What patterns should you add?

Timeout + Retry (limited) + Fallback ("check in-store")

Circuit Breaker + Bulkhead + Health Check

Just timeout - it's internal, fix the service

Good thinking! For an internal service you control with moderate traffic: timeout is essential, 1-2 retries can handle transient issues, and a user-friendly fallback ("check availability in-store") improves UX. Circuit breakers are overkill for internal services - you should fix reliability at the source.

Consider: It's internal (you control it) and moderate traffic. Circuit breakers need high traffic to work well. A simple timeout + limited retry + graceful fallback is the right balance. If it keeps timing out, fix the service itself.

Practice Scenario 2

Your app integrates with Stripe for payments. You get 50K transactions/day. Stripe occasionally has brief outages. What's your resilience strategy?

Timeout + Retry - Stripe is reliable, keep it simple

Timeout + Retry (with idempotency) + Circuit Breaker + Queue fallback

Circuit Breaker only - fail fast during outages

Excellent! At 50K transactions/day with an external API you don't control, you need the full toolkit. Timeouts prevent hanging, retries with idempotency keys handle transient failures safely, circuit breaker prevents thundering herd during outages, and queue fallback ensures no orders are lost. This is exactly when circuit breakers shine.

At 50K transactions/day, brief outages affect thousands of users. You need: timeout (always), retries with idempotency keys (payments can't double-charge), circuit breaker (high traffic + external API), and queue fallback (no lost orders). This scenario warrants full resilience.

Practice Scenario 3

You're building a new feature that calls an ML model for content moderation. It's a PoC, you're the only user, and the ML service is experimental. What resilience do you add?

Timeout + Retry + Circuit Breaker - protect against failures

Timeout only - it's a PoC, focus on the feature

No resilience - it's experimental anyway

Perfect! At PoC stage, focus on validating the feature, not production hardening. A simple timeout prevents infinite hangs. That's all you need. Adding circuit breakers to a one-user PoC is classic over-engineering. Add resilience when you have users who depend on uptime.

It's a PoC with one user. The question is "does this ML moderation even work?" not "what if it's down?" A timeout is enough to prevent hangs. Everything else is premature optimization. Add resilience patterns when you have real users and real traffic.

Part 4

Reference

The Resilience Checklist

Use this as a quick reference when building or reviewing systems.

Always (Every External Call)

Timeout configured
Error handling that doesn't crash the app

When Calling External APIs

Retries for transient failures only
Exponential backoff with jitter
Idempotency keys for non-idempotent calls

For Critical Paths

Fallback strategy defined
Graceful degradation tested

For High-Traffic Systems

Circuit breakers on unreliable deps
Bulkheads for isolation

For Production

Health check endpoints
Monitoring and alerting on failures
Runbook for common scenarios

Key Takeaways

Resilience != Scaling
Timeouts are non-negotiable
Context matters - don't apply blindly
Start simple, add complexity when needed

Failure Handling: Timeouts, Retries, and More

A deeper dive into failure patterns with more code examples

The 95% Problem: Understanding DB Connections

Connection resilience, pooling, and idle timeouts

Async Processing: Don't Make Users Wait

Dead letter queues and background processing for fallbacks

Designing for 10,000 Requests/Second

The complete scaling toolkit: caching, queuing, load balancing

Failure will happen. The question is whether you've designed for it.

Start with timeouts. Add complexity only when you have the problem. And remember: a system that fails gracefully is more valuable than one that "never fails" but catastrophically crashes when it eventually does.

Building Resilient Systems:
What Happens When Things Fail?

Resilience vs. Scaling: Know the Difference

When NOT to Add Resilience Patterns

Skip Resilience Patterns When...

Add Resilience Patterns When...

Pattern 1: Timeouts (Start Here)

Pattern 2: Retries with Backoff

Pattern 3: Circuit Breakers

Configuration Guidelines

Pattern 4: Fallbacks

Good vs Bad Fallbacks

Pattern 5: Bulkheads

Pattern 6: Health Checks

Combining Patterns: A Real Example

Pattern Combination Guide

Practice Mode: Apply What You Learned

The Resilience Checklist

Always (Every External Call)

When Calling External APIs

For Critical Paths

For High-Traffic Systems

For Production

Key Takeaways

Related Posts

Was this helpful?

Comments

Leave a comment

Building Resilient Systems:What Happens When Things Fail?

Resilience vs. Scaling: Know the Difference

When NOT to Add Resilience Patterns

Skip Resilience Patterns When...

Add Resilience Patterns When...

Pattern 1: Timeouts (Start Here)

Pattern 2: Retries with Backoff

Pattern 3: Circuit Breakers

Configuration Guidelines

Pattern 4: Fallbacks

Good vs Bad Fallbacks

Pattern 5: Bulkheads

Pattern 6: Health Checks

Combining Patterns: A Real Example

Pattern Combination Guide

Practice Mode: Apply What You Learned

The Resilience Checklist

Always (Every External Call)

When Calling External APIs

For Critical Paths

For High-Traffic Systems

For Production

Key Takeaways

Related Posts

Was this helpful?

Comments

Leave a comment

Share this post

Get More Like This

Building Resilient Systems:
What Happens When Things Fail?