In the name of Allah, the Most Gracious, the Most Merciful
Your app calls a payment API. The API is down.
Option A: Your entire checkout flow crashes. Users see 500 errors. Revenue stops. Support tickets pile up. Your phone buzzes at 3am.
Option B: Your app retries once, then shows "Payment delayed, we'll process shortly." Order is queued. User is happy. You sleep through the night.
Same failure. Completely different outcome. The difference is resilience: designing for failure before it happens.
- Resilience = staying up when dependencies fail (different from scaling)
- Each pattern has a specific use case - don't apply blindly
- Start with timeouts - the simplest pattern that prevents cascade failures
- Add complexity only when needed - premature resilience is over-engineering
Want the full story? Keep reading.
This post covers WHEN to use resilience patterns and the thinking behind each one. For implementation details and code examples, see Failure Handling.
This post is for you if:
- Your app crashes when a third-party API is slow or down
- You've seen cascade failures bring down your whole system
- You want to understand circuit breakers, retries, and fallbacks
- You're calling external services and need to handle their failures gracefully
Resilience vs. Scaling: Know the Difference
These terms get confused constantly. They solve completely different problems:
"Can we handle 10,000 users instead of 1,000?"
"What happens when the payment API goes down?"
A system can be highly scalable but not resilient (handles 10K RPS but crashes on a 1-second DB hiccup), or resilient but not scalable (gracefully handles all failures but only serves 100 users). You need both.
In a small town, you might drive the same route every day without issues. In a city with millions of people and thousands of roads, there's ALWAYS something broken somewhere: a traffic light out, road construction, an accident. The city doesn't shut down because of one issue - it routes around problems. Distributed systems are the same.
The question isn't IF things will fail. It's WHEN, and what happens then.
When NOT to Add Resilience Patterns
Before diving into patterns, a critical warning: resilience patterns have costs. Adding them everywhere is over-engineering.
Skip Resilience Patterns When...
If InventoryService is in the same process, circuit breakers are overkill. If it fails, you have bigger problems - just let it throw.
Your App to Your API to Your Database? If it's failing, fix the root cause. Don't paper over it with retries.
400 Bad Request, 401 Unauthorized, 404 Not Found - retrying these will never work. Fix the request.
Focus: "Does anyone want this?" NOT "What if the payment API is down?" Add resilience when you have users who depend on uptime.
Add Resilience Patterns When...
| Signal | Pattern to Consider |
|---|---|
| Calling external APIs you don't control | Timeouts + Retries + Circuit Breaker |
| User-facing critical paths (checkout, login) | Fallbacks + Graceful Degradation |
| High traffic to unreliable dependency | Circuit Breaker |
| Occasional network blips causing errors | Retries with Backoff |
| One slow service blocking everything | Bulkheads + Timeouts |
Let's go through each pattern, from simplest to most complex. Always start with timeouts.
Don't start with circuit breakers. Start with timeouts. Add patterns as your system and traffic justify them.
Pattern 1: Timeouts (Start Here)
The simplest and most important pattern. Every external call should have a timeout. No exceptions.
Imagine calling customer support and being put on hold. Without a "timeout" (hanging up after 10 minutes), you could be stuck forever, unable to do anything else. Your system is the same - a request without a timeout can hang forever, consuming a connection/thread that never gets released.
API hangs. Request waits forever. Connection pool exhausted. All requests start failing. Cascade failure.
API hangs. After 5 seconds, request fails fast. Connection released. Error handled. System stays healthy.
| Call Type | Typical Timeout | Why |
|---|---|---|
| Health check | 1-2 seconds |
Quick yes/no - if it takes longer, something's wrong |
| Internal service | 2-5 seconds |
Should be fast - same network, optimized |
| Database query | 5-10 seconds |
If it takes longer, the query needs optimization |
| External API | 10-30 seconds |
They might be slow, but not forever |
timeout = (expected_response_time x 2) + buffer
If your p95 is 200ms, a timeout of 1-2 seconds is reasonable. Don't set timeout = 60 seconds "just in case" - that defeats the purpose.
// Without timeout - DANGEROUS
const response = await fetch('https://api.payment.com/charge');
// If API hangs, your request hangs forever
// Connection pool exhausted. Everything dies.
// With timeout - SAFE
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 5000);
try {
const response = await fetch('https://api.payment.com/charge', {
signal: controller.signal
});
// Process response
} catch (e) {
if (e.name === 'AbortError') {
// Handle timeout - fail gracefully
console.log('Request timed out after 5 seconds');
}
} finally {
clearTimeout(timeout);
}
- Any external call (APIs, databases, services)
- Any network request whatsoever
- Always. No exceptions.
- Never skip timeouts
- This is the one pattern that's always appropriate
- Seriously. Always use timeouts.
Pattern 2: Retries with Backoff
Sometimes things fail temporarily: a network blip, a service restarting, momentary overload. Retrying can help - but only for transient failures.
503 Service Unavailable429 Too Many Requests(with backoff)Connection timeoutECONNRESET(connection reset)- Network errors
These are transient - might work on retry
400 Bad Request401 Unauthorized403 Forbidden404 Not Found- Business logic failures ("insufficient funds")
These will fail again - fix the request
When retrying, HOW you wait between retries matters as much as whether you retry.
const delay = Math.min(1000 * Math.pow(2, attempt), maxDelay);
const jitter = delay * 0.5 * Math.random();
await sleep(delay + jitter);
async function fetchWithRetry(url, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fetch(url);
} catch (error) {
// Don't retry permanent failures
if (!isTransient(error)) throw error;
// Exponential backoff with jitter
const delay = Math.min(1000 * Math.pow(2, attempt), 10000);
const jitter = delay * 0.5 * Math.random();
await sleep(delay + jitter);
}
}
throw new Error(`Failed after ${maxRetries} attempts`);
}
function isTransient(error) {
// Only retry transient errors
return error.code === 'ECONNRESET' ||
error.code === 'ETIMEDOUT' ||
error.status === 503 ||
error.status === 429;
}
Retrying is only safe if the operation is idempotent (can be done multiple times without side effects). Charging a credit card twice is NOT idempotent. Use idempotency keys:
// Dangerous: Retrying might charge twice
POST /payments { amount: 100 }
// Safe: Idempotency key ensures single charge
POST /payments { amount: 100, idempotency_key: "order-123" }
- Transient network errors
503,429responses- Connection timeouts
- Operation is idempotent
- 4xx client errors
- Business logic failures
- Non-idempotent operations without idempotency keys
Pattern 3: Circuit Breakers
When a service is failing repeatedly, continuing to call it wastes resources and can make things worse. A circuit breaker "trips" to fail fast instead of waiting.
Requests flow through
No requests sent
Is service back?
In your house, if there's an electrical short, the circuit breaker "trips" to cut power and prevent fire. It stays open until you manually reset it. Similarly, a software circuit breaker "trips" when a service is failing, preventing your system from wasting resources on doomed requests. After a timeout, it tests if the service is back (half-open) before fully resetting.
Configuration Guidelines
| Setting | Typical Value | Tune Based On |
|---|---|---|
Failure Threshold |
5-10 failures | Traffic volume, expected error rate |
Reset Timeout |
30-60 seconds | How long service typically takes to recover |
Half-Open Requests |
1-3 requests | How quickly to test recovery |
class CircuitBreaker {
constructor(options) {
this.failureThreshold = options.failureThreshold || 5;
this.resetTimeout = options.resetTimeout || 30000;
this.state = 'CLOSED';
this.failures = 0;
this.lastFailure = null;
}
async call(fn) {
// If OPEN, check if we should try half-open
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailure > this.resetTimeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.failureThreshold) {
this.state = 'OPEN';
}
}
}
// Usage
const paymentCircuit = new CircuitBreaker({
failureThreshold: 5,
resetTimeout: 30000
});
async function chargeUser(amount) {
return paymentCircuit.call(() => paymentAPI.charge(amount));
}
- High traffic to unreliable external services
- When failures cascade (one down brings others)
- Need to fail fast under heavy load
- Low traffic services (not enough data to trip)
- Calling your own internal services (fix them instead)
- PoC/MVP stage (over-engineering)
Pattern 4: Fallbacks
When the primary call fails, return something useful instead of an error. The key word is useful - a bad fallback can be worse than no fallback.
Each level is worse than the previous but still provides value. The goal: never show an error if you can show something useful.
Good vs Bad Fallbacks
| Scenario | Good Fallback | Bad Fallback |
|---|---|---|
| Recommendations down | Show popular items | Show random items |
| User profile unavailable | Show cached profile | Show blank profile |
| Payment API down | Queue for retry, notify user | Silently skip payment |
| Inventory check fails | "Check availability in store" | Show as "In Stock" (lie) |
- Non-critical features that enhance UX
- Stale data is acceptable
- Must show something (empty state is worse)
- Fallback would be misleading/incorrect
- Financial transactions
- Users expect real-time accuracy
- Security-sensitive operations
Pattern 5: Bulkheads
Named after ship compartments that prevent one leak from sinking the whole vessel. Isolate components so one failure doesn't bring down everything.
If API C is slow, all 100 threads wait on it. API A and B can't get threads. Everything dies.
If API C is slow, only its 40 threads are stuck. API A and B continue working.
- Mission-critical paths that must stay up
- Unreliable dependencies you can't control
- Multi-tenant systems (isolate tenants)
- One slow call can exhaust resources
- Simple systems with few dependencies
- All calls are equally important
- Resource overhead isn't justified
- PoC/MVP stage
Pattern 6: Health Checks
Proactively know when services are healthy or unhealthy instead of waiting for failures.
Is the process running?
Can it serve traffic?
Are all dependencies up?
app.get('/health', async (req, res) => {
const health = {
status: 'healthy',
timestamp: new Date().toISOString(),
checks: {}
};
// Check database
try {
await db.query('SELECT 1');
health.checks.database = 'healthy';
} catch (e) {
health.checks.database = 'unhealthy';
health.status = 'unhealthy';
}
// Check Redis
try {
await redis.ping();
health.checks.cache = 'healthy';
} catch (e) {
health.checks.cache = 'unhealthy';
health.status = 'unhealthy';
}
const statusCode = health.status === 'healthy' ? 200 : 503;
res.status(statusCode).json(health);
});
Combining Patterns: A Real Example
Here's how patterns work together in a payment service:
Pattern Combination Guide
| Scenario | Recommended Patterns |
|---|---|
| Calling payment API | Timeout + Retry (idempotent) + Circuit Breaker + Fallback (queue) |
| Fetching recommendations | Timeout + Fallback (cached/popular) |
| Sending emails | Timeout + Queue (async) |
| Critical database call | Timeout + Connection pool |
| Third-party API integration | Timeout + Retry + Circuit Breaker |
Practice Mode: Apply What You Learned
Test your understanding with these scenarios. Think through the answer before clicking.
The Resilience Checklist
Use this as a quick reference when building or reviewing systems.
Always (Every External Call)
- Timeout configured
- Error handling that doesn't crash the app
When Calling External APIs
- Retries for transient failures only
- Exponential backoff with jitter
- Idempotency keys for non-idempotent calls
For Critical Paths
- Fallback strategy defined
- Graceful degradation tested
For High-Traffic Systems
- Circuit breakers on unreliable deps
- Bulkheads for isolation
For Production
- Health check endpoints
- Monitoring and alerting on failures
- Runbook for common scenarios
Key Takeaways
- Resilience != Scaling
- Timeouts are non-negotiable
- Context matters - don't apply blindly
- Start simple, add complexity when needed
Related Posts
Failure will happen. The question is whether you've designed for it.
Start with timeouts. Add complexity only when you have the problem. And remember: a system that fails gracefully is more valuable than one that "never fails" but catastrophically crashes when it eventually does.
Was this helpful?
Your feedback helps me improve future posts
Have a suggestion or question?
Comments
Leave a comment