بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
In the name of Allah, the Most Gracious, the Most Merciful
A developer inherited a "production-ready" codebase.
The API call to the payment service had: retry with exponential backoff (3 attempts), connection pooling (min 5, max 20), circuit breaker (opens after 5 failures), response caching (5 minute TTL), timeout (30 seconds), and health checks (every 10 seconds).
300 lines of "reliability patterns." The previous developer had added every best practice they'd ever read about.
Then payments started failing randomly. 2% of transactions. No pattern. No error that made sense.
"Is the retry triggering the circuit breaker?"
"Is the pool exhausted?"
"Is the cache returning stale data?"
They spent 3 days debugging. Added more logging. Increased timeouts. Decreased timeouts. Nothing worked.
$23,000 in failed payments. An angry call from the CEO. A weekend in the office.
Then a senior engineer looked at it for 10 minutes and said:
"You're debugging Layer 3. The problem is in Layer 1."
The connection was dying after 30 seconds of idle time. The database was killing it. All those retry patterns couldn't fix a dead connection they didn't know was dead.
The patterns weren't wrong. The understanding was.
Once they saw the layers, the fix took 10 minutes: add keepalive pings.
This is the map nobody gave them.
- Four layers — Network, Resource, Resilience, Application
- Layers nest — L4 contains L3 contains L2 contains L1
- Debug by layer — Ask "which layer?" before "which pattern?"
- The 95% problem is usually L1 — Random failures after idle time = keepalive issue
This post is for you if:
- You've added reliability patterns but don't know which one to debug when things break
- You've read about retry AND pooling AND circuit breaker but don't see how they connect
- You want a mental model, not more tutorials
- You've ever thought "I have too many patterns, which one is actually helping?"
The Problem: Learning in Silos
Here's how most developers learn reliability patterns:
Your brain: "I learned 5 different things." Reality: You learned 3 layers applied to similar problems. But nobody showed you the layers.
The Toolbox Problem
- Hammer
- Screwdriver
- Wrench
- Drill
- Pliers
- "Which one do I use for this problem?"
- "Maybe I'll try all of them?"
- Cutting tools: Saw, scissors
- Fastening tools: Hammer, screwdriver, drill
- Gripping tools: Wrench, pliers
- "It's a fastening problem → use fastening tools"
When you learn patterns without layers, you have tools without a toolbox. So you try everything and hope something works.
The question isn't "should I add retry?" The question is "which layer is my problem in?" Then you pick patterns from THAT layer.
The Building Analogy
Think of building a house. You don't build randomly — you build in layers, and each layer depends on the one below it:
When your house has a problem, you don't randomly check the smoke detectors, then the pipes, then the foundation. You ask: what kind of problem is this?
Why This Changes Everything
| Your Problem | House Equivalent | Layer | What to Check |
|---|---|---|---|
| "Connection randomly dies" | Foundation cracked | L1 | Keepalive, TCP settings |
| "Everything is slow" | Pipes too small | L2 | Pool size, caching |
| "Service fails sometimes" | Need backup generator | L3 | Retry, circuit breaker |
| "Multiple parts fighting over same resource" | Two families sharing one kitchen | L4 | Singleton, organization |
In the hook story, the developer was debugging Layer 3 (retry, circuit breaker). But the problem was Layer 1 (connection dying from idle timeout).
L3 patterns can't fix L1 problems. You were in the wrong layer.
If you have studied networking, you know the OSI model: 7 layers from physical cables up to applications. The idea is the same here: each layer has a single responsibility, layers depend on the ones below, and you debug by identifying which layer is broken.
This 4-layer reliability model is not an industry standard. It is a practical simplification designed around the question most developers actually ask: "My system has 8 reliability patterns and something just broke. Where do I look?"
The layers map roughly to:
- L1 (Network) aligns with OSI Layers 3-5 (Network, Transport, Session)
- L2 (Resource) has no direct OSI equivalent. It is an application-level concern about managing expensive shared resources.
- L3 (Resilience) maps loosely to reliability engineering practices: SRE patterns, Netflix's Hystrix model, Microsoft's cloud design patterns.
- L4 (Application) maps to classic software engineering patterns (Gang of Four, dependency injection).
The value is not in the model's theoretical purity. The value is that it gives you a first question to ask when something breaks, instead of randomly trying patterns.
Without a layered model, debugging looks like this:
With the layered model:
The difference is not intelligence. It is having a map. Without layers, you search the entire problem space. With layers, you search 4 zones top-to-bottom and stop when you find the mismatch.
The Four Layers (Deep Dive)
Now that you have the map, let's explore each floor of the building. Click each layer to see:
- The analogy — what it's like in real life
- How it works — the actual technical mechanics
- When to use it — context matters!
Pro tip: Start from the bottom (L1) and work up. That's how you build a house — and a system.
Before you build a house, you pour a concrete foundation. Without it, nothing else matters — the house will collapse.
Before you send data to a database, you establish a connection. This is your foundation. If the connection dies, all your fancy retry logic and pooling won't help.
- Connections are expensive — TCP handshake + TLS + auth = 50-200ms per connection
- The "95% problem" — Databases kill idle connections silently. Your app thinks it's alive. Next query fails.
- The fix: Keepalive — Send periodic pings ("I'm still here") to prevent idle timeout
Step 1: Check your database's idle timeout
-- PostgreSQL
SHOW idle_in_transaction_session_timeout;
SHOW idle_session_timeout; -- PostgreSQL 14+ only
-- MySQL / MariaDB
SHOW VARIABLES LIKE 'wait_timeout';
SHOW VARIABLES LIKE 'interactive_timeout';
-- If these return 30-60 seconds and you have no
-- keepalive configured, you have the 95% problem.
Step 2: Check if keepalive is enabled
# Check TCP keepalive on Linux
cat /proc/sys/net/ipv4/tcp_keepalive_time
# Default: 7200 (2 hours!) Way too long.
# Check if your app sets keepalive (Python example)
# Look for these in your connection string:
# keepalives=1&keepalives_idle=20&keepalives_interval=5
Step 3: Quick test for the 95% problem
# Connect, wait longer than DB timeout, try a query
psql -h your-db-host -U your-user your-db
SELECT 1; -- works
-- Now wait 60 seconds (or your DB's wait_timeout + 10)
SELECT 1; -- if this fails: you have the 95% problem
-- if this works: your DB timeout is longer than you think
| Pattern | Use WHEN | DON'T Use When |
|---|---|---|
| TCP Keepalive | Long-lived connections, managed DBs with aggressive timeouts (30-60s) | Serverless (no persistent connections), very short requests |
| TLS | Any production system, sensitive data | Local development only |
nc -zv host port to verify network connectivity. Check if your DB's idle_session_timeout is shorter than your pool's idle time. If the connection test passes but queries fail after idle periods, you likely need TCP keepalive enabled.
Imagine a house where every time you need water, you dig a new well. That's insane, right?
Instead, you install pipes once, and everyone shares them. That's Layer 2: manage expensive resources so you don't recreate them every time.
- Connection pooling — Keep connections open and reuse them. Borrowing from pool = ~0ms vs creating new = 50-200ms
- Caching — Store frequently-read data closer to the app. Don't hit the database for every request.
- Rate limiting — Protect expensive resources from being overwhelmed
Key insight: Each connection in the pool IS a Layer 1 connection! The pool HOLDS and manages Layer 1 connections.
The formula for minimum pool size is straightforward:
Example: 200 req/s with average query time of 50ms (0.05s):
pool_size = 200 x 0.05 = 10 connections
But add 50% headroom for traffic spikes:
pool_size = 10 x 1.5 = 15 connections
With 3 app servers sharing the database:
per_server = 15 connections
total = 3 x 15 = 45 connections
Check: Does your DB allow 45+ connections? (Plus admin headroom)
Common mistake: Setting pool to 100 "just in case." Each idle connection holds ~10MB of memory on the database side (PostgreSQL). 100 idle connections = 1GB of wasted database RAM.
For the full pool sizing guide with real benchmarks, see Connection Pooling in the DB Connections post.
| Pattern | Use WHEN | DON'T Use When |
|---|---|---|
| Connection Pooling | High throughput (>10 req/s), expensive connections | Serverless (cold starts kill pools), single-use scripts |
| Caching | Read-heavy (80%+ reads), data changes slowly | Write-heavy, real-time requirements |
| Rate Limiting | Protecting expensive resources, cost control | Internal trusted services, low traffic |
pool_size = requests_per_second x avg_query_duration. If your pool is correctly sized but still exhausting, look for slow queries holding connections too long, or check if you have multiple pools (L4 singleton issue).
Your house has safety systems because things go wrong:
- Smoke detector = Health Check (know when there's a problem)
- Circuit breaker box = Circuit Breaker (cut power before the house burns)
- Try doorbell again = Retry (maybe they didn't hear)
- Backup generator = Fallback (graceful degradation)
- Timer on the oven = Timeout (don't wait forever)
- Retry with backoff + jitter — Don't hammer a failing service. Wait longer each time. Add randomness so 1000 retries don't all hit at once.
- Timeout — Never wait forever. Set a max wait time on every external call.
- Circuit breaker — After N failures, stop trying for a while. Let the downstream service recover.
- Fallback — When the primary path fails, have a backup (cached data, default response, graceful degradation).
- Bulkhead — Isolate failures so one bad dependency doesn't consume all your resources.
Note: For detailed implementation of L3 patterns (circuit breakers, retries, fallbacks) with code examples in Python, Java, Node.js, and Go, see Failure Handling and Building Resilient Systems.
These two can fight. 2 requests with 3 retries each = 6 failures = circuit opens. Make sure retry checks if circuit is open BEFORE retrying. See the When Patterns Collide section for the full breakdown.
| Pattern | Use WHEN | DON'T Use When |
|---|---|---|
| Retry | Transient failures, idempotent operations (see above) | Non-idempotent ops (payments!), persistent failures |
| Timeout | Any external call (DB, API, file system) | CPU-bound work |
| Circuit Breaker | External deps you don't control, high traffic | Internal services (fix them!), PoC stage |
Imagine a house with 4 families living in it. Each family builds their own kitchen.
Now you have 4 kitchens, 4 refrigerators, 4 stoves... in one house. Expensive, wasteful, and they bump into each other.
Better: One shared kitchen that everyone uses. Clear rules for who uses it when. That's what Layer 4 patterns do.
| Pattern | Use WHEN | DON'T Use When |
|---|---|---|
| Singleton | Shared expensive resources (pools, clients) | Stateless operations, testing (hard to mock) |
| Factory | Creating configured instances, dependency injection | Simple object creation, one-off instances |
| Repository | Centralizing data access, testable code | Simple scripts, single-use queries |
How Layers Connect
So you've learned what each layer does. But here's what most tutorials miss:
The layers aren't separate boxes sitting next to each other. They nest inside each other like Russian dolls.
Why does this matter? Because when you call db.query(), you're not just using L2 (the pool). You're using L4 (singleton manager) which uses L3 (retry/timeout) which uses L2 (pool) which uses L1 (TCP connection).
Let me show you what this looks like:
Connection 2 → Database
Connection 3 → Database
L4 contains L3 contains L2 contains L1. When debugging, start from the inside (L1) and work outward.
Tracing a Request Through All Layers
Theory is nice. Let's see it in action.
When you write db.query("SELECT * FROM users WHERE id = 123"), you think you're just running a query. But watch what actually happens behind the scenes:
This trace is from a real e-commerce system: 3 app servers behind an ALB, connecting to PostgreSQL 15 on RDS (db.r6g.large, us-east-1). Pool: HikariCP with max_size=15, min_idle=5, idle_timeout=25s. Traffic: ~400 req/s at peak. All timing numbers below are actual P50 values from production metrics.
db = DatabaseManager.get_instance() — Only one manager exists for the whole app
Is circuit breaker open? Start timeout timer. Enter retry loop (max 3 attempts).
connection = pool.acquire() — Idle connection? Borrow it (0ms). No idle but under max? Create new (Step 4). Pool full? Wait in queue (up to 500ms before timeout).
TCP handshake: 0.5ms local / 30ms cross-AZ + TLS negotiation: 20-50ms + Authentication: 5-20ms = 25ms-100ms total for new connection
Send: "SELECT * FROM users WHERE id = 123" → Receive: {id: 123, name: "Ahmed"}
Timing: Simple lookup by PK: 1-5ms. Join query: 10-100ms. Full table scan: 500ms+
pool.release(connection) — Connection goes back to idle state, ready for next request
Here is the same request with exact timing for two scenarios: warm pool (connection already exists) vs. cold pool (new connection needed).
This is why connection pooling matters. Without it, every request pays the 35ms tax. At 400 req/s, that is 14 seconds of cumulative connection overhead every second. Pooling drops it to near zero.
Tracing a FAILURE Through All Layers
That was the happy path. Now let's see what happens when things go wrong.
Remember the hook story? The developer spent 3 days debugging retry logic (L3) when the real problem was the connection dying (L1). Here's exactly how that happens — and why it's so confusing:
45 seconds ago: Last query finished, connection returned to pool.
30 seconds ago: Database's idle timeout (30s) triggered. Database closed the connection.
Your pool doesn't know! Connection looks fine.
Works fine — manager exists
Circuit breaker is closed (no recent failures). Start timeout timer (5000ms budget). Attempt 1 of 3.
Pool has idle connection available. Returns connection (0ms). Idle since 45 seconds ago. PROBLEM: Pool thinks connection is good. It's not! Database killed it 15 seconds ago.
ERROR: "Connection closed by server" after just 1ms (TCP RST packet).
This is a LAYER 1 PROBLEM — but it LOOKS like a query failure! The error message doesn't say "dead connection" — it says "query failed."
Catch error. Is it retryable? Yes. Wait 1000ms (exponential backoff). Attempt 2 of 3.
L3 retry MASKS the L1 problem! Your logs say "Retry attempt 2" — not "dead connection from idle timeout." The real cause is invisible.
Pool evicts dead connection, creates NEW one (goes to L1: TCP+TLS+Auth = 80ms). This one is FRESH and works! Total elapsed: ~1085ms for what should have been a 3ms operation.
Result: Query succeeds on retry 2. You think "L3 retry saved the day!" Reality: L1 was broken. L3 just papered over it.
In the failure scenario above, the user-facing impact depends on whether retry succeeds on attempt 2 (fresh connection) or attempt 3 (all idle connections were stale). Here is the math:
Now think about what this does to your latency distribution:
- P50: 3ms (most requests use warm connections, no problem)
- P95: 8ms (some queries are slightly more complex, still fine)
- P99: 1,300ms (1% of requests hit stale connections after idle periods)
- P99.9: 3,600ms (when multiple stale connections are encountered)
Your SLA says P99 under 500ms. You are violating it 1% of the time, but only after quiet periods (lunch, meetings, overnight). This is why it is so hard to reproduce in testing: load tests keep connections warm.
Don't just add more L3 retry. Fix L1!
- Solution 1: Add keepalive (prevent the problem) — Pings every 20 seconds, connection never dies
- Solution 2: Add health check before use (detect the problem) — "SELECT 1" takes 1ms, catches dead connections
- Solution 3: Configure pool to match database timeout — Pool idle timeout (25s) shorter than database (30s)
When Layers Clash: Cross-Layer Failures
The hardest bugs in reliable systems aren't within a single layer — they're at the boundaries between layers. Each layer works correctly in isolation, but the interaction between two layers creates a failure mode that neither anticipated.
These are the bugs that take days to find, because the symptom appears in one layer and the cause lives in another.
A managed database (like AWS RDS) has a wait_timeout of 30 seconds. Your connection pool keeps 10 idle connections warm. Traffic is bursty: heavy at 9 AM, quiet at 2 PM.
At 2:15 PM, after 15 minutes of silence, the database has silently closed all 10 connections. But the pool's internal state still shows "10 idle, 0 active, healthy." A request comes in. Pool hands out connection #3. The application sends SELECT * FROM orders WHERE id = 42. TCP sends the packet into a dead socket.
Symptoms: "Works 95% of the time, fails randomly after quiet periods." Errors spike right after lunch, right after meetings, right after weekends. The retry succeeds immediately (because it gets a fresh connection), so nobody panics — but latency P99 is 10x higher than it should be.
Root cause: L1 (network layer) killed the connection, but L2 (pool) has no mechanism to detect this. The pool trusts its own bookkeeping over the actual wire state.
Fix: Enable pool health checks (testOnBorrow: true with SELECT 1 validation query — adds 1ms per borrow). Or set pool idleTimeout to 25 seconds (shorter than the database's 30s). Or enable TCP keepalive at 20-second intervals so the OS maintains the connection.
Your pool has 10 connections. Normal request: borrow connection for 50ms, return it. At 200 req/s, you need ~10 concurrent connections. Perfect sizing.
Then the database slows down. Response time goes from 50ms to 2 seconds. Your L3 retry logic kicks in: 3 attempts with exponential backoff (1s, 2s, 4s). Each retry holds a connection from the pool while it waits.
The math is devastating: Each failing request holds a connection for up to 7 seconds (attempt 1: 2s timeout + 1s backoff + attempt 2: 2s timeout + 2s backoff). With 10 connections and requests still arriving at 200/s, the pool exhausts in under 100ms. Now every request queues for a connection. Healthy requests that would succeed in 50ms are waiting 30 seconds for a pool slot.
Symptoms: Database has a brief hiccup (2 seconds). Your application is down for 2 minutes. Pool wait time goes from 0ms to 30,000ms+. The circuit breaker never trips because individual requests eventually succeed.
Root cause: L3 (retry with backoff) holds L2 (pool) connections during sleep. The retry pattern assumes connections are free; the pool assumes borrowers return quickly. Neither assumption holds during failures.
Fix: Release the connection before the retry sleep, then re-acquire on the next attempt. Or use a separate "retry budget" that limits total concurrent retries (e.g., max 2 connections can be in retry state). Or add a pool-level timeout of 500ms so waiting requests fail fast instead of queueing forever.
You have a circuit breaker protecting calls to a payment service. Following the Singleton pattern (L4), you share one circuit breaker instance across all 8 application instances behind a load balancer.
Instance A sits in availability zone us-east-1a. The payment service has a network blip affecting only that AZ. Instance A's requests fail 5 times in 10 seconds. Circuit breaker trips to OPEN state.
Because the state is shared (Redis-backed singleton), all 8 instances now think the payment service is down. Instances B through H are in us-east-1b and us-east-1c — their network path to the payment service is fine. But they're all returning fallback responses because of Instance A's circuit breaker.
Symptoms: Payment processing drops 87.5% (7 of 8 instances refusing to try) even though only one instance has a network problem. Recovery takes the full circuit breaker timeout (60 seconds) even though the payment service was never actually down.
Root cause: L4 (Singleton pattern) shares L3 (circuit breaker) state globally. The circuit breaker was designed assuming all instances see the same failure conditions. But network failures are often localized.
Fix: Use per-instance circuit breaker state, not shared. Each instance should trip its own circuit independently. If you need global coordination, use a voting mechanism: circuit opens only when >50% of instances report failures within the same window.
Your L3 timeout is set to 5 seconds: timeout: 5000ms. Reasonable. Should be plenty for a database query.
But your database hostname is my-db.cluster-abc123.us-east-1.rds.amazonaws.com. DNS resolution usually takes 1-5ms (cached). But the DNS cache expired, and your DNS resolver is overloaded. Resolution takes 30 seconds.
Your timeout starts after the TCP connection is established, which happens after DNS resolves. So actual wait time = 30s DNS + 5s HTTP timeout = 35 seconds. Your user has been staring at a loading spinner for half a minute.
Symptoms: Random requests take 30-60 seconds even though your timeout is "only 5 seconds." Happens in bursts (when DNS cache expires). Monitoring shows timeout is working — it fires after 5 seconds of the HTTP call — but the total wall-clock time is much higher.
Root cause: L3 (timeout) doesn't cover L1 (DNS resolution). Most HTTP timeout configurations only measure the socket-level operation, not the full connection lifecycle including name resolution.
Fix: Set a total timeout that includes DNS resolution. In most languages, this means setting a connectTimeout (covers DNS + TCP + TLS) separate from a socketTimeout (covers data transfer). Or use AbortController / deadline-based timeouts that measure wall-clock time from the moment the request begins.
Notice the common thread: each layer makes assumptions about the layers around it. The pool assumes connections are alive (L1). Retry assumes connections are free (L2). The singleton assumes all instances see the same world (L3). The timeout assumes DNS is instant (L1).
When you design reliability, ask: "What does this layer assume about the layers above and below it? What happens when those assumptions break?"
The Anti-Patterns
Now that you understand the layers, let's look at the six most common ways developers get it wrong. You'll probably recognize at least one of these from your own code:
What it looks like
@retry(attempts=3, backoff=exponential)
@circuit_breaker(failures=5, timeout=60)
@timeout(seconds=30)
@cache(ttl=300)
@rate_limit(requests=100, period=60)
def call_api():
with connection_pool.acquire() as conn:
return conn.execute(query)
Why it happens
- "More patterns = more reliable" thinking
- Copy-pasting from Stack Overflow without understanding
- Fear of failures without understanding WHICH failures
The symptoms
- You can't explain what each pattern is doing
- When something fails, you don't know which pattern to check
- Patterns interact in unexpected ways (retry triggers circuit breaker)
The fix
Ask: "Which LAYER is my actual problem in?" Add patterns from THAT layer only.
What it looks like
"Connections keep failing randomly. Let me increase the retry count!" 3 retries → Still failing 5 retries → Still failing 10 retries → Still failing, but slower!
Why it happens
Not understanding that patterns solve different problems. Debugging the most visible pattern, not the root cause.
The fix
STOP. Ask: "What layer is this symptom in?" Connection dies → L1. Too many connections → L2. Service sometimes fails → L3.
What it looks like
"Let's add circuit breaker for reliability!" *Adds circuit breaker* "Let's add retry with backoff!" *Adds retry* "Let's add fallback!" *Adds fallback* Meanwhile: No connection pooling. No keepalive. Every request opens a new connection. Connections die randomly.
Why it happens
L3 patterns are "trendy" (everyone talks about circuit breakers). L1/L2 patterns are "boring" (just TCP and pooling).
The fix
Build layers IN ORDER: L1 → L2 → L3 → L4. Don't add L3 until L1 and L2 are solid.
What it looks like
Every service gets: - Its own connection pool (10 pools!) - Its own circuit breaker - Its own retry config - Its own timeout Result: 10 x 20 connections = 200 total connections Database max_connections: 100
Why it happens
Each team adds resilience patterns independently. Nobody looks at the total. Each pool is correctly sized in isolation, but combined they exceed the database limit.
The symptoms
- "Too many connections" errors under moderate load
- Service A works when tested alone, fails when Service B is also running
- Connection pool exhaustion during deployments (old + new instances both holding connections)
The fix
Budget connections globally. If your database allows 100 connections, divide them: 30 for Service A, 20 for Service B, etc. Leave 20% headroom for admin connections and spikes. See The 95% Problem for pool sizing math.
What it looks like
def execute_with_retry(query):
conn = pool.acquire() # Borrow connection
for attempt in range(3):
try:
return conn.execute(query)
except TransientError:
sleep(2 ** attempt) # 1s, 2s, 4s... WHILE HOLDING conn!
pool.release(conn)
Why it's devastating
Each retry attempt sleeps for 1, 2, then 4 seconds. During that sleep, the connection is borrowed from the pool and unavailable to anyone else. One failing request holds a connection for up to 7 seconds. At 200 req/s with a pool of 10, the pool exhausts in under 100ms once a few requests start retrying.
The cascade: database has a brief 2-second hiccup → 10 requests start retrying → all 10 pool connections are held during retry sleep → new requests queue for connections → queue grows for 7 seconds → all queued requests timeout → application is down for 30+ seconds from a 2-second database hiccup.
The symptoms
- Brief downstream slowdowns cause disproportionately long application outages
- Pool wait time jumps from 0ms to 30,000ms+ during incidents
- Monitoring shows "pool exhausted" but individual retries eventually succeed
The fix
Release the connection BEFORE the retry sleep. Re-acquire on the next attempt. The retry pattern should borrow a connection, try the operation, return the connection, THEN sleep. Each attempt gets a fresh connection from the pool. Now 10 retrying requests hold at most 10 connections for 50ms each, not 7 seconds each.
What it looks like
Symptom: "Queries are slow (500ms instead of 5ms)" Week 1: "Must be the query." Added indexes. No change. Week 2: "Must be the database." Upgraded to bigger instance. No change. Week 3: "Must be the connection pool." Increased pool size. WORSE! Actual cause: SSL certificate verification was re-negotiating on every connection (L1) because keepalive was off.
Why it happens
Performance problems surface at the application layer but originate at lower layers. The symptom (slow query) and the cause (L1 TLS renegotiation) are separated by 3 layers of abstraction.
The fix
Always start diagnosis from L1 and work up. Check connection timing first (SELECT 1 round-trip), then pool metrics (wait time, active/idle count), then resilience behavior (retry count, circuit breaker state), then application logic. The failure trace diagram above shows exactly this process.
Decision Framework
You've learned the layers. You've seen the anti-patterns. Now here's the practical part: when something breaks, how do you know which layer to check?
Use this cheat sheet. Match your symptom to the layer:
"My System Is Breaking" — Which Layer?
Diagnostic Commands: Layer by Layer
The decision framework tells you which layer to investigate. This section tells you how to investigate it. For each layer, you get: the commands to run, the log patterns to search for, and the numbers that tell you whether something is healthy or broken.
Think of this as the blood test for your system. You don't need to memorize every command — but you need to know they exist so you can reach for them when 3 AM comes calling.
When to check L1: Random failures after idle periods, slow first requests, "connection reset" errors, or TLS handshake failures.
1. Check if the database is killing idle connections
-- PostgreSQL: Check idle timeout settings
SHOW idle_in_transaction_session_timeout;
SHOW idle_session_timeout; -- PostgreSQL 14+ only
SHOW tcp_keepalives_idle;
SHOW tcp_keepalives_interval;
SHOW tcp_keepalives_count;
-- MySQL / MariaDB: Check wait timeout
SHOW VARIABLES LIKE 'wait_timeout';
SHOW VARIABLES LIKE 'interactive_timeout';
-- What to look for (idle_session_timeout is PG 14+):
idle_session_timeout = 30s ← If this is 30-60s and you
have no keepalive, you WILL
have the 95% problem
2. Measure raw connection time (is L1 itself slow?)
# Measure TCP + TLS handshake time to your database
time openssl s_client -connect your-db-host:5432 -starttls postgres
# Or measure just TCP round-trip
time nc -zv your-db-host 5432
# Healthy numbers (same region):
TCP connect: 0.5-2ms (same AZ)
TCP connect: 5-15ms (cross AZ)
TCP connect: 30-80ms (cross region)
TLS handshake: 15-50ms (first time)
TLS resume: 5-15ms (session reuse)
# If TCP connect > 50ms same-region: check network path
# If TLS > 100ms: check certificate chain, cipher config
3. Check keepalive settings at the OS level
# Linux: Check kernel TCP keepalive defaults
cat /proc/sys/net/ipv4/tcp_keepalive_time
7200 ← 2 hours! Way too long for most databases
cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75 ← 75 seconds between probes
cat /proc/sys/net/ipv4/tcp_keepalive_probes
9 ← 9 probes before declaring dead
# Total time to detect dead connection:
# 7200 + (75 x 9) = 7875 seconds = 2+ hours!
# Your DB kills connections in 30 seconds.
# You won't know for 2 hours. That's the gap.
4. Log patterns that point to L1
# Search your application logs for these patterns:
grep -E "connection reset|broken pipe|EOF|closed" app.log
grep -E "ECONNREFUSED|ECONNRESET|ETIMEDOUT" app.log
grep -E "SSL.*handshake|certificate.*expired" app.log
# PostgreSQL logs:
grep "unexpected EOF on client connection" postgresql.log
grep "could not receive data from client" postgresql.log
# If you see these AFTER idle periods (not under load),
# it's almost certainly an L1 keepalive problem.
When to check L2: Slow response times under load, "too many connections" errors, requests queuing, or gradual performance degradation.
1. Check current connection count on the database
-- PostgreSQL: How many connections exist right now?
SELECT count(*) as total,
count(*) FILTER (WHERE state = 'active') as active,
count(*) FILTER (WHERE state = 'idle') as idle,
count(*) FILTER (WHERE state = 'idle in transaction') as idle_in_txn
FROM pg_stat_activity
WHERE datname = 'your_database';
-- What's healthy:
total: 15 active: 3 idle: 10 idle_in_txn: 2
-- What's broken:
total: 98 active: 45 idle: 5 idle_in_txn: 48
-- 48 idle-in-transaction = connections held by abandoned
-- transactions. App isn't returning them to the pool.
-- MySQL equivalent:
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';
2. Check pool health (framework-specific)
# HikariCP (Java) - expose via JMX or /actuator/metrics
hikaricp.connections.active: 3
hikaricp.connections.idle: 7
hikaricp.connections.pending: 0 ← If > 0, pool is full!
hikaricp.connections.timeout: 0 ← If > 0, requests are
waiting too long for connections
# SQLAlchemy (Python) - check pool status
pool.status():
Pool size: 10 Connections in pool: 7 Current overflow: 0
Current checked out: 3
# Key metric: pool wait time
# Healthy: 0-1ms (connection immediately available)
# Warning: 10-50ms (pool is often full, consider sizing up)
# Broken: 500ms+ (pool exhausted, requests queuing)
3. Find connection leaks
-- PostgreSQL: Find connections held too long
SELECT pid, usename, application_name,
state, state_change,
now() - state_change as duration,
left(query, 80) as query
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - state_change > interval '30 seconds'
ORDER BY duration DESC;
-- What to look for:
state: "idle in transaction" duration: "00:05:23"
-- This connection has been holding a transaction open
-- for 5 minutes. It's blocking other queries AND
-- occupying a pool slot. This is a leak.
4. Log patterns that point to L2
# Application logs:
grep -E "pool exhausted|connection timeout|too many" app.log
grep -E "max_connections|FATAL.*connection" app.log
grep -E "acquire timeout|pool wait" app.log
# If you see "pool exhausted" during traffic spikes:
# → Pool is undersized for peak traffic
# If you see "too many connections" always:
# → Multiple pools exist (L4 singleton issue)
When to check L3: Retries spiking, circuit breaker tripping unexpectedly, timeouts happening on operations that should be fast, or cascading failures across services.
1. Check retry behavior
# Search for retry activity in logs
grep -c "retry attempt" app.log
847 ← 847 retries in this log file
# Break down by attempt number
grep "retry attempt" app.log | grep -c "attempt 1"
423
grep "retry attempt" app.log | grep -c "attempt 2"
312
grep "retry attempt" app.log | grep -c "attempt 3"
112
# Interpretation:
# 423 first retries, 312 needed a second → 74% fail twice
# That's not transient! If > 50% need retry 2+,
# the underlying cause is persistent (check L1 or L2).
# Healthy retry pattern:
# 90%+ succeed on attempt 1 (no retry needed)
# Of the 10% that retry, 80%+ succeed on attempt 2
# Very few reach attempt 3
2. Check circuit breaker state
# If using resilience4j (Java):
/actuator/circuitbreakers
{
"paymentService": {
"state": "HALF_OPEN", ← Recovering
"failureRate": "62.5%", ← Above threshold
"slowCallRate": "15.0%",
"numberOfFailedCalls": 5,
"numberOfSlowCalls": 3
}
}
# Key questions:
# Is the circuit tripping too often? Lower the threshold.
# Is it never tripping? Raise the threshold or check
# if it's even wired correctly.
# Is it stuck OPEN? The downstream may have recovered
# but half-open probes are failing (different failure).
3. Check timeout budget
# The #1 timeout mistake: not accounting for all layers
# Here's how to audit your actual timeout stack:
# Your code says: timeout = 5000ms
# But actual budget:
DNS resolution: 0-30,000ms (uncovered!)
TCP connect: 0-5,000ms (connectTimeout)
TLS handshake: 0-5,000ms (part of connectTimeout?)
Pool wait: 0-500ms (pool acquireTimeout)
Query execution: 0-5,000ms (socketTimeout)
Retry backoff: 0-7,000ms (1s + 2s + 4s)
---
Worst case total: 47,500ms (user sees 47s spinner!)
# Fix: Set a TOTAL deadline that covers all layers:
AbortController timeout = 8000ms (covers everything)
4. Log patterns that point to L3
# Retry storms (L3 making things worse):
grep -E "retry|backoff|attempt [2-9]" app.log | wc -l
# Circuit breaker activity:
grep -E "circuit.*open|circuit.*half|circuit.*closed" app.log
# Timeout violations:
grep -E "timeout|deadline exceeded|context canceled" app.log
# The smoking gun for "L3 masking L1/L2 problems":
# Retry count is HIGH but success rate is also HIGH.
# Meaning: retries "work" but shouldn't be needed.
# Fix the root cause (L1/L2), don't celebrate L3.
When to check L4: "Too many connections" errors that keep coming back, different parts of the app using different pool configurations, or reliability patterns that conflict with each other.
1. Count how many pools exist
-- PostgreSQL: Group connections by application_name
SELECT application_name, count(*) as connections
FROM pg_stat_activity
WHERE datname = 'your_database'
GROUP BY application_name
ORDER BY connections DESC;
-- Healthy (one pool per service):
application_name | connections
------------------+------------
api-server | 12
background-worker | 5
-- Broken (multiple pools created accidentally):
application_name | connections
------------------+------------
api-server | 45
api-server | 38
api-server | 42
-- Three separate pools! Singleton is broken.
-- 125 connections from one service = crash incoming.
2. Verify singleton behavior
# Add this to your app startup (temporary debug):
# Python example
import id # or use id(pool)
print(f"Pool instance: {id(pool)}")
# Run your app. If you see DIFFERENT IDs from
# different request handlers, the singleton is broken.
# Java example (check via JMX)
HikariPool-1.active: 5
HikariPool-2.active: 3 ← Two pools! Shouldn't exist.
# Common causes of broken singletons:
# - Each module imports and creates its own pool
# - Dependency injection creates new instance per scope
# - Hot reload in dev recreates pools without closing old
3. Check for configuration drift
# Different services connecting to the same DB
# but with different pool/timeout/retry configs
# is a common L4 organizational failure.
# Quick audit: search codebase for pool creation
grep -rn "create_pool\|createPool\|HikariConfig\|Pool(" src/
# If you find pool creation in more than ONE place,
# you likely have a singleton problem.
# Also check: do all services share the same config?
grep -rn "max_connections\|maxPoolSize\|pool_size" src/
# Different values in different files = configuration drift
When something breaks, run diagnostics bottom-up: L1 first, then L2, then L3, then L4. Most production issues are L1 (connection) or L2 (pool) problems wearing L3 (retry/timeout) disguises.
The rule of thumb: if retries are "working" (succeeding on attempt 2+), you don't have an L3 problem — you have an L1/L2 problem that L3 is masking. Fix the root cause. Don't celebrate the bandage.
It is 3 AM. Your pager went off. Here is the exact sequence to follow. Each step takes under a minute. Stop as soon as you find the problem.
This checklist has saved hours of debugging in real incidents. The key is going in order. Most engineers jump to L3 (retry config) or L4 (application logic) because that is the code they wrote. But 80% of the time, the problem is in L1 or L2 — the infrastructure they didn't write.
What To Do Monday
Draw this for your system:
L1: [ ] TCP [ ] TLS [ ] Keepalive L2: [ ] Pooling [ ] Caching L3: [ ] Retry [ ] Timeout [ ] Circuit L4: [ ] Singleton [ ] Factory
Run this SQL:
SHOW idle_session_timeout; -- PG 14+ -- (or wait_timeout for MySQL) -- If 30-60s and no keepalive, -- you WILL have the 95% problem
For every pattern, answer:
1. Which layer?
2. What problem?
3. Have I seen this problem?
Can't answer? Remove it.
Where To Go Deep
This post is the map. Here are the detailed guides for each territory:
The Reliability Layers Cheat Sheet
Pin this to your desk. When something breaks, scan the "What Breaks" column. Find your symptom. Look left for the layer. Look right for the fix.
L1 Network
- Purpose: Establish and maintain connections
- Patterns: TCP, TLS, Keepalive, DNS
- What Breaks: Idle timeout kills connections silently; TLS renegotiation; DNS cache expiry
- How to Diagnose:
SHOW idle_session_timeout;(PG 14+) +nc -zv host port - Healthy Numbers: TCP <2ms (same AZ), TLS <50ms, keepalive < DB timeout
- Fix: Enable keepalive (20s interval), set pool idle < DB idle timeout
L2 Resource
- Purpose: Manage expensive shared resources
- Patterns: Connection Pooling, Caching, Rate Limiting
- What Breaks: Pool exhaustion under load; connection leaks; cache stampedes; multiple pools (no singleton)
- How to Diagnose:
SELECT count(*) FROM pg_stat_activity+ pool metrics - Healthy Numbers: Pool wait <1ms, idle-in-txn = 0, total < 60% max
- Fix: Size pool = RPS x avg_query_time x 1.5; add health checks; fix leaks
L3 Resilience
- Purpose: Survive failures gracefully
- Patterns: Retry, Timeout, Circuit Breaker, Fallback, Bulkhead
- What Breaks: Retry storms; retries triggering circuit breaker; timeouts not covering DNS; holding pool connections during backoff sleep
- How to Diagnose: Retry count + success rate; circuit breaker state; timeout budget audit
- Healthy Numbers: <5% requests need retry; circuit trips <1x/day; total timeout < 8s
- Fix: Release connections before retry sleep; set total deadline; use per-instance circuit breakers
L4 Application Organization
- Purpose: Organize code so components don't fight over resources
- Patterns: Singleton, Factory, Repository, Dependency Injection
- What Breaks: Multiple pools created (no singleton); configuration drift between services; global circuit breaker state from shared singleton
- How to Diagnose:
SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 - Healthy Numbers: 1 pool per service; 1 config source; each instance owns its own L3 state
- Fix: Centralize pool creation in one factory; use DI; budget connections globally across all services
When in doubt, diagnose bottom-up: L1 → L2 → L3 → L4. Most production issues are L1 or L2 problems wearing L3 disguises.
SHOW idle_session_timeout; (PostgreSQL 14+) or SHOW VARIABLES LIKE 'wait_timeout'; (MySQL). If the timeout is shorter than your idle period, enable TCP keepalive or reduce pool idle time.
pool_size = requests_per_second x avg_query_duration. If the math is right, look for slow queries holding connections too long, or multiple pools being created (L4 singleton issue). Quick fix: increase pool size temporarily, but find the root cause.
Tradeoffs: What Each Layer Costs
Every pattern has a cost. Here's what you're trading when you add each layer's patterns:
| Layer | Pattern | Adding It Gives You | Adding It Costs You |
|---|---|---|---|
| L1 | TCP Keepalive | Prevents idle connection death | Slight network overhead, needs tuning per environment |
| L2 | Connection Pooling | Fast connection reuse, resource efficiency | Pool sizing complexity, potential for pool exhaustion bugs |
| L2 | Caching | Faster reads, reduced DB load | Cache invalidation complexity, stale data risk |
| L3 | Retry + Backoff | Handles transient failures automatically | Can cause retry storms, harder to debug, latency spikes |
| L3 | Circuit Breaker | Prevents cascade failures, fast-fail | Extra state to manage, threshold tuning, false positives |
| L4 | Singleton | One shared resource, no duplicates | Global state, harder to test, hidden dependencies |
Rule of thumb: Add patterns only when you have the problem they solve. A simple system with no L3 patterns beats a complex system with misconfigured circuit breakers.
- Four Layers: Network → Resource → Resilience → Application
- Each layer = different question: L1: "How do I connect?" L2: "How do I manage resources?" L3: "How do I survive failures?" L4: "How do I organize?"
- Layers nest: L4 contains L3 contains L2 contains L1
- Debug by layer: Ask "which layer?" before "which pattern?"
- Build in order: L1 → L2 → L3 → L4. Don't skip the foundation.
- The 95% problem is usually L1: Random failures after idle time = keepalive issue
- Less is more: 3 patterns you understand > 10 patterns you don't
والله أعلم
And Allah knows best
وصلى الله وسلم وبارك على سيدنا محمد وعلى آله
Comments
Leave a comment