System Design

The 95% Problem:
Understanding DB Connections

Your system works 95% of the time. But randomly, once every few hours, a request fails with "connection closed". The code hasn't changed. The database is running. What's going on?

Bahgat Bahgat Ahmed
· January 2025 · 18 min read
User clicks button
Frontend (what you see)
Backend (the brain)
DB Connection YOU ARE HERE
Database (data lives here)
Table of Contents
5 parts

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

Your system works 95% of the time. But randomly, once every few hours, a request fails with a cryptic error: connection was closed in the middle of operation.

The code hasn't changed. The database is running. Everything looks fine in the logs... until it doesn't.

If you've been here, you know the frustration. If you haven't, you will be.

This post explains what's actually happening, why it's so hard to debug, and how to fix it.

Quick Summary
  • Connections are expensive — creating one takes 25-50ms and network round trips
  • Connection pools solve this — reuse connections instead of creating new ones
  • Idle timeouts kill connections silently — your database closes connections your app thinks are alive

Want the full story? Keep reading.

This post is for you if:

  • You're getting random "connection closed" errors that you can't reproduce
  • Your app works 95% of the time but fails mysteriously
  • You're using a managed database (Supabase, RDS, PlanetScale) and hitting connection limits

The Big Picture: Where Do Connections Fit?

Before diving into the details, let's zoom out. When you click a button on a website, here's what happens:

The Software Stack
User Action You click "Load Data"
Frontend Browser sends HTTP request
Backend Server Python/Node/Go processes request
Database Connection THIS IS WHERE THINGS BREAK
Database PostgreSQL/MySQL returns data

The database connection is the bridge between your application code and your data. When this bridge breaks, everything falls apart.

New to this? What are TCP, TLS, and connections?

If terms like TCP and TLS are new to you, here's the quick version:

TCP (Transmission Control Protocol)

The "reliable pipe" between two computers. Like a phone call—you dial, connect, talk, hang up. TCP guarantees messages arrive in order and nothing gets lost.

TLS (Transport Layer Security)

Encryption on top of TCP. Like speaking in a secret code so nobody can eavesdrop on your phone call. The "S" in HTTPS.

Database Connection

A TCP pipe (usually encrypted with TLS) where you've also logged in with a username/password. It's an authenticated, secure channel to send SQL queries and get results back.

The key insight: Creating this connection takes 50-200ms (TCP handshake + TLS + authentication). That's why we reuse connections instead of creating new ones for every query.

The database connection is often invisible in tutorials. You write db.query("SELECT * FROM users") and magically get data back. But that magic hides a complex dance of network protocols, authentication, and resource management.

What IS a Database Connection?

Think of a database connection like a phone call:

  • Dial the number — Your app reaches out to the database server over the network
  • Verify identity — "Hi, I'm the backend app, password is xyz123"
  • Have the conversation — Send queries, receive results
  • Hang up — Close the connection when done

Simple, right? But here's where it gets interesting...

Why Connections Are Expensive

Opening a new database connection isn't like opening a file. It's a multi-step handshake that takes real time:

Connection Lifecycle: The Hidden Cost
Application
1 TCP SYN ~10ms
2 TCP SYN-ACK ~10ms
3 TCP ACK ~10ms
4 TLS Handshake ~30-50ms
5 PostgreSQL Auth ~20-30ms
Connection Ready Total: 50-200ms
Database

Every new connection costs 50-200ms. At 1,000 requests/second, that's 50-200 seconds of waiting per second just for connection overhead.

The Math

If your app serves 1,000 requests/second and each opens a new connection (150ms), you'd need 150,000ms of connection time per second. That's 150 seconds of work crammed into 1 second. Impossible without connection reuse.

Connection Pooling: The Solution

Instead of dialing a new phone call for every question, what if you had a team of operators with phones already connected, ready to relay your messages?

That's connection pooling. Your app maintains a "pool" of pre-opened connections that can be borrowed and returned:

Connection Pooling: Pre-Opened Connections Ready to Use
Application
Requests
borrow
CONNECTION POOL
1
2
3
4
5
Active (in use)
Idle (waiting)
min_size: 5
max_size: 20
query
Database

Connections are pre-opened and reused. Borrowing from the pool takes ~0.1ms instead of 50-200ms for a new connection.

Key concepts:

  • min_size — Minimum connections to keep open (e.g., 5). Never go below this.
  • max_size — Maximum connections allowed (e.g., 20). If all are busy, requests wait.
  • Idle connections — Open but not currently executing a query. Ready to be borrowed.
  • Active connections — Currently executing a query.

Connection pooling solves the performance problem. But it introduces a new one...

Part 2
The Problem

The Hidden Killer: Idle Timeouts

An idle timeout is a safety mechanism: if a connection hasn't sent any data for X seconds, the database assumes it's dead and closes it. Makes sense—but it causes a sneaky problem.

Here's the "aha" moment of this post. The scenario that causes 95% of connection errors:

  1. Your app borrows a connection from the pool
  2. Your app starts some long work (like calling an LLM API that takes 30+ seconds)
  3. The connection sits idle during this time — it's borrowed, but not actively querying
  4. The database server thinks: "This connection has been silent too long. Is anyone there?"
  5. The database closes the connection
  6. Your app finishes the LLM call and tries to use the connection — BOOM
The Idle Timeout Timeline: How 95% Becomes 5% Failures
t = 0s

Connection Acquired

App borrows connection from pool. Everything is fine.

t = 0-30s

Long Operation (LLM Call)

App calls external API. Connection sits idle. No queries sent.

t = 30s

Database Timeout Triggers

idle_in_transaction_session_timeout = 30s — DB closes the connection silently.

t = 35s

App Tries to Query — ERROR!

ConnectionDoesNotExistError: connection was closed in the middle of operation

The database doesn't tell your app it closed the connection. Your app still thinks the connection is open until it tries to use it.

Why 95% Works

Most requests complete in under 30 seconds, so the timeout never triggers. Only the unlucky requests that hit a slow external service at the wrong time get killed. That's why the failure is random and maddening to debug.

Why 95% Success is Harder to Debug Than 0%
0%
Everything fails
Easy to debug. Something is fundamentally broken.
Check credentials, network, firewall
100%
Everything works
Nothing to debug. Ship it!
Deploy with confidence
95%
Random failures
The worst. Code is correct. DB is running. But something invisible is interfering.
What's going on?!
Clues That Point to Idle Timeout Issues
Errors mention "connection closed"
Failures are random, not consistent
Correlate with slow operations
Happen after a long gap of inactivity

So how do we fix this? There are three defensive tools every production system should have:

The Three Lines of Defense
Keepalive Pings
"Hey DB, I'm still here!" - prevents idle disconnection
Health Checks
Test connections before use - catch dead ones early
Retry Logic
If it fails, try again - graceful error recovery

But to use these tools effectively, you first need to understand where the settings live...

The Two Sides: Database vs Application

Connection settings live in two places, and they need to work together:

Two Sides of Configuration
Application
Your code controls this
  • Pool size (min/max)
  • Keepalive pings
  • Health checks
  • Retry logic
TCP/IP
Database
DBA / Cloud console controls this
  • Idle timeout
  • Max connections
  • Statement timeout
  • TCP keepalives

Both sides have settings that affect connections. They need to work together.

Database-Side Settings

These are configured by whoever manages the database (DBA, or you in the cloud console):

Setting What It Does Typical Default
idle_in_transaction_session_timeout Kills connections sitting idle inside a transaction 0 (disabled) or 30s-5min
idle_session_timeout Kills connections idle outside transactions 0 (disabled) or 10-30min
statement_timeout Kills queries that run too long 0 (disabled) or 30s-5min
max_connections Total connections the DB will accept 100-500
Where to find these

AWS RDS: Parameter Groups in the console. Supabase: Dashboard → Database Settings. Self-hosted: Edit postgresql.conf.

Application-Side Settings

These live in YOUR code, in the connection pool configuration:

Setting What It Does Typical Default
min_pool_size Connections always kept open and ready 1-5
max_pool_size Maximum connections the pool will create 10-20
connection_timeout How long to wait when opening a new connection 30s
command_timeout How long to wait for a query to complete 30s or no limit
keepalive_interval How often to ping idle connections Often disabled by default

How They Interact

The problem happens when these settings don't align. Here's the critical scenario:

When Settings Don't Align
No Keepalive Configured
0s   Connection acquired
0-30s App doing long work...
30s DB closes connection
35s App queries → ERROR
Keepalive Every 20s
0s   Connection acquired
20s PING → DB resets timer
35s App queries → Works
40s PING → DB resets timer

Keepalive pings tell the database "I'm still here" and prevent the idle timeout from triggering.

Part 3
Going Deeper

How It All Works at Scale

Now that you understand the basics and the problem, let's see how requests actually flow through the system—and what happens as you scale up.

One Request, Start to Finish
1
Request Arrives
User clicks button → API endpoint called → needs database data
2
Ask the Pool for a Connection
Pool checks: idle connection available? → Yes: grab it. No: create new (if under max) or wait.
3
Health Check DEFENSE #2
Quick SELECT 1 to verify connection is alive before using it.
4
Execute Your Query
SELECT * FROM users WHERE id = 123 → Database processes → Returns data
5
Return Connection to Pool
Connection goes back to idle state. Ready for the next request.
While Idle: Keepalive Pings DEFENSE #1
Every 30 seconds: "Hey DB, still here!" → Resets DB's idle timer → Connection stays alive.

This cycle repeats thousands of times per second in a busy application.

That's one request. But what happens when you have hundreds of requests per second across multiple servers?

Scaling Up: The Connection Math Problem

Here's where many teams run into trouble. Let's do the math:

The Connection Math
Small Scale
60
total connections
3 app servers
× 20 pool size
DB limit: 100 ✓
Medium Scale
200
total connections
10 app servers
× 20 pool size
DB limit: 100 ✗
Large Scale
1000
total connections
50 app servers
× 20 pool size
DB limit: 100 ✗✗

As you scale horizontally, connection counts multiply. This is when you need a connection pooler.

The Solution: Connection Poolers (PgBouncer)

At scale, you add a connection pooler between your apps and the database. It multiplexes (shares a few real connections among many requests) many app connections into fewer real database connections.

Architecture at Scale
App 1
pool: 20
App 2
pool: 20
App 3
pool: 20
...
App 50
pool: 20
1000 "virtual" connections
PgBouncer
Connection Pooler / Multiplexer
Shares a small number of real connections across all apps
100 real connections
PostgreSQL
max_connections = 100

PgBouncer sits between your apps and database, multiplexing 1000 app connections into 100 real DB connections.

Wait—we already have a pool. Why do we need PgBouncer?

Great question! This confuses a lot of people. Here's the key difference:

Application Pool (e.g., SQLAlchemy, HikariCP)
  • Lives inside each app server
  • Each app has its own separate pool
  • 10 apps × 20 pool = 200 DB connections
  • Database sees ALL of them
External Pooler (PgBouncer)
  • Lives between all apps and DB
  • Consolidates all app connections
  • 200 app connections → 50 DB connections
  • Database only sees 50

Think of it this way: App pools manage connections within one server. PgBouncer manages connections across all servers. It's a "pool of pools."

Deep Dive: What is "multiplexing"?

Multiplexing means sharing one resource among many users. Here's how PgBouncer does it:

App 1: "I need to run a query"
PgBouncer: "Here, use DB connection #3"
App 1: *runs query, returns connection*
App 2: "I need to run a query"
PgBouncer: "Here, use DB connection #3" ← same connection!

The apps think they have dedicated connections, but PgBouncer is secretly sharing a smaller set of real connections among them. It's like having 10 phone lines shared among 100 employees—most calls are short, so they rarely need all 10 at once.

The Bigger Picture: How Systems Actually Scale

Before we continue, let's zoom out and understand the full scaling journey. This will make everything click.

First, Let's Clear Up a Confusion
Load Balancer

Distributes requests across multiple backends

1000 requests
Server 1
Server 2
Server 3
PgBouncer

Shares connections to ONE database

1000 app connections
100 real connections → Same DB
Key insight: PgBouncer is a connection multiplexer, not a load balancer. It doesn't give you more database capacity—it just lets more apps share fewer connections to the same database.

The Scaling Journey: 4 Stages

As your app grows, you'll go through these stages. Knowing where you are helps you pick the right solution.

The Scaling Journey: What It Actually Looks Like
1
Small
1-3 app servers
App pools connect directly to DB. No extra infrastructure needed.
20-60 connections Simple setup
2
Medium
5-20 app servers
Add PgBouncer between apps and DB to multiplex connections.
200 → 50 connections + PgBouncer
3
Large
20-100 app servers
Add read replicas. Primary (amber) for writes, replicas (green) for reads.
1 Primary + N Replicas
4
Very Large
100+ servers, millions of queries
Sharding: split data across multiple DBs. Different architecture.
Complex Citus, Vitess

Most teams stay at Stage 1-2. Read replicas (Stage 3) are for high traffic. Sharding (Stage 4) is rare.

Everything Has Limits

Here's the reality: every component has a ceiling. Understanding these limits helps you plan.

What Can You Scale? (And What Stops You)
DB max_connections
Can increase: 100 → 500+
Limit: Each connection uses 5-10MB RAM. 500 connections = 2.5-5GB just for connections.
DB Memory/CPU
Can increase: Bigger instance
Limit: $$$ cost, and eventually hits ceiling. Can't scale one machine forever.
PgBouncer
Has its own settings: max_client_conn, pool_size
Limit: Single PgBouncer can become CPU bottleneck. May need multiple instances.
Read Replicas
Can add: Multiple replicas for reads
Limit: Only helps READ queries. Writes still go to one primary DB.

Read Replicas: The Most Common Scaling Solution

Most apps are 80-90% reads. Read replicas let you scale reads without changing your database architecture.

How Read Replicas Work
Your App Servers
App 1
App 2
App 3
WRITES
Primary
INSERT, UPDATE, DELETE
replication
READS
Replica 1
Replica 2
How it works

Writes (INSERT, UPDATE, DELETE) go to the primary DB only. The primary then replicates changes to replicas automatically. Reads (SELECT) can go to any replica, spreading the load. Add more replicas = handle more read traffic.

The Decision Tree: What Should You Do?

When you're hitting connection limits, here's how to decide what to do:

Scaling Decision Tree
Do you have PgBouncer?
NO
Add PgBouncer
First step. Usually solves the problem.
YES
Is workload mostly READS or WRITES?
MOSTLY READS
Add Read Replicas
Send reads to replicas
MOSTLY WRITES
Scale Up DB
More RAM/CPU, or consider sharding

Most apps are read-heavy, so read replicas are usually the answer after PgBouncer.

Where Are You on the Journey?

Most teams never get past Stage 2. If you have PgBouncer configured correctly, you can handle thousands of requests per second with a single database. Read replicas (Stage 3) are for when you're pushing that limit. Sharding (Stage 4) is rare—most companies never need it.

Do Best Practices Hurt Scale?

A common concern: "If I add health checks and keepalives, won't that add overhead?" Let's look at the numbers:

The Real Overhead
Keepalive
Negligible
Cost: 1 tiny ping per connection per 30s
At 1000 req/s: ~3 pings/second total
Health Check
Tiny
Cost: ~1ms per request
At 1000 req/s: 1000ms spread across pool
Retry Logic
Free
Cost: Only triggers on failures
At 1000 req/s: Zero overhead normally
Acquire Late
HELPS scale
Cost: None - just discipline
At 1000 req/s: Connections free faster

These practices cost almost nothing but save you from mysterious 3am failures.

Putting It All Together: The Timeline

Here's what actually happens over time with a healthy connection:

A Day in the Life of a Connection
0:00:00
Pool creates connection (TCP + TLS + Auth = 150ms)
0:00:01
Request #1: health check ✓ → query → return to pool
0:00:05
Request #2: health check ✓ → query → return to pool
0:00:30
Keepalive ping → DB resets idle timer
0:01:00
Keepalive ping → DB resets idle timer
0:02:15
Request #3 arrives after long gap
Health check ✓ → connection still alive (thanks to keepalive!)
Query executes successfully ✓
Pool operation
Request
Defense mechanism

Without keepalive, request #3 would have failed. The connection would have been silently closed during the 2+ minute gap.

Key Takeaway

Best practices don't slow you down—they prevent failures. The real bottleneck at scale is the total number of connections, not the keepalives or health checks. Use PgBouncer when your app servers × pool size exceeds your DB's max_connections.

Part 4
Solutions

Best Practices for Production

Now that you understand how connections work at scale, let's implement the defenses properly. Each tool attacks the problem from a different angle.

1. Keepalive Pings

Keepalive: The Heartbeat
App
PING PING PING 20s 20s
Database
Result: Each PING resets the idle timer. Connection stays alive even during long operations.

2. Health Checks

Health Check: Validate Before Use
Request
Pool
SELECT 1
Health Check
Alive
Use it
Dead
Discard & get new
Cost: ~1-5ms per request
Benefit: Zero dead connection errors

3. Retry Logic

Retry with Exponential Backoff
Attempt 1
Connection error
Wait 100ms
Attempt 2
Still failing
Wait 200ms (doubled)
Attempt 3
Success! New connection worked
Caution: Only retry read operations or idempotent writes. Don't blindly retry "INSERT payment" or you might charge twice!

4. Acquire Late, Release Early

This is the most overlooked best practice — don't hold connections while doing work that doesn't need them:

Connection Acquisition Pattern
Problematic Pattern
acquire_connection()
do_long_llm_call() // 30s idle!
run_query()
release_connection()
Better Pattern
do_long_llm_call() // No connection held
acquire_connection()
run_query() // Quick
release_connection()

Don't hold a database connection while doing work that doesn't need it. Acquire right before you query, release right after.

Defense in Depth

Production systems should have multiple layers of protection:

Production-Ready Setup
1
DB Configuration
Reasonable timeouts (not too aggressive)
2
Pool Configuration
Keepalive pings + Health checks
3
Application Code
Acquire late, release early + Retry logic
4
Monitoring
Track pool stats, error rates, connection age

Each layer catches what the previous layer missed. No single point of failure.

When Things Fail: The Defense in Action

We've talked about defenses. Now let's see them actually work when things go wrong.

The Defensive Flow: A Real Failure Scenario
Scenario: Connection died while sitting idle

The database closed connection #3 after 30 seconds of inactivity. Your app doesn't know yet.

1
Request arrives, needs database
App asks pool: "Give me a connection"
2
Pool hands over connection #3
Pool thinks it's fine (it looks open on our side)
Health Check: SELECT 1 FAILS!
Connection is dead. DB already closed it.
Pool discards connection #3
Removes it from pool, tries another
5
Pool tries connection #4
Or creates a new one if none available
Health check passes!
Connection #4 is alive
Query executes successfully
User never knew anything went wrong!
Without health checks, the query would have failed with "connection closed"

Health checks catch dead connections before your code tries to use them. The failure is handled silently.

What do I actually need to do in my code?

Good news: Most of this is handled by your connection pool library. You just need to configure it right.

DO
Enable health checks in your pool config:
pool_pre_ping=True (SQLAlchemy) or testOnBorrow=true (HikariCP)
DO
Configure keepalive:
pool_keepalive=30 or set TCP keepalive in connection string
DO
Add retry logic for transient failures:
Use tenacity (Python) or similar retry library
DON'T
Don't catch and ignore connection errors silently.
Let them bubble up so retry logic can handle them.
Part 5
Reference

Common Failure Patterns

When things go wrong, these are the usual suspects. Knowing their names helps you diagnose faster.

Failure Taxonomy: Know Your Enemy
Pattern Name Symptom Root Cause
The Phantom Disconnect connection closed on random requests Database idle timeout closed the connection silently
The Pool Drought pool exhausted under load Too many concurrent requests, all connections busy
The Zombie Connection Connection looks open but queries hang forever Network issues severed connection, no keepalive detected it
The Slow Leak Pool slowly runs out over hours/days Code borrows connections but doesn't return them properly

The Tradeoffs Table

There's no perfect fix. Each solution has pros and cons:

Tradeoffs: Pick Your Strategy
Increase Timeout
Database
Clean fix at the source
Requires DB config change; may need restart
Keepalive Pings
Application
Works anywhere; keeps connections alive
Adds complexity; slight overhead
Retry Logic
Application
Handles any transient failure
May hide problems; careful with non-idempotent ops
Health Checks
Application
Validates connection before use
Adds ~1ms latency to every request

Most production systems use a combination: health checks + retry logic as a baseline, with keepalive or increased timeout as needed.

Decision Framework

Here's how to think about fixing connection issues:

Decision Framework: What To Fix
Do you control the database settings?
Yes
Increase timeouts
Set idle_timeout to 5+ minutes
No (managed DB)
Add keepalive + retry
App-side workaround
Either way: Add monitoring for connection pool stats and error rates
What To Do Monday

Seeing random connection errors?

Check your database timeout settings. Most cloud databases (AWS RDS, Supabase, etc.) have aggressive 30-60 second defaults.

Run: SHOW idle_in_transaction_session_timeout;

Have long-running operations?

If you call external APIs (especially LLMs) that take 30+ seconds, add retry logic around your database queries.

Pattern: try/catch with exponential backoff

Want to prevent issues?

Monitor your connection pool. Track active/idle connections, wait times, and error rates. Problems show up in metrics before they hit users.

Add pool stats to your observability stack
Cheat Sheet: Database Connections

The Problem

  • Connections take 50-200ms to open
  • Idle connections get killed by DB timeouts
  • Your app doesn't know the connection died

Key Numbers

  • TCP handshake: 10-30ms
  • TLS negotiation: 20-50ms
  • Auth exchange: 10-30ms
  • Default idle timeout: 30-60s

Solutions

  • Pooling: Reuse connections
  • Timeouts: Increase on DB side
  • Keepalive: Ping idle connections
  • Retry: Handle transient failures
The Bottom Line

The 95% problem isn't a code bug. It's a mismatch between your database's timeout settings and your application's behavior. Once you understand the invisible handshake happening between your app and database, these random failures become predictable—and fixable.

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this helpful?

Comments

Loading comments...

Leave a comment