The 95% Problem: Understanding DB Connections

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

Your system works 95% of the time. But randomly, once every few hours, a request fails with a cryptic error: connection was closed in the middle of operation.

The code hasn't changed. The database is running. Everything looks fine in the logs... until it doesn't.

If you've been here, you know the frustration. If you haven't, you will be.

This post explains what's actually happening, why it's so hard to debug, and how to fix it.

Quick Summary

Connections are expensive — creating one takes 25-50ms and network round trips
Connection pools solve this — reuse connections instead of creating new ones
Idle timeouts kill connections silently — your database closes connections your app thinks are alive

Want the full story? Keep reading.

This post is for you if:

You're getting random "connection closed" errors that you can't reproduce
Your app works 95% of the time but fails mysteriously
You're using a managed database (Supabase, RDS, PlanetScale) and hitting connection limits

The Big Picture: Where Do Connections Fit?

Before diving into the details, let's zoom out. When you click a button on a website, here's what happens:

The Software Stack

User Action You click "Load Data"

Frontend Browser sends HTTP request

Backend Server Python/Node/Go processes request

Database Connection THIS IS WHERE THINGS BREAK

Database PostgreSQL/MySQL returns data

The database connection is the bridge between your application code and your data. When this bridge breaks, everything falls apart.

New to this? What are TCP, TLS, and connections?

If terms like TCP and TLS are new to you, here's the quick version:

TCP (Transmission Control Protocol)

The "reliable pipe" between two computers. Like a phone call—you dial, connect, talk, hang up. TCP guarantees messages arrive in order and nothing gets lost.

TLS (Transport Layer Security)

Encryption on top of TCP. Like speaking in a secret code so nobody can eavesdrop on your phone call. The "S" in HTTPS.

Database Connection

A TCP pipe (usually encrypted with TLS) where you've also logged in with a username/password. It's an authenticated, secure channel to send SQL queries and get results back.

The key insight: Creating this connection takes 50-200ms (TCP handshake + TLS + authentication). That's why we reuse connections instead of creating new ones for every query.

The database connection is often invisible in tutorials. You write db.query("SELECT * FROM users") and magically get data back. But that magic hides a complex dance of network protocols, authentication, and resource management.

What IS a Database Connection?

Think of a database connection like a phone call:

Dial the number — Your app reaches out to the database server over the network
Verify identity — "Hi, I'm the backend app, password is xyz123"
Have the conversation — Send queries, receive results
Hang up — Close the connection when done

Simple, right? But here's where it gets interesting...

Why Connections Are Expensive

Opening a new database connection isn't like opening a file. It's a multi-step handshake that takes real time:

Connection Lifecycle: The Hidden Cost

Application

1 → TCP SYN ~10ms

2 → TCP SYN-ACK ~10ms

3 → TCP ACK ~10ms

4 ↔ TLS Handshake ~30-50ms

5 ↔ PostgreSQL Auth ~20-30ms

✓ Connection Ready Total: 50-200ms

Database

Every new connection costs 50-200ms. At 1,000 requests/second, that's 50-200 seconds of waiting per second just for connection overhead.

The Math

If your app serves 1,000 requests/second and each opens a new connection (150ms), you'd need 150,000ms of connection time per second. That's 150 seconds of work crammed into 1 second. Impossible without connection reuse.

Connection Pooling: The Solution

Instead of dialing a new phone call for every question, what if you had a team of operators with phones already connected, ready to relay your messages?

That's connection pooling. Your app maintains a "pool" of pre-opened connections that can be borrowed and returned:

Connection Pooling: Pre-Opened Connections Ready to Use

Application
Requests

borrow

CONNECTION POOL

Active (in use)

Idle (waiting)

min_size: 5

max_size: 20

query

Database

Connections are pre-opened and reused. Borrowing from the pool takes ~0.1ms instead of 50-200ms for a new connection.

Key concepts:

min_size — Minimum connections to keep open (e.g., 5). Never go below this.
max_size — Maximum connections allowed (e.g., 20). If all are busy, requests wait.
Idle connections — Open but not currently executing a query. Ready to be borrowed.
Active connections — Currently executing a query.

Connection pooling solves the performance problem. But it introduces a new one...

Part 2

The Problem

The Hidden Killer: Idle Timeouts

An idle timeout is a safety mechanism: if a connection hasn't sent any data for X seconds, the database assumes it's dead and closes it. Makes sense—but it causes a sneaky problem.

Here's the "aha" moment of this post. The scenario that causes 95% of connection errors:

Your app borrows a connection from the pool
Your app starts some long work (like calling an LLM API that takes 30+ seconds)
The connection sits idle during this time — it's borrowed, but not actively querying
The database server thinks: "This connection has been silent too long. Is anyone there?"
The database closes the connection
Your app finishes the LLM call and tries to use the connection — BOOM

The Idle Timeout Timeline: How 95% Becomes 5% Failures

t = 0s

Connection Acquired

App borrows connection from pool. Everything is fine.

t = 0-30s

Long Operation (LLM Call)

App calls external API. Connection sits idle. No queries sent.

t = 30s

Database Timeout Triggers

idle_in_transaction_session_timeout = 30s — DB closes the connection silently.

t = 35s

App Tries to Query — ERROR!

ConnectionDoesNotExistError: connection was closed in the middle of operation

The database doesn't tell your app it closed the connection. Your app still thinks the connection is open until it tries to use it.

Why 95% Works

Most requests complete in under 30 seconds, so the timeout never triggers. Only the unlucky requests that hit a slow external service at the wrong time get killed. That's why the failure is random and maddening to debug.

Why 95% Success is Harder to Debug Than 0%

Everything fails

Easy to debug. Something is fundamentally broken.

Check credentials, network, firewall

100%

Everything works

Nothing to debug. Ship it!

Deploy with confidence

95%

Random failures

The worst. Code is correct. DB is running. But something invisible is interfering.

What's going on?!

Clues That Point to Idle Timeout Issues

Errors mention "connection closed"

Failures are random, not consistent

Correlate with slow operations

Happen after a long gap of inactivity

So how do we fix this? There are three defensive tools every production system should have:

The Three Lines of Defense

Keepalive Pings

"Hey DB, I'm still here!" - prevents idle disconnection

Health Checks

Test connections before use - catch dead ones early

Retry Logic

If it fails, try again - graceful error recovery

But to use these tools effectively, you first need to understand where the settings live...

The Two Sides: Database vs Application

Connection settings live in two places, and they need to work together:

Two Sides of Configuration

Application

Your code controls this

Pool size (min/max)
Keepalive pings
Health checks
Retry logic

TCP/IP

Database

DBA / Cloud console controls this

Idle timeout
Max connections
Statement timeout
TCP keepalives

Both sides have settings that affect connections. They need to work together.

Database-Side Settings

These are configured by whoever manages the database (DBA, or you in the cloud console):

Setting	What It Does	Typical Default
`idle_in_transaction_session_timeout`	Kills connections sitting idle inside a transaction	0 (disabled) or 30s-5min
`idle_session_timeout`	Kills connections idle outside transactions	0 (disabled) or 10-30min
`statement_timeout`	Kills queries that run too long	0 (disabled) or 30s-5min
`max_connections`	Total connections the DB will accept	100-500

Where to find these

AWS RDS: Parameter Groups in the console. Supabase: Dashboard → Database Settings. Self-hosted: Edit postgresql.conf.

Application-Side Settings

These live in YOUR code, in the connection pool configuration:

Setting	What It Does	Typical Default
`min_pool_size`	Connections always kept open and ready	1-5
`max_pool_size`	Maximum connections the pool will create	10-20
`connection_timeout`	How long to wait when opening a new connection	30s
`command_timeout`	How long to wait for a query to complete	30s or no limit
`keepalive_interval`	How often to ping idle connections	Often disabled by default

How They Interact

The problem happens when these settings don't align. Here's the critical scenario:

When Settings Don't Align

No Keepalive Configured

0s Connection acquired

0-30s App doing long work...

30s DB closes connection

35s App queries → ERROR

Keepalive Every 20s

0s Connection acquired

20s PING → DB resets timer

35s App queries → Works

40s PING → DB resets timer

Keepalive pings tell the database "I'm still here" and prevent the idle timeout from triggering.

Part 3

Going Deeper

How It All Works at Scale

Now that you understand the basics and the problem, let's see how requests actually flow through the system—and what happens as you scale up.

One Request, Start to Finish

Request Arrives

User clicks button → API endpoint called → needs database data

Ask the Pool for a Connection

Pool checks: idle connection available? → Yes: grab it. No: create new (if under max) or wait.

Health Check DEFENSE #2

Quick SELECT 1 to verify connection is alive before using it.

Execute Your Query

SELECT * FROM users WHERE id = 123 → Database processes → Returns data

Return Connection to Pool

Connection goes back to idle state. Ready for the next request.

While Idle: Keepalive Pings DEFENSE #1

Every 30 seconds: "Hey DB, still here!" → Resets DB's idle timer → Connection stays alive.

This cycle repeats thousands of times per second in a busy application.

That's one request. But what happens when you have hundreds of requests per second across multiple servers?

Scaling Up: The Connection Math Problem

Here's where many teams run into trouble. Let's do the math:

The Connection Math

Small Scale

total connections

3 app servers

× 20 pool size

DB limit: 100 ✓

Medium Scale

200

total connections

10 app servers

× 20 pool size

DB limit: 100 ✗

Large Scale

1000

total connections

50 app servers

× 20 pool size

DB limit: 100 ✗✗

As you scale horizontally, connection counts multiply. This is when you need a connection pooler.

The Solution: Connection Poolers (PgBouncer)

At scale, you add a connection pooler between your apps and the database. It multiplexes (shares a few real connections among many requests) many app connections into fewer real database connections.

Architecture at Scale

App 1

pool: 20

App 2

pool: 20

App 3

pool: 20

...

App 50

pool: 20

1000 "virtual" connections

PgBouncer

Connection Pooler / Multiplexer

Shares a small number of real connections across all apps

100 real connections

PostgreSQL

max_connections = 100

PgBouncer sits between your apps and database, multiplexing 1000 app connections into 100 real DB connections.

Wait—we already have a pool. Why do we need PgBouncer?

Great question! This confuses a lot of people. Here's the key difference:

Application Pool (e.g., SQLAlchemy, HikariCP)

Lives inside each app server
Each app has its own separate pool
10 apps × 20 pool = 200 DB connections
Database sees ALL of them

External Pooler (PgBouncer)

Lives between all apps and DB
Consolidates all app connections
200 app connections → 50 DB connections
Database only sees 50

Think of it this way: App pools manage connections within one server. PgBouncer manages connections across all servers. It's a "pool of pools."

Deep Dive: What is "multiplexing"?

Multiplexing means sharing one resource among many users. Here's how PgBouncer does it:

App 1: "I need to run a query"

PgBouncer: "Here, use DB connection #3"

App 1: *runs query, returns connection*

App 2: "I need to run a query"

PgBouncer: "Here, use DB connection #3" ← same connection!

The apps think they have dedicated connections, but PgBouncer is secretly sharing a smaller set of real connections among them. It's like having 10 phone lines shared among 100 employees—most calls are short, so they rarely need all 10 at once.

The Bigger Picture: How Systems Actually Scale

Before we continue, let's zoom out and understand the full scaling journey. This will make everything click.

First, Let's Clear Up a Confusion

Load Balancer

Distributes requests across multiple backends

1000 requests

↓

Server 1

Server 2

Server 3

PgBouncer

Shares connections to ONE database

1000 app connections

↓

100 real connections → Same DB

Key insight: PgBouncer is a connection multiplexer, not a load balancer. It doesn't give you more database capacity—it just lets more apps share fewer connections to the same database.

The Scaling Journey: 4 Stages

As your app grows, you'll go through these stages. Knowing where you are helps you pick the right solution.

The Scaling Journey: What It Actually Looks Like

Small

1-3 app servers

App pools connect directly to DB. No extra infrastructure needed.

20-60 connections Simple setup

Medium

5-20 app servers

Add PgBouncer between apps and DB to multiplex connections.

200 → 50 connections + PgBouncer

Large

20-100 app servers

Add read replicas. Primary (amber) for writes, replicas (green) for reads.

1 Primary + N Replicas

Very Large

100+ servers, millions of queries

Sharding: split data across multiple DBs. Different architecture.

Complex Citus, Vitess

Most teams stay at Stage 1-2. Read replicas (Stage 3) are for high traffic. Sharding (Stage 4) is rare.

Everything Has Limits

Here's the reality: every component has a ceiling. Understanding these limits helps you plan.

What Can You Scale? (And What Stops You)

DB max_connections

Can increase: 100 → 500+

Limit: Each connection uses 5-10MB RAM. 500 connections = 2.5-5GB just for connections.

DB Memory/CPU

Can increase: Bigger instance

Limit: $$$ cost, and eventually hits ceiling. Can't scale one machine forever.

PgBouncer

Has its own settings: max_client_conn, pool_size

Limit: Single PgBouncer can become CPU bottleneck. May need multiple instances.

Read Replicas

Can add: Multiple replicas for reads

Limit: Only helps READ queries. Writes still go to one primary DB.

Read Replicas: The Most Common Scaling Solution

Most apps are 80-90% reads. Read replicas let you scale reads without changing your database architecture.

How Read Replicas Work

Your App Servers

App 1

App 2

App 3

WRITES

Primary

INSERT, UPDATE, DELETE

replication

READS

Replica 1

Replica 2

How it works

Writes (INSERT, UPDATE, DELETE) go to the primary DB only. The primary then replicates changes to replicas automatically. Reads (SELECT) can go to any replica, spreading the load. Add more replicas = handle more read traffic.

The Decision Tree: What Should You Do?

When you're hitting connection limits, here's how to decide what to do:

Scaling Decision Tree

Do you have PgBouncer?

Add PgBouncer

First step. Usually solves the problem.

YES

Is workload mostly READS or WRITES?

MOSTLY READS

Add Read Replicas

Send reads to replicas

MOSTLY WRITES

Scale Up DB

More RAM/CPU, or consider sharding

Most apps are read-heavy, so read replicas are usually the answer after PgBouncer.

Where Are You on the Journey?

Most teams never get past Stage 2. If you have PgBouncer configured correctly, you can handle thousands of requests per second with a single database. Read replicas (Stage 3) are for when you're pushing that limit. Sharding (Stage 4) is rare—most companies never need it.

Do Best Practices Hurt Scale?

A common concern: "If I add health checks and keepalives, won't that add overhead?" Let's look at the numbers:

The Real Overhead

Keepalive

Negligible

Cost: 1 tiny ping per connection per 30s

At 1000 req/s: ~3 pings/second total

Health Check

Tiny

Cost: ~1ms per request

At 1000 req/s: 1000ms spread across pool

Retry Logic

Free

Cost: Only triggers on failures

At 1000 req/s: Zero overhead normally

Acquire Late

HELPS scale

Cost: None - just discipline

At 1000 req/s: Connections free faster

These practices cost almost nothing but save you from mysterious 3am failures.

Putting It All Together: The Timeline

Here's what actually happens over time with a healthy connection:

A Day in the Life of a Connection