بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
In the name of Allah, the Most Gracious, the Most Merciful
Your system works 95% of the time. But randomly, once every few hours, a request fails with a cryptic error: connection was closed in the middle of operation.
The code hasn't changed. The database is running. Everything looks fine in the logs... until it doesn't.
If you've been here, you know the frustration. If you haven't, you will be.
This post explains what's actually happening, why it's so hard to debug, and how to fix it.
- Connections are expensive — creating one takes 25-50ms and network round trips
- Connection pools solve this — reuse connections instead of creating new ones
- Idle timeouts kill connections silently — your database closes connections your app thinks are alive
Want the full story? Keep reading.
This post is for you if:
- You're getting random "connection closed" errors that you can't reproduce
- Your app works 95% of the time but fails mysteriously
- You're using a managed database (Supabase, RDS, PlanetScale) and hitting connection limits
The Big Picture: Where Do Connections Fit?
Before diving into the details, let's zoom out. When you click a button on a website, here's what happens:
The database connection is the bridge between your application code and your data. When this bridge breaks, everything falls apart.
If terms like TCP and TLS are new to you, here's the quick version:
The "reliable pipe" between two computers. Like a phone call—you dial, connect, talk, hang up. TCP guarantees messages arrive in order and nothing gets lost.
Encryption on top of TCP. Like speaking in a secret code so nobody can eavesdrop on your phone call. The "S" in HTTPS.
A TCP pipe (usually encrypted with TLS) where you've also logged in with a username/password. It's an authenticated, secure channel to send SQL queries and get results back.
The key insight: Creating this connection takes 50-200ms (TCP handshake + TLS + authentication). That's why we reuse connections instead of creating new ones for every query.
The database connection is often invisible in tutorials. You write db.query("SELECT * FROM users") and magically get data back. But that magic hides a complex dance of network protocols, authentication, and resource management.
What IS a Database Connection?
Think of a database connection like a phone call:
- Dial the number — Your app reaches out to the database server over the network
- Verify identity — "Hi, I'm the backend app, password is xyz123"
- Have the conversation — Send queries, receive results
- Hang up — Close the connection when done
Simple, right? But here's where it gets interesting...
Why Connections Are Expensive
Opening a new database connection isn't like opening a file. It's a multi-step handshake that takes real time:
Every new connection costs 50-200ms. At 1,000 requests/second, that's 50-200 seconds of waiting per second just for connection overhead.
If your app serves 1,000 requests/second and each opens a new connection (150ms), you'd need 150,000ms of connection time per second. That's 150 seconds of work crammed into 1 second. Impossible without connection reuse.
Connection Pooling: The Solution
Instead of dialing a new phone call for every question, what if you had a team of operators with phones already connected, ready to relay your messages?
That's connection pooling. Your app maintains a "pool" of pre-opened connections that can be borrowed and returned:
Requests
min_size: 5
max_size: 20
Connections are pre-opened and reused. Borrowing from the pool takes ~0.1ms instead of 50-200ms for a new connection.
Key concepts:
- min_size — Minimum connections to keep open (e.g., 5). Never go below this.
- max_size — Maximum connections allowed (e.g., 20). If all are busy, requests wait.
- Idle connections — Open but not currently executing a query. Ready to be borrowed.
- Active connections — Currently executing a query.
Connection pooling solves the performance problem. But it introduces a new one...
The Hidden Killer: Idle Timeouts
An idle timeout is a safety mechanism: if a connection hasn't sent any data for X seconds, the database assumes it's dead and closes it. Makes sense—but it causes a sneaky problem.
Here's the "aha" moment of this post. The scenario that causes 95% of connection errors:
- Your app borrows a connection from the pool
- Your app starts some long work (like calling an LLM API that takes 30+ seconds)
- The connection sits idle during this time — it's borrowed, but not actively querying
- The database server thinks: "This connection has been silent too long. Is anyone there?"
- The database closes the connection
- Your app finishes the LLM call and tries to use the connection — BOOM
Connection Acquired
App borrows connection from pool. Everything is fine.
Long Operation (LLM Call)
App calls external API. Connection sits idle. No queries sent.
Database Timeout Triggers
idle_in_transaction_session_timeout = 30s — DB closes the connection silently.
App Tries to Query — ERROR!
ConnectionDoesNotExistError: connection was closed in the middle of operation
The database doesn't tell your app it closed the connection. Your app still thinks the connection is open until it tries to use it.
Most requests complete in under 30 seconds, so the timeout never triggers. Only the unlucky requests that hit a slow external service at the wrong time get killed. That's why the failure is random and maddening to debug.
"connection closed"
So how do we fix this? There are three defensive tools every production system should have:
But to use these tools effectively, you first need to understand where the settings live...
The Two Sides: Database vs Application
Connection settings live in two places, and they need to work together:
- Pool size (min/max)
- Keepalive pings
- Health checks
- Retry logic
- Idle timeout
- Max connections
- Statement timeout
- TCP keepalives
Both sides have settings that affect connections. They need to work together.
Database-Side Settings
These are configured by whoever manages the database (DBA, or you in the cloud console):
| Setting | What It Does | Typical Default |
|---|---|---|
idle_in_transaction_session_timeout |
Kills connections sitting idle inside a transaction | 0 (disabled) or 30s-5min |
idle_session_timeout |
Kills connections idle outside transactions | 0 (disabled) or 10-30min |
statement_timeout |
Kills queries that run too long | 0 (disabled) or 30s-5min |
max_connections |
Total connections the DB will accept | 100-500 |
AWS RDS: Parameter Groups in the console. Supabase: Dashboard → Database Settings. Self-hosted: Edit postgresql.conf.
Application-Side Settings
These live in YOUR code, in the connection pool configuration:
| Setting | What It Does | Typical Default |
|---|---|---|
min_pool_size |
Connections always kept open and ready | 1-5 |
max_pool_size |
Maximum connections the pool will create | 10-20 |
connection_timeout |
How long to wait when opening a new connection | 30s |
command_timeout |
How long to wait for a query to complete | 30s or no limit |
keepalive_interval |
How often to ping idle connections | Often disabled by default |
How They Interact
The problem happens when these settings don't align. Here's the critical scenario:
Keepalive pings tell the database "I'm still here" and prevent the idle timeout from triggering.
How It All Works at Scale
Now that you understand the basics and the problem, let's see how requests actually flow through the system—and what happens as you scale up.
SELECT 1 to verify connection is alive before using it.SELECT * FROM users WHERE id = 123 → Database processes → Returns dataThis cycle repeats thousands of times per second in a busy application.
That's one request. But what happens when you have hundreds of requests per second across multiple servers?
Scaling Up: The Connection Math Problem
Here's where many teams run into trouble. Let's do the math:
As you scale horizontally, connection counts multiply. This is when you need a connection pooler.
The Solution: Connection Poolers (PgBouncer)
At scale, you add a connection pooler between your apps and the database. It multiplexes (shares a few real connections among many requests) many app connections into fewer real database connections.
PgBouncer sits between your apps and database, multiplexing 1000 app connections into 100 real DB connections.
Great question! This confuses a lot of people. Here's the key difference:
- Lives inside each app server
- Each app has its own separate pool
- 10 apps × 20 pool = 200 DB connections
- Database sees ALL of them
- Lives between all apps and DB
- Consolidates all app connections
- 200 app connections → 50 DB connections
- Database only sees 50
Think of it this way: App pools manage connections within one server. PgBouncer manages connections across all servers. It's a "pool of pools."
Multiplexing means sharing one resource among many users. Here's how PgBouncer does it:
The apps think they have dedicated connections, but PgBouncer is secretly sharing a smaller set of real connections among them. It's like having 10 phone lines shared among 100 employees—most calls are short, so they rarely need all 10 at once.
The Bigger Picture: How Systems Actually Scale
Before we continue, let's zoom out and understand the full scaling journey. This will make everything click.
Distributes requests across multiple backends
Shares connections to ONE database
The Scaling Journey: 4 Stages
As your app grows, you'll go through these stages. Knowing where you are helps you pick the right solution.
Most teams stay at Stage 1-2. Read replicas (Stage 3) are for high traffic. Sharding (Stage 4) is rare.
Everything Has Limits
Here's the reality: every component has a ceiling. Understanding these limits helps you plan.
Read Replicas: The Most Common Scaling Solution
Most apps are 80-90% reads. Read replicas let you scale reads without changing your database architecture.
Writes (INSERT, UPDATE, DELETE) go to the primary DB only. The primary then replicates changes to replicas automatically. Reads (SELECT) can go to any replica, spreading the load. Add more replicas = handle more read traffic.
The Decision Tree: What Should You Do?
When you're hitting connection limits, here's how to decide what to do:
Most apps are read-heavy, so read replicas are usually the answer after PgBouncer.
Most teams never get past Stage 2. If you have PgBouncer configured correctly, you can handle thousands of requests per second with a single database. Read replicas (Stage 3) are for when you're pushing that limit. Sharding (Stage 4) is rare—most companies never need it.
Do Best Practices Hurt Scale?
A common concern: "If I add health checks and keepalives, won't that add overhead?" Let's look at the numbers:
These practices cost almost nothing but save you from mysterious 3am failures.
Putting It All Together: The Timeline
Here's what actually happens over time with a healthy connection:
Without keepalive, request #3 would have failed. The connection would have been silently closed during the 2+ minute gap.
Best practices don't slow you down—they prevent failures. The real bottleneck at scale is the total number of connections, not the keepalives or health checks. Use PgBouncer when your app servers × pool size exceeds your DB's max_connections.
Best Practices for Production
Now that you understand how connections work at scale, let's implement the defenses properly. Each tool attacks the problem from a different angle.
1. Keepalive Pings
2. Health Checks
3. Retry Logic
4. Acquire Late, Release Early
This is the most overlooked best practice — don't hold connections while doing work that doesn't need them:
Don't hold a database connection while doing work that doesn't need it. Acquire right before you query, release right after.
Defense in Depth
Production systems should have multiple layers of protection:
Each layer catches what the previous layer missed. No single point of failure.
When Things Fail: The Defense in Action
We've talked about defenses. Now let's see them actually work when things go wrong.
The database closed connection #3 after 30 seconds of inactivity. Your app doesn't know yet.
SELECT 1
FAILS!
Health checks catch dead connections before your code tries to use them. The failure is handled silently.
Good news: Most of this is handled by your connection pool library. You just need to configure it right.
pool_pre_ping=True (SQLAlchemy) or testOnBorrow=true (HikariCP)
pool_keepalive=30 or set TCP keepalive in connection string
Use
tenacity (Python) or similar retry library
Let them bubble up so retry logic can handle them.
Common Failure Patterns
When things go wrong, these are the usual suspects. Knowing their names helps you diagnose faster.
| Pattern Name | Symptom | Root Cause |
|---|---|---|
| The Phantom Disconnect | connection closed on random requests |
Database idle timeout closed the connection silently |
| The Pool Drought | pool exhausted under load |
Too many concurrent requests, all connections busy |
| The Zombie Connection | Connection looks open but queries hang forever | Network issues severed connection, no keepalive detected it |
| The Slow Leak | Pool slowly runs out over hours/days | Code borrows connections but doesn't return them properly |
The Tradeoffs Table
There's no perfect fix. Each solution has pros and cons:
Most production systems use a combination: health checks + retry logic as a baseline, with keepalive or increased timeout as needed.
Decision Framework
Here's how to think about fixing connection issues:
Set idle_timeout to 5+ minutes
App-side workaround
Seeing random connection errors?
Check your database timeout settings. Most cloud databases (AWS RDS, Supabase, etc.) have aggressive 30-60 second defaults.
Have long-running operations?
If you call external APIs (especially LLMs) that take 30+ seconds, add retry logic around your database queries.
Want to prevent issues?
Monitor your connection pool. Track active/idle connections, wait times, and error rates. Problems show up in metrics before they hit users.
The Problem
- Connections take 50-200ms to open
- Idle connections get killed by DB timeouts
- Your app doesn't know the connection died
Key Numbers
- TCP handshake: 10-30ms
- TLS negotiation: 20-50ms
- Auth exchange: 10-30ms
- Default idle timeout: 30-60s
Solutions
- Pooling: Reuse connections
- Timeouts: Increase on DB side
- Keepalive: Ping idle connections
- Retry: Handle transient failures
The 95% problem isn't a code bug. It's a mismatch between your database's timeout settings and your application's behavior. Once you understand the invisible handshake happening between your app and database, these random failures become predictable—and fixable.
وَاللَّهُ أَعْلَمُ
And Allah knows best
وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ
May Allah's peace and blessings be upon our master Muhammad and his family
Was this helpful?
Your feedback helps me create better content
Comments
Leave a comment