Designing for 10,000 Requests per Second

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

Your API handles 100 requests per second beautifully. Then you get featured on Hacker News.

Traffic spikes to 10,000 requests per second. Your database melts. Your servers crash. Users see 503 errors.

And you're scrambling at 2 AM trying to figure out why everything is on fire.

I've been there. And I've learned that scaling isn't magic - it's a toolkit. This is the toolkit.

Quick Summary

Scaling is about removing bottlenecks - one at a time, not all at once
The toolkit: caching, queuing, pooling, sharding, replication, load balancing
Horizontal beats vertical (usually) - but start vertical, it's simpler
Design for 10x your current load, not 1000x

Want the full story? Keep reading.

This post is for you if:

Your system works but you're worried about "what if traffic grows?"
You've hit scaling limits and don't know where to start
You want to understand caching, queuing, and load balancing
You're designing a new system and want to get it right from the start

Part 1: The Scaling Mindset

The Wrong Approach

"Let's add Kubernetes, Redis, Kafka, and a CDN. Then we'll be ready for scale."

No. You'll be ready for complexity.

Wrong vs Right Approach to Scaling

The Wrong Way

"Add everything at once"

Kubernetes before you need it
Microservices with 2 engineers
Redis "just in case"
Sharding before 1M rows

Result: All time on infrastructure, zero on features

The Right Way

"Fix one bottleneck at a time"

Measure where you are
Find THE bottleneck
Fix that one thing
Repeat

Result: Minimal complexity, maximum impact

What's the difference between premature optimization and premature scaling?

Premature Optimization

Making code faster before it's slow. Micro-optimizing a function that runs once per request.

Premature Scaling

Adding infrastructure before you need it. Sharding a database with 10,000 rows.

Both waste time. But premature scaling also adds: operational complexity, more things that can break, higher costs, and slower development.

The 10x Rule

Design for 10x your current traffic. Not 100x. Not 1000x.

Why?

10x is achievable with known techniques
100x often requires architectural changes
When you hit 10x, you'll have money/time to redesign for 100x

Know Your Numbers

Before scaling, you must know these metrics. If you don't, you can't scale intelligently:

Current RPS: ___

Peak RPS: ___

DB queries/request: ___

Avg response time: ___

P95 response time: ___

Error rate: ___

When NOT to Scale

This section exists because context matters. Scaling advice without context is dangerous.

Don't Scale When...

PoC/MVP Stage

You have 50 users but asking about 1 million.

Focus on: Does anyone want this product?

Haven't Measured

"We need caching" but don't know what's slow.

This is guessing, not engineering.

Team is Small

2 engineers + Kubernetes + Microservices = ?

Result: All time on infra, zero on features.

Quick Check

Your startup has 200 daily active users. The CEO asks: "Should we add Redis for caching?" What's your answer?

Yes - caching is a best practice, better to add it now

First, measure what's slow. Then decide if caching is the solution.

No - 200 users is too small to ever need caching

Exactly right

"Best practices" without measurement is guessing. Maybe your bottleneck is a slow external API, or an N+1 query, or unindexed database columns. Caching won't help any of those. Measure first, scale second.

Not quite

The right answer is to measure first. "Best practices" without measurement is guessing. Even with 200 users, you might have a bottleneck - but it might not be what caching solves. And even with 10,000 users, you might not need caching if your queries are already fast.

Scale When...

Signals That Mean "Time to Scale"

Signal	What It Means	Then Consider
Measured bottleneck exists	You've profiled and found THE slow thing	Fix that specific bottleneck
Database CPU > 70% sustained	Database is working too hard	Query optimization, then read replicas
App servers maxed out	CPU/Memory at limits	Horizontal scaling
Response time degrading under load	System can't keep up	Caching, then more resources
Paying customers waiting	Real business impact	Now it's worth the investment

The golden rule: Don't solve problems you don't have. Measure, identify, fix, repeat.

Find the Bottleneck First

Your system is only as fast as its slowest component. This is the bottleneck principle.

The Bottleneck Principle

Request

Web Server

10ms

App Server

50ms

Database

200ms

BOTTLENECK

Response

Optimizing web server (10ms to 5ms)

Saves 5ms

Optimizing database (200ms to 50ms)

Saves 150ms

Always fix the biggest bottleneck first. Optimizing the wrong component wastes time.

Common Bottlenecks

Symptom	Likely Bottleneck	How to Confirm
High CPU on app servers	Inefficient code	Profile the code
High CPU on database	Expensive queries	Check slow query log
Low CPU, high latency	Waiting for I/O	Check connection pools
Memory keeps growing	Memory leak	Profile memory
Errors under load	Connection exhaustion	Check pool sizes
Sudden failures	Resource limits	Check ulimits, max connections

The 3 Questions

Before adding any infrastructure, ask:

Where is time being spent? (Profiling)
What resource is exhausted? (Monitoring)
What's the simplest fix? (Usually not "add more servers")

Part 2

The Scaling Toolkit

The Complete Scaling Toolkit

Here are the tools in your scaling arsenal. Each solves a specific problem - the key is knowing when to use each one.

The Scaling Toolkit at a Glance

Caching

Store computed results for reuse

Use when: Repeated reads of same data

Queuing

Defer work for later processing

Use when: Heavy processing, > 1 second tasks

Pooling

Reuse expensive connections

Use when: Database/API calls

Load Balancing

Distribute traffic across servers

Use when: Multiple servers

Replication

Copy data for read scaling

Use when: Read-heavy workloads

Sharding

Split data across databases

Use when: Write-heavy, last resort

Caching: The Most Powerful Tool

Caching is often your first and best option. It works because the fastest request is one you don't have to make.

What is caching? (Analogy + How it works)

The Library Help Desk

Imagine you work at a library help desk. People keep asking "Where are the Harry Potter books?" 50 times a day. Without a sticky note, you walk to the back office every time. With a sticky note at your desk saying "Harry Potter = Aisle 7, Shelf 3" - instant answer!

How Caching Actually Works

Request comes in: "Get user #123"

Check cache first (1-5ms)

Cache HIT? Return immediately. Cache MISS? Query database (100-500ms), store in cache, return.

Redis

Queuing: Defer Heavy Work

For work that doesn't need an immediate response, queuing lets you return quickly to the user while processing in the background.

Queuing: User Doesn't Wait

Without Queue

User App Process 5s

User waits 5 seconds. Timeout risk. Bad UX.

With Queue

User App Queue

Instant response: "Job submitted". Worker processes in background.

Email sending Image processing Report generation Webhook delivery Anything > 1 second

Use WHEN

Task takes > 1 second
User doesn't need immediate result
Work can fail and retry later
Processing spikes exceed capacity

DON'T use when

User needs immediate response
Task is quick (< 500ms)
Order of execution matters strictly
Adds complexity you don't need yet

Load Balancing: Distribute Traffic

When one server isn't enough, load balancing distributes requests across multiple servers.

Load Balancing: Traffic Distribution

Load Balancer

Server 1

Server 2

Server 3

Algorithm	How It Works	Best For
Round Robin	Each server in turn	Equal servers
Least Connections	Server with fewest active	Varying request times
IP Hash	Same user to same server	Session affinity
Weighted	More traffic to stronger servers	Mixed hardware

Use WHEN

Single server can't handle load
Need high availability (redundancy)
Zero-downtime deployments needed
Horizontal scaling strategy

DON'T use when

Single server handles your traffic fine
App has server-local state (fix that first)
Vertical scaling is still cheaper
Development/testing environments

Horizontal vs Vertical Scaling

Two Ways to Scale

Vertical (Scale Up)

4 CPU, 8GB

32 CPU, 64GB

Simple, no code changes

Has limits, single point of failure

Horizontal (Scale Out)

No limits, redundancy

Complexity, needs stateless design

The Rule: Start vertical (it's simpler). Go horizontal when vertical hits limits. Design for horizontal from the start (stateless).

Database Scaling

The database is usually the bottleneck. Why? Disk I/O is slow, locks cause contention, connections are limited, and data must be consistent.

Read Scaling: Replicas

If your workload is read-heavy (most web apps), read replicas can dramatically increase capacity.

Read Replicas: Scale Read Capacity

Primary

Writes only

Replica 1

Reads

Replica 2

Reads

Replica 3

Reads

Watch Out: Replication Lag

Replicas sync from primary asynchronously. If you write data and immediately read from a replica, you might get stale data. For critical reads (like showing a user their own data after an update), read from primary.

Write Scaling: Sharding

Sharding splits your data across multiple databases. It's powerful but complex - avoid it as long as possible.

Deep dive: Sharding strategies and trade-offs

Strategy	How It Works	Pros	Cons
Range	ID 1-1000 to Shard 1, etc.	Simple, range queries	Uneven distribution
Hash	Hash(ID) mod N	Even distribution	No range queries
Geographic	Region to shard	Data locality	Cross-region queries
Tenant	Customer to shard	Isolation	Varying sizes

The Hard Parts of Sharding

Cross-shard queries (expensive)
Rebalancing when adding shards
Transactions across shards
Unique constraints across shards

Query Optimization: Before You Scale Hardware

Before adding more hardware, optimize what you have. Often the fix is simpler than you think.

Quick Wins: Query Optimization

Add Indexes

Before: 1000ms
Full table scan

After: 5ms
Index scan

Fix N+1 Queries

Before: 101 queries
1 for users + 100 for orders

After: 2 queries
Eager loading

Paginate Results

Before: 1M rows
Load everything

After: 100 rows
LIMIT + cursor

API Design for Scale

How you design your API affects how well it scales. Here are the patterns that matter.

Rate Limiting

Protect your API from overuse. Without rate limiting, one bad actor (or one bug) can take down your entire system.

Pagination

Never return unbounded lists. If someone can request GET /users and get 1 million rows, you have a problem.

Async Operations

For long-running tasks, return immediately with a job ID, not the result.

API Design Patterns for Scale

Rate Limiting

100 requests/minute limit
Hit limit? 429 Too Many Requests

Headers: X-RateLimit-Remaining: 45

Pagination

Bad: GET /users returns 1M rows

Good: GET /users?limit=100&cursor=abc

Compression

500KB

Before

50KB

After gzip

Async Operations

POST /reports

Response: {"job_id": "abc123", "status": "processing"}

Part 3

Real-World Application

Real Example: 100 to 10,000 RPS

Let's walk through a real scaling journey, week by week. This is how you'd actually do it in practice.

The Scaling Journey: 8 Weeks

Starting Point

100 RPS

Peak

200ms

Avg Response

1 Server

App

1 DB

PostgreSQL

Week 1-2: Find the Bottleneck

DB CPU at 90%, App CPU at 30% - Database is the bottleneck

Week 2-3: Add Caching (Redis)

DB load 90% to 40%, Response 200ms to 80ms, Capacity: 300 RPS

Week 3-4: Optimize Queries

Added 5 indexes, fixed 3 N+1 queries. DB load 40% to 15%, Capacity: 800 RPS

Week 4-5: Add Read Replicas

2 read replicas, reads distributed. Primary load 15% to 5%, Capacity: 2,000 RPS

Week 5-6: Horizontal Scaling

Load balancer + 3 app servers. Each handles 700 RPS, Capacity: 2,100 RPS

Week 6-7: CDN + Edge Caching

40% of requests served by CDN. Capacity: 5,000+ RPS

Week 7-8: Queue Heavy Work

Email, reports moved to queue. Response 50ms to 30ms, Capacity: 10,000 RPS

Final Architecture: 10,000 RPS

                     CDN (static + cached)
                            |
Users ---> Load Balancer ---+---> App Server 1 ---> Redis Cache
                            |                            |
                            +---> App Server 2 ---> Queue ---> Workers
                            |                            |
                            +---> App Server 3 ---> Primary DB
                                                         |
                                                   Read Replicas

10,000

RPS Capacity

30ms

Avg Response

8 weeks

Total Time

The Scaling Checklist

Use this checklist when you need to scale. Work through it in order - quick wins first, major changes last.

Before You Scale

Know your current RPS and latency
Identify the bottleneck with data
Set target numbers (success criteria)

Quick Wins (Do First)

Add database indexes for slow queries
Fix N+1 queries
Enable gzip compression
Add connection pooling
Add caching for repeated reads

Medium Effort

Add Redis/Memcached for distributed cache
Add CDN for static assets
Implement rate limiting
Add database read replicas
Move heavy work to background queues

Major Changes

Horizontal scaling with load balancer
Database sharding (if writes bottleneck)
Microservices (if monolith is the issue)
Multi-region deployment

Always

Monitor everything
Load test before launching
Have a rollback plan
Document what you changed

Key Takeaways

Find bottleneck first - don't guess
Caching is your friend - fastest request = no request
Design for 10x, not 1000x
Database is usually the problem

Practice Mode: Test Your Scaling Intuition

Score: 0/3

Scenario 1 of 3

Your e-commerce site's database CPU is at 85% and climbing during peak hours. The application servers are sitting at 20% CPU. Response times are degrading.

What should you scale first?

Add more application servers - they can handle more load

Focus on the database - add read replicas or optimize queries

Add caching in front of everything

The database is the bottleneck. App servers at 20% means they're idle - adding more won't help. The 85% database CPU is where time is being spent. Start with query optimization (free!), then consider read replicas if reads are the issue. Caching might help, but "cache everything" without measuring is guessing.

Scenario 2 of 3

Your API receives 1,000 RPS. Users upload profile images, which get processed (resize, compress, thumbnail). Image processing takes 3-5 seconds. Users complain about slow uploads.

What's the best approach?

Get faster servers to process images quicker

Add a CDN to serve images faster

Queue the processing - accept upload immediately, process in background

Queuing is the answer. The user doesn't need to wait for image processing. Accept the upload instantly (200ms), queue the heavy work, process in background. User sees "Upload complete!" immediately. The processed image appears later. CDN helps with serving, not uploading. Faster servers only reduce 3-5s to maybe 1-2s - still slow.

Scenario 3 of 3

You have a single server handling 500 RPS with 40% CPU utilization. Your CEO asks: "Should we add Kubernetes and microservices for scalability?"

What's your response?

Yes - better to prepare for growth now before we need it

Not yet - we're at 40% capacity with room to grow. Add complexity when data demands it.

Maybe - it depends on what technologies our competitors use

Don't solve problems you don't have. At 40% CPU, you have 60% headroom. A single well-configured server can often handle 2-5K RPS. Kubernetes and microservices add massive operational complexity. When you actually need to scale, you'll have revenue/resources to invest. Right now, that complexity would slow down feature development for zero benefit.

Was this helpful?

Your feedback helps improve future articles

Have suggestions for improvement?

Discussion

Loading comments...

Designing for
10,000 Requests/Second

Part 1: The Scaling Mindset

The Wrong Approach

The 10x Rule

When NOT to Scale

PoC/MVP Stage

Haven't Measured

Team is Small

Scale When...

Find the Bottleneck First

Common Bottlenecks

The Complete Scaling Toolkit

Caching: The Most Powerful Tool

Queuing: Defer Heavy Work

Load Balancing: Distribute Traffic

Horizontal vs Vertical Scaling

Database Scaling

Read Scaling: Replicas

Write Scaling: Sharding

Query Optimization: Before You Scale Hardware

API Design for Scale

Rate Limiting

Pagination

Async Operations

Real Example: 100 to 10,000 RPS

The Scaling Checklist

Before You Scale

Quick Wins (Do First)

Medium Effort

Major Changes

Always

Key Takeaways

Practice Mode: Test Your Scaling Intuition

Related Posts in This Series

Was this helpful?

Discussion

Leave a comment

Designing for10,000 Requests/Second

Part 1: The Scaling Mindset

The Wrong Approach

The 10x Rule

When NOT to Scale

PoC/MVP Stage

Haven't Measured

Team is Small

Scale When...

Find the Bottleneck First

Common Bottlenecks

The Complete Scaling Toolkit

Caching: The Most Powerful Tool

Queuing: Defer Heavy Work

Load Balancing: Distribute Traffic

Horizontal vs Vertical Scaling

Database Scaling

Read Scaling: Replicas

Write Scaling: Sharding

Query Optimization: Before You Scale Hardware

API Design for Scale

Rate Limiting

Pagination

Async Operations

Real Example: 100 to 10,000 RPS

The Scaling Checklist

Before You Scale

Quick Wins (Do First)

Medium Effort

Major Changes

Always

Key Takeaways

Practice Mode: Test Your Scaling Intuition

Related Posts in This Series

Was this helpful?

Discussion

Leave a comment

Share this article

Get notified of new posts

Designing for
10,000 Requests/Second