System Design

Designing for
10,000 Requests/Second

Your API handles 100 requests per second beautifully. Then you get featured on Hacker News. Traffic spikes to 10,000 RPS. Your database melts. Your servers crash. Users see 503 errors. This is the scaling toolkit.

Bahgat Bahgat Ahmed
· February 2026 · 20 min read
Caching
Queuing
Connection Pooling
Load Balancing
Replication
Sharding
Table of Contents
8 sections

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

Your API handles 100 requests per second beautifully. Then you get featured on Hacker News.

Traffic spikes to 10,000 requests per second. Your database melts. Your servers crash. Users see 503 errors.

And you're scrambling at 2 AM trying to figure out why everything is on fire.

I've been there. And I've learned that scaling isn't magic - it's a toolkit. This is the toolkit.

Quick Summary
  • Scaling is about removing bottlenecks - one at a time, not all at once
  • The toolkit: caching, queuing, pooling, sharding, replication, load balancing
  • Horizontal beats vertical (usually) - but start vertical, it's simpler
  • Design for 10x your current load, not 1000x

Want the full story? Keep reading.

This post is for you if:

  • Your system works but you're worried about "what if traffic grows?"
  • You've hit scaling limits and don't know where to start
  • You want to understand caching, queuing, and load balancing
  • You're designing a new system and want to get it right from the start

Part 1: The Scaling Mindset

The Wrong Approach

"Let's add Kubernetes, Redis, Kafka, and a CDN. Then we'll be ready for scale."

No. You'll be ready for complexity.

Wrong vs Right Approach to Scaling
The Wrong Way

"Add everything at once"

  • Kubernetes before you need it
  • Microservices with 2 engineers
  • Redis "just in case"
  • Sharding before 1M rows
Result: All time on infrastructure, zero on features
The Right Way

"Fix one bottleneck at a time"

  1. Measure where you are
  2. Find THE bottleneck
  3. Fix that one thing
  4. Repeat
Result: Minimal complexity, maximum impact
What's the difference between premature optimization and premature scaling?
Premature Optimization

Making code faster before it's slow. Micro-optimizing a function that runs once per request.

Premature Scaling

Adding infrastructure before you need it. Sharding a database with 10,000 rows.

Both waste time. But premature scaling also adds: operational complexity, more things that can break, higher costs, and slower development.

The 10x Rule

Design for 10x your current traffic. Not 100x. Not 1000x.

Why?

  • 10x is achievable with known techniques
  • 100x often requires architectural changes
  • When you hit 10x, you'll have money/time to redesign for 100x
Know Your Numbers

Before scaling, you must know these metrics. If you don't, you can't scale intelligently:

Current RPS: ___
Peak RPS: ___
DB queries/request: ___
Avg response time: ___
P95 response time: ___
Error rate: ___

When NOT to Scale

This section exists because context matters. Scaling advice without context is dangerous.

Don't Scale When...

PoC/MVP Stage

You have 50 users but asking about 1 million.

Focus on: Does anyone want this product?

Haven't Measured

"We need caching" but don't know what's slow.

This is guessing, not engineering.

Team is Small

2 engineers + Kubernetes + Microservices = ?

Result: All time on infra, zero on features.
Quick Check
Your startup has 200 daily active users. The CEO asks: "Should we add Redis for caching?" What's your answer?
Yes - caching is a best practice, better to add it now
First, measure what's slow. Then decide if caching is the solution.
No - 200 users is too small to ever need caching
Exactly right

"Best practices" without measurement is guessing. Maybe your bottleneck is a slow external API, or an N+1 query, or unindexed database columns. Caching won't help any of those. Measure first, scale second.

Not quite

The right answer is to measure first. "Best practices" without measurement is guessing. Even with 200 users, you might have a bottleneck - but it might not be what caching solves. And even with 10,000 users, you might not need caching if your queries are already fast.

Scale When...

Signals That Mean "Time to Scale"
Signal What It Means Then Consider
Measured bottleneck exists You've profiled and found THE slow thing Fix that specific bottleneck
Database CPU > 70% sustained Database is working too hard Query optimization, then read replicas
App servers maxed out CPU/Memory at limits Horizontal scaling
Response time degrading under load System can't keep up Caching, then more resources
Paying customers waiting Real business impact Now it's worth the investment

The golden rule: Don't solve problems you don't have. Measure, identify, fix, repeat.

Find the Bottleneck First

Your system is only as fast as its slowest component. This is the bottleneck principle.

The Bottleneck Principle
Request
Web Server
10ms
App Server
50ms
Database
200ms
BOTTLENECK
Response
Optimizing web server (10ms to 5ms)
Saves 5ms
Optimizing database (200ms to 50ms)
Saves 150ms

Always fix the biggest bottleneck first. Optimizing the wrong component wastes time.

Common Bottlenecks

Symptom Likely Bottleneck How to Confirm
High CPU on app servers Inefficient code Profile the code
High CPU on database Expensive queries Check slow query log
Low CPU, high latency Waiting for I/O Check connection pools
Memory keeps growing Memory leak Profile memory
Errors under load Connection exhaustion Check pool sizes
Sudden failures Resource limits Check ulimits, max connections
The 3 Questions

Before adding any infrastructure, ask:

  1. Where is time being spent? (Profiling)
  2. What resource is exhausted? (Monitoring)
  3. What's the simplest fix? (Usually not "add more servers")
Part 2
The Scaling Toolkit

The Complete Scaling Toolkit

Here are the tools in your scaling arsenal. Each solves a specific problem - the key is knowing when to use each one.

The Scaling Toolkit at a Glance
Caching

Store computed results for reuse

Use when: Repeated reads of same data
Queuing

Defer work for later processing

Use when: Heavy processing, > 1 second tasks
Pooling

Reuse expensive connections

Use when: Database/API calls
Load Balancing

Distribute traffic across servers

Use when: Multiple servers
Replication

Copy data for read scaling

Use when: Read-heavy workloads
Sharding

Split data across databases

Use when: Write-heavy, last resort

Caching: The Most Powerful Tool

Caching is often your first and best option. It works because the fastest request is one you don't have to make.

What is caching? (Analogy + How it works)
The Library Help Desk

Imagine you work at a library help desk. People keep asking "Where are the Harry Potter books?" 50 times a day. Without a sticky note, you walk to the back office every time. With a sticky note at your desk saying "Harry Potter = Aisle 7, Shelf 3" - instant answer!

How Caching Actually Works
1
Request comes in: "Get user #123"
2
Check cache first (1-5ms)
3
Cache HIT? Return immediately. Cache MISS? Query database (100-500ms), store in cache, return.
Redis
Most popular
Memcached
Simple, fast
Cache Levels: From Fastest to Slowest
Browser Cache
User's device - 0ms latency
Static assets
CDN (Edge)
Edge servers - 10-50ms
Static + some dynamic
Application Cache
App memory - < 1ms
Hot data
Distributed Cache (Redis)
Network call - 1-5ms
Shared across servers
Database (No Cache)
Disk I/O - 100-500ms
Source of truth

Each level is 10-100x slower than the one above it. Start from the top.

When to use caching (and when NOT to)
Use WHEN
  • Same data requested repeatedly
  • Data doesn't change often
  • Stale data is acceptable briefly
  • Database is the bottleneck
DON'T use when
  • Every request is unique
  • Data changes every second
  • Consistency is critical (bank balance)
  • Cache invalidation is complex
The trade-off: Caching trades consistency for speed. You might serve slightly stale data. For most apps, this is fine. For bank balances, it's not.

Queuing: Defer Heavy Work

For work that doesn't need an immediate response, queuing lets you return quickly to the user while processing in the background.

Queuing: User Doesn't Wait
Without Queue
User App Process 5s

User waits 5 seconds. Timeout risk. Bad UX.

With Queue
User App Queue

Instant response: "Job submitted". Worker processes in background.

Email sending Image processing Report generation Webhook delivery Anything > 1 second
Use WHEN
  • Task takes > 1 second
  • User doesn't need immediate result
  • Work can fail and retry later
  • Processing spikes exceed capacity
DON'T use when
  • User needs immediate response
  • Task is quick (< 500ms)
  • Order of execution matters strictly
  • Adds complexity you don't need yet

Load Balancing: Distribute Traffic

When one server isn't enough, load balancing distributes requests across multiple servers.

Load Balancing: Traffic Distribution
Load Balancer
Server 1
Server 2
Server 3
Algorithm How It Works Best For
Round Robin Each server in turn Equal servers
Least Connections Server with fewest active Varying request times
IP Hash Same user to same server Session affinity
Weighted More traffic to stronger servers Mixed hardware
Use WHEN
  • Single server can't handle load
  • Need high availability (redundancy)
  • Zero-downtime deployments needed
  • Horizontal scaling strategy
DON'T use when
  • Single server handles your traffic fine
  • App has server-local state (fix that first)
  • Vertical scaling is still cheaper
  • Development/testing environments

Horizontal vs Vertical Scaling

Two Ways to Scale
Vertical (Scale Up)
4 CPU, 8GB
32 CPU, 64GB
Simple, no code changes
Has limits, single point of failure
Horizontal (Scale Out)
No limits, redundancy
Complexity, needs stateless design

The Rule: Start vertical (it's simpler). Go horizontal when vertical hits limits. Design for horizontal from the start (stateless).

Database Scaling

The database is usually the bottleneck. Why? Disk I/O is slow, locks cause contention, connections are limited, and data must be consistent.

Read Scaling: Replicas

If your workload is read-heavy (most web apps), read replicas can dramatically increase capacity.

Read Replicas: Scale Read Capacity
Primary
Writes only
Replica 1
Reads
Replica 2
Reads
Replica 3
Reads
Watch Out: Replication Lag

Replicas sync from primary asynchronously. If you write data and immediately read from a replica, you might get stale data. For critical reads (like showing a user their own data after an update), read from primary.

Write Scaling: Sharding

Sharding splits your data across multiple databases. It's powerful but complex - avoid it as long as possible.

Deep dive: Sharding strategies and trade-offs
Strategy How It Works Pros Cons
Range ID 1-1000 to Shard 1, etc. Simple, range queries Uneven distribution
Hash Hash(ID) mod N Even distribution No range queries
Geographic Region to shard Data locality Cross-region queries
Tenant Customer to shard Isolation Varying sizes
The Hard Parts of Sharding
  • Cross-shard queries (expensive)
  • Rebalancing when adding shards
  • Transactions across shards
  • Unique constraints across shards

Query Optimization: Before You Scale Hardware

Before adding more hardware, optimize what you have. Often the fix is simpler than you think.

Quick Wins: Query Optimization
Add Indexes
Before: 1000ms
Full table scan
After: 5ms
Index scan
Fix N+1 Queries
Before: 101 queries
1 for users + 100 for orders
After: 2 queries
Eager loading
Paginate Results
Before: 1M rows
Load everything
After: 100 rows
LIMIT + cursor

API Design for Scale

How you design your API affects how well it scales. Here are the patterns that matter.

Rate Limiting

Protect your API from overuse. Without rate limiting, one bad actor (or one bug) can take down your entire system.

Pagination

Never return unbounded lists. If someone can request GET /users and get 1 million rows, you have a problem.

Async Operations

For long-running tasks, return immediately with a job ID, not the result.

API Design Patterns for Scale
Rate Limiting
100 requests/minute limit
Hit limit? 429 Too Many Requests
Headers: X-RateLimit-Remaining: 45
Pagination
Bad: GET /users returns 1M rows
Good: GET /users?limit=100&cursor=abc
Compression
500KB
Before
50KB
After gzip
Async Operations
POST /reports
Response: {"job_id": "abc123", "status": "processing"}
Part 3
Real-World Application

Real Example: 100 to 10,000 RPS

Let's walk through a real scaling journey, week by week. This is how you'd actually do it in practice.

The Scaling Journey: 8 Weeks
0
Starting Point
100 RPS
Peak
200ms
Avg Response
1 Server
App
1 DB
PostgreSQL
1
Week 1-2: Find the Bottleneck
DB CPU at 90%, App CPU at 30% - Database is the bottleneck
2
Week 2-3: Add Caching (Redis)
DB load 90% to 40%, Response 200ms to 80ms, Capacity: 300 RPS
3
Week 3-4: Optimize Queries
Added 5 indexes, fixed 3 N+1 queries. DB load 40% to 15%, Capacity: 800 RPS
4
Week 4-5: Add Read Replicas
2 read replicas, reads distributed. Primary load 15% to 5%, Capacity: 2,000 RPS
5
Week 5-6: Horizontal Scaling
Load balancer + 3 app servers. Each handles 700 RPS, Capacity: 2,100 RPS
6
Week 6-7: CDN + Edge Caching
40% of requests served by CDN. Capacity: 5,000+ RPS
7
Week 7-8: Queue Heavy Work
Email, reports moved to queue. Response 50ms to 30ms, Capacity: 10,000 RPS
Final Architecture: 10,000 RPS
                     CDN (static + cached)
                            |
Users ---> Load Balancer ---+---> App Server 1 ---> Redis Cache
                            |                            |
                            +---> App Server 2 ---> Queue ---> Workers
                            |                            |
                            +---> App Server 3 ---> Primary DB
                                                         |
                                                   Read Replicas
            
10,000
RPS Capacity
30ms
Avg Response
8 weeks
Total Time

The Scaling Checklist

Use this checklist when you need to scale. Work through it in order - quick wins first, major changes last.

Before You Scale

  • Know your current RPS and latency
  • Identify the bottleneck with data
  • Set target numbers (success criteria)

Quick Wins (Do First)

  • Add database indexes for slow queries
  • Fix N+1 queries
  • Enable gzip compression
  • Add connection pooling
  • Add caching for repeated reads

Medium Effort

  • Add Redis/Memcached for distributed cache
  • Add CDN for static assets
  • Implement rate limiting
  • Add database read replicas
  • Move heavy work to background queues

Major Changes

  • Horizontal scaling with load balancer
  • Database sharding (if writes bottleneck)
  • Microservices (if monolith is the issue)
  • Multi-region deployment

Always

  • Monitor everything
  • Load test before launching
  • Have a rollback plan
  • Document what you changed

Key Takeaways

  • Find bottleneck first - don't guess
  • Caching is your friend - fastest request = no request
  • Design for 10x, not 1000x
  • Database is usually the problem

Practice Mode: Test Your Scaling Intuition

Score: 0/3
Scenario 1 of 3
Your e-commerce site's database CPU is at 85% and climbing during peak hours. The application servers are sitting at 20% CPU. Response times are degrading.
What should you scale first?
A
Add more application servers - they can handle more load
B
Focus on the database - add read replicas or optimize queries
C
Add caching in front of everything
Scenario 2 of 3
Your API receives 1,000 RPS. Users upload profile images, which get processed (resize, compress, thumbnail). Image processing takes 3-5 seconds. Users complain about slow uploads.
What's the best approach?
A
Get faster servers to process images quicker
B
Add a CDN to serve images faster
C
Queue the processing - accept upload immediately, process in background
Scenario 3 of 3
You have a single server handling 500 RPS with 40% CPU utilization. Your CEO asks: "Should we add Kubernetes and microservices for scalability?"
What's your response?
A
Yes - better to prepare for growth now before we need it
B
Not yet - we're at 40% capacity with room to grow. Add complexity when data demands it.
C
Maybe - it depends on what technologies our competitors use

Related Posts in This Series

Caching Strategies Deep Dive Database Connections Async Processing & Queues Failure Handling Building Resilient Systems

والله أعلم - And Allah knows best.

Scaling isn't about adding complexity. It's about removing bottlenecks.
Start simple. Measure everything. Add complexity only when the data tells you to.

Was this helpful?

Have suggestions for improvement?

Discussion

Loading comments...

Leave a comment