Production Case Study

From Idea to Production

Ahmed's journey building a real SaaS on AWS. Every architecture decision, every tradeoff, every 3 AM incident.

70 min read Complete Architecture

Quick Summary

Want the full story? Keep reading.

This is for you if...

This is the case study version. For the principles framework (what to add at each maturity stage), see PoC to Production.

بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ

January 28, 2026. 3:12 AM.

Ahmed's phone buzzes. Then again. Then it won't stop.

"Your EC2 instance is generating unusual outbound traffic."

He opens CloudWatch. Outbound bandwidth: 10x normal. CPU: 100%. His server—the one running his entire business—is attacking other servers on the internet.

A security vulnerability he didn't know existed. A botnet he's now part of. And AWS will suspend his account in 30 minutes if he doesn't respond.

This is the story of what Ahmed built, what broke, and what he learned.

1

The Dream

Why Ahmed started building

Ahmed noticed something. Small businesses spend hours every day answering the same customer questions—order status, business hours, product availability. One person, typing the same answers over and over.

"What if I could automate this?" he thought. "What if I could build a platform where any business could have an AI chatbot handling their customer inquiries?"

The vision was clear:

  • Multi-tenant - One platform, many businesses
  • AI-powered - Chatbots that actually understand questions
  • Multi-channel - Website widget, mobile app, messaging platforms
The Starting Point

Ahmed had a laptop, some Node.js experience, and $200 in savings. No team. No infrastructure expertise. Just an idea and determination.

2

The MVP Era

Week 1-4: "It works on my machine"

Ahmed's first architecture was beautifully simple:

MVP Architecture (Week 1)
Single VPS - $20/month
DigitalOcean droplet running everything
Docker Compose
4 containers: Backend, Frontend, PostgreSQL, Redis

Total cost: $20/month. Total complexity: One docker-compose.yml file.

docker-compose.yml
version: '3.8'
services:
  backend:
    build: ./backend
    ports: ["6000:6000"]
    depends_on: [postgres, redis]

  frontend:
    build: ./frontend
    ports: ["3000:3000"]

  postgres:
    image: postgres:15
    volumes: [postgres_data:/var/lib/postgresql/data]

  redis:
    image: redis:7-alpine

Alternatives Ahmed Considered

Option Pros Cons Best For
Single VPS Cheap, simple, fast to deploy No redundancy, limited scale MVP, <100 users
Heroku/Railway Even simpler, managed More expensive, less control Quick prototypes
Kubernetes Scalable, industry standard Massive overkill, complex Never for MVP
Lesson Learned

Start with the simplest thing that works. Ahmed could have spent weeks setting up Kubernetes. Instead, he had paying customers in 2 weeks.

The MVP worked. 10 businesses signed up. Ahmed was charging $50/month each. Revenue: $500/month. Profit after server costs: $480.

Life was good. For about 6 weeks.

3

Growing Pains

When "it works" stops working

Month 2. Ahmed wakes up to 47 support messages. All variations of the same thing:

"The app is down."

What Went Wrong (The First Time)

The VPS ran out of disk space. PostgreSQL logs had grown to 45GB. The database crashed. No backups existed.

Result: 3 hours of downtime. 2 customers asked for refunds.

Ahmed fixed it. Added log rotation. Set up daily backups to S3. Felt smart.

Two weeks later:

What Went Wrong (The Second Time)

Black Friday. Traffic spiked 5x. The single server couldn't handle it. Response times went from 200ms to 15 seconds. Then timeout errors.

Result: Lost orders. Angry business owners. One customer's customer thought the store was closed.

The pattern was clear. Every month brought a new "surprise":

Week 6
Disk Full
Database crashed, no backups
Week 8
Traffic Spike
Server couldn't handle load
Week 10
Memory Leak
Node.js process grew until OOM killed it
Week 12
"Is it down?"
No monitoring. Ahmed found out from customers.
The Realization

Pain is the best teacher. Each crash taught Ahmed what production actually requires. The MVP architecture had served its purpose. Now it was time to grow up.

The Failure Taxonomy

These failures aren't random. They follow predictable patterns. Name them so you can spot them before they hit you:

TRAP #1 The Friday Deploy Trap

Deploying before a weekend when you won't be around to fix issues. Ahmed deployed on Friday at 5pm. The bug showed up Saturday morning. He didn't see it until Monday.

Prevention: No deploys after Wednesday. Friday deploys need explicit approval and on-call coverage.

TRAP #2 The Disk Space Surprise

Running out of storage space because logs, temp files, or data grew unchecked. Ahmed's PostgreSQL logs grew to 45GB. Database crashed with "no space left on device".

Prevention: Log rotation, disk space alerts at 70%, autoscaling storage (or RDS managed storage).

TRAP #3 The Egress Blindspot

Unexpected outbound data transfer costs or security issues. Ahmed's compromised server was sending traffic to a botnet. He noticed only when AWS throttled him.

Prevention: Restrict egress in security groups. Only allow outbound to specific IPs/ports you need.

TRAP #4 The Traffic Spike Trap

Not being ready for sudden traffic increases. Ahmed got mentioned on Twitter. Traffic 10x'd. His single server couldn't handle it. Site went down during his biggest opportunity.

Prevention: Autoscaling, load testing, CDN for static assets, graceful degradation.

4

Going Serious

The AWS decision

Ahmed had 30 paying customers now. $1,500/month revenue. Enough to justify real infrastructure. But which cloud?

The Options

Provider Pros Cons Best For
AWS Most services, best docs, largest community Complex pricing, learning curve Most startups
GCP Great for AI/ML, simpler pricing Smaller ecosystem AI-heavy apps
Azure Enterprise integration, Microsoft stack Less startup-friendly Enterprise, .NET
Self-hosted Full control, potentially cheaper You manage everything Expert teams only

Ahmed chose AWS because:

  • Most tutorials and Stack Overflow answers
  • ECS for containers (simpler than Kubernetes)
  • RDS for managed database (no more disk space surprises)
  • His potential enterprise customers already used AWS
Decision Framework

Choose the cloud your customers use, or the one with the best docs for your skill level. The "best" cloud is the one you can actually operate.

The Transformation

Here's what changed when Ahmed moved from a $20 VPS to AWS:

BEFORE: MVP
Single VPS
$20/month
Docker Compose
PostgreSQL (local)
Redis (local)
Problems:
  • No redundancy
  • Manual backups
  • Can't scale
  • Single point of failure
Pain-driven
evolution
AFTER: Production
AWS Infrastructure
~$500/month
ECS Fargate (5 services)
RDS PostgreSQL (Multi-AZ)
ElastiCache (managed)
ALB + Auto Scaling
Gains:
  • Auto failover
  • Automated backups
  • Scales to demand
  • Multi-AZ redundancy
Cost increased 25x But revenue increased 30x
$20
$500/month — but can handle 100x the load
5

The Five Services

What Ahmed is actually building

Before diving into infrastructure, let's understand what Ahmed's platform actually does:

Frontend
Next.js / React
Dashboard for business owners
Backend
Node.js / Fastify
API, business logic, webhooks
Chatbot API
Python / FastAPI
AI conversation engine
Admin
Node.js
Tenant management
Weaviate
Vector Database
Semantic product search

Why Separate Services?

Service Language Why Separate?
Frontend JavaScript Static files can be cached, different deploy cycle
Backend Node.js Business logic, real-time updates, API
Chatbot API Python AI libraries are Python-first. Heavy computation shouldn't block Node.js
Weaviate Go Specialized vector database, not something you build yourself
The Pattern

Separate services when they have different scaling needs or different technology requirements. The AI chatbot needs Python. It also needs to scale independently when message volume spikes.

Related: Software Map - How Services Talk to Each Other
The Complete Production Architecture
Users (Browser / Mobile / API)
Route53 (DNS)
CloudFront (CDN)
Application Load Balancer
Routes traffic, TLS termination, health checks
ECS Fargate Cluster
Frontend
Next.js
Backend
Node.js
Chatbot API
Python
Admin
Node.js
Weaviate
Vector DB
RDS PostgreSQL
Multi-AZ, Automated Backups
ElastiCache (Valkey)
Cache, Queue, Sessions, Pub/Sub
Secrets Manager
DB passwords, API keys
All resources inside VPC (Private Network)

How Services Talk to Each Other

Not all services talk to all services. Here's the actual communication map:

Service Communication Map
Users
Messaging
Frontend
HTTP + WebSocket
Backend
The Hub — All requests go through here
HTTP
TCP
TCP
Chatbot API
PostgreSQL
Redis
HTTP
Weaviate
OpenAI API (external)

Who Calls Who

From To How What For
Frontend Backend HTTP REST All user actions, data fetching
Frontend Backend WebSocket Real-time updates, notifications
Backend Chatbot API HTTP REST Process AI chat messages
Backend PostgreSQL TCP (pg) All data reads/writes
Backend Redis TCP Cache, sessions, pub/sub, queues
Chatbot API Weaviate HTTP REST Semantic search for products
Chatbot API OpenAI/LLM HTTPS Generate AI responses
Admin PostgreSQL TCP (pg) Tenant management, config
The Hub Pattern

Backend is the hub. Frontend never talks to Chatbot API directly. This gives Backend control over authentication, rate limiting, and request validation. If you need to add security rules, you add them in one place.

6

Container Layer

ECS: Running containers without managing servers

Ahmed's apps run in Docker containers. But where do those containers run?

What is ECS? (Elastic Container Service)

ECS (Elastic Container Service) is AWS's container orchestration service. Think of it as a manager that:

  • Runs your Docker containers across multiple servers
  • Restarts them if they crash
  • Scales them up or down based on demand
  • Routes traffic to healthy containers

Why not just Docker on EC2? You could, but then YOU manage server patching, container placement, health checks, and scaling. ECS handles all that. With Fargate (serverless mode), you don't even manage the underlying servers.

The Options

Option Complexity Cost Best For
EC2 + Docker Low Low-Medium Simple apps, full control needed
ECS Fargate Medium Medium Container apps, no server management
EKS (Kubernetes) High High Large teams, complex orchestration
Lambda Low Variable Event-driven, sporadic traffic
App Runner Very Low Medium Simple web apps

Ahmed chose ECS Fargate because:

  • No servers to manage - Fargate handles the underlying infrastructure
  • Simpler than Kubernetes - One less thing to learn and debug
  • Good enough for 5 services - EKS would be overkill
  • Scales automatically - Add tasks when load increases

Key Concepts

ECS Hierarchy
1
Cluster: A logical grouping of services. Ahmed has one cluster for production.
2
Service: Keeps N copies of a task running. "Always run 2 backend tasks."
3
Task: One or more containers running together. Like one docker-compose "up".
4
Task Definition: The recipe. "Use this image, this much CPU, these env vars."

Task Definition Anatomy

backend-task-definition.json (simplified)
{
  "family": "backend",
  "cpu": "512",           // 0.5 vCPU
  "memory": "1024",       // 1 GB RAM
  "containerDefinitions": [{
    "name": "api",
    "image": "123456.dkr.ecr.eu-west-1.amazonaws.com/backend:latest",
    "portMappings": [{ "containerPort": 6000 }],
    "healthCheck": {
      "command": ["CMD-SHELL", "wget -q --spider http://127.0.0.1:6000/health"]
    },
    "environment": [
      { "name": "NODE_ENV", "value": "production" }
    ],
    "secrets": [
      { "name": "DATABASE_URL", "valueFrom": "arn:aws:secretsmanager:..." }
    ]
  }]
}
Health Checks Matter

The health check command must actually verify your app is working. Don't just check if the port is open. Hit an endpoint that touches the database. If the health check passes but the app is broken, ECS won't know to restart it.

Decision Check: Container Platform

Your team is 2 engineers with no Kubernetes experience. What should you use?

Reveal Answer

ECS Fargate is the right choice. Here's why:

  • Kubernetes requires expertise you don't have. Learning curve is 3-6 months.
  • ECS is AWS-native, integrates with everything else you're using.
  • Fargate = no servers. You don't manage EC2 instances at all.
  • You can migrate to EKS later if you outgrow ECS (rare at <50 services).

When Kubernetes makes sense: Large team (5+ engineers), 20+ services, need for advanced features like service mesh, or company-wide Kubernetes standardization.

7

Docker & ECR

Building and storing container images

Multi-Stage Builds

Ahmed's Docker images use multi-stage builds to keep them small and secure:

Dockerfile (multi-stage)
# Stage 1: Build (includes dev dependencies, source code)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Run (only what's needed)
FROM node:20-alpine AS runner
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
CMD ["node", "dist/server.js"]
Single-Stage Build

800MB image

Includes dev dependencies, source code, build tools. Slow to push, slow to pull.

Multi-Stage Build

200MB image

Only production code. Faster deploys, smaller attack surface.

ECR: Where Images Live

ECR (Elastic Container Registry) is AWS's Docker image storage. The flow:

Build → Push → Deploy
1
Build locally: docker build -t backend .
2
Tag with git SHA: docker tag backend:latest 123456.dkr.ecr.../backend:abc123
3
Push to ECR: docker push 123456.dkr.ecr.../backend:abc123
4
ECS pulls: Task definition references the new image tag
Alpine Linux Gotchas (Lessons Ahmed Learned the Hard Way)

Ahmed uses Alpine Linux for smaller images (200MB vs 800MB), but hit several quirks:

  • No curl by default — Alpine ships with wget instead of curl. Health check commands that use curl silently fail. Use wget --spider http://localhost:3000/health or install curl with apk add --no-cache curl.
  • IPv6 issues — Alpine resolves localhost to ::1 (IPv6) by default. If your app only binds to IPv4, use 127.0.0.1 explicitly in all health checks and connection strings.
  • Next.js binding — Next.js by default binds to localhost only. In a container, that means external connections are refused. Set HOSTNAME=0.0.0.0 in your environment variables or it won't accept connections from the ECS health checker.
  • Native modules — If your Node.js dependencies include native C++ addons (bcrypt, sharp, canvas), they need build-base and python3 installed in Alpine. This increases build time. Consider using pre-built binaries: bcryptjs instead of bcrypt, @img/sharp-linux-arm64 etc.
  • Timezone — Alpine doesn't include timezone data by default. If your app logs timestamps, add apk add --no-cache tzdata and set TZ=UTC.

Despite these quirks, Alpine is worth it: 200MB images deploy 4x faster than 800MB ones, saving ~30 seconds per deployment across all ECS tasks.

8

Database Layer

RDS: Managed PostgreSQL

Remember Ahmed's disk space disaster with the MVP? That's why he uses RDS now—AWS manages backups, patches, and disk space.

The Options

Option Managed? Cost Best For
Self-hosted PostgreSQL No Low Experts with time to manage
RDS PostgreSQL Yes Medium Most production apps
Aurora Yes High High performance, auto-scaling storage
PlanetScale/Neon Yes Variable Serverless, branching for dev

High Availability

Option Failover Time Cost When to Use
Single-AZ Manual (hours) $ Dev/staging environments
Multi-AZ Automatic (~60s) $$ Production databases
Aurora Multi-Master Zero $$$ Mission-critical, zero downtime

Ahmed uses:

  • Multi-AZ for production - Can't afford downtime for customer orders
  • Single-AZ for dev - Acceptable risk, saves money
Future Problem

Connection pooling. 10 ECS tasks × 10 connections each = 100 connections. RDS has limits. Ahmed will need PgBouncer or RDS Proxy as he scales further.

Deep Dive: Database Connections Explained

Decision Check: Database

You need ACID transactions and your data is relational. What database?

Reveal Answer

PostgreSQL (RDS or Aurora) is the standard answer. Here's why:

  • ACID compliance: Transactions, constraints, referential integrity built-in
  • Relational data: JOINs, foreign keys, normalization
  • RDS vs Aurora: RDS is cheaper for small workloads. Aurora for high-performance, auto-scaling storage.
  • Multi-AZ: Automatic failover in ~60 seconds

When NOT PostgreSQL: If you need horizontal scaling for massive writes (DynamoDB), document storage (MongoDB), or time-series data (TimescaleDB, built on Postgres anyway).

9

Multi-Tenancy

One platform, many businesses

Ahmed has 30 businesses on his platform. Each thinks they have their own system. But they all share the same database. How does this work?

The Options

Pattern Isolation Cost Best For
Shared DB + tenant_id Low $ Most SaaS startups
Schema per tenant Medium $$ Compliance requirements
Database per tenant High $$$ Enterprise, regulated industries

Ahmed uses Shared DB + tenant_id:

Every table has tenant_id
CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  tenant_id UUID NOT NULL,  -- The isolation key
  customer_id INT,
  total DECIMAL,
  created_at TIMESTAMP
);

-- Every query filters by tenant
SELECT * FROM orders WHERE tenant_id = $1;
The Danger

Forget the WHERE tenant_id = ... clause and one business sees another's data. This is a data leak.

Solution: Use PostgreSQL Row-Level Security (RLS) to enforce tenant isolation at the database level.

Row-Level Security (RLS): How Ahmed Prevents Data Leaks

RLS makes the database enforce tenant isolation automatically. Even if application code forgets the WHERE tenant_id = ... clause, the database won't return other tenants' data.

rls_setup.sql
-- Enable RLS on the orders table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

-- Create a policy: users can only see rows where
-- tenant_id matches the current session variable
CREATE POLICY tenant_isolation ON orders
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

-- In your middleware (runs before every query):
-- SET app.current_tenant = 'tenant-abc';

-- Now this query automatically filters by tenant:
SELECT * FROM orders;
-- ↑ Only returns orders for tenant-abc, even without WHERE clause

How Ahmed uses it in Express middleware:

tenant_middleware.ts
async function tenantMiddleware(req, res, next) {
  const tenantId = req.user?.tenantId;
  if (!tenantId) return res.status(401).json({ error: 'No tenant' });

  // IMPORTANT: Validate tenantId format to prevent SQL injection!
  if (!/^[a-f0-9-]{36}$/.test(tenantId)) {
    return res.status(400).json({ error: 'Invalid tenant ID' });
  }

  // Set the tenant for this database connection
  await db.query(`SET app.current_tenant = '${tenantId}'`);
  next();
}

// Now every database query in this request is automatically
// scoped to the authenticated user's tenant. No WHERE clause needed.
SQL Injection Warning

Always validate tenantId before using it in SET commands. If tenantId comes from user input (even indirectly through JWT claims), a malicious value like '; DROP TABLE orders; -- could execute arbitrary SQL. Use a strict regex pattern (UUID format) or parameterized queries where possible.

Why this matters: Without RLS, a single missed WHERE clause in any of your hundreds of queries could leak data. With RLS, the database is the safety net. Even a bug in application code can't cross tenant boundaries.

Decision Check: Multi-Tenancy

You have 5 enterprise customers who each want data isolation. What pattern?

Reveal Answer

Database per tenant is likely the right choice for this scenario. Here's why:

  • Enterprise customers demand isolation: They often require it contractually for compliance (SOC2, HIPAA)
  • 5 customers is manageable: The operational overhead is acceptable at this scale
  • Easy data export/deletion: Customer leaves? Drop the database.
  • Independent backups: Restore one customer without affecting others

When shared DB with RLS works: Many small customers (100+), cost-sensitive, customers don't require strict isolation. Schema per tenant: Middle ground, but adds migration complexity.

10

Caching & Queues

Redis for everything

Ahmed uses Redis (actually AWS ElastiCache with Valkey) for four different purposes:

Queue
BullMQ job processing. Send emails, process webhooks, run background tasks.
Cache
Store expensive query results. Don't hit the database for the same data.
Sessions
User login sessions. Any server can validate any user.
Real-time
Socket.IO pub/sub. Broadcast updates to all connected users.

Priority Queues

Not all jobs are equal. A payment webhook is more urgent than an analytics update:

Queue Priorities
C
Critical (50 workers): Real-time broadcasts, payment webhooks. Must process immediately.
H
High (20 workers): User actions, order processing. Within seconds.
M
Medium (10 workers): Emails, imports. Within minutes is fine.
L
Low (5 workers): Analytics, cleanup. Can wait.

Dead Letter Queue

What happens when a job fails after all retries?

Dead Letter Queue

Failed jobs move to a separate queue for manual review. They don't disappear, and they don't block other jobs.

Deep Dive: Dead Letter Queues in Workflow Orchestration
11

Autoscaling

Add capacity when needed, remove when not

Ahmed's traffic varies. Quiet at night, busy during business hours. Why pay for 10 servers at 3 AM when 1 is enough?

Ahmed's Autoscaling Rules

Service Min Max Scale Out When Scale In When
Backend 1 10 CPU > 50% for 60s CPU < 30% for 300s
Frontend 1 6 CPU > 50% for 60s CPU < 30% for 300s
Chatbot API 1 4 CPU > 70% for 60s CPU < 40% for 300s

Why Asymmetric Cooldowns?

Scale Out: 60 seconds

React fast to load. Users are waiting.

Scale In: 300 seconds

Don't flap. Avoid add/remove/add/remove cycles.

Custom Metrics

You can scale on more than CPU. Ahmed scales his chatbot based on queue depth: if there are more than 1000 messages waiting, add workers.

12

Networking

How requests flow through the system

A request from a user's browser goes through multiple layers. Let's trace a real request: "User clicks 'Ask AI' to get product recommendations."

Complete Request Flow: "Ask AI for Product Recommendations"
User
Clicks button
Route53
DNS lookup
ALB
Route + TLS
Backend
Auth + Logic
Chatbot
AI Processing
Weaviate
Vector Search
Request Timeline (Total: ~850ms)
DNS
~10ms
TLS + ALB
~30ms
Backend
~100ms
Weaviate
~200ms
LLM API
~500ms (the slow part!)
Where's the Bottleneck?

The LLM API call (500ms) is 60% of the total request time. This is why Ahmed uses streaming responses—show partial results while waiting—and caches frequent questions.

ALB Routing Rules

What is an ALB? (Application Load Balancer)

ALB (Application Load Balancer) is AWS's Layer 7 load balancer. It sits in front of your services and:

  • Distributes traffic across multiple containers/servers
  • Routes by path or hostname: /api/* goes to backend, /admin/* goes to admin service
  • Terminates TLS: Handles HTTPS, your containers only see HTTP
  • Health checks: Only sends traffic to healthy containers
  • Sticky sessions: Can route a user to the same container (for WebSocket)

ALB vs NLB: ALB understands HTTP (paths, headers, cookies). NLB is Layer 4 (TCP/UDP only, faster, for non-HTTP traffic). For web apps, use ALB.

One load balancer routes to multiple services based on path:

Routing Rules (Priority Order)
# Path-based routing
/admin/*        → chatbot-admin:8001
/socket.io/*    → backend:6000      # WebSocket
/api/*          → backend:6000

# Host-based routing
api.example.com → chatbot-api:8000

# Default
/*              → frontend:5000

VPC: Your Private Network

What is a VPC? (Virtual Private Cloud)

VPC (Virtual Private Cloud) is your own isolated section of AWS's network. Think of it like having your own private building in a shared office complex:

  • Isolation: Other AWS customers can't see or access your resources
  • Subnets: You divide your VPC into smaller networks (public-facing vs internal-only)
  • Security Groups: Firewall rules that control what traffic is allowed
  • IP Addresses: You control the IP range (e.g., 10.0.0.0/16)

Why it matters: Without a VPC, your database would be directly on the internet. With a VPC, only your application servers can reach it.

Ahmed's infrastructure lives in a Virtual Private Cloud (VPC)—an isolated network in AWS:

  • Public subnets: ALB lives here (accessible from internet)
  • Private subnets: Databases, Redis (not accessible from internet)
  • Security groups: Firewall rules per service
Security Rule

The database should never be accessible from the internet. Only your application servers can talk to it. This is enforced by security groups.

Related: Software Map - Infrastructure Layer
13

The Security Incident

3 AM. Everything is on fire.

Back to that phone call at 3:12 AM.

3:12 AM
Alert Received
"Unusual outbound traffic detected"
3:15 AM
Confirm It's Real
CloudWatch shows 10x normal egress. CPU at 100%.
3:25 AM
Identify the Vulnerability
CVE-2025-55182 in a npm dependency. Remote code execution. (CVE = Common Vulnerabilities and Exposures, a unique ID for security flaws. When you see "CVE-YYYY-NNNNN", that's a tracked vulnerability in a public database.)
3:40 AM
Block Outbound Traffic
Restrict security group egress to only HTTP/HTTPS/DNS
3:55 AM
Rotate All Secrets
Database passwords, API keys, JWT secrets. All new.
4:30 AM
Enable VPC Flow Logs
For forensics. Should have done this before.
5:00 AM
Incident Contained
Patched dependency. Rebuilt images. Redeployed.

What Changed After

Before After
No VPC Flow Logs Enabled - Can see all network traffic for forensics
Open egress (any outbound) Restricted - Only HTTP/HTTPS/DNS allowed
Secrets in environment variables Secrets Manager with automatic rotation
No incident runbook Documented step-by-step response process
Lessons Learned
  • You don't know what you don't log. VPC Flow Logs should be on from day 1.
  • Restrict egress. Why can your server talk to any IP on any port?
  • Have a runbook before 3 AM happens. Half-asleep you needs clear instructions.

WAF & Rate Limiting

After the incident, Ahmed added two more protection layers:

AWS WAF
Web Application Firewall. Blocks SQL injection, XSS, and known bad IPs before they reach the app.
Rate Limiting
100 requests per IP per minute. Stops credential stuffing, scraping, and DDoS attempts.
Protection Layer Where What It Blocks
WAF Rules CloudFront / ALB SQL injection, XSS, bad bots, known malicious IPs
Rate Limiting WAF or ALB Brute force, credential stuffing, API abuse
Geo Blocking WAF Traffic from countries you don't serve
Bot Control WAF Scrapers, automated attacks (costs extra)
Why Not Earlier?

WAF costs ~$5/month + $0.60 per million requests. At MVP stage with 100 users, it's overkill. After an incident with 10,000 users? Essential.

Rule of thumb: Add WAF when you have something worth protecting.

14

Monitoring

Knowing what's happening before customers tell you

Ahmed's MVP had no monitoring. He found out about outages from angry customers. Now he has the three pillars:

Logs
"What happened?" A timeline of events. Request received, auth passed, payment failed.
Metrics
"How much?" Request rate, error rate, latency percentiles, CPU usage.
Traces
"Where did time go?" Request took 2.5s. Auth: 15ms. DB: 120ms. External API: 2300ms.

AWS X-Ray: Distributed Tracing

When a request touches 5 services, which one is slow? AWS X-Ray traces the entire journey:

X-Ray Trace Visualization
Request ID: abc-123-def
│
├── Frontend (50ms)
│
├── Backend (350ms)
│   ├── Auth middleware (15ms)
│   ├── Database query (120ms)
│   └── Redis cache check (5ms)
│
└── Chatbot API (1200ms) ← The slow one!
    ├── Weaviate search (400ms)
    └── LLM API call (800ms)

X-Ray answers: "The chatbot's LLM call takes 800ms. Can we cache frequent questions?"

Structured Logging

Ahmed's MVP logs were useless:

Bad: Unstructured Logs
// Impossible to search or aggregate
console.log("Order created")
console.log("User logged in")
console.log("Something went wrong")

Now every log is JSON with context:

Good: Structured JSON Logs
logger.info({
  event: "order.created",
  orderId: "ord-789",
  tenantId: "tenant-abc",
  userId: "user-456",
  amount: 299.99,
  duration: 145,
  correlationId: req.headers['x-correlation-id']
})

Correlation IDs: Connecting the Dots

A single user action might hit 5 services. How do you find all related logs?

Correlation ID Flow
1
Request arrives: Generate correlation ID corr-xyz-123
2
Pass to all services: Header X-Correlation-ID: corr-xyz-123
3
Every log includes it: All 5 services log with same correlation ID
4
Debug easily: Search CloudWatch for corr-xyz-123 → see entire journey

Critical Alarms

Must-Have Alarms
!
Service down: Running tasks < 1
!
High error rate: 5xx errors > 5%
!
High latency: p95 > 2 seconds
!
Database connections: Approaching limit
!
Disk space: < 10% remaining
Ahmed's Gap

Ahmed has CloudWatch alarms. They trigger SNS topics. But nobody subscribes to those topics. The alarms fire into the void. Nobody gets paged at 3 AM.

Fix: Connect SNS to Slack, PagerDuty, or email. Takes 30 minutes.

Related: Software Map - Observability Chapter
15

Deployment

How code gets to production

Ahmed's deployment is a shell script. Not ideal, but it works:

Deployment Phases
1
Run tests: Unit and integration tests must pass
2
Security audit: npm audit --audit-level=high
3
Build Docker images: Multi-stage builds for each service
4
Push to ECR: Tag with git SHA for traceability
5
Run migrations: Database schema updates
6
Update ECS: aws ecs update-service --force-new-deployment
7
Wait for stability: Poll until new tasks are healthy

What's Missing

Current State
  • Manual trigger (run script)
  • No automatic rollback
  • No staging environment first
  • No approval gates
Ideal State
  • Push to main triggers deploy
  • Auto rollback on failure
  • Deploy to staging first
  • Manual approval for prod
Next Step

Ahmed's next priority: GitHub Actions CI/CD. Automate the deploy script, add staging environment, add rollback on health check failure.

The GitHub Actions Workflow Ahmed Should Build

Here's what a proper CI/CD pipeline looks like for Ahmed's ECS setup. Push to main triggers the full pipeline:

.github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npm test
      - run: npm audit --audit-level=high

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-west-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push Docker image
        run: |
          docker build -t $ECR_REPO:${{ github.sha }} .
          docker push $ECR_REPO:${{ github.sha }}

      - name: Deploy to ECS
        run: |
          # Update task definition with new image
          aws ecs update-service \
            --cluster production \
            --service api \
            --force-new-deployment

      - name: Wait for stability
        run: |
          aws ecs wait services-stable \
            --cluster production \
            --services api
          # If this times out, ECS auto-rolls back

The key insight: ECS has built-in rollback. If the new containers fail health checks, ECS keeps the old containers running. You don't need custom rollback logic — just good health checks.

What this gives Ahmed over the shell script: automated on every push, no human needed, rollback if anything fails, full audit trail in GitHub Actions history.

16

Cost Breakdown

Where the money goes

Ahmed's monthly AWS bill: ~$500

RDS (3 DBs)
45%
$225
ECS Fargate
38%
$188
ElastiCache
9%
$45
Load Balancers
6%
$32
Other
2%
$10

Optimizations Applied

  • Right-sized instances: t4g.small instead of m5.large (ARM is cheaper)
  • Weekend shutdown: Script stops non-essential services Fri night, starts Mon morning
  • Reserved instances: Considering for database (30-70% savings)
Cost vs Revenue

$500/month infrastructure for $1,500/month revenue = 33% margin on infrastructure. Not great, but acceptable for a growing startup. As customers grow, cost per customer decreases.

Cost Optimization Tips Ahmed Hasn't Applied Yet

1. Graviton (ARM) instances everywhere. Ahmed already uses t4g.small for compute. But his RDS is still x86. Switching to db.t4g.small saves ~20% on the biggest cost item ($225 → ~$180).

2. Reserved Instances for RDS. Ahmed's databases run 24/7. A 1-year reserved instance commitment saves 30-40%. That's $225 → ~$140. The catch: you're committed for a year.

3. Fargate Spot for the AI worker. The AI processing service handles async tasks from a queue. If a Spot instance gets reclaimed, the task goes back to the queue. No data loss. Savings: 50-70% on compute for that service.

4. Aurora Serverless v2 instead of RDS. If traffic is bursty (high during business hours, near-zero at night), Aurora Serverless scales to near-zero and back. For Ahmed's pattern, this could be cheaper than always-on RDS.

Projected savings: With all optimizations, Ahmed could get from $500 to ~$300/month — dropping infrastructure cost to 20% of revenue.

What To Do Monday

Based on where you are in your journey, here's your next concrete step:

MVP STAGE

If you're at MVP stage...

Focus on getting users first. Stay on the $20 VPS. Add daily automated backups (takes 30 minutes to set up). Don't add complexity until pain forces you. Ship your product, not your infrastructure.

PRE-PRODUCTION

If you're preparing for production...

Prioritize observability. Set up structured logging, basic CloudWatch alarms (service down, high error rate), and connect alerts to Slack/email. You can't fix what you can't see. This is Day 1 infrastructure.

SCALING

If you're scaling...

Consider connection pooling and caching. Database connections become a bottleneck around 50+ concurrent users. Add RDS Proxy or PgBouncer. Cache expensive queries in Redis. Review your autoscaling rules.

Practice Mode: Test Your Understanding

Before you move on, make sure you can answer these scenarios:

Scenario 1: Your ECS service is restarting containers every few minutes. Health checks pass. What do you check?

Check for:

  • Memory limits: Container OOM killed? Check CloudWatch for memory usage hitting 100%.
  • Exit codes: Look at stopped task details. Exit code 137 = OOM. Exit code 1 = app crash.
  • Startup time: Is the health check timing out before the app starts? Increase health check grace period.
  • Dependencies: Is the app crashing because it can't reach the database or Redis?

Quick command: aws ecs describe-tasks --cluster prod --tasks $(aws ecs list-tasks --cluster prod --desired-status STOPPED --query 'taskArns[0]' --output text)

Scenario 2: Response times jumped from 200ms to 3 seconds, but CPU is at 20%. Where do you look?

Low CPU + high latency = waiting on something external:

  • Database: Check RDS CloudWatch for high latency, connection count near limit, or replica lag.
  • Redis: Check ElastiCache metrics for evictions or high memory usage.
  • External APIs: Is your LLM provider slow? Add tracing to identify the slow call.
  • Connection starvation: Pool exhausted? All threads waiting for a connection?

The pattern: High CPU = your code is slow. Low CPU + high latency = you're waiting on I/O.

Scenario 3: You deployed a new version and 5xx errors spike. What's your rollback plan?

ECS rollback options (fastest to slowest):

  • Do nothing: If health checks are configured correctly, ECS won't drain old containers until new ones are healthy. Bad deploys auto-stop.
  • Update task definition: Point back to the previous image tag and update service.
  • ECS service auto-rollback: Enable "circuit breaker" with rollback. ECS automatically reverts if deployment fails.

Prevention: Tag images with git SHA, not just :latest. You need to know exactly which version to roll back to.

Where To Go Deep

Ahmed's journey touched every reliability pattern. Here's where to learn each one in depth:

Key Takeaways

1
Start with the simplest thing that works. Docker Compose on a $20 VPS got Ahmed his first paying customers.
2
Pain teaches you what to build next. Each crash showed Ahmed what production actually requires.
3
Managed services save time. RDS, ElastiCache, Fargate—let AWS handle the undifferentiated heavy lifting.
4
Separate services when they have different needs. Python for AI, Node.js for API, each scales independently.
5
Health checks must actually check health. Don't just verify the port is open. Hit an endpoint that touches the database.
6
Restrict egress before you need to. Your server shouldn't be able to talk to any IP on any port.
7
Log everything you might need at 3 AM. VPC Flow Logs, structured application logs, request traces.
8
Alarms are useless if nobody receives them. Connect CloudWatch → SNS → Slack/PagerDuty.
9
Document your decisions. Future you (and your team) will thank present you.
10
Every architecture is a tradeoff. There's no "best" architecture, only "best for your current scale and constraints."

Continue Learning

والله أعلم

وصلى الله وسلم على نبينا محمد وعلى آله وصحبه أجمعين

Comments

Loading comments...

Leave a comment