From Idea to Production: Building a Real SaaS on AWS

بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ

January 28, 2026. 3:12 AM.

Ahmed's phone buzzes. Then again. Then it won't stop.

"Your EC2 instance is generating unusual outbound traffic."

He opens CloudWatch. Outbound bandwidth: 10x normal. CPU: 100%. His server—the one running his entire business—is attacking other servers on the internet.

A security vulnerability he didn't know existed. A botnet he's now part of. And AWS will suspend his account in 30 minutes if he doesn't respond.

This is the story of what Ahmed built, what broke, and what he learned.

The Dream

Why Ahmed started building

Ahmed noticed something. Small businesses spend hours every day answering the same customer questions—order status, business hours, product availability. One person, typing the same answers over and over.

"What if I could automate this?" he thought. "What if I could build a platform where any business could have an AI chatbot handling their customer inquiries?"

The vision was clear:

Multi-tenant - One platform, many businesses
AI-powered - Chatbots that actually understand questions
Multi-channel - Website widget, mobile app, messaging platforms

The Starting Point

Ahmed had a laptop, some Node.js experience, and $200 in savings. No team. No infrastructure expertise. Just an idea and determination.

The MVP Era

Week 1-4: "It works on my machine"

Ahmed's first architecture was beautifully simple:

MVP Architecture (Week 1)

Single VPS - $20/month

DigitalOcean droplet running everything

↓

Docker Compose

4 containers: Backend, Frontend, PostgreSQL, Redis

Total cost: $20/month. Total complexity: One docker-compose.yml file.

docker-compose.yml

version: '3.8'
services:
  backend:
    build: ./backend
    ports: ["6000:6000"]
    depends_on: [postgres, redis]

  frontend:
    build: ./frontend
    ports: ["3000:3000"]

  postgres:
    image: postgres:15
    volumes: [postgres_data:/var/lib/postgresql/data]

  redis:
    image: redis:7-alpine

Alternatives Ahmed Considered

Option	Pros	Cons	Best For
Single VPS	Cheap, simple, fast to deploy	No redundancy, limited scale	MVP, <100 users
Heroku/Railway	Even simpler, managed	More expensive, less control	Quick prototypes
Kubernetes	Scalable, industry standard	Massive overkill, complex	Never for MVP

Lesson Learned

Start with the simplest thing that works. Ahmed could have spent weeks setting up Kubernetes. Instead, he had paying customers in 2 weeks.

The MVP worked. 10 businesses signed up. Ahmed was charging $50/month each. Revenue: $500/month. Profit after server costs: $480.

Life was good. For about 6 weeks.

Growing Pains

When "it works" stops working

Month 2. Ahmed wakes up to 47 support messages. All variations of the same thing:

"The app is down."

What Went Wrong (The First Time)

The VPS ran out of disk space. PostgreSQL logs had grown to 45GB. The database crashed. No backups existed.

Result: 3 hours of downtime. 2 customers asked for refunds.

Ahmed fixed it. Added log rotation. Set up daily backups to S3. Felt smart.

Two weeks later:

What Went Wrong (The Second Time)

Black Friday. Traffic spiked 5x. The single server couldn't handle it. Response times went from 200ms to 15 seconds. Then timeout errors.

Result: Lost orders. Angry business owners. One customer's customer thought the store was closed.

The pattern was clear. Every month brought a new "surprise":

Week 6

Disk Full

Database crashed, no backups

Week 8

Traffic Spike

Server couldn't handle load

Week 10

Memory Leak

Node.js process grew until OOM killed it

Week 12

"Is it down?"

No monitoring. Ahmed found out from customers.

The Realization

Pain is the best teacher. Each crash taught Ahmed what production actually requires. The MVP architecture had served its purpose. Now it was time to grow up.

The Failure Taxonomy

These failures aren't random. They follow predictable patterns. Name them so you can spot them before they hit you:

TRAP #1 The Friday Deploy Trap

Deploying before a weekend when you won't be around to fix issues. Ahmed deployed on Friday at 5pm. The bug showed up Saturday morning. He didn't see it until Monday.

Prevention: No deploys after Wednesday. Friday deploys need explicit approval and on-call coverage.

TRAP #2 The Disk Space Surprise

Running out of storage space because logs, temp files, or data grew unchecked. Ahmed's PostgreSQL logs grew to 45GB. Database crashed with "no space left on device".

Prevention: Log rotation, disk space alerts at 70%, autoscaling storage (or RDS managed storage).

TRAP #3 The Egress Blindspot

Unexpected outbound data transfer costs or security issues. Ahmed's compromised server was sending traffic to a botnet. He noticed only when AWS throttled him.

Prevention: Restrict egress in security groups. Only allow outbound to specific IPs/ports you need.

TRAP #4 The Traffic Spike Trap

Not being ready for sudden traffic increases. Ahmed got mentioned on Twitter. Traffic 10x'd. His single server couldn't handle it. Site went down during his biggest opportunity.

Prevention: Autoscaling, load testing, CDN for static assets, graceful degradation.

Going Serious

The AWS decision

Ahmed had 30 paying customers now. $1,500/month revenue. Enough to justify real infrastructure. But which cloud?

The Options

Provider	Pros	Cons	Best For
AWS	Most services, best docs, largest community	Complex pricing, learning curve	Most startups
GCP	Great for AI/ML, simpler pricing	Smaller ecosystem	AI-heavy apps
Azure	Enterprise integration, Microsoft stack	Less startup-friendly	Enterprise, .NET
Self-hosted	Full control, potentially cheaper	You manage everything	Expert teams only

Ahmed chose AWS because:

Most tutorials and Stack Overflow answers
ECS for containers (simpler than Kubernetes)
RDS for managed database (no more disk space surprises)
His potential enterprise customers already used AWS

Decision Framework

Choose the cloud your customers use, or the one with the best docs for your skill level. The "best" cloud is the one you can actually operate.

The Transformation

Here's what changed when Ahmed moved from a $20 VPS to AWS:

BEFORE: MVP

Single VPS

$20/month

Docker Compose

PostgreSQL (local)

Redis (local)

Problems:

No redundancy
Manual backups
Can't scale
Single point of failure

Pain-driven
evolution

AFTER: Production

AWS Infrastructure

~$500/month

ECS Fargate (5 services)

RDS PostgreSQL (Multi-AZ)

ElastiCache (managed)

ALB + Auto Scaling

Gains:

Auto failover
Automated backups
Scales to demand
Multi-AZ redundancy

Cost increased 25x But revenue increased 30x

$20

$500/month — but can handle 100x the load

The Five Services

What Ahmed is actually building

Before diving into infrastructure, let's understand what Ahmed's platform actually does:

Frontend

Next.js / React

Dashboard for business owners

Backend

Node.js / Fastify

API, business logic, webhooks

Chatbot API

Python / FastAPI

AI conversation engine

Admin

Node.js

Tenant management

Weaviate

Vector Database

Semantic product search

Why Separate Services?

Service	Language	Why Separate?
Frontend	JavaScript	Static files can be cached, different deploy cycle
Backend	Node.js	Business logic, real-time updates, API
Chatbot API	Python	AI libraries are Python-first. Heavy computation shouldn't block Node.js
Weaviate	Go	Specialized vector database, not something you build yourself

The Pattern

Separate services when they have different scaling needs or different technology requirements. The AI chatbot needs Python. It also needs to scale independently when message volume spikes.

The Complete Production Architecture

Users (Browser / Mobile / API)

Route53 (DNS)

CloudFront (CDN)

Application Load Balancer

Routes traffic, TLS termination, health checks

ECS Fargate Cluster

Frontend

Next.js

Backend

Node.js

Chatbot API

Python

Admin

Node.js

Weaviate

Vector DB

RDS PostgreSQL

Multi-AZ, Automated Backups

ElastiCache (Valkey)

Cache, Queue, Sessions, Pub/Sub

Secrets Manager

DB passwords, API keys

All resources inside VPC (Private Network)

How Services Talk to Each Other

Not all services talk to all services. Here's the actual communication map:

Service Communication Map

Users

Messaging

Frontend

HTTP + WebSocket

Backend

The Hub — All requests go through here

HTTP

TCP

Chatbot API

PostgreSQL

Redis

HTTP

Weaviate

OpenAI API (external)

Who Calls Who

From	To	How	What For
Frontend	Backend	HTTP REST	All user actions, data fetching
Frontend	Backend	WebSocket	Real-time updates, notifications
Backend	Chatbot API	HTTP REST	Process AI chat messages
Backend	PostgreSQL	TCP (pg)	All data reads/writes
Backend	Redis	TCP	Cache, sessions, pub/sub, queues
Chatbot API	Weaviate	HTTP REST	Semantic search for products
Chatbot API	OpenAI/LLM	HTTPS	Generate AI responses
Admin	PostgreSQL	TCP (pg)	Tenant management, config

The Hub Pattern

Backend is the hub. Frontend never talks to Chatbot API directly. This gives Backend control over authentication, rate limiting, and request validation. If you need to add security rules, you add them in one place.

Container Layer

ECS: Running containers without managing servers

Ahmed's apps run in Docker containers. But where do those containers run?

What is ECS? (Elastic Container Service)

ECS (Elastic Container Service) is AWS's container orchestration service. Think of it as a manager that:

Runs your Docker containers across multiple servers
Restarts them if they crash
Scales them up or down based on demand
Routes traffic to healthy containers

Why not just Docker on EC2? You could, but then YOU manage server patching, container placement, health checks, and scaling. ECS handles all that. With Fargate (serverless mode), you don't even manage the underlying servers.

The Options

Option	Complexity	Cost	Best For
EC2 + Docker	Low	Low-Medium	Simple apps, full control needed
ECS Fargate	Medium	Medium	Container apps, no server management
EKS (Kubernetes)	High	High	Large teams, complex orchestration
Lambda	Low	Variable	Event-driven, sporadic traffic
App Runner	Very Low	Medium	Simple web apps

Ahmed chose ECS Fargate because:

No servers to manage - Fargate handles the underlying infrastructure
Simpler than Kubernetes - One less thing to learn and debug
Good enough for 5 services - EKS would be overkill
Scales automatically - Add tasks when load increases

Key Concepts

ECS Hierarchy

Cluster: A logical grouping of services. Ahmed has one cluster for production.

Service: Keeps N copies of a task running. "Always run 2 backend tasks."

Task: One or more containers running together. Like one docker-compose "up".

Task Definition: The recipe. "Use this image, this much CPU, these env vars."

Task Definition Anatomy

backend-task-definition.json (simplified)

{
  "family": "backend",
  "cpu": "512",           // 0.5 vCPU
  "memory": "1024",       // 1 GB RAM
  "containerDefinitions": [{
    "name": "api",
    "image": "123456.dkr.ecr.eu-west-1.amazonaws.com/backend:latest",
    "portMappings": [{ "containerPort": 6000 }],
    "healthCheck": {
      "command": ["CMD-SHELL", "wget -q --spider http://127.0.0.1:6000/health"]
    },
    "environment": [
      { "name": "NODE_ENV", "value": "production" }
    ],
    "secrets": [
      { "name": "DATABASE_URL", "valueFrom": "arn:aws:secretsmanager:..." }
    ]
  }]
}

Health Checks Matter

The health check command must actually verify your app is working. Don't just check if the port is open. Hit an endpoint that touches the database. If the health check passes but the app is broken, ECS won't know to restart it.

Decision Check: Container Platform

Your team is 2 engineers with no Kubernetes experience. What should you use?

Reveal Answer

ECS Fargate is the right choice. Here's why:

Kubernetes requires expertise you don't have. Learning curve is 3-6 months.
ECS is AWS-native, integrates with everything else you're using.
Fargate = no servers. You don't manage EC2 instances at all.
You can migrate to EKS later if you outgrow ECS (rare at <50 services).

When Kubernetes makes sense: Large team (5+ engineers), 20+ services, need for advanced features like service mesh, or company-wide Kubernetes standardization.

Docker & ECR

Building and storing container images

Multi-Stage Builds

Ahmed's Docker images use multi-stage builds to keep them small and secure:

Dockerfile (multi-stage)

# Stage 1: Build (includes dev dependencies, source code)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Run (only what's needed)
FROM node:20-alpine AS runner
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
CMD ["node", "dist/server.js"]

Single-Stage Build

800MB image

Includes dev dependencies, source code, build tools. Slow to push, slow to pull.

Multi-Stage Build

200MB image

Only production code. Faster deploys, smaller attack surface.

ECR: Where Images Live

ECR (Elastic Container Registry) is AWS's Docker image storage. The flow:

Build → Push → Deploy

Build locally: docker build -t backend .

Tag with git SHA: docker tag backend:latest 123456.dkr.ecr.../backend:abc123

Push to ECR: docker push 123456.dkr.ecr.../backend:abc123

ECS pulls: Task definition references the new image tag

Alpine Linux Gotchas (Lessons Ahmed Learned the Hard Way)

Ahmed uses Alpine Linux for smaller images (200MB vs 800MB), but hit several quirks:

No curl by default — Alpine ships with wget instead of curl. Health check commands that use curl silently fail. Use wget --spider http://localhost:3000/health or install curl with apk add --no-cache curl.
IPv6 issues — Alpine resolves localhost to ::1 (IPv6) by default. If your app only binds to IPv4, use 127.0.0.1 explicitly in all health checks and connection strings.
Next.js binding — Next.js by default binds to localhost only. In a container, that means external connections are refused. Set HOSTNAME=0.0.0.0 in your environment variables or it won't accept connections from the ECS health checker.
Native modules — If your Node.js dependencies include native C++ addons (bcrypt, sharp, canvas), they need build-base and python3 installed in Alpine. This increases build time. Consider using pre-built binaries: bcryptjs instead of bcrypt, @img/sharp-linux-arm64 etc.
Timezone — Alpine doesn't include timezone data by default. If your app logs timestamps, add apk add --no-cache tzdata and set TZ=UTC.

Despite these quirks, Alpine is worth it: 200MB images deploy 4x faster than 800MB ones, saving ~30 seconds per deployment across all ECS tasks.

Database Layer

RDS: Managed PostgreSQL

Remember Ahmed's disk space disaster with the MVP? That's why he uses RDS now—AWS manages backups, patches, and disk space.

The Options

Option	Managed?	Cost	Best For
Self-hosted PostgreSQL	No	Low	Experts with time to manage
RDS PostgreSQL	Yes	Medium	Most production apps
Aurora	Yes	High	High performance, auto-scaling storage
PlanetScale/Neon	Yes	Variable	Serverless, branching for dev

High Availability

Option	Failover Time	Cost	When to Use
Single-AZ	Manual (hours)	$	Dev/staging environments
Multi-AZ	Automatic (~60s)	$$	Production databases
Aurora Multi-Master	Zero	$$$	Mission-critical, zero downtime

Ahmed uses:

Multi-AZ for production - Can't afford downtime for customer orders
Single-AZ for dev - Acceptable risk, saves money

Future Problem

Connection pooling. 10 ECS tasks × 10 connections each = 100 connections. RDS has limits. Ahmed will need PgBouncer or RDS Proxy as he scales further.

Deep Dive: Database Connections Explained

Decision Check: Database

You need ACID transactions and your data is relational. What database?

Reveal Answer

PostgreSQL (RDS or Aurora) is the standard answer. Here's why:

ACID compliance: Transactions, constraints, referential integrity built-in
Relational data: JOINs, foreign keys, normalization
RDS vs Aurora: RDS is cheaper for small workloads. Aurora for high-performance, auto-scaling storage.
Multi-AZ: Automatic failover in ~60 seconds

When NOT PostgreSQL: If you need horizontal scaling for massive writes (DynamoDB), document storage (MongoDB), or time-series data (TimescaleDB, built on Postgres anyway).

Multi-Tenancy

One platform, many businesses

Ahmed has 30 businesses on his platform. Each thinks they have their own system. But they all share the same database. How does this work?

The Options

Pattern	Isolation	Cost	Best For
Shared DB + tenant_id	Low	$	Most SaaS startups
Schema per tenant	Medium	$$	Compliance requirements
Database per tenant	High	$$$	Enterprise, regulated industries

Ahmed uses Shared DB + tenant_id:

Every table has tenant_id

CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  tenant_id UUID NOT NULL,  -- The isolation key
  customer_id INT,
  total DECIMAL,
  created_at TIMESTAMP
);

-- Every query filters by tenant
SELECT * FROM orders WHERE tenant_id = $1;

The Danger

Forget the WHERE tenant_id = ... clause and one business sees another's data. This is a data leak.

Solution: Use PostgreSQL Row-Level Security (RLS) to enforce tenant isolation at the database level.

Row-Level Security (RLS): How Ahmed Prevents Data Leaks

RLS makes the database enforce tenant isolation automatically. Even if application code forgets the WHERE tenant_id = ... clause, the database won't return other tenants' data.

rls_setup.sql

-- Enable RLS on the orders table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

-- Create a policy: users can only see rows where
-- tenant_id matches the current session variable
CREATE POLICY tenant_isolation ON orders
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

-- In your middleware (runs before every query):
-- SET app.current_tenant = 'tenant-abc';

-- Now this query automatically filters by tenant:
SELECT * FROM orders;
-- ↑ Only returns orders for tenant-abc, even without WHERE clause

How Ahmed uses it in Express middleware:

tenant_middleware.ts

async function tenantMiddleware(req, res, next) {
  const tenantId = req.user?.tenantId;
  if (!tenantId) return res.status(401).json({ error: 'No tenant' });

  // IMPORTANT: Validate tenantId format to prevent SQL injection!
  if (!/^[a-f0-9-]{36}$/.test(tenantId)) {
    return res.status(400).json({ error: 'Invalid tenant ID' });
  }

  // Set the tenant for this database connection
  await db.query(`SET app.current_tenant = '${tenantId}'`);
  next();
}

// Now every database query in this request is automatically
// scoped to the authenticated user's tenant. No WHERE clause needed.

SQL Injection Warning

Always validate tenantId before using it in SET commands. If tenantId comes from user input (even indirectly through JWT claims), a malicious value like '; DROP TABLE orders; -- could execute arbitrary SQL. Use a strict regex pattern (UUID format) or parameterized queries where possible.

Why this matters: Without RLS, a single missed WHERE clause in any of your hundreds of queries could leak data. With RLS, the database is the safety net. Even a bug in application code can't cross tenant boundaries.

Decision Check: Multi-Tenancy

You have 5 enterprise customers who each want data isolation. What pattern?

Reveal Answer

Database per tenant is likely the right choice for this scenario. Here's why:

Enterprise customers demand isolation: They often require it contractually for compliance (SOC2, HIPAA)
5 customers is manageable: The operational overhead is acceptable at this scale
Easy data export/deletion: Customer leaves? Drop the database.
Independent backups: Restore one customer without affecting others

When shared DB with RLS works: Many small customers (100+), cost-sensitive, customers don't require strict isolation. Schema per tenant: Middle ground, but adds migration complexity.

Caching & Queues

Redis for everything

Ahmed uses Redis (actually AWS ElastiCache with Valkey) for four different purposes:

Queue

BullMQ job processing. Send emails, process webhooks, run background tasks.

Cache

Store expensive query results. Don't hit the database for the same data.

Sessions

User login sessions. Any server can validate any user.

Real-time

Socket.IO pub/sub. Broadcast updates to all connected users.

Priority Queues

Not all jobs are equal. A payment webhook is more urgent than an analytics update:

Queue Priorities

Critical (50 workers): Real-time broadcasts, payment webhooks. Must process immediately.

High (20 workers): User actions, order processing. Within seconds.

Medium (10 workers): Emails, imports. Within minutes is fine.

Low (5 workers): Analytics, cleanup. Can wait.

Dead Letter Queue

What happens when a job fails after all retries?

Dead Letter Queue

Failed jobs move to a separate queue for manual review. They don't disappear, and they don't block other jobs.

Deep Dive: Dead Letter Queues in Workflow Orchestration

Autoscaling

Add capacity when needed, remove when not

Ahmed's traffic varies. Quiet at night, busy during business hours. Why pay for 10 servers at 3 AM when 1 is enough?

Ahmed's Autoscaling Rules

Service	Min	Max	Scale Out When	Scale In When
Backend	1	10	CPU > 50% for 60s	CPU < 30% for 300s
Frontend	1	6	CPU > 50% for 60s	CPU < 30% for 300s
Chatbot API	1	4	CPU > 70% for 60s	CPU < 40% for 300s

Why Asymmetric Cooldowns?

Scale Out: 60 seconds

React fast to load. Users are waiting.

Scale In: 300 seconds

Don't flap. Avoid add/remove/add/remove cycles.

Custom Metrics

You can scale on more than CPU. Ahmed scales his chatbot based on queue depth: if there are more than 1000 messages waiting, add workers.

Networking

How requests flow through the system

A request from a user's browser goes through multiple layers. Let's trace a real request: "User clicks 'Ask AI' to get product recommendations."

Complete Request Flow: "Ask AI for Product Recommendations"

User

Clicks button

Route53

DNS lookup

ALB

Route + TLS

Backend

Auth + Logic

Chatbot

AI Processing

Weaviate

Vector Search

Request Timeline (Total: ~850ms)

DNS

~10ms

TLS + ALB

~30ms

Backend

~100ms

Weaviate

~200ms

LLM API

~500ms (the slow part!)

Where's the Bottleneck?

The LLM API call (500ms) is 60% of the total request time. This is why Ahmed uses streaming responses—show partial results while waiting—and caches frequent questions.

ALB Routing Rules

What is an ALB? (Application Load Balancer)

ALB (Application Load Balancer) is AWS's Layer 7 load balancer. It sits in front of your services and:

Distributes traffic across multiple containers/servers
Routes by path or hostname: /api/* goes to backend, /admin/* goes to admin service
Terminates TLS: Handles HTTPS, your containers only see HTTP
Health checks: Only sends traffic to healthy containers
Sticky sessions: Can route a user to the same container (for WebSocket)

ALB vs NLB: ALB understands HTTP (paths, headers, cookies). NLB is Layer 4 (TCP/UDP only, faster, for non-HTTP traffic). For web apps, use ALB.

One load balancer routes to multiple services based on path:

Routing Rules (Priority Order)

# Path-based routing
/admin/*        → chatbot-admin:8001
/socket.io/*    → backend:6000      # WebSocket
/api/*          → backend:6000

# Host-based routing
api.example.com → chatbot-api:8000

# Default
/*              → frontend:5000

VPC: Your Private Network

What is a VPC? (Virtual Private Cloud)

VPC (Virtual Private Cloud) is your own isolated section of AWS's network. Think of it like having your own private building in a shared office complex:

Isolation: Other AWS customers can't see or access your resources
Subnets: You divide your VPC into smaller networks (public-facing vs internal-only)
Security Groups: Firewall rules that control what traffic is allowed
IP Addresses: You control the IP range (e.g., 10.0.0.0/16)

Why it matters: Without a VPC, your database would be directly on the internet. With a VPC, only your application servers can reach it.

Ahmed's infrastructure lives in a Virtual Private Cloud (VPC)—an isolated network in AWS:

Public subnets: ALB lives here (accessible from internet)
Private subnets: Databases, Redis (not accessible from internet)
Security groups: Firewall rules per service

Security Rule

The database should never be accessible from the internet. Only your application servers can talk to it. This is enforced by security groups.

The Security Incident

3 AM. Everything is on fire.

Back to that phone call at 3:12 AM.

3:12 AM

Alert Received

"Unusual outbound traffic detected"

3:15 AM

Confirm It's Real

CloudWatch shows 10x normal egress. CPU at 100%.

3:25 AM

Identify the Vulnerability

CVE-2025-55182 in a npm dependency. Remote code execution. (CVE = Common Vulnerabilities and Exposures, a unique ID for security flaws. When you see "CVE-YYYY-NNNNN", that's a tracked vulnerability in a public database.)

3:40 AM

Block Outbound Traffic

Restrict security group egress to only HTTP/HTTPS/DNS

3:55 AM

Rotate All Secrets

Database passwords, API keys, JWT secrets. All new.

4:30 AM

Enable VPC Flow Logs

For forensics. Should have done this before.

5:00 AM

Incident Contained

Patched dependency. Rebuilt images. Redeployed.

What Changed After

Before	After
No VPC Flow Logs	Enabled - Can see all network traffic for forensics
Open egress (any outbound)	Restricted - Only HTTP/HTTPS/DNS allowed
Secrets in environment variables	Secrets Manager with automatic rotation
No incident runbook	Documented step-by-step response process

Lessons Learned

You don't know what you don't log. VPC Flow Logs should be on from day 1.
Restrict egress. Why can your server talk to any IP on any port?
Have a runbook before 3 AM happens. Half-asleep you needs clear instructions.

WAF & Rate Limiting

After the incident, Ahmed added two more protection layers:

AWS WAF

Web Application Firewall. Blocks SQL injection, XSS, and known bad IPs before they reach the app.

Rate Limiting

100 requests per IP per minute. Stops credential stuffing, scraping, and DDoS attempts.

Protection Layer	Where	What It Blocks
WAF Rules	CloudFront / ALB	SQL injection, XSS, bad bots, known malicious IPs
Rate Limiting	WAF or ALB	Brute force, credential stuffing, API abuse
Geo Blocking	WAF	Traffic from countries you don't serve
Bot Control	WAF	Scrapers, automated attacks (costs extra)

Why Not Earlier?

WAF costs ~$5/month + $0.60 per million requests. At MVP stage with 100 users, it's overkill. After an incident with 10,000 users? Essential.

Rule of thumb: Add WAF when you have something worth protecting.

Monitoring

Knowing what's happening before customers tell you

Ahmed's MVP had no monitoring. He found out about outages from angry customers. Now he has the three pillars:

Logs

"What happened?" A timeline of events. Request received, auth passed, payment failed.

Metrics

"How much?" Request rate, error rate, latency percentiles, CPU usage.

Traces

"Where did time go?" Request took 2.5s. Auth: 15ms. DB: 120ms. External API: 2300ms.

AWS X-Ray: Distributed Tracing

When a request touches 5 services, which one is slow? AWS X-Ray traces the entire journey:

X-Ray Trace Visualization

Request ID: abc-123-def
│
├── Frontend (50ms)
│
├── Backend (350ms)
│   ├── Auth middleware (15ms)
│   ├── Database query (120ms)
│   └── Redis cache check (5ms)
│
└── Chatbot API (1200ms) ← The slow one!
    ├── Weaviate search (400ms)
    └── LLM API call (800ms)

X-Ray answers: "The chatbot's LLM call takes 800ms. Can we cache frequent questions?"

Structured Logging

Ahmed's MVP logs were useless:

Bad: Unstructured Logs

// Impossible to search or aggregate
console.log("Order created")
console.log("User logged in")
console.log("Something went wrong")

Now every log is JSON with context:

Good: Structured JSON Logs

logger.info({
  event: "order.created",
  orderId: "ord-789",
  tenantId: "tenant-abc",
  userId: "user-456",
  amount: 299.99,
  duration: 145,
  correlationId: req.headers['x-correlation-id']
})

Correlation IDs: Connecting the Dots

A single user action might hit 5 services. How do you find all related logs?

Correlation ID Flow

Request arrives: Generate correlation ID corr-xyz-123

Pass to all services: Header X-Correlation-ID: corr-xyz-123

Every log includes it: All 5 services log with same correlation ID

Debug easily: Search CloudWatch for corr-xyz-123 → see entire journey

Critical Alarms

Must-Have Alarms

Service down: Running tasks < 1

High error rate: 5xx errors > 5%

High latency: p95 > 2 seconds

Database connections: Approaching limit

Disk space: < 10% remaining

Ahmed's Gap

Ahmed has CloudWatch alarms. They trigger SNS topics. But nobody subscribes to those topics. The alarms fire into the void. Nobody gets paged at 3 AM.

Fix: Connect SNS to Slack, PagerDuty, or email. Takes 30 minutes.

Deployment

How code gets to production

Ahmed's deployment is a shell script. Not ideal, but it works:

Deployment Phases

Run tests: Unit and integration tests must pass

Security audit: npm audit --audit-level=high

Build Docker images: Multi-stage builds for each service

Push to ECR: Tag with git SHA for traceability

Run migrations: Database schema updates

Update ECS: aws ecs update-service --force-new-deployment

Wait for stability: Poll until new tasks are healthy

What's Missing

Current State

Manual trigger (run script)
No automatic rollback
No staging environment first
No approval gates

Ideal State

Push to main triggers deploy
Auto rollback on failure
Deploy to staging first
Manual approval for prod

Next Step

Ahmed's next priority: GitHub Actions CI/CD. Automate the deploy script, add staging environment, add rollback on health check failure.

The GitHub Actions Workflow Ahmed Should Build

Here's what a proper CI/CD pipeline looks like for Ahmed's ECS setup. Push to main triggers the full pipeline:

.github/workflows/deploy.yml

name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npm test
      - run: npm audit --audit-level=high

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-west-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push Docker image
        run: |
          docker build -t $ECR_REPO:${{ github.sha }} .
          docker push $ECR_REPO:${{ github.sha }}

      - name: Deploy to ECS
        run: |
          # Update task definition with new image
          aws ecs update-service \
            --cluster production \
            --service api \
            --force-new-deployment

      - name: Wait for stability
        run: |
          aws ecs wait services-stable \
            --cluster production \
            --services api
          # If this times out, ECS auto-rolls back

The key insight: ECS has built-in rollback. If the new containers fail health checks, ECS keeps the old containers running. You don't need custom rollback logic — just good health checks.

What this gives Ahmed over the shell script: automated on every push, no human needed, rollback if anything fails, full audit trail in GitHub Actions history.

Cost Breakdown

Where the money goes

Ahmed's monthly AWS bill: ~$500

RDS (3 DBs)

45%

$225

ECS Fargate

38%

$188

ElastiCache

$45

Load Balancers

$32

Other

$10

Optimizations Applied

Right-sized instances: t4g.small instead of m5.large (ARM is cheaper)
Weekend shutdown: Script stops non-essential services Fri night, starts Mon morning
Reserved instances: Considering for database (30-70% savings)

Cost vs Revenue

$500/month infrastructure for $1,500/month revenue = 33% margin on infrastructure. Not great, but acceptable for a growing startup. As customers grow, cost per customer decreases.

Cost Optimization Tips Ahmed Hasn't Applied Yet

1. Graviton (ARM) instances everywhere. Ahmed already uses t4g.small for compute. But his RDS is still x86. Switching to db.t4g.small saves ~20% on the biggest cost item ($225 → ~$180).

2. Reserved Instances for RDS. Ahmed's databases run 24/7. A 1-year reserved instance commitment saves 30-40%. That's $225 → ~$140. The catch: you're committed for a year.

3. Fargate Spot for the AI worker. The AI processing service handles async tasks from a queue. If a Spot instance gets reclaimed, the task goes back to the queue. No data loss. Savings: 50-70% on compute for that service.

4. Aurora Serverless v2 instead of RDS. If traffic is bursty (high during business hours, near-zero at night), Aurora Serverless scales to near-zero and back. For Ahmed's pattern, this could be cheaper than always-on RDS.

Projected savings: With all optimizations, Ahmed could get from $500 to ~$300/month — dropping infrastructure cost to 20% of revenue.

What To Do Monday

Based on where you are in your journey, here's your next concrete step:

MVP STAGE

If you're at MVP stage...

Focus on getting users first. Stay on the $20 VPS. Add daily automated backups (takes 30 minutes to set up). Don't add complexity until pain forces you. Ship your product, not your infrastructure.

PRE-PRODUCTION

If you're preparing for production...

Prioritize observability. Set up structured logging, basic CloudWatch alarms (service down, high error rate), and connect alerts to Slack/email. You can't fix what you can't see. This is Day 1 infrastructure.

SCALING

If you're scaling...

Consider connection pooling and caching. Database connections become a bottleneck around 50+ concurrent users. Add RDS Proxy or PgBouncer. Cache expensive queries in Redis. Review your autoscaling rules.

Practice Mode: Test Your Understanding

Before you move on, make sure you can answer these scenarios:

Scenario 1: Your ECS service is restarting containers every few minutes. Health checks pass. What do you check?

Check for:

Memory limits: Container OOM killed? Check CloudWatch for memory usage hitting 100%.
Exit codes: Look at stopped task details. Exit code 137 = OOM. Exit code 1 = app crash.
Startup time: Is the health check timing out before the app starts? Increase health check grace period.
Dependencies: Is the app crashing because it can't reach the database or Redis?

Quick command: aws ecs describe-tasks --cluster prod --tasks $(aws ecs list-tasks --cluster prod --desired-status STOPPED --query 'taskArns[0]' --output text)

Scenario 2: Response times jumped from 200ms to 3 seconds, but CPU is at 20%. Where do you look?

Low CPU + high latency = waiting on something external:

Database: Check RDS CloudWatch for high latency, connection count near limit, or replica lag.
Redis: Check ElastiCache metrics for evictions or high memory usage.
External APIs: Is your LLM provider slow? Add tracing to identify the slow call.
Connection starvation: Pool exhausted? All threads waiting for a connection?

The pattern: High CPU = your code is slow. Low CPU + high latency = you're waiting on I/O.

Scenario 3: You deployed a new version and 5xx errors spike. What's your rollback plan?

ECS rollback options (fastest to slowest):

Do nothing: If health checks are configured correctly, ECS won't drain old containers until new ones are healthy. Bad deploys auto-stop.
Update task definition: Point back to the previous image tag and update service.
ECS service auto-rollback: Enable "circuit breaker" with rollback. ECS automatically reverts if deployment fails.

Prevention: Tag images with git SHA, not just :latest. You need to know exactly which version to roll back to.