بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ
Ahmed's phone buzzes. Then again. Then it won't stop.
"Your EC2 instance is generating unusual outbound traffic."
He opens CloudWatch. Outbound bandwidth: 10x normal. CPU: 100%. His server—the one running his entire business—is attacking other servers on the internet.
A security vulnerability he didn't know existed. A botnet he's now part of. And AWS will suspend his account in 30 minutes if he doesn't respond.
This is the story of what Ahmed built, what broke, and what he learned.
The Dream
Why Ahmed started building
Ahmed noticed something. Small businesses spend hours every day answering the same customer questions—order status, business hours, product availability. One person, typing the same answers over and over.
"What if I could automate this?" he thought. "What if I could build a platform where any business could have an AI chatbot handling their customer inquiries?"
The vision was clear:
- Multi-tenant - One platform, many businesses
- AI-powered - Chatbots that actually understand questions
- Multi-channel - Website widget, mobile app, messaging platforms
Ahmed had a laptop, some Node.js experience, and $200 in savings. No team. No infrastructure expertise. Just an idea and determination.
The MVP Era
Week 1-4: "It works on my machine"
Ahmed's first architecture was beautifully simple:
Total cost: $20/month. Total complexity: One docker-compose.yml file.
version: '3.8' services: backend: build: ./backend ports: ["6000:6000"] depends_on: [postgres, redis] frontend: build: ./frontend ports: ["3000:3000"] postgres: image: postgres:15 volumes: [postgres_data:/var/lib/postgresql/data] redis: image: redis:7-alpine
Alternatives Ahmed Considered
| Option | Pros | Cons | Best For |
|---|---|---|---|
| Single VPS | Cheap, simple, fast to deploy | No redundancy, limited scale | MVP, <100 users |
| Heroku/Railway | Even simpler, managed | More expensive, less control | Quick prototypes |
| Kubernetes | Scalable, industry standard | Massive overkill, complex | Never for MVP |
Start with the simplest thing that works. Ahmed could have spent weeks setting up Kubernetes. Instead, he had paying customers in 2 weeks.
The MVP worked. 10 businesses signed up. Ahmed was charging $50/month each. Revenue: $500/month. Profit after server costs: $480.
Life was good. For about 6 weeks.
Growing Pains
When "it works" stops working
Month 2. Ahmed wakes up to 47 support messages. All variations of the same thing:
"The app is down."
The VPS ran out of disk space. PostgreSQL logs had grown to 45GB. The database crashed. No backups existed.
Result: 3 hours of downtime. 2 customers asked for refunds.
Ahmed fixed it. Added log rotation. Set up daily backups to S3. Felt smart.
Two weeks later:
Black Friday. Traffic spiked 5x. The single server couldn't handle it. Response times went from 200ms to 15 seconds. Then timeout errors.
Result: Lost orders. Angry business owners. One customer's customer thought the store was closed.
The pattern was clear. Every month brought a new "surprise":
Pain is the best teacher. Each crash taught Ahmed what production actually requires. The MVP architecture had served its purpose. Now it was time to grow up.
The Failure Taxonomy
These failures aren't random. They follow predictable patterns. Name them so you can spot them before they hit you:
Deploying before a weekend when you won't be around to fix issues. Ahmed deployed on Friday at 5pm. The bug showed up Saturday morning. He didn't see it until Monday.
Prevention: No deploys after Wednesday. Friday deploys need explicit approval and on-call coverage.
Running out of storage space because logs, temp files, or data grew unchecked. Ahmed's PostgreSQL logs grew to 45GB. Database crashed with "no space left on device".
Prevention: Log rotation, disk space alerts at 70%, autoscaling storage (or RDS managed storage).
Unexpected outbound data transfer costs or security issues. Ahmed's compromised server was sending traffic to a botnet. He noticed only when AWS throttled him.
Prevention: Restrict egress in security groups. Only allow outbound to specific IPs/ports you need.
Not being ready for sudden traffic increases. Ahmed got mentioned on Twitter. Traffic 10x'd. His single server couldn't handle it. Site went down during his biggest opportunity.
Prevention: Autoscaling, load testing, CDN for static assets, graceful degradation.
Going Serious
The AWS decision
Ahmed had 30 paying customers now. $1,500/month revenue. Enough to justify real infrastructure. But which cloud?
The Options
| Provider | Pros | Cons | Best For |
|---|---|---|---|
| AWS | Most services, best docs, largest community | Complex pricing, learning curve | Most startups |
| GCP | Great for AI/ML, simpler pricing | Smaller ecosystem | AI-heavy apps |
| Azure | Enterprise integration, Microsoft stack | Less startup-friendly | Enterprise, .NET |
| Self-hosted | Full control, potentially cheaper | You manage everything | Expert teams only |
Ahmed chose AWS because:
- Most tutorials and Stack Overflow answers
- ECS for containers (simpler than Kubernetes)
- RDS for managed database (no more disk space surprises)
- His potential enterprise customers already used AWS
Choose the cloud your customers use, or the one with the best docs for your skill level. The "best" cloud is the one you can actually operate.
The Transformation
Here's what changed when Ahmed moved from a $20 VPS to AWS:
- No redundancy
- Manual backups
- Can't scale
- Single point of failure
evolution
- Auto failover
- Automated backups
- Scales to demand
- Multi-AZ redundancy
The Five Services
What Ahmed is actually building
Before diving into infrastructure, let's understand what Ahmed's platform actually does:
Why Separate Services?
| Service | Language | Why Separate? |
|---|---|---|
| Frontend | JavaScript | Static files can be cached, different deploy cycle |
| Backend | Node.js | Business logic, real-time updates, API |
| Chatbot API | Python | AI libraries are Python-first. Heavy computation shouldn't block Node.js |
| Weaviate | Go | Specialized vector database, not something you build yourself |
Separate services when they have different scaling needs or different technology requirements. The AI chatbot needs Python. It also needs to scale independently when message volume spikes.
How Services Talk to Each Other
Not all services talk to all services. Here's the actual communication map:
Who Calls Who
| From | To | How | What For |
|---|---|---|---|
| Frontend | Backend | HTTP REST | All user actions, data fetching |
| Frontend | Backend | WebSocket | Real-time updates, notifications |
| Backend | Chatbot API | HTTP REST | Process AI chat messages |
| Backend | PostgreSQL | TCP (pg) | All data reads/writes |
| Backend | Redis | TCP | Cache, sessions, pub/sub, queues |
| Chatbot API | Weaviate | HTTP REST | Semantic search for products |
| Chatbot API | OpenAI/LLM | HTTPS | Generate AI responses |
| Admin | PostgreSQL | TCP (pg) | Tenant management, config |
Backend is the hub. Frontend never talks to Chatbot API directly. This gives Backend control over authentication, rate limiting, and request validation. If you need to add security rules, you add them in one place.
Container Layer
ECS: Running containers without managing servers
Ahmed's apps run in Docker containers. But where do those containers run?
ECS (Elastic Container Service) is AWS's container orchestration service. Think of it as a manager that:
- Runs your Docker containers across multiple servers
- Restarts them if they crash
- Scales them up or down based on demand
- Routes traffic to healthy containers
Why not just Docker on EC2? You could, but then YOU manage server patching, container placement, health checks, and scaling. ECS handles all that. With Fargate (serverless mode), you don't even manage the underlying servers.
The Options
| Option | Complexity | Cost | Best For |
|---|---|---|---|
| EC2 + Docker | Low | Low-Medium | Simple apps, full control needed |
| ECS Fargate | Medium | Medium | Container apps, no server management |
| EKS (Kubernetes) | High | High | Large teams, complex orchestration |
| Lambda | Low | Variable | Event-driven, sporadic traffic |
| App Runner | Very Low | Medium | Simple web apps |
Ahmed chose ECS Fargate because:
- No servers to manage - Fargate handles the underlying infrastructure
- Simpler than Kubernetes - One less thing to learn and debug
- Good enough for 5 services - EKS would be overkill
- Scales automatically - Add tasks when load increases
Key Concepts
Task Definition Anatomy
{
"family": "backend",
"cpu": "512", // 0.5 vCPU
"memory": "1024", // 1 GB RAM
"containerDefinitions": [{
"name": "api",
"image": "123456.dkr.ecr.eu-west-1.amazonaws.com/backend:latest",
"portMappings": [{ "containerPort": 6000 }],
"healthCheck": {
"command": ["CMD-SHELL", "wget -q --spider http://127.0.0.1:6000/health"]
},
"environment": [
{ "name": "NODE_ENV", "value": "production" }
],
"secrets": [
{ "name": "DATABASE_URL", "valueFrom": "arn:aws:secretsmanager:..." }
]
}]
}
The health check command must actually verify your app is working. Don't just check if the port is open. Hit an endpoint that touches the database. If the health check passes but the app is broken, ECS won't know to restart it.
Decision Check: Container Platform
Your team is 2 engineers with no Kubernetes experience. What should you use?
ECS Fargate is the right choice. Here's why:
- Kubernetes requires expertise you don't have. Learning curve is 3-6 months.
- ECS is AWS-native, integrates with everything else you're using.
- Fargate = no servers. You don't manage EC2 instances at all.
- You can migrate to EKS later if you outgrow ECS (rare at <50 services).
When Kubernetes makes sense: Large team (5+ engineers), 20+ services, need for advanced features like service mesh, or company-wide Kubernetes standardization.
Docker & ECR
Building and storing container images
Multi-Stage Builds
Ahmed's Docker images use multi-stage builds to keep them small and secure:
# Stage 1: Build (includes dev dependencies, source code) FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build # Stage 2: Run (only what's needed) FROM node:20-alpine AS runner WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules CMD ["node", "dist/server.js"]
800MB image
Includes dev dependencies, source code, build tools. Slow to push, slow to pull.
200MB image
Only production code. Faster deploys, smaller attack surface.
ECR: Where Images Live
ECR (Elastic Container Registry) is AWS's Docker image storage. The flow:
docker build -t backend .docker tag backend:latest 123456.dkr.ecr.../backend:abc123docker push 123456.dkr.ecr.../backend:abc123Ahmed uses Alpine Linux for smaller images (200MB vs 800MB), but hit several quirks:
- No curl by default — Alpine ships with
wgetinstead ofcurl. Health check commands that usecurlsilently fail. Usewget --spider http://localhost:3000/healthor install curl withapk add --no-cache curl. - IPv6 issues — Alpine resolves
localhostto::1(IPv6) by default. If your app only binds to IPv4, use127.0.0.1explicitly in all health checks and connection strings. - Next.js binding — Next.js by default binds to
localhostonly. In a container, that means external connections are refused. SetHOSTNAME=0.0.0.0in your environment variables or it won't accept connections from the ECS health checker. - Native modules — If your Node.js dependencies include native C++ addons (bcrypt, sharp, canvas), they need
build-baseandpython3installed in Alpine. This increases build time. Consider using pre-built binaries:bcryptjsinstead ofbcrypt,@img/sharp-linux-arm64etc. - Timezone — Alpine doesn't include timezone data by default. If your app logs timestamps, add
apk add --no-cache tzdataand setTZ=UTC.
Despite these quirks, Alpine is worth it: 200MB images deploy 4x faster than 800MB ones, saving ~30 seconds per deployment across all ECS tasks.
Database Layer
RDS: Managed PostgreSQL
Remember Ahmed's disk space disaster with the MVP? That's why he uses RDS now—AWS manages backups, patches, and disk space.
The Options
| Option | Managed? | Cost | Best For |
|---|---|---|---|
| Self-hosted PostgreSQL | No | Low | Experts with time to manage |
| RDS PostgreSQL | Yes | Medium | Most production apps |
| Aurora | Yes | High | High performance, auto-scaling storage |
| PlanetScale/Neon | Yes | Variable | Serverless, branching for dev |
High Availability
| Option | Failover Time | Cost | When to Use |
|---|---|---|---|
| Single-AZ | Manual (hours) | $ | Dev/staging environments |
| Multi-AZ | Automatic (~60s) | $$ | Production databases |
| Aurora Multi-Master | Zero | $$$ | Mission-critical, zero downtime |
Ahmed uses:
- Multi-AZ for production - Can't afford downtime for customer orders
- Single-AZ for dev - Acceptable risk, saves money
Connection pooling. 10 ECS tasks × 10 connections each = 100 connections. RDS has limits. Ahmed will need PgBouncer or RDS Proxy as he scales further.
Decision Check: Database
You need ACID transactions and your data is relational. What database?
PostgreSQL (RDS or Aurora) is the standard answer. Here's why:
- ACID compliance: Transactions, constraints, referential integrity built-in
- Relational data: JOINs, foreign keys, normalization
- RDS vs Aurora: RDS is cheaper for small workloads. Aurora for high-performance, auto-scaling storage.
- Multi-AZ: Automatic failover in ~60 seconds
When NOT PostgreSQL: If you need horizontal scaling for massive writes (DynamoDB), document storage (MongoDB), or time-series data (TimescaleDB, built on Postgres anyway).
Multi-Tenancy
One platform, many businesses
Ahmed has 30 businesses on his platform. Each thinks they have their own system. But they all share the same database. How does this work?
The Options
| Pattern | Isolation | Cost | Best For |
|---|---|---|---|
| Shared DB + tenant_id | Low | $ | Most SaaS startups |
| Schema per tenant | Medium | $$ | Compliance requirements |
| Database per tenant | High | $$$ | Enterprise, regulated industries |
Ahmed uses Shared DB + tenant_id:
CREATE TABLE orders ( id SERIAL PRIMARY KEY, tenant_id UUID NOT NULL, -- The isolation key customer_id INT, total DECIMAL, created_at TIMESTAMP ); -- Every query filters by tenant SELECT * FROM orders WHERE tenant_id = $1;
Forget the WHERE tenant_id = ... clause and one business sees another's data. This is a data leak.
Solution: Use PostgreSQL Row-Level Security (RLS) to enforce tenant isolation at the database level.
RLS makes the database enforce tenant isolation automatically. Even if application code forgets the WHERE tenant_id = ... clause, the database won't return other tenants' data.
-- Enable RLS on the orders table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
-- Create a policy: users can only see rows where
-- tenant_id matches the current session variable
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.current_tenant')::uuid);
-- In your middleware (runs before every query):
-- SET app.current_tenant = 'tenant-abc';
-- Now this query automatically filters by tenant:
SELECT * FROM orders;
-- ↑ Only returns orders for tenant-abc, even without WHERE clause
How Ahmed uses it in Express middleware:
async function tenantMiddleware(req, res, next) {
const tenantId = req.user?.tenantId;
if (!tenantId) return res.status(401).json({ error: 'No tenant' });
// IMPORTANT: Validate tenantId format to prevent SQL injection!
if (!/^[a-f0-9-]{36}$/.test(tenantId)) {
return res.status(400).json({ error: 'Invalid tenant ID' });
}
// Set the tenant for this database connection
await db.query(`SET app.current_tenant = '${tenantId}'`);
next();
}
// Now every database query in this request is automatically
// scoped to the authenticated user's tenant. No WHERE clause needed.
Always validate tenantId before using it in SET commands. If tenantId comes from user input (even indirectly through JWT claims), a malicious value like '; DROP TABLE orders; -- could execute arbitrary SQL. Use a strict regex pattern (UUID format) or parameterized queries where possible.
Why this matters: Without RLS, a single missed WHERE clause in any of your hundreds of queries could leak data. With RLS, the database is the safety net. Even a bug in application code can't cross tenant boundaries.
Decision Check: Multi-Tenancy
You have 5 enterprise customers who each want data isolation. What pattern?
Database per tenant is likely the right choice for this scenario. Here's why:
- Enterprise customers demand isolation: They often require it contractually for compliance (SOC2, HIPAA)
- 5 customers is manageable: The operational overhead is acceptable at this scale
- Easy data export/deletion: Customer leaves? Drop the database.
- Independent backups: Restore one customer without affecting others
When shared DB with RLS works: Many small customers (100+), cost-sensitive, customers don't require strict isolation. Schema per tenant: Middle ground, but adds migration complexity.
Caching & Queues
Redis for everything
Ahmed uses Redis (actually AWS ElastiCache with Valkey) for four different purposes:
Priority Queues
Not all jobs are equal. A payment webhook is more urgent than an analytics update:
Dead Letter Queue
What happens when a job fails after all retries?
Failed jobs move to a separate queue for manual review. They don't disappear, and they don't block other jobs.
Autoscaling
Add capacity when needed, remove when not
Ahmed's traffic varies. Quiet at night, busy during business hours. Why pay for 10 servers at 3 AM when 1 is enough?
Ahmed's Autoscaling Rules
| Service | Min | Max | Scale Out When | Scale In When |
|---|---|---|---|---|
| Backend | 1 | 10 | CPU > 50% for 60s | CPU < 30% for 300s |
| Frontend | 1 | 6 | CPU > 50% for 60s | CPU < 30% for 300s |
| Chatbot API | 1 | 4 | CPU > 70% for 60s | CPU < 40% for 300s |
Why Asymmetric Cooldowns?
React fast to load. Users are waiting.
Don't flap. Avoid add/remove/add/remove cycles.
You can scale on more than CPU. Ahmed scales his chatbot based on queue depth: if there are more than 1000 messages waiting, add workers.
Networking
How requests flow through the system
A request from a user's browser goes through multiple layers. Let's trace a real request: "User clicks 'Ask AI' to get product recommendations."
The LLM API call (500ms) is 60% of the total request time. This is why Ahmed uses streaming responses—show partial results while waiting—and caches frequent questions.
ALB Routing Rules
ALB (Application Load Balancer) is AWS's Layer 7 load balancer. It sits in front of your services and:
- Distributes traffic across multiple containers/servers
- Routes by path or hostname: /api/* goes to backend, /admin/* goes to admin service
- Terminates TLS: Handles HTTPS, your containers only see HTTP
- Health checks: Only sends traffic to healthy containers
- Sticky sessions: Can route a user to the same container (for WebSocket)
ALB vs NLB: ALB understands HTTP (paths, headers, cookies). NLB is Layer 4 (TCP/UDP only, faster, for non-HTTP traffic). For web apps, use ALB.
One load balancer routes to multiple services based on path:
# Path-based routing /admin/* → chatbot-admin:8001 /socket.io/* → backend:6000 # WebSocket /api/* → backend:6000 # Host-based routing api.example.com → chatbot-api:8000 # Default /* → frontend:5000
VPC: Your Private Network
VPC (Virtual Private Cloud) is your own isolated section of AWS's network. Think of it like having your own private building in a shared office complex:
- Isolation: Other AWS customers can't see or access your resources
- Subnets: You divide your VPC into smaller networks (public-facing vs internal-only)
- Security Groups: Firewall rules that control what traffic is allowed
- IP Addresses: You control the IP range (e.g., 10.0.0.0/16)
Why it matters: Without a VPC, your database would be directly on the internet. With a VPC, only your application servers can reach it.
Ahmed's infrastructure lives in a Virtual Private Cloud (VPC)—an isolated network in AWS:
- Public subnets: ALB lives here (accessible from internet)
- Private subnets: Databases, Redis (not accessible from internet)
- Security groups: Firewall rules per service
The database should never be accessible from the internet. Only your application servers can talk to it. This is enforced by security groups.
The Security Incident
3 AM. Everything is on fire.
Back to that phone call at 3:12 AM.
What Changed After
| Before | After |
|---|---|
| No VPC Flow Logs | Enabled - Can see all network traffic for forensics |
| Open egress (any outbound) | Restricted - Only HTTP/HTTPS/DNS allowed |
| Secrets in environment variables | Secrets Manager with automatic rotation |
| No incident runbook | Documented step-by-step response process |
- You don't know what you don't log. VPC Flow Logs should be on from day 1.
- Restrict egress. Why can your server talk to any IP on any port?
- Have a runbook before 3 AM happens. Half-asleep you needs clear instructions.
WAF & Rate Limiting
After the incident, Ahmed added two more protection layers:
| Protection Layer | Where | What It Blocks |
|---|---|---|
| WAF Rules | CloudFront / ALB | SQL injection, XSS, bad bots, known malicious IPs |
| Rate Limiting | WAF or ALB | Brute force, credential stuffing, API abuse |
| Geo Blocking | WAF | Traffic from countries you don't serve |
| Bot Control | WAF | Scrapers, automated attacks (costs extra) |
WAF costs ~$5/month + $0.60 per million requests. At MVP stage with 100 users, it's overkill. After an incident with 10,000 users? Essential.
Rule of thumb: Add WAF when you have something worth protecting.
Monitoring
Knowing what's happening before customers tell you
Ahmed's MVP had no monitoring. He found out about outages from angry customers. Now he has the three pillars:
AWS X-Ray: Distributed Tracing
When a request touches 5 services, which one is slow? AWS X-Ray traces the entire journey:
Request ID: abc-123-def │ ├── Frontend (50ms) │ ├── Backend (350ms) │ ├── Auth middleware (15ms) │ ├── Database query (120ms) │ └── Redis cache check (5ms) │ └── Chatbot API (1200ms) ← The slow one! ├── Weaviate search (400ms) └── LLM API call (800ms)
X-Ray answers: "The chatbot's LLM call takes 800ms. Can we cache frequent questions?"
Structured Logging
Ahmed's MVP logs were useless:
// Impossible to search or aggregate console.log("Order created") console.log("User logged in") console.log("Something went wrong")
Now every log is JSON with context:
logger.info({
event: "order.created",
orderId: "ord-789",
tenantId: "tenant-abc",
userId: "user-456",
amount: 299.99,
duration: 145,
correlationId: req.headers['x-correlation-id']
})
Correlation IDs: Connecting the Dots
A single user action might hit 5 services. How do you find all related logs?
corr-xyz-123X-Correlation-ID: corr-xyz-123corr-xyz-123 → see entire journeyCritical Alarms
Ahmed has CloudWatch alarms. They trigger SNS topics. But nobody subscribes to those topics. The alarms fire into the void. Nobody gets paged at 3 AM.
Fix: Connect SNS to Slack, PagerDuty, or email. Takes 30 minutes.
Deployment
How code gets to production
Ahmed's deployment is a shell script. Not ideal, but it works:
npm audit --audit-level=highaws ecs update-service --force-new-deploymentWhat's Missing
- Manual trigger (run script)
- No automatic rollback
- No staging environment first
- No approval gates
- Push to main triggers deploy
- Auto rollback on failure
- Deploy to staging first
- Manual approval for prod
Ahmed's next priority: GitHub Actions CI/CD. Automate the deploy script, add staging environment, add rollback on health check failure.
Here's what a proper CI/CD pipeline looks like for Ahmed's ECS setup. Push to main triggers the full pipeline:
name: Deploy to Production
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
- run: npm test
- run: npm audit --audit-level=high
deploy:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: eu-west-1
- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push Docker image
run: |
docker build -t $ECR_REPO:${{ github.sha }} .
docker push $ECR_REPO:${{ github.sha }}
- name: Deploy to ECS
run: |
# Update task definition with new image
aws ecs update-service \
--cluster production \
--service api \
--force-new-deployment
- name: Wait for stability
run: |
aws ecs wait services-stable \
--cluster production \
--services api
# If this times out, ECS auto-rolls back
The key insight: ECS has built-in rollback. If the new containers fail health checks, ECS keeps the old containers running. You don't need custom rollback logic — just good health checks.
What this gives Ahmed over the shell script: automated on every push, no human needed, rollback if anything fails, full audit trail in GitHub Actions history.
Cost Breakdown
Where the money goes
Ahmed's monthly AWS bill: ~$500
Optimizations Applied
- Right-sized instances: t4g.small instead of m5.large (ARM is cheaper)
- Weekend shutdown: Script stops non-essential services Fri night, starts Mon morning
- Reserved instances: Considering for database (30-70% savings)
$500/month infrastructure for $1,500/month revenue = 33% margin on infrastructure. Not great, but acceptable for a growing startup. As customers grow, cost per customer decreases.
1. Graviton (ARM) instances everywhere. Ahmed already uses t4g.small for compute. But his RDS is still x86. Switching to db.t4g.small saves ~20% on the biggest cost item ($225 → ~$180).
2. Reserved Instances for RDS. Ahmed's databases run 24/7. A 1-year reserved instance commitment saves 30-40%. That's $225 → ~$140. The catch: you're committed for a year.
3. Fargate Spot for the AI worker. The AI processing service handles async tasks from a queue. If a Spot instance gets reclaimed, the task goes back to the queue. No data loss. Savings: 50-70% on compute for that service.
4. Aurora Serverless v2 instead of RDS. If traffic is bursty (high during business hours, near-zero at night), Aurora Serverless scales to near-zero and back. For Ahmed's pattern, this could be cheaper than always-on RDS.
Projected savings: With all optimizations, Ahmed could get from $500 to ~$300/month — dropping infrastructure cost to 20% of revenue.
What To Do Monday
Based on where you are in your journey, here's your next concrete step:
If you're at MVP stage...
Focus on getting users first. Stay on the $20 VPS. Add daily automated backups (takes 30 minutes to set up). Don't add complexity until pain forces you. Ship your product, not your infrastructure.
If you're preparing for production...
Prioritize observability. Set up structured logging, basic CloudWatch alarms (service down, high error rate), and connect alerts to Slack/email. You can't fix what you can't see. This is Day 1 infrastructure.
If you're scaling...
Consider connection pooling and caching. Database connections become a bottleneck around 50+ concurrent users. Add RDS Proxy or PgBouncer. Cache expensive queries in Redis. Review your autoscaling rules.
Practice Mode: Test Your Understanding
Before you move on, make sure you can answer these scenarios:
Check for:
- Memory limits: Container OOM killed? Check CloudWatch for memory usage hitting 100%.
- Exit codes: Look at stopped task details. Exit code 137 = OOM. Exit code 1 = app crash.
- Startup time: Is the health check timing out before the app starts? Increase health check grace period.
- Dependencies: Is the app crashing because it can't reach the database or Redis?
Quick command: aws ecs describe-tasks --cluster prod --tasks $(aws ecs list-tasks --cluster prod --desired-status STOPPED --query 'taskArns[0]' --output text)
Low CPU + high latency = waiting on something external:
- Database: Check RDS CloudWatch for high latency, connection count near limit, or replica lag.
- Redis: Check ElastiCache metrics for evictions or high memory usage.
- External APIs: Is your LLM provider slow? Add tracing to identify the slow call.
- Connection starvation: Pool exhausted? All threads waiting for a connection?
The pattern: High CPU = your code is slow. Low CPU + high latency = you're waiting on I/O.
ECS rollback options (fastest to slowest):
- Do nothing: If health checks are configured correctly, ECS won't drain old containers until new ones are healthy. Bad deploys auto-stop.
- Update task definition: Point back to the previous image tag and update service.
- ECS service auto-rollback: Enable "circuit breaker" with rollback. ECS automatically reverts if deployment fails.
Prevention: Tag images with git SHA, not just :latest. You need to know exactly which version to roll back to.
Where To Go Deep
Ahmed's journey touched every reliability pattern. Here's where to learn each one in depth:
Key Takeaways
Continue Learning
والله أعلم
وصلى الله وسلم على نبينا محمد وعلى آله وصحبه أجمعين
Comments
Leave a comment