بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
Quick Summary
- Multi-step operations (checkout, onboarding, data pipelines) break in ways that single requests don't
- The saga pattern lets you undo distributed operations when steps fail (compensation)
- Idempotency makes retries safe — without it, retrying can double-charge customers
- Real tool comparison: Step Functions vs Temporal vs Airflow vs task queues
This is for you if...
- You've had orders stuck in limbo (charged but never shipped)
- You've struggled to debug "where did this request fail?"
- You're building anything with multiple steps (checkout, pipelines, workflows)
- You want to understand when to use tools like Temporal, Step Functions, or Airflow
What We'll Cover
- 1 The Checkout That Never Completed
- 2 The Simple Function Trap
- 3 In-Process vs Distributed
- 4 Orchestration vs Choreography
- 5 The Saga Pattern
- 6 Idempotency & Exactly-Once
- 7 State, Coupling & Data Flow
- 8 Failure Taxonomy
- 9 Monitoring Workflows
- 10 Anti-Patterns
- 11 Testing Workflow Patterns
- 12 Choosing Your Tools
- 13 When To Use What
- 14 Where To Go Deep
The Checkout That Never Completed
A startup processed orders with a simple function: charge card, reserve inventory, send email, update analytics.
One Friday, Stripe had a 30-second timeout. The card charged, but the function crashed before reserving inventory. Customer got charged. No product. No email. Analytics said "0 orders."
The fix? They added a try/catch with retry. But retries meant double-charging customers. They lost $12,000 in refunds and 3 angry customers went to Twitter.
The real problem wasn't the timeout. It was that nobody knew where the order was when things failed.
Was the card charged? Was inventory reserved? Was the email sent? The state was in memory. When the process died, the state died with it.
This post is about fixing that problem. Not with more try/catch blocks, but with architecture patterns that make multi-step operations reliable.
def checkout(order):
charge_card(order)
reserve_inventory(order)
send_email(order)
# Crash here = charged,
# reserved, no email, no
# record of what happened
State in memory. Crash = data loss. No undo. No visibility.
def checkout(order):
saga = CheckoutSaga(order)
saga.execute()
# Each step: persisted,
# idempotent, compensatable.
# Crash = resume from last
# recorded state
Durable state. Auto-compensation. Full audit trail.
The Simple Function Trap
Every complex system starts simple. A function that does everything:
function checkout(order) {
chargeCard(order) // 500ms - talks to Stripe
reserveInventory(order) // 200ms - talks to database
sendEmail(order) // 300ms - talks to email service
updateAnalytics(order) // 100ms - talks to analytics
}
It works. Ship it. Then the questions start:
- chargeCard() takes 30 seconds? User stares at spinner. HTTP times out. Did it charge?
- sendEmail() fails? Do we fail the whole checkout? Customer already charged.
- The server crashes after chargeCard()? Money taken, nothing else happened.
- We need to debug a stuck order? No visibility into what step failed.
All state lives in memory. The variables, the progress, the "where are we" - all gone when the process dies. You have no idea what happened.
The Try/Catch Trap
The first instinct is to add error handling. But watch what happens:
function checkout(order) {
try {
chargeCard(order) // Succeeds — $149 charged
reserveInventory(order) // Fails! OutOfStockError
sendEmail(order)
} catch (error) {
// Now what? Card is already charged.
// Do we refund? What if the refund fails?
// What if chargeCard() succeeded but we don't know
// because the response timed out?
refundCard(order) // What if THIS fails too?
}
}
Every fix creates new questions:
| "Fix" | New Problem |
|---|---|
| Add try/catch | Catch block doesn't know which steps completed |
| Add refund in catch | Refund can also fail — now you need error handling for error handling |
| Add retry | Retry without idempotency → double charge |
| Add status flags | Flags in memory → lost on crash |
| Add logging | Logs tell you what happened, but can't resume the workflow |
The simple function approach is fundamentally broken for multi-step operations that cross service boundaries. No amount of try/catch can fix the fact that each external call (Stripe, database, email) is a separate system with its own failure modes. You need architecture, not more error handling.
In-Process vs Distributed: The Fundamental Choice
There are fundamentally two ways to run multi-step operations. Everything else is details.
In-Process: One chef does everything. Preps, cooks, plates. Fast and simple. But if they get sick mid-meal, the whole order is lost. Nobody else knows where they were.
Distributed: A head chef coordinates. One cook preps, one grills, one plates. The head chef tracks "prep done, grill in progress." If the grill cook gets sick, you replace them. The order continues from where it stopped.
All steps in one program
- State lives in memory
- Fast (no network between steps)
- Simple to build and debug
- Dies together - crash loses everything
- Can't scale steps independently
Good for: Quick tasks, prototypes, operations under 30 seconds
Steps run on separate workers
- State persisted externally
- Slower (network hops)
- More complex to build
- Workers independent - one crash doesn't kill all
- Scale bottleneck steps separately
Good for: Long operations, critical flows, scale, reliability
step2(step1Result). Simple, direct, fragile.Orchestration vs Choreography
When you go distributed, you face another choice: who controls the flow?
publishes
listens
reacts
| Aspect | Orchestration | Choreography |
|---|---|---|
| Control | Central coordinator knows everything | No central control, services autonomous |
| Visibility | Easy to trace "where is my order?" | Hard - events scattered across services |
| Coupling | Services depend on coordinator | Services only know events, not each other |
| Failure | Coordinator is single point of failure | No single point, but cascade failures possible |
| Best For | Business-critical flows, payments, orders | Loose integrations, notifications, analytics |
Orchestration: Easier to understand and debug, but coordinator becomes critical. Choreography: More resilient, but "what happened to order #123?" becomes a detective game across 5 services.
Most payment flows use orchestration — when money is involved, you need to know exactly where every transaction is. Large-scale systems often use choreography for loose integrations (notifications, analytics) while keeping orchestration for critical flows.
- Multi-team organization — teams own their services and don't want central coordination
- Loose integrations — notifications, analytics, audit logs
- Strong event infrastructure — mature Kafka/SNS setup with good observability
- High autonomy required — services need to evolve independently
- Single team — you own the whole flow, no coordination overhead needed
- Need to trace "where is order #123?" — choreography makes this a detective game
- Money is involved — payments, refunds, financial reconciliation
- Need to debug flows — "which service failed?" is hard without a coordinator
AWS Step Functions lets you define workflows as state machines in JSON. Here's the checkout flow as a Step Function:
{
"Comment": "Checkout workflow with compensation",
"StartAt": "ChargeCard",
"States": {
"ChargeCard": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:charge-card",
"TimeoutSeconds": 30,
"Retry": [{
"ErrorEquals": ["TransientError"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}],
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "NotifyFailure"
}],
"Next": "ReserveInventory"
},
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:reserve-inventory",
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "RefundCard" // Compensation!
}],
"Next": "SendConfirmation"
},
"RefundCard": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:refund-card",
"Next": "NotifyFailure"
},
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:send-email",
"End": true
},
"NotifyFailure": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:notify-failure",
"End": true
}
}
}
Notice the Catch block on ReserveInventory — if inventory fails, the workflow automatically jumps to RefundCard. This is compensation, and it's the foundation of the Saga pattern (next section).
Temporal takes a different approach — you write workflows as regular code. The framework handles persistence, retries, and crash recovery automatically:
from temporalio import workflow, activity
from datetime import timedelta
@activity.defn
async def charge_card(order_id: str, amount: float) -> str:
# This is a regular function. Temporal handles retries.
result = stripe.PaymentIntent.create(
amount=int(amount * 100),
currency="usd",
idempotency_key=f"charge-{order_id}"
)
return result.id
@activity.defn
async def reserve_inventory(order_id: str, items: list) -> bool:
return inventory_service.reserve(order_id, items)
@activity.defn
async def refund_card(payment_id: str) -> bool:
stripe.Refund.create(payment_intent=payment_id)
return True
@workflow.defn
class CheckoutWorkflow:
@workflow.run
async def run(self, order: dict) -> dict:
# Step 1: Charge card
payment_id = await workflow.execute_activity(
charge_card,
args=[order["id"], order["total"]],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3)
)
# Step 2: Reserve inventory (with compensation on failure)
try:
await workflow.execute_activity(
reserve_inventory,
args=[order["id"], order["items"]],
start_to_close_timeout=timedelta(seconds=10)
)
except Exception:
# Compensation: refund the card
await workflow.execute_activity(
refund_card, args=[payment_id],
start_to_close_timeout=timedelta(seconds=30)
)
raise
# Step 3: Send email (fire and forget)
await workflow.execute_activity(
send_confirmation, args=[order["id"]],
start_to_close_timeout=timedelta(seconds=10)
)
return {"status": "completed", "payment_id": payment_id}
The key insight: this looks like normal code. But if the server crashes between Step 1 and Step 2, Temporal replays the workflow from the beginning — and because charge_card uses an idempotency key, the replay won't double-charge. The workflow resumes exactly where it left off.
In choreography, there's no central coordinator. Each service publishes events, and other services react:
# Payment service: charges card, publishes event
def handle_order_created(event):
order = event["detail"]
result = stripe.charge(order["amount"])
# Publish event - inventory service is listening
sns.publish(
TopicArn="arn:aws:sns:...:order-events",
Message=json.dumps({
"type": "PaymentCompleted",
"orderId": order["id"],
"paymentId": result.id
})
)
# Inventory service: listens for PaymentCompleted
def handle_payment_completed(event):
order_id = event["orderId"]
reserved = inventory.reserve(order_id)
if reserved:
sns.publish(Message={"type": "InventoryReserved", ...})
else:
# Publish compensation event
sns.publish(Message={"type": "InventoryFailed", ...})
# Payment service ALSO listens for InventoryFailed
def handle_inventory_failed(event):
stripe.refund(event["paymentId"]) # Compensation
Notice the problem: the flow is invisible. To understand what happens when you create an order, you need to trace events across 3+ services. If InventoryFailed gets published but the payment service's listener is down, the refund never happens. This is why choreography requires excellent monitoring and dead letter queues.
The Saga Pattern: Undoing What Went Wrong
Here's the fundamental problem with distributed transactions: you can't just ROLLBACK. When you charge a card through Stripe, there's no database transaction wrapping the whole flow. If inventory reservation fails, the charge already happened. The money already moved.
You're planning a trip. You book the flight, then the hotel, then the rental car. The rental car company says "no cars available." You can't undo the flight by pressing Ctrl+Z — you need to call the airline and actively cancel the booking. Then call the hotel and cancel that too. Each undo is a separate action.
That's a saga. Every forward step has a corresponding compensation step. If step 3 fails, you run compensation for step 2, then step 1, in reverse order.
Forward Flow vs Compensation Flow
Happy Path (Forward)
Failure at Step 3 (Compensation)
class CheckoutSaga:
def __init__(self, order):
self.order = order
self.completed_steps = [] # Track what we've done
def execute(self):
steps = [
("charge", self.charge_card, self.refund_card),
("reserve", self.reserve_inventory, self.release_inventory),
("shipment", self.create_shipment, self.cancel_shipment),
("email", self.send_email, None), # No compensation needed
]
for name, forward, compensate in steps:
try:
result = forward()
self.completed_steps.append((name, compensate, result))
except Exception as e:
logger.error(f"Step '{name}' failed: {e}. Compensating...")
self.compensate()
raise SagaFailed(f"Failed at {name}", completed=self.completed_steps)
def compensate(self):
# Undo in reverse order
for name, undo_fn, result in reversed(self.completed_steps):
if undo_fn:
try:
undo_fn(result)
logger.info(f"Compensated '{name}'")
except Exception as e:
# Compensation failure! This is the worst case.
logger.critical(f"COMPENSATION FAILED for '{name}': {e}")
alert_ops_team(name, e) # Human intervention needed
An e-commerce company ran a flash sale on Black Friday. Their checkout saga: charge card → reserve inventory → generate shipping label → send confirmation. The saga was in-memory — no durable state.
11:42 AM: The shipping API starts returning 503s. The saga triggers compensation: release inventory, then refund card. But Stripe was also under load from the flash sale. Refund requests were timing out at 15 seconds.
11:43 AM: The compensation function retries the refund. Meanwhile, the server runs out of memory from all the in-flight compensations (each holding onto the full order context) and crashes.
11:44 AM: Server restarts. All in-flight compensations are gone. No record of which orders need refunds. The saga state was in memory. It died with the process.
The aftermath: 312 customers charged between $49-$299, no products shipped, no refunds issued. Total: $47,000 in charges that needed manual reconciliation. It took 2 engineers 3 days to cross-reference Stripe charges with inventory records and issue refunds.
The fix: They moved to a durable saga with PostgreSQL state tracking. Every compensation is now recorded in the database. If the server crashes, a recovery cron picks up unfinished compensations within 5 minutes. They also added a separate compensation worker pool with its own memory budget — compensations can't starve the main checkout process.
The lesson: In-memory sagas are fine for prototypes. For anything involving money, your compensation state must survive a crash. The database write adds 10ms per step. That 10ms would have saved $47,000 and 6 engineer-days.
What Can Go Wrong With Sagas
Sagas can be implemented with either approach:
Orchestrated Saga: A central coordinator runs the saga. It calls each step, tracks what completed, and runs compensation in reverse on failure. Easier to reason about. The examples above (Step Functions, Temporal) are orchestrated sagas.
Choreographed Saga: Each service publishes events. On failure, the failing service publishes a "compensate" event, and upstream services listen and undo their work. No central coordinator — but you need to guarantee that every compensation event gets processed.
| Aspect | Orchestrated Saga | Choreographed Saga |
|---|---|---|
| Compensation order | Guaranteed reverse order | Depends on event delivery order |
| Debugging | One place to look | Trace events across services |
| Failure visibility | Coordinator knows full state | Need distributed tracing |
| Best for | Payment flows, order processing | Multi-team, loosely coupled systems |
Recommendation: Start with orchestrated sagas. Move to choreographed only when your organization has multiple teams that need to own their services independently and you have strong event infrastructure (Kafka, EventBridge) with dead letter queues.
Idempotency: The Pattern That Makes Everything Else Work
Retries are useless without idempotency. Sagas are dangerous without idempotency. Every pattern in this post assumes your operations are idempotent. An idempotent operation produces the same result whether you call it once or ten times.
Not idempotent (light switch): Press once = on. Press again = off. Press again = on. Each press changes the state. If you're not sure whether you pressed it, pressing again might make things worse.
Idempotent (elevator button): Press "Floor 3." Press it again. Press it 10 times. You still go to Floor 3. Multiple presses have the same effect as one. This is what your payment API needs to be.
Three Levels of Idempotency
# Level 1: Natural Idempotency (cheapest)
# Some operations are naturally idempotent.
# "Set user email to X" is the same whether you do it once or 10 times.
UPDATE users SET email = 'new@example.com' WHERE id = 42;
# Level 2: Database-Level Idempotency
# Use INSERT ON CONFLICT to prevent duplicates.
INSERT INTO orders (order_id, amount, status)
VALUES ('order-123', 99.99, 'pending')
ON CONFLICT (order_id) DO NOTHING; -- Second insert is a no-op
# Level 3: Application-Level Idempotency (most flexible)
# Track processed requests with idempotency keys.
def charge_card(order_id, amount, idempotency_key):
# Check if already processed
existing = db.get(f"charge:{idempotency_key}")
if existing:
return existing # Same result, no double charge
# Process and store atomically
result = stripe.charge(amount, idempotency_key=idempotency_key)
db.set(f"charge:{idempotency_key}", result, ttl=86400) # Keep for 24h
return result
Idempotency-Key header with your request. If Stripe sees the same key within 24 hours, it returns the original response without processing again. Your keys should be deterministic — use charge-{order_id}, not a random UUID.
What Can Go Wrong With Idempotency
user_id as the idempotency key. User places two different orders — second one silently ignored because the key is the same.operation-{entity_id}. For charges: charge-{order_id}.
SET NX (set-if-not-exists) for atomic check-and-set.
Here's a battle-tested pattern using PostgreSQL advisory locks for atomic idempotency:
import hashlib
def idempotent(operation_name):
"""Decorator that makes any operation idempotent."""
def decorator(func):
def wrapper(*, idempotency_key, **kwargs):
full_key = f"{operation_name}:{idempotency_key}"
# Try to get existing result
existing = db.execute(
"SELECT result FROM idempotency_keys WHERE key = %s",
[full_key]
)
if existing:
return json.loads(existing["result"])
# Acquire advisory lock (prevents race conditions)
lock_id = int(hashlib.md5(full_key.encode()).hexdigest()[:8], 16)
db.execute("SELECT pg_advisory_lock(%s)", [lock_id])
try:
# Double-check after acquiring lock
existing = db.execute(
"SELECT result FROM idempotency_keys WHERE key = %s",
[full_key]
)
if existing:
return json.loads(existing["result"])
# Execute the actual operation
result = func(**kwargs)
# Store result
db.execute(
"""INSERT INTO idempotency_keys (key, result, created_at)
VALUES (%s, %s, NOW())""",
[full_key, json.dumps(result)]
)
return result
finally:
db.execute("SELECT pg_advisory_unlock(%s)", [lock_id])
return wrapper
return decorator
# Usage:
@idempotent("charge")
def charge_card(order_id, amount):
return stripe.PaymentIntent.create(
amount=int(amount * 100), currency="usd",
idempotency_key=f"stripe-charge-{order_id}"
)
# Both calls return the same result, charge happens once:
charge_card(idempotency_key="order-123", order_id="123", amount=99.99)
charge_card(idempotency_key="order-123", order_id="123", amount=99.99) # Returns cached
The advisory lock prevents the race condition where two concurrent requests both see "no existing result" and both execute. This is the double-checked locking pattern applied to idempotency.
The Myth of Exactly-Once Delivery
You'll hear people talk about "exactly-once delivery" in distributed systems. Here's the uncomfortable truth: exactly-once delivery is impossible in a distributed system. Here's why.
Imagine two armies on opposite hills need to coordinate an attack. They send messengers through enemy territory. Army A sends "Attack at dawn." Did Army B receive it? A doesn't know. B sends back "Confirmed." Did A receive the confirmation? B doesn't know. No matter how many confirmations you send, you can never be certain the last message arrived.
This is your service calling Stripe. You send a charge request. Stripe processes it. Stripe sends back "success." But your server crashes before reading the response. Did the charge happen? Yes. Does your system know? No.
Every messaging system gives you one of two guarantees:
Fire and forget. Send the message once. If it's lost, it's lost.
- Fast, simple
- Message might never arrive
- No duplicates
Use for: Metrics, analytics, logging — losing one data point is fine
Keep retrying until acknowledged. Message definitely arrives — maybe more than once.
- Reliable, guaranteed delivery
- Message might arrive 2x, 3x, 10x
- Requires idempotent consumers
Use for: Payments, orders, anything where losing a message is unacceptable
Exactly-once delivery is impossible. Exactly-once processing is achievable. Combine at-least-once delivery (guaranteed arrival) with idempotent consumers (safe to process twice). The message might arrive 3 times, but the operation only happens once. This is what Kafka calls "exactly-once semantics" — it's really at-least-once delivery + idempotent processing under the hood.
Kafka's exactly-once semantics (EOS) was introduced in Kafka 0.11 and works through three mechanisms:
1. Idempotent Producer: Each producer gets a unique ID. Each message gets a sequence number. If the broker receives a duplicate (same producer ID + sequence), it silently deduplicates it. This prevents network retries from creating duplicate messages in the topic.
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=['kafka:9092'],
enable_idempotence=True, # Prevents duplicate messages
acks='all', # Wait for all replicas
retries=3, # Retry on failure
max_in_flight_requests_per_connection=5
)
# Even if this send is retried due to network issues,
# the broker will only store the message once
producer.send('orders', key=order_id, value=order_data)
2. Transactional Producer: Groups multiple writes (to different topics/partitions) into an atomic transaction. Either all writes succeed or none do. This is how you implement consume-transform-produce pipelines without duplicates.
3. Consumer Offset Management: The consumer's offset commit is part of the same transaction as the output write. If the consumer crashes after processing but before committing, it reprocesses the message — but the idempotent producer ensures the output isn't duplicated.
The outbox pattern solves a common problem: you need to update your database AND publish a message, but these are two different systems. If the database write succeeds but the message publish fails, you have inconsistency.
The solution: write the message to an "outbox" table in the same database transaction as your business logic. A separate process reads the outbox and publishes to the message broker.
def place_order(order):
with db.transaction() as tx:
# Business logic and outbox message in ONE transaction
tx.execute(
"INSERT INTO orders (id, status, amount) VALUES (%s, %s, %s)",
[order.id, 'pending', order.amount]
)
tx.execute(
"""INSERT INTO outbox (id, topic, key, payload, created_at)
VALUES (%s, %s, %s, %s, NOW())""",
[uuid4(), 'order-events', order.id,
json.dumps({'type': 'OrderCreated', 'orderId': order.id})]
)
# Both succeed or both fail - no inconsistency
# Separate worker: polls outbox and publishes
def outbox_relay():
while True:
messages = db.execute(
"SELECT * FROM outbox WHERE published = FALSE ORDER BY created_at LIMIT 100"
)
for msg in messages:
kafka.send(msg.topic, key=msg.key, value=msg.payload)
db.execute("UPDATE outbox SET published = TRUE WHERE id = %s", [msg.id])
time.sleep(1) # Poll every second
Why this works: The database transaction guarantees that if the order is created, the outbox message also exists. Even if the relay crashes, it'll pick up unpublished messages on restart. Even if it publishes twice, the consumer uses idempotency keys. End-to-end, the order is processed exactly once.
State, Coupling & Data Flow
The biggest difference between "works on my laptop" and "works in production at scale" is how you handle state. The key question: "If your server crashes mid-operation, can you resume?" If yes, you have durable state. If no, you have ephemeral state. For anything involving money, user data, or business-critical flows — you need durable.
Durable state adds latency per step (typically 10-100ms for a database write, depending on setup). But that small overhead saves you hours of debugging "what state was this order in?" after a crash.
Durable State in Practice: The State Machine Pattern
The most reliable way to track workflow progress is a state machine — a database table that records every transition, not just the current state.
-- The workflow state table
CREATE TABLE workflow_executions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
workflow_type VARCHAR(50) NOT NULL, -- 'checkout', 'onboarding', etc.
entity_id VARCHAR(100) NOT NULL, -- 'order-123'
current_state VARCHAR(50) NOT NULL DEFAULT 'created',
context JSONB NOT NULL DEFAULT '{}', -- Accumulated data from each step
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (workflow_type, entity_id)
);
-- Every state transition is recorded (audit trail)
CREATE TABLE workflow_transitions (
id BIGSERIAL PRIMARY KEY,
workflow_id UUID REFERENCES workflow_executions(id),
from_state VARCHAR(50) NOT NULL,
to_state VARCHAR(50) NOT NULL,
triggered_by VARCHAR(100), -- 'charge_card', 'reserve_inventory'
metadata JSONB DEFAULT '{}',
transitioned_at TIMESTAMPTZ DEFAULT NOW()
);
-- Find stuck workflows (the recovery query)
SELECT * FROM workflow_executions
WHERE current_state NOT IN ('completed', 'failed', 'compensated')
AND updated_at < NOW() - INTERVAL '5 minutes';
-- ^ Any workflow stuck for >5 min needs investigation
class WorkflowEngine:
"""Simple durable workflow engine using PostgreSQL."""
TRANSITIONS = {
'checkout': {
'created': {'next': 'charging', 'action': 'charge_card'},
'charging': {'next': 'reserving', 'action': 'reserve_inventory'},
'reserving': {'next': 'shipping', 'action': 'create_shipment'},
'shipping': {'next': 'confirming', 'action': 'send_confirmation'},
'confirming': {'next': 'completed', 'action': None},
}
}
def advance(self, workflow_id):
"""Move workflow to the next state."""
wf = db.execute(
"SELECT * FROM workflow_executions WHERE id = %s FOR UPDATE",
[workflow_id]
) # FOR UPDATE = row-level lock
transition = self.TRANSITIONS[wf.workflow_type].get(wf.current_state)
if not transition:
return # Terminal state
try:
# Execute the action for this transition
if transition['action']:
result = getattr(self, transition['action'])(wf)
context = {**wf.context, transition['action']: result}
else:
context = wf.context
# Atomic: update state + record transition
with db.transaction():
db.execute(
"""UPDATE workflow_executions
SET current_state = %s, context = %s, updated_at = NOW()
WHERE id = %s""",
[transition['next'], json.dumps(context), workflow_id]
)
db.execute(
"""INSERT INTO workflow_transitions
(workflow_id, from_state, to_state, triggered_by)
VALUES (%s, %s, %s, %s)""",
[workflow_id, wf.current_state,
transition['next'], transition['action']]
)
except Exception as e:
db.execute(
"""UPDATE workflow_executions
SET current_state = 'failed', updated_at = NOW()
WHERE id = %s""", [workflow_id]
)
self.start_compensation(workflow_id)
raise
def recover_stuck(self):
"""Run every minute via cron. Finds and resumes stuck workflows."""
stuck = db.execute(
"""SELECT id FROM workflow_executions
WHERE current_state NOT IN ('completed', 'failed', 'compensated')
AND updated_at < NOW() - INTERVAL '5 minutes'"""
)
for wf in stuck:
logger.warning(f"Recovering stuck workflow: {wf.id}")
self.advance(wf.id)
Why this matters: When your on-call engineer gets paged at 3 AM, they can run SELECT * FROM workflow_transitions WHERE workflow_id = 'abc' ORDER BY transitioned_at and immediately see: "It went created → charging → reserving → stuck." No guessing, no log diving.
Coupling: Why Your Tests Are So Hard
Ever tried to unit test a function that depends on 5 other services? That's tight coupling in action.
checkoutService knows about:
├── StripeAPI
├── InventoryDB
├── EmailService
├── AnalyticsQueue
├── UserPreferences
└── ShippingRates
Change any of these → change checkoutService.
Test checkoutService → mock all 6 dependencies.
ChargeTask:
Input: { orderId, amount }
Output: { transactionId }
ReserveTask:
Input: { orderId, items }
Output: { reserved: true }
Each task knows ONLY its input/output.
Test task → mock 1 input, check 1 output.
| Aspect | Tight Coupling | Loose Coupling |
|---|---|---|
| Test Setup | 50 lines of mocks | 5 lines - one input object |
| Change Impact | Change ripples everywhere | Change contained to one task |
| Team Work | Everyone steps on each other | Teams own their tasks independently |
| Debugging | "Which of 6 services broke?" | "This task failed, here's input/output" |
The insight: In distributed systems, loose coupling isn't a nice-to-have. It's survival. When tasks only know their contracts (input/output), you can test, deploy, and scale them independently.
Dead Letter Queues: The Safety Net for Poison Messages
A dead letter queue (DLQ) is where messages go to die — gracefully. When a message fails processing N times, instead of retrying forever and blocking everything behind it, you move it to a separate queue for inspection.
When a letter can't be delivered (wrong address, recipient moved), the postal service doesn't keep trying forever. It puts the letter in a "return to sender" bin. Someone inspects it, figures out what went wrong, and either fixes the address or notifies the sender.
Your DLQ is that bin. Messages that fail 3 times go there. An alert fires. An engineer inspects the message, fixes the bug or bad data, and either re-processes or discards it.
AWS SQS — built-in DLQ support:
{
"OrderQueue": {
"Type": "AWS::SQS::Queue",
"Properties": {
"QueueName": "checkout-orders",
"VisibilityTimeout": 30,
"RedrivePolicy": {
"deadLetterTargetArn": { "Fn::GetAtt": ["OrderDLQ", "Arn"] },
"maxReceiveCount": 3 // After 3 failures → DLQ
}
}
},
"OrderDLQ": {
"Type": "AWS::SQS::Queue",
"Properties": {
"QueueName": "checkout-orders-dlq",
"MessageRetentionPeriod": 1209600 // 14 days
}
}
}
RabbitMQ — using exchange routing:
import pika
channel.queue_declare(
queue='checkout-orders',
arguments={
'x-dead-letter-exchange': 'dlx',
'x-dead-letter-routing-key': 'checkout-orders-dlq',
'x-message-ttl': 30000, # 30s visibility timeout
}
)
channel.queue_declare(queue='checkout-orders-dlq')
Kafka — manual DLQ with a separate topic:
def process_message(message, retry_count=0):
try:
handle_order(message)
consumer.commit()
except Exception as e:
if retry_count >= 3:
# Move to DLQ topic
dlq_producer.send(
'checkout-orders-dlq',
key=message.key,
value=message.value,
headers={
'original-topic': 'checkout-orders',
'failure-reason': str(e),
'retry-count': str(retry_count),
}
)
consumer.commit() # Don't retry main queue
alert("Message moved to DLQ", message.key)
else:
# Retry with backoff
time.sleep(2 ** retry_count)
process_message(message, retry_count + 1)
DLQ monitoring rule: Alert if DLQ depth > 0. Every message in the DLQ represents a customer problem — a stuck order, a failed payment, a missing confirmation. Review DLQ messages daily.
Putting It All Together: Complete Checkout Configuration
Here's the complete checkout system with every pattern from this post wired together. This is what the "after" looks like.
from dataclasses import dataclass
from enum import Enum
class OrderState(Enum):
CREATED = "created"
CHARGING = "charging"
CHARGED = "charged"
RESERVING = "reserving"
RESERVED = "reserved"
SHIPPING = "shipping"
COMPLETED = "completed"
COMPENSATING = "compensating"
COMPENSATED = "compensated"
FAILED = "failed"
class ProductionCheckout:
"""
Full checkout with:
- Durable state (PostgreSQL)
- Saga compensation (reverse-order undo)
- Idempotency (advisory locks + keys)
- DLQ (after 3 failures)
- Monitoring (Prometheus metrics)
"""
SAGA_STEPS = [
{
"name": "charge",
"forward_state": OrderState.CHARGING,
"success_state": OrderState.CHARGED,
"action": "_charge_card",
"compensate": "_refund_card",
"timeout_seconds": 30,
},
{
"name": "reserve",
"forward_state": OrderState.RESERVING,
"success_state": OrderState.RESERVED,
"action": "_reserve_inventory",
"compensate": "_release_inventory",
"timeout_seconds": 10,
},
{
"name": "ship",
"forward_state": OrderState.SHIPPING,
"success_state": OrderState.COMPLETED,
"action": "_create_shipment",
"compensate": "_cancel_shipment",
"timeout_seconds": 15,
},
]
def execute(self, order_id: str):
# Record start time for metrics
start = time.time()
for step in self.SAGA_STEPS:
self._transition(order_id, step["forward_state"])
try:
# Idempotent execution with timeout
result = self._idempotent_execute(
key=f"{step['name']}-{order_id}",
action=getattr(self, step["action"]),
timeout=step["timeout_seconds"],
order_id=order_id,
)
self._transition(order_id, step["success_state"],
context={step["name"]: result})
except Exception as e:
step_failures.labels(step=step["name"]).inc()
self._compensate(order_id)
raise
# Success metrics
workflow_duration.labels(
status="completed"
).observe(time.time() - start)
# Fire-and-forget: email (non-critical)
queue.send("send-confirmation", {"order_id": order_id})
What this gives you:
- Crash at any point? Recovery cron finds the order stuck in CHARGING/RESERVING/SHIPPING within 5 minutes
- Double request? Idempotency key prevents double charge/reserve/ship
- Step fails? Compensation runs in reverse, all tracked in the transitions table
- Poison message? After 3 failures, order moves to DLQ with full context
- "Where is order #123?"
SELECT * FROM workflow_transitions WHERE entity_id = 'order-123'
Multi-step operations have different shapes. Knowing them helps you design better.
Pipeline (Sequential)
Each step depends on previous. Use for: ordered operations like checkout flow.
DAG (Parallel + Dependencies)
B and C run in parallel after A. D waits for both. Use for: parallel processing with merge.
Fan-Out (Split)
One input spawns many parallel tasks. Use for: batch processing, sending to multiple targets.
Fan-In (Merge)
Multiple results merge into one. Use for: aggregation, collecting parallel results.
If you haven't worked with message queues before, here's the 2-minute version. A message queue is a buffer between producers (things that create work) and consumers (things that do work).
When you order food, the waiter writes a ticket and puts it on the counter. The kitchen picks it up when ready. If the kitchen is busy, tickets stack up. If a cook calls in sick, other cooks pick up the slack. The waiter doesn't wait for the food — they take the next order.
Replace "ticket" with "message," "counter" with "queue," "kitchen" with "worker." That's a message queue.
Three key properties:
- Decoupling: Producer doesn't know (or care) which consumer handles the message. This is how you make email sending not block checkout.
- Buffering: If consumers are slow, messages wait in the queue instead of crashing the producer. This handles traffic spikes.
- Persistence: Messages survive consumer crashes. If a worker dies mid-processing, the message goes back to the queue for another worker.
| Queue | Delivery | Ordering | Best For |
|---|---|---|---|
| SQS | At-least-once | Best-effort (FIFO available) | AWS-native, simple async jobs |
| RabbitMQ | At-least-once | FIFO per queue | Complex routing, multi-consumer patterns |
| Kafka | At-least-once (EOS available) | FIFO per partition | High throughput, event streaming, replay |
| Redis Streams | At-least-once | FIFO | Simple, fast, already have Redis |
For a complete deep-dive on message queues and workers, see Async Processing.
Workers: The Execution Layer
Tasks don't run themselves. Something has to execute them. That's where workers come in.
Stateless workers are like warehouse workers with no assigned stations. Any worker can pick up any task from the queue. When one goes home, others continue. Easy to add more during busy season.
Stateful workers are like workers who keep their tools at their station. Faster (no fetching), but if they're absent, their specific tasks wait.
- No memory between tasks - fetch what you need each time
- Easy to scale - just add more workers
- Easy to replace - worker crashes? Another picks up
- Easy to load balance - round-robin works
Use for: Most workloads, anything distributed
- Faster - no fetch overhead (ML model in memory)
- Harder to scale - sticky sessions needed
- Harder to replace - state lost on crash
- Complex load balancing - need routing rules
Use for: ML models, cached connections, session affinity
Auto-Scaling: Matching Capacity to Demand
The beauty of stateless workers: you can add and remove them based on load.
At scale, this means spinning up more workers during peak hours (evenings, promotions) and scaling back during quiet periods — paying only for what you use.
Failure Taxonomy: Named Patterns
Giving failures memorable names makes them recognizable. When your team says "we have a Zombie Order," everyone knows what it means and where to look. Here are the five most common workflow failures — with the full story of what happens and how to fix it.
The story: It's 2 PM. A customer completes checkout. Stripe charges $149. Your server starts reserving inventory — then the container gets restarted during a rolling deployment. The card is charged. Inventory is not reserved. No confirmation email. The order exists in Stripe but nowhere else. The customer checks "My Orders" — nothing. They check their bank — charge is there. They contact support. Support can't find the order. The customer disputes the charge. You lose the $149 plus a $15 dispute fee.
What the logs show: Nothing. The process crashed. There's a Stripe webhook for the successful payment, but if your webhook handler has the same bug (no durable state), it doesn't help.
Detection query: SELECT * FROM stripe_charges WHERE charge_id NOT IN (SELECT charge_id FROM orders) AND created_at > NOW() - INTERVAL '24 hours'
Fix: Use a saga with durable state. Record the order's state machine in a database before calling Stripe. After the charge, update the state. If the process crashes, a recovery worker picks up orders stuck in "charging" for more than 5 minutes and either completes or compensates them.
The story: Your checkout function calls Stripe. The request reaches Stripe and succeeds, but the response takes 31 seconds. Your timeout is 30 seconds. Your code sees "timeout" and retries. Stripe processes the second request as a new charge. Customer is charged twice. Your retry logic — the thing you added to make things more reliable — caused the problem.
What the logs show: First call: TimeoutError after 30s. Second call: 200 OK, charge_id: ch_xyz. But there's also ch_abc from the first call that actually succeeded. You see one success in your logs, two charges in Stripe.
The math: At 10,000 orders/day with 1% timeout rate and retry, you could double-charge up to 100 customers daily. At $100 average order, that's $10,000/day in refunds plus the customer trust damage.
Fix: Idempotency keys on every payment call. Use charge-{order_id} as the key. Stripe sees the same key and returns the original result. See the Idempotency section above for full implementation.
The story: A customer enters their phone number as +1 (555) 123-4567 ext. 890 in the checkout form. Your email template function tries to format it and throws a ValueError. The message goes back to the queue. Gets retried. Throws the same error. Retried. Same error. This message is now consuming all your retry budget. Meanwhile, 500 legitimate orders are waiting behind it. Your queue depth alarm fires. You think you're under heavy load. You add more workers. They all fail on the same message.
What the logs show: ValueError: invalid phone format repeated thousands of times. Same message_id. Retry count climbing: 50, 100, 500...
Fix: Dead letter queue (DLQ). After 3-5 retries, move the message to a separate queue. Alert the team. Inspect manually. The key insight: a message that fails 3 times will almost certainly fail forever. Don't let it block everything else. See Failure Handling for detailed retry and DLQ patterns.
The story: A customer contacts support: "I ordered 3 days ago. No email. No tracking. Money is gone." The support agent checks the orders database — order exists, status: "processing." What does "processing" mean? Is it stuck? Is it running? Did it fail silently? Nobody knows because the workflow state is a single enum column with no history. There's no timestamp for when it entered "processing." There's no log of which steps completed. The support agent says "I'll escalate this" and the order sits for another 2 days until an engineer manually checks Stripe, the inventory DB, the email logs, and the shipping API to reconstruct what happened.
What the logs show: Nothing useful. Maybe an entry log for the order creation. No structured state transitions.
Fix: Store workflow state as a state machine with history. Every state transition gets a timestamp: created → charging (14:01:03) → charged (14:01:04) → reserving (14:01:04) → ???. Now support can see: "It got stuck at reserving 3 days ago." An engineer can check the inventory service logs for that timestamp. Consider tools like Temporal or Step Functions that give you workflow visibility out of the box.
The story: Your email service goes down for maintenance. Your checkout workflow calls email, waits 30 seconds, times out, and retries. Three retries × 30 seconds = 90 seconds per checkout. Meanwhile, your checkout workers are all stuck waiting for email. New checkout requests pile up. Queue depth hits 10,000. Your payment service starts timing out because the checkout workers (who also call payment) are all occupied. Now customers can't even start checkout. The email service — which has nothing to do with payments — took down your entire business.
What the dashboard shows: Email service: 100% errors. Checkout service: response time 90s (normally 2s). Payment service: response time 45s. Queue depth: growing linearly. Revenue: $0.
Fix: Separate critical from non-critical steps. Email is non-critical — the customer can get the email 5 minutes later. Make it async (fire and forget to a queue). Use circuit breakers on external calls. Use separate worker pools for payment vs email. See Failure Handling for circuit breaker and bulkhead patterns.
Quick Diagnosis Reference
| Symptom | Root Cause | Missing Pattern | Fix |
|---|---|---|---|
| Customer charged, no product | Crash between steps | Saga + durable state | Persist workflow state before each step |
| Double charge on retry | Non-idempotent payment | Idempotency keys | Add idempotency_key to every payment call |
| Queue blocked, nothing processing | Poison message (bad data) | Dead letter queue | Move to DLQ after 3 failed attempts |
| "Where is my order?" — nobody knows | State only in memory | State machine + transitions table | Record every state change with timestamp |
| Email down → payments stop | No isolation between steps | Bulkhead + async for non-critical | Make email async, separate worker pools |
| Refund never issued after failure | Compensation step failed silently | Compensation retry + alerting | Retry compensation 3x, then alert human |
| Order stuck in "processing" forever | No timeout on workflow steps | Step timeouts + recovery cron | Timeout per step, cron job finds stuck workflows |
| Same order processed by 2 workers | No distributed lock | Idempotency + advisory locks | Atomic check-and-set with SELECT FOR UPDATE |
Monitoring Your Workflows
You've built the patterns. Now you need to know when they break. Workflow monitoring is different from regular application monitoring — you're not just tracking request latency, you're tracking multi-step processes that span minutes, hours, or days.
The Four Metrics That Matter
SELECT COUNT(*) FROM workflow_executions WHERE current_state NOT IN ('completed','failed','compensated') AND updated_at < NOW() - INTERVAL '5 min'reserve_inventory, the problem isn't your workflow — it's the inventory service. Break down failure rate per step to find the weak link.from prometheus_client import Counter, Histogram, Gauge
# How long do workflows take?
workflow_duration = Histogram(
'workflow_duration_seconds',
'Time from start to completion',
['workflow_type', 'final_state'],
buckets=[1, 2, 5, 10, 30, 60, 300, 600]
)
# Which steps fail?
step_failures = Counter(
'workflow_step_failures_total',
'Step-level failure count',
['workflow_type', 'step_name', 'error_type']
)
# How many are stuck right now?
stuck_workflows = Gauge(
'workflow_stuck_count',
'Workflows stuck in non-terminal state > 5 min',
['workflow_type']
)
# Compensation tracking
compensations = Counter(
'workflow_compensations_total',
'Compensation executions',
['workflow_type', 'result'] # result: 'success' or 'failed'
)
# Usage in the workflow engine:
def advance(self, workflow_id):
start = time.time()
try:
# ... execute step ...
workflow_duration.labels(
workflow_type=wf.type, final_state='completed'
).observe(time.time() - start)
except Exception:
step_failures.labels(
workflow_type=wf.type, step_name=step_name,
error_type=type(e).__name__
).inc()
Recommended alerts:
stuck_workflows > 0for 5 min → Page on-call (something is stuck)rate(compensations{result="failed"}) > 0→ Page immediately (money at risk)rate(step_failures) / rate(step_total) > 0.05→ Warning (step failure rate above 5%)histogram_quantile(0.99, workflow_duration) > 30→ Warning (p99 latency above 30s)
When a workflow spans multiple services, you need distributed tracing to follow the request through the entire flow. Here's the minimum viable tracing setup:
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("workflow-engine")
def execute_saga(saga):
with tracer.start_as_current_span(
"saga.execute",
attributes={
"saga.type": saga.workflow_type,
"saga.entity_id": saga.entity_id,
}
) as span:
for step in saga.steps:
with tracer.start_as_current_span(
f"saga.step.{step.name}",
attributes={"step.name": step.name}
) as step_span:
try:
result = step.execute()
step_span.set_attribute("step.result", str(result))
except Exception as e:
step_span.set_status(StatusCode.ERROR, str(e))
span.set_attribute("saga.failed_at", step.name)
# Compensation spans are children of the saga span
with tracer.start_as_current_span("saga.compensate"):
saga.compensate()
raise
What this gives you: In Jaeger or Zipkin, you can see the full checkout saga as a single trace with child spans for each step. When step 3 fails, you see the compensation spans too. You can measure exactly how long each step took and where the bottleneck is.
For choreographed workflows, propagate the trace context through message headers. Every service that processes an event creates a child span using the parent context from the message. This gives you end-to-end visibility even without a central coordinator.
Choosing Your Tools
You understand the patterns. Now, which tool actually implements them? Here's a honest comparison.
| Tool | Approach | Learning Curve | Best For | Watch Out For |
|---|---|---|---|---|
| AWS Step Functions | JSON state machine | Low-Medium | AWS-native, serverless workflows | Vendor lock-in, JSON gets verbose |
| Temporal | Code-first (Go, Python, Java, TS) | Medium-High | Complex business logic, long-running workflows | Self-hosted complexity, steep initial learning |
| Apache Airflow | Python DAGs | Medium | Data pipelines, batch processing, ETL | Not designed for real-time, heavy scheduler |
| Celery + Redis/RabbitMQ | Task queues (Python) | Low | Async tasks, simple workflows | No built-in saga support, visibility limited |
| Bull/BullMQ | Task queues (Node.js) | Low | Node.js async jobs, simple pipelines | Redis single-point, no orchestration |
| Custom (DB + queues) | DIY state machine | Low (to start) | Small teams, specific requirements | You'll rebuild half of Temporal eventually |
Use AWS Step Functions if: You're already on AWS, your workflows are under 25,000 events, and your team is small. The visual debugger is excellent. The JSON definition language gets painful for complex branching but handles 80% of use cases.
Use Temporal if: You have complex business logic that doesn't fit a state machine, you need long-running workflows (days/weeks), or you need to test workflow logic with unit tests. Temporal lets you write workflows as code, which means your IDE, debugger, and test framework all work. The trade-off: you need to run the Temporal server (or use Temporal Cloud).
Use Airflow if: You're building data pipelines, not business workflows. Airflow is designed for "run this ETL job at 2 AM daily," not "process this order in real-time." If your workflows are triggered by user actions and need sub-second response, Airflow is the wrong tool.
Use Celery/Bull if: You have simple async tasks (send email, generate report, resize image) that don't need saga compensation. Add a retry decorator, point at Redis, and you're done. When you start needing "if task A fails, undo task B," you've outgrown task queues.
Build custom if: Your workflow is simple enough to fit in a database table with a status column and a cron job. Be honest about when you've outgrown this — the moment you're writing your own retry logic, dead letter queue, and state machine transitions, you're rebuilding Temporal but worse.
Workflow Anti-Patterns
These are the mistakes that look reasonable in code review but cause production incidents. Every one of these comes from real systems.
The problem: Your checkout takes 500ms (Stripe) + 200ms (inventory) + 300ms (email) + 100ms (analytics) = 1,100ms total. The user stares at a spinner for over a second. When email is slow (2s), checkout feels broken. When analytics is down, checkout fails entirely.
Fix: Separate critical from non-critical steps. Only charge card and reserve inventory synchronously (700ms). Fire email and analytics to a queue (fire-and-forget). Checkout responds in 700ms. Email arrives 2 seconds later. User doesn't notice. Analytics can retry on its own schedule.
The problem: The orchestrator becomes a bottleneck and single point of failure. A bug in the returns workflow takes down checkout. A deployment to add a notification type requires testing all workflows. The team that owns the orchestrator becomes a bottleneck for every team.
Fix: One orchestrator per business domain. The checkout workflow orchestrates checkout steps. The returns workflow is a separate orchestrator owned by the returns team. They share patterns (saga, idempotency) but not code. This is the "bounded context" principle from domain-driven design applied to workflows.
The problem: The queue consumer tries to charge the card. It fails (expired card, insufficient funds). The user already got a confirmation email. Now you need to send a "sorry, your order didn't go through" email. The user already told their friends they bought it. They're angry.
Fix: Never fire-and-forget on critical steps. The card charge and inventory reservation MUST be synchronous — the user needs to know immediately if they failed. Only non-critical steps (email, analytics, recommendations) should be async. The rule: if failure changes what you tell the user, it must be synchronous.
The problem: You split the checkout function into 5 services: charge-service, inventory-service, email-service, analytics-service, and orchestrator-service. But they all share the same PostgreSQL database. They can't be deployed independently because the database schema is shared. You have all the complexity of distributed systems (network failures, message ordering, eventual consistency) with none of the benefits (independent scaling, independent deployment, fault isolation).
Fix: Either commit to true microservices (each service owns its data, communicates via APIs/events) or keep it as a well-structured monolith. A monolith with good module boundaries is 10x better than a distributed monolith. Only split when you have a concrete reason: different scaling needs, different team ownership, or different deployment cadences.
Testing Workflow Patterns
Workflows are notoriously hard to test because they cross service boundaries and have complex failure modes. Here's how to test each pattern without needing a full production-like environment.
Testing Sagas: The Compensation Matrix
A saga with 4 steps has 4 possible failure points. Each needs a test proving compensation works correctly.
import pytest
from unittest.mock import MagicMock, patch
class TestCheckoutSaga:
"""Test every failure point triggers correct compensation."""
def test_happy_path(self):
"""All steps succeed → no compensation called."""
saga = CheckoutSaga(order=mock_order)
result = saga.execute()
assert result.status == "completed"
assert saga.compensate_called == False
def test_failure_at_step_1_charge(self):
"""Charge fails → nothing to compensate."""
with patch('stripe.charge', side_effect=StripeError):
saga = CheckoutSaga(order=mock_order)
with pytest.raises(SagaFailed):
saga.execute()
assert len(saga.completed_steps) == 0 # Nothing to undo
def test_failure_at_step_2_inventory(self):
"""Inventory fails → refund card."""
with patch('inventory.reserve', side_effect=OutOfStockError):
saga = CheckoutSaga(order=mock_order)
with pytest.raises(SagaFailed):
saga.execute()
# Verify: card was charged then refunded
assert stripe_mock.charge.called
assert stripe_mock.refund.called
assert stripe_mock.refund.call_args["amount"] == order.total
def test_failure_at_step_3_shipping(self):
"""Shipping fails → release inventory, refund card."""
with patch('shipping.create', side_effect=ShippingAPIError):
saga = CheckoutSaga(order=mock_order)
with pytest.raises(SagaFailed):
saga.execute()
# Compensation in REVERSE order
assert inventory_mock.release.called # Step 2 undone
assert stripe_mock.refund.called # Step 1 undone
def test_compensation_failure_alerts_human(self):
"""If refund fails during compensation → alert ops team."""
with patch('shipping.create', side_effect=ShippingAPIError):
with patch('stripe.refund', side_effect=StripeError):
saga = CheckoutSaga(order=mock_order)
with pytest.raises(SagaFailed):
saga.execute()
assert alert_mock.ops_team.called
assert "COMPENSATION FAILED" in alert_mock.ops_team.call_args[0]
Testing Idempotency
def test_idempotent_charge_only_charges_once():
"""Calling charge_card with same key twice charges once."""
result1 = charge_card(idempotency_key="order-123", amount=99.99)
result2 = charge_card(idempotency_key="order-123", amount=99.99)
assert result1 == result2 # Same response
assert stripe_mock.charge.call_count == 1 # Only charged once
def test_different_keys_charge_separately():
"""Different keys = different charges (not a cache)."""
charge_card(idempotency_key="order-123", amount=99.99)
charge_card(idempotency_key="order-456", amount=49.99)
assert stripe_mock.charge.call_count == 2 # Two different charges
def test_concurrent_same_key_charges_once():
"""Two threads with same key → only one charge (race condition test)."""
import threading
results = []
def charge():
r = charge_card(idempotency_key="order-789", amount=99.99)
results.append(r)
t1 = threading.Thread(target=charge)
t2 = threading.Thread(target=charge)
t1.start(); t2.start()
t1.join(); t2.join()
assert results[0] == results[1] # Same response
assert stripe_mock.charge.call_count == 1 # One charge
Unit tests verify logic. Integration tests verify the workflow actually works with real infrastructure. Here's a minimal Docker Compose setup for testing workflows:
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_DB: workflow_test
POSTGRES_PASSWORD: test
ports: ["5432:5432"]
redis:
image: redis:7
ports: ["6379:6379"]
# Fake Stripe API that records calls
stripe-mock:
image: stripe/stripe-mock:latest
ports: ["12111:12111"]
# Your workflow service
workflow:
build: .
environment:
DATABASE_URL: postgresql://postgres:test@postgres/workflow_test
REDIS_URL: redis://redis:6379
STRIPE_API_BASE: http://stripe-mock:12111
depends_on: [postgres, redis, stripe-mock]
import pytest, time
def test_full_checkout_with_crash_recovery(workflow_engine, db):
"""Simulate crash mid-workflow, verify recovery."""
# Start checkout
wf_id = workflow_engine.start("checkout", order_data)
# Advance to "charging" state
workflow_engine.advance(wf_id)
assert db.get_state(wf_id) == "charging"
# Simulate crash: don't call advance() again
# Wait for recovery cron to find it
time.sleep(6) # Recovery runs every 5s in test
# Verify workflow was recovered and completed
state = db.get_state(wf_id)
assert state in ("completed", "compensated")
# Verify idempotency: charge happened only once
charges = stripe_mock.get_charges(order_data["id"])
assert len(charges) == 1
This test proves three things: (1) the workflow engine persists state, (2) the recovery cron finds and resumes stuck workflows, and (3) idempotency prevents double-charging during recovery.
When To Use What
The Evolution Path
Most systems evolve through these stages. Each transition is triggered by a specific pain point.
Trigger to move on: First production incident where you don't know if the customer was charged.
Trigger to move on: First double-charge incident from a retry without idempotency.
Trigger to move on: "Where is order #123?" — support can't answer because state is in memory.
Trigger to move on: Crash leaves orders in limbo. Manual reconciliation takes hours.
Trigger to move on: You're writing retry logic, DLQ handling, timeout management, and it's 3,000+ lines.
You've reached the end of the evolution. Now focus on business logic, not infrastructure.
Decision Framework
Tradeoffs: Approach Comparison
What To Do Monday
- If you're just starting: Keep it simple. Use a function. Add retry with backoff. Don't over-engineer.
- If you have multi-step operations > 5 seconds: Add in-process orchestration. Define the workflow shape explicitly. Add timeouts.
- If failures are causing real pain (lost orders, double charges): Move to distributed orchestration. Persist state. Add idempotency to every payment operation.
- If you need to trace "where is this order?": You need durable state and workflow visibility. Consider tools like Temporal, Step Functions, or Airflow.
Practice Mode: Test Your Understanding
Before moving on, try answering these. Click to reveal the answer.
Scenario 1: Given this checkout flow: ChargeCard → ReserveInventory → SendEmail. Which step should have compensation?
ChargeCard needs compensation (RefundCard). If ReserveInventory fails after a successful charge, you must refund the customer. SendEmail doesn't need compensation — a failed email doesn't require undoing the charge. Put irreversible steps (emails, shipped packages) LAST.
Scenario 2: What idempotency key would you use for "charge customer $50 for order #ABC123"?
charge-order-ABC123 or payment-{order_id}. The key should be unique per business operation, not per request. If the same charge is retried, the idempotency key ensures it only happens once. Never use random UUIDs — they defeat the purpose.
Scenario 3: Diagnose this failure: "Customer was charged twice for the same order"
Missing idempotency. The payment request was retried (network timeout, user clicked twice, queue redelivery) without an idempotency key. The payment provider processed both as separate charges. Fix: Always include an idempotency key with payment operations. Stripe, for example, accepts Idempotency-Key header and will return the same result for duplicate requests.
Workflow Orchestration Cheat Sheet
- Every forward step has a compensation step
- Compensate in reverse order
- Put irreversible steps LAST
- If compensation fails: retry → queue → alert human
- Key =
operation-{entity_id} - TTL > longest retry window (24-72h)
- Use atomic check-and-set (advisory locks or
SET NX) - Store key BEFORE or use external API's built-in
- Persist state to DB before each step
- Record every transition with timestamp
- Recovery cron: find stuck > 5 min
- Use
FOR UPDATEfor row-level locks
- Zombie Order: charged, not shipped → saga
- Double Charge: retry without key → idempotency
- Infinite Retry: poison message → DLQ
- Cascade Failure: no isolation → bulkhead + async
- Exactly-once delivery is impossible
- Use at-least-once + idempotent consumers
- Outbox pattern: DB write + message in one tx
- Debezium for production CDC relay
- Step Functions: AWS-native, serverless, visual
- Temporal: complex logic, long-running, code-first
- Airflow: data pipelines, batch ETL only
- Custom: fine until you need retry + DLQ + state
- Step Functions: ~$25/month
- Temporal Cloud: ~$100-200/month starter
- Self-hosted Temporal: ~$50-100/month (EC2/ECS)
- Custom + PostgreSQL: ~$20/month
Key Takeaways
Where To Go Deep
وَاللَّهُ أَعْلَمُ
And Allah knows best.
Comments
0 commentsLeave a comment
No comments yet. Be the first to share your thoughts!