Memory for LLMs: From Stateless to Production-Ready

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

You're building a coding assistant. Users love it in testing. Then you ship.

A developer asks: "Refactor the auth module." Your assistant nails it. Same developer, 10 minutes later: "Now add rate limiting to what we just built." Your assistant responds: "I'd be happy to help! Could you share the code you'd like me to add rate limiting to?"

It forgot. Everything. The entire refactoring session—gone.

This isn't a bug. It's how LLMs work. And it's the first problem every production AI system has to solve.

Quick Summary

LLMs are stateless — every API call starts fresh, no memory of previous calls
"Memory" = sending conversation history — you literally paste everything each time
Context windows have limits — when you exceed them, you need strategies: trimming, summarization, or RAG

Want the full story? Keep reading.

This post is for you if:

You're building an LLM-powered app and conversations feel "broken"
You're confused why ChatGPT "remembers" but your API calls don't
You're hitting context limits and don't know how to handle them

The Uncomfortable Truth: LLMs Have Amnesia

Every time you call an LLM API, you're talking to a stranger. The model doesn't know you messaged it 5 seconds ago. It doesn't know you messaged it yesterday. Each API call exists in complete isolation.

Each API Call is Isolated

First Call

I'm building a REST API in Python

→

LLM

→

Great! I can help with Flask, FastAPI, or Django...

⚡ NO CONNECTION BETWEEN CALLS ⚡

Second Call

Which framework should I use?

→

LLM

→

For what project? I'd need more context...

The LLM has no idea these two calls are related. To it, they're from two different users on two different planets.

When you use ChatGPT or Claude, it seems like they remember. But that's not the LLM—that's an entire memory system built on top. The LLM is the engine; memory is a separate module that most people never see.

Real World Impact

If you run Llama locally and chat with it, you'll see this firsthand. No memory system = every message is message #1.

The Obvious Fix: Just Send Everything

The simplest solution? Send the entire conversation every time. The LLM doesn't remember, so you remind it—over and over.

How "Memory" Actually Works

Conversation with LLM

System: You are a helpful coding assistant

I'm building a REST API in Python

Great! I recommend FastAPI for its speed and automatic docs.

Let's use FastAPI then. Show me a basic setup.

Here's a minimal FastAPI app with one endpoint...

Now add authentication to it

I'll add JWT auth to the FastAPI app we just created...

The LLM doesn't "remember"—you're literally pasting the entire conversation into every API call.

This works beautifully... until it doesn't.

The Wall: Context Windows Have Limits

LLMs can only process so much text at once. This limit is called the context window—and when you hit it, everything breaks.

Same Conversation, Different Limits

Small Context

OVERFLOW

Medium Context

~40% full

Large Context

~15% full

The same conversation fills different percentages depending on your model's context window size. Check your model's documentation for exact limits.

But bigger context windows aren't free. More tokens means:

Higher costs — you pay per token, input and output
Slower responses — more to process, more latency
Worse focus — models struggle to attend to everything equally

The Math

Token costs add up fast. Long conversations with many users can quickly become expensive—especially when you're resending the entire history with every message.

Strategy 1: Trimming (Drop the Old Stuff)

The bluntest solution: when the conversation gets too long, cut the oldest messages. First in, first out.

FIFO Trimming: Simple but Dangerous

Dropped Forever Kept in Context

The user mentioned they're allergic to shellfish in M2. Your food recommendation bot just suggested shrimp.

Trimming is fast and cheap. But critical information often lives in those early messages—user preferences, project context, key decisions. Once it's gone, it's gone.

Strategy 2: Summarization (Compress, Don't Delete)

Instead of throwing away old messages, compress them. Use an LLM to create a summary of what happened.

Compression Without Total Loss

12,000 tokens

Full conversation history

→ Summarize

400 tokens

Key facts preserved

30x compression. But summaries are lossy—nuance gets lost. And generating them adds 300-800ms latency.

Strategy 3: External Memory (Store It Somewhere Else)

Here's the key insight: memory doesn't have to live in the prompt. You can store information externally and pull it in only when relevant.

This is the shift from "remember everything" to "remember what matters right now."

A Framework for Thinking About Memory

Different types of information need different memory strategies:

The Memory Taxonomy

Working

Current chat in prompt

Episodic

Past sessions & outcomes

LLM

Semantic

Docs & knowledge bases

Procedural

System prompts & rules

Parametric

Baked in during training

Short-Term

Working Memory

The current conversation. Lives in the prompt. Limited by context window.

Example: Your chat history with Claude right now

Long-Term

Episodic

Past sessions and their outcomes.

Example: "Last week you tried Redis and it crashed"

Semantic

External knowledge—docs, codebases, wikis.

Example: Cursor reading your codebase

Procedural

How to do things. System prompts.

Example: "Always respond in JSON format"

Permanent

Parametric

Baked into model weights during training. Cannot be changed at runtime.

Example: The model "knows" Python syntax and Paris is in France

In Practice

Cursor's knowledge of your codebase = semantic memory. Its ability to write TypeScript = parametric memory. Your current chat = working memory.

RAG: The Industry Standard

Retrieval-Augmented Generation is how most production systems handle long-term memory. Instead of stuffing everything in the prompt, you store documents externally and retrieve only what's relevant.

RAG: Retrieve → Augment → Generate

User Query Step 1

Embed Step 2

Vector DB Step 3

LLM Step 4

Answer Step 5

This is what powers Notion AI, Cursor, and every "chat with your docs" product.

When Memory Systems Fail

Memory bugs are subtle. The system doesn't crash—it just gives confident wrong answers.

The Stale Memory Problem

Old Data 6 months ago

→

Memory

→

LLM

→

Wrong Answer Uses outdated info

Fix: Add recency scoring. Decay old memories. Re-index regularly.

The Leaked Context Problem

User A

User B

→

Shared Memory No isolation!

→

Data Leak! A sees B's data

Fix: Namespace by user ID. Filter at retrieval time, not after generation.

The Polluted Retrieval Problem

Query

→

Relevant

Noise

→

LLM

→

Confused Too much noise

Fix: Better chunking. Add reranking. Reduce K. Quality over quantity.

The Tradeoffs (No Free Lunch)

Full History

Speed: Fast

Cost: High

Accuracy: Best

Best for: Short sessions

Trimming

Speed: Fast

Cost: Low

Accuracy: Low

Best for: Casual chat

Summarization

Speed: Medium

Cost: Medium

Accuracy: Medium

Best for: Long convos

RAG

Speed: Medium

Cost: Low

Accuracy: High

Best for: Knowledge bases

Hybrid

Speed: Slow

Cost: High

Accuracy: Best

Best for: Production

Decision Framework

One way to think about choosing a memory strategy:

How long are your sessions?

Short

Try Full History

Medium

Consider Summarization

Long

Explore Hybrid/RAG

Do you have external docs?

Conversation memory only

Yes

Add RAG layer

What To Do Monday

Starting a new project?

Begin with full conversation history. Measure when it breaks. Add complexity only when data shows you need it.

Step 1: Just append messages

Users complaining about "forgetting"?

Add logging first. Track conversation lengths and failure points. Data tells you what to fix.

Step 1: Log before optimizing

Building RAG?

Start simple: basic chunking + vector search. Measure retrieval quality before adding complexity.

Most RAG failures = retrieval failures

Cheat Sheet: LLM Memory

The Problem

LLMs are stateless — each call is independent
No built-in memory between API calls
Context windows have token limits

Memory Strategies

Full History: Send everything (expensive, simple)
Trimming: Drop old messages (lossy)
Summarization: Compress history (balanced)
RAG: Retrieve relevant context (scalable)

Memory Types

Working: Current conversation
Episodic: Past sessions
Semantic: External knowledge
Procedural: How-to instructions

Common Failures

Stale Memory: Uses outdated info
Leaked Context: User A gets User B's data
Polluted Retrieval: Too much noise in chunks

Starting Point

Short chats → Try full history first
Long chats → Consider summarization
Very long → Explore hybrid/RAG
External docs? → Add RAG

Coming in Part 2: Advanced Memory Patterns

Forgetting curves, multi-agent memory sharing, graph-based RAG, and memory that learns from itself.

What to Read Next

From Weekend Project to Production

What actually matters at each stage—with checklists and decision frameworks

Database Connections: The 95% Problem

Why your app randomly fails and how to fix it

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

Memory for LLMs:
From Stateless to Production-Ready

The Uncomfortable Truth: LLMs Have Amnesia

The Obvious Fix: Just Send Everything

The Wall: Context Windows Have Limits

Strategy 1: Trimming (Drop the Old Stuff)

Strategy 2: Summarization (Compress, Don't Delete)

Strategy 3: External Memory (Store It Somewhere Else)

A Framework for Thinking About Memory

RAG: The Industry Standard

The Stale Memory Problem

The Leaked Context Problem

The Polluted Retrieval Problem

Starting a new project?

Users complaining about "forgetting"?

Building RAG?

The Problem

Memory Strategies

Memory Types

Common Failures

Starting Point

Coming in Part 2: Advanced Memory Patterns

What to Read Next

Was this helpful?

Comments

Leave a comment

Memory for LLMs:From Stateless to Production-Ready

The Uncomfortable Truth: LLMs Have Amnesia

The Obvious Fix: Just Send Everything

The Wall: Context Windows Have Limits

Strategy 1: Trimming (Drop the Old Stuff)

Strategy 2: Summarization (Compress, Don't Delete)

Strategy 3: External Memory (Store It Somewhere Else)

A Framework for Thinking About Memory

RAG: The Industry Standard

The Stale Memory Problem

The Leaked Context Problem

The Polluted Retrieval Problem

Starting a new project?

Users complaining about "forgetting"?

Building RAG?

The Problem

Memory Strategies

Memory Types

Common Failures

Starting Point

Coming in Part 2: Advanced Memory Patterns

What to Read Next

Was this helpful?

Comments

Leave a comment

Share this article

Enjoyed this article?

Memory for LLMs:
From Stateless to Production-Ready