LLM Engineering

Memory for LLMs:
From Stateless to Production-Ready

Why your AI assistant forgets everything—and the engineering patterns to fix it. A visual guide to conversation history, trimming, summarization, and RAG.

Bahgat Bahgat Ahmed
January 2025 15 min read
Messages
LLM
Response
Memory
Table of Contents
8 sections

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

You're building a coding assistant. Users love it in testing. Then you ship.

A developer asks: "Refactor the auth module." Your assistant nails it. Same developer, 10 minutes later: "Now add rate limiting to what we just built." Your assistant responds: "I'd be happy to help! Could you share the code you'd like me to add rate limiting to?"

It forgot. Everything. The entire refactoring session—gone.

This isn't a bug. It's how LLMs work. And it's the first problem every production AI system has to solve.

Quick Summary
  • LLMs are stateless — every API call starts fresh, no memory of previous calls
  • "Memory" = sending conversation history — you literally paste everything each time
  • Context windows have limits — when you exceed them, you need strategies: trimming, summarization, or RAG

Want the full story? Keep reading.

This post is for you if:

  • You're building an LLM-powered app and conversations feel "broken"
  • You're confused why ChatGPT "remembers" but your API calls don't
  • You're hitting context limits and don't know how to handle them

The Uncomfortable Truth: LLMs Have Amnesia

Every time you call an LLM API, you're talking to a stranger. The model doesn't know you messaged it 5 seconds ago. It doesn't know you messaged it yesterday. Each API call exists in complete isolation.

Each API Call is Isolated
First Call
I'm building a REST API in Python
LLM
Great! I can help with Flask, FastAPI, or Django...
NO CONNECTION BETWEEN CALLS
Second Call
Which framework should I use?
LLM
For what project? I'd need more context...

The LLM has no idea these two calls are related. To it, they're from two different users on two different planets.

When you use ChatGPT or Claude, it seems like they remember. But that's not the LLM—that's an entire memory system built on top. The LLM is the engine; memory is a separate module that most people never see.

Real World Impact

If you run Llama locally and chat with it, you'll see this firsthand. No memory system = every message is message #1.

The Obvious Fix: Just Send Everything

The simplest solution? Send the entire conversation every time. The LLM doesn't remember, so you remind it—over and over.

How "Memory" Actually Works
Conversation with LLM
System: You are a helpful coding assistant
I'm building a REST API in Python
Great! I recommend FastAPI for its speed and automatic docs.
Let's use FastAPI then. Show me a basic setup.
Here's a minimal FastAPI app with one endpoint...
Now add authentication to it
I'll add JWT auth to the FastAPI app we just created...

The LLM doesn't "remember"—you're literally pasting the entire conversation into every API call.

This works beautifully... until it doesn't.

The Wall: Context Windows Have Limits

LLMs can only process so much text at once. This limit is called the context window—and when you hit it, everything breaks.

Same Conversation, Different Limits
Small Context
OVERFLOW
Medium Context
~40% full
Large Context
~15% full

The same conversation fills different percentages depending on your model's context window size. Check your model's documentation for exact limits.

But bigger context windows aren't free. More tokens means:

  • Higher costs — you pay per token, input and output
  • Slower responses — more to process, more latency
  • Worse focus — models struggle to attend to everything equally
The Math

Token costs add up fast. Long conversations with many users can quickly become expensive—especially when you're resending the entire history with every message.

Strategy 1: Trimming (Drop the Old Stuff)

The bluntest solution: when the conversation gets too long, cut the oldest messages. First in, first out.

FIFO Trimming: Simple but Dangerous
M1
M2
M3
M4
M5
M6
M7
Dropped Forever Kept in Context

The user mentioned they're allergic to shellfish in M2. Your food recommendation bot just suggested shrimp.

Trimming is fast and cheap. But critical information often lives in those early messages—user preferences, project context, key decisions. Once it's gone, it's gone.

Strategy 2: Summarization (Compress, Don't Delete)

Instead of throwing away old messages, compress them. Use an LLM to create a summary of what happened.

Compression Without Total Loss
12,000 tokens
Full conversation history
Summarize
400 tokens
Key facts preserved

30x compression. But summaries are lossy—nuance gets lost. And generating them adds 300-800ms latency.

Strategy 3: External Memory (Store It Somewhere Else)

Here's the key insight: memory doesn't have to live in the prompt. You can store information externally and pull it in only when relevant.

This is the shift from "remember everything" to "remember what matters right now."

A Framework for Thinking About Memory

Different types of information need different memory strategies:

The Memory Taxonomy
Working
Current chat in prompt
Episodic
Past sessions & outcomes
LLM
Semantic
Docs & knowledge bases
Procedural
System prompts & rules
Parametric
Baked in during training
Short-Term
Working Memory

The current conversation. Lives in the prompt. Limited by context window.

Example: Your chat history with Claude right now
Long-Term
Episodic

Past sessions and their outcomes.

Example: "Last week you tried Redis and it crashed"
Semantic

External knowledge—docs, codebases, wikis.

Example: Cursor reading your codebase
Procedural

How to do things. System prompts.

Example: "Always respond in JSON format"
Permanent
Parametric

Baked into model weights during training. Cannot be changed at runtime.

Example: The model "knows" Python syntax and Paris is in France
In Practice

Cursor's knowledge of your codebase = semantic memory. Its ability to write TypeScript = parametric memory. Your current chat = working memory.

RAG: The Industry Standard

Retrieval-Augmented Generation is how most production systems handle long-term memory. Instead of stuffing everything in the prompt, you store documents externally and retrieve only what's relevant.

RAG: Retrieve → Augment → Generate
User Query Step 1
Embed Step 2
Vector DB Step 3
LLM Step 4
Answer Step 5

This is what powers Notion AI, Cursor, and every "chat with your docs" product.

When Memory Systems Fail

Memory bugs are subtle. The system doesn't crash—it just gives confident wrong answers.

The Stale Memory Problem

Old Data 6 months ago
Memory
LLM
Wrong Answer Uses outdated info
Fix: Add recency scoring. Decay old memories. Re-index regularly.

The Leaked Context Problem

User A
User B
Shared Memory No isolation!
Data Leak! A sees B's data
Fix: Namespace by user ID. Filter at retrieval time, not after generation.

The Polluted Retrieval Problem

Query
Relevant
Noise
Noise
LLM
Confused Too much noise
Fix: Better chunking. Add reranking. Reduce K. Quality over quantity.
The Tradeoffs (No Free Lunch)
Full History
Speed: Fast
Cost: High
Accuracy: Best
Best for: Short sessions
Trimming
Speed: Fast
Cost: Low
Accuracy: Low
Best for: Casual chat
Summarization
Speed: Medium
Cost: Medium
Accuracy: Medium
Best for: Long convos
RAG
Speed: Medium
Cost: Low
Accuracy: High
Best for: Knowledge bases
Hybrid
Speed: Slow
Cost: High
Accuracy: Best
Best for: Production
Decision Framework

One way to think about choosing a memory strategy:

How long are your sessions?
Short
Try Full History
Medium
Consider Summarization
Long
Explore Hybrid/RAG
Do you have external docs?
No
Conversation memory only
Yes
Add RAG layer
What To Do Monday

Starting a new project?

Begin with full conversation history. Measure when it breaks. Add complexity only when data shows you need it.

Step 1: Just append messages

Users complaining about "forgetting"?

Add logging first. Track conversation lengths and failure points. Data tells you what to fix.

Step 1: Log before optimizing

Building RAG?

Start simple: basic chunking + vector search. Measure retrieval quality before adding complexity.

Most RAG failures = retrieval failures
Cheat Sheet: LLM Memory

The Problem

  • LLMs are stateless — each call is independent
  • No built-in memory between API calls
  • Context windows have token limits

Memory Strategies

  • Full History: Send everything (expensive, simple)
  • Trimming: Drop old messages (lossy)
  • Summarization: Compress history (balanced)
  • RAG: Retrieve relevant context (scalable)

Memory Types

  • Working: Current conversation
  • Episodic: Past sessions
  • Semantic: External knowledge
  • Procedural: How-to instructions

Common Failures

  • Stale Memory: Uses outdated info
  • Leaked Context: User A gets User B's data
  • Polluted Retrieval: Too much noise in chunks

Starting Point

  • Short chats → Try full history first
  • Long chats → Consider summarization
  • Very long → Explore hybrid/RAG
  • External docs? → Add RAG

Coming in Part 2: Advanced Memory Patterns

Forgetting curves, multi-agent memory sharing, graph-based RAG, and memory that learns from itself.

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this helpful?

Comments

Loading comments...

Leave a comment