بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
In the name of Allah, the Most Gracious, the Most Merciful
You're building a coding assistant. Users love it in testing. Then you ship.
A developer asks: "Refactor the auth module." Your assistant nails it. Same developer, 10 minutes later: "Now add rate limiting to what we just built." Your assistant responds: "I'd be happy to help! Could you share the code you'd like me to add rate limiting to?"
It forgot. Everything. The entire refactoring session—gone.
This isn't a bug. It's how LLMs work. And it's the first problem every production AI system has to solve.
- LLMs are stateless — every API call starts fresh, no memory of previous calls
- "Memory" = sending conversation history — you literally paste everything each time
- Context windows have limits — when you exceed them, you need strategies: trimming, summarization, or RAG
Want the full story? Keep reading.
This post is for you if:
- You're building an LLM-powered app and conversations feel "broken"
- You're confused why ChatGPT "remembers" but your API calls don't
- You're hitting context limits and don't know how to handle them
The Uncomfortable Truth: LLMs Have Amnesia
Every time you call an LLM API, you're talking to a stranger. The model doesn't know you messaged it 5 seconds ago. It doesn't know you messaged it yesterday. Each API call exists in complete isolation.
The LLM has no idea these two calls are related. To it, they're from two different users on two different planets.
When you use ChatGPT or Claude, it seems like they remember. But that's not the LLM—that's an entire memory system built on top. The LLM is the engine; memory is a separate module that most people never see.
If you run Llama locally and chat with it, you'll see this firsthand. No memory system = every message is message #1.
The Obvious Fix: Just Send Everything
The simplest solution? Send the entire conversation every time. The LLM doesn't remember, so you remind it—over and over.
The LLM doesn't "remember"—you're literally pasting the entire conversation into every API call.
This works beautifully... until it doesn't.
The Wall: Context Windows Have Limits
LLMs can only process so much text at once. This limit is called the context window—and when you hit it, everything breaks.
The same conversation fills different percentages depending on your model's context window size. Check your model's documentation for exact limits.
But bigger context windows aren't free. More tokens means:
- Higher costs — you pay per token, input and output
- Slower responses — more to process, more latency
- Worse focus — models struggle to attend to everything equally
Token costs add up fast. Long conversations with many users can quickly become expensive—especially when you're resending the entire history with every message.
Strategy 1: Trimming (Drop the Old Stuff)
The bluntest solution: when the conversation gets too long, cut the oldest messages. First in, first out.
The user mentioned they're allergic to shellfish in M2. Your food recommendation bot just suggested shrimp.
Trimming is fast and cheap. But critical information often lives in those early messages—user preferences, project context, key decisions. Once it's gone, it's gone.
Strategy 2: Summarization (Compress, Don't Delete)
Instead of throwing away old messages, compress them. Use an LLM to create a summary of what happened.
30x compression. But summaries are lossy—nuance gets lost. And generating them adds 300-800ms latency.
Strategy 3: External Memory (Store It Somewhere Else)
Here's the key insight: memory doesn't have to live in the prompt. You can store information externally and pull it in only when relevant.
This is the shift from "remember everything" to "remember what matters right now."
A Framework for Thinking About Memory
Different types of information need different memory strategies:
The current conversation. Lives in the prompt. Limited by context window.
Past sessions and their outcomes.
External knowledge—docs, codebases, wikis.
How to do things. System prompts.
Baked into model weights during training. Cannot be changed at runtime.
Cursor's knowledge of your codebase = semantic memory. Its ability to write TypeScript = parametric memory. Your current chat = working memory.
RAG: The Industry Standard
Retrieval-Augmented Generation is how most production systems handle long-term memory. Instead of stuffing everything in the prompt, you store documents externally and retrieve only what's relevant.
This is what powers Notion AI, Cursor, and every "chat with your docs" product.
Memory bugs are subtle. The system doesn't crash—it just gives confident wrong answers.
The Stale Memory Problem
The Leaked Context Problem
The Polluted Retrieval Problem
One way to think about choosing a memory strategy:
Starting a new project?
Begin with full conversation history. Measure when it breaks. Add complexity only when data shows you need it.
Users complaining about "forgetting"?
Add logging first. Track conversation lengths and failure points. Data tells you what to fix.
Building RAG?
Start simple: basic chunking + vector search. Measure retrieval quality before adding complexity.
The Problem
- LLMs are stateless — each call is independent
- No built-in memory between API calls
- Context windows have token limits
Memory Strategies
- Full History: Send everything (expensive, simple)
- Trimming: Drop old messages (lossy)
- Summarization: Compress history (balanced)
- RAG: Retrieve relevant context (scalable)
Memory Types
- Working: Current conversation
- Episodic: Past sessions
- Semantic: External knowledge
- Procedural: How-to instructions
Common Failures
- Stale Memory: Uses outdated info
- Leaked Context: User A gets User B's data
- Polluted Retrieval: Too much noise in chunks
Starting Point
- Short chats → Try full history first
- Long chats → Consider summarization
- Very long → Explore hybrid/RAG
- External docs? → Add RAG
وَاللَّهُ أَعْلَمُ
And Allah knows best
وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ
May Allah's peace and blessings be upon our master Muhammad and his family
Was this helpful?
Your feedback helps me create better content
Comments
Leave a comment