بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
In the name of Allah, the Most Gracious, the Most Merciful
You're building a coding assistant. Users love it in testing. Then you ship.
A developer asks: "Refactor the auth module." Your assistant nails it. Same developer, 10 minutes later: "Now add rate limiting to what we just built." Your assistant responds: "I'd be happy to help! Could you share the code you'd like me to add rate limiting to?"
It forgot. Everything. The entire refactoring session—gone.
This isn't a bug. It's how LLMs work. And it's the first problem every production AI system has to solve.
- LLMs are stateless — every API call starts fresh, no memory of previous calls
- "Memory" = sending conversation history — you literally paste everything each time
- Context windows have limits — when you exceed them, you need strategies: trimming, summarization, or RAG
Want the full story? Keep reading.
This post is for you if:
- You're building an LLM-powered app and conversations feel "broken"
- You're confused why ChatGPT "remembers" but your API calls don't
- You're hitting context limits and don't know how to handle them
The Uncomfortable Truth: LLMs Have Amnesia
Every time you call an LLM API, you're talking to a stranger. The model doesn't know you messaged it 5 seconds ago. It doesn't know you messaged it yesterday. Each API call exists in complete isolation.
The LLM has no idea these two calls are related. To it, they're from two different users on two different planets.
When you use ChatGPT or Claude, it seems like they remember. But that's not the LLM—that's an entire memory system built on top. The LLM is the engine; memory is a separate module that most people never see.
If you run Llama locally and chat with it, you'll see this firsthand. No memory system = every message is message #1.
The Obvious Fix: Just Send Everything
The simplest solution? Send the entire conversation every time. The LLM doesn't remember, so you remind it — over and over.
from openai import OpenAI client = OpenAI() # This is "memory" — a list you manage yourself messages = [ {"role": "system", "content": "You are a coding assistant."}, ] def chat(user_message): messages.append({"role": "user", "content": user_message}) response = client.chat.completions.create( model="gpt-4o", messages=messages # Send EVERYTHING every time ) assistant_msg = response.choices[0].message.content messages.append({"role": "assistant", "content": assistant_msg}) return assistant_msg # Each call sends the full history: chat("I'm building a REST API in Python") # 3 messages sent chat("Let's use FastAPI") # 5 messages sent chat("Add authentication to it") # 7 messages sent chat("Now add rate limiting") # 9 messages sent... growing!
This works beautifully for short conversations. The LLM sees the full context, so its responses are coherent and relevant. But it can't last.
The LLM doesn't "remember"—you're literally pasting the entire conversation into every API call.
This works beautifully... until it doesn't.
The Wall: Context Windows Have Limits
LLMs can only process so much text at once. This limit is called the context window—and when you hit it, everything breaks.
The same conversation fills different percentages depending on your model's context window size. Check your model's documentation for exact limits.
But bigger context windows aren't free. More tokens means:
- Higher costs — you pay per token, input and output
- Slower responses — more to process, more latency
- Worse focus — models struggle to attend to everything equally
A 50-message conversation averages ~15,000 tokens. At GPT-4o pricing ($2.50/M input tokens), that single conversation costs ~$0.04 per message toward the end. With 10,000 active users having 20 messages/day, that's $800/day in input tokens alone. And it gets worse with every message — message #50 resends all 49 previous messages.
- GPT-4o: 128K tokens (~300 pages of text)
- Claude Sonnet/Opus: 200K tokens (~500 pages)
- Gemini 1.5 Pro: 2M tokens (~5,000 pages)
- Llama 3: 8K-128K tokens depending on variant
Bigger sounds better, but attention quality degrades with length. A model with 200K context doesn't attend equally to token #1 and token #199,999. Research shows "lost in the middle" — models focus on the beginning and end, often missing information in the middle of long prompts.
Strategy 1: Trimming (Drop the Old Stuff)
The bluntest solution: when the conversation gets too long, cut the oldest messages. First in, first out.
The user mentioned they're allergic to shellfish in M2. Your food recommendation bot just suggested shrimp.
Trimming is fast and cheap. But critical information often lives in those early messages — user preferences, project context, key decisions. Once it's gone, it's gone.
import tiktoken def trim_messages(messages, max_tokens=4000, model="gpt-4o"): """Keep the system prompt + most recent messages that fit.""" enc = tiktoken.encoding_for_model(model) # Always keep the system prompt system_msg = messages[0] system_tokens = len(enc.encode(system_msg["content"])) # Add messages from newest to oldest until we hit the limit trimmed = [] token_count = system_tokens for msg in reversed(messages[1:]): msg_tokens = len(enc.encode(msg["content"])) if token_count + msg_tokens > max_tokens: break trimmed.insert(0, msg) token_count += msg_tokens return [system_msg] + trimmed # Usage: trim before every API call trimmed = trim_messages(messages, max_tokens=4000) response = client.chat.completions.create( model="gpt-4o", messages=trimmed )
Basic FIFO trimming is naive. Better approaches:
- Keep system prompt + first user message + recent messages. The first message often contains crucial context ("I'm building a food delivery app").
- Keep pairs together. Never trim a user message but keep its assistant response — the model gets confused by orphaned messages.
- Importance-weighted trimming. Mark certain messages as "pinned" (e.g., messages where the user stated preferences or made decisions). Trim unpinned messages first.
- Sliding window with overlap. Keep the last N messages but also the first 2-3 messages. Simple and usually good enough.
The Moment Trimming Betrays You
Trimming works until it silently deletes the one message that mattered. Here is a real conversation where trimming creates a dangerous failure. Watch what happens when message 3 gets dropped:
# Message 1 (user): "I'm building a medical triage chatbot for an ER." # Message 2 (assistant): "I can help. What symptoms should it handle?" # Message 3 (user): ← THIS IS CRITICAL "Important constraint: NEVER suggest the patient is fine to go home. Always recommend they wait for a doctor." # Message 4 (assistant): "Understood. I'll always err on the side of caution." # Messages 5-20: Building out triage logic, testing scenarios... # Message 21 (user): "Patient reports mild headache, no other symptoms. What should the bot say?"
With a window of 10 messages, the trimmer drops messages 1-11. Message 3 — the safety constraint — is gone. The assistant now responds: "Based on the symptoms, this sounds like a tension headache. The patient could likely go home and take ibuprofen." The one constraint the user explicitly set has been silently deleted.
This is why naive trimming is dangerous for any use case where early messages contain constraints, preferences, or safety rules. You have two options: pin critical messages so they never get trimmed, or graduate to summarization.
Tokens are not words. They are sub-word units that the model's tokenizer produces. On average, 1 token is about 0.75 English words, or 4 characters. But this varies by language and content type:
- English prose: ~1.3 tokens per word
- Python code: ~1.5 tokens per word (more special characters)
- Arabic text: ~2-3x more tokens than English for the same meaning
- JSON/structured data: ~2x tokens vs. plain text (brackets, quotes, colons)
import tiktoken # Different models use different tokenizers enc_gpt4 = tiktoken.encoding_for_model("gpt-4o") enc_gpt35 = tiktoken.encoding_for_model("gpt-3.5-turbo") text = "Refactor the authentication module to use JWT tokens" # Same text, different token counts gpt4_tokens = len(enc_gpt4.encode(text)) # 9 tokens gpt35_tokens = len(enc_gpt35.encode(text)) # 10 tokens # For a full messages array (OpenAI format), count overhead too def count_messages_tokens(messages, model="gpt-4o"): """Count tokens including per-message overhead.""" enc = tiktoken.encoding_for_model(model) tokens = 0 for msg in messages: tokens += 4 # every message has overhead: role, content markers for key, value in msg.items(): tokens += len(enc.encode(value)) tokens += 2 # priming overhead for assistant reply return tokens # A 20-message conversation: ~3,200 tokens # A 50-message conversation: ~15,000 tokens # A 100-message coding session: ~45,000 tokens
Why this matters for trimming: You cannot trim by message count alone. A message containing a code block might be 2,000 tokens. A message saying "yes" is 1 token. Token-based trimming is the only reliable approach.
For Anthropic models: Claude uses a different tokenizer. The anthropic Python SDK provides client.count_tokens() to get accurate counts. Do not use tiktoken for Claude — the counts will be wrong by 10-20%.
- Sessions are short (under 20 messages)
- Cost is your primary concern
- Early messages are truly disposable (casual chat)
- You have a "sliding window + first message" strategy
- User preferences are stated early (allergies, constraints)
- Safety-critical rules are in early messages
- You need to reference past decisions ("why did we choose X?")
- Users expect the assistant to "remember" everything
Summarization: Compress, Don't Delete
Instead of throwing away old messages, compress them. Use an LLM to create a summary of what happened.
30x compression. But summaries are lossy—nuance gets lost. And generating them adds 300-800ms latency.
def summarize_and_compress(messages, max_tokens=4000): """Summarize old messages, keep recent ones verbatim.""" enc = tiktoken.encoding_for_model("gpt-4o") total = sum(len(enc.encode(m["content"])) for m in messages) if total <= max_tokens: return messages # No need to summarize yet # Keep system prompt + last 6 messages verbatim system = messages[0] recent = messages[-6:] old = messages[1:-6] # Summarize the old messages summary = client.chat.completions.create( model="gpt-4o-mini", # Cheaper model for summarization messages=[{ "role": "system", "content": """Summarize this conversation. Preserve: - User's name, preferences, and stated requirements - Key decisions made and their reasoning - Technical context (language, framework, architecture) - Any constraints or deadlines mentioned Be concise but don't lose critical facts.""" }] + old ).choices[0].message.content # Return: system + summary + recent messages return [ system, {"role": "system", "content": f"Previous conversation summary: {summary}"}, *recent ]
The prompt you use for summarization determines what survives. A vague prompt produces vague summaries. Here is a production-grade prompt and what it does to a real conversation:
"""You are a conversation summarizer for a coding assistant.
Produce a structured summary preserving EXACTLY these fields:
1. USER IDENTITY: Name, role, timezone (if mentioned)
2. PROJECT CONTEXT: Language, framework, architecture decisions
3. CONSTRAINTS: Any "must", "never", "always" rules the user stated
4. DECISIONS MADE: What was decided, with reasoning
5. CURRENT STATE: What has been built so far
6. OPEN QUESTIONS: What is unresolved
Format as a flat list. Be specific. Use the user's exact words
for constraints. Do not editorialize or add your own opinions.
Maximum 400 tokens."""
Before summarization (12,400 tokens):
20 messages about building a FastAPI app with JWT auth, including code snippets, debugging a CORS issue, deciding on PostgreSQL over MongoDB, and a constraint that all endpoints must return JSON:API format.
After summarization (380 tokens):
USER IDENTITY: Backend developer, building for a startup
PROJECT CONTEXT: Python 3.12, FastAPI, PostgreSQL, deployed on Railway
CONSTRAINTS:
- All endpoints MUST return JSON:API format
- No ORM - raw SQL with asyncpg
- Auth tokens expire after 15 minutes
DECISIONS MADE:
- PostgreSQL over MongoDB (need relational joins for reporting)
- JWT with refresh tokens (not session-based)
- CORS allows only app.example.com
CURRENT STATE:
- /auth/login and /auth/refresh endpoints working
- User model with bcrypt password hashing
- CORS middleware configured
OPEN QUESTIONS:
- Rate limiting strategy not yet decided
- File upload endpoint design pending
32x compression. The key constraint about JSON:API format — which trimming would have silently deleted — is explicitly preserved. The cost of this summarization call with GPT-4o-mini: $0.0019.
Rolling summarization: After every N messages, summarize the oldest batch and prepend the summary. Simple but summaries stack up and drift over time. The summary of a summary of a summary loses nuance.
Hierarchical summarization: Maintain multiple levels. Level 1: raw messages (last 10). Level 2: summary of messages 11-50. Level 3: summary of the entire session. Each level compresses more aggressively. More complex but preserves both recent detail and long-term context.
Single evolving summary: Maintain one summary that gets updated after each conversation turn instead of re-summarized from scratch. Prompt: "Given this existing summary and the new exchange, update the summary." Most token-efficient but can drift from reality over many updates.
Cost comparison: Rolling adds ~$0.001 per summarization call with GPT-4o-mini. For a 100-message conversation, that's ~$0.02 total. Compare to sending the full history at message #100 which costs ~$0.08 per message. Summarization pays for itself after ~4 messages.
Full history is perfect until the context window fills up. Trimming is fast but loses critical info. Summarization preserves intent but costs an extra API call and adds 300-800ms latency. Each strategy is the right answer for a different scenario — and most production systems combine them.
- Conversations often exceed 30+ messages
- You need to preserve intent and decisions
- Cost savings matter (cheaper than full history)
- You can tolerate 300-800ms extra latency
- Exact quotes matter (legal, compliance, support tickets)
- Nuance is critical ("slightly prefer X" vs "want X")
- Latency is ultra-critical (under 100ms requirement)
- Audit trails require verbatim history
External Memory: Store It Somewhere Else
Here is the key insight: memory does not have to live in the prompt. You can store information externally and pull it in only when relevant.
This is the shift from "remember everything" to "remember what matters right now."
A Framework for Thinking About Memory
Memory in LLMs maps to 5 categories. The first two (parametric and procedural) are baked into the model — you can't change them at runtime. The three that matter for developers are: working memory (context window), episodic memory (conversation history), and semantic memory (knowledge bases/RAG).
Cursor's knowledge of your codebase = semantic memory. Its ability to write TypeScript = parametric memory. Your current chat = working memory.
For completeness, here is the full 5-type memory taxonomy. The visual map and detailed cards below show how each type relates to LLM systems:
The current conversation. Lives in the prompt. Limited by context window.
Past sessions and their outcomes.
External knowledge—docs, codebases, wikis.
How to do things. System prompts.
Baked into model weights during training. Cannot be changed at runtime.
Parametric memory is information baked into the model's weights during training. It is why GPT-4 "knows" that Python uses indentation for blocks, or that Paris is the capital of France. You cannot add to it, remove from it, or update it at runtime.
This matters practically because:
- Knowledge cutoff: Parametric memory stops at the training date. If your company launched a new product after the cutoff, the model does not know about it. This is the primary reason RAG exists — to inject current knowledge.
- Hallucination risk: When the model's parametric memory is wrong or outdated, it confidently states incorrect facts. Fine-tuning can partially address this, but it is expensive ($5-50+ per training run) and introduces its own failure modes.
- Domain-specific gaps: General-purpose models have shallow parametric memory for niche domains. A medical coding standard, your company's internal API, a proprietary protocol — the model has no parametric memory of these. You must supply them via RAG or context.
Procedural memory in LLM systems is typically implemented as system prompts — instructions that tell the model how to behave rather than what to know. Examples:
- "Always respond in JSON format"
- "You are a customer support agent for Acme Corp. Never discuss competitor products."
- "When the user asks about pricing, redirect to the sales team."
Procedural memory is the most stable form of memory because it lives at the start of the messages array and never gets trimmed (a well-designed system always preserves the system prompt). However, system prompts are also the most token-expensive per conversation — a detailed system prompt of 1,000 tokens costs money on every single API call.
Practical tip: Keep system prompts under 500 tokens for high-volume applications. Move detailed behavioral rules into a separate document and inject them via RAG only when relevant. A system prompt that says "Follow the rules in the provided context" plus a RAG-retrieved rulebook is cheaper than a 2,000-token system prompt on every call.
RAG: The Industry Standard
Retrieval-Augmented Generation is how most production systems handle long-term memory. Instead of stuffing everything in the prompt, you store documents externally and retrieve only what's relevant.
If context window management is "remembering what happened in this conversation," RAG is "knowing things that were never part of this conversation." Your company's documentation, your codebase, your customer's order history — these all live outside the conversation and need to be retrieved on demand.
Cursor indexes your codebase with RAG. Notion AI uses RAG over your workspace. ChatGPT's "memory" feature stores facts in a retrieval layer. Every "chat with your docs" product is RAG under the hood. Understanding how it works is not optional if you are building with LLMs.
This is what powers Notion AI, Cursor, and every "chat with your docs" product.
from openai import OpenAI import numpy as np client = OpenAI() # Step 1: INGEST - Chunk your documents and embed them def embed_text(text): response = client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding def chunk_document(text, chunk_size=500, overlap=50): """Split text into overlapping chunks.""" words = text.split() chunks = [] for i in range(0, len(words), chunk_size - overlap): chunk = " ".join(words[i:i + chunk_size]) chunks.append(chunk) return chunks # Step 2: RETRIEVE - Find the most relevant chunks def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def retrieve(query, knowledge_base, top_k=3): query_embedding = embed_text(query) scored = [ (chunk, cosine_similarity(query_embedding, emb)) for chunk, emb in knowledge_base ] scored.sort(key=lambda x: x[1], reverse=True) return [chunk for chunk, score in scored[:top_k]] # Step 3: GENERATE - Ask the LLM with retrieved context def rag_query(question, knowledge_base): relevant_chunks = retrieve(question, knowledge_base) context = "\n\n".join(relevant_chunks) return client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"""Answer based on this context: {context} If the context doesn't contain the answer, say so."""}, {"role": "user", "content": question} ] ).choices[0].message.content
Step 1: Chunking — How You Split Documents Changes Everything
Before you can search your documents, you need to split them into chunks small enough for embedding. This sounds trivial — just split every 500 words, right? But chunking strategy is the single biggest lever you have for RAG quality. Bad chunking produces bad retrieval, and no amount of fancy reranking fixes it.
Fixed-size chunking
Split every N tokens, with an overlap to avoid losing context at boundaries. Simple, fast, and often good enough for a first pass.
def fixed_chunk(text, chunk_size=500, overlap=50): """Split text into fixed-size chunks with overlap.""" words = text.split() chunks = [] start = 0 while start < len(words): end = start + chunk_size chunk = " ".join(words[start:end]) chunks.append(chunk) start = end - overlap # overlap prevents cutting mid-thought return chunks # A 10,000-word doc with 500-word chunks + 50 overlap = ~22 chunks # Pro: Fast, predictable chunk sizes # Con: Splits mid-sentence, mid-paragraph, mid-thought
Semantic chunking
Split at natural boundaries — paragraph breaks, section headers, or even sentence boundaries. Preserves meaning within each chunk, but chunk sizes vary wildly (some paragraphs are 2 sentences, others are 20).
import re def semantic_chunk(text, max_tokens=800): """Split at paragraph boundaries, merge small paragraphs.""" paragraphs = re.split(r'\n\n+', text) chunks = [] current = [] current_size = 0 for para in paragraphs: para_tokens = len(para.split()) * 1.3 # rough token estimate if current_size + para_tokens > max_tokens and current: chunks.append("\n\n".join(current)) current = [para] current_size = para_tokens else: current.append(para) current_size += para_tokens if current: chunks.append("\n\n".join(current)) return chunks # Pro: Each chunk is a complete thought # Con: Chunk sizes vary (100-800 tokens), harder to predict costs
Recursive character splitting (LangChain default)
Try to split at paragraphs first. If a paragraph is too big, split at sentences. If a sentence is too big, split at words. This gives you consistent sizes while preserving as much semantic structure as possible.
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # characters, not tokens chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""] # try these in order ) chunks = splitter.split_text(document_text) # Result: chunks that are ~1000 chars each, split at natural boundaries # This is the best default for most use cases
Which to use when:
- Fixed-size: When you need predictable cost/latency per chunk (API-based embeddings with rate limits)
- Semantic: When each paragraph or section is a self-contained concept (FAQ pages, structured docs)
- Recursive: When you do not know the document structure ahead of time (best general-purpose default)
Rule of thumb: Start with recursive splitting at 500-1000 characters with 100-200 overlap. Measure retrieval quality. Adjust from there. Most teams spend too long picking chunking strategies and too little time measuring whether their retrieval actually returns the right chunks.
Step 2: Embedding — Turning Text into Searchable Vectors
Once you have chunks, each one gets converted into a vector — a list of numbers that captures its meaning. Two pieces of text about the same topic will have vectors that are close together in this space. Two unrelated texts will be far apart. This is how "search by meaning" works instead of "search by keyword."
| Model | Dimensions | Cost per 1M Tokens | Best For |
|---|---|---|---|
| text-embedding-3-small | 1,536 | $0.02 | Most use cases, prototyping, cost-sensitive production |
| text-embedding-3-large | 3,072 | $0.13 | Nuanced similarity, legal/medical where precision matters |
| Voyage AI (voyage-3) | 1,024 | $0.06 | Code search, domain-specific retrieval |
| Cohere embed-v3 | 1,024 | $0.10 | Multilingual, search-optimized use cases |
| BGE / E5 (open source) | 768-1,024 | Free (self-hosted) | Data privacy, on-premise, no API dependency |
Higher dimensions do not always mean better quality. The 1,536-dimension text-embedding-3-small outperforms older 1,536-dimension models by a wide margin. Architecture matters more than dimension count. For most teams, text-embedding-3-small is the right starting point: it costs 6.5x less than the large variant and the quality difference only matters for specialized retrieval tasks.
Practical tip: Embed a test set of 50-100 queries against your actual documents. Manually label the "correct" chunks for each query. Measure recall@5 (does the correct chunk appear in the top 5 results?). If recall@5 is above 85%, your embedding model is not your bottleneck — focus on chunking and reranking instead.
Step 3: Vector Storage — Where Your Embeddings Live
Once you have vectors, you need somewhere to store and search them. This is where vector databases come in. The choice matters less than you think for getting started — but matters a lot at scale.
| Database | Type | Best For | Latency (1M vectors) | Cost |
|---|---|---|---|---|
| Pinecone | Managed SaaS | Production, scale without ops | ~30-50ms | $70+/month (Starter) |
| Chroma | Embedded / local | Prototyping, small datasets | ~5-15ms (in-process) | Free (runs locally) |
| pgvector | PostgreSQL extension | Already using Postgres, simpler ops | ~50-200ms | Your existing Postgres cost |
| Qdrant | Open source / cloud | Production, need filtering + speed | ~10-30ms | Free (self-host) or $25+/month (cloud) |
Decision shortcut:
- Prototyping? Use Chroma. It runs in your Python process, zero setup, and you can switch later.
- Already have PostgreSQL? Use pgvector. One less database to manage. It scales to ~1M vectors before you need to optimize.
- Production with >1M vectors? Use Pinecone (if you want managed) or Qdrant (if you want control).
- Need advanced filtering? Qdrant has the best metadata filtering. Pinecone is close behind.
Connection pooling note: If you use pgvector, you inherit all the connection management challenges from regular PostgreSQL. Connection pools, idle timeouts, and connection limits all apply. See the Database Connections deep dive for how to handle this properly.
Step 4: Reranking — The Quality Multiplier
Here is the thing most teams miss: vector similarity is not the same as relevance. Two texts can be "semantically similar" without one being a useful answer to the other. Reranking is the fix — and it is the single biggest quality improvement you can make after basic RAG is working.
The problem: Embedding models encode the query and each document separately. They measure similarity but not relevance. A chunk about "Python deployment strategies" and a chunk about "Python deployment errors" might score almost identically for the query "how do I deploy my Python app?" — but only the first is useful.
The fix: A reranker (cross-encoder model) takes the query and each candidate chunk as a pair, processes them together, and produces a true relevance score. It is slower (because it compares each pair individually) but dramatically more accurate.
import cohere co = cohere.Client("your-api-key") def retrieve_and_rerank(query, knowledge_base, initial_k=20, final_k=3): # Step 1: Fast vector search for initial candidates candidates = vector_search(query, knowledge_base, top_k=initial_k) # Step 2: Rerank with cross-encoder for true relevance results = co.rerank( query=query, documents=[c["text"] for c in candidates], top_n=final_k, model="rerank-english-v3.0" ) # Return only the truly relevant chunks return [candidates[r.index] for r in results.results] # Typical improvement: recall@3 goes from 65% to 88% # Cost: ~$0.001 per rerank call (20 documents) # Latency: adds ~150-250ms
The pattern: Retrieve 20 candidates with fast vector search (cheap, ~30ms). Rerank to find the true top 3 (accurate, ~200ms). This two-stage approach gives you both speed and quality. Cohere Rerank and Voyage Rerank are the most popular options, both costing fractions of a cent per call.
When to add reranking: If your retrieval accuracy is below 85% and you have tried improving chunking first. Reranking is the highest-ROI improvement for most RAG systems after chunking, but it does add 150-250ms of latency per query.
Measuring Retrieval Quality
You cannot improve what you do not measure. Most teams build RAG, test it manually with 5 queries, say "looks good," and ship. Then users report wrong answers and nobody knows if it is a retrieval problem, a chunking problem, or an LLM problem.
Recall@K: "For each query in my test set, does the correct chunk appear in the top K results?" Target 85%+ at K=5. Precision@K: "Of the top K results returned, how many are actually relevant?" Target 70%+. MRR (Mean Reciprocal Rank): "When the correct chunk appears, how high is it ranked?" Target 0.7+. If your recall is low, fix chunking. If precision is low, add reranking. If MRR is low, your embedding model needs upgrading.
Most production systems combine multiple strategies:
- Short-term: Full conversation history (last 10-20 messages) — keeps the current conversation coherent
- Medium-term: Summarization of older messages — preserves session context without token bloat
- Long-term: RAG over documents and past sessions — gives the LLM access to knowledge beyond the current conversation
def build_context(user_query, conversation, knowledge_base): # Layer 1: System prompt (procedural memory) system = {"role": "system", "content": SYSTEM_PROMPT} # Layer 2: Summary of old conversation (compressed short-term) summary = summarize_if_needed(conversation[:-10]) # Layer 3: RAG context (long-term/semantic memory) relevant_docs = retrieve(user_query, knowledge_base, top_k=3) rag_context = "\n".join(relevant_docs) # Layer 4: Recent conversation (verbatim short-term) recent = conversation[-10:] return [ system, {"role": "system", "content": f"Session summary: {summary}"}, {"role": "system", "content": f"Relevant docs:\n{rag_context}"}, *recent ]
This is essentially what Cursor, ChatGPT, and other production AI tools do under the hood. The specific implementation varies, but the layered approach is universal.
- You have a large knowledge base (docs, codebases, wikis)
- Information changes less frequently than daily
- You need to cite sources for answers
- Context needs exceed any model's context window
- Data changes in real-time (live dashboards, trading)
- The corpus is tiny (<10 documents) — just put it in prompt
- You need guaranteed retrieval of specific docs
- Latency requirements are under 50ms
final_score = 0.7 * semantic_score + 0.3 * recency_score. Title/author (A) helps attribution but doesn't solve staleness. Chunk metadata (C) helps reconstruction but not freshness. Embedding version (D) is for maintenance, not retrieval quality.
Memory bugs are subtle. The system doesn't crash — it just gives confident wrong answers.
The Stale Memory Problem
- Users report "the bot is wrong" or "it said the old thing"
- Answers reference deprecated features, old policies, or discontinued products
- Correct information exists in newer documents but isn't retrieved
How to Detect: Log retrieved chunks with their timestamps. If old documents appear in top-K results when newer versions exist, you have this problem. Create a test case: ask about something that changed recently.
How to Fix: Add
last_updated metadata to every chunk. Use hybrid scoring: final_score = 0.7 * semantic + 0.3 * recency. Re-index on a schedule. For critical docs, implement "supersedes" relationships.
The Leaked Context Problem
- Users see information they didn't provide ("How do you know my address?")
- Responses reference other users' orders, conversations, or personal data
- Security audit reveals cross-user data in responses
How to Detect: Create two test users. User A mentions "my secret project is called Phoenix." User B asks "what projects exist?" If B's response mentions Phoenix, you have a leak. This is a mandatory pre-launch test.
How to Fix: Namespace all stored data by user ID. Filter by
user_id as a metadata filter before similarity search (not post-filter). Use separate collections per tenant for enterprise. This is non-negotiable for production.
The Polluted Retrieval Problem
- LLM gives vague or contradictory answers
- Responses blend unrelated topics ("Deploy by going to the product meeting...")
- Hallucinations increase despite having correct docs in the corpus
How to Detect: Log retrieved chunks for each query. Manually review 20 random queries. If more than 30% of retrieved chunks are irrelevant, you have polluted retrieval.
How to Fix: (1) Reduce top-K from 10 to 3-5. (2) Add reranking: retrieve 20, use a cross-encoder to pick best 3. (3) Improve chunking — each chunk should be about ONE topic. (4) Add metadata filters (doc_type, section_header). Quality over quantity — 2 perfect chunks beat 10 mediocre ones.
One way to think about choosing a memory strategy:
Starting a new project?
Begin with full conversation history. Measure when it breaks. Add complexity only when data shows you need it.
Users complaining about "forgetting"?
Add logging first. Track conversation lengths and failure points. Data tells you what to fix.
Building RAG?
Start simple: basic chunking + vector search. Measure retrieval quality before adding complexity.
Practice Mode: Apply What You Learned
Test your understanding with real scenarios you might face in production
The Problem
- LLMs are stateless — each call is independent
- No built-in memory between API calls
- Context windows have token limits
Memory Strategies
- Full History: Send everything (expensive, simple)
- Trimming: Drop old messages (lossy)
- Summarization: Compress history (balanced)
- RAG: Retrieve relevant context (scalable)
Memory Types
- Working: Current conversation
- Episodic: Past sessions
- Semantic: External knowledge
- Procedural: How-to instructions
Common Failures
- Stale Memory: Uses outdated info
- Leaked Context: User A gets User B's data
- Polluted Retrieval: Too much noise in chunks
Starting Point
- Short chats → Try full history first
- Long chats → Consider summarization
- Very long → Explore hybrid/RAG
- External docs? → Add RAG
Testing Memory Systems
Memory bugs are invisible at demo time. They emerge after 50+ messages, after documents change, or after users push edge cases you never imagined. Here is how to catch them before production does.
The minimum viable test suite for any RAG system. Run this on every deployment.
import json def test_retrieval_quality(retriever, test_cases): """ test_cases = [ {"query": "How do I reset my password?", "expected_doc_id": "auth-guide-v3", "expected_in_top_k": 3}, ... ] """ hits = 0 reciprocal_ranks = [] for case in test_cases: results = retriever.search(case["query"], top_k=5) result_ids = [r.doc_id for r in results] if case["expected_doc_id"] in result_ids[:case["expected_in_top_k"]]: hits += 1 rank = result_ids.index(case["expected_doc_id"]) + 1 reciprocal_ranks.append(1.0 / rank) else: reciprocal_ranks.append(0.0) recall = hits / len(test_cases) mrr = sum(reciprocal_ranks) / len(reciprocal_ranks) assert recall >= 0.85, f"Recall dropped to {recall:.2f} (threshold: 0.85)" assert mrr >= 0.70, f"MRR dropped to {mrr:.2f} (threshold: 0.70)" return {"recall": recall, "mrr": mrr, "total": len(test_cases)}
Building the test set: Take your 50 most common user queries from production logs. For each, manually identify which document chunk should be retrieved. Store as JSON. Run this on every deployment. If recall drops below 85%, the deployment broke something — block it.
These tests catch the bugs that only appear after long conversations.
def test_context_window_overflow(): """Simulate a 200-message conversation and verify the system handles it gracefully (summarize/trim).""" messages = [] for i in range(200): messages.append({"role": "user", "content": f"Message {i}: " + "x" * 500}) messages.append({"role": "assistant", "content": f"Reply {i}"}) # This should NOT throw a token limit error context = build_context(messages) total_tokens = count_tokens(context) assert total_tokens <= MODEL_MAX_TOKENS * 0.9 def test_early_info_retained(): """User mentions their name in message 1. After 50 messages, the system should still know it.""" messages = [ {"role": "user", "content": "My name is Ahmed and I work at Acme Corp"}, # ... 50 filler messages ... {"role": "user", "content": "What's my name?"} ] response = get_completion(build_context(messages)) assert "Ahmed" in response
Why this matters: Trimming strategies silently drop the first messages. Summarization might omit "unimportant" details like user names. These tests catch the failure before users do.
Where To Go Deep
وَاللَّهُ أَعْلَمُ
And Allah knows best
وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ
May Allah's peace and blessings be upon our master Muhammad and his family
Was this helpful?
Your feedback helps me create better content
Comments
Leave a comment