LLM Engineering

Memory for LLMs:
From Stateless to Production-Ready

Why your AI assistant forgets everything—and the engineering patterns to fix it. A visual guide to conversation history, trimming, summarization, and RAG.

Bahgat Bahgat Ahmed
January 2025 45 min read
Messages
LLM
Response
Memory
Table of Contents
11 sections

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

You're building a coding assistant. Users love it in testing. Then you ship.

A developer asks: "Refactor the auth module." Your assistant nails it. Same developer, 10 minutes later: "Now add rate limiting to what we just built." Your assistant responds: "I'd be happy to help! Could you share the code you'd like me to add rate limiting to?"

It forgot. Everything. The entire refactoring session—gone.

This isn't a bug. It's how LLMs work. And it's the first problem every production AI system has to solve.

Quick Summary
  • LLMs are stateless — every API call starts fresh, no memory of previous calls
  • "Memory" = sending conversation history — you literally paste everything each time
  • Context windows have limits — when you exceed them, you need strategies: trimming, summarization, or RAG

Want the full story? Keep reading.

This post is for you if:

  • You're building an LLM-powered app and conversations feel "broken"
  • You're confused why ChatGPT "remembers" but your API calls don't
  • You're hitting context limits and don't know how to handle them
The Problem
Why LLMs Forget Everything

The Uncomfortable Truth: LLMs Have Amnesia

Every time you call an LLM API, you're talking to a stranger. The model doesn't know you messaged it 5 seconds ago. It doesn't know you messaged it yesterday. Each API call exists in complete isolation.

Each API Call is Isolated
First Call
I'm building a REST API in Python
LLM
Great! I can help with Flask, FastAPI, or Django...
NO CONNECTION BETWEEN CALLS
Second Call
Which framework should I use?
LLM
For what project? I'd need more context...

The LLM has no idea these two calls are related. To it, they're from two different users on two different planets.

When you use ChatGPT or Claude, it seems like they remember. But that's not the LLM—that's an entire memory system built on top. The LLM is the engine; memory is a separate module that most people never see.

Real World Impact

If you run Llama locally and chat with it, you'll see this firsthand. No memory system = every message is message #1.

The Obvious Fix: Just Send Everything

The simplest solution? Send the entire conversation every time. The LLM doesn't remember, so you remind it — over and over.

Python - OpenAI SDK
from openai import OpenAI
client = OpenAI()

# This is "memory" — a list you manage yourself
messages = [
    {"role": "system", "content": "You are a coding assistant."},
]

def chat(user_message):
    messages.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages  # Send EVERYTHING every time
    )

    assistant_msg = response.choices[0].message.content
    messages.append({"role": "assistant", "content": assistant_msg})
    return assistant_msg

# Each call sends the full history:
chat("I'm building a REST API in Python")  # 3 messages sent
chat("Let's use FastAPI")                   # 5 messages sent
chat("Add authentication to it")             # 7 messages sent
chat("Now add rate limiting")                # 9 messages sent... growing!

This works beautifully for short conversations. The LLM sees the full context, so its responses are coherent and relevant. But it can't last.

How "Memory" Actually Works
Conversation with LLM
System: You are a helpful coding assistant
I'm building a REST API in Python
Great! I recommend FastAPI for its speed and automatic docs.
Let's use FastAPI then. Show me a basic setup.
Here's a minimal FastAPI app with one endpoint...
Now add authentication to it
I'll add JWT auth to the FastAPI app we just created...

The LLM doesn't "remember"—you're literally pasting the entire conversation into every API call.

This works beautifully... until it doesn't.

The Wall: Context Windows Have Limits

LLMs can only process so much text at once. This limit is called the context window—and when you hit it, everything breaks.

Same Conversation, Different Limits
Small Context
OVERFLOW
Medium Context
~40% full
Large Context
~15% full

The same conversation fills different percentages depending on your model's context window size. Check your model's documentation for exact limits.

But bigger context windows aren't free. More tokens means:

  • Higher costs — you pay per token, input and output
  • Slower responses — more to process, more latency
  • Worse focus — models struggle to attend to everything equally
The Math That Matters

A 50-message conversation averages ~15,000 tokens. At GPT-4o pricing ($2.50/M input tokens), that single conversation costs ~$0.04 per message toward the end. With 10,000 active users having 20 messages/day, that's $800/day in input tokens alone. And it gets worse with every message — message #50 resends all 49 previous messages.

Context window sizes by model (2025)
  • GPT-4o: 128K tokens (~300 pages of text)
  • Claude Sonnet/Opus: 200K tokens (~500 pages)
  • Gemini 1.5 Pro: 2M tokens (~5,000 pages)
  • Llama 3: 8K-128K tokens depending on variant

Bigger sounds better, but attention quality degrades with length. A model with 200K context doesn't attend equally to token #1 and token #199,999. Research shows "lost in the middle" — models focus on the beginning and end, often missing information in the middle of long prompts.

Strategy 1
Context Window Management

Strategy 1: Trimming (Drop the Old Stuff)

The bluntest solution: when the conversation gets too long, cut the oldest messages. First in, first out.

FIFO Trimming: Simple but Dangerous
M1
M2
M3
M4
M5
M6
M7
Dropped Forever Kept in Context

The user mentioned they're allergic to shellfish in M2. Your food recommendation bot just suggested shrimp.

Trimming is fast and cheap. But critical information often lives in those early messages — user preferences, project context, key decisions. Once it's gone, it's gone.

Python - Token-based trimming
import tiktoken

def trim_messages(messages, max_tokens=4000, model="gpt-4o"):
    """Keep the system prompt + most recent messages that fit."""
    enc = tiktoken.encoding_for_model(model)

    # Always keep the system prompt
    system_msg = messages[0]
    system_tokens = len(enc.encode(system_msg["content"]))

    # Add messages from newest to oldest until we hit the limit
    trimmed = []
    token_count = system_tokens

    for msg in reversed(messages[1:]):
        msg_tokens = len(enc.encode(msg["content"]))
        if token_count + msg_tokens > max_tokens:
            break
        trimmed.insert(0, msg)
        token_count += msg_tokens

    return [system_msg] + trimmed

# Usage: trim before every API call
trimmed = trim_messages(messages, max_tokens=4000)
response = client.chat.completions.create(
    model="gpt-4o", messages=trimmed
)
Smarter trimming strategies

Basic FIFO trimming is naive. Better approaches:

  • Keep system prompt + first user message + recent messages. The first message often contains crucial context ("I'm building a food delivery app").
  • Keep pairs together. Never trim a user message but keep its assistant response — the model gets confused by orphaned messages.
  • Importance-weighted trimming. Mark certain messages as "pinned" (e.g., messages where the user stated preferences or made decisions). Trim unpinned messages first.
  • Sliding window with overlap. Keep the last N messages but also the first 2-3 messages. Simple and usually good enough.

The Moment Trimming Betrays You

Trimming works until it silently deletes the one message that mattered. Here is a real conversation where trimming creates a dangerous failure. Watch what happens when message 3 gets dropped:

Conversation History (before trimming)
# Message 1 (user):
"I'm building a medical triage chatbot for an ER."

# Message 2 (assistant):
"I can help. What symptoms should it handle?"

# Message 3 (user):  ← THIS IS CRITICAL
"Important constraint: NEVER suggest the patient is fine
 to go home. Always recommend they wait for a doctor."

# Message 4 (assistant):
"Understood. I'll always err on the side of caution."

# Messages 5-20: Building out triage logic, testing scenarios...

# Message 21 (user):
"Patient reports mild headache, no other symptoms. What should
 the bot say?"
What trimming does

With a window of 10 messages, the trimmer drops messages 1-11. Message 3 — the safety constraint — is gone. The assistant now responds: "Based on the symptoms, this sounds like a tension headache. The patient could likely go home and take ibuprofen." The one constraint the user explicitly set has been silently deleted.

This is why naive trimming is dangerous for any use case where early messages contain constraints, preferences, or safety rules. You have two options: pin critical messages so they never get trimmed, or graduate to summarization.

How token counting actually works (with real numbers)

Tokens are not words. They are sub-word units that the model's tokenizer produces. On average, 1 token is about 0.75 English words, or 4 characters. But this varies by language and content type:

  • English prose: ~1.3 tokens per word
  • Python code: ~1.5 tokens per word (more special characters)
  • Arabic text: ~2-3x more tokens than English for the same meaning
  • JSON/structured data: ~2x tokens vs. plain text (brackets, quotes, colons)
Python - Real token counting
import tiktoken

# Different models use different tokenizers
enc_gpt4 = tiktoken.encoding_for_model("gpt-4o")
enc_gpt35 = tiktoken.encoding_for_model("gpt-3.5-turbo")

text = "Refactor the authentication module to use JWT tokens"

# Same text, different token counts
gpt4_tokens = len(enc_gpt4.encode(text))    # 9 tokens
gpt35_tokens = len(enc_gpt35.encode(text))  # 10 tokens

# For a full messages array (OpenAI format), count overhead too
def count_messages_tokens(messages, model="gpt-4o"):
    """Count tokens including per-message overhead."""
    enc = tiktoken.encoding_for_model(model)
    tokens = 0
    for msg in messages:
        tokens += 4  # every message has overhead: role, content markers
        for key, value in msg.items():
            tokens += len(enc.encode(value))
    tokens += 2  # priming overhead for assistant reply
    return tokens

# A 20-message conversation: ~3,200 tokens
# A 50-message conversation: ~15,000 tokens
# A 100-message coding session: ~45,000 tokens

Why this matters for trimming: You cannot trim by message count alone. A message containing a code block might be 2,000 tokens. A message saying "yes" is 1 token. Token-based trimming is the only reliable approach.

For Anthropic models: Claude uses a different tokenizer. The anthropic Python SDK provides client.count_tokens() to get accurate counts. Do not use tiktoken for Claude — the counts will be wrong by 10-20%.

Use Trimming When
  • Sessions are short (under 20 messages)
  • Cost is your primary concern
  • Early messages are truly disposable (casual chat)
  • You have a "sliding window + first message" strategy
Don't Use When
  • User preferences are stated early (allergies, constraints)
  • Safety-critical rules are in early messages
  • You need to reference past decisions ("why did we choose X?")
  • Users expect the assistant to "remember" everything
Decision Point
Your food delivery chatbot has 100 messages in the current session. Message 5 contains: "I'm severely allergic to shellfish." Your trimming window keeps only the last 20 messages. What do you do?
A Increase the window to 100 messages so nothing gets lost
B Summarize message 5 into a compact format before trimming
C Pin safety-critical messages (like allergies) so they're never trimmed
D Trust the user to re-state allergies if it becomes relevant
C is correct. For safety-critical information like allergies, dietary restrictions, or medical constraints, you need a "pinning" mechanism that marks certain messages as immune to trimming. This is the only approach that guarantees the information survives. Option A is expensive and doesn't scale. Option B might lose nuance ("severely allergic" becoming "dislikes shellfish"). Option D is dangerous — users assume the bot remembers what they told it.

Summarization: Compress, Don't Delete

Instead of throwing away old messages, compress them. Use an LLM to create a summary of what happened.

Compression Without Total Loss
12,000 tokens
Full conversation history
Summarize
400 tokens
Key facts preserved

30x compression. But summaries are lossy—nuance gets lost. And generating them adds 300-800ms latency.

Python - Conversation summarization
def summarize_and_compress(messages, max_tokens=4000):
    """Summarize old messages, keep recent ones verbatim."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    total = sum(len(enc.encode(m["content"])) for m in messages)

    if total <= max_tokens:
        return messages  # No need to summarize yet

    # Keep system prompt + last 6 messages verbatim
    system = messages[0]
    recent = messages[-6:]
    old = messages[1:-6]

    # Summarize the old messages
    summary = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for summarization
        messages=[{
            "role": "system",
            "content": """Summarize this conversation. Preserve:
- User's name, preferences, and stated requirements
- Key decisions made and their reasoning
- Technical context (language, framework, architecture)
- Any constraints or deadlines mentioned
Be concise but don't lose critical facts."""
        }] + old
    ).choices[0].message.content

    # Return: system + summary + recent messages
    return [
        system,
        {"role": "system", "content": f"Previous conversation summary: {summary}"},
        *recent
    ]
A production summarization prompt (before and after)

The prompt you use for summarization determines what survives. A vague prompt produces vague summaries. Here is a production-grade prompt and what it does to a real conversation:

Summarization System Prompt
"""You are a conversation summarizer for a coding assistant.
Produce a structured summary preserving EXACTLY these fields:

1. USER IDENTITY: Name, role, timezone (if mentioned)
2. PROJECT CONTEXT: Language, framework, architecture decisions
3. CONSTRAINTS: Any "must", "never", "always" rules the user stated
4. DECISIONS MADE: What was decided, with reasoning
5. CURRENT STATE: What has been built so far
6. OPEN QUESTIONS: What is unresolved

Format as a flat list. Be specific. Use the user's exact words
for constraints. Do not editorialize or add your own opinions.
Maximum 400 tokens."""

Before summarization (12,400 tokens):

20 messages about building a FastAPI app with JWT auth, including code snippets, debugging a CORS issue, deciding on PostgreSQL over MongoDB, and a constraint that all endpoints must return JSON:API format.

After summarization (380 tokens):

Generated Summary
USER IDENTITY: Backend developer, building for a startup
PROJECT CONTEXT: Python 3.12, FastAPI, PostgreSQL, deployed on Railway
CONSTRAINTS:
  - All endpoints MUST return JSON:API format
  - No ORM - raw SQL with asyncpg
  - Auth tokens expire after 15 minutes
DECISIONS MADE:
  - PostgreSQL over MongoDB (need relational joins for reporting)
  - JWT with refresh tokens (not session-based)
  - CORS allows only app.example.com
CURRENT STATE:
  - /auth/login and /auth/refresh endpoints working
  - User model with bcrypt password hashing
  - CORS middleware configured
OPEN QUESTIONS:
  - Rate limiting strategy not yet decided
  - File upload endpoint design pending

32x compression. The key constraint about JSON:API format — which trimming would have silently deleted — is explicitly preserved. The cost of this summarization call with GPT-4o-mini: $0.0019.

Summarization strategies: rolling vs. hierarchical

Rolling summarization: After every N messages, summarize the oldest batch and prepend the summary. Simple but summaries stack up and drift over time. The summary of a summary of a summary loses nuance.

Hierarchical summarization: Maintain multiple levels. Level 1: raw messages (last 10). Level 2: summary of messages 11-50. Level 3: summary of the entire session. Each level compresses more aggressively. More complex but preserves both recent detail and long-term context.

Single evolving summary: Maintain one summary that gets updated after each conversation turn instead of re-summarized from scratch. Prompt: "Given this existing summary and the new exchange, update the summary." Most token-efficient but can drift from reality over many updates.

Cost comparison: Rolling adds ~$0.001 per summarization call with GPT-4o-mini. For a 100-message conversation, that's ~$0.02 total. Compare to sending the full history at message #100 which costs ~$0.08 per message. Summarization pays for itself after ~4 messages.

The Progression So Far

Full history is perfect until the context window fills up. Trimming is fast but loses critical info. Summarization preserves intent but costs an extra API call and adds 300-800ms latency. Each strategy is the right answer for a different scenario — and most production systems combine them.

Use Summarization When
  • Conversations often exceed 30+ messages
  • You need to preserve intent and decisions
  • Cost savings matter (cheaper than full history)
  • You can tolerate 300-800ms extra latency
Don't Use When
  • Exact quotes matter (legal, compliance, support tickets)
  • Nuance is critical ("slightly prefer X" vs "want X")
  • Latency is ultra-critical (under 100ms requirement)
  • Audit trails require verbatim history
Decision Point
You're building a medical assistant that helps doctors review patient history. Conversations can reach 100+ messages. Which memory strategy should you use?
A Summarization — it compresses efficiently and preserves key facts
B Trimming — just keep the last 20 messages for speed
C Full history with pinned critical messages, accept higher cost
D RAG over past sessions — retrieve relevant history on demand
C is the safest choice for medical contexts. Summarization risks losing critical details ("patient mentioned chest pain once" might become "no cardiac symptoms"). Trimming is dangerous — early symptoms matter. RAG is useful for cross-session memory but doesn't solve within-session recall. For safety-critical domains, accept higher costs and keep full verbatim history with pinned critical messages. You can still add summarization for non-critical context, but core medical details should never be compressed.

External Memory: Store It Somewhere Else

Here is the key insight: memory does not have to live in the prompt. You can store information externally and pull it in only when relevant.

This is the shift from "remember everything" to "remember what matters right now."

A Framework for Thinking About Memory

Memory in LLMs maps to 5 categories. The first two (parametric and procedural) are baked into the model — you can't change them at runtime. The three that matter for developers are: working memory (context window), episodic memory (conversation history), and semantic memory (knowledge bases/RAG).

In Practice

Cursor's knowledge of your codebase = semantic memory. Its ability to write TypeScript = parametric memory. Your current chat = working memory.

The full memory taxonomy (academic framework)

For completeness, here is the full 5-type memory taxonomy. The visual map and detailed cards below show how each type relates to LLM systems:

The Memory Taxonomy
Working
Current chat in prompt
Episodic
Past sessions & outcomes
LLM
Semantic
Docs & knowledge bases
Procedural
System prompts & rules
Parametric
Baked in during training
Short-Term
Working Memory

The current conversation. Lives in the prompt. Limited by context window.

Example: Your chat history with Claude right now
Long-Term
Episodic

Past sessions and their outcomes.

Example: "Last week you tried Redis and it crashed"
Semantic

External knowledge—docs, codebases, wikis.

Example: Cursor reading your codebase
Procedural

How to do things. System prompts.

Example: "Always respond in JSON format"
Permanent
Parametric

Baked into model weights during training. Cannot be changed at runtime.

Example: The model "knows" Python syntax and Paris is in France
Parametric memory: what the model "just knows" (and why you cannot change it)

Parametric memory is information baked into the model's weights during training. It is why GPT-4 "knows" that Python uses indentation for blocks, or that Paris is the capital of France. You cannot add to it, remove from it, or update it at runtime.

This matters practically because:

  • Knowledge cutoff: Parametric memory stops at the training date. If your company launched a new product after the cutoff, the model does not know about it. This is the primary reason RAG exists — to inject current knowledge.
  • Hallucination risk: When the model's parametric memory is wrong or outdated, it confidently states incorrect facts. Fine-tuning can partially address this, but it is expensive ($5-50+ per training run) and introduces its own failure modes.
  • Domain-specific gaps: General-purpose models have shallow parametric memory for niche domains. A medical coding standard, your company's internal API, a proprietary protocol — the model has no parametric memory of these. You must supply them via RAG or context.
Procedural memory: system prompts as behavioral rules

Procedural memory in LLM systems is typically implemented as system prompts — instructions that tell the model how to behave rather than what to know. Examples:

  • "Always respond in JSON format"
  • "You are a customer support agent for Acme Corp. Never discuss competitor products."
  • "When the user asks about pricing, redirect to the sales team."

Procedural memory is the most stable form of memory because it lives at the start of the messages array and never gets trimmed (a well-designed system always preserves the system prompt). However, system prompts are also the most token-expensive per conversation — a detailed system prompt of 1,000 tokens costs money on every single API call.

Practical tip: Keep system prompts under 500 tokens for high-volume applications. Move detailed behavioral rules into a separate document and inject them via RAG only when relevant. A system prompt that says "Follow the rules in the provided context" plus a RAG-retrieved rulebook is cheaper than a 2,000-token system prompt on every call.

The Industry Standard
RAG: Retrieval-Augmented Generation

RAG: The Industry Standard

Retrieval-Augmented Generation is how most production systems handle long-term memory. Instead of stuffing everything in the prompt, you store documents externally and retrieve only what's relevant.

If context window management is "remembering what happened in this conversation," RAG is "knowing things that were never part of this conversation." Your company's documentation, your codebase, your customer's order history — these all live outside the conversation and need to be retrieved on demand.

Why RAG Is Everywhere

Cursor indexes your codebase with RAG. Notion AI uses RAG over your workspace. ChatGPT's "memory" feature stores facts in a retrieval layer. Every "chat with your docs" product is RAG under the hood. Understanding how it works is not optional if you are building with LLMs.

RAG: Retrieve → Augment → Generate
User Query Step 1
Embed Step 2
Vector DB Step 3
LLM Step 4
Answer Step 5

This is what powers Notion AI, Cursor, and every "chat with your docs" product.

Python - Minimal RAG pipeline
from openai import OpenAI
import numpy as np

client = OpenAI()

# Step 1: INGEST - Chunk your documents and embed them
def embed_text(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def chunk_document(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Step 2: RETRIEVE - Find the most relevant chunks
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query, knowledge_base, top_k=3):
    query_embedding = embed_text(query)
    scored = [
        (chunk, cosine_similarity(query_embedding, emb))
        for chunk, emb in knowledge_base
    ]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [chunk for chunk, score in scored[:top_k]]

# Step 3: GENERATE - Ask the LLM with retrieved context
def rag_query(question, knowledge_base):
    relevant_chunks = retrieve(question, knowledge_base)
    context = "\n\n".join(relevant_chunks)

    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""Answer based on this context:
{context}

If the context doesn't contain the answer, say so."""},
            {"role": "user", "content": question}
        ]
    ).choices[0].message.content

Step 1: Chunking — How You Split Documents Changes Everything

Before you can search your documents, you need to split them into chunks small enough for embedding. This sounds trivial — just split every 500 words, right? But chunking strategy is the single biggest lever you have for RAG quality. Bad chunking produces bad retrieval, and no amount of fancy reranking fixes it.

Chunking strategies compared: fixed, semantic, and recursive (with code)

Fixed-size chunking

Split every N tokens, with an overlap to avoid losing context at boundaries. Simple, fast, and often good enough for a first pass.

Python - Fixed-size chunking
def fixed_chunk(text, chunk_size=500, overlap=50):
    """Split text into fixed-size chunks with overlap."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # overlap prevents cutting mid-thought
    return chunks

# A 10,000-word doc with 500-word chunks + 50 overlap = ~22 chunks
# Pro: Fast, predictable chunk sizes
# Con: Splits mid-sentence, mid-paragraph, mid-thought

Semantic chunking

Split at natural boundaries — paragraph breaks, section headers, or even sentence boundaries. Preserves meaning within each chunk, but chunk sizes vary wildly (some paragraphs are 2 sentences, others are 20).

Python - Semantic chunking
import re

def semantic_chunk(text, max_tokens=800):
    """Split at paragraph boundaries, merge small paragraphs."""
    paragraphs = re.split(r'\n\n+', text)
    chunks = []
    current = []
    current_size = 0

    for para in paragraphs:
        para_tokens = len(para.split()) * 1.3  # rough token estimate
        if current_size + para_tokens > max_tokens and current:
            chunks.append("\n\n".join(current))
            current = [para]
            current_size = para_tokens
        else:
            current.append(para)
            current_size += para_tokens

    if current:
        chunks.append("\n\n".join(current))
    return chunks

# Pro: Each chunk is a complete thought
# Con: Chunk sizes vary (100-800 tokens), harder to predict costs

Recursive character splitting (LangChain default)

Try to split at paragraphs first. If a paragraph is too big, split at sentences. If a sentence is too big, split at words. This gives you consistent sizes while preserving as much semantic structure as possible.

Python - Using LangChain's recursive splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,         # characters, not tokens
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]  # try these in order
)

chunks = splitter.split_text(document_text)
# Result: chunks that are ~1000 chars each, split at natural boundaries
# This is the best default for most use cases

Which to use when:

  • Fixed-size: When you need predictable cost/latency per chunk (API-based embeddings with rate limits)
  • Semantic: When each paragraph or section is a self-contained concept (FAQ pages, structured docs)
  • Recursive: When you do not know the document structure ahead of time (best general-purpose default)

Rule of thumb: Start with recursive splitting at 500-1000 characters with 100-200 overlap. Measure retrieval quality. Adjust from there. Most teams spend too long picking chunking strategies and too little time measuring whether their retrieval actually returns the right chunks.

Step 2: Embedding — Turning Text into Searchable Vectors

Once you have chunks, each one gets converted into a vector — a list of numbers that captures its meaning. Two pieces of text about the same topic will have vectors that are close together in this space. Two unrelated texts will be far apart. This is how "search by meaning" works instead of "search by keyword."

Embedding model comparison: dimensions, cost, and performance
Model Dimensions Cost per 1M Tokens Best For
text-embedding-3-small 1,536 $0.02 Most use cases, prototyping, cost-sensitive production
text-embedding-3-large 3,072 $0.13 Nuanced similarity, legal/medical where precision matters
Voyage AI (voyage-3) 1,024 $0.06 Code search, domain-specific retrieval
Cohere embed-v3 1,024 $0.10 Multilingual, search-optimized use cases
BGE / E5 (open source) 768-1,024 Free (self-hosted) Data privacy, on-premise, no API dependency

Higher dimensions do not always mean better quality. The 1,536-dimension text-embedding-3-small outperforms older 1,536-dimension models by a wide margin. Architecture matters more than dimension count. For most teams, text-embedding-3-small is the right starting point: it costs 6.5x less than the large variant and the quality difference only matters for specialized retrieval tasks.

Practical tip: Embed a test set of 50-100 queries against your actual documents. Manually label the "correct" chunks for each query. Measure recall@5 (does the correct chunk appear in the top 5 results?). If recall@5 is above 85%, your embedding model is not your bottleneck — focus on chunking and reranking instead.

Step 3: Vector Storage — Where Your Embeddings Live

Once you have vectors, you need somewhere to store and search them. This is where vector databases come in. The choice matters less than you think for getting started — but matters a lot at scale.

Vector database comparison: Pinecone vs Chroma vs pgvector vs Qdrant
Database Type Best For Latency (1M vectors) Cost
Pinecone Managed SaaS Production, scale without ops ~30-50ms $70+/month (Starter)
Chroma Embedded / local Prototyping, small datasets ~5-15ms (in-process) Free (runs locally)
pgvector PostgreSQL extension Already using Postgres, simpler ops ~50-200ms Your existing Postgres cost
Qdrant Open source / cloud Production, need filtering + speed ~10-30ms Free (self-host) or $25+/month (cloud)

Decision shortcut:

  • Prototyping? Use Chroma. It runs in your Python process, zero setup, and you can switch later.
  • Already have PostgreSQL? Use pgvector. One less database to manage. It scales to ~1M vectors before you need to optimize.
  • Production with >1M vectors? Use Pinecone (if you want managed) or Qdrant (if you want control).
  • Need advanced filtering? Qdrant has the best metadata filtering. Pinecone is close behind.

Connection pooling note: If you use pgvector, you inherit all the connection management challenges from regular PostgreSQL. Connection pools, idle timeouts, and connection limits all apply. See the Database Connections deep dive for how to handle this properly.

Step 4: Reranking — The Quality Multiplier

Here is the thing most teams miss: vector similarity is not the same as relevance. Two texts can be "semantically similar" without one being a useful answer to the other. Reranking is the fix — and it is the single biggest quality improvement you can make after basic RAG is working.

Reranking: how it works and why it matters so much

The problem: Embedding models encode the query and each document separately. They measure similarity but not relevance. A chunk about "Python deployment strategies" and a chunk about "Python deployment errors" might score almost identically for the query "how do I deploy my Python app?" — but only the first is useful.

The fix: A reranker (cross-encoder model) takes the query and each candidate chunk as a pair, processes them together, and produces a true relevance score. It is slower (because it compares each pair individually) but dramatically more accurate.

Python - Reranking with Cohere
import cohere

co = cohere.Client("your-api-key")

def retrieve_and_rerank(query, knowledge_base, initial_k=20, final_k=3):
    # Step 1: Fast vector search for initial candidates
    candidates = vector_search(query, knowledge_base, top_k=initial_k)

    # Step 2: Rerank with cross-encoder for true relevance
    results = co.rerank(
        query=query,
        documents=[c["text"] for c in candidates],
        top_n=final_k,
        model="rerank-english-v3.0"
    )

    # Return only the truly relevant chunks
    return [candidates[r.index] for r in results.results]

# Typical improvement: recall@3 goes from 65% to 88%
# Cost: ~$0.001 per rerank call (20 documents)
# Latency: adds ~150-250ms

The pattern: Retrieve 20 candidates with fast vector search (cheap, ~30ms). Rerank to find the true top 3 (accurate, ~200ms). This two-stage approach gives you both speed and quality. Cohere Rerank and Voyage Rerank are the most popular options, both costing fractions of a cent per call.

When to add reranking: If your retrieval accuracy is below 85% and you have tried improving chunking first. Reranking is the highest-ROI improvement for most RAG systems after chunking, but it does add 150-250ms of latency per query.

Measuring Retrieval Quality

You cannot improve what you do not measure. Most teams build RAG, test it manually with 5 queries, say "looks good," and ship. Then users report wrong answers and nobody knows if it is a retrieval problem, a chunking problem, or an LLM problem.

The Three Numbers That Matter

Recall@K: "For each query in my test set, does the correct chunk appear in the top K results?" Target 85%+ at K=5. Precision@K: "Of the top K results returned, how many are actually relevant?" Target 70%+. MRR (Mean Reciprocal Rank): "When the correct chunk appears, how high is it ranked?" Target 0.7+. If your recall is low, fix chunking. If precision is low, add reranking. If MRR is low, your embedding model needs upgrading.

The hybrid approach: conversation memory + RAG

Most production systems combine multiple strategies:

  • Short-term: Full conversation history (last 10-20 messages) — keeps the current conversation coherent
  • Medium-term: Summarization of older messages — preserves session context without token bloat
  • Long-term: RAG over documents and past sessions — gives the LLM access to knowledge beyond the current conversation
def build_context(user_query, conversation, knowledge_base):
    # Layer 1: System prompt (procedural memory)
    system = {"role": "system", "content": SYSTEM_PROMPT}

    # Layer 2: Summary of old conversation (compressed short-term)
    summary = summarize_if_needed(conversation[:-10])

    # Layer 3: RAG context (long-term/semantic memory)
    relevant_docs = retrieve(user_query, knowledge_base, top_k=3)
    rag_context = "\n".join(relevant_docs)

    # Layer 4: Recent conversation (verbatim short-term)
    recent = conversation[-10:]

    return [
        system,
        {"role": "system", "content": f"Session summary: {summary}"},
        {"role": "system", "content": f"Relevant docs:\n{rag_context}"},
        *recent
    ]

This is essentially what Cursor, ChatGPT, and other production AI tools do under the hood. The specific implementation varies, but the layered approach is universal.

Use RAG When
  • You have a large knowledge base (docs, codebases, wikis)
  • Information changes less frequently than daily
  • You need to cite sources for answers
  • Context needs exceed any model's context window
Don't Use When
  • Data changes in real-time (live dashboards, trading)
  • The corpus is tiny (<10 documents) — just put it in prompt
  • You need guaranteed retrieval of specific docs
  • Latency requirements are under 50ms
Decision Point
Your company documentation is updated weekly. Engineers complain the RAG system "knows old things." What's the most important metadata to track on each chunk?
A Document title and author — helps with source attribution
B Last updated timestamp — allows recency-weighted ranking
C Chunk length and position — helps with context reconstruction
D Embedding model version — ensures consistency
B is the key fix. When docs update weekly, timestamp metadata lets you boost recent documents in ranking. A 2024 doc about your API should outrank a 2022 doc even if semantic similarity is slightly lower. You can use hybrid scoring: final_score = 0.7 * semantic_score + 0.3 * recency_score. Title/author (A) helps attribution but doesn't solve staleness. Chunk metadata (C) helps reconstruction but not freshness. Embedding version (D) is for maintenance, not retrieval quality.
When Memory Fails
Failure Patterns
When Memory Systems Fail

Memory bugs are subtle. The system doesn't crash — it just gives confident wrong answers.

The Stale Memory Problem

Old Data 6 months ago
Memory
LLM
Wrong Answer Uses outdated info
Symptoms:
  • Users report "the bot is wrong" or "it said the old thing"
  • Answers reference deprecated features, old policies, or discontinued products
  • Correct information exists in newer documents but isn't retrieved
Root Cause: Embedding similarity doesn't consider recency. A 2019 document about "parental leave policy" is semantically identical to a 2024 document about the same topic.

How to Detect: Log retrieved chunks with their timestamps. If old documents appear in top-K results when newer versions exist, you have this problem. Create a test case: ask about something that changed recently.

How to Fix: Add last_updated metadata to every chunk. Use hybrid scoring: final_score = 0.7 * semantic + 0.3 * recency. Re-index on a schedule. For critical docs, implement "supersedes" relationships.

The Leaked Context Problem

User A
User B
Shared Memory No isolation!
Data Leak! A sees B's data
Symptoms:
  • Users see information they didn't provide ("How do you know my address?")
  • Responses reference other users' orders, conversations, or personal data
  • Security audit reveals cross-user data in responses
Root Cause: Memory storage isn't namespaced. All users' data lives in the same vector space. Semantic search finds similar content regardless of ownership.

How to Detect: Create two test users. User A mentions "my secret project is called Phoenix." User B asks "what projects exist?" If B's response mentions Phoenix, you have a leak. This is a mandatory pre-launch test.

How to Fix: Namespace all stored data by user ID. Filter by user_id as a metadata filter before similarity search (not post-filter). Use separate collections per tenant for enterprise. This is non-negotiable for production.

The Polluted Retrieval Problem

Query
Relevant
Noise
Noise
LLM
Confused Too much noise
Symptoms:
  • LLM gives vague or contradictory answers
  • Responses blend unrelated topics ("Deploy by going to the product meeting...")
  • Hallucinations increase despite having correct docs in the corpus
Root Cause: Embedding similarity is imprecise. "Deploy to production" matches "production line," "product management," and "productive meetings." High top-K amplifies noise.

How to Detect: Log retrieved chunks for each query. Manually review 20 random queries. If more than 30% of retrieved chunks are irrelevant, you have polluted retrieval.

How to Fix: (1) Reduce top-K from 10 to 3-5. (2) Add reranking: retrieve 20, use a cross-encoder to pick best 3. (3) Improve chunking — each chunk should be about ONE topic. (4) Add metadata filters (doc_type, section_header). Quality over quantity — 2 perfect chunks beat 10 mediocre ones.
The Tradeoffs (No Free Lunch)
Full History
Speed: Fast
Cost: High
Accuracy: Best
Best for: Short sessions
Trimming
Speed: Fast
Cost: Low
Accuracy: Low
Best for: Casual chat
Summarization
Speed: Medium
Cost: Medium
Accuracy: Medium
Best for: Long convos
RAG
Speed: Medium
Cost: Low
Accuracy: High
Best for: Knowledge bases
Hybrid
Speed: Slow
Cost: High
Accuracy: Best
Best for: Production
Decision Framework

One way to think about choosing a memory strategy:

How long are your sessions?
Short
Try Full History
Medium
Consider Summarization
Long
Explore Hybrid/RAG
Do you have external docs?
No
Conversation memory only
Yes
Add RAG layer
What To Do Monday

Starting a new project?

Begin with full conversation history. Measure when it breaks. Add complexity only when data shows you need it.

Step 1: Just append messages

Users complaining about "forgetting"?

Add logging first. Track conversation lengths and failure points. Data tells you what to fix.

Step 1: Log before optimizing

Building RAG?

Start simple: basic chunking + vector search. Measure retrieval quality before adding complexity.

Most RAG failures = retrieval failures

Practice Mode: Apply What You Learned

Test your understanding with real scenarios you might face in production

1
You're building a customer support bot for an e-commerce site. Sessions average 15 messages. Users often ask about order status, returns, and product questions. What memory strategy should you start with?
Show answer
Start with full conversation history. At 15 messages, you're well under context limits (~5K tokens). Full history gives perfect accuracy with no complexity. Monitor conversation lengths in production. If they start exceeding 30 messages regularly, add summarization. Don't optimize prematurely — data will tell you when to evolve.
2
Your RAG system answers questions about company documentation. Users report it "knows the wrong things" — giving outdated information or answering about the wrong department's policies. How do you diagnose and fix this?
Show answer
This is a retrieval problem, not a generation problem. Three fixes: (1) Add metadata to chunks — department, document type, last_updated timestamp. Filter by department when querying. (2) Check your chunking — if chunks mix content from different sections, they'll match incorrectly. Use semantic chunking at section boundaries. (3) Add a reranking step — retrieve 10, rerank to 3. The cross-encoder catches semantic mismatches that embeddings miss.
3
Your LLM-powered coding assistant is costing $400/day in API calls. You're sending full conversation history on every request, and conversations often reach 80+ messages. What do you try first?
Show answer
Implement rolling summarization. Keep the last 15 messages verbatim, summarize everything older. Use a cheap model (GPT-4o-mini at $0.15/M tokens) for summarization. At 80 messages, you're paying for ~40K tokens per request. Summarization cuts this to ~8K tokens (last 15 messages + 500-token summary). That's an 80% cost reduction. You'll add ~300ms latency for the summarization call, but save ~$320/day. The ROI is immediate.
Quick Reference
Your Memory Toolkit
Cheat Sheet: LLM Memory

The Problem

  • LLMs are stateless — each call is independent
  • No built-in memory between API calls
  • Context windows have token limits

Memory Strategies

  • Full History: Send everything (expensive, simple)
  • Trimming: Drop old messages (lossy)
  • Summarization: Compress history (balanced)
  • RAG: Retrieve relevant context (scalable)

Memory Types

  • Working: Current conversation
  • Episodic: Past sessions
  • Semantic: External knowledge
  • Procedural: How-to instructions

Common Failures

  • Stale Memory: Uses outdated info
  • Leaked Context: User A gets User B's data
  • Polluted Retrieval: Too much noise in chunks

Starting Point

  • Short chats → Try full history first
  • Long chats → Consider summarization
  • Very long → Explore hybrid/RAG
  • External docs? → Add RAG

Coming in Part 2: Advanced Memory Patterns

Forgetting curves, multi-agent memory sharing, graph-based RAG, and memory that learns from itself.

Testing Memory Systems

Memory bugs are invisible at demo time. They emerge after 50+ messages, after documents change, or after users push edge cases you never imagined. Here is how to catch them before production does.

RAG retrieval quality test suite

The minimum viable test suite for any RAG system. Run this on every deployment.

import json

def test_retrieval_quality(retriever, test_cases):
    """
    test_cases = [
        {"query": "How do I reset my password?",
         "expected_doc_id": "auth-guide-v3",
         "expected_in_top_k": 3},
        ...
    ]
    """
    hits = 0
    reciprocal_ranks = []

    for case in test_cases:
        results = retriever.search(case["query"], top_k=5)
        result_ids = [r.doc_id for r in results]

        if case["expected_doc_id"] in result_ids[:case["expected_in_top_k"]]:
            hits += 1
            rank = result_ids.index(case["expected_doc_id"]) + 1
            reciprocal_ranks.append(1.0 / rank)
        else:
            reciprocal_ranks.append(0.0)

    recall = hits / len(test_cases)
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)

    assert recall >= 0.85, f"Recall dropped to {recall:.2f} (threshold: 0.85)"
    assert mrr >= 0.70, f"MRR dropped to {mrr:.2f} (threshold: 0.70)"

    return {"recall": recall, "mrr": mrr, "total": len(test_cases)}

Building the test set: Take your 50 most common user queries from production logs. For each, manually identify which document chunk should be retrieved. Store as JSON. Run this on every deployment. If recall drops below 85%, the deployment broke something — block it.

Context window edge case tests

These tests catch the bugs that only appear after long conversations.

def test_context_window_overflow():
    """Simulate a 200-message conversation and verify
    the system handles it gracefully (summarize/trim)."""
    messages = []
    for i in range(200):
        messages.append({"role": "user", "content": f"Message {i}: " + "x" * 500})
        messages.append({"role": "assistant", "content": f"Reply {i}"})

    # This should NOT throw a token limit error
    context = build_context(messages)
    total_tokens = count_tokens(context)
    assert total_tokens <= MODEL_MAX_TOKENS * 0.9

def test_early_info_retained():
    """User mentions their name in message 1.
    After 50 messages, the system should still know it."""
    messages = [
        {"role": "user", "content": "My name is Ahmed and I work at Acme Corp"},
        # ... 50 filler messages ...
        {"role": "user", "content": "What's my name?"}
    ]
    response = get_completion(build_context(messages))
    assert "Ahmed" in response

Why this matters: Trimming strategies silently drop the first messages. Summarization might omit "unimportant" details like user names. These tests catch the failure before users do.

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this helpful?

Comments

Loading comments...

Leave a comment