AI Memory: From Stateless to Context-Engineered

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

Your chatbot remembers the user's name for exactly 47 messages. Message 48: "I'm sorry, what was your name again?"

Meanwhile, a medical triage bot silently drops "Patient is allergic to penicillin" from context on message 12. The allergy was mentioned on message 3. Nobody notices until the prescription.

These aren't edge cases — they're what happens when you treat memory as an afterthought. Every production AI system eventually hits this wall: the model forgets, the context overflows, the wrong information gets retrieved, or worse — critical information gets silently dropped.

This post is the complete guide to AI memory — from the fundamental statelessness problem to the paradigm shift of context engineering. No prerequisites beyond curiosity.

Quick Summary

Part 1: The memory problem — why LLMs forget everything between calls, and two strategies (trimming & summarization) to manage the context window
Part 2: How agents remember — five memory types, agentic retrieval, Zettelkasten-inspired memory, forgetting curves, and RAG during reasoning
Part 3: When memory fails — ten failure modes from stale memories to scale collapse, plus the "lost in the middle" problem and context rot
Part 4: Context engineering — the paradigm shift from "fill the window" to "engineer the context," with five principles and multi-agent strategies

This post is for you if...

You've built a chatbot that works great for 5 messages but forgets everything by message 20
You've heard "context window," "RAG," or "context engineering" and want to understand what they actually mean for building AI systems
You're wondering why your AI agent gives perfect answers on Monday but completely wrong ones by Thursday — despite having the right documents
You read RAG & Agents and want the deep dive into memory that post introduced but couldn't fully explore

Part 1

The Memory Problem

The Goldfish Problem

Imagine you're at a coffee shop talking to a brilliant consultant. You explain your entire business problem — the architecture, the constraints, the deadline. They give perfect advice. You come back the next morning, and they greet you like they've never seen you before. Every conversation starts from absolute zero.

That's how LLMs work. Not because they're poorly designed, but because of a fundamental architectural choice: every API call is completely isolated. When you send a message to GPT-4o, Claude, or any other LLM, the model processes your input, generates a response, and then — poof — forgets the entire interaction ever happened.

There's no internal notebook. No session state. No "hey, we talked about this yesterday." The model is stateless in the strictest sense: it has zero memory between calls.

Why LLMs Forget Everything

API Call 1

You: I'm building a REST API in Python using FastAPI

AI: Great choice! FastAPI is excellent for async APIs. Let me help you structure it...

No Connection

API Call 2

You: Now add authentication to it

AI: I'd be happy to help with authentication! What language and framework are you using?

Each API call is a fresh start. The model has zero memory of previous conversations, regardless of how recent or important they were.

This isn't a bug — it's a design choice. LLMs are token-prediction machines: they take a sequence of input tokens, process them through billions of parameters, and output the next most likely tokens. Once the response is generated, all that processing is discarded. The model doesn't "learn" from your conversation or store it anywhere.

If you've used ChatGPT or Claude and felt like they "remember" your conversation, that's because the application layer (the chat interface) is doing the remembering, not the model. Behind the scenes, the app stores your conversation history and sends the entire thing back to the model with every new message. The model isn't remembering — it's being reminded.

"Just Send Everything"

The simplest solution to statelessness? Keep a growing list of every message in the conversation, and send all of them to the model every single time.

Think of it like recording every meeting on your team. When you need context for a new discussion, you play back all previous recordings first. For the first few meetings, this works perfectly. By meeting 50, you're spending more time watching recordings than having actual discussions.

This approach — appending messages and sending everything — is how most chatbots start. You maintain an array of messages (system prompt + conversation turns), and each API call sends the entire array.

"Just Send Everything" — The Growing Payload Problem

Call 1

3 msgs

Call 5

11 msgs

Call 10

21 msgs

Call 20

41 msgs

Context limit!

Call 50

101 msgs

Each API call sends the entire conversation history. Payload grows linearly — by call 50, you're sending 100+ messages. Eventually you hit the context window limit and messages "fall off the edge."

For short conversations — say 5 to 15 messages — this works fine. The model sees the full context and gives coherent, contextual responses. But there's a hard ceiling on how much you can send, and it's called the context window.

The Math That Matters: Context Windows

The context window is the total amount of information a model can process in a single call — both the input you send and the output it generates. It's measured in tokens, where one token is roughly three-quarters of a word (or about four characters). So 1,000 tokens is approximately 750 English words.

Think of the context window as a desk. Everything the model can "see" has to fit on this desk — your system prompt, the entire conversation history, any retrieved documents, and the response it's generating. If your papers overflow the desk, they fall off the edge and disappear.

Context Window Sizes — How Much Fits on the Desk

Early Models

tokens

OVERFLOW

~12 pages

GPT-4o / Claude

128K–200K

tokens

~300–500 pages

Claude 3.7 / GPT-4o

200K

tokens

~500 pages

Gemini 2.5

tokens

~2,500 pages

Context windows have grown 250x in three years. But bigger doesn't mean better — performance degrades, costs increase, and latency spikes with longer contexts. The fill bars show how much a typical 50-message conversation uses.

You might look at Gemini 2.5's million-token window and think: "Problem solved. Just dump everything in." But three forces conspire against you:

Cost scales linearly with tokens. Sending 100K tokens every call is 25x more expensive than sending 4K. For high-volume applications, this adds up fast.
Latency increases with context size. The model has to process every token you send. More tokens means slower responses — sometimes painfully so.
Performance degrades with length. This is the counterintuitive one. Research consistently shows that models lose track of information in long contexts, especially information placed in the middle. We'll explore this deeply in Part 3.

So even with massive context windows, you need strategies to manage what goes in. The two most fundamental strategies are trimming and summarization.

Context Window Sizes in 2025–2026

Context windows have grown dramatically over just a few years. Here's the current landscape:

Model	Context Length	Approx. Pages
GPT-3.5 (2023)	4,096 tokens	~12 pages
GPT-4 (2023)	8,192 / 32,768	~24–100 pages
GPT-4o (2024)	128,000 tokens	~300 pages
Claude 3.5 Sonnet (2024)	200,000 tokens	~500 pages
Gemini 1.5 Pro (2024)	2,000,000 tokens	~5,000 pages
Claude 3.7 Sonnet (2025)	200,000 tokens	~500 pages
Gemini 2.5 (2025)	1,000,000 tokens	~2,500 pages
Llama 3.x (2025)	8,000–128,000	Varies by config

Important: "Context length" is the maximum the model can process, not what it should. Just because Gemini can handle a million tokens doesn't mean it will handle them well. Models often struggle with information placed in the middle of very long contexts (the "lost in the middle" phenomenon we cover in Part 3).

Also note the difference between context length (maximum input) and output length (maximum response). Most models have output limits of 4K–16K tokens, even with massive input windows.

Strategy 1: Trimming — First In, First Out

The simplest strategy: when the conversation gets too long, drop the oldest messages. Keep the system prompt (which defines the AI's behavior) and the most recent messages. Everything else falls off the edge.

Think of a conveyor belt at an airport baggage claim. New bags arrive from one end, and once the belt is full, the oldest bags drop off the other end. Your system prompt is the one bag that's bolted to the belt — it never moves. Everything else is first in, first out.

Trimming — The Conveyor Belt

System Prompt

Always kept — defines AI behavior

M8 (new)

Dropped (too old)

Kept (fits in window)

Trimming drops the oldest messages to make room for new ones. Fast and cheap, but any information in dropped messages is permanently lost.

Trimming is fast, cheap, and easy to implement. You count tokens from the newest message backward until you hit the limit, and everything older gets cut. No extra API calls, no processing overhead.

But trimming has a critical weakness: it has no concept of importance. A user mentioning their name on message 1 is just as likely to be trimmed as a casual "sounds good" on message 3. A patient's drug allergy mentioned early in a conversation can be silently dropped when newer, less important messages push it out.

For some applications — casual chat, quick Q&A sessions, interactions under 20 messages — trimming works perfectly. For anything involving safety-critical information, long-running tasks, or users who expect continuity, you need something smarter.

Token Counting Across Models

Different models tokenize text differently, which means the same sentence can use different numbers of tokens across models. This matters for trimming because you need an accurate count.

General rules of thumb:

1 token ≈ 4 characters in English (roughly ¾ of a word)
1,000 tokens ≈ 750 words ≈ 3 pages of text
Code is typically more token-dense than prose (more special characters)
Non-English languages often use more tokens per word (Chinese, Arabic, etc.)

For OpenAI models, use the tiktoken library for exact counts. For Claude, Anthropic provides a token counting API. For open-source models, each has its own tokenizer (usually a SentencePiece or BPE variant).

Why exact counts matter: If your trimming function estimates "this message is about 100 tokens" but it's actually 150, you might exceed the context window and get an error. Always use the model's actual tokenizer for production systems.

Don't forget the output. The context window includes both input AND output tokens. If you fill 95% of the window with input, the model only has 5% left to generate a response, which often leads to truncated or incoherent answers.

Production Trimming Strategies

Basic FIFO trimming — count tokens from newest to oldest, cut when full:

def trim_messages(messages, max_tokens=4000, model="gpt-4o"):
    """Keep the system prompt + most recent messages that fit."""
    import tiktoken
    enc = tiktoken.encoding_for_model(model)

    # Always keep the system prompt
    system_msg = messages[0]
    system_tokens = len(enc.encode(system_msg["content"]))

    # Add messages from newest to oldest until we hit the limit
    trimmed = []
    token_count = system_tokens

    for msg in reversed(messages[1:]):
        msg_tokens = len(enc.encode(msg["content"]))
        if token_count + msg_tokens > max_tokens:
            break
        trimmed.insert(0, msg)
        token_count += msg_tokens

    return [system_msg] + trimmed

Rolling window with LangGraph — for agentic systems, you can implement a rolling context window where the full interaction is passed until the window fills up, then the oldest parts are ejected first-in, first-out. This is easy to implement, low in complexity, and works for many use cases.

Priority-aware trimming — the critical improvement over basic FIFO:

def smart_trim(messages, max_tokens=4000):
    """Trim with priority: pinned > recent > old."""
    system = messages[0]
    pinned = [m for m in messages[1:] if m.get("pinned")]
    regular = [m for m in messages[1:] if not m.get("pinned")]

    # Start with system + pinned (always kept)
    result = [system] + pinned
    token_count = count_tokens(result)

    # Add regular messages from newest to oldest
    for msg in reversed(regular):
        msg_tokens = count_tokens([msg])
        if token_count + msg_tokens > max_tokens:
            break
        result.insert(len([system] + pinned), msg)
        token_count += msg_tokens

    return result

The pinned flag lets you mark safety-critical messages (allergies, constraints, user identity) that should never be trimmed, regardless of age.

Strategy 2: Summarization — Compress, Don't Delete

What if instead of throwing away old messages, you condensed them into a summary? Rather than losing that the user mentioned their name, their project requirements, and their deadline, you compress 20 messages into a single paragraph that captures the essential facts.

Think of it as the difference between keeping every receipt from a year of grocery shopping versus keeping a monthly budget spreadsheet. The receipts have all the detail, but the spreadsheet tells you everything you actually need to know.

Summarization — Compress, Don't Delete

Old Messages

12,000

tokens

...

M24

Summarize

32x compression

Summary

380

tokens

User: Bahgat, building FastAPI service. Chose JWT auth. Deadline: March. Constraint: must support OAuth2...

Summarization compresses 24 old messages (12,000 tokens) into a single 380-token summary that preserves the key facts. The model then sees: system prompt + summary + recent messages.

The approach: when the conversation exceeds a threshold, take the oldest messages, send them to a (usually cheaper) model with a summarization prompt, and replace them with the resulting summary. The model now sees: system prompt → summary of old messages → last 6 recent messages.

Summarization preserves intent where trimming destroys it. The summary captures "the user is Bahgat, building a FastAPI service with JWT authentication, deadline is March" even though those details were scattered across messages 1 through 15. With trimming, all of that would be gone once message 20 arrives.

The tradeoff? Summarization costs extra. Each summarization step requires an additional API call (typically using a cheaper model like GPT-4o-mini). It adds 300–800ms of latency. And summarization is lossy — you inevitably lose some nuance and detail. A summary might capture "the user has a food allergy" but drop the specific allergen. For casual conversations that's fine. For medical applications, it's dangerous.

But there's not just one way to summarize. In practice, there are three distinct strategies, and the right choice depends on your use case:

Three Summarization Strategies

Stack Per Turn

After each exchange, summarize it and append to the running summary. Summaries stack on each other.

Turn 1 summary +

Turn 2 summary +

Turn 3 summary +...

Simple Still grows

Batch of N

Wait until N turns accumulate, then summarize the entire batch at once. Repeat per batch.

Turns 1–5 → Batch summary

Turns 6–10 → Batch summary

Controlled growth Delayed

Single Rolling Summary

Maintain one summary. After each turn, ask the LLM to update it with new information. Fixed size.

Summary v1 → +Turn 2 → v2

v2 → +Turn 3 → v3 → ...

Fixed size Most lossy

Stacking summaries grows slowly but never consolidates. Batch summarization offers periodic compression. A single rolling summary stays fixed-size but is the most lossy — each update can overwrite older details. Choose based on how much detail you need to preserve vs how much compression you need.

For short conversations (under 20 turns), stacking summaries works fine — the stack stays small. For longer interactions, batch or rolling summaries are necessary. The single rolling summary is the most aggressive: it never grows, but it risks overwriting important early details if the conversation shifts topics. In practice, many production systems use a hybrid: a rolling summary for the bulk of history, plus a separate "critical facts" list that's never summarized away.

Summarization Prompt Templates

Basic summarization with preservation priorities:

def summarize_and_compress(messages, max_tokens=4000):
    """Summarize old messages, keep recent ones verbatim."""
    import tiktoken
    enc = tiktoken.encoding_for_model("gpt-4o")
    total = sum(len(enc.encode(m["content"])) for m in messages)

    if total <= max_tokens:
        return messages  # No need to summarize yet

    # Keep system prompt + last 6 messages verbatim
    system = messages[0]
    recent = messages[-6:]
    old = messages[1:-6]

    if not old:
        return messages  # Nothing to summarize

    # Summarize the old messages
    summary = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for summarization
        messages=[{
            "role": "system",
            "content": """Summarize this conversation. Preserve:
- User's name, preferences, and stated requirements
- Key decisions made and their reasoning
- Technical context (language, framework, architecture)
- Any constraints or deadlines mentioned
- Safety-critical information (allergies, access levels, etc.)
Be concise but don't lose critical facts."""
        }] + old
    ).choices[0].message.content

    return [
        system,
        {"role": "system",
         "content": f"Previous conversation summary: {summary}"},
        *recent
    ]

Production tip: Use a cheaper model (GPT-4o-mini, Claude Haiku) for summarization. You're compressing, not reasoning — smaller models handle this well at a fraction of the cost.

Structured summarization prompt (for applications where specific categories of information matter):

SUMMARY_PROMPT = """Summarize the following conversation into these categories:

USER IDENTITY: Name, role, preferences
PROJECT CONTEXT: What they're building, tech stack, architecture
CONSTRAINTS: Deadlines, budgets, requirements, limitations
DECISIONS MADE: What was decided and why
CURRENT STATE: Where the conversation left off
OPEN QUESTIONS: Anything unresolved

Be concise. Use bullet points. Preserve exact numbers, dates, and names."""

This structured approach makes it easier to verify that critical information survived the summarization step.

Try It: Automatic Summarization with LangGraph

The code above shows the raw approach. In production, LangGraph handles summarization automatically with its SummarizationNode:

# pip install langmem langgraph langchain-openai
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, START, MessagesState
from langgraph.checkpoint.memory import InMemorySaver
from langmem.short_term import SummarizationNode

model = ChatOpenAI(model="gpt-4o")

# Configure automatic summarization
summarization_node = SummarizationNode(
    token_counter=model.get_num_tokens_from_messages,
    model=model.bind(max_tokens=128),
    max_tokens=256,                # max tokens passed to LLM
    max_tokens_before_summary=256, # trigger summarization here
    max_summary_tokens=128,        # max summary size
)

def call_model(state):
    return {"messages": [model.invoke(state["summarized_messages"])]}

# Build graph: summarize → call model
builder = StateGraph(MessagesState)
builder.add_node("summarize", summarization_node)
builder.add_node("call_model", call_model)
builder.add_edge(START, "summarize")
builder.add_edge("summarize", "call_model")

graph = builder.compile(checkpointer=InMemorySaver())

# Use it — summarization happens automatically when context grows
config = {"configurable": {"thread_id": "1"}}
graph.invoke({"messages": "Hi, I'm Bob"}, config)
graph.invoke({"messages": "I'm building a RAG system"}, config)
response = graph.invoke({"messages": "What's my name?"}, config)
# Bob ✓ — the summary preserved it

Summarize every N turns (alternative pattern): Instead of token-based triggering, use a conditional edge that counts messages:

# After the model responds, check message count
def should_summarize(state):
    if len(state["messages"]) > 6:
        return "summarize_conversation"
    return END

workflow.add_conditional_edges("conversation", should_summarize)

Note: LangChain's older ConversationSummaryMemory and ConversationSummaryBufferMemory are deprecated since v0.3.1. The LangGraph + langmem approach above is the current recommended path.

Decision Point

A medical triage chatbot handles 30+ message conversations. On message 3, a patient mentions a severe penicillin allergy. By message 25, the conversation has exceeded the context window. Which strategy should you use?

The Stakes

If the allergy information is lost, the system might recommend a penicillin-based antibiotic. This is a safety-critical failure.

Exactly right.

Pin safety-critical information so it's never trimmed or summarized away. The allergy mention on message 3 should be flagged as "pinned" and kept alongside the system prompt permanently. Then use trimming or summarization for everything else. Summarization alone is risky here — it might preserve "patient has allergy" but could lose the specific allergen. For medical systems, "might" isn't good enough.

Not in a medical context.

Trimming would silently drop the allergy information once enough new messages arrive. Summarization might preserve it — or might compress "severe penicillin allergy" into "patient has allergies," losing the critical detail. For safety-critical information, the right answer is pinning: mark the allergy message as "never trim" and keep it alongside the system prompt, then manage the rest of the conversation with trimming or summarization.

Trimming and summarization manage what's in the context window. But what about knowledge that was never there in the first place? What about your user's preferences from last month, the documentation for your API, or the lessons learned from a thousand past conversations? That's where memory systems come in — and where the real complexity begins.

Part 2

How Agents Remember

A Map of Memory

Trimming and summarization handle one kind of memory — the current conversation. But real AI systems need to remember across sessions, access external knowledge, follow rules, and draw on what they were trained on. These are fundamentally different types of memory, just like in the human brain.

Neuroscientists have long categorized human memory into distinct systems: working memory (what you're holding in your head right now), episodic memory (what happened at your birthday party), semantic memory (knowing that Paris is in France), and procedural memory (how to ride a bicycle). AI memory systems mirror these categories — not because we designed them to, but because the same problems arise naturally.

The Five Types of AI Memory

Working

Context Window

Your current chat session. Active right now, gone when the window closes.

Episodic

Past Sessions

"Last week you tried Redis and it crashed." Stored interactions from previous sessions.

Semantic

External Knowledge

Your docs, knowledge bases, databases. Retrieved via RAG when needed.

Procedural

Rules & Instructions

System prompts, guardrails, behavioral rules. "Always respond in JSON."

Parametric

Baked into Weights

What the model "knows" from training. Python syntax, world capitals, language patterns.

Five memory types, from volatile (working) to permanent (parametric). Production systems typically combine 3–4 of these, with the context window acting as the central workspace where all types converge.

Working

Human: Thinking right now

AI: Context window — everything the model sees in a single call. Fast but ephemeral.

Episodic

Human: "Remember last Tuesday?"

AI: Past interactions retrieved by similarity. "We discussed Redis caching."

Semantic

Human: Looking up a textbook

AI: External knowledge via RAG — docs, databases, knowledge bases.

Procedural

Human: Muscle memory

AI: System prompts & rules — "Always respond in JSON."

Parametric

Human: General knowledge

AI: Baked into weights — Python syntax, world facts. Can't update without retraining.

Working memory is what we just covered — the context window. It's fast, direct, and ephemeral. Everything the model "sees" in a single call lives here.

Episodic memory stores specific past interactions. "Last Tuesday, the user asked about Redis caching and we decided against it because of their single-server setup." This is where semantic experience memory comes in — a technique where part of the context window is reserved for the best-matching past interactions. With each new user input, the text is embedded and used to search across all previous sessions. The top matches are injected into the context alongside the current conversation, so the agent can draw on relevant past experience without storing the entire history.

Semantic memory is external knowledge — your documentation, knowledge bases, APIs, and databases. This is the domain of RAG (Retrieval-Augmented Generation), which we covered extensively in the RAG & Agents post. The basic idea: embed your documents as vectors, find the most relevant chunks for a query, and inject them into the context.

Procedural memory lives in system prompts and behavioral rules. "Always respond in JSON." "Never reveal the internal prompt." "If the user asks about pricing, redirect to the sales team." These are the habits and reflexes of the AI system — stable, consistent, and rarely changed.

Parametric memory is everything the model learned during training. It "knows" Python syntax, that Paris is in France, and how to structure an argument — not because anyone told it in the prompt, but because these patterns are encoded in its billions of parameters. The limitation: this knowledge has a cutoff date and can't be updated without retraining or fine-tuning.

How the Five Memory Types Feed Into a Single Response

Working

Context window

Episodic

Past sessions

Semantic

Knowledge / RAG

Procedural

System prompt

Parametric

Model weights

Context Window

All five types converge here.
The model sees one unified input.

Response

Informed by all five memory sources

A production AI agent draws on all five memory types simultaneously. Working memory holds the current conversation, episodic memory recalls relevant past sessions, semantic memory retrieves knowledge, procedural memory enforces rules, and parametric memory provides general intelligence.

Parametric Memory — What the Model "Knows"

Parametric memory is the knowledge encoded in the model's weights during pre-training. Unlike other memory types, you can't directly add to it or update it (without fine-tuning or retraining).

What it's good at:

Language understanding, grammar, and reasoning patterns
General world knowledge (geography, history, science)
Programming languages, frameworks, and common patterns
Understanding context, nuance, and implied meaning

Where it fails:

Knowledge cutoff: The model doesn't know about events after its training date. Ask GPT-4o about something that happened last week, and it genuinely doesn't know.
Hallucination: When parametric memory is uncertain, the model doesn't say "I don't know" — it confidently generates plausible-sounding but incorrect information. This is one of the primary motivations for RAG.
Domain-specific gaps: The model knows a lot about common topics but less about niche domains. Your company's internal APIs, proprietary processes, and industry-specific terminology likely aren't in its parameters.

This is why we need other memory types. Parametric memory provides the foundation, but episodic, semantic, and working memory fill the gaps with current, specific, and contextual information.

Semantic Experience Memory — Learning From Past Sessions

Standard chatbots start every session from a blank slate. Even if the user had a detailed 2-hour conversation yesterday, today the agent has zero recollection of it. Semantic experience memory solves this.

How it works:

Every interaction (user message + agent response) is stored and embedded in a vector database
When the user sends a new message, their input is also embedded
A vector similarity search finds the most relevant past interactions across all previous sessions
Part of the context window is reserved for these matches — typically 10–20% of the available space
The rest of the context window holds the system prompt, current conversation, and any RAG results

The key insight: You're not retrieving all of last Tuesday's conversation. You're retrieving the 3–5 past interactions that are most semantically relevant to what the user is asking right now. If they asked about Redis caching last week and are now asking about caching again, those specific exchanges surface automatically.

This enables agents to not just draw upon a broad base of knowledge but also tailor their responses based on accumulated experience, leading to more adaptive and personalized behavior.

The Evolution of Embeddings & Vector Stores

The foundation for semantic memory is embeddings — vector representations that capture meaning. The field has evolved rapidly:

2013

Word2Vec — First widely-used word embeddings. "King - Man + Woman = Queen" showed vectors capture relationships.

2014

GloVe — Global Vectors. Better at capturing corpus-wide statistics. Trained on Wikipedia + Gigaword.

2018

BERT — Context-aware embeddings. "Bank" gets different vectors in "river bank" vs "bank account." Game-changer.

2024+

LLM Embeddings — Models like text-embedding-3 (OpenAI), Voyage AI, and Cohere embed produce richer representations trained on massive, diverse datasets.

Where embeddings live — vector stores:

FAISS (Meta)

Annoy (Spotify)

Pinecone

Weaviate

Chroma

FAISS (Facebook AI Similarity Search) and Annoy (Approximate Nearest Neighbors Oh Yeah) are optimized for fast similarity searches over high-dimensional vectors. Managed services like Pinecone, Weaviate, and Chroma handle the infrastructure and scale concerns for you.

Try It: Vector Store in 5 Lines (Chroma or FAISS)

Option A: Chroma (easiest to start, runs locally, no server needed):

# pip install langchain-chroma langchain-openai
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_texts(
    texts=["your", "documents", "here"],
    embedding=OpenAIEmbeddings(),
    persist_directory="./chroma_db"  # saves to disk
)

# Search
results = vectorstore.similarity_search("your query", k=3)

Option B: FAISS (faster for large datasets, Facebook's battle-tested library):

# pip install langchain-community faiss-cpu langchain-openai
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

vectorstore = FAISS.from_texts(
    texts=["your", "documents", "here"],
    embedding=OpenAIEmbeddings()
)

# Save / load
vectorstore.save_local("./faiss_index")
loaded = FAISS.load_local("./faiss_index", OpenAIEmbeddings(),
    allow_dangerous_deserialization=True)

Which to choose: Chroma for prototyping and small datasets (built-in persistence, metadata filtering). FAISS for production workloads with millions of vectors (optimized C++ core). Pinecone, Weaviate, or Qdrant for managed cloud with zero ops.

Looking Ahead: Index-Free RAG

As context windows grow to millions of tokens, a new concept is emerging: index-free RAG. Instead of maintaining external vector stores, you load entire knowledge bases directly into the model's context and let its attention mechanisms do the retrieval internally. Models like GPT-4.1 and Gemini 2.5 can process 1M+ tokens, potentially making external retrieval unnecessary for smaller datasets. The trade-off: massive compute cost and no guarantee the model finds the relevant passage in a sea of tokens. For now, hybrid approaches (selective retrieval + long context) remain the practical choice.

Try It: Index-Free RAG with Gemini Context Caching

Google Gemini's 1M-token context window + caching makes index-free RAG practical for small-to-medium knowledge bases:

# pip install google-genai
from google import genai
from google.genai import types

client = genai.Client()

# Upload your entire knowledge base as files
file1 = client.files.upload(file="company_handbook.pdf")
file2 = client.files.upload(file="product_docs.pdf")

# Cache the knowledge base (pay once for ingestion)
cache = client.caches.create(
    model="models/gemini-2.5-pro",
    config=types.CreateCachedContentConfig(
        display_name="company_kb",
        system_instruction="Answer based on the knowledge base.",
        contents=[file1, file2],
        ttl="3600s",  # 1 hour cache
    )
)

# Each query now costs 75-90% less for the cached portion
response = client.models.generate_content(
    model="models/gemini-2.5-pro",
    contents="What is our refund policy for enterprise?",
    config=types.GenerateContentConfig(cached_content=cache.name),
)
print(response.text)

When this beats traditional RAG:

Knowledge base under ~750 pages (fits in 1M tokens)
Repeated queries over the same corpus (caching amortizes cost)
You need zero infrastructure (no vector store, no embeddings, no chunking)

When traditional RAG still wins:

Millions of documents (won't fit in any context window)
Need for metadata filtering ("show me only Q4 reports")
Latency-sensitive applications (~45s for full context vs ~1s for RAG retrieval)

Anthropic alternative: Claude supports 200K tokens standard (1M in beta). Use XML tags to organize the knowledge base: <knowledge_base>...</knowledge_base> for cleaner context structure.

Beyond Basic RAG: Three Advances

In the RAG & Agents post, we covered the standard retrieval pipeline: embed your documents, store them in a vector database, find the most similar chunks for a query, and inject them into the prompt. That pipeline treats retrieval as a static, one-shot step that runs before the model generates output.

But what if the agent could decide what, when, and how to retrieve? What if it could build interconnected notes like a researcher? What if retrieval could happen during reasoning, not just before it? These three advances push memory beyond basic RAG.

1. Agentic RAG — The Agent Takes Control

In standard RAG, the vector database is the long-term memory of the LLM — but the LLM has no say in what gets retrieved. It's like a student who gets handed a stack of textbook pages by a librarian and has to make do with whatever was selected. Agentic RAG gives the student direct access to the library card catalog.

In agentic RAG, retrieval becomes a tool that the agent can invoke on its own terms. Instead of a fixed pipeline that always runs the same way, the agent decides:

Whether to search at all — sometimes the answer is already in the conversation
Which source to search — documentation? Slack messages? The user's past sessions? A web search?
What query to use — the agent can reformulate the user's question for better retrieval
Whether to search again — if the first results aren't sufficient, it can refine and search again

Three Retrieval Paradigms

Static RAG

Fixed pipeline, always runs

Query

↓

Always retrieve

↓

Generate

Agentic RAG

Agent controls retrieval

Query

↓

Decide: search? which source?

↓ ↻

Retrieve & evaluate

↓

Generate

RAG During Reasoning

Search-o1: mid-thought retrieval

Query

↓

Reason... need info...

↓ search mid-thought

Retrieve & compress

↓

Continue reasoning...

From static retrieval (always search once) to agentic retrieval (agent decides) to mid-reasoning retrieval (search during the thinking process). Each paradigm gives the system more control over its own memory access.

In practice, agentic RAG often uses a router pattern: the agent has access to multiple knowledge sources (documentation, Slack, email, web search, past sessions) as individual tools, and it picks the right one(s) based on the query. It might search your API docs first, find insufficient information, then run a web search for a more recent answer.

This also extends to multi-agent RAG, where specialized retrieval agents each handle a specific knowledge source. A coordinator agent routes the query to the right specialist, collects results, and synthesizes a response. Think of it as having a team of research assistants instead of one generalist.

2. A-MEM — The Zettelkasten for AI

The Zettelkasten method is a note-taking system used by prolific writers and researchers. The core idea: each note contains exactly one unit of knowledge (atomicity), and notes are heavily linked to each other (hypertextual). Over time, you build a web of interconnected ideas where any note can lead you to related concepts you might not have found otherwise.

A-MEM (Agentic Memory) borrows this idea for AI agents. Instead of storing raw conversations or document chunks, each "memory" is an enriched note containing:

The original interaction (one conversation turn)
A timestamp
LLM-generated keywords that capture key concepts
LLM-generated tags to categorize the interaction
An LLM-generated contextual description that summarizes the significance

All of this is concatenated and embedded into a single vector. The clever part: when a new memory is added, the system searches for existing memories with similar embeddings. The LLM is then asked to evaluate each candidate and decide which should be linked to the new memory. After linking, connected memories are updated — their tags, keywords, and descriptions evolve to reflect the new relationship.

This creates a living knowledge graph that doesn't just accumulate memories but actively maintains an evolving worldview. When the agent retrieves a memory, it can also traverse links to discover related memories it might not have found through embedding similarity alone.

A-MEM — Zettelkasten-Style Memory Linking

Memory #12

Redis caching setup

User set up Redis for product catalog caching on single server

redis caching

New Memory #47

Cache invalidation bug

Stale product prices showing after catalog update

caching bug

Memory #23

Single-server limits

Decided against distributed cache due to single-server architecture

architecture caching

LLM evaluates links: #12 ↔ #47 (cache bug relates to setup) • #23 ↔ #47 (single-server constraint)

After Linking — Memories Evolve

Memory #12 updated: "Redis caching setup — later experienced cache invalidation bug with stale product prices"

When a new memory is added, the LLM finds related existing memories, evaluates links, and updates connected notes. Adding knowledge about cache bugs enriches the original caching setup memory.

A-MEM Architecture — Atomic Notes + Evolutionary Linking

Step 1: Create the note. A single interaction is processed by the LLM to extract keywords, tags, and a contextual description. These are concatenated with the original text and embedded into a single vector.

Step 2: Find related memories. The new note's embedding is compared against all existing memories via similarity search. The top-K candidates are retrieved.

Step 3: LLM-based linking. The LLM evaluates each candidate and decides: should this existing memory be linked to the new one? If yes, what kind of relationship is it? (expansion, refinement, contradiction, new branch)

Step 4: Evolution. After linking, the LLM is prompted to update the tags, keywords, and descriptions of connected memories. A note about "Redis caching" that gets linked to a new note about "caching failures" might have its description updated to mention failure modes.

Why this matters: Traditional vector stores just accumulate chunks. A-MEM creates a living network where adding new knowledge refines existing knowledge. It mirrors how human experts develop deeper understanding over time — new experiences don't just stack up; they reshape how we understand earlier experiences.

The downside: every new memory requires multiple LLM calls (extraction, similarity search, linking, evolution), making it significantly more expensive than simple vector storage. This is a tradeoff between memory quality and operational cost.

Try It: Agentic RAG with LangGraph (Grade → Rewrite → Retry)

LangGraph makes the agentic retrieval loop explicit as a state graph. The key pattern: after retrieval, a grader node decides if results are relevant. If not, a rewriter node reformulates the query and loops back:

# pip install langgraph langchain-openai langchain-community faiss-cpu
from langgraph.graph import StateGraph, START, END, MessagesState
from langgraph.prebuilt import ToolNode, tools_condition
from langchain.chat_models import init_chat_model
from langchain_core.messages import HumanMessage
from langchain.tools import tool

model = init_chat_model("gpt-4o", temperature=0)

# Set up your retriever (any LangChain retriever works here)
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
vectorstore = FAISS.from_texts(["your", "documents", "here"], OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

@tool
def search_knowledge_base(query: str) -> str:
    """Search the knowledge base for relevant information."""
    docs = retriever.invoke(query)
    return "\n\n".join([d.page_content for d in docs])

# Node 1: LLM decides to search or respond directly
def generate_or_search(state: MessagesState):
    response = model.bind_tools([search_knowledge_base]).invoke(state["messages"])
    return {"messages": [response]}

# Node 2: Grade retrieved documents (conditional edge, not a node)
def grade_documents(state: MessagesState):
    question = state["messages"][0].content
    context = state["messages"][-1].content
    grade = model.invoke(f"Answer 'yes' or 'no': Is this context relevant?\nQuestion: {question}\nContext: {context}")
    return "generate_or_search" if "yes" in grade.content.lower() else "rewrite_question"

# Node 3: Rewrite the query for better retrieval
def rewrite_question(state: MessagesState):
    question = state["messages"][0].content
    better = model.invoke(f"Reformulate this question for better search results: {question}")
    return {"messages": [HumanMessage(content=better.content)]}

# Build the graph
workflow = StateGraph(MessagesState)
workflow.add_node(generate_or_search)
workflow.add_node("retrieve", ToolNode([search_knowledge_base]))
workflow.add_node(rewrite_question)

workflow.add_edge(START, "generate_or_search")
workflow.add_conditional_edges("generate_or_search", tools_condition,
    {"tools": "retrieve", END: END})  # END = model answered directly
workflow.add_conditional_edges("retrieve", grade_documents)
workflow.add_edge("rewrite_question", "generate_or_search")  # loop back!

graph = workflow.compile()

The key insight: The rewrite_question → generate_or_search edge creates a loop. The agent keeps refining its query until the grader is satisfied — or until a max iteration limit is reached. This is what makes it "agentic" instead of "static." When the model decides it can answer directly (no tool call), tools_condition routes to END.

LlamaIndex alternative: LlamaIndex takes a different approach with ReActAgent — each data source becomes a tool, and a meta-agent routes queries across them. Use LangGraph when you want explicit control over the retrieval loop; use LlamaIndex when you have multiple data sources and want automatic routing.

3. Search-o1 — RAG During Reasoning

Standard RAG retrieves information before the model starts generating. Agentic RAG lets the agent decide when to retrieve. Search-o1 goes further: it enables retrieval during the reasoning process itself.

Think of it this way. In standard RAG, you hand a student a stack of textbook pages and say "now answer the question." In agentic RAG, the student can ask the librarian for specific books. In Search-o1, the student is working through a problem, realizes mid-thought "I need to check something," looks it up, processes what they found, and continues reasoning. The information arrives exactly when the reasoning process needs it.

Technically, the model is instructed to use special tokens — <|begin_search_query|> and <|end_search_query|> — to trigger a search mid-reasoning. The retrieved documents are then processed by a Reason-in-Documents module: using the search query, the retrieved content, and the current reasoning trace, this module condenses everything into focused reasoning steps that align with the model's thought process.

Why is the Reason-in-Documents step important? Because raw retrieved documents are often long, contain irrelevant sections, and can disrupt the flow of reasoning. By having the model's own reasoning LLM process the retrieved information, the documents are compressed and formatted to fit naturally within the ongoing chain of thought.

Search-o1 — Retrieval Happens Inside the Reasoning Chain

"Flamingos are pink... related to diet, but I need the specific mechanism..."

SEARCH: "flamingo pigmentation mechanism"

Reason-in-Documents Module

Retrieved: Wikipedia on carotenoids → Compressed to: "Carotenoid pigments from brine shrimp diet are metabolized into feather coloration"

"The pigments are carotenoids from brine shrimp. But how are they metabolized?"

SEARCH: "carotenoid metabolic pathway feathers"

Complete answer with full biochemical explanation

The model reasons, realizes it needs information, triggers a search mid-thought, processes the results, and continues reasoning. Information arrives exactly when the thinking process needs it.

Search-o1 — Special Tokens and Reason-in-Documents

How it works in practice:

Consider the query "Why are flamingos pink?" The model begins reasoning:

"Flamingos are pink... this is related to their diet, but I need to know the specific mechanism..."

<|begin_search_query|> flamingo pigmentation diet mechanism <|end_search_query|>

<|begin_search_result|> [Wikipedia excerpt about carotenoid pigments in brine shrimp...] <|end_search_result|>

"The pigments are carotenoids found in brine shrimp. But how exactly are these pigments metabolized into the feather coloration?"

The model then triggers a second search to ArXiv for the metabolic pathway, processes that result, and continues reasoning with the full picture.

Key innovation: Instead of just injecting raw documents into the context (which can be noisy and disruptive), the Reason-in-Documents module uses the same reasoning LLM to process retrieved content such that it fits naturally within the reasoning trace. The information is aligned with how the model is thinking, not just dumped in.

Limitation: This approach is primarily used for long-term semantic memory (external knowledge), not for working memory or episodic memory. It's most powerful for tasks that require synthesizing information from multiple sources mid-thought — like research queries, complex analysis, and multi-step reasoning.

Forgetting as a Feature: MemoryBank

Here's a counterintuitive idea: not everything deserves to be remembered. Your brain doesn't retain every conversation you've ever had — it selectively strengthens important memories and lets unimportant ones fade. MemoryBank applies this principle to AI.

Inspired by the Ebbinghaus Forgetting Curve — the psychological model showing that we forget roughly half of what we learn each day unless we actively reinforce it — MemoryBank dynamically adjusts the "strength" of each memory based on usage.

MemoryBank — The Ebbinghaus Forgetting Curve for AI

Memory Strength

Time

Unused memory fades

Used memory persists

Memories that are retrieved and used in conversations are reinforced and persist longer. Memories that haven't been accessed for a while gradually fade and may be removed entirely. Just like human memory, importance is determined by usage, not storage time.

When a memory is retrieved and used during a conversation, its strength is reinforced — making it persist longer in the system. But if a memory hasn't been accessed for a while, it gradually loses strength and may eventually be removed entirely. This is the AI equivalent of spaced repetition — the same technique students use to prepare for exams.

MemoryBank stores three types of information for each user: raw conversation turns (embedded for retrieval), LLM-generated summaries of past events, and a dynamically updated "user portrait" that captures personality traits and preferences. The user portrait is always included in the context; the turns and summaries are retrieved on demand.

Conversation Turns

Raw exchanges, embedded for retrieval. Strength decays over time if unused.

Retrieved on demand

Dynamic Summaries

LLM-generated from conversation history. Compress many turns into key events.

Retrieved on demand

User Portrait

Personality, preferences, traits — updated after every conversation.

Always in context

This approach aligns with a powerful insight from the mem0 research: only about 10% of information in a conversation deserves permanent storage. The user saying "sounds good" or "let me think about it" doesn't need to be immortalized. But "I'm allergic to penicillin" does. MemoryBank's forgetting mechanism naturally separates the signal from the noise — important information gets retrieved and reinforced, while trivial exchanges fade away.

Don't Forget Keyword Search

With all this talk about embeddings and semantic search, it's tempting to think keyword-based search is obsolete. It isn't. In fact, for certain types of memory retrieval, BM25 keyword search outperforms semantic search.

Consider this: a user asks "What was the configuration for server PROD-DB-07?" Semantic search looks for meaning — it might find documents about database configurations in general. But the user needs the exact string "PROD-DB-07." Keyword search with an inverted index finds it instantly because it's matching exact terms, not meanings.

BM25 (Best Matching 25) is the scoring function behind most keyword search systems. It ranks results by three factors: how often the search term appears in the document (term frequency), how rare the term is across all documents (inverse document frequency), and document length normalization. This makes it excellent for retrieving specific identifiers, error codes, configuration values, and proper nouns — exactly the things semantic search often misses.

In practice, the best memory systems combine both. Semantic search finds conceptually related information; keyword search finds exact matches. This is the hybrid search approach we covered in the RAG & Agents post, and it's just as important for memory retrieval as it is for document retrieval.

Keyword Search (BM25) vs Semantic Search — Different Strengths

Query: "PROD-DB-07 configuration"

BM25 Keyword Search

Matches exact terms

"PROD-DB-07 deployed with 16GB RAM" Score: 8.7

"PROD-DB-07 timeout at 2:30 AM" Score: 7.2

"Database caching strategy" Score: 0.3

Finds exact identifier instantly

Semantic Search

Matches meaning

"Database caching strategy" Sim: 0.89

"PostgreSQL performance tuning" Sim: 0.84

"PROD-DB-07 deployed..." Sim: 0.71

Finds related concepts, misses exact ID

Hybrid Search: Combine both with Reciprocal Rank Fusion (RRF)

BM25 excels at exact identifiers, error codes, and configuration values. Semantic search excels at conceptual queries. Production systems use both together via RRF.

Reciprocal Rank Fusion (RRF) — How Hybrid Search Merges Results

When you run BM25 and semantic search in parallel, you get two ranked lists. How do you merge them? The most common technique is Reciprocal Rank Fusion (RRF).

For each result, RRF computes a score based on its rank in each list:

                RRF_score = ∑ 1 / (k + ranki)
              

Where k is a smoothing constant (typically 60) and rank_i is the result's position in each list.

Why RRF over score-based merging?

No calibration needed: BM25 scores (e.g., 12.7) and cosine similarity scores (e.g., 0.83) aren't comparable. RRF uses ranks, not raw scores.
Robust to missing items: If a result only appears in one list, it still contributes — absent items simply add nothing.
Naturally surfaces consensus: Results ranked high in both lists score highest, so items that match by keyword AND meaning rise to the top.

In production systems like HINDSIGHT, RRF merges results from multiple channels (keyword, semantic, temporal, entity-based) and then a neural cross-encoder reranker refines precision on the top candidates before fitting them into the context window.

Try It: BM25 Keyword Search Implementation

A minimal BM25 implementation using rank_bm25 in Python:

# pip install rank_bm25
from rank_bm25 import BM25Okapi
from typing import List

# Your stored memories / conversation turns
memories: List[str] = [
    "User deployed PROD-DB-07 with 16GB RAM and PostgreSQL 15",
    "Discussed Redis caching strategy for the product catalog",
    "User prefers FastAPI over Flask for new microservices",
    "PROD-DB-07 experienced connection timeout at 2:30 AM",
    "Decided to use JWT tokens for API authentication",
]

# Tokenize and build the BM25 index
tokenized = [doc.split() for doc in memories]
bm25 = BM25Okapi(tokenized)

# Retrieve relevant memories for a query
query = "PROD-DB-07 configuration"
scores = bm25.get_scores(query.split())
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]

for i in top_indices:
    print(f"  Score {scores[i]:.2f}: {memories[i]}")

When to use BM25 vs semantic search:

BM25 excels at: Exact identifiers (server names, error codes, SKUs), proper nouns, configuration values, log entries
Semantic search excels at: Conceptual queries ("how do we handle authentication?"), paraphrased content, thematic connections
Best practice: Run both in parallel and merge results using Reciprocal Rank Fusion (RRF), which we covered in the RAG & Agents post

Try It: Hybrid Search with LangChain EnsembleRetriever

LangChain's EnsembleRetriever combines BM25 + semantic search with RRF fusion in 10 lines:

# pip install langchain langchain-community langchain-openai rank_bm25 faiss-cpu
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Your documents (same docs go to both retrievers)
docs = [...]  # list of Document objects

# BM25 retriever (keyword matching)
bm25_retriever = BM25Retriever.from_documents(docs, k=3)

# FAISS retriever (semantic matching)
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Combine with weighted RRF
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.4, 0.6],  # 40% keyword, 60% semantic
)

results = hybrid_retriever.invoke("PROD-DB-07 connection issues")

How weights work: The weights parameter controls how much each retriever contributes to the final RRF score. Higher weight = more influence. Start with [0.4, 0.6] (slight semantic bias), increase keyword weight for codebases with lots of identifiers.

Native vector store hybrid search: Some vector stores support hybrid search natively without needing EnsembleRetriever:

Weaviate — collection.query.hybrid(query, alpha=0.75) blends BM25 + vector natively
Qdrant — Prefetch API combines sparse + dense vectors with RRF fusion in a single request
Pinecone — Stores sparse (BM25) + dense vectors in the same index for hybrid queries

If your vector store supports native hybrid search, use it — it's faster than running two separate retrievers. If not, EnsembleRetriever works with any combination of LangChain retrievers.

Decision Point

You're building a customer support bot that handles 500+ conversations per day. Users often reference previous interactions: "Last time, your agent told me to reinstall the driver." Which retrieval paradigm should you use?

Correct!

Agentic RAG with episodic memory is the right fit here. The agent needs to search both documentation (for product info and troubleshooting steps) and past conversation logs (for "what did we tell this customer before?"). Static RAG can't handle the cross-session context, and Search-o1's mid-reasoning retrieval is overkill for support queries where the needed information is straightforward to identify.

Not quite.

Static RAG only searches documents — it can't find past conversations where the user was told to "reinstall the driver." Search-o1 is designed for complex reasoning tasks where the model needs to retrieve information mid-thought, not for straightforward support queries. The right answer is Agentic RAG + episodic memory: the agent can search documentation when it needs product info AND search past session logs when the user references a previous interaction.

Decision Point

An e-commerce platform needs to match product queries like "Find SKU WH-2847-BLK" and also handle questions like "Show me wireless headphones under $100." Which search approach for the memory/retrieval layer?

Exactly right.

Hybrid search handles both scenarios. BM25 instantly matches the exact SKU "WH-2847-BLK" via keyword matching. Semantic search understands that "wireless headphones under $100" is about Bluetooth audio devices in a price range. Neither approach alone handles both query types well — BM25 would miss conceptual queries, and semantic search might not find exact identifiers.

Not ideal.

Semantic search only might miss exact SKU matches — "WH-2847-BLK" has no semantic meaning, it's just an identifier. BM25 only would fail at conceptual queries like "wireless headphones under $100" because it can't understand that "Bluetooth earbuds" are semantically related. Hybrid search (BM25 + semantic with Reciprocal Rank Fusion) handles both query types by combining exact matching with conceptual understanding.

Two More Retrieval Techniques Worth Knowing

Before we move on to failure patterns, there are two more memory retrieval approaches that solve common production problems. Both have been around longer than "agentic RAG" became a buzzword, and both remain surprisingly useful.

Semantic Experience Memory — Remembering Past Sessions

Here's a frustration every user has felt: you had a detailed conversation with an AI assistant last week about your project architecture. Today you come back, and it starts from scratch. Semantic experience memory solves this by reserving part of the context window for search results from past interactions.

Semantic Experience Memory — Budget Allocation

System Message

Fixed

Past Experience Matches

Reserved budget

Recent Interactions

Remaining space

Current User Input

Latest

With each user input, their message is embedded and searched against all previous interactions. The best matches fill the reserved budget, giving the agent cross-session memory.

Semantic experience memory partitions the context window: a fixed portion always shows the best matches from past sessions. This gives agents continuity across conversations without storing the full history of every session.

The mechanism: with each user input, the text is embedded and used as a query against all previous interactions stored in a memory vector store. Part of the context window is reserved for the best matches. The rest of the space goes to the system message, latest input, and most recent turns. This means the agent can recall "you mentioned last Tuesday that you're allergic to shellfish" without the entire previous conversation being in context.

This is different from generic RAG (which retrieves from external documents) and different from conversation history (which only holds the current session). Semantic experience memory specifically searches the agent's own past interactions — making it a form of episodic memory that works across sessions.

Note-Taking — Reading Before Answering

Here's a technique borrowed from how humans handle dense material: before answering a question, first annotate the context. When you're reading a dense research paper and someone asks you a question about it, you don't just skim and answer. You take notes in the margins first, then use those notes to form your answer.

Note-Taking — Annotate Before Answering

Standard Approach

Context + Question

↓

Generate Answer

Context processed once, directly

Note-Taking Approach

Context

↓

Generate Notes on Context

↓

Context + Notes + Question

↓

Generate Answer

Context processed twice — deeper understanding

The note-taking technique has the model generate margin notes on the context before seeing the question. These notes are interleaved with the original context, enabling deeper reasoning and better recall of relevant details.

With the self-note approach, the model generates notes on multiple parts of the context before the question is presented. These annotations are then interleaved with the original context when attempting to answer. Research shows good results on multiple reasoning and evaluation tasks — the model essentially "reads the document twice" rather than trying to comprehend and answer simultaneously.

This is especially valuable for long context windows where the model might miss critical details buried in the middle. The note-taking pass forces the model to attend to the entire context, and the resulting notes create an additional "index" that the answer-generation pass can reference.

Try It: Note-Taking Is a Prompting Technique (No Library Needed)

Note-taking is not a framework feature — it's a two-pass prompting pattern you implement yourself. No pip install needed:

# Pass 1: Generate margin notes on the context
note_prompt = """Read the following context carefully.
For each important fact, relationship, or detail, write a brief
margin note summarizing it. Number your notes.

CONTEXT:
{long_context}

MARGIN NOTES:"""

notes = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": note_prompt}]
).choices[0].message.content

# Pass 2: Answer the question with context + notes
answer_prompt = """Using the original context AND your notes,
answer the following question.

CONTEXT:
{long_context}

YOUR NOTES:
{notes}

QUESTION: {question}"""

answer = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": answer_prompt}]
).choices[0].message.content

Why two passes? The first pass forces the model to attend to the entire context (not just the beginning and end). The notes become a compressed index that the second pass can reference, reducing the "lost in the middle" effect.

Cost trade-off: You're making two LLM calls instead of one. Use this when accuracy matters more than latency — legal documents, medical records, compliance reviews. For casual chat, it's overkill.

Related work: Meta AI's "Self-Notes" paper (2023) trains models to interleave notes during reading rather than in a separate pass. Google's "Scratchpad" paper (2021) was an early version of the same idea, which later evolved into chain-of-thought prompting. Neither has a production library — this is the practical approximation.

You now know how to store and retrieve memories — from basic trimming to Zettelkasten-inspired networks to mid-reasoning retrieval and note-taking. But even the best memory systems fail in predictable ways. Understanding these failures leads to a bigger insight about how to think about the entire information flow.

Part 3

When Memory Fails

Three Fundamental Failure Patterns

Before we catalog everything that can go wrong, let's start with the three failure patterns that every production memory system eventually encounters. These are the ones that show up in your support tickets, your error logs, and your 2 AM Slack alerts.

Stale Memory

Symptom: Bot cites outdated policy because old & new docs are identical in vector space

Fix: Recency-weighted scoring + last_updated metadata

Leaked Context

Symptom: User A's secret project visible to User B — no memory namespacing

Fix: Namespace by user_id, pre-filter before search

Polluted Retrieval

Symptom: "Deploy to production" retrieves manufacturing docs — noise overwhelms signal

Fix: Reduce top-K, add reranking, improve chunking

Failure 1: Stale Memory

Your company updated its parental leave policy in January. A user asks your HR bot about it in March. The bot confidently cites the old policy from 2023 because that document's embedding is semantically identical to the new one, and nobody re-indexed.

Why Stale Memory Happens

OLD Policy (2023)

"12 weeks parental leave"

embed

Vector Space

[0.82, 0.41, ...]

Nearly identical embeddings!

embed

NEW Policy (2024)

"16 weeks parental leave"

Both documents embed to nearly the same vector. Without recency metadata, the retrieval system can't distinguish old from new.

Stale memory happens because embedding similarity doesn't consider recency. A 2019 document about "parental leave policy" and a 2024 document about the same topic look nearly identical in vector space. Without explicit recency signals, the retrieval system has no way to prefer the newer version.

How to detect it

Log retrieved chunks with their timestamps. If old documents appear in top-K results when newer versions exist, you have stale memory. Create a test: ask about something you know changed recently.

How to fix it

Add last_updated metadata to every chunk. Use hybrid scoring: final_score = 0.7 * semantic_similarity + 0.3 * recency_score. Re-index on a schedule. For critical documents, implement "supersedes" relationships so the old version is explicitly replaced.

Failure 2: Leaked Context

User A tells your bot about their secret project codenamed "Phoenix." User B asks "what projects exist?" User B's response mentions Phoenix. You've just leaked one user's data to another.

How Context Leaks Between Users

User A

"My secret project Phoenix"

stores

Shared Vector Space

No namespace separation!

User A & B data mixed together

leaks!

User B

"What projects exist?"

Sees "Phoenix"!

Without namespace separation, User B's query semantically matches User A's stored data. The fix: filter by user_id BEFORE similarity search, not after.

Leaked context happens because memory storage isn't namespaced. All users' data lives in the same vector space, and semantic search finds similar content regardless of who wrote it. The retrieval system matched User B's query to User A's data because they were semantically similar.

How to detect it

Create two test users. User A mentions "my secret project Phoenix." User B asks "what projects exist?" If B sees Phoenix, you have a leak. This is a mandatory pre-launch test.

How to fix it

Namespace all stored data by user_id. Filter by user ID as a metadata filter before similarity search (not post-filter). For enterprise, use separate collections per tenant. This is non-negotiable for production.

Failure 3: Polluted Retrieval

A developer asks your coding assistant "How do I deploy to production?" The retrieval system returns chunks about production deployments, production line manufacturing, product management meetings, and productive work habits. The model, now confused by four different meanings of "production," gives a vague, contradictory answer.

Polluted Retrieval — When Semantic Similarity Misleads

USER QUERY

"How do I deploy to production?"

Production deployment

Relevant

Production line manufacturing

Noise

Product management

Noise

Productive work habits

Noise

75% noise — the model gets confused and gives a vague answer

Semantic search finds all documents related to "production" regardless of meaning. With high top-K, noise overwhelms signal. Fix: reduce top-K, add reranking, and use metadata filters.

Polluted retrieval happens because embedding similarity is imprecise. "Deploy to production" is semantically close to "production line," "product management," and "productive meetings." With a high top-K (retrieving many chunks), noise overwhelms the signal.

How to detect it

Log retrieved chunks for 20 random queries. If more than 30% of retrieved chunks are irrelevant, you have polluted retrieval.

How to fix it

(1) Reduce top-K from 10 to 3–5. (2) Add reranking: retrieve 20, use a cross-encoder to pick best 3. (3) Improve chunking so each chunk covers one topic. (4) Add metadata filters (doc_type, section_header). Quality over quantity — 2 perfect chunks beat 10 mediocre ones.

Eight Deeper Failure Modes

The three patterns above are the most common, but the Letta Leaderboard — a benchmark designed specifically to test AI memory systems — identified a broader taxonomy of memory failures. These extend beyond the basics and reveal how memory systems break at scale.

The Complete Memory Failure Taxonomy

Stale Memory

Outdated information treated as current

Leaked Context

User A's data visible to User B

Polluted Retrieval

Noise overwhelms relevant results

Unnecessary Searches

Searching for info already in context

Hierarchy Breakdown

Trivia in prime memory, critical facts buried

Missed Information

Info is in context but gets overlooked

Silent Overwrites

New info erases old without versioning

Isolated Silos

Related facts never cross-referenced

Temporal Confusion

Event timelines blur together

Scale Collapse

Works at 100 facts, breaks at 10,000

Ten failure modes organized by severity: basic (red, affects all systems), intermediate (amber, emerges with complexity), advanced (purple, appears at scale), and catastrophic (dark, systemic breakdown). Most production systems hit 3–5 of these simultaneously.

4. Unnecessary Searches: The agent searches for information that's already in the context window. A user says "My name is Sarah, I work at Acme Corp." Two messages later, the agent queries the memory system for the user's name. This wastes latency, costs money, and sometimes retrieves conflicting information that overwrites what was already known.

5. Hierarchy Breakdown: Trivial information occupies prime memory while critical facts are buried. The system stores "user likes dark mode" in the same tier as "user is allergic to penicillin." When the context window gets tight, there's no mechanism to prioritize the allergy over the UI preference.

6. Missed Information: The right information is in the context but the model doesn't use it. This often happens with long contexts where important facts are buried in the middle of many paragraphs. The information was successfully retrieved and placed in the prompt, but the model's attention mechanism failed to weight it properly.

Failure Mode Progression — Simple Systems to Complex Systems

Basic (All systems)

1. Stale Memory

2. Leaked Context

3. Polluted Retrieval

Intermediate (With complexity)

4. Unnecessary Searches

5. Hierarchy Breakdown

6. Missed Information

Advanced (At scale)

7. Silent Overwrites

8. Isolated Silos

9. Temporal Confusion

10. Scale Collapse

Failure modes emerge in layers. Basic failures affect every system. Intermediate failures appear as you add complexity. Advanced failures only surface at production scale — often invisible during development.

7. Silent Overwrites: New information replaces old information without versioning. Sarah's job title changes from "Marketing Manager" to "VP of Marketing." The memory system updates the fact — but now there's no record that she was ever a Marketing Manager. When someone asks "who managed marketing in Q1?" the system has no answer.

8. Isolated Silos: Related facts exist in the memory system but are never connected. The system knows "Server X went down Tuesday" and "Network switch Y was replaced Monday" but never links these events. A human would immediately see the correlation; the flat memory system treats them as unrelated facts.

9. Temporal Confusion: The system can't distinguish between "what happened first" and "what happened recently." It might tell you Sarah's current job title when you ask about her role three years ago, or confuse the order of events in an incident timeline. Without explicit temporal metadata, all facts exist in an eternal present.

10. Scale Collapse: The system works beautifully with 100 stored facts but degrades catastrophically at 10,000. Retrieval precision drops, latency spikes, and the signal-to-noise ratio becomes unmanageable. This is often invisible during development and testing, only revealing itself in production.

How to Detect Each Failure Pattern

A monitoring checklist for production memory systems:

Failure	Detection Method	Key Metric
Stale Memory	Log chunk timestamps, compare to latest version	% of retrievals using outdated docs
Leaked Context	Cross-user test: does User B see User A's data?	Cross-tenant leakage rate (must be 0%)
Polluted Retrieval	Manual review 20 random queries' retrieved chunks	% of irrelevant chunks in top-K
Unnecessary Searches	Track when search returns info already in context	Redundant search rate
Hierarchy Breakdown	Compare importance ratings of stored vs surfaced facts	Critical fact retrieval rate
Missed Information	Check if answer contradicts information in context	In-context miss rate
Silent Overwrites	Query for historical facts ("who was X in Q1?")	Historical accuracy on versioned facts
Isolated Silos	Ask cross-referencing questions ("what events overlap?")	Cross-reference success rate
Temporal Confusion	Ask about event ordering or "state at time T"	Temporal ordering accuracy
Scale Collapse	Benchmark at 10x, 100x your current data size	Precision@K at different scales

Priority: Start with leaked context (security), stale memory (accuracy), and polluted retrieval (quality). These three account for the majority of user-facing issues.

The "Lost in the Middle" Problem

Here's something deeply counterintuitive: information placed in the middle of a long context is more likely to be ignored than information at the beginning or end.

The "Lost in the Middle" paper by Liu et al. demonstrated this empirically. They placed a target fact at different positions in a long context and measured how often the model used it correctly. The results followed a U-shaped curve: high accuracy when the fact was near the beginning, high accuracy near the end, and a significant drop in the middle.

This is the serial position effect — a phenomenon well-documented in human psychology. We remember the first things we encounter (primacy effect) and the last things (recency effect), but the middle blurs together. LLMs, despite being fundamentally different from human brains, exhibit the same pattern because of how their attention mechanisms work.

The "Lost in the Middle" Effect — Where Attention Falls

Accuracy (%)

Position in context (start → end)

Primacy Effect

System prompt goes here

Attention Drop

Don't bury critical info here

Recency Effect

User message goes here

LLMs pay most attention to the beginning and end of the context, just like humans. Information buried in the middle can be missed even if it's directly relevant. Place critical retrieved context at the start or end.

The practical implication is massive: where you place information in the context matters as much as what information you include. If you're building a RAG system, the most critical retrieved chunks should be placed at the very beginning or very end of the retrieved context — never buried in the middle.

Lost in the Middle — The Serial Position Effect with Data

The experiment by Liu et al. (2023) placed a relevant document among 9–19 irrelevant documents and varied its position. Key findings:

Models performed best when the relevant document was the first or last in the sequence
Performance dropped by up to 20% when the relevant document was in positions 5–15 out of 20
The effect was consistent across multiple model families (GPT, Claude, Llama)
Larger context windows made the problem worse, not better — more space for information to get lost in

Why it happens: Transformer attention mechanisms don't distribute attention uniformly. They tend to attend strongly to the beginning of the sequence (due to positional encoding patterns) and the end (due to recency in the attention window). Middle positions get less attention weight, making information there more likely to be overlooked.

Practical implications for memory systems:

Place your system prompt first (it already is) and the user's latest message last (it already is)
Put the most critical retrieved context immediately after the system prompt or immediately before the user message
If you're including a conversation summary, place it at the beginning of the conversation turns, not in the middle
For RAG, consider reversing the order of retrieved chunks so the most relevant one is last (just before the question)

The RULER benchmark extended these findings by testing not just retrieval but reasoning over long contexts. Models that performed well on simple needle-in-a-haystack tests showed significant drops on multi-hop tracing, aggregation, and counting tasks — demonstrating that finding information and reasoning about it are very different challenges.

Context Rot — When More Is Less

Closely related to "lost in the middle" is a broader phenomenon: context rot. Research consistently shows that model performance degrades as context length increases — even when the context window isn't anywhere near its maximum capacity.

The logic seems simple: more context should mean more information, which should mean better answers. But the empirical data tells a different story.

Context Rot — More Tokens, Worse Performance

Answer Quality

Context Length (tokens)

Optimal range

Degradation zone

Actual performance

More context doesn't mean better answers. Performance peaks in a sweet spot, then degrades as attention dilutes, noise accumulates, and reasoning suffers.

As you add more tokens to the context, three things happen:

Attention dilution: The model's attention mechanism has to spread across more tokens, reducing the weight given to any individual piece of information
Noise accumulation: More context means more opportunities for irrelevant information to interfere with the relevant signal
Reasoning degradation: Multi-step reasoning becomes harder when the model has to track relationships across a longer sequence

This is why "just stuff everything into a million-token context" is a recipe for failure. The Needle In A Haystack (NIAH) benchmark — the most common test for long-context models — merely tests retrieval: can the model find a specific fact hidden in long text? Most modern models ace this test. But retrieval is the easy part. The RULER benchmark demonstrated that models performing well on NIAH showed significant performance drops when tested on multi-hop tracing, aggregation, and counting — tasks that require actual reasoning over the context, not just finding a needle.

The conclusion is clear: context quality matters more than context quantity. This is the foundation of the paradigm shift we'll explore in Part 4.

These failures aren't random. They all stem from the same root cause: treating the context window as a dumping ground rather than as a carefully engineered input. There's a better way to think about it — and it's called context engineering.

Part 4

Context Engineering — The Paradigm Shift

The LLM as a Function

Let's step back and think about what an LLM really is. Strip away the chat interfaces, the API wrappers, the prompt templates. At its core, an LLM is a function: it takes a sequence of input tokens, processes them through billions of parameters, and outputs a sequence of tokens.

            output_tokens = LLM(input_tokens)
          

Optimize the model — training, fine-tuning (expensive, slow)

Optimize the input — context engineering (fast, powerful)

That's it. The entire "intelligence" of the system comes from two things: the quality of the model (its parameters) and the quality of the input (the context). You can improve the model through training or fine-tuning — expensive, slow, and requires expertise. Or you can improve the input — the context you send to the model.

Context engineering is the art and science of optimizing those input tokens so they produce the best possible output for a given task. It's not about filling the context window. It's about strategically choosing and placing information so the model can do its best work.

Think of it like writing a brief for a brilliant but literal-minded consultant. This consultant will do exactly what you tell them, using exactly the information you provide.

Context Quality Determines Output Quality

Too Little

Gaps filled with guesses

Hallucinations

Too Much

Focus lost in noise

Context Rot

Wrong Info

Confident wrong output

Garbage In, Garbage Out

Just Right

Right info, right order

Extraordinary work

An LLM is a brilliant but literal consultant. The quality of its output is entirely determined by the quality of the context you engineer for it.

Five Principles of Context Engineering

Based on research from Anthropic, the RULER benchmark findings, and production experience with large-scale AI systems, five principles emerge for engineering effective context.

The Five Principles of Context Engineering

Relevance

Only include what serves the current task. Every token should earn its place.

Diversity

Avoid redundancy. Five chunks saying the same thing waste 4 slots.

Ordering

Critical info at the start and end. Never bury important facts in the middle.

Freshness

Weight recent information higher. Recency-scored retrieval prevents stale answers.

Specification

Treat context as a spec. Track it, version it, test it like you test code.

Five principles that transform context from a dumping ground into a carefully engineered input. Each principle addresses a specific class of memory failures from Part 3.

Principle 1: Relevance — Every Token Earns Its Place

The most common mistake in context engineering is including too much. Every token you add to the context has a cost: financial (you pay per token), computational (the model processes every token), and attentional (each token dilutes the attention available for other tokens).

Relevance means asking a simple question for every piece of information: "Does this serve the current task?" If the user is asking about database configuration, their chat preferences from last month probably don't need to be in the context. MemoryBank's forgetting mechanism is one implementation of this principle — it naturally surfaces frequently-used (relevant) information and lets rarely-used information fade.

Principle 2: Diversity — Avoid Redundancy

Retrieving five chunks that all say "use connection pooling for database performance" wastes four context slots that could contain new information. Maximal Marginal Relevance (MMR) is the key technique here: it balances relevance to the query against redundancy with already-selected chunks.

MMR — Balancing Relevance Against Redundancy

Without MMR

"Use connection pooling"

"Connection pools improve performance"

"Pool connections for speed"

"DB pooling is essential"

"Always pool connections"

5 chunks, same info — 4 wasted slots

With MMR

"Use connection pooling"

"Set pool_size = 2×CPU cores"

"Monitor connection leaks"

"Timeouts and retry policies"

"When NOT to pool"

5 chunks, 5 different insights

Without diversity control, semantic search returns near-duplicates. MMR selects each new chunk to be relevant to the query while being different from already-selected chunks — maximizing information per token.

MMR Diversity Algorithm — The Math and Intuition

Maximal Marginal Relevance selects documents that are both relevant to the query AND different from documents already selected:

                MMR = λ · Sim(doc, query) − (1 − λ) · max[Sim(doc, selected)]
              

Two competing forces:

Relevance vector: λ · Sim(doc, query) — how similar is this document to the query?
Redundancy penalty: (1 - λ) · max[Sim(doc, selected)] — how similar is it to documents we've already chosen?

The λ parameter controls the balance:

λ = 1.0: Pure relevance, no diversity consideration (standard top-K)
λ = 0.5: Equal weight to relevance and diversity (typical starting point)
λ = 0.0: Pure diversity, ignoring relevance (rarely useful)

Practical example: Query is "database connection pooling." Top-K returns five chunks about connection pooling basics. With MMR at λ=0.5, you might get: one chunk about pooling basics, one about pool sizing, one about connection lifecycle, one about monitoring, and one about failure handling. Same topic, five different angles — much more useful than five variations of "pooling improves performance."

The iterative process step by step:

Calculate relevance vector. Compute similarity between every candidate document and the query. Doc 1 = 0.95, Doc 2 = 0.91, Doc 3 = 0.88, Doc 4 = 0.82, Doc 5 = 0.79.

Build redundancy matrix. Compute pairwise similarity between all candidates. Doc 1 and Doc 2 have 0.97 similarity (near-duplicates). Doc 1 and Doc 3 only 0.41.

Select greedily. Pick Doc 1 (highest relevance). Next, score remaining docs as: λ · relevance − (1−λ) · similarity_to_Doc1. Doc 2 drops (0.97 similarity penalty), Doc 3 wins.

Repeat. For each remaining doc, the redundancy penalty is the max similarity to any already-selected doc. Continue until you have K diverse, relevant documents.

The result: from 5 near-identical chunks, you get 3 chunks that are each relevant but cover different angles of the topic. This is far more useful than 5 repetitions of the same information.

Try It: MMR Is One Line in LangChain

MMR is built into LangChain's vector store abstraction. You don't implement the algorithm — you just set search_type="mmr":

# pip install langchain-chroma langchain-openai
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# Build your vector store (works with FAISS, Pinecone, Qdrant too)
vectorstore = Chroma.from_texts(
    texts=your_documents,
    embedding=OpenAIEmbeddings()
)

# One line: switch from similarity to MMR
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,            # return 5 documents
        "fetch_k": 50,     # consider 50 candidates for diversity
        "lambda_mult": 0.3  # 0.0 = max diversity, 1.0 = max relevance
    }
)

docs = retriever.invoke("your query here")

How it works under the hood: LangChain fetches fetch_k candidates via similarity search from the vector store, then applies the MMR algorithm client-side to select the final k diverse results. This means MMR works with any vector store LangChain supports — Chroma, FAISS, Pinecone, Qdrant, Weaviate.

Key tuning parameter: lambda_mult controls the relevance-diversity balance. Start at 0.3 (biased toward diversity), increase toward 0.7 if results feel too scattered. Set fetch_k to at least 5–10x your k value so MMR has enough candidates to choose from.

Principle 3: Ordering — Position Is Power

We covered the "lost in the middle" phenomenon in Part 3. The practical takeaway: put the most critical information at the beginning and end of the context. System prompts naturally go first. The user's latest message naturally goes last. Everything in between should be ordered by importance, with the most critical items nearest to these anchors.

For RAG systems, this means the highest-ranked retrieved chunks should be placed either immediately after the system prompt or immediately before the user's message — not sandwiched in the middle of a long conversation history.

Optimal Context Layout — The Attention Sandwich

System Prompt + Top Retrieved Chunks

Role, rules, and highest-ranked context

High attention

Older conversation turns

Medium

Middle zone — don't put critical info here

Lowest

Recent conversation turns

Medium

User's Latest Message

Current query + any just-retrieved context

High attention

Structure your context like a sandwich: critical information at the top and bottom where attention is highest. Never bury important retrieved chunks in the middle of long conversation history.

Principle 4: Freshness — Recency-Weighted Scoring

Stale memory (Failure 1) happens because retrieval systems treat all information as equally timely. The fix: blend recency into your scoring function.

A simple approach: final_score = α · semantic_similarity + (1 - α) · recency_score, where recency decays exponentially from the document's last-updated timestamp. For most applications, α = 0.7 (favoring semantic relevance) is a good starting point, with adjustments based on how frequently your domain's information changes.

Principle 5: Specification — Context as Code

Here's the paradigm shift: treat your context like you treat your code. The context window isn't just an input — it's a specification for the model's behavior. Just as you wouldn't deploy code without testing, you shouldn't deploy a context configuration without testing.

This means: version your system prompts. Track what context you're sending. Monitor retrieval quality. Set up automated tests that verify the right information surfaces for known queries. Create regression tests for the failure patterns from Part 3. Context engineering is an engineering discipline, not a creative exercise.

The Input Paradox

How strange it is that we tend to throw away the input to our function (the LLM) and only keep track of the output. Think about that: you'd never deploy a service without logging the requests it receives. Yet most AI systems discard the exact context that produced each response. Tracking inputs isn't just for reproducibility — it's how you debug why the agent chose certain tools, produced certain outputs, and made certain mistakes. The context is the specification.

This insight transforms how you think about AI operations. The context you send isn't just an input — it's a persistent artifact that explains the agent's decisions. Sophisticated agents already use this pattern: they maintain a PLAN.md file that persists across context cropping. When the orchestrator trims old messages, the plan survives, providing continuity and an audit trail of what the agent intended to do and why.

What to Track: The Context Taxonomy

Before you can optimize context, you need to know what types of context exist. Most developers only think about the conversation history. In reality, the full context landscape includes four categories:

The Four Categories of Trackable Context

Agent Behavior

Tool usage & outputs
Sub-agent interactions
Internal reasoning steps
Successes & failures

User Behavior

Explicit intent & goals
Feedback (approvals, edits)
Preference signals
Conversation patterns

Knowledge Sources

DB snapshots (for auditing)
External documents (RAG)
Artifacts: PLAN.md, REQS
API responses

System-Level

LLM hyperparameters
Available tools config
Guardrails & policies
Model version

Most developers only track conversation history. Production systems need to track all four categories — agent behavior, user behavior, knowledge sources, and system configuration — to reproduce decisions, debug failures, and communicate intent across teams.

In practice, not everything is useful and many other things should be tracked. The key is deciding beforehand what to capture. This taxonomy gives you a starting framework — especially useful when debugging why an agent took an unexpected action. Was the issue in what tools it used (agent behavior), what the user asked for (user behavior), what knowledge was retrieved (knowledge sources), or how the system was configured (system-level)?

Multi-Agent Context

In systems with multiple AI agents working together — an orchestrator routing tasks to specialists — context engineering becomes even more critical. Each agent needs to see different information, and sharing everything between all agents creates unnecessary cost and confusion.

Multi-Agent Context Isolation

Orchestrator Agent

Sees: user query + task plan + agent summaries

↓

Research Agent

Sees: search query + retrieved docs only. Not full conversation.

Coding Agent

Sees: code context + requirements. Not research results.

Summary Agent

Sees: draft content + user preferences. Not code or search.

Key insight: Each agent sees only what it needs. The orchestrator synthesizes. Context isolation prevents confusion and reduces cost.

In multi-agent systems, each specialist agent sees only the context relevant to its task. The orchestrator manages the full picture and routes information selectively. This prevents context pollution and reduces total token usage.

The orchestrator sees the big picture: the user's query, the task plan, and summaries from each specialist. But each specialist sees only what it needs. The research agent gets the search query and retrieved documents. The coding agent gets the code context and requirements. The summary agent gets the draft and user preferences. No agent is overwhelmed with irrelevant context from other agents' work.

This is context engineering at the architectural level — designing not just what goes into a single context window, but how context flows across an entire multi-agent system.

Context Flow in a Deep Research Agent

Step 1

User Query

Orchestrator receives full question

~500 tokens

→

Step 2

Plan & Decompose

Break into 3–5 sub-queries

~1,200 tokens

→

Step 3

Parallel Search

Each agent gets only its sub-query

~300 tokens each

→

Step 4

Summarize

Drop raw results, keep findings

~2,000 tokens

→

Step 5

Crop Context

Keep key findings, drop reasoning

~800 tokens

→

Step 6

Final Answer

Clean response to user

~1,500 tokens

Key: At each handoff, context is deliberately narrowed. No agent sees the full picture. The research agent never sees the conversation. The user never sees intermediate reasoning.

A deep research agent processes 10,000+ tokens of raw search results across multiple agents, but the final context sent to the synthesis model is only ~2,000 curated tokens. Each step is an act of context engineering — cropping, summarizing, and routing information to where it's needed.

Multi-Agent Context Strategies

Three patterns for managing context across agents:

1. Shared blackboard: All agents read from and write to a shared context store. Simple but noisy — every agent sees everything, including irrelevant data from other agents. Best for small, tightly-coupled agent teams.

2. Orchestrator-mediated: The orchestrator selectively passes context to each specialist. More control but the orchestrator becomes a bottleneck. Best for complex workflows where different agents need very different contexts.

3. Hierarchical: Context is organized in layers — global context (visible to all), team context (shared within a specialist team), and local context (agent-specific). Most flexible but most complex to implement. Best for large-scale systems with many agents.

Context cropping for delegation: When the orchestrator delegates a task to a specialist, it shouldn't just forward the entire conversation. Instead, it extracts only the relevant portions: the specific sub-task, relevant constraints, and any context the specialist needs. This is "context cropping" — cutting out the noise before passing context downstream.

Practical example — a deep research agent:

Plan creation: User asks a research question. Orchestrator creates a multi-step plan.
Search delegation: Orchestrator sends only the search query to the research agent (not the full conversation).
Summarization delegation: Research results are sent to a summary agent with instructions to extract key findings (not the raw search results AND the conversation).
Context cropping: The orchestrator takes the summary, drops intermediate reasoning, and presents the user with a clean answer.

At each step, context is deliberately narrowed. The research agent never sees the conversation history. The summary agent never sees the search queries. The user never sees the intermediate context management. Each handoff is a deliberate act of context engineering.

Inside the Messages Array — XML Tags and Context Cropping

In practice, system prompts use XML tags to clearly delineate different types of context. This helps the model distinguish between instructions, available tools, temporal context, and the actual query:

SYSTEM_PROMPT = """
You are a helpful research assistant.

<INSTRUCTIONS>
* Create a plan for the user's query
* If you lack information, use tools to search
* Prioritize clarity and use the current date for SOTA
</INSTRUCTIONS>

<TOOLS>
{information_about_tools}
</TOOLS>

<DATE>
{current_date}
</DATE>

<USER_QUERY>
{user_query}
</USER_QUERY>
"""

The messages array evolves at each step. Here's what a deep research agent's context looks like as it progresses:

# Step 1: Create a plan
messages = [{"role": "system", "content": SYSTEM_PROMPT}]

# Step 2: User provides feedback on the plan
messages = [
  {"role": "system", "content": SYSTEM_PROMPT},
  {"role": "assistant", "content": "Here is my plan..."},
  {"role": "user", "content": "Search for surveys instead."}
]

# Step 3: CROP — only keep the updated plan
messages = [
  {"role": "system", "content": SYSTEM_PROMPT},
  {"role": "assistant", "content": PLAN_MD},  # Persistent artifact
]

# Step 4: Tool call for search
messages = [
  {"role": "system", "content": SYSTEM_PROMPT},
  {"role": "assistant", "content": PLAN_MD},
  {"role": "assistant", "content": "<think>Search on ArXiv</think>",
   "tool_call": {"name": "search_arxiv", ...}},
  {"role": "tool", "content": ABSTRACTS},
]

# Step 5: Sub-agent gets MINIMAL context
messages_summary_agent = [
  {"role": "system", "content": "Summarize these papers."},
  {"role": "user", "content": PAPERS},
]

# Step 6: Final synthesis — CROPPED again
messages = [
  {"role": "system", "content": SYSTEM_PROMPT},
  {"role": "assistant", "content": UPDATED_PLAN},
  {"role": "assistant", "content": "Summaries: ..."},
]

Key patterns to notice:

PLAN.md persists — The plan is a persistent artifact that survives context cropping. Even when old messages are trimmed, the plan stays, providing continuity.
Sub-agents get minimal context — The summary agent only sees papers, not the full conversation or search queries.
Context shrinks at handoffs — Each step deliberately narrows what the next model sees. This is the core of context engineering.

Context Engineering Checklist for Production Systems

Before deployment, verify each principle:

Relevance:

Are you logging what context is sent with each request?
Can you identify and remove context that doesn't contribute to task performance?
Do you have a mechanism to avoid retrieving information already in the conversation?

Diversity:

Are you using MMR or similar de-duplication in your retrieval?
Have you tested for cases where multiple similar chunks are returned?
Is your retrieval top-K appropriately sized (3–5 is often better than 10)?

Ordering:

Is critical context placed at the beginning or end of the prompt, not the middle?
Are retrieved chunks ordered by relevance, with the best near the anchors?
Is the conversation summary (if used) placed before the conversation turns?

Freshness:

Do your stored documents have last_updated timestamps?
Is your retrieval scoring recency-aware?
Do you have a re-indexing schedule for frequently-changing content?

Specification:

Is your system prompt versioned (in git or a config store)?
Do you have automated tests that verify correct retrieval for known queries?
Do you monitor retrieval quality metrics (precision@K, latency, staleness)?
Can you reproduce the exact context that was sent for any given request?

Memory Strategy Costs & Latency Tradeoffs

Different memory strategies have very different cost and latency profiles. Use this as a rough guide when choosing your approach:

Strategy	Latency	Token Cost	Best For
Full conversation history	Scales with conversation	High — pays for every past message	Short sessions (< 20 turns)
Summary + recent turns	Low (fixed overhead)	Moderate — summary generation cost	Long sessions, chat support
Vector RAG (semantic)	+50–200ms per retrieval	Low — only top-K chunks injected	Large knowledge bases
Hybrid (BM25 + semantic)	+100–300ms (parallel search)	Low — same top-K after RRF	Mixed query types (IDs + concepts)
Graph-based (GraphRAG)	+200–500ms (traversal)	Moderate — higher setup cost	Relationship-heavy domains
Million-token context	High latency, long generation	Very high — processing all tokens	One-off analysis, not production

The takeaway: Even with million-token context windows available, RAG and hybrid approaches remain the most cost-effective for production systems. Processing millions of tokens in one shot demands substantial compute and can introduce latency that negates the simplicity gains of removing external retrieval. Context quality beats context quantity — both for accuracy and for your cloud bill.

Decision Point

Your production chatbot gives accurate answers for the first 10 messages but starts giving vague, generic responses by message 30. Users report it "stops being helpful." What's the most likely context engineering problem?

Correct!

Context rot is the primary culprit. As the conversation grows, the context window fills with 30+ messages, diluting the attention the model gives to any individual piece of information. The system prompt, early important decisions, and key context are all being "pushed to the middle" where they get overlooked. The fix: implement summarization (to compress old messages), ordering (to keep critical facts near the anchors), and possibly relevance filtering (to drop low-value conversational turns).

Close, but not the primary issue.

While relevance and freshness can contribute, the pattern of "works at 10 messages, degrades at 30" is the classic signature of context rot. The growing context window dilutes the model's attention across too many tokens. Early important information ends up in the "lost in the middle" zone. The fix involves summarization (compress old messages), ordering (keep critical facts near the top and bottom), and possibly trimming low-value conversational turns.

Decision Point

You're building a multi-agent research system. The research agent retrieves 15 document chunks. How should you pass this context to the summary agent?

Exactly right.

Context cropping is the key principle. The summary agent doesn't need all 15 chunks — many will be redundant. Apply MMR diversity to select the top 5 most informative, non-overlapping chunks. This reduces cost, latency, and the chance of context rot. The summary agent produces better output with 5 focused chunks than with 15 noisy ones.

Not ideal.

Passing all 15 chunks risks polluted retrieval (many chunks will be redundant or marginally relevant), increases cost, and makes the summary agent more likely to produce vague output. Passing the full reasoning trace adds even more noise. The right approach is context cropping: use MMR or reranking to select the top 5 most informative, non-overlapping chunks. Less context, better results — this is the core insight of context engineering.

Context as Specification

Let's close with the biggest mental shift in context engineering: the context window IS the product.

When you're building a traditional application, you write code that processes inputs and produces outputs. The code is the product. When you're building an AI application, the LLM processes context and produces responses. The context — how you assemble it, what you include, how you order it, how you test it — is the product.

This is why the techniques we've covered in this post matter so much. Memory management (Part 1) determines what historical information is available. Memory architecture (Part 2) determines how efficiently information can be stored and retrieved. Failure awareness (Part 3) tells you what can go wrong. And context engineering (this section) provides the framework for putting it all together.

The systems that get context engineering right — that treat their context window as a first-class engineering artifact — consistently outperform systems with bigger models, more parameters, and larger context windows. The context is the specification. Engineer it accordingly.

This post covered memory within a single context window and across sessions. But what happens when flat memory can't represent the relationships between facts? When you need to know not just what happened but when, why, and how it connects to other events? That's the domain of graph memory systems — the subject of the next post in this series.

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this post helpful?

Your feedback helps me improve these deep dives.

Discussion

No comments yet. Be the first to share your thoughts!

Practice Mode

Test your understanding with real-world AI memory scenarios.

Score: 0 / 4

Scenario 1 of 4

Your medical assistant chatbot handles patient intake conversations. On message 3, a patient mentions they're allergic to penicillin. By message 15, the conversation has filled up and your system is using trimming. On message 18, the patient asks about treatment for an infection.

The chatbot recommends amoxicillin (a penicillin-type antibiotic) without mentioning the allergy.

What's the root cause, and what's the best fix?

Pin critical medical information — mark allergy mentions as "never trim" so they persist alongside the system prompt, then trim non-critical messages normally.

Switch to summarization — compress old messages into a summary that captures allergies and medical history.

Increase context window — use a model with a larger context window so messages don't need to be trimmed.

Best approach!

Scenario 2 of 4

Your customer support bot uses RAG to answer questions about your product. A user asks: "What's the refund policy?" The bot confidently explains a refund policy that was discontinued 6 months ago. The new policy exists in your knowledge base but the old policy document wasn't removed.

What memory failure is this, and what's the fix?

Polluted retrieval — improve chunking and reduce top-K to filter out irrelevant documents.

Stale memory — add recency-weighted scoring so newer documents rank higher, and implement "supersedes" relationships between document versions.

Context rot — the context window is too large and the model is losing focus.

Best approach!

Scenario 3 of 4

Your multi-turn research agent handles complex questions. A user asks: "Compare the architectural approaches of Netflix and Uber for microservice communication." The agent retrieves 20 relevant document chunks covering both companies. By the time it generates its response, it produces a generic overview of microservices without specific details about either company's approach.

Which context engineering principles should you apply?

Relevance + Diversity — reduce to 5 chunks using MMR (de-duplicate Netflix chunks and Uber chunks separately), ensuring both companies are represented without redundancy.

Ordering — put all retrieved chunks at the beginning of the context so the model pays more attention to them.

Freshness — weight more recent articles higher to get the latest architecture details.

Best approach!

Scenario 4 of 4

Your e-commerce chatbot serves 10,000 users daily on a shared platform. User A asks about their order #1234 and the bot correctly retrieves it. User B, who has never placed an order, asks "What's the status of my order?" The bot responds with information about order #1234 — User A's order.

What memory failure is this, and what's the critical fix?

Polluted retrieval — improve your search to better understand query intent.

Hierarchy breakdown — order information isn't being prioritized correctly.

Leaked context — namespace all data by user_id and filter BEFORE similarity search. This is a security-critical fix that must be deployed immediately.

Best approach!

Cheat Sheet

The essential reference for AI memory systems.

5 Memory Types

Working: Context window (current session)
Episodic: Past sessions & outcomes
Semantic: External knowledge via RAG
Procedural: System prompts & rules
Parametric: Baked into model weights

Trimming vs Summarization

Trimming: Fast, cheap, drops oldest. Use for <20 messages, casual chat.
Summarization: Preserves intent, 300-800ms extra latency. Use for 30+ messages.
Pin + Trim: Pin safety-critical info, trim the rest. Use for medical/legal.

3 Retrieval Paradigms

Static RAG: Fixed pipeline, always retrieves
Agentic RAG: Agent decides what/when/where to search
RAG During Reasoning: Search-o1 retrieves mid-thought with Reason-in-Documents

Forgetting Curves

MemoryBank: Ebbinghaus-inspired. Used memories persist, unused ones fade.
Key insight: ~10% of information deserves permanent storage.
Spaced repetition: Frequent retrieval = stronger memory.

10 Failure Modes

Basic: Stale memory, leaked context, polluted retrieval
Intermediate: Unnecessary searches, hierarchy breakdown, missed info
Advanced: Silent overwrites, isolated silos, temporal confusion
Catastrophic: Scale collapse

5 Principles

Relevance: Every token earns its place
Diversity: MMR — avoid redundancy
Ordering: Critical info at start & end
Freshness: Recency-weighted scoring
Specification: Treat context like code

Context Ordering

Lost in the Middle: Models miss info in positions 5-15 of 20
U-shaped curve: High attention at start and end, low in middle
Fix: System prompt first, critical context near anchors, user message last

Which Strategy When

<20 messages: Full history (send everything)
20-50 messages: Trimming with pinned critical info
50+ messages: Summarization + trimming
Cross-session: Episodic memory + semantic experience
External docs: Agentic RAG + hybrid search

Where to Go Deep

Graph Memory Systems — the next post in this series: when flat memory breaks, knowledge graphs, temporal awareness, and production systems
RAG & Agents — the complete guide to retrieval pipelines, embeddings, chunking, and agent architectures (prerequisite for this post)
From Prompt to GPU — tokenization, GPU memory, inference phases, and serving infrastructure
Fine-Tuning — when retrieval and memory aren't enough: LoRA, QLoRA, RAFT

From Stateless to Context-Engineered

The Goldfish Problem

"Just Send Everything"

The Math That Matters: Context Windows

Strategy 1: Trimming — First In, First Out

Strategy 2: Summarization — Compress, Don't Delete

A Map of Memory

Beyond Basic RAG: Three Advances

1. Agentic RAG — The Agent Takes Control

2. A-MEM — The Zettelkasten for AI

3. Search-o1 — RAG During Reasoning

Forgetting as a Feature: MemoryBank

Don't Forget Keyword Search

Two More Retrieval Techniques Worth Knowing

Semantic Experience Memory — Remembering Past Sessions

Note-Taking — Reading Before Answering

Three Fundamental Failure Patterns

Failure 1: Stale Memory

Failure 2: Leaked Context

Failure 3: Polluted Retrieval

Eight Deeper Failure Modes

The "Lost in the Middle" Problem

Context Rot — When More Is Less

The LLM as a Function

Five Principles of Context Engineering

Principle 1: Relevance — Every Token Earns Its Place

Principle 2: Diversity — Avoid Redundancy

Principle 3: Ordering — Position Is Power

Principle 4: Freshness — Recency-Weighted Scoring

Principle 5: Specification — Context as Code

What to Track: The Context Taxonomy

Multi-Agent Context

Context as Specification

Was this post helpful?

Discussion

Leave a comment

Practice Mode

Cheat Sheet

5 Memory Types

Trimming vs Summarization

3 Retrieval Paradigms

Forgetting Curves

10 Failure Modes

5 Principles

Context Ordering

Which Strategy When

Share this post

Stay Updated