بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
In the name of Allah, the Most Gracious, the Most Merciful
Your chatbot remembers the user's name for exactly 47 messages. Message 48: "I'm sorry, what was your name again?"
Meanwhile, a medical triage bot silently drops "Patient is allergic to penicillin" from context on message 12. The allergy was mentioned on message 3. Nobody notices until the prescription.
These aren't edge cases — they're what happens when you treat memory as an afterthought. Every production AI system eventually hits this wall: the model forgets, the context overflows, the wrong information gets retrieved, or worse — critical information gets silently dropped.
This post is the complete guide to AI memory — from the fundamental statelessness problem to the paradigm shift of context engineering. No prerequisites beyond curiosity.
- Part 1: The memory problem — why LLMs forget everything between calls, and two strategies (trimming & summarization) to manage the context window
- Part 2: How agents remember — five memory types, agentic retrieval, Zettelkasten-inspired memory, forgetting curves, and RAG during reasoning
- Part 3: When memory fails — ten failure modes from stale memories to scale collapse, plus the "lost in the middle" problem and context rot
- Part 4: Context engineering — the paradigm shift from "fill the window" to "engineer the context," with five principles and multi-agent strategies
- You've built a chatbot that works great for 5 messages but forgets everything by message 20
- You've heard "context window," "RAG," or "context engineering" and want to understand what they actually mean for building AI systems
- You're wondering why your AI agent gives perfect answers on Monday but completely wrong ones by Thursday — despite having the right documents
- You read RAG & Agents and want the deep dive into memory that post introduced but couldn't fully explore
The Goldfish Problem
Imagine you're at a coffee shop talking to a brilliant consultant. You explain your entire business problem — the architecture, the constraints, the deadline. They give perfect advice. You come back the next morning, and they greet you like they've never seen you before. Every conversation starts from absolute zero.
That's how LLMs work. Not because they're poorly designed, but because of a fundamental architectural choice: every API call is completely isolated. When you send a message to GPT-4o, Claude, or any other LLM, the model processes your input, generates a response, and then — poof — forgets the entire interaction ever happened.
There's no internal notebook. No session state. No "hey, we talked about this yesterday." The model is stateless in the strictest sense: it has zero memory between calls.
This isn't a bug — it's a design choice. LLMs are token-prediction machines: they take a sequence of input tokens, process them through billions of parameters, and output the next most likely tokens. Once the response is generated, all that processing is discarded. The model doesn't "learn" from your conversation or store it anywhere.
If you've used ChatGPT or Claude and felt like they "remember" your conversation, that's because the application layer (the chat interface) is doing the remembering, not the model. Behind the scenes, the app stores your conversation history and sends the entire thing back to the model with every new message. The model isn't remembering — it's being reminded.
"Just Send Everything"
The simplest solution to statelessness? Keep a growing list of every message in the conversation, and send all of them to the model every single time.
Think of it like recording every meeting on your team. When you need context for a new discussion, you play back all previous recordings first. For the first few meetings, this works perfectly. By meeting 50, you're spending more time watching recordings than having actual discussions.
This approach — appending messages and sending everything — is how most chatbots start. You maintain an array of messages (system prompt + conversation turns), and each API call sends the entire array.
For short conversations — say 5 to 15 messages — this works fine. The model sees the full context and gives coherent, contextual responses. But there's a hard ceiling on how much you can send, and it's called the context window.
The Math That Matters: Context Windows
The context window is the total amount of information a model can process in a single call — both the input you send and the output it generates. It's measured in tokens, where one token is roughly three-quarters of a word (or about four characters). So 1,000 tokens is approximately 750 English words.
Think of the context window as a desk. Everything the model can "see" has to fit on this desk — your system prompt, the entire conversation history, any retrieved documents, and the response it's generating. If your papers overflow the desk, they fall off the edge and disappear.
You might look at Gemini 2.5's million-token window and think: "Problem solved. Just dump everything in." But three forces conspire against you:
- Cost scales linearly with tokens. Sending 100K tokens every call is 25x more expensive than sending 4K. For high-volume applications, this adds up fast.
- Latency increases with context size. The model has to process every token you send. More tokens means slower responses — sometimes painfully so.
- Performance degrades with length. This is the counterintuitive one. Research consistently shows that models lose track of information in long contexts, especially information placed in the middle. We'll explore this deeply in Part 3.
So even with massive context windows, you need strategies to manage what goes in. The two most fundamental strategies are trimming and summarization.
Context windows have grown dramatically over just a few years. Here's the current landscape:
| Model | Context Length | Approx. Pages |
|---|---|---|
| GPT-3.5 (2023) | 4,096 tokens | ~12 pages |
| GPT-4 (2023) | 8,192 / 32,768 | ~24–100 pages |
| GPT-4o (2024) | 128,000 tokens | ~300 pages |
| Claude 3.5 Sonnet (2024) | 200,000 tokens | ~500 pages |
| Gemini 1.5 Pro (2024) | 2,000,000 tokens | ~5,000 pages |
| Claude 3.7 Sonnet (2025) | 200,000 tokens | ~500 pages |
| Gemini 2.5 (2025) | 1,000,000 tokens | ~2,500 pages |
| Llama 3.x (2025) | 8,000–128,000 | Varies by config |
Important: "Context length" is the maximum the model can process, not what it should. Just because Gemini can handle a million tokens doesn't mean it will handle them well. Models often struggle with information placed in the middle of very long contexts (the "lost in the middle" phenomenon we cover in Part 3).
Also note the difference between context length (maximum input) and output length (maximum response). Most models have output limits of 4K–16K tokens, even with massive input windows.
Strategy 1: Trimming — First In, First Out
The simplest strategy: when the conversation gets too long, drop the oldest messages. Keep the system prompt (which defines the AI's behavior) and the most recent messages. Everything else falls off the edge.
Think of a conveyor belt at an airport baggage claim. New bags arrive from one end, and once the belt is full, the oldest bags drop off the other end. Your system prompt is the one bag that's bolted to the belt — it never moves. Everything else is first in, first out.
Trimming is fast, cheap, and easy to implement. You count tokens from the newest message backward until you hit the limit, and everything older gets cut. No extra API calls, no processing overhead.
But trimming has a critical weakness: it has no concept of importance. A user mentioning their name on message 1 is just as likely to be trimmed as a casual "sounds good" on message 3. A patient's drug allergy mentioned early in a conversation can be silently dropped when newer, less important messages push it out.
For some applications — casual chat, quick Q&A sessions, interactions under 20 messages — trimming works perfectly. For anything involving safety-critical information, long-running tasks, or users who expect continuity, you need something smarter.
Different models tokenize text differently, which means the same sentence can use different numbers of tokens across models. This matters for trimming because you need an accurate count.
General rules of thumb:
- 1 token ≈ 4 characters in English (roughly ¾ of a word)
- 1,000 tokens ≈ 750 words ≈ 3 pages of text
- Code is typically more token-dense than prose (more special characters)
- Non-English languages often use more tokens per word (Chinese, Arabic, etc.)
For OpenAI models, use the tiktoken library for exact counts. For Claude, Anthropic provides a token counting API. For open-source models, each has its own tokenizer (usually a SentencePiece or BPE variant).
Why exact counts matter: If your trimming function estimates "this message is about 100 tokens" but it's actually 150, you might exceed the context window and get an error. Always use the model's actual tokenizer for production systems.
Don't forget the output. The context window includes both input AND output tokens. If you fill 95% of the window with input, the model only has 5% left to generate a response, which often leads to truncated or incoherent answers.
Basic FIFO trimming — count tokens from newest to oldest, cut when full:
def trim_messages(messages, max_tokens=4000, model="gpt-4o"):
"""Keep the system prompt + most recent messages that fit."""
import tiktoken
enc = tiktoken.encoding_for_model(model)
# Always keep the system prompt
system_msg = messages[0]
system_tokens = len(enc.encode(system_msg["content"]))
# Add messages from newest to oldest until we hit the limit
trimmed = []
token_count = system_tokens
for msg in reversed(messages[1:]):
msg_tokens = len(enc.encode(msg["content"]))
if token_count + msg_tokens > max_tokens:
break
trimmed.insert(0, msg)
token_count += msg_tokens
return [system_msg] + trimmed
Rolling window with LangGraph — for agentic systems, you can implement a rolling context window where the full interaction is passed until the window fills up, then the oldest parts are ejected first-in, first-out. This is easy to implement, low in complexity, and works for many use cases.
Priority-aware trimming — the critical improvement over basic FIFO:
def smart_trim(messages, max_tokens=4000):
"""Trim with priority: pinned > recent > old."""
system = messages[0]
pinned = [m for m in messages[1:] if m.get("pinned")]
regular = [m for m in messages[1:] if not m.get("pinned")]
# Start with system + pinned (always kept)
result = [system] + pinned
token_count = count_tokens(result)
# Add regular messages from newest to oldest
for msg in reversed(regular):
msg_tokens = count_tokens([msg])
if token_count + msg_tokens > max_tokens:
break
result.insert(len([system] + pinned), msg)
token_count += msg_tokens
return result
The pinned flag lets you mark safety-critical messages (allergies, constraints, user identity) that should never be trimmed, regardless of age.
Strategy 2: Summarization — Compress, Don't Delete
What if instead of throwing away old messages, you condensed them into a summary? Rather than losing that the user mentioned their name, their project requirements, and their deadline, you compress 20 messages into a single paragraph that captures the essential facts.
Think of it as the difference between keeping every receipt from a year of grocery shopping versus keeping a monthly budget spreadsheet. The receipts have all the detail, but the spreadsheet tells you everything you actually need to know.
The approach: when the conversation exceeds a threshold, take the oldest messages, send them to a (usually cheaper) model with a summarization prompt, and replace them with the resulting summary. The model now sees: system prompt → summary of old messages → last 6 recent messages.
Summarization preserves intent where trimming destroys it. The summary captures "the user is Bahgat, building a FastAPI service with JWT authentication, deadline is March" even though those details were scattered across messages 1 through 15. With trimming, all of that would be gone once message 20 arrives.
The tradeoff? Summarization costs extra. Each summarization step requires an additional API call (typically using a cheaper model like GPT-4o-mini). It adds 300–800ms of latency. And summarization is lossy — you inevitably lose some nuance and detail. A summary might capture "the user has a food allergy" but drop the specific allergen. For casual conversations that's fine. For medical applications, it's dangerous.
But there's not just one way to summarize. In practice, there are three distinct strategies, and the right choice depends on your use case:
For short conversations (under 20 turns), stacking summaries works fine — the stack stays small. For longer interactions, batch or rolling summaries are necessary. The single rolling summary is the most aggressive: it never grows, but it risks overwriting important early details if the conversation shifts topics. In practice, many production systems use a hybrid: a rolling summary for the bulk of history, plus a separate "critical facts" list that's never summarized away.
Basic summarization with preservation priorities:
def summarize_and_compress(messages, max_tokens=4000):
"""Summarize old messages, keep recent ones verbatim."""
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
total = sum(len(enc.encode(m["content"])) for m in messages)
if total <= max_tokens:
return messages # No need to summarize yet
# Keep system prompt + last 6 messages verbatim
system = messages[0]
recent = messages[-6:]
old = messages[1:-6]
if not old:
return messages # Nothing to summarize
# Summarize the old messages
summary = client.chat.completions.create(
model="gpt-4o-mini", # Cheaper model for summarization
messages=[{
"role": "system",
"content": """Summarize this conversation. Preserve:
- User's name, preferences, and stated requirements
- Key decisions made and their reasoning
- Technical context (language, framework, architecture)
- Any constraints or deadlines mentioned
- Safety-critical information (allergies, access levels, etc.)
Be concise but don't lose critical facts."""
}] + old
).choices[0].message.content
return [
system,
{"role": "system",
"content": f"Previous conversation summary: {summary}"},
*recent
]
Production tip: Use a cheaper model (GPT-4o-mini, Claude Haiku) for summarization. You're compressing, not reasoning — smaller models handle this well at a fraction of the cost.
Structured summarization prompt (for applications where specific categories of information matter):
SUMMARY_PROMPT = """Summarize the following conversation into these categories:
USER IDENTITY: Name, role, preferences
PROJECT CONTEXT: What they're building, tech stack, architecture
CONSTRAINTS: Deadlines, budgets, requirements, limitations
DECISIONS MADE: What was decided and why
CURRENT STATE: Where the conversation left off
OPEN QUESTIONS: Anything unresolved
Be concise. Use bullet points. Preserve exact numbers, dates, and names."""
This structured approach makes it easier to verify that critical information survived the summarization step.
The code above shows the raw approach. In production, LangGraph handles summarization automatically with its SummarizationNode:
# pip install langmem langgraph langchain-openai
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, START, MessagesState
from langgraph.checkpoint.memory import InMemorySaver
from langmem.short_term import SummarizationNode
model = ChatOpenAI(model="gpt-4o")
# Configure automatic summarization
summarization_node = SummarizationNode(
token_counter=model.get_num_tokens_from_messages,
model=model.bind(max_tokens=128),
max_tokens=256, # max tokens passed to LLM
max_tokens_before_summary=256, # trigger summarization here
max_summary_tokens=128, # max summary size
)
def call_model(state):
return {"messages": [model.invoke(state["summarized_messages"])]}
# Build graph: summarize → call model
builder = StateGraph(MessagesState)
builder.add_node("summarize", summarization_node)
builder.add_node("call_model", call_model)
builder.add_edge(START, "summarize")
builder.add_edge("summarize", "call_model")
graph = builder.compile(checkpointer=InMemorySaver())
# Use it — summarization happens automatically when context grows
config = {"configurable": {"thread_id": "1"}}
graph.invoke({"messages": "Hi, I'm Bob"}, config)
graph.invoke({"messages": "I'm building a RAG system"}, config)
response = graph.invoke({"messages": "What's my name?"}, config)
# Bob ✓ — the summary preserved it
Summarize every N turns (alternative pattern): Instead of token-based triggering, use a conditional edge that counts messages:
# After the model responds, check message count
def should_summarize(state):
if len(state["messages"]) > 6:
return "summarize_conversation"
return END
workflow.add_conditional_edges("conversation", should_summarize)
Note: LangChain's older ConversationSummaryMemory and ConversationSummaryBufferMemory are deprecated since v0.3.1. The LangGraph + langmem approach above is the current recommended path.
If the allergy information is lost, the system might recommend a penicillin-based antibiotic. This is a safety-critical failure.
Pin safety-critical information so it's never trimmed or summarized away. The allergy mention on message 3 should be flagged as "pinned" and kept alongside the system prompt permanently. Then use trimming or summarization for everything else. Summarization alone is risky here — it might preserve "patient has allergy" but could lose the specific allergen. For medical systems, "might" isn't good enough.
Trimming would silently drop the allergy information once enough new messages arrive. Summarization might preserve it — or might compress "severe penicillin allergy" into "patient has allergies," losing the critical detail. For safety-critical information, the right answer is pinning: mark the allergy message as "never trim" and keep it alongside the system prompt, then manage the rest of the conversation with trimming or summarization.
Trimming and summarization manage what's in the context window. But what about knowledge that was never there in the first place? What about your user's preferences from last month, the documentation for your API, or the lessons learned from a thousand past conversations? That's where memory systems come in — and where the real complexity begins.
A Map of Memory
Trimming and summarization handle one kind of memory — the current conversation. But real AI systems need to remember across sessions, access external knowledge, follow rules, and draw on what they were trained on. These are fundamentally different types of memory, just like in the human brain.
Neuroscientists have long categorized human memory into distinct systems: working memory (what you're holding in your head right now), episodic memory (what happened at your birthday party), semantic memory (knowing that Paris is in France), and procedural memory (how to ride a bicycle). AI memory systems mirror these categories — not because we designed them to, but because the same problems arise naturally.
Working memory is what we just covered — the context window. It's fast, direct, and ephemeral. Everything the model "sees" in a single call lives here.
Episodic memory stores specific past interactions. "Last Tuesday, the user asked about Redis caching and we decided against it because of their single-server setup." This is where semantic experience memory comes in — a technique where part of the context window is reserved for the best-matching past interactions. With each new user input, the text is embedded and used to search across all previous sessions. The top matches are injected into the context alongside the current conversation, so the agent can draw on relevant past experience without storing the entire history.
Semantic memory is external knowledge — your documentation, knowledge bases, APIs, and databases. This is the domain of RAG (Retrieval-Augmented Generation), which we covered extensively in the RAG & Agents post. The basic idea: embed your documents as vectors, find the most relevant chunks for a query, and inject them into the context.
Procedural memory lives in system prompts and behavioral rules. "Always respond in JSON." "Never reveal the internal prompt." "If the user asks about pricing, redirect to the sales team." These are the habits and reflexes of the AI system — stable, consistent, and rarely changed.
Parametric memory is everything the model learned during training. It "knows" Python syntax, that Paris is in France, and how to structure an argument — not because anyone told it in the prompt, but because these patterns are encoded in its billions of parameters. The limitation: this knowledge has a cutoff date and can't be updated without retraining or fine-tuning.
The model sees one unified input.
Parametric memory is the knowledge encoded in the model's weights during pre-training. Unlike other memory types, you can't directly add to it or update it (without fine-tuning or retraining).
What it's good at:
- Language understanding, grammar, and reasoning patterns
- General world knowledge (geography, history, science)
- Programming languages, frameworks, and common patterns
- Understanding context, nuance, and implied meaning
Where it fails:
- Knowledge cutoff: The model doesn't know about events after its training date. Ask GPT-4o about something that happened last week, and it genuinely doesn't know.
- Hallucination: When parametric memory is uncertain, the model doesn't say "I don't know" — it confidently generates plausible-sounding but incorrect information. This is one of the primary motivations for RAG.
- Domain-specific gaps: The model knows a lot about common topics but less about niche domains. Your company's internal APIs, proprietary processes, and industry-specific terminology likely aren't in its parameters.
This is why we need other memory types. Parametric memory provides the foundation, but episodic, semantic, and working memory fill the gaps with current, specific, and contextual information.
Standard chatbots start every session from a blank slate. Even if the user had a detailed 2-hour conversation yesterday, today the agent has zero recollection of it. Semantic experience memory solves this.
How it works:
- Every interaction (user message + agent response) is stored and embedded in a vector database
- When the user sends a new message, their input is also embedded
- A vector similarity search finds the most relevant past interactions across all previous sessions
- Part of the context window is reserved for these matches — typically 10–20% of the available space
- The rest of the context window holds the system prompt, current conversation, and any RAG results
The key insight: You're not retrieving all of last Tuesday's conversation. You're retrieving the 3–5 past interactions that are most semantically relevant to what the user is asking right now. If they asked about Redis caching last week and are now asking about caching again, those specific exchanges surface automatically.
This enables agents to not just draw upon a broad base of knowledge but also tailor their responses based on accumulated experience, leading to more adaptive and personalized behavior.
The foundation for semantic memory is embeddings — vector representations that capture meaning. The field has evolved rapidly:
Where embeddings live — vector stores:
FAISS (Facebook AI Similarity Search) and Annoy (Approximate Nearest Neighbors Oh Yeah) are optimized for fast similarity searches over high-dimensional vectors. Managed services like Pinecone, Weaviate, and Chroma handle the infrastructure and scale concerns for you.
Option A: Chroma (easiest to start, runs locally, no server needed):
# pip install langchain-chroma langchain-openai
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_texts(
texts=["your", "documents", "here"],
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db" # saves to disk
)
# Search
results = vectorstore.similarity_search("your query", k=3)
Option B: FAISS (faster for large datasets, Facebook's battle-tested library):
# pip install langchain-community faiss-cpu langchain-openai
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
vectorstore = FAISS.from_texts(
texts=["your", "documents", "here"],
embedding=OpenAIEmbeddings()
)
# Save / load
vectorstore.save_local("./faiss_index")
loaded = FAISS.load_local("./faiss_index", OpenAIEmbeddings(),
allow_dangerous_deserialization=True)
Which to choose: Chroma for prototyping and small datasets (built-in persistence, metadata filtering). FAISS for production workloads with millions of vectors (optimized C++ core). Pinecone, Weaviate, or Qdrant for managed cloud with zero ops.
As context windows grow to millions of tokens, a new concept is emerging: index-free RAG. Instead of maintaining external vector stores, you load entire knowledge bases directly into the model's context and let its attention mechanisms do the retrieval internally. Models like GPT-4.1 and Gemini 2.5 can process 1M+ tokens, potentially making external retrieval unnecessary for smaller datasets. The trade-off: massive compute cost and no guarantee the model finds the relevant passage in a sea of tokens. For now, hybrid approaches (selective retrieval + long context) remain the practical choice.
Google Gemini's 1M-token context window + caching makes index-free RAG practical for small-to-medium knowledge bases:
# pip install google-genai
from google import genai
from google.genai import types
client = genai.Client()
# Upload your entire knowledge base as files
file1 = client.files.upload(file="company_handbook.pdf")
file2 = client.files.upload(file="product_docs.pdf")
# Cache the knowledge base (pay once for ingestion)
cache = client.caches.create(
model="models/gemini-2.5-pro",
config=types.CreateCachedContentConfig(
display_name="company_kb",
system_instruction="Answer based on the knowledge base.",
contents=[file1, file2],
ttl="3600s", # 1 hour cache
)
)
# Each query now costs 75-90% less for the cached portion
response = client.models.generate_content(
model="models/gemini-2.5-pro",
contents="What is our refund policy for enterprise?",
config=types.GenerateContentConfig(cached_content=cache.name),
)
print(response.text)
When this beats traditional RAG:
- Knowledge base under ~750 pages (fits in 1M tokens)
- Repeated queries over the same corpus (caching amortizes cost)
- You need zero infrastructure (no vector store, no embeddings, no chunking)
When traditional RAG still wins:
- Millions of documents (won't fit in any context window)
- Need for metadata filtering ("show me only Q4 reports")
- Latency-sensitive applications (~45s for full context vs ~1s for RAG retrieval)
Anthropic alternative: Claude supports 200K tokens standard (1M in beta). Use XML tags to organize the knowledge base: <knowledge_base>...</knowledge_base> for cleaner context structure.
Beyond Basic RAG: Three Advances
In the RAG & Agents post, we covered the standard retrieval pipeline: embed your documents, store them in a vector database, find the most similar chunks for a query, and inject them into the prompt. That pipeline treats retrieval as a static, one-shot step that runs before the model generates output.
But what if the agent could decide what, when, and how to retrieve? What if it could build interconnected notes like a researcher? What if retrieval could happen during reasoning, not just before it? These three advances push memory beyond basic RAG.
1. Agentic RAG — The Agent Takes Control
In standard RAG, the vector database is the long-term memory of the LLM — but the LLM has no say in what gets retrieved. It's like a student who gets handed a stack of textbook pages by a librarian and has to make do with whatever was selected. Agentic RAG gives the student direct access to the library card catalog.
In agentic RAG, retrieval becomes a tool that the agent can invoke on its own terms. Instead of a fixed pipeline that always runs the same way, the agent decides:
- Whether to search at all — sometimes the answer is already in the conversation
- Which source to search — documentation? Slack messages? The user's past sessions? A web search?
- What query to use — the agent can reformulate the user's question for better retrieval
- Whether to search again — if the first results aren't sufficient, it can refine and search again
In practice, agentic RAG often uses a router pattern: the agent has access to multiple knowledge sources (documentation, Slack, email, web search, past sessions) as individual tools, and it picks the right one(s) based on the query. It might search your API docs first, find insufficient information, then run a web search for a more recent answer.
This also extends to multi-agent RAG, where specialized retrieval agents each handle a specific knowledge source. A coordinator agent routes the query to the right specialist, collects results, and synthesizes a response. Think of it as having a team of research assistants instead of one generalist.
2. A-MEM — The Zettelkasten for AI
The Zettelkasten method is a note-taking system used by prolific writers and researchers. The core idea: each note contains exactly one unit of knowledge (atomicity), and notes are heavily linked to each other (hypertextual). Over time, you build a web of interconnected ideas where any note can lead you to related concepts you might not have found otherwise.
A-MEM (Agentic Memory) borrows this idea for AI agents. Instead of storing raw conversations or document chunks, each "memory" is an enriched note containing:
- The original interaction (one conversation turn)
- A timestamp
- LLM-generated keywords that capture key concepts
- LLM-generated tags to categorize the interaction
- An LLM-generated contextual description that summarizes the significance
All of this is concatenated and embedded into a single vector. The clever part: when a new memory is added, the system searches for existing memories with similar embeddings. The LLM is then asked to evaluate each candidate and decide which should be linked to the new memory. After linking, connected memories are updated — their tags, keywords, and descriptions evolve to reflect the new relationship.
This creates a living knowledge graph that doesn't just accumulate memories but actively maintains an evolving worldview. When the agent retrieves a memory, it can also traverse links to discover related memories it might not have found through embedding similarity alone.
Step 1: Create the note. A single interaction is processed by the LLM to extract keywords, tags, and a contextual description. These are concatenated with the original text and embedded into a single vector.
Step 2: Find related memories. The new note's embedding is compared against all existing memories via similarity search. The top-K candidates are retrieved.
Step 3: LLM-based linking. The LLM evaluates each candidate and decides: should this existing memory be linked to the new one? If yes, what kind of relationship is it? (expansion, refinement, contradiction, new branch)
Step 4: Evolution. After linking, the LLM is prompted to update the tags, keywords, and descriptions of connected memories. A note about "Redis caching" that gets linked to a new note about "caching failures" might have its description updated to mention failure modes.
Why this matters: Traditional vector stores just accumulate chunks. A-MEM creates a living network where adding new knowledge refines existing knowledge. It mirrors how human experts develop deeper understanding over time — new experiences don't just stack up; they reshape how we understand earlier experiences.
The downside: every new memory requires multiple LLM calls (extraction, similarity search, linking, evolution), making it significantly more expensive than simple vector storage. This is a tradeoff between memory quality and operational cost.
LangGraph makes the agentic retrieval loop explicit as a state graph. The key pattern: after retrieval, a grader node decides if results are relevant. If not, a rewriter node reformulates the query and loops back:
# pip install langgraph langchain-openai langchain-community faiss-cpu
from langgraph.graph import StateGraph, START, END, MessagesState
from langgraph.prebuilt import ToolNode, tools_condition
from langchain.chat_models import init_chat_model
from langchain_core.messages import HumanMessage
from langchain.tools import tool
model = init_chat_model("gpt-4o", temperature=0)
# Set up your retriever (any LangChain retriever works here)
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
vectorstore = FAISS.from_texts(["your", "documents", "here"], OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
@tool
def search_knowledge_base(query: str) -> str:
"""Search the knowledge base for relevant information."""
docs = retriever.invoke(query)
return "\n\n".join([d.page_content for d in docs])
# Node 1: LLM decides to search or respond directly
def generate_or_search(state: MessagesState):
response = model.bind_tools([search_knowledge_base]).invoke(state["messages"])
return {"messages": [response]}
# Node 2: Grade retrieved documents (conditional edge, not a node)
def grade_documents(state: MessagesState):
question = state["messages"][0].content
context = state["messages"][-1].content
grade = model.invoke(f"Answer 'yes' or 'no': Is this context relevant?\nQuestion: {question}\nContext: {context}")
return "generate_or_search" if "yes" in grade.content.lower() else "rewrite_question"
# Node 3: Rewrite the query for better retrieval
def rewrite_question(state: MessagesState):
question = state["messages"][0].content
better = model.invoke(f"Reformulate this question for better search results: {question}")
return {"messages": [HumanMessage(content=better.content)]}
# Build the graph
workflow = StateGraph(MessagesState)
workflow.add_node(generate_or_search)
workflow.add_node("retrieve", ToolNode([search_knowledge_base]))
workflow.add_node(rewrite_question)
workflow.add_edge(START, "generate_or_search")
workflow.add_conditional_edges("generate_or_search", tools_condition,
{"tools": "retrieve", END: END}) # END = model answered directly
workflow.add_conditional_edges("retrieve", grade_documents)
workflow.add_edge("rewrite_question", "generate_or_search") # loop back!
graph = workflow.compile()
The key insight: The rewrite_question → generate_or_search edge creates a loop. The agent keeps refining its query until the grader is satisfied — or until a max iteration limit is reached. This is what makes it "agentic" instead of "static." When the model decides it can answer directly (no tool call), tools_condition routes to END.
LlamaIndex alternative: LlamaIndex takes a different approach with ReActAgent — each data source becomes a tool, and a meta-agent routes queries across them. Use LangGraph when you want explicit control over the retrieval loop; use LlamaIndex when you have multiple data sources and want automatic routing.
3. Search-o1 — RAG During Reasoning
Standard RAG retrieves information before the model starts generating. Agentic RAG lets the agent decide when to retrieve. Search-o1 goes further: it enables retrieval during the reasoning process itself.
Think of it this way. In standard RAG, you hand a student a stack of textbook pages and say "now answer the question." In agentic RAG, the student can ask the librarian for specific books. In Search-o1, the student is working through a problem, realizes mid-thought "I need to check something," looks it up, processes what they found, and continues reasoning. The information arrives exactly when the reasoning process needs it.
Technically, the model is instructed to use special tokens — <|begin_search_query|> and <|end_search_query|> — to trigger a search mid-reasoning. The retrieved documents are then processed by a Reason-in-Documents module: using the search query, the retrieved content, and the current reasoning trace, this module condenses everything into focused reasoning steps that align with the model's thought process.
Why is the Reason-in-Documents step important? Because raw retrieved documents are often long, contain irrelevant sections, and can disrupt the flow of reasoning. By having the model's own reasoning LLM process the retrieved information, the documents are compressed and formatted to fit naturally within the ongoing chain of thought.
How it works in practice:
Consider the query "Why are flamingos pink?" The model begins reasoning:
"Flamingos are pink... this is related to their diet, but I need to know the specific mechanism..."
<|begin_search_query|> flamingo pigmentation diet mechanism <|end_search_query|>
<|begin_search_result|> [Wikipedia excerpt about carotenoid pigments in brine shrimp...] <|end_search_result|>
"The pigments are carotenoids found in brine shrimp. But how exactly are these pigments metabolized into the feather coloration?"
The model then triggers a second search to ArXiv for the metabolic pathway, processes that result, and continues reasoning with the full picture.
Key innovation: Instead of just injecting raw documents into the context (which can be noisy and disruptive), the Reason-in-Documents module uses the same reasoning LLM to process retrieved content such that it fits naturally within the reasoning trace. The information is aligned with how the model is thinking, not just dumped in.
Limitation: This approach is primarily used for long-term semantic memory (external knowledge), not for working memory or episodic memory. It's most powerful for tasks that require synthesizing information from multiple sources mid-thought — like research queries, complex analysis, and multi-step reasoning.
Forgetting as a Feature: MemoryBank
Here's a counterintuitive idea: not everything deserves to be remembered. Your brain doesn't retain every conversation you've ever had — it selectively strengthens important memories and lets unimportant ones fade. MemoryBank applies this principle to AI.
Inspired by the Ebbinghaus Forgetting Curve — the psychological model showing that we forget roughly half of what we learn each day unless we actively reinforce it — MemoryBank dynamically adjusts the "strength" of each memory based on usage.
When a memory is retrieved and used during a conversation, its strength is reinforced — making it persist longer in the system. But if a memory hasn't been accessed for a while, it gradually loses strength and may eventually be removed entirely. This is the AI equivalent of spaced repetition — the same technique students use to prepare for exams.
MemoryBank stores three types of information for each user: raw conversation turns (embedded for retrieval), LLM-generated summaries of past events, and a dynamically updated "user portrait" that captures personality traits and preferences. The user portrait is always included in the context; the turns and summaries are retrieved on demand.
This approach aligns with a powerful insight from the mem0 research: only about 10% of information in a conversation deserves permanent storage. The user saying "sounds good" or "let me think about it" doesn't need to be immortalized. But "I'm allergic to penicillin" does. MemoryBank's forgetting mechanism naturally separates the signal from the noise — important information gets retrieved and reinforced, while trivial exchanges fade away.
Don't Forget Keyword Search
With all this talk about embeddings and semantic search, it's tempting to think keyword-based search is obsolete. It isn't. In fact, for certain types of memory retrieval, BM25 keyword search outperforms semantic search.
Consider this: a user asks "What was the configuration for server PROD-DB-07?" Semantic search looks for meaning — it might find documents about database configurations in general. But the user needs the exact string "PROD-DB-07." Keyword search with an inverted index finds it instantly because it's matching exact terms, not meanings.
BM25 (Best Matching 25) is the scoring function behind most keyword search systems. It ranks results by three factors: how often the search term appears in the document (term frequency), how rare the term is across all documents (inverse document frequency), and document length normalization. This makes it excellent for retrieving specific identifiers, error codes, configuration values, and proper nouns — exactly the things semantic search often misses.
In practice, the best memory systems combine both. Semantic search finds conceptually related information; keyword search finds exact matches. This is the hybrid search approach we covered in the RAG & Agents post, and it's just as important for memory retrieval as it is for document retrieval.
When you run BM25 and semantic search in parallel, you get two ranked lists. How do you merge them? The most common technique is Reciprocal Rank Fusion (RRF).
For each result, RRF computes a score based on its rank in each list:
Where k is a smoothing constant (typically 60) and ranki is the result's position in each list.
Why RRF over score-based merging?
- No calibration needed: BM25 scores (e.g., 12.7) and cosine similarity scores (e.g., 0.83) aren't comparable. RRF uses ranks, not raw scores.
- Robust to missing items: If a result only appears in one list, it still contributes — absent items simply add nothing.
- Naturally surfaces consensus: Results ranked high in both lists score highest, so items that match by keyword AND meaning rise to the top.
In production systems like HINDSIGHT, RRF merges results from multiple channels (keyword, semantic, temporal, entity-based) and then a neural cross-encoder reranker refines precision on the top candidates before fitting them into the context window.
A minimal BM25 implementation using rank_bm25 in Python:
# pip install rank_bm25
from rank_bm25 import BM25Okapi
from typing import List
# Your stored memories / conversation turns
memories: List[str] = [
"User deployed PROD-DB-07 with 16GB RAM and PostgreSQL 15",
"Discussed Redis caching strategy for the product catalog",
"User prefers FastAPI over Flask for new microservices",
"PROD-DB-07 experienced connection timeout at 2:30 AM",
"Decided to use JWT tokens for API authentication",
]
# Tokenize and build the BM25 index
tokenized = [doc.split() for doc in memories]
bm25 = BM25Okapi(tokenized)
# Retrieve relevant memories for a query
query = "PROD-DB-07 configuration"
scores = bm25.get_scores(query.split())
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
for i in top_indices:
print(f" Score {scores[i]:.2f}: {memories[i]}")
When to use BM25 vs semantic search:
- BM25 excels at: Exact identifiers (server names, error codes, SKUs), proper nouns, configuration values, log entries
- Semantic search excels at: Conceptual queries ("how do we handle authentication?"), paraphrased content, thematic connections
- Best practice: Run both in parallel and merge results using Reciprocal Rank Fusion (RRF), which we covered in the RAG & Agents post
LangChain's EnsembleRetriever combines BM25 + semantic search with RRF fusion in 10 lines:
# pip install langchain langchain-community langchain-openai rank_bm25 faiss-cpu
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
# Your documents (same docs go to both retrievers)
docs = [...] # list of Document objects
# BM25 retriever (keyword matching)
bm25_retriever = BM25Retriever.from_documents(docs, k=3)
# FAISS retriever (semantic matching)
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Combine with weighted RRF
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, semantic_retriever],
weights=[0.4, 0.6], # 40% keyword, 60% semantic
)
results = hybrid_retriever.invoke("PROD-DB-07 connection issues")
How weights work: The weights parameter controls how much each retriever contributes to the final RRF score. Higher weight = more influence. Start with [0.4, 0.6] (slight semantic bias), increase keyword weight for codebases with lots of identifiers.
Native vector store hybrid search: Some vector stores support hybrid search natively without needing EnsembleRetriever:
- Weaviate —
collection.query.hybrid(query, alpha=0.75)blends BM25 + vector natively - Qdrant — Prefetch API combines sparse + dense vectors with RRF fusion in a single request
- Pinecone — Stores sparse (BM25) + dense vectors in the same index for hybrid queries
If your vector store supports native hybrid search, use it — it's faster than running two separate retrievers. If not, EnsembleRetriever works with any combination of LangChain retrievers.
Agentic RAG with episodic memory is the right fit here. The agent needs to search both documentation (for product info and troubleshooting steps) and past conversation logs (for "what did we tell this customer before?"). Static RAG can't handle the cross-session context, and Search-o1's mid-reasoning retrieval is overkill for support queries where the needed information is straightforward to identify.
Static RAG only searches documents — it can't find past conversations where the user was told to "reinstall the driver." Search-o1 is designed for complex reasoning tasks where the model needs to retrieve information mid-thought, not for straightforward support queries. The right answer is Agentic RAG + episodic memory: the agent can search documentation when it needs product info AND search past session logs when the user references a previous interaction.
Hybrid search handles both scenarios. BM25 instantly matches the exact SKU "WH-2847-BLK" via keyword matching. Semantic search understands that "wireless headphones under $100" is about Bluetooth audio devices in a price range. Neither approach alone handles both query types well — BM25 would miss conceptual queries, and semantic search might not find exact identifiers.
Semantic search only might miss exact SKU matches — "WH-2847-BLK" has no semantic meaning, it's just an identifier. BM25 only would fail at conceptual queries like "wireless headphones under $100" because it can't understand that "Bluetooth earbuds" are semantically related. Hybrid search (BM25 + semantic with Reciprocal Rank Fusion) handles both query types by combining exact matching with conceptual understanding.
Two More Retrieval Techniques Worth Knowing
Before we move on to failure patterns, there are two more memory retrieval approaches that solve common production problems. Both have been around longer than "agentic RAG" became a buzzword, and both remain surprisingly useful.
Semantic Experience Memory — Remembering Past Sessions
Here's a frustration every user has felt: you had a detailed conversation with an AI assistant last week about your project architecture. Today you come back, and it starts from scratch. Semantic experience memory solves this by reserving part of the context window for search results from past interactions.
The mechanism: with each user input, the text is embedded and used as a query against all previous interactions stored in a memory vector store. Part of the context window is reserved for the best matches. The rest of the space goes to the system message, latest input, and most recent turns. This means the agent can recall "you mentioned last Tuesday that you're allergic to shellfish" without the entire previous conversation being in context.
This is different from generic RAG (which retrieves from external documents) and different from conversation history (which only holds the current session). Semantic experience memory specifically searches the agent's own past interactions — making it a form of episodic memory that works across sessions.
Note-Taking — Reading Before Answering
Here's a technique borrowed from how humans handle dense material: before answering a question, first annotate the context. When you're reading a dense research paper and someone asks you a question about it, you don't just skim and answer. You take notes in the margins first, then use those notes to form your answer.
With the self-note approach, the model generates notes on multiple parts of the context before the question is presented. These annotations are then interleaved with the original context when attempting to answer. Research shows good results on multiple reasoning and evaluation tasks — the model essentially "reads the document twice" rather than trying to comprehend and answer simultaneously.
This is especially valuable for long context windows where the model might miss critical details buried in the middle. The note-taking pass forces the model to attend to the entire context, and the resulting notes create an additional "index" that the answer-generation pass can reference.
Note-taking is not a framework feature — it's a two-pass prompting pattern you implement yourself. No pip install needed:
# Pass 1: Generate margin notes on the context
note_prompt = """Read the following context carefully.
For each important fact, relationship, or detail, write a brief
margin note summarizing it. Number your notes.
CONTEXT:
{long_context}
MARGIN NOTES:"""
notes = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": note_prompt}]
).choices[0].message.content
# Pass 2: Answer the question with context + notes
answer_prompt = """Using the original context AND your notes,
answer the following question.
CONTEXT:
{long_context}
YOUR NOTES:
{notes}
QUESTION: {question}"""
answer = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": answer_prompt}]
).choices[0].message.content
Why two passes? The first pass forces the model to attend to the entire context (not just the beginning and end). The notes become a compressed index that the second pass can reference, reducing the "lost in the middle" effect.
Cost trade-off: You're making two LLM calls instead of one. Use this when accuracy matters more than latency — legal documents, medical records, compliance reviews. For casual chat, it's overkill.
Related work: Meta AI's "Self-Notes" paper (2023) trains models to interleave notes during reading rather than in a separate pass. Google's "Scratchpad" paper (2021) was an early version of the same idea, which later evolved into chain-of-thought prompting. Neither has a production library — this is the practical approximation.
You now know how to store and retrieve memories — from basic trimming to Zettelkasten-inspired networks to mid-reasoning retrieval and note-taking. But even the best memory systems fail in predictable ways. Understanding these failures leads to a bigger insight about how to think about the entire information flow.
Three Fundamental Failure Patterns
Before we catalog everything that can go wrong, let's start with the three failure patterns that every production memory system eventually encounters. These are the ones that show up in your support tickets, your error logs, and your 2 AM Slack alerts.
last_updated metadatauser_id, pre-filter before searchFailure 1: Stale Memory
Your company updated its parental leave policy in January. A user asks your HR bot about it in March. The bot confidently cites the old policy from 2023 because that document's embedding is semantically identical to the new one, and nobody re-indexed.
Stale memory happens because embedding similarity doesn't consider recency. A 2019 document about "parental leave policy" and a 2024 document about the same topic look nearly identical in vector space. Without explicit recency signals, the retrieval system has no way to prefer the newer version.
How to detect it
Log retrieved chunks with their timestamps. If old documents appear in top-K results when newer versions exist, you have stale memory. Create a test: ask about something you know changed recently.
How to fix it
Add last_updated metadata to every chunk. Use hybrid scoring: final_score = 0.7 * semantic_similarity + 0.3 * recency_score. Re-index on a schedule. For critical documents, implement "supersedes" relationships so the old version is explicitly replaced.
Failure 2: Leaked Context
User A tells your bot about their secret project codenamed "Phoenix." User B asks "what projects exist?" User B's response mentions Phoenix. You've just leaked one user's data to another.
Leaked context happens because memory storage isn't namespaced. All users' data lives in the same vector space, and semantic search finds similar content regardless of who wrote it. The retrieval system matched User B's query to User A's data because they were semantically similar.
How to detect it
Create two test users. User A mentions "my secret project Phoenix." User B asks "what projects exist?" If B sees Phoenix, you have a leak. This is a mandatory pre-launch test.
How to fix it
Namespace all stored data by user_id. Filter by user ID as a metadata filter before similarity search (not post-filter). For enterprise, use separate collections per tenant. This is non-negotiable for production.
Failure 3: Polluted Retrieval
A developer asks your coding assistant "How do I deploy to production?" The retrieval system returns chunks about production deployments, production line manufacturing, product management meetings, and productive work habits. The model, now confused by four different meanings of "production," gives a vague, contradictory answer.
Polluted retrieval happens because embedding similarity is imprecise. "Deploy to production" is semantically close to "production line," "product management," and "productive meetings." With a high top-K (retrieving many chunks), noise overwhelms the signal.
How to detect it
Log retrieved chunks for 20 random queries. If more than 30% of retrieved chunks are irrelevant, you have polluted retrieval.
How to fix it
(1) Reduce top-K from 10 to 3–5. (2) Add reranking: retrieve 20, use a cross-encoder to pick best 3. (3) Improve chunking so each chunk covers one topic. (4) Add metadata filters (doc_type, section_header). Quality over quantity — 2 perfect chunks beat 10 mediocre ones.
Eight Deeper Failure Modes
The three patterns above are the most common, but the Letta Leaderboard — a benchmark designed specifically to test AI memory systems — identified a broader taxonomy of memory failures. These extend beyond the basics and reveal how memory systems break at scale.
4. Unnecessary Searches: The agent searches for information that's already in the context window. A user says "My name is Sarah, I work at Acme Corp." Two messages later, the agent queries the memory system for the user's name. This wastes latency, costs money, and sometimes retrieves conflicting information that overwrites what was already known.
5. Hierarchy Breakdown: Trivial information occupies prime memory while critical facts are buried. The system stores "user likes dark mode" in the same tier as "user is allergic to penicillin." When the context window gets tight, there's no mechanism to prioritize the allergy over the UI preference.
6. Missed Information: The right information is in the context but the model doesn't use it. This often happens with long contexts where important facts are buried in the middle of many paragraphs. The information was successfully retrieved and placed in the prompt, but the model's attention mechanism failed to weight it properly.
7. Silent Overwrites: New information replaces old information without versioning. Sarah's job title changes from "Marketing Manager" to "VP of Marketing." The memory system updates the fact — but now there's no record that she was ever a Marketing Manager. When someone asks "who managed marketing in Q1?" the system has no answer.
8. Isolated Silos: Related facts exist in the memory system but are never connected. The system knows "Server X went down Tuesday" and "Network switch Y was replaced Monday" but never links these events. A human would immediately see the correlation; the flat memory system treats them as unrelated facts.
9. Temporal Confusion: The system can't distinguish between "what happened first" and "what happened recently." It might tell you Sarah's current job title when you ask about her role three years ago, or confuse the order of events in an incident timeline. Without explicit temporal metadata, all facts exist in an eternal present.
10. Scale Collapse: The system works beautifully with 100 stored facts but degrades catastrophically at 10,000. Retrieval precision drops, latency spikes, and the signal-to-noise ratio becomes unmanageable. This is often invisible during development and testing, only revealing itself in production.
A monitoring checklist for production memory systems:
| Failure | Detection Method | Key Metric |
|---|---|---|
| Stale Memory | Log chunk timestamps, compare to latest version | % of retrievals using outdated docs |
| Leaked Context | Cross-user test: does User B see User A's data? | Cross-tenant leakage rate (must be 0%) |
| Polluted Retrieval | Manual review 20 random queries' retrieved chunks | % of irrelevant chunks in top-K |
| Unnecessary Searches | Track when search returns info already in context | Redundant search rate |
| Hierarchy Breakdown | Compare importance ratings of stored vs surfaced facts | Critical fact retrieval rate |
| Missed Information | Check if answer contradicts information in context | In-context miss rate |
| Silent Overwrites | Query for historical facts ("who was X in Q1?") | Historical accuracy on versioned facts |
| Isolated Silos | Ask cross-referencing questions ("what events overlap?") | Cross-reference success rate |
| Temporal Confusion | Ask about event ordering or "state at time T" | Temporal ordering accuracy |
| Scale Collapse | Benchmark at 10x, 100x your current data size | Precision@K at different scales |
Priority: Start with leaked context (security), stale memory (accuracy), and polluted retrieval (quality). These three account for the majority of user-facing issues.
The "Lost in the Middle" Problem
Here's something deeply counterintuitive: information placed in the middle of a long context is more likely to be ignored than information at the beginning or end.
The "Lost in the Middle" paper by Liu et al. demonstrated this empirically. They placed a target fact at different positions in a long context and measured how often the model used it correctly. The results followed a U-shaped curve: high accuracy when the fact was near the beginning, high accuracy near the end, and a significant drop in the middle.
This is the serial position effect — a phenomenon well-documented in human psychology. We remember the first things we encounter (primacy effect) and the last things (recency effect), but the middle blurs together. LLMs, despite being fundamentally different from human brains, exhibit the same pattern because of how their attention mechanisms work.
The practical implication is massive: where you place information in the context matters as much as what information you include. If you're building a RAG system, the most critical retrieved chunks should be placed at the very beginning or very end of the retrieved context — never buried in the middle.
The experiment by Liu et al. (2023) placed a relevant document among 9–19 irrelevant documents and varied its position. Key findings:
- Models performed best when the relevant document was the first or last in the sequence
- Performance dropped by up to 20% when the relevant document was in positions 5–15 out of 20
- The effect was consistent across multiple model families (GPT, Claude, Llama)
- Larger context windows made the problem worse, not better — more space for information to get lost in
Why it happens: Transformer attention mechanisms don't distribute attention uniformly. They tend to attend strongly to the beginning of the sequence (due to positional encoding patterns) and the end (due to recency in the attention window). Middle positions get less attention weight, making information there more likely to be overlooked.
Practical implications for memory systems:
- Place your system prompt first (it already is) and the user's latest message last (it already is)
- Put the most critical retrieved context immediately after the system prompt or immediately before the user message
- If you're including a conversation summary, place it at the beginning of the conversation turns, not in the middle
- For RAG, consider reversing the order of retrieved chunks so the most relevant one is last (just before the question)
The RULER benchmark extended these findings by testing not just retrieval but reasoning over long contexts. Models that performed well on simple needle-in-a-haystack tests showed significant drops on multi-hop tracing, aggregation, and counting tasks — demonstrating that finding information and reasoning about it are very different challenges.
Context Rot — When More Is Less
Closely related to "lost in the middle" is a broader phenomenon: context rot. Research consistently shows that model performance degrades as context length increases — even when the context window isn't anywhere near its maximum capacity.
The logic seems simple: more context should mean more information, which should mean better answers. But the empirical data tells a different story.
As you add more tokens to the context, three things happen:
- Attention dilution: The model's attention mechanism has to spread across more tokens, reducing the weight given to any individual piece of information
- Noise accumulation: More context means more opportunities for irrelevant information to interfere with the relevant signal
- Reasoning degradation: Multi-step reasoning becomes harder when the model has to track relationships across a longer sequence
This is why "just stuff everything into a million-token context" is a recipe for failure. The Needle In A Haystack (NIAH) benchmark — the most common test for long-context models — merely tests retrieval: can the model find a specific fact hidden in long text? Most modern models ace this test. But retrieval is the easy part. The RULER benchmark demonstrated that models performing well on NIAH showed significant performance drops when tested on multi-hop tracing, aggregation, and counting — tasks that require actual reasoning over the context, not just finding a needle.
The conclusion is clear: context quality matters more than context quantity. This is the foundation of the paradigm shift we'll explore in Part 4.
These failures aren't random. They all stem from the same root cause: treating the context window as a dumping ground rather than as a carefully engineered input. There's a better way to think about it — and it's called context engineering.
The LLM as a Function
Let's step back and think about what an LLM really is. Strip away the chat interfaces, the API wrappers, the prompt templates. At its core, an LLM is a function: it takes a sequence of input tokens, processes them through billions of parameters, and outputs a sequence of tokens.
That's it. The entire "intelligence" of the system comes from two things: the quality of the model (its parameters) and the quality of the input (the context). You can improve the model through training or fine-tuning — expensive, slow, and requires expertise. Or you can improve the input — the context you send to the model.
Context engineering is the art and science of optimizing those input tokens so they produce the best possible output for a given task. It's not about filling the context window. It's about strategically choosing and placing information so the model can do its best work.
Think of it like writing a brief for a brilliant but literal-minded consultant. This consultant will do exactly what you tell them, using exactly the information you provide.
Five Principles of Context Engineering
Based on research from Anthropic, the RULER benchmark findings, and production experience with large-scale AI systems, five principles emerge for engineering effective context.
Principle 1: Relevance — Every Token Earns Its Place
The most common mistake in context engineering is including too much. Every token you add to the context has a cost: financial (you pay per token), computational (the model processes every token), and attentional (each token dilutes the attention available for other tokens).
Relevance means asking a simple question for every piece of information: "Does this serve the current task?" If the user is asking about database configuration, their chat preferences from last month probably don't need to be in the context. MemoryBank's forgetting mechanism is one implementation of this principle — it naturally surfaces frequently-used (relevant) information and lets rarely-used information fade.
Principle 2: Diversity — Avoid Redundancy
Retrieving five chunks that all say "use connection pooling for database performance" wastes four context slots that could contain new information. Maximal Marginal Relevance (MMR) is the key technique here: it balances relevance to the query against redundancy with already-selected chunks.
Maximal Marginal Relevance selects documents that are both relevant to the query AND different from documents already selected:
Two competing forces:
- Relevance vector:
λ · Sim(doc, query)— how similar is this document to the query? - Redundancy penalty:
(1 - λ) · max[Sim(doc, selected)]— how similar is it to documents we've already chosen?
The λ parameter controls the balance:
- λ = 1.0: Pure relevance, no diversity consideration (standard top-K)
- λ = 0.5: Equal weight to relevance and diversity (typical starting point)
- λ = 0.0: Pure diversity, ignoring relevance (rarely useful)
Practical example: Query is "database connection pooling." Top-K returns five chunks about connection pooling basics. With MMR at λ=0.5, you might get: one chunk about pooling basics, one about pool sizing, one about connection lifecycle, one about monitoring, and one about failure handling. Same topic, five different angles — much more useful than five variations of "pooling improves performance."
The iterative process step by step:
λ · relevance − (1−λ) · similarity_to_Doc1. Doc 2 drops (0.97 similarity penalty), Doc 3 wins.The result: from 5 near-identical chunks, you get 3 chunks that are each relevant but cover different angles of the topic. This is far more useful than 5 repetitions of the same information.
MMR is built into LangChain's vector store abstraction. You don't implement the algorithm — you just set search_type="mmr":
# pip install langchain-chroma langchain-openai
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
# Build your vector store (works with FAISS, Pinecone, Qdrant too)
vectorstore = Chroma.from_texts(
texts=your_documents,
embedding=OpenAIEmbeddings()
)
# One line: switch from similarity to MMR
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 5, # return 5 documents
"fetch_k": 50, # consider 50 candidates for diversity
"lambda_mult": 0.3 # 0.0 = max diversity, 1.0 = max relevance
}
)
docs = retriever.invoke("your query here")
How it works under the hood: LangChain fetches fetch_k candidates via similarity search from the vector store, then applies the MMR algorithm client-side to select the final k diverse results. This means MMR works with any vector store LangChain supports — Chroma, FAISS, Pinecone, Qdrant, Weaviate.
Key tuning parameter: lambda_mult controls the relevance-diversity balance. Start at 0.3 (biased toward diversity), increase toward 0.7 if results feel too scattered. Set fetch_k to at least 5–10x your k value so MMR has enough candidates to choose from.
Principle 3: Ordering — Position Is Power
We covered the "lost in the middle" phenomenon in Part 3. The practical takeaway: put the most critical information at the beginning and end of the context. System prompts naturally go first. The user's latest message naturally goes last. Everything in between should be ordered by importance, with the most critical items nearest to these anchors.
For RAG systems, this means the highest-ranked retrieved chunks should be placed either immediately after the system prompt or immediately before the user's message — not sandwiched in the middle of a long conversation history.
Principle 4: Freshness — Recency-Weighted Scoring
Stale memory (Failure 1) happens because retrieval systems treat all information as equally timely. The fix: blend recency into your scoring function.
A simple approach: final_score = α · semantic_similarity + (1 - α) · recency_score, where recency decays exponentially from the document's last-updated timestamp. For most applications, α = 0.7 (favoring semantic relevance) is a good starting point, with adjustments based on how frequently your domain's information changes.
Principle 5: Specification — Context as Code
Here's the paradigm shift: treat your context like you treat your code. The context window isn't just an input — it's a specification for the model's behavior. Just as you wouldn't deploy code without testing, you shouldn't deploy a context configuration without testing.
This means: version your system prompts. Track what context you're sending. Monitor retrieval quality. Set up automated tests that verify the right information surfaces for known queries. Create regression tests for the failure patterns from Part 3. Context engineering is an engineering discipline, not a creative exercise.
How strange it is that we tend to throw away the input to our function (the LLM) and only keep track of the output. Think about that: you'd never deploy a service without logging the requests it receives. Yet most AI systems discard the exact context that produced each response. Tracking inputs isn't just for reproducibility — it's how you debug why the agent chose certain tools, produced certain outputs, and made certain mistakes. The context is the specification.
This insight transforms how you think about AI operations. The context you send isn't just an input — it's a persistent artifact that explains the agent's decisions. Sophisticated agents already use this pattern: they maintain a PLAN.md file that persists across context cropping. When the orchestrator trims old messages, the plan survives, providing continuity and an audit trail of what the agent intended to do and why.
What to Track: The Context Taxonomy
Before you can optimize context, you need to know what types of context exist. Most developers only think about the conversation history. In reality, the full context landscape includes four categories:
- Tool usage & outputs
- Sub-agent interactions
- Internal reasoning steps
- Successes & failures
- Explicit intent & goals
- Feedback (approvals, edits)
- Preference signals
- Conversation patterns
- DB snapshots (for auditing)
- External documents (RAG)
- Artifacts: PLAN.md, REQS
- API responses
- LLM hyperparameters
- Available tools config
- Guardrails & policies
- Model version
In practice, not everything is useful and many other things should be tracked. The key is deciding beforehand what to capture. This taxonomy gives you a starting framework — especially useful when debugging why an agent took an unexpected action. Was the issue in what tools it used (agent behavior), what the user asked for (user behavior), what knowledge was retrieved (knowledge sources), or how the system was configured (system-level)?
Multi-Agent Context
In systems with multiple AI agents working together — an orchestrator routing tasks to specialists — context engineering becomes even more critical. Each agent needs to see different information, and sharing everything between all agents creates unnecessary cost and confusion.
The orchestrator sees the big picture: the user's query, the task plan, and summaries from each specialist. But each specialist sees only what it needs. The research agent gets the search query and retrieved documents. The coding agent gets the code context and requirements. The summary agent gets the draft and user preferences. No agent is overwhelmed with irrelevant context from other agents' work.
This is context engineering at the architectural level — designing not just what goes into a single context window, but how context flows across an entire multi-agent system.
Three patterns for managing context across agents:
1. Shared blackboard: All agents read from and write to a shared context store. Simple but noisy — every agent sees everything, including irrelevant data from other agents. Best for small, tightly-coupled agent teams.
2. Orchestrator-mediated: The orchestrator selectively passes context to each specialist. More control but the orchestrator becomes a bottleneck. Best for complex workflows where different agents need very different contexts.
3. Hierarchical: Context is organized in layers — global context (visible to all), team context (shared within a specialist team), and local context (agent-specific). Most flexible but most complex to implement. Best for large-scale systems with many agents.
Context cropping for delegation: When the orchestrator delegates a task to a specialist, it shouldn't just forward the entire conversation. Instead, it extracts only the relevant portions: the specific sub-task, relevant constraints, and any context the specialist needs. This is "context cropping" — cutting out the noise before passing context downstream.
Practical example — a deep research agent:
- Plan creation: User asks a research question. Orchestrator creates a multi-step plan.
- Search delegation: Orchestrator sends only the search query to the research agent (not the full conversation).
- Summarization delegation: Research results are sent to a summary agent with instructions to extract key findings (not the raw search results AND the conversation).
- Context cropping: The orchestrator takes the summary, drops intermediate reasoning, and presents the user with a clean answer.
At each step, context is deliberately narrowed. The research agent never sees the conversation history. The summary agent never sees the search queries. The user never sees the intermediate context management. Each handoff is a deliberate act of context engineering.
In practice, system prompts use XML tags to clearly delineate different types of context. This helps the model distinguish between instructions, available tools, temporal context, and the actual query:
SYSTEM_PROMPT = """
You are a helpful research assistant.
<INSTRUCTIONS>
* Create a plan for the user's query
* If you lack information, use tools to search
* Prioritize clarity and use the current date for SOTA
</INSTRUCTIONS>
<TOOLS>
{information_about_tools}
</TOOLS>
<DATE>
{current_date}
</DATE>
<USER_QUERY>
{user_query}
</USER_QUERY>
"""
The messages array evolves at each step. Here's what a deep research agent's context looks like as it progresses:
# Step 1: Create a plan
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
# Step 2: User provides feedback on the plan
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "assistant", "content": "Here is my plan..."},
{"role": "user", "content": "Search for surveys instead."}
]
# Step 3: CROP — only keep the updated plan
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "assistant", "content": PLAN_MD}, # Persistent artifact
]
# Step 4: Tool call for search
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "assistant", "content": PLAN_MD},
{"role": "assistant", "content": "<think>Search on ArXiv</think>",
"tool_call": {"name": "search_arxiv", ...}},
{"role": "tool", "content": ABSTRACTS},
]
# Step 5: Sub-agent gets MINIMAL context
messages_summary_agent = [
{"role": "system", "content": "Summarize these papers."},
{"role": "user", "content": PAPERS},
]
# Step 6: Final synthesis — CROPPED again
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "assistant", "content": UPDATED_PLAN},
{"role": "assistant", "content": "Summaries: ..."},
]
Key patterns to notice:
- PLAN.md persists — The plan is a persistent artifact that survives context cropping. Even when old messages are trimmed, the plan stays, providing continuity.
- Sub-agents get minimal context — The summary agent only sees papers, not the full conversation or search queries.
- Context shrinks at handoffs — Each step deliberately narrows what the next model sees. This is the core of context engineering.
Before deployment, verify each principle:
Relevance:
- Are you logging what context is sent with each request?
- Can you identify and remove context that doesn't contribute to task performance?
- Do you have a mechanism to avoid retrieving information already in the conversation?
Diversity:
- Are you using MMR or similar de-duplication in your retrieval?
- Have you tested for cases where multiple similar chunks are returned?
- Is your retrieval top-K appropriately sized (3–5 is often better than 10)?
Ordering:
- Is critical context placed at the beginning or end of the prompt, not the middle?
- Are retrieved chunks ordered by relevance, with the best near the anchors?
- Is the conversation summary (if used) placed before the conversation turns?
Freshness:
- Do your stored documents have
last_updatedtimestamps? - Is your retrieval scoring recency-aware?
- Do you have a re-indexing schedule for frequently-changing content?
Specification:
- Is your system prompt versioned (in git or a config store)?
- Do you have automated tests that verify correct retrieval for known queries?
- Do you monitor retrieval quality metrics (precision@K, latency, staleness)?
- Can you reproduce the exact context that was sent for any given request?
Different memory strategies have very different cost and latency profiles. Use this as a rough guide when choosing your approach:
| Strategy | Latency | Token Cost | Best For |
|---|---|---|---|
| Full conversation history | Scales with conversation | High — pays for every past message | Short sessions (< 20 turns) |
| Summary + recent turns | Low (fixed overhead) | Moderate — summary generation cost | Long sessions, chat support |
| Vector RAG (semantic) | +50–200ms per retrieval | Low — only top-K chunks injected | Large knowledge bases |
| Hybrid (BM25 + semantic) | +100–300ms (parallel search) | Low — same top-K after RRF | Mixed query types (IDs + concepts) |
| Graph-based (GraphRAG) | +200–500ms (traversal) | Moderate — higher setup cost | Relationship-heavy domains |
| Million-token context | High latency, long generation | Very high — processing all tokens | One-off analysis, not production |
The takeaway: Even with million-token context windows available, RAG and hybrid approaches remain the most cost-effective for production systems. Processing millions of tokens in one shot demands substantial compute and can introduce latency that negates the simplicity gains of removing external retrieval. Context quality beats context quantity — both for accuracy and for your cloud bill.
Context rot is the primary culprit. As the conversation grows, the context window fills with 30+ messages, diluting the attention the model gives to any individual piece of information. The system prompt, early important decisions, and key context are all being "pushed to the middle" where they get overlooked. The fix: implement summarization (to compress old messages), ordering (to keep critical facts near the anchors), and possibly relevance filtering (to drop low-value conversational turns).
While relevance and freshness can contribute, the pattern of "works at 10 messages, degrades at 30" is the classic signature of context rot. The growing context window dilutes the model's attention across too many tokens. Early important information ends up in the "lost in the middle" zone. The fix involves summarization (compress old messages), ordering (keep critical facts near the top and bottom), and possibly trimming low-value conversational turns.
Context cropping is the key principle. The summary agent doesn't need all 15 chunks — many will be redundant. Apply MMR diversity to select the top 5 most informative, non-overlapping chunks. This reduces cost, latency, and the chance of context rot. The summary agent produces better output with 5 focused chunks than with 15 noisy ones.
Passing all 15 chunks risks polluted retrieval (many chunks will be redundant or marginally relevant), increases cost, and makes the summary agent more likely to produce vague output. Passing the full reasoning trace adds even more noise. The right approach is context cropping: use MMR or reranking to select the top 5 most informative, non-overlapping chunks. Less context, better results — this is the core insight of context engineering.
Context as Specification
Let's close with the biggest mental shift in context engineering: the context window IS the product.
When you're building a traditional application, you write code that processes inputs and produces outputs. The code is the product. When you're building an AI application, the LLM processes context and produces responses. The context — how you assemble it, what you include, how you order it, how you test it — is the product.
This is why the techniques we've covered in this post matter so much. Memory management (Part 1) determines what historical information is available. Memory architecture (Part 2) determines how efficiently information can be stored and retrieved. Failure awareness (Part 3) tells you what can go wrong. And context engineering (this section) provides the framework for putting it all together.
The systems that get context engineering right — that treat their context window as a first-class engineering artifact — consistently outperform systems with bigger models, more parameters, and larger context windows. The context is the specification. Engineer it accordingly.
This post covered memory within a single context window and across sessions. But what happens when flat memory can't represent the relationships between facts? When you need to know not just what happened but when, why, and how it connects to other events? That's the domain of graph memory systems — the subject of the next post in this series.
وَاللَّهُ أَعْلَمُ
And Allah knows best
وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ
May Allah's peace and blessings be upon our master Muhammad and his family
Was this post helpful?
Your feedback helps me improve these deep dives.
Practice Mode
Test your understanding with real-world AI memory scenarios.
Your medical assistant chatbot handles patient intake conversations. On message 3, a patient mentions they're allergic to penicillin. By message 15, the conversation has filled up and your system is using trimming. On message 18, the patient asks about treatment for an infection.
The chatbot recommends amoxicillin (a penicillin-type antibiotic) without mentioning the allergy.
Cheat Sheet
The essential reference for AI memory systems.
5 Memory Types
Working: Context window (current session)
Episodic: Past sessions & outcomes
Semantic: External knowledge via RAG
Procedural: System prompts & rules
Parametric: Baked into model weights
Trimming vs Summarization
Trimming: Fast, cheap, drops oldest. Use for <20 messages, casual chat.
Summarization: Preserves intent, 300-800ms extra latency. Use for 30+ messages.
Pin + Trim: Pin safety-critical info, trim the rest. Use for medical/legal.
3 Retrieval Paradigms
Static RAG: Fixed pipeline, always retrieves
Agentic RAG: Agent decides what/when/where to search
RAG During Reasoning: Search-o1 retrieves mid-thought with Reason-in-Documents
Forgetting Curves
MemoryBank: Ebbinghaus-inspired. Used memories persist, unused ones fade.
Key insight: ~10% of information deserves permanent storage.
Spaced repetition: Frequent retrieval = stronger memory.
10 Failure Modes
Basic: Stale memory, leaked context, polluted retrieval
Intermediate: Unnecessary searches, hierarchy breakdown, missed info
Advanced: Silent overwrites, isolated silos, temporal confusion
Catastrophic: Scale collapse
5 Principles
Relevance: Every token earns its place
Diversity: MMR — avoid redundancy
Ordering: Critical info at start & end
Freshness: Recency-weighted scoring
Specification: Treat context like code
Context Ordering
Lost in the Middle: Models miss info in positions 5-15 of 20
U-shaped curve: High attention at start and end, low in middle
Fix: System prompt first, critical context near anchors, user message last
Which Strategy When
<20 messages: Full history (send everything)
20-50 messages: Trimming with pinned critical info
50+ messages: Summarization + trimming
Cross-session: Episodic memory + semantic experience
External docs: Agentic RAG + hybrid search
- Graph Memory Systems — the next post in this series: when flat memory breaks, knowledge graphs, temporal awareness, and production systems
- RAG & Agents — the complete guide to retrieval pipelines, embeddings, chunking, and agent architectures (prerequisite for this post)
- From Prompt to GPU — tokenization, GPU memory, inference phases, and serving infrastructure
- Fine-Tuning — when retrieval and memory aren't enough: LoRA, QLoRA, RAFT
Discussion
Leave a comment