RAG & Agents

How LLMs Learn to Find Answers and Take Action

From naive retrieval to production-grade RAG pipelines, and from passive Q&A to autonomous agents — the complete guide to making LLMs useful in the real world.

Bahgat
Bahgat Ahmed
October 2025
The Journey
Finding Information
Keyword, Semantic & Hybrid Search
The RAG Pipeline
Rewrite, Retrieve, Rerank, Generate
Optimizing the Machine
Chunking, Embeddings & Vector DBs
When LLMs Start Acting
Agents, Tools, Memory & Safety
What's Inside
35 min read
1 Finding the Right Information 2 The Complete RAG Pipeline 3 Optimizing the Machine 4 When LLMs Start Acting Practice Mode Cheat Sheet

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

A medical Q&A bot goes live at a hospital. A doctor asks about drug interactions for a patient on warfarin. The system retrieves 10 text chunks from its knowledge base — 7 are relevant, but 3 slipped in from the veterinary section. The LLM, unable to tell the difference, confidently cites a dosage recommendation for a horse dewormer. The doctor catches it. This time.

Meanwhile, a customer service agent is told: "Refund all my orders." It processes $12,000 in refunds across 47 orders in 8 seconds. No confirmation dialog, no human review, no "are you sure?" — just a tool with the authority to act and no guardrails to stop it.

These aren't hypothetical horror stories. They're the two fundamental challenges of modern AI systems: how do you find the right information, and how do you let AI take action safely?

This post is the complete guide to both — from basic keyword search to production-grade RAG pipelines to autonomous agents with guardrails. Everything explained visually, with no prerequisites beyond curiosity.

Quick Summary
  • Part 1: How search actually works — from keywords to meaning-based retrieval to hybrid approaches that combine both
  • Part 2: The full RAG pipeline — query rewriting, multi-hop retrieval, reranking, and how to evaluate whether it's actually working
  • Part 3: Optimizing every layer — chunking strategies, embedding compression, vector databases, and when RAG isn't the answer
  • Part 4: When LLMs start acting — agents, tools, planning, memory, and the safety systems that keep them from going rogue
This post is for you if...
  • You've used ChatGPT or Claude but wondered how they could access your data — not just what they were trained on
  • You've heard "RAG," "vector database," "embeddings," or "agents" and want to understand what they actually mean
  • You're building (or planning to build) AI applications and need to know how retrieval, generation, and autonomous action fit together
  • You read How LLMs Actually Work and want the next chapter of the story
Part 1
Finding the Right Information

The Brilliant but Amnesiac Research Assistant

Imagine you hire a brilliant research assistant. They've read millions of books, can write beautifully, and can synthesize complex ideas. There's just one problem: they have no memory of anything that happened after their training. Ask them about your company's latest policy change? Blank stare. Ask about yesterday's meeting notes? Nothing. Ask about any document you haven't explicitly shown them? They'll either admit they don't know — or worse, confidently make something up.

This is the fundamental limitation of every LLM. GPT-4, Claude, Gemini — they're all frozen in time. Their knowledge is locked to whatever they saw during training. They can't access your database, your documents, or today's news unless you give it to them.

The solution is straightforward in concept: before the LLM generates an answer, search for relevant information and paste it into the prompt. This is called Retrieval-Augmented Generation, or RAG — and it's the most widely used pattern for building AI applications that work with real data.

But the critical question becomes: how do you search? Because the quality of what you retrieve determines the quality of what the LLM generates. Garbage in, garbage out — except now the garbage sounds eloquent and confident.

Two Ways to Search: Keywords vs. Meaning

There are fundamentally two approaches to finding relevant information, and they work in completely different ways. To understand them, think about two kinds of librarians.

The first librarian is a precise catalog clerk. You say "transformer architecture" and they find every document that literally contains those exact words. Fast, reliable, cheap — but if the document says "the attention-based model" instead, they miss it entirely. They match words, not meaning.

The second librarian understands what you're really asking for. You say "transformer architecture" and they also bring you documents about "self-attention mechanisms" and "the model from the 'Attention Is All You Need' paper" — even if those documents never use the word "transformer." They match meaning. But they're slower, more expensive, and sometimes bring back documents about electrical transformers because the meaning was close enough to confuse them.

In the real world, these two approaches are called term-based retrieval and embedding-based retrieval. Let's look at each one.

Two Ways to Search
Keyword Search
Term-based retrieval
Query
"transformer architecture"
Matches exact words
"The transformer architecture uses..."
"The attention-based model from 2017..."
"Self-attention revolutionized NLP..."
Semantic Search
Embedding-based retrieval
Query
"transformer architecture"
Matches meaning
"The transformer architecture uses..."
"The attention-based model from 2017..."
"Self-attention revolutionized NLP..."
Keyword search finds documents containing the exact words you typed. Semantic search finds documents with the same meaning — even when different words are used. Each has strengths: keywords are fast and precise for exact terms; semantics understand intent.

Approach 1: Term-Based Retrieval (The Catalog Clerk)

The simplest way to search is by keywords. Given a query, find every document that contains those words. This is exactly how search worked for decades — and it still powers a huge portion of search systems today.

But there's a problem: if you search for "AI engineering," you'll match every document containing those words. Some might be highly relevant (a textbook chapter). Some might mention it once in passing (an unrelated blog post). You need a way to rank which matches are actually important.

The insight comes from two observations:

Observation 1: If a word appears many times in a document, that document is probably about that topic. A document mentioning "transformer" 47 times is likely more relevant to a "transformer" query than one mentioning it once. This count is called term frequency (TF).

Observation 2: If a word appears in almost every document, it's not very informative. Words like "the," "and," "is" appear everywhere — they tell you nothing about what a document is actually about. But a rare word like "transformer" or "warfarin"? If a document contains it, that's meaningful. This rarity measure is called inverse document frequency (IDF).

Combine these two ideas and you get TF-IDF: the most influential scoring formula in the history of search. It ranks documents by multiplying term frequency (how often the word appears in this document) by inverse document frequency (how rare the word is across all documents). High TF-IDF = the word appears a lot here, but not everywhere.

How TF-IDF actually scores a document

Let's walk through a concrete example. Suppose you have 10,000 documents and you search for "transformer architecture."

For the word "transformer":

  • It appears 15 times in Document A → TF = 15
  • 200 out of 10,000 documents contain it → IDF = log(10,000 / 200) = 3.91
  • TF-IDF for "transformer" in Doc A = 15 × 3.91 = 58.7

For the word "architecture":

  • It appears 8 times in Document A → TF = 8
  • 1,500 out of 10,000 documents contain it → IDF = log(10,000 / 1,500) = 1.90
  • TF-IDF for "architecture" in Doc A = 8 × 1.90 = 15.2

Document A's total score = 58.7 + 15.2 = 73.9

Notice how "transformer" contributes much more to the score than "architecture" — because it's a rarer, more informative word. This is exactly the behavior we want.

How does the data get organized for fast retrieval? Through a structure called an inverted index. Instead of storing "Document → list of words," it stores "Word → list of documents." Think of it as the index at the back of a textbook: you look up a term and it tells you which pages mention it.

Term Doc Count Documents (doc_id, frequency)
transformer 200 (42, 15), (108, 9), (2001, 23), ...
architecture 1,500 (42, 8), (55, 12), (789, 3), ...
the 9,950 (1, 47), (2, 31), (3, 52), ...

A production evolution of TF-IDF is BM25 (Best Matching 25) — it adds document length normalization so that longer documents don't automatically score higher. BM25 and its variants are still the backbone of systems like Elasticsearch (built on Lucene's inverted index) and remain formidable baselines that surprisingly often beat more sophisticated approaches.

One subtlety worth noting: before any scoring happens, the query must be broken into individual terms — a process called tokenization. The simplest method is splitting by words, but this can destroy multi-word meanings ("hot dog" becomes "hot" and "dog," neither retaining the original meaning). Production systems handle this by treating common n-grams as terms, converting to lowercase, removing punctuation, and eliminating stop words like "the" and "is." Classical NLP packages like NLTK, spaCy, and CoreNLP provide these tokenization features.

The weakness of keyword search is that it can't understand meaning. Search for "how to fix a slow application" and you'll miss a document titled "Performance optimization techniques" — even though it's exactly what you need. The words don't overlap, so the catalog clerk walks right past it.

Approach 2: Embedding-Based Retrieval (The Librarian Who Understands)

If you read the previous post, you already know the key idea: words can be converted into vectors (lists of numbers) where similar meanings end up close together in mathematical space. "King" and "queen" are close, "cat" and "kitten" are close, "fast" and "rapid" are close.

Embedding-based retrieval uses this same idea for search. Here's how it works:

Step 1 — Index your documents: Take every document (or chunk of a document) and convert it into a vector using an embedding model. Store all these vectors in a vector database.

Step 2 — Search: When a query comes in, convert it into a vector using the same embedding model. Then find the document vectors that are closest to the query vector. Those are your search results.

The beauty is that "how to fix a slow application" and "performance optimization techniques" will have similar vectors even though they share zero words — because the embedding model understands they mean the same thing.

But this approach has its own weakness: it can obscure exact keywords. If you search for a specific error code like EADDRNOTAVAIL (99) or a product name like fancy-printer-A300, the semantic meaning might not help. The embedding model might map it close to "network error" or "printer" — losing the specificity you need.

How vector databases find nearest neighbors fast

The naive approach to finding the closest vectors is simple: compare the query vector against every single vector in the database. This guarantees perfect results but is painfully slow. For a database of 100 million vectors, each with 1,024 dimensions? That's 100 billion floating-point operations per query.

Production systems use approximate nearest neighbor (ANN) algorithms that trade a tiny bit of accuracy for massive speed gains. Here are the major approaches:

LSH (Locality-Sensitive Hashing)

Hashes similar vectors into the same "bucket" using random projections. Like sorting mail by zip code — items in the same bucket are likely neighbors. Fast to build, but lower accuracy.

HNSW (Hierarchical Navigable Small World)

Builds a multi-layer graph where similar vectors are connected by edges. Searching means "walking" the graph from coarse to fine layers. High accuracy, fast queries, but expensive to build.

IVF (Inverted File Index)

Clusters vectors using K-means (100–10,000 vectors per cluster). During search, only the closest clusters are checked. Combined with Product Quantization, this forms the backbone of FAISS.

Product Quantization

Compresses each vector by splitting it into subvectors and replacing each with a code from a small codebook. Reduces memory by 10–100x. Distances are approximated from the compressed form.

Annoy (Approximate Nearest Neighbors Oh Yeah)

A tree-based approach by Spotify. Builds multiple binary trees, each splitting vectors using random hyperplanes. During search, traverses these trees to gather candidate neighbors. Lightweight, uses static files as indexes, and is open source.

SPLADE (Sparse Lexical and Expansion)

Bridges keyword and semantic retrieval. Uses BERT embeddings but applies regularization to push most values to zero, creating sparse embeddings. Gets the semantic understanding of neural models with the efficiency of sparse operations — the best of both worlds in a single model.

The key trade-off is between indexing (build time + memory) and querying (speed + accuracy). HNSW gives the best query accuracy but needs significant memory and build time. LSH is cheap to build but less accurate. IVF + Product Quantization offers a good balance for very large datasets. Annoy is lightweight and uses static files, great for read-heavy workloads.

For benchmarking, three resources are essential: the ANN-Benchmarks website compares vector search algorithms across datasets on recall, queries per second, build time, and index size. BEIR (Benchmarking IR) evaluates retrieval systems across 14 common retrieval benchmarks. And the MTEB (Massive Text Embedding Benchmark) evaluates embeddings across retrieval, classification, and clustering tasks — it's the go-to leaderboard for choosing an embedding model.

When dense retrieval goes wrong (and what to do about it)

Dense retrieval (embedding-based search) is powerful, but has specific failure modes you should know about:

No relevant results exist, but you get results anyway. If you ask "What is the mass of the moon?" against a movie database, the system still returns the nearest vectors — even though none are relevant. One heuristic is to set a maximum distance threshold for relevance. Any result beyond that threshold is treated as "no match found." In practice, many search systems present the best results and let the user decide, tracking whether users actually click on them to improve future versions.

Exact phrase matching fails. When a user wants to find an exact match for a specific phrase, embeddings can fuzz over the exact words. This is why hybrid search (combining semantic with keyword search) is recommended over relying solely on dense retrieval.

Domain transfer problems. Dense retrieval models trained on internet and Wikipedia data often perform poorly on specialized domains like legal, medical, or financial text — without sufficient domain-specific training data. The embedding model doesn't understand your domain's jargon.

Multi-sentence answers get lost. If the answer to a question spans multiple sentences or chunks, individual chunks may not capture enough context for a good embedding match. This is where chunking strategy becomes critical — more on that in Part 3.

One useful rule of thumb: Anthropic suggests that if your knowledge base is smaller than 200,000 tokens (about 500 pages), you can include the entire knowledge base directly in the prompt — no RAG needed. It would be great if other model providers gave similar guidance for when RAG is and isn't necessary.

Fine-tuning embedding models to improve search quality

Just like classification models, embedding models can be fine-tuned to work better for your specific search task. The key insight: a query and its best result aren't always "similar" in the general sense — they just need to be close in embedding space.

The fine-tuning process uses contrastive learning. For each document in your dataset, you create training pairs:

  • Positive pairs: Queries that the document correctly answers (e.g., "Interstellar release date" paired with "Interstellar premiered on October 26, 2014, in Los Angeles.")
  • Negative pairs: Queries that the document does NOT answer (e.g., "Interstellar cast" paired with the same premiere sentence)

Training pushes relevant query-document pairs closer together in embedding space and irrelevant pairs further apart. Before fine-tuning, all queries about "Interstellar" might be equally close to the document. After fine-tuning, only the truly relevant queries are close.

The Sentence Transformers library is the standard tool for this. It supports training with various loss functions, including contrastive loss, and integrates with popular model hubs. See the "Retrieve & Re-Rank" section of their documentation for step-by-step instructions.

The Best of Both Worlds: Hybrid Search

So keyword search is fast and precise but misses synonyms. Semantic search understands meaning but can lose exact terms. The obvious question: why not use both?

This is called hybrid search, and it's what most production systems use. You run both a keyword search and a semantic search in parallel, then combine their results.

But combining results from two different systems creates a new problem: the scores aren't comparable. A keyword search might score a document 12.7, while a semantic search scores it 0.89. You can't just add them. You need a way to merge the rankings.

The standard solution is Reciprocal Rank Fusion (RRF). The idea is elegant: instead of comparing scores, compare ranks. If a document is ranked 1st by one system, it gets a score of 1/1 = 1.0. If ranked 3rd by another system, it gets 1/3 = 0.33. Its combined score is 1.0 + 0.33 = 1.33. Documents that rank highly in both systems bubble to the top.

Why does this work so well? Because keyword search catches the exact matches that semantic search might fuzz over, while semantic search catches the conceptual matches that keyword search misses entirely. Together, they cover each other's blind spots.

The Reranking Trick: A Second, Smarter Pass

There's a problem with both keyword and semantic search: they score each document independently. The keyword system checks if the document contains the right terms. The embedding system checks if the document's vector is close to the query's vector. But neither one looks at the query and document together — understanding the relationship between what you asked and what the document actually says.

This is where reranking comes in. The idea is simple but powerful: use a smarter (and slower) model to re-score the top results from your initial search.

Think of it like a college admissions process. The first round is a quick filter — GPA above 3.5, test scores above a threshold. Fast, but crude. The second round is a committee that reads each application in full, understanding the nuance of how the student's story connects to the program. The committee is much more accurate, but they can only review 50 applications, not 50,000. That's why you need both rounds.

Two-Stage Search: Retrieve Then Rerank
Query
"drug interactions
with warfarin"
1
Bi-Encoder (Fast Retriever)
Encodes query & docs separately, compares vectors
100,000 docs Top 100 candidates
Fast: ~10ms for millions of docs
2
Cross-Encoder (Precision Reranker)
Reads query + doc together, assigns relevance score
100 candidates Top 10 results
Slower but much more accurate — reads full context
Best Results
Top 10
High precision
The bi-encoder processes query and documents separately (fast, can scan millions). The cross-encoder processes query and each document together (slow, but understands the relationship). Using both in sequence gives you speed and accuracy.

In technical terms, the first-stage retriever is called a bi-encoder — it encodes the query and each document separately into their own vectors, then compares them by distance. It's fast because the document vectors are precomputed once and stored.

The reranker is called a cross-encoder — it feeds the query and a candidate document into the model together, allowing the model to see both texts at once and understand how they relate. This approach is described in the paper "Multi-stage document ranking with BERT" and is sometimes referred to as monoBERT. It's dramatically more accurate but can't scale to millions of documents because it needs to process each query-document pair fresh every time.

The numbers speak for themselves: on multilingual benchmarks like MIRACL, adding a reranker after keyword search can boost relevance scores from 36.5 to 62.8 (measured as nDCG@10) — nearly doubling the quality of results. If you want to set up retrieval and reranking locally on your own machine, the Sentence Transformers library provides ready-to-use cross-encoder models and a complete "Retrieve & Re-Rank" pipeline.

Building a search system with FAISS and Cohere (code)

Here's a minimal but complete example of building a dense retrieval system with search and reranking. We'll use Cohere for embeddings and reranking, and FAISS for the vector index.

Step 1: Embed your documents

import cohere
import numpy as np
import faiss

co = cohere.Client('your-api-key')

# Your documents (chunks of text)
texts = [
    "The transformer architecture uses self-attention...",
    "Performance optimization techniques for web apps...",
    "Warfarin interacts with vitamin K...",
    # ... more documents
]

# Embed all documents
response = co.embed(
    texts=texts,
    input_type="search_document"
).embeddings
embeds = np.array(response, dtype='float32')
# Result: (num_docs, 4096) matrix

Step 2: Build the FAISS index

# Create a flat L2 index (exact search)
dim = embeds.shape[1]  # 4096
index = faiss.IndexFlatL2(dim)
index.add(embeds)
print(f"Index contains {index.ntotal} vectors")

Step 3: Search

def search(query, k=10):
    # Embed the query
    query_embed = co.embed(
        texts=[query],
        input_type="search_query"
    ).embeddings
    query_embed = np.array(query_embed, dtype='float32')

    # Find k nearest neighbors
    distances, indices = index.search(query_embed, k)
    return [(texts[i], distances[0][j])
            for j, i in enumerate(indices[0])]

results = search("drug interactions with warfarin")
for text, dist in results[:3]:
    print(f"  {dist:.1f}  {text[:80]}...")

Step 4: Rerank for precision

# Take top 10 from search, rerank to find best 3
query = "drug interactions with warfarin"
search_results = search(query, k=10)
docs = [text for text, _ in search_results]

reranked = co.rerank(
    query=query,
    documents=docs,
    top_n=3,
    return_documents=True
)

for result in reranked.results:
    print(f"  {result.relevance_score:.3f}  "
          f"{result.document.text[:80]}...")

For production: Replace IndexFlatL2 with IndexIVFPQ for datasets over 100K vectors. Use a managed vector database (Pinecone, Weaviate, or Qdrant) if you need to add/remove vectors without rebuilding the index.

BM25 keyword search vs. dense retrieval — side-by-side code comparison

Let's see how keyword search (BM25) compares to the dense retrieval system we built above — using the same query on the same documents.

Set up BM25 keyword search:

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string

def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)
        if len(token) > 0 and token not in \
            _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

# Tokenize and index documents
tokenized_corpus = [bm25_tokenizer(t) for t in texts]
bm25 = BM25Okapi(tokenized_corpus)

def keyword_search(query, top_k=3):
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -top_k)[-top_k:]
    hits = sorted(
        [{'id': idx, 'score': bm25_scores[idx]}
         for idx in top_n],
        key=lambda x: x['score'], reverse=True
    )
    for hit in hits[:top_k]:
        print(f"  {hit['score']:.3f}  "
              f"{texts[hit['id']][:80]}...")
    return hits

Compare the same query:

query = "how precise was the science"

print("=== Dense Retrieval (Semantic) ===")
search(query)  # From our FAISS search above

print("\n=== BM25 (Keyword) ===")
keyword_search(query)

# Dense retrieval correctly finds:
#   "received praise for scientific accuracy..."
# BM25 matches on the word "science" instead:
#   "science fiction film co-written by Nolan..."

The dense retrieval system finds the document about scientific accuracy — the semantically correct answer — even though it doesn't contain the word "precise." BM25 matches on the keyword "science" and returns the wrong document. This is exactly why semantic search exists.

But BM25 wins for exact terms. Search for a specific name like "Kip Thorne" and BM25 nails it instantly while semantic search might return vaguely related physics documents. This is why production systems combine both approaches.

Building a RAG system with local models (Phi-3 + FAISS + LangChain)

You can build a complete RAG system without any API calls — running everything locally on your machine. Here's how using a quantized model (Phi-3), a local embedding model, and FAISS.

Step 1: Load a local generation model

# Download a quantized GGUF model
# !wget https://huggingface.co/microsoft/
#   Phi-3-mini-4k-instruct-gguf/resolve/main/
#   Phi-3-mini-4k-instruct-fp16.gguf

from langchain import LlamaCpp

llm = LlamaCpp(
    model_path="Phi-3-mini-4k-instruct-fp16.gguf",
    n_gpu_layers=-1,
    max_tokens=500,
    n_ctx=2048,
    seed=42,
    verbose=False
)

Step 2: Load a local embedding model

from langchain.embeddings.huggingface import (
    HuggingFaceEmbeddings
)

# A small, high-quality embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name='thenlper/gte-small'
)

Step 3: Build the vector database

from langchain.vectorstores import FAISS

# Create a local FAISS index from your documents
db = FAISS.from_texts(texts, embedding_model)

Step 4: Create the RAG pipeline

from langchain import PromptTemplate
from langchain.chains import RetrievalQA

template = """<|user|>
Relevant information:
{context}

Answer concisely using the information above:
{question}<|end|>
<|assistant|>"""

prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

rag = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=db.as_retriever(),
    chain_type_kwargs={"prompt": prompt}
)

# Ask a question!
rag.invoke('What was the total revenue?')

The local model won't match the quality of larger managed models (no inline citations, for instance), but this gives you a complete, private RAG system running entirely on your hardware — no API costs, no data leaving your machine.

Choosing Your Retrieval Strategy

Let's put it all together. You have three approaches to retrieval, and the right choice depends on your specific situation:

Keyword Search Semantic Search Hybrid + Reranking
Speed Fastest Slower (embedding + vector search) Slowest (runs both + reranking)
Cost Very cheap Embedding API + vector DB costs Highest (all components)
Handles synonyms? No Yes Yes
Exact term matching? Excellent Can miss specific codes/names Excellent (keyword path)
Accuracy Good baseline Better (with fine-tuning) Best
Best for Log search, exact codes, low-latency Natural language queries, Q&A Production RAG systems
Decision Card #1

You're building a legal research tool. Lawyers need to find relevant case law based on questions like "precedent for employer liability in remote work injuries." The database has 2 million case documents. Many cases use different terminology for the same legal concepts (e.g., "respondeat superior" vs. "employer liability" vs. "vicarious liability"). Lawyers also frequently search for specific case citations like "Smith v. Jones, 2019."

Which retrieval approach should you use?

Now you understand how to find information. But retrieval alone isn't RAG — it's just search. The magic happens when you connect the retriever to an LLM, building a complete pipeline that rewrites queries, retrieves intelligently, reranks results, and generates grounded answers. That pipeline is what breaks or makes a production AI system — and it's far more complex than "just paste the results into the prompt."

Part 2
The Complete RAG Pipeline

When Simple Search Isn't Enough

So you've built a basic RAG system. You retrieve some documents, paste them into a prompt, and ask the LLM to answer. It works great for straightforward questions. Then a user types:

"We have an essay due tomorrow. We have to write about some animal. I love penguins. I could write about them. But I could also write about dolphins. Are they animals? Maybe. Let's do dolphins. Where do they live for example?"

Your retriever searches for that entire rambling paragraph. The results? A mess. Documents about essays, documents about penguins, documents about zoos — because the system doesn't understand that the actual question is simply: "Where do dolphins live?"

Or consider a user in a chatbot conversation:

User: When was the last time John Doe bought something from us?
AI: John last bought a Fruity Fedora hat on January 3, 2030.
User: How about Emily Doe?

The query "How about Emily Doe?" is meaningless without context. If you search for it verbatim, you'll get random documents about someone named Emily. The system needs to understand that the query really means: "When was the last time Emily Doe bought something from us?"

These are the kinds of failures that send you from naive RAG to a production-grade pipeline. The full pipeline has six stages — some optional, all powerful — and understanding them is the difference between a demo that impresses and a system that actually works.

The Six-Stage RAG Pipeline

A production RAG system follows what practitioners call the rewrite-retrieve-rerank-refine-insert-generate workflow. Let's break down what each stage does and why it exists.

The Full RAG Pipeline
1
Rewrite
Fix vague or context-dependent queries
Fails: topic drift
2
Retrieve
Fetch candidates (hybrid search)
Fails: irrelevant docs
3
Rerank
Cross-encoder scores relevance
Fails: latency spike
4
Refine
Summarize or filter retrieved docs
Fails: info loss
5
Insert
Order docs in the prompt
Fails: lost in middle
6
Generate
LLM produces grounded answer
Fails: hallucination
Verify Stage — can run after any step: check quality, catch errors, take corrective measures
The full rewrite → retrieve → rerank → refine → insert → generate pipeline. Not every application needs every stage — but knowing what each one does lets you diagnose exactly where your system is failing. Each stage has its own failure mode (shown in red).

Stage 1: Rewrite — Fixing the Query Before It Breaks Everything

The rewrite stage transforms the user's raw query into something the retriever can actually work with. This is the most underrated stage in the pipeline — and the one that prevents the most failures.

There are three flavors of query rewriting:

Simple cleanup: That rambling dolphin question becomes "Where do dolphins live?" The conversational "How about Emily Doe?" becomes "When was the last time Emily Doe bought something from us?" An LLM does this with a simple prompt: "Given the conversation so far, rewrite the last user query to be self-contained and clear."

Multi-query (parallel): Some questions need multiple searches run at the same time. "Compare Nvidia's financial results in 2020 vs. 2023" should become two parallel queries: "Nvidia 2020 financial results" and "Nvidia 2023 financial results." The results are merged before being fed to the LLM.

Multi-hop (sequential): Some questions need the answer to one search before you can make the next. "Who are the largest car manufacturers in 2023? Do they each make EVs?" First, you search for the largest manufacturers. Once you learn it's Toyota, Volkswagen, and Hyundai, you search for each one's EV offerings separately. Each step depends on the previous step's results.

Three Retrieval Strategies
Single Query
One question, one search
Where do dolphins live?
Search
Answer
Multi-Query
Multiple searches in parallel
Nvidia 2020 vs 2023?

2020

2023
MERGE
Answer
Multi-Hop
Each search depends on the last
Top car makers + EV?
Largest car makers 2023
Toyota, VW, Hyundai
Toyota EV
VW EV
Hyundai EV
Answer
Single query works for simple questions. Multi-query breaks a comparison into parallel searches. Multi-hop chains searches sequentially — each one informed by the results of the last. The complexity grows, but so does the system's ability to answer sophisticated questions.
HyDE: The clever trick of searching with a fake answer

Hypothetical Document Embeddings (HyDE) is one of the most creative query rewriting techniques. Instead of searching with the question, you search with a fake answer.

Here's the idea: ask the LLM to generate a hypothetical document that would answer the question — even though it might be completely wrong. Then use that document as the search query instead of the original question.

Why would searching with a wrong answer work?

Because the vocabulary mismatch between a question and its answer is often huge. Consider:

  • Question: "Which politicians complained about the budget not being balanced?"
  • Actual document: 'Senator Paxton: "I can't stand the sight of our enormous deficit!"'

The question uses words like "complained" and "balanced." The actual document uses "can't stand" and "deficit." There's barely any word overlap. But a hypothetical answer generated by the LLM might say:

"Senator Mark Wellington stated, 'This government's failure to balance the budget is unacceptable.' MP Emily Fraser remarked, 'We cannot continue this path of reckless spending without addressing the deficit.'"

This hypothetical answer is completely made up — those politicians don't exist. But it uses the same kind of language that real politicians use. Its embedding will be much closer to the actual Senator Paxton quote than the original question's embedding would be.

The key insight: A fake answer is closer in embedding space to real answers than the question itself is. The factual accuracy doesn't matter — the linguistic style is what makes the retrieval work.

Practical tips: You can use a smaller, cheaper LLM for HyDE since factual accuracy doesn't matter. Keep the generated document roughly the same length as your chunks. Watch for topic drift — the generated text may wander off-topic after the first few sentences.

More query expansion techniques: PRF, Query2Doc, query routing, and the road to agentic RAG

HyDE is just one of several query expansion techniques. Here are the others you should know about:

Pseudo-Relevance Feedback (PRF)

PRF is a tried-and-tested technique from the information retrieval field. The idea: use your original query to retrieve a first batch of documents, then extract salient terms from those documents and add them back to your query. For example, searching "Which politicians complained about the budget not being balanced?" might return documents containing terms like "fiscal policy," "deficit," "financial mismanagement," and "budgetary reforms." Adding these terms to the original query dramatically improves recall on the second search pass. PRF is simple, effective, and doesn't require an LLM — but it's susceptible to topic drift if the initial results are off-target.

Query2Doc

Similar to HyDE, Query2Doc uses an LLM to generate a hypothetical document, but the key difference is how the generated text is used. Rather than replacing the query entirely, Query2Doc combines the generated document with the original query to create a richer search input.

Query2Doc2Keyword

This technique (Li et al.) chains three steps: generate a hypothetical document from the query, extract salient keywords from that document, then add those keywords to the original query. The authors improve keyword quality using the self-consistency method — repeating the keyword generation multiple times and selecting only keywords that appear across multiple generations. This filters out noise and keeps only the most reliable expansion terms.

Document-side rewriting

All the techniques above modify the query to bring it closer to the document space. An alternative approach flips this: modify the documents to bring them closer to the query space. Techniques like doc2query (generate potential questions each document could answer) and contextual retrieval (add context headers to chunks) do exactly this. The upfront cost is higher — you process every document once — but it eliminates query-rewriting latency at inference time.

Query Routing

Once you have multiple data sources, the LLM can learn to route queries to the right one. For example: HR questions go to Notion, customer data goes to Salesforce, financial reports go to the document store. The model examines the query and decides which data source(s) to search — or whether multiple sources are needed.

The road to Agentic RAG

Notice the progression: simple query rewriting → multi-query → multi-hop → query routing. Each step delegates more responsibility to the LLM. At the end of this progression, the LLM isn't just rewriting queries — it's deciding what to search, where to search, and when to stop. The data sources become tools, and the LLM becomes an agent. This is exactly the transition we'll explore in Part 4.

Decision Card #2

Your company's HR chatbot keeps returning outdated information. When employees ask "What's our parental leave policy?", it retrieves the 2022 policy (12 weeks) instead of the 2024 policy (16 weeks). Both documents exist in the knowledge base and both contain the words "parental leave policy." The embedding similarity scores are nearly identical since the documents are about the same topic.

What's the most effective fix?

Stages 2-3: Retrieve and Rerank — The Heart of the Pipeline

The retrieve stage is the most critical part of the entire pipeline. If retrieval fails, nothing downstream can save you — even the most powerful LLM can't generate correct answers from irrelevant context.

We covered retrieval methods in Part 1 (keyword, semantic, hybrid search). In a production pipeline, retrieval typically uses hybrid search optimized for recall — casting a wide net to make sure the right documents are somewhere in the results, even if some noise comes along.

The rerank stage then optimizes for precision — using a cross-encoder or similar model to re-score the candidates and push the most relevant documents to the top. Remember: the retriever says "here are 100 documents that might be relevant." The reranker says "here are the 10 that actually are."

Generative retrieval, tightly-coupled retrievers, and query likelihood models

Beyond the standard retrieve-then-rerank pattern, researchers have developed alternative retrieval paradigms that rethink how documents are found:

Generative Retrieval

What if the LLM could directly name the right document instead of searching for it? Generative retrieval assigns each document a unique identifier (called a docid) and teaches the model the association between documents and their IDs. During inference, the model simply generates the docid(s) relevant to the query — no vector search needed.

Types of docids:

  • Single tokens: Each document gets a new vocabulary token — the model outputs one token per document. Only works with small corpora.
  • Prefix tokens: Use the first 64 tokens of a document as its identifier (Tay et al.).
  • Cluster tokens: Perform hierarchical clustering on the corpus; the docid is the concatenation of cluster IDs at each level.
  • Keyword tokens: Salient keywords representing the document's topics (e.g., "transformer_self-attention_architecture").

The model can learn these associations via fine-tuning (training-based indexing) or few-shot learning (Askari et al.) — generating pseudo-queries for each document, then teaching the model to map queries to docids. Constrained beam search ensures only valid docids are generated.

Best for: Small document corpora with well-defined categories (e.g., annual reports of public companies). Not suitable for frequently-updated collections where you'd need constant retraining.

Tightly-Coupled Retrievers

In standard RAG, the embedding model and the LLM are independent — the retriever doesn't know what the generator needs. A tightly-coupled retriever is trained using feedback from the LLM itself: it learns to retrieve text that best positions the LLM to generate the correct answer.

LLM-Embedder (Zhang et al.) is a unified embedding model trained from two signals: (1) standard contrastive learning, and (2) LLM feedback — a retrieval candidate receives a higher reward if it improves the LLM's answer quality. This single model supports multiple retrieval tasks: knowledge retrieval, few-shot example selection, tool documentation lookup, and more.

Trade-off: Requires training infrastructure and the retriever may need retraining when you swap the generator LLM. Use when standard retrieval hits a plateau and you need the extra lift.

Query Likelihood Models (QLM)

A QLM flips the ranking question: instead of asking "how relevant is this document to the query?", it asks "how likely is the query to have been generated from this document?" For each candidate document, the model is prompted: "Generate a question most relevant to this document." Then the probability of the actual query tokens is calculated from the model's logits. Documents are ranked by this probability.

Important caveat: Zhuang et al. showed that instruction-tuned models that weren't trained on query-generation tasks lose their zero-shot QLM capability. This means fine-tuning a smaller LLM specifically for query generation is often the most practical approach.

Limitation: Requires access to model logits, which most proprietary APIs don't provide. Best implemented with open-source models.

Stage 4: Refine — Don't Overwhelm the LLM

You've retrieved and reranked your documents. Now you need to feed them to the LLM. But there's a problem: the retrieved documents might be too long for the context window, or they might contain irrelevant sections mixed with relevant ones. The refine stage handles this through summarization or note generation.

One particularly powerful technique is Chain-of-Note. Instead of just summarizing each document, the system generates a note that evaluates whether each document actually answers the question, provides useful context, or is irrelevant and should be ignored.

For example, given the query "Did the Green Party support the 2023 Transit bill?" and two retrieved documents — one about the Green Party's environmental platform and one about the bill's popularity — a Chain-of-Note would generate: "While the first passage mentions the party's emphasis on sustainable transportation and the second mentions the bill's popularity, neither confirms the party's specific support or opposition to the 2023 bill." This helps the LLM correctly say "I don't know" instead of hallucinating an answer.

Extractive vs abstractive summarization — and training tightly-coupled summarizers

When the refine stage summarizes retrieved documents, there are two fundamentally different approaches:

Extractive Summarization

Selects key sentences directly from the original text without modifying them. Almost always faithful to the source because it preserves original wording.

How it works: Generate embeddings for the query and each sentence in the retrieved document. Select the top-K sentences most similar to the query. The similarity score measures how useful each sentence is for enabling the LLM to generate the correct answer.

Abstractive Summarization

Generates new text that captures the meaning of the original. Can combine information from multiple locations within and across documents. Risk: hallucination.

Key insight: These summaries aren't for human readers — they're for the LLM. The objective is to produce a summary that maximizes the LLM's ability to generate the correct answer, not readability.

Tightly-coupled summarizers

Both extractive and abstractive summarizers can be tightly coupled with the generator LLM — meaning they're trained using feedback from the LLM itself (Xu et al.).

For extractive summarizers: each sentence is prefixed to the input query, and the likelihood of the LLM generating the correct output is measured. Sentences with high likelihood become positive training examples; sentences below a threshold become negatives. The model is trained with contrastive learning on these triplets.

For abstractive summarizers: a larger LLM generates candidate summaries using prompt templates. Each summary is evaluated by measuring output token likelihood when it's used as context. The best summary (highest likelihood) becomes the training target. During inference, if no summary improves the LLM's output likelihood compared to no summary at all, an empty summary is returned — effectively filtering out irrelevant documents.

Practical note: Tightly-coupled summarizers are expensive to train initially but effective at removing noise. If you change your target LLM, you may need to retrain the summarizers — there's a slight performance degradation when transferring across models.

Stage 5: Insert — Order Matters More Than You Think

Here's a surprising research finding that changes how you build RAG systems: LLMs are better at recalling information at the beginning and end of their context window, and worse at recalling information in the middle. This is called the "lost in the middle" problem.

If you have 10 documents ranked by relevance (Doc 1 = most relevant, Doc 10 = least), the naive approach is to list them 1 through 10. But research by Liu et al. shows that a better ordering is:

Optimal Document Ordering
Doc 1 Doc 3 Doc 5 Doc 7 Doc 9 Doc 10 Doc 8 Doc 6 Doc 4 Doc 2
Most relevant at beginning & end — least relevant buried in middle where LLMs pay less attention

The most relevant documents go at the beginning and end of the context window. The least relevant go in the middle, where they're more likely to be ignored. This simple reordering can measurably improve answer quality.

Stage 6: Generate — The Final Answer

Finally, the LLM generates a response based on the query and the carefully prepared context. The standard approach is straightforward: stuff everything into the prompt and let the model answer.

But there's an advanced technique called active retrieval where generation and retrieval are interleaved. The LLM generates a few sentences, realizes it needs more information, triggers a new retrieval, incorporates the results, and continues generating. This is especially useful for long-form answers where different parts require different context.

Advanced: GraphRAG, ColBERT, FLARE, and Chain-of-Note

Here are four advanced techniques that push RAG beyond the basic pipeline:

GraphRAG (Microsoft)

Standard retrieval struggles with questions that span across documents — "What are the key themes in this dataset?" can't be answered by any single chunk. GraphRAG solves this by building a knowledge graph from the corpus: extracting entities and relationships, performing hierarchical clustering, and generating summaries for each cluster. The knowledge graph captures connections that embeddings miss.

Trade-off: Requires significant upfront compute to build the graph. Works best for datasets with rich entity relationships.

ColBERT (Late Interaction)

Standard cross-encoders are accurate but slow because they process query+document together. ColBERT (Contextualized Late Interaction over BERT) gets the best of both worlds: it encodes queries and documents separately (like a bi-encoder, so documents can be pre-computed), but then does a fine-grained token-level comparison between them. For each query token, it finds the maximum similarity with any document token, then sums these scores.

Trade-off: More storage than standard bi-encoders (stores per-token embeddings). But much faster than cross-encoders at inference.

FLARE (Active Retrieval)

FLARE (Forward-Looking Active REtrieval) interleaves generation with retrieval. It comes in two flavors:

FLARE-Instruct: The LLM is prompted to generate search queries in a special syntax whenever it needs more information. For example, generating an article about an athlete: "Peruth Chemutai [Search(birthdate of Peruth Chemutai)] is a Ugandan runner who [Search(what medals did Peruth Chemutai win)]..." — the model explicitly requests retrieval at uncertain points.

FLARE-Direct: The LLM generates a candidate sentence. If any token has probability below a threshold, retrieval is triggered automatically. The low-confidence tokens are masked in the generated sentence, and the remaining text is used as the search query. No special prompting needed — the model's own uncertainty drives retrieval.

Trade-off: Higher latency due to multiple retrieval calls. Best for long-form generation where different sections need different context.

Chain-of-Note

Before the LLM answers, it generates a note for each retrieved document that classifies it into one of three categories: (1) contains the answer, (2) provides useful context but not the answer, or (3) is irrelevant. This explicit reasoning step dramatically improves the LLM's ability to say "I don't know" when the retrieved documents don't actually contain the answer — instead of hallucinating one.

Trade-off: Adds an extra LLM call. Harder for smaller models. But very effective when retrieval results are noisy.

LLM distillation for ranking: teaching smaller models to rank documents

Cross-encoders and ColBERT are encoder-based models. But decoder LLMs can also be trained to rank documents directly. There are three approaches:

Pointwise
Each document scored independently

Feed each candidate document separately to the LLM. The model provides a binary relevance judgment (yes/no) or a numerical score.

Easiest to implement. May not be the most effective.

Pairwise
Compare documents head-to-head

For each pair of candidates, the LLM decides which is more relevant. Requires N² comparisons for a complete ranking.

Most effective (direct comparison). Slowest due to quadratic comparisons.

Listwise
Rank all documents at once

Tag all candidates with identifiers, feed them all to the LLM, and ask it to output a ranked list of identifiers by relevance.

One LLM call. Needs large context window.

Training these rankers via distillation: Models like RankGPT, RankVicuna, and RankZephyr are trained by distilling from larger LLMs. The process for RankVicuna: (1) feed training queries through BM25 to get candidate documents, (2) pass these to a large LLM which generates a rank-ordered list, (3) use these query + ranked-list pairs to fine-tune a smaller LLM.

The RankVicuna authors found two interesting results: as the quality of the first-level retrieval increases, the gains from reranking decrease (diminishing returns). And augmenting training data by shuffling the input order of candidate documents improved model performance — it prevents the model from learning positional biases.

Practical tip: You can combine retrieval scores with reranking scores for the final relevance ranking. This is useful when you need to enforce keyword weighting or factor in metadata like publication date (prefer newer documents).

How Do You Know If It's Working? RAG Evaluation

Building a RAG pipeline without evaluation is like flying blind. You need metrics for both the retrieval quality and the generation quality. Here are the key metrics:

Retrieval Metrics
Context Precision
Of all docs retrieved, what % is actually relevant? Catches "noisy retrieval."
Context Recall
Of all relevant docs in the database, what % was retrieved? Catches "missing information."
MAP (Mean Average Precision)
Rewards relevant docs that appear higher in the ranking. Punishes relevant docs buried at position 10.
Generation Metrics
Faithfulness
Is the answer consistent with the provided context? Catches hallucination.
Answer Relevance
Does the answer actually address the question? Catches off-topic responses.
Citation Precision & Recall
Are the citations accurate? Do all claims have supporting citations?
How MAP and nDCG actually work — with a worked example

Evaluating a search system requires three things: a text archive, a set of test queries, and relevance judgments — human labels indicating which documents are relevant for each query.

Average Precision (AP) for a single query:

Imagine your search system returns 5 results for a query, and results at positions 1, 3, and 5 are relevant (positions 2 and 4 are irrelevant). The calculation focuses only on the relevant positions:

Position 1 (relevant): Precision@1 = 1/1 = 1.0
Position 3 (relevant): Precision@3 = 2/3 = 0.67
Position 5 (relevant): Precision@5 = 3/5 = 0.60

AP = (1.0 + 0.67 + 0.60) / 3 = 0.76

Now compare: if the same 3 relevant documents were at positions 3, 4, and 5 instead:

Position 3 (relevant): Precision@3 = 1/3 = 0.33
Position 4 (relevant): Precision@4 = 2/4 = 0.50
Position 5 (relevant): Precision@5 = 3/5 = 0.60

AP = (0.33 + 0.50 + 0.60) / 3 = 0.48

The first system scores 0.76, the second 0.48 — AP naturally rewards systems that put relevant results higher in the ranking.

Mean Average Precision (MAP) is simply the mean of AP scores across all queries in your test suite. This gives you a single number to compare entire search systems.

Normalized Discounted Cumulative Gain (nDCG)

MAP treats relevance as binary — a document is either relevant or not. But sometimes one document is more relevant than another. nDCG handles this with graded relevance (e.g., 0 = irrelevant, 1 = somewhat relevant, 2 = highly relevant). It also applies a logarithmic discount — relevant documents at lower positions contribute less to the score. nDCG is normalized to [0, 1] by dividing by the ideal ranking's score.

LLM-as-a-Judge: While human evaluation is always preferred, it doesn't scale. The RAGAS library automates evaluation by having a capable LLM (like GPT-4) score your system's outputs across multiple axes — faithfulness, answer relevance, context precision, and more. Human evaluation along four axes (fluency, perceived utility, citation recall, citation precision) from the paper "Evaluating Verifiability in Generative Search Engines" remains the gold standard for production systems.

Setting up automated RAG evaluation with RAGAS

RAGAS is a library that automates RAG evaluation by using an LLM as a judge. Instead of manually labeling thousands of query-answer pairs, you let a capable model (like GPT-4) evaluate your system's outputs.

What you need:

  • A set of test queries
  • For each query: the retrieved context, the generated answer, and (ideally) a ground-truth answer
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# Your test data
data = {
    "question": [
        "What is our parental leave policy?",
        "How do I request time off?",
    ],
    "answer": [
        "Our parental leave policy provides 16 weeks...",
        "To request time off, submit a request in HR Portal...",
    ],
    "contexts": [
        ["Policy doc chunk 1...", "Policy doc chunk 2..."],
        ["HR Portal guide chunk 1...", "Time-off FAQ chunk..."],
    ],
    "ground_truth": [
        "The 2024 policy provides 16 weeks of paid leave.",
        "Employees request time off through the HR Portal.",
    ],
}

dataset = Dataset.from_dict(data)
results = evaluate(
    dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)
print(results)
# {'context_precision': 0.83, 'context_recall': 0.91,
#  'faithfulness': 0.95, 'answer_relevancy': 0.88}

How to interpret the results:

  • Low context precision → Your retriever is returning too many irrelevant documents. Fix: better reranking, metadata filters, or tighter similarity thresholds.
  • Low context recall → Your retriever is missing relevant documents. Fix: better chunking, add synonyms, try hybrid search.
  • Low faithfulness → The LLM is making claims not supported by the context. Fix: add Chain-of-Note, improve prompt instructions, use a more instruction-following model.
  • Low answer relevancy → The answer doesn't address the question. Fix: improve query rewriting, check that the right documents are being retrieved.
Decision Card #3

You're debugging a customer support chatbot. Users ask follow-up questions that reference earlier conversation turns. When a user asks "What's the status of my order?" then follows up with "What about the other one?" — the system retrieves documents about return policies instead of order status. The initial question works fine; only follow-ups break.

Which pipeline stage is most likely the root cause?

Now you understand the full pipeline — how each stage transforms the query, the context, and finally the answer. But there's an entire layer of optimization underneath that determines how well each stage performs. How should you chunk your documents? Can you compress your embeddings without losing quality? When should you skip RAG entirely and use a different approach? That's what Part 3 unpacks.

Part 3
Optimizing the Machine

The Assembly Line Analogy

You've built the complete RAG pipeline — rewrite, retrieve, rerank, refine, insert, generate. It works. But "works" and "works well" are very different things. Now it's time to optimize every layer of the system.

Think of your RAG system as an assembly line. The pipeline is the sequence of stations. But the quality of the raw materials — how your documents are chunked, how your embeddings are stored, which vector database you use — determines whether the factory produces luxury goods or rejects. This part is about the raw materials.

Chunking: How You Split Documents Changes Everything

When you put documents into a RAG system, you don't feed entire documents to the retriever — you split them into smaller chunks. This seems like a minor detail, but it's one of the most impactful decisions you'll make. The chunking strategy determines what the retriever can find, how much context the LLM gets, and whether important information is preserved or destroyed.

Here's the fundamental tension: smaller chunks are more specific and let you fit more diverse information in the context window — but they can lose important context. Larger chunks preserve context but may include irrelevant information and leave less room for other chunks.

Consider this real-world example from a financial document:

Page 5: "All numbers in this document are in millions."
Page 84: "The related party transaction amounts to $213.45."
The transaction is actually $213.45 million — but if these pages are in different chunks, the LLM will report $213.45. A 1,000x error.

No chunking strategy fully solves this — it's an active area of research. But understanding the options helps you choose the least-bad approach for your data.

Chunking Strategies Compared
Fixed-Size Chunking

Split every N characters/words. Simple, predictable. Risk: can cut mid-sentence, losing context. Add overlap (e.g., 20 chars) to mitigate.

Recursive Chunking

Split by sections, then paragraphs, then sentences until each chunk fits. Respects document structure. Better: reduces mid-thought splits.

Semantic Chunking

Groups related content by topic using embeddings or clustering. Chunks can be non-contiguous. Best coherence, but expensive and complex.

Late Chunking

Feed the entire document to a long-context model first (capturing cross-chunk dependencies), then chunk the output. Each embedding knows about the whole document. Newest, requires long-context embedding model.

Each strategy trades off between simplicity, context preservation, and cost. Fixed-size is your starting point. Recursive is the practical default. Semantic gives best coherence. Late chunking preserves long-range dependencies. Add metadata (title, date, section headers) to any strategy for better retrieval.
Chunk size tradeoffs: the numbers that matter

Here are the concrete tradeoffs when choosing chunk size:

Smaller Chunks (100-300 tokens)
  • More diverse information per context window
  • Vectors better represent individual concepts
  • Higher precision in retrieval
  • 2x chunks = 2x embeddings = 2x storage
  • Can lose important cross-sentence context
  • Slower vector search (larger index)
Larger Chunks (500-2000 tokens)
  • Better context preservation
  • Fewer embeddings to generate and store
  • Faster vector search
  • Less diverse info per context window
  • Vectors become diluted (represent too many topics)
  • More irrelevant text mixed with relevant text

Practical guidance:

  • Always add overlapping between chunks (10-20% of chunk size) to avoid cutting mid-thought
  • Chunk size must not exceed your embedding model's context limit
  • Consider metadata-aware chunking: use section headers, paragraph boundaries, and document structure to determine chunk boundaries
  • For code: use language-specific splitters that respect function/class boundaries
  • For Q&A: each question-answer pair can be its own chunk
  • Augment chunks with parent document title and summary — this is Anthropic's "contextual retrieval" approach that dramatically improves retrieval quality

There's no universal best size. The only reliable approach: try multiple sizes, measure retrieval quality on your actual queries, and pick the winner.

More chunking strategies: token-based, sliding window, metadata-aware, layout-aware, and multi-level embeddings

Beyond the four strategies shown above, here are additional approaches used in production:

Token-based chunking — Instead of splitting by characters or sentences, use the generative model's tokenizer as the unit. Split documents into chunks of N tokens using the same tokenizer your downstream LLM uses (e.g., Llama 3's tokenizer). This makes it easier to control exactly how many tokens each chunk consumes in the context window. The downside: if you switch to a different LLM with a different tokenizer, you'd need to re-index everything.

Sliding window chunking — When chunk boundaries split related content, the neighboring chunk's context is lost. Sliding window fixes this: each piece of text can appear in multiple overlapping chunks, so neighboring context is always preserved in at least one chunk. Think of it as moving a window across the document rather than cutting it into discrete pieces.

Metadata-aware chunking — Use paragraph boundaries, section headers, and subsection structure to determine where to split. If metadata isn't readily available, document parsing tools like Unstructured can extract this information automatically. This preserves the document's logical structure.

Layout-aware chunking — A more sophisticated version of metadata-aware chunking that uses computer vision to extract layout information: text placement, font sizes, titles, subtitles, table boundaries, and image regions. Tools include AWS Textractor, Unstructured, and layout-aware language models like LayoutLMv3. For example, knowing a subsection's scope lets you prepend the subsection title to every chunk within it.

Contextual retrieval — Anthropic's approach: augment each chunk with context from its parent document. An LLM generates a short context (50-100 tokens) explaining the chunk and its relationship to the whole document. This context is prepended to the chunk before indexing. You can also augment chunks with: the questions they could answer, extracted entities (so error code EADDRNOTAVAIL becomes searchable), or product descriptions and reviews for e-commerce data.

Multi-level embeddings — Use multiple levels of granularity: sentence embeddings, paragraph embeddings, section embeddings, and even document-level summary embeddings. You can use different embedding models at each level — more expensive models at higher granularity levels where there are fewer items to embed. During retrieval, you can start from the top level and drill down like a tree, or target a specific level directly.

A warning about document parsing: Effective document parsing — extracting section boundaries, detecting tables, handling heterogeneous formats — is the unglamorous foundation of any RAG system. If your data includes domain-specific text, even sentence tokenization isn't trivial (abbreviations confuse naive period-splitting). NLTK's Punkt tokenizer can be trained unsupervised on your domain text, and spaCy, Stanza, and ClarityNLP offer additional options. A large proportion of RAG failure modes trace back to poor document parsing.

Making Embeddings Smaller (Without Losing Quality)

Modern embedding models produce vectors with 768 to 4,096 dimensions. Each dimension is a 32-bit floating-point number. Do the math: 100 million documents × 768 dimensions × 4 bytes = ~300 GB just for embeddings. And that's before the index overhead.

Three techniques can dramatically reduce this cost:

1. Matryoshka Embeddings — Truncate the dimensions. Named after Russian nesting dolls, these embeddings are trained so that the first N dimensions contain the most important information. You can truncate a 1024-dimension vector to 128 dimensions and retain 98% of the performance — while using 8x less storage. The trick is in the training: during learning, the model optimizes for multiple truncation points simultaneously, so earlier dimensions learn richer representations.

2. Binary Quantization — 32x compression. Instead of storing each dimension as a 32-bit float, store just one bit: 1 if the value is positive, 0 if negative. A 768-dimension vector goes from 3,072 bytes to just 96 bytes. Distance calculations become simple bitwise operations — blazingly fast. The performance hit is often smaller than you'd expect, especially when you use this for the initial retrieval pass and re-score the top candidates with full-precision vectors.

3. Product Quantization — Compress by clustering. Split each vector into subvectors, cluster similar subvectors, and replace each with a cluster ID. A 768-dimension vector with 5 chunks becomes just 5 numbers (cluster IDs). This is the backbone of FAISS for large-scale vector search.

Peering inside embeddings with Sparse Autoencoders

Embeddings are famously opaque — you know that two vectors are close or far, but you can't see why. What features does each dimension encode? Why are two sentences sometimes closer than you'd expect?

Sparse Autoencoders (SAEs) offer a window into this black box. The idea: a language model learns millions of features, but only a few are active for any given input. SAEs decompose the embedding into independent, interpretable features.

When researchers applied SAEs to T5 embeddings, they discovered features like:

  • Presence of negation — fires when text contains "not," "never," "can't"
  • Expression of possibility — fires on "might," "could," "perhaps"
  • Employment and labor concepts — fires on job-related text
  • Possessive syntax at sentence start — fires on "My," "Their," "Our"

Features range from lexical (specific words) to syntactic (grammar patterns) to topical to pragmatic (tone, intent). Each feature activates on specific text samples, making it possible to understand — and even steer — what the embedding encodes.

Practical implication: If you know what features your embedding represents, you can perform "semantic edits" — adjusting specific features to change the retrieval behavior without retraining the model.

Espresso layer reduction and int8 quantization — faster embeddings with code

Espresso Sentence Embeddings (Li et al.) takes a different approach to compression: instead of reducing dimensions, it reduces layers. Embeddings are typically extracted from the last layer of the model, but Espresso trains the lower layers to produce high-quality embeddings too, using KL-divergence loss between the final layer and each lower layer. Removing half the layers gives a 2x speed improvement while preserving 85% of the original performance.

The Sentence Transformers library lets you combine Matryoshka dimension reduction with Espresso layer reduction using Matryoshka2dLoss — compressing both width and depth simultaneously.

Training Matryoshka embeddings:

from sentence_transformers import (
    SentenceTransformer, SentenceTransformerTrainer, losses
)
from datasets import load_dataset

model = SentenceTransformer("all-mpnet-base-v2")
train_dataset = load_dataset("csv", data_files="finetune_dataset.csv")

# Standard contrastive loss
loss = losses.MultipleNegativesRankingLoss(model)

# Wrap with Matryoshka — optimize for multiple truncation points
loss = losses.MatryoshkaLoss(model, loss, [768, 512, 256, 128])

trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()
# Result: truncate to 128 dims (8.3% of original) and
# retain 98.37% of performance

Int8 quantization: Between full float32 and extreme binary (1-bit), there's int8 quantization — converting each dimension to an integer between -127 and 127 (or 0-255 unsigned). This gives 4x compression instead of 32x, but with better precision. A calibration dataset is used to determine min/max values per dimension for the conversion.

from sentence_transformers.quantization import quantize_embeddings

model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode([
    "I heard the horses are excited for Halloween.",
    "Dalmatians are the most patriotic of dogs.",
])

# Binary: 32x compression, uses bitwise operations
binary = quantize_embeddings(embeddings, precision="binary")

# Int8: 4x compression, better precision than binary
int8 = quantize_embeddings(embeddings, precision="int8")

Surprising finding: For some embedding models, binary quantization actually outperforms int8 despite having far less precision. This is largely due to the challenge of mapping float values to int8 buckets using the calibration dataset. When in doubt, benchmark both on your actual queries.

Choosing a Vector Database

Once you've decided on your embedding model and chunking strategy, you need somewhere to store and search those vectors. The vector database landscape has exploded — here's how to navigate it.

Database Type Best For Key Feature
FAISS Library (Meta) Research, prototyping IVF + PQ for billion-scale search
Chroma Open-source Getting started fast Simplest API, runs locally in-memory
Pinecone Managed (closed) Production, enterprise Fully managed, many enterprise features
Weaviate Open-source Cloud-native deployments Known for speed, hybrid search built-in
Qdrant Open-source Distributed systems Distributed deployment, product quantization
Milvus Open-source Very large scale Cloud-native, managed via Zilliz Cloud
PGVector Postgres extension When you already use Postgres No new infrastructure, SQL integration

Key features all production vector databases offer: approximate nearest neighbor search, metadata filtering (like a WHERE clause in SQL), hybrid search (keyword + semantic), real-time updates (add/delete vectors without rebuilding the index), and boolean search operations.

Getting started with Chroma — vector database in 10 lines

Chroma is the simplest vector database to get started with — it's open-source and runs in-memory locally:

import chromadb

# Create client and collection
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="mango_science")

# Add documents with metadata
chunks = [
    "353 varieties of mangoes are now extinct",
    "Mangoes are grown in the tropics"
]
metadata = [
    {"topic": "extinction", "chapter": "2"},
    {"topic": "regions", "chapter": "5"}
]
collection.add(
    documents=chunks,
    metadatas=metadata,
    ids=[str(i) for i in range(len(chunks))]
)

# Query with metadata filtering
results = collection.query(
    query_texts=["Where are mangoes grown?"],
    n_results=2,
    where={"chapter": {"$ne": "2"}},          # metadata filter
    where_document={"$contains": "grown"}      # text filter
)

Notice how Chroma handles embedding generation automatically — you don't need to manage embedding models separately. The where clause works like SQL filtering on metadata, and where_document filters on the actual text content.

Evaluating which vector database to use:

  • What retrieval mechanisms does it support? Hybrid search?
  • What embedding models and vector search algorithms are supported?
  • How scalable is it for both data storage and query traffic?
  • How long does indexing take? Can you bulk add/delete?
  • What's the query latency for different retrieval algorithms?
  • For managed solutions: is pricing based on document volume or query volume?

The Big Decision: RAG vs. Long Context vs. Fine-Tuning

RAG isn't always the right answer. Sometimes you should just expand the context window. Sometimes you should fine-tune. Here's a framework for deciding.

RAG vs. Long Context vs. Fine-Tuning: When to Use Each
RAG

Use when:

  • Data changes frequently
  • Knowledge base is large (>500 pages)
  • Need citations and source tracking
  • Different users need different data

Avoid when:

  • Knowledge base is tiny (<200K tokens)
  • Query patterns are very predictable
Long Context

Use when:

  • Knowledge base fits in context (<200K tokens)
  • Need to reason across the entire dataset
  • Simplicity is paramount
  • Analyzing long single documents

Avoid when:

  • Data exceeds context limits
  • Cost per query is a concern
Fine-Tuning

Use when:

  • Need specialized behavior/format/style
  • Domain-specific reasoning required
  • Latency is critical (no retrieval step)
  • Knowledge is stable and won't change

Avoid when:

  • Data changes often (need retraining)
  • Need source attribution
RAG excels at dynamic, large-scale knowledge. Long context wins for small, stable datasets. Fine-tuning is best for teaching specialized behavior. In practice, combine them: fine-tune a model for your domain's style, then use RAG for up-to-date knowledge retrieval.
RAG as memory management: the operating system analogy

Here's a powerful way to think about RAG that clarifies when to use it: the LLM's context window is like RAM, and RAG is like virtual memory.

In an operating system, memory is organized in a hierarchy:

  • CPU registers — tiny, blazingly fast (the LLM's internal weights)
  • RAM — limited but directly accessible (the context window)
  • Disk/SSD — massive but slow (your vector database)

When a program needs data that isn't in RAM, the OS swaps it in from disk and swaps out data that isn't currently needed. RAG does exactly the same thing for LLMs — it swaps relevant information into the context window on demand.

This analogy reveals three key insights:

  1. Just like RAM, the context window has a fixed size. You need to be strategic about what goes in. Stuffing everything in (the "long context" approach) is like buying a computer with 1TB of RAM — theoretically possible, but expensive and often wasteful.
  2. Page replacement policies matter. In OS terms, deciding which documents to retrieve is like choosing which pages to swap in. A good retriever is like a good page replacement algorithm — it predicts what the "processor" (LLM) will need next.
  3. Foregoing retrieval entirely is like storing all files in RAM instead of disk. Cost-wise, it makes no sense at scale — even as RAM gets cheaper, disk will always be cheaper still.

Libraries like MemGPT (now Letta) and mem0 implement this memory management pattern explicitly, enabling LLMs to maintain long-term conversational memory across sessions by swapping relevant history in and out of context.

RAG beyond knowledge retrieval: few-shot selection (LLM-R), model training (REALM), and RAG limitations

RAG isn't just for knowledge retrieval. Here are two powerful alternative applications — plus important limitations to be aware of:

RAG for Few-Shot Example Selection (LLM-R)

Instead of hand-picking few-shot examples, use RAG to dynamically select the best examples for each query. LLM-R (Wang et al.) trains a retrieval model using LLM feedback: for each training query, retrieve top-K candidate examples via BM25, then measure which examples — when prefixed to the input — maximize the probability of the correct output. Rank examples by log-probability, train a reward model on these rankings, then distill to a final retriever.

Key insight: The "best" few-shot example isn't the most similar one — it's the one that most helps the LLM generate the correct answer. These can be very different.

RAG During Training: REALM

REALM (Retrieval-Augmented Language Model) integrates retrieval directly into the pre-training process. The architecture has two jointly-trained components: a knowledge retriever and a knowledge-augmented encoder (BERT-like). During masked-language-model pre-training, the retriever learns to fetch text that helps predict masked tokens — the model learns what to retrieve and how to use retrieved context simultaneously.

Clever training strategies: (1) Named entities and dates are preferentially masked to force the model to learn knowledge retrieval. (2) An empty document is always included so the model can learn when retrieval isn't needed. (3) Trivial retrievals containing the masked token are excluded — the model must learn to retrieve context, not just find the answer verbatim.

Why it matters: REALM was one of the pioneering works showing that retrieval and generation can be trained end-to-end as a single differentiable system.

Important RAG Limitations

Surface-level dependence: RAG retrieves text snippets, not deep understanding. The LLM reasons from surface-level information rather than a comprehensive understanding of the problem space.

Retrieval is the bottleneck: If retrieval fails, even the most powerful LLM can't produce correct answers. The entire pipeline is only as good as its weakest link.

Contradictory information: Sometimes retrieved documents contradict the LLM's internal knowledge. Without access to ground truth, the LLM struggles to resolve these contradictions. Liu et al. introduced the RECALL (Robustness against External CounterfactuAL knowLedge) benchmark for testing this. Interesting finding: LLMs tend to rely on internal knowledge when information is logically inconsistent, but prefer prompt content when the inconsistency is more factual — and their confidence drops notably when dealing with contradictions.

Needle-in-a-haystack limitations: Many evaluation tests for long-context models only test simple information recall. Real-world data has related text that acts as distractors — a much harder problem than finding a random fact in unrelated text. This is precisely why the Rerank and Refine stages exist in the pipeline.

RAG with images (multimodal) and database tables (text-to-SQL)

RAG isn't limited to text documents. Two important extensions:

Multimodal RAG

If your generator is multimodal, you can augment its context with images, videos, and audio — not just text. For example, asking "What's the color of the house in the Pixar movie Up?" — the retriever can fetch an actual image of the house to help the model answer.

For metadata-based retrieval, images are found via their titles, tags, and captions. For content-based retrieval, you need a multimodal embedding model like CLIP (Radford et al., 2021) that produces embeddings for both images and text in the same vector space. The workflow: (1) generate CLIP embeddings for all images and texts, (2) for a query, generate its CLIP embedding, (3) find nearest neighbors across both modalities.

Tabular RAG (Text-to-SQL)

When answers live in database tables rather than documents, the RAG workflow changes significantly. Imagine an e-commerce site with an orders table. The query "How many Fruity Fedora hats were sold in the last 7 days?" requires:

  1. Text-to-SQL: The LLM translates the natural language query into SQL based on the provided table schemas
  2. SQL execution: The generated query is run against the database
  3. Generation: The LLM produces a natural language response from the SQL results
-- Generated SQL from: "How many Fruity Fedoras sold this week?"
SELECT SUM(units) AS total_units_sold
FROM Sales
WHERE product_name = 'Fruity Fedora'
AND timestamp >= DATE_SUB(CURDATE(), INTERVAL 7 DAY);

If there are many tables whose schemas don't all fit in the context window, you may need an intermediate step to predict which tables are relevant for each query. The text-to-SQL step can use the same LLM as the final generator or a specialized model.

Decision Card #4

You're building an internal support bot for a SaaS company. The knowledge base includes 5,000 help articles, troubleshooting guides, and product docs (about 2 million tokens total). Articles are updated weekly. Users ask questions in natural language, and the bot needs to cite which article the answer came from. The company also wants the bot to respond in a consistent, friendly tone that matches their brand voice.

Which approach should you use?

You now understand how to find information, build a complete pipeline, and optimize every layer. But there's a whole other dimension to AI systems we haven't touched yet. So far, everything has been passive — the LLM reads what you give it and generates text. What happens when you give it the ability to take action? That changes everything — and introduces entirely new categories of risk.

Part 4
When LLMs Start Acting

From Reader to Actor

Everything we've covered so far follows the same pattern: find information, feed it to the LLM, get text back. The LLM is a passive reader — it processes what you give it and generates a response. It never reaches out on its own, never takes action, never changes anything in the real world.

Now imagine you tell an LLM: "Book me a flight to Tokyo next Tuesday, under $800, window seat." To do this, it would need to search flight APIs, compare prices, select the best option, and execute a booking. That's not reading — that's acting. And it introduces entirely new categories of capability, complexity, and risk.

This is the world of agents — AI systems that don't just generate text but take autonomous actions to accomplish goals. The field of artificial intelligence has always been, at its core, "the study and design of rational agents." Foundation models have made this vision finally practical.

Three Ways LLMs Interact with the World

Not every AI application needs full autonomy. There are actually three distinct paradigms for how an LLM can interact with external systems, each with different levels of control and risk. Think of them as training wheels for AI:

1. Passive (RAG)

The LLM receives pre-fetched context in its prompt and generates a response. It doesn't know where the information came from or how it was retrieved. This is everything we covered in Parts 1–3 — the retrieval system works behind the scenes, the LLM just reads and responds.

Risk level: Low — the LLM can't do anything except generate text.

2. Explicit (Fixed Pipeline)

The LLM follows pre-programmed instructions to interact with tools in a fixed sequence. "When the user asks about sales data, generate a SQL query, execute it, then summarize results." The steps are determined by the developer, not the LLM — it has no choice in what to do, only how to do each step.

Risk level: Medium — the LLM can interact with tools but only in pre-approved ways.

3. Autonomous (Agent)

The LLM decides which tools to use, when to use them, and in what order. Given a task, it breaks it down into subtasks, selects appropriate tools, executes them, evaluates results, and adjusts its plan. This is a true agent — and it's both the most powerful and the most dangerous paradigm.

Risk level: High — the LLM has real agency and can take unexpected actions.

Most production AI applications today use the passive or explicit paradigm. Fully autonomous agents are powerful but fragile — they tend to get stuck in loops, take incorrect actions, and struggle to self-correct reliably. The practical sweet spot is often a partially autonomous agent with human oversight at critical decision points.

Three Paradigms of LLM Interaction
Passive (RAG)
User Query
Retriever Fetches Context
LLM Reads & Responds
LLM is a passive participant — doesn't know the source of its context
Explicit (Pipeline)
User Query
LLM Generates SQL
Execute Query
LLM Summarizes
Steps are pre-defined by developer — LLM follows instructions
Autonomous (Agent)
User Task
LLM Plans Steps
Picks & Uses Tools
Evaluates & Adjusts
LLM decides what, when, and how — full autonomy with full risk
As you move from left to right, the LLM gains more autonomy — and more potential for both value and catastrophic failure

What Is an Agent?

An agent is anything that can perceive its environment and act upon it. That's the formal definition, and it's deceptively simple. An agent is characterized by two things:

  • The environment it operates in — a chess board, the internet, a customer database, your email inbox
  • The set of actions it can perform — moving a chess piece, searching the web, running a SQL query, sending an email

These two are deeply connected. The environment determines what tools an agent could use, while the agent's tools restrict what environments it can operate in. A robot that can only swim is confined to water. An AI with only a calculator can't browse the web.

Here's a key insight that makes agents different from everything we've covered so far: compound errors. When an agent performs 10 steps to accomplish a task, and each step has 95% accuracy, the overall accuracy isn't 95% — it's 0.9510 = 60%. Over 100 steps? Just 0.6%. This is why agents require more powerful models than simple RAG — each step needs to be very reliable because mistakes compound.

Tools: Extending What LLMs Can Do

By itself, an LLM can do exactly one thing: generate text. That's it. No math, no web browsing, no database queries, no sending emails. Tools are what transform an LLM from a sophisticated text generator into a capable agent. There are three categories of tools:

Knowledge Augmentation

Read-only tools that help the agent perceive its environment. Text retrievers, web search, database queries, email readers, API calls for weather or stock prices. These are the tools RAG systems use.

READ

Capability Extension

Tools that fix an LLM's inherent weaknesses. Calculators (LLMs are bad at math), code interpreters, image generators, translators, timezone converters. Instead of training the model to be good at arithmetic, just give it a calculator.

COMPUTE

Write Actions

Tools that change the world. Sending emails, updating databases, initiating bank transfers, booking flights, posting to social media. This is where the real power — and real danger — lives. A SQL executor can retrieve data (read) but can also delete a table (write).

WRITE

The difference between read and write tools is the difference between an assistant that looks up your bank balance versus one that can transfer money. Just as you wouldn't give an intern the authority to delete your production database, you shouldn't give an unreliable AI the authority to initiate bank transfers. Trust must be earned through demonstrated reliability.

Tool use can dramatically boost performance. Research shows that a GPT-4-powered agent augmented with just 13 tools can outperform GPT-4 alone by 11–17% on science and math benchmarks. The right tools don't just add capabilities — they multiply them.

When you give tools to an LLM, the model doesn't literally execute functions — it generates text that describes which function to call and with what parameters. The actual execution is handled by an orchestration framework.

Here's how function calling works in practice:

  1. Create a tool inventory: Define each tool with its name, parameters, and description (usually in JSON format)
  2. Specify available tools: Pass the tool definitions to the model with each query. You can set mode to required (must use a tool), none (don't use tools), or auto (model decides)
  3. Model generates a tool call: Instead of generating regular text, the model outputs a structured response indicating which function to call and with what arguments
  4. Framework executes the tool: The orchestration layer runs the actual function and feeds the result back to the model
  5. Model generates final response: Using the tool's output, the model produces its answer

For example, given "How many kilograms are 40 pounds?", the model might generate:

{
  "tool_calls": [{
    "function": {
      "name": "lbs_to_kg",
      "arguments": "{\"lbs\": 40}"
    }
  }]
}

Models with native tool calling: Some models have tool calling baked in. Llama 3.1, for example, has native support for Brave web search, Wolfram Alpha, and a code interpreter. When it wants to use a tool, it generates a special <|python_tag|> token to enter tool-calling mode, followed by the function call, then an <|eom_id|> token to signal it's waiting for results.

You can also define custom tools in JSON and include them in the prompt. The model learns the tool descriptions and generates appropriate calls — though it can "hallucinate" invalid function names or incorrect parameters, which is why validation is critical.

Here's an example of a custom tool definition in JSON that you'd include in the system prompt:

{
  "type": "local_function",
  "function": {
    "name": "find_citations",
    "description": "Find citations for claims made",
    "parameters": {
      "type": "object",
      "properties": {
        "claim_sentence": {
          "type": "string",
          "description": "A sentence representing a claim"
        },
        "model": {
          "type": "string",
          "enum": ["weak", "strong"],
          "description": "Citation model to use. Weak for
            claims with entities/numbers."
        }
      },
      "required": ["claim_sentence", "model"]
    }
  }
}

Practical LangChain tool examples: Libraries like LangChain provide connectors for common tools:

# Web search with DuckDuckGo
from langchain_community.tools import DuckDuckGoSearchRun
search = DuckDuckGoSearchRun()
output = search.run("What's the weather in Toronto?")

# Wikipedia API
from langchain.tools import WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper
wiki = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
output = wiki.load("Winter Olympics")

# Code interpreter
from langchain_experimental.utilities import PythonREPL
python = PythonREPL()
python.run("456 * 345")

# Database connector
from langchain_community.utilities import SQLDatabase
db = SQLDatabase.from_uri(DATABASE_URI)
output = db.run(
    "SELECT * FROM COMPANIES WHERE Name LIKE :comp;",
    parameters={"comp": "Apple%"}, fetch="all"
)

How many tools should you give an agent? There's no single answer, but research has explored a wide range:

5
Toolformer
Fine-tuned GPT-J to learn 5 tools (calculator, search, calendar, translator, Q&A). The model learned when and how to call each tool by itself.
13
Chameleon
GPT-4 with 13 tools (knowledge retrieval, query generator, image captioner, text detector, Bing search, etc.). +11.37% on ScienceQA, +17% on TabMWP benchmarks.
1,645
Gorilla
Prompted agents to select the right API call among 1,645 APIs. Demonstrates that agents can navigate massive tool inventories — with the right descriptions.

More tools = more capabilities, but also harder to master. Chameleon's experiments revealed two key findings:

  1. Different tasks need different tools. ScienceQA relied heavily on knowledge retrieval, while TabMWP (tabular math) barely used it.
  2. Different models have different tool preferences. GPT-4 selected a wider variety of tools, while ChatGPT favored image captioning.

Tool transition trees: Chameleon also studied tool transitions — after using tool X, how likely is the agent to call tool Y? If two tools are frequently used together, they can be combined into a bigger tool. An agent aware of these patterns can even compose new tools from existing ones.

Voyager (Wang et al., 2023) took this further with a skill manager that keeps track of new skills (tools) the agent creates. Each skill is a coding program. When a new skill successfully helps accomplish a task, it's added to the skill library for later reuse. Built for Minecraft, Voyager's agent continuously expanded its capabilities by creating and storing progressively more complex skills — a glimpse of self-improving agents.

Practical advice for tool selection:

  • Do ablation studies — remove each tool and measure performance drop. If removing a tool doesn't hurt, remove it.
  • Plot the distribution of tool calls to see which tools are most/least used.
  • If a tool is too hard for the model to use correctly (even after better prompts or fine-tuning), swap it for a simpler alternative.
  • Evaluate how easy it is to extend your tool inventory over time — your needs will change.

Planning: How Agents Think Before They Act

Giving an LLM tools is like giving someone a toolbox. Having a hammer doesn't mean you know how to build a house. The critical missing piece is planning — the ability to look at a task, break it down into steps, and execute them in the right order.

Consider this question: "Who was the CFO of Apple when its stock price was at its lowest point in the last 10 years?"

A human would naturally decompose this into steps: first find today's date, then query historical stock prices to find the minimum, then look up who was CFO on that date. An agent needs to do the same thing — except it needs to figure out this plan on its own, then execute each step using the right tool.

Effective planning has three principles:

  1. Separate planning from execution. Don't let the model just "think step by step" and execute simultaneously. Have it generate a plan first, validate that the plan makes sense, then execute. Otherwise you might waste hours on a 1,000-step plan that goes nowhere.
  2. Validate before running. A plan can be checked with simple heuristics (does every tool in the plan exist in the agent's inventory? Are there too many steps?) or with AI judges ("Does this plan seem reasonable?").
  3. Start with humans in the loop. For risky operations — updating a database, merging code changes, initiating payments — require explicit human approval before execution.

The ReAct Pattern: Think, Act, Observe

The most influential framework for agent planning is ReAct — Reasoning + Acting. Instead of generating all steps at once, the agent alternates between thinking and doing in a loop:

1
Thought: "I need to find the current price of a MacBook Pro. I should search the web for this."
2
Action: web_search("current MacBook Pro price USD")
3
Observation: "The MacBook Pro starts at $2,249 from apple.com."
4
Thought: "Good. Now I need to convert $2,249 to EUR at an exchange rate of 0.85."
5
Action: calculator(2249 * 0.85)
6
Observation: 1911.65
Final Answer: "The MacBook Pro costs $2,249. At 0.85 EUR/USD, that's approximately €1,911.65."

The power of ReAct is that each observation informs the next thought. The agent adapts its plan based on what it actually discovers, rather than committing to a rigid sequence upfront. But it comes with trade-offs: each thought and observation consumes tokens, increasing cost and latency — especially for tasks with many intermediate steps.

Control Flows: Beyond Sequential Steps

Simple plans execute one step after another. But real-world tasks often require more complex patterns:

Sequential

Step B waits for Step A. The SQL query can only execute after it's been generated.

Parallel

Steps A and B run simultaneously. Fetch prices for 100 products at once instead of one by one.

If Statement

Choose B or C based on the result of A. Check the earnings report, then decide to buy or sell.

For Loop

Repeat A until a condition is met. Keep generating random numbers until you find a prime.

When evaluating agent frameworks, check which control flows they support. If your task needs to browse 10 websites, can the agent do it simultaneously? Parallel execution can dramatically reduce the latency your users experience.

A plan is a roadmap — but how detailed should that roadmap be? There's a fundamental trade-off:

Exact function names

Plan: [get_time(), fetch_top_products(), fetch_product_info(), generate_response()]. Very granular — easier to execute, but harder to generate. If a function gets renamed (e.g., get_time() to get_current_time()), you need to update your prompt, examples, and any fine-tuned models.

Natural language

Plan: "1. get current date, 2. retrieve best-selling product, 3. get product info, 4. generate response." Higher-level — easier to generate and more robust to API changes. But needs a translator to convert each step into executable code.

The practical solution is hierarchical planning: generate a high-level plan first (quarter-by-quarter), then expand each step into detailed sub-plans (month-by-month). Natural language plans are more robust because the model was trained primarily on natural language, making it less likely to hallucinate. The translator that converts natural language → function calls is a much simpler task that can use a weaker (cheaper) model.

Can LLMs actually plan?

This is a genuinely open debate. Meta's Chief AI Scientist Yann LeCun states unequivocally that autoregressive LLMs cannot plan. Kambhampati (2023) argues that LLMs extract planning knowledge but don't produce executable plans — "the plans that come out of LLMs may look reasonable to the lay user, and yet lead to execution time interactions and errors."

The counterargument: planning is fundamentally a search problem — search among paths, predict outcomes, pick the best path. Some argue LLMs can't backtrack (only generate forward). But in practice, an LLM can evaluate a path, determine it's wrong, and generate an alternative — effectively backtracking. It's also possible that LLMs just need better tooling for planning: action outcome prediction, state tracking, and search tools.

Foundation Model (FM) vs Reinforcement Learning (RL) planners: The concept of an "agent" comes from RL — both are characterized by environments and actions. The difference is in the planner: RL planners are trained through trial-and-error (expensive), while FM planners use the model itself (prompted or fine-tuned, generally cheaper). In the long run, these approaches will likely merge — FM agents incorporating RL algorithms for continuous improvement.

Even good plans need evaluation and adjustment. Reflection is the agent's ability to step back, assess its progress, and course-correct.

Reflection can happen at multiple points:

  • After receiving a query — is this request even feasible?
  • After plan generation — does this plan make sense?
  • After each execution step — am I on the right track?
  • After the full plan is executed — did I actually accomplish the goal?

Reflexion (Shinn et al., 2023) formalized this into a framework with three modules:

  1. Actor: Generates and executes plans
  2. Evaluator: Scores the outcome (e.g., "generated code passes 2 out of 3 test cases")
  3. Self-reflection: Analyzes what went wrong ("failed because I didn't handle arrays where all numbers are negative")

The agent then generates a new plan incorporating the reflection, effectively learning from its mistakes within a single session.

The downsides:

  • Cost and latency: Thoughts, observations, and reflections consume a lot of tokens. For tasks with many intermediate steps, this adds up fast.
  • Overconfidence in self-assessment: An interesting failure mode is when the agent is convinced it's accomplished a task when it hasn't. "You asked me to assign 50 people to 30 hotel rooms. Done!" — but only 40 people were actually assigned.
  • Over-reflection: Invoking reflection too often can cause the model to second-guess correct solutions, making things worse.

The practical advice: reflection brings surprisingly good performance improvement for relatively little implementation effort, but calibrate how often it's invoked. A good rule of thumb is to reflect when the agent performs the same action more than three times in a row (which usually means it's stuck in a loop).

The more complex a task, the more ways an agent can fail. Beyond the hallucination problems common to all LLM applications, agents have unique failure modes tied to planning and tool execution:

1. Invalid tool

The agent generates a plan containing bing_search, but bing_search isn't in its tool inventory. This is the most obvious failure — the agent hallucinated a tool that doesn't exist.

2. Valid tool, invalid parameters

The agent calls lbs_to_kg with two parameters, but the function only accepts one (lbs). The tool exists, but the agent misunderstands its signature.

3. Valid tool, incorrect parameter values

The agent calls lbs_to_kg(lbs=100) when the correct value should be 120. Right tool, right parameters, wrong values — the hardest failure to catch automatically.

4. Goal failure

The plan doesn't achieve the task, or achieves it while violating constraints. "Plan a two-week trip from SF to Hanoi under $5,000" → agent plans a trip to Ho Chi Minh City, or plans the right trip but way over budget.

5. False completion (reflection failure)

The agent is convinced it's accomplished the task when it hasn't. "Assign 50 people to 30 hotel rooms" → only 40 people were assigned, but the agent insists it's done. This is an error in the reflection module, not the planning module.

How to evaluate agents systematically: Create a benchmark dataset of (task, tool inventory) pairs. For each task, have the agent generate K plans. Then measure:

  1. What percentage of generated plans are valid?
  2. How many plans does the agent need to generate, on average, to get a valid one?
  3. Out of all tool calls, how many are valid? (Break down by: invalid tool, invalid params, incorrect values)
  4. What types of tasks does the agent fail on most? Are certain tools consistently misused?

Efficiency matters too. An agent might generate valid plans but be wasteful. Track: how many steps per task on average? What's the cost per task? Are there actions that are disproportionately slow or expensive? Compare these to a baseline (another agent or a human).

Available benchmarks:

  • Berkeley Function Calling Leaderboard — ranks models on function calling accuracy across different scenarios
  • AgentOps — evaluation harness for agent operations
  • TravelPlanner — complex multi-constraint travel planning benchmark (budget, dates, preferences)

Key insight for debugging: Always log every tool call and its output so you can inspect the full execution trace. If you use natural language plans with a translation module, create benchmarks for the translator separately. And work with domain experts to understand what tools should be used — if your agent consistently fails in a domain, it might be missing a critical tool.

Decision Time #5
You're building a coding assistant that helps developers write and debug code. The agent has access to a code interpreter, documentation search, and a file editor. A user asks: "Refactor the authentication module to use JWT tokens instead of sessions." What planning approach should the agent use?

Memory: Giving Agents the Ability to Remember

Think about what happens when you ask an LLM "What is my name?" in a new chat. Blank stare. It has no memory of previous conversations. Now imagine an agent that needs to execute a 15-step plan — it needs to remember what it's already done, what results it got, and what's left to do. Without memory, agents are useless.

An AI model has three types of memory, and they mirror how human memory works:

Agent Memory Architecture
Internal Knowledge
Baked into model weights during training
Available for all queries
No retrieval cost
Can't update without retraining
Human equivalent: How to breathe, how language works
Short-term Memory
The context window — current conversation
Fast to access
Instantly relevant
Limited by context length
Human equivalent: The name of someone you just met
Long-term Memory
External stores — vector DBs, files, APIs
Persists across sessions
Effectively unlimited
Requires retrieval (slower)
Human equivalent: Books, notes, computers
Memory Management
When short-term memory fills up, overflow moves to long-term storage. The challenge: deciding what to keep nearby and what to archive.
Which memory to use depends on access frequency: essential knowledge → internal, current task → short-term, everything else → long-term

The hardest problem in agent memory isn't storage — it's management. When the context window fills up (and it will), what do you keep and what do you remove? This is the same problem operating systems have been solving for decades.

When an LLM has a conversation, it needs a strategy for managing the growing history. Here are the three most common approaches:

1. Conversation Buffer (Keep Everything)

The simplest approach: paste the entire conversation history into every prompt. Works perfectly for short conversations. But as the conversation grows, you'll hit the token limit, generation slows down (more tokens to process), and the model struggles to find relevant information buried in a massive history.

2. Windowed Buffer (Keep the Last K Messages)

Only keep the last K conversation turns. This prevents the context from growing without bound. The trade-off is brutal: if the user mentioned their name in the first message and you only keep the last 3 turns, the agent forgets their name. Early messages are often the most important — they set the purpose of the entire conversation.

3. Conversation Summary (Compress the History)

Use another LLM to summarize the conversation history and keep only the summary. This lets you retain the gist of a long conversation in far fewer tokens. The trade-off: summarization loses specific details (exact numbers, names mentioned in passing), requires an extra LLM call per turn (adding cost and latency), and the quality depends on the summarizer's ability.

Strategy Pros Cons
Buffer No information loss, simplest to implement Hits token limits fast, slower as it grows
Window Fixed memory footprint, fast Forgets early context that may be critical
Summary Captures full history compactly Loses specifics, extra LLM call per turn

Advanced strategies go beyond these basics. Some systems detect redundancy and remove it. Others use reflection — after each action, the agent decides whether new information should be inserted, merged with existing memory, or should replace outdated information. The most sophisticated systems combine approaches: keep the last few turns as a buffer, summarize everything before that, and store specific facts (names, dates, preferences) in structured long-term memory.

Here's how each memory strategy looks in LangChain. First, set up a prompt template with a chat_history variable:

from langchain import PromptTemplate, LLMChain

template = """Current conversation:{chat_history}

{input_prompt}"""

prompt = PromptTemplate(
    template=template,
    input_variables=["input_prompt", "chat_history"]
)

1. Conversation Buffer (keep everything)

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history")
llm_chain = LLMChain(prompt=prompt, llm=llm, memory=memory)

llm_chain.invoke({"input_prompt": "Hi! My name is Maarten."})
# chat_history: "" (empty — first message)

llm_chain.invoke({"input_prompt": "What is my name?"})
# chat_history: "Human: Hi! My name is Maarten.\nAI: Hello!"
# Result: "Your name is Maarten." ✓

2. Windowed Buffer (keep last K turns)

from langchain.memory import ConversationBufferWindowMemory

# Only retain the last 2 conversations
memory = ConversationBufferWindowMemory(
    k=2, memory_key="chat_history"
)
llm_chain = LLMChain(prompt=prompt, llm=llm, memory=memory)

llm_chain.invoke({"input_prompt": "I am 33 years old."})
llm_chain.invoke({"input_prompt": "What is 3 + 3?"})
llm_chain.invoke({"input_prompt": "What is my name?"})
# ✓ Still remembers (within last 2 turns)

llm_chain.invoke({"input_prompt": "What is my age?"})
# ✗ Forgotten! The age was in turn 1, now outside the window

3. Conversation Summary (compress history)

from langchain.memory import ConversationSummaryMemory

# Uses another LLM call to summarize at each step
memory = ConversationSummaryMemory(
    llm=llm,
    memory_key="chat_history",
    prompt=summary_prompt  # template for summarization
)
llm_chain = LLMChain(prompt=prompt, llm=llm, memory=memory)

llm_chain.invoke({"input_prompt": "Hi! My name is Maarten."})
llm_chain.invoke({"input_prompt": "What is my name?"})
# chat_history: "Summary: Human, identified as Maarten,
#   asked about 1+1, answered correctly as 2."

# Check current summary at any time:
memory.load_memory_variables({})

Choosing the right strategy: Use Buffer for short conversations (demos, single-task agents). Use Window for chat applications with fixed context budgets. Use Summary for long-running agents where you need the full history compressed. In production, many systems combine approaches — window for recent context + summary for older history + structured storage for specific facts (names, preferences, IDs).

Beyond the three memory types we just covered, a production agent typically interacts with three categories of data stores:

1. Prompt Repository

A collection of detailed prompts instructing the model how to perform specific tasks. Instead of cramming every instruction into the system prompt (which wastes tokens and overwhelms the model), the agent retrieves the relevant prompt on demand.

For example, many LLMs struggle with number comparison ("Is 9.11 greater than 9.9?"). If you know this limitation exists, you store a detailed prompt in the repository that says: "To compare two numbers, ensure they have the same number of decimal places, then subtract one from the other." When the agent encounters a comparison task, it retrieves this prompt first.

Prompt repositories can also include few-shot examples that are retrieved on demand — instead of including 20 examples in every prompt, retrieve only the 2-3 most relevant ones.

2. Session Memory (Workflow Logs)

Logs of every step the agent takes during a session: tool selections, tool outputs, verifier feedback, user feedback, and intermediate LLM outputs. These logs serve two purposes:

  • Context for the current session: The agent can review what it's already tried, what worked, and what failed.
  • Guidance for future sessions: Past successful workflows can guide the agent toward correct trajectories on similar tasks.

Advanced implementations define multiple logging levels — so during retrieval, you can fetch either the full trace or just the important steps.

3. Tools Data

Detailed documentation needed to invoke tools correctly: database schemas, API documentation, sample API calls, parameter descriptions. When the agent decides to invoke a tool, it retrieves the relevant documentation from this store.

For example, to generate a SQL query, the agent retrieves the database schema (tables, columns, data types, keys). If there are too many tables, schemas are retrieved on demand rather than all at once.

Why not put everything in the prompt? Three reasons: (1) too many instructions may not fit in the context window, (2) tokens are expensive and irrelevant prompts waste them, and (3) LLMs can only follow a limited set of concurrent instructions — retrieving on demand is more effective than front-loading everything.

Safety: The Guardrails That Keep Agents in Check

Here's the terrifying reality of autonomous agents: the more powerful they are, the more dangerous their failures become. A RAG system that hallucinates gives you a wrong answer — annoying but fixable. An agent that hallucinates might execute a wrong action — deleting data, sending emails, processing refunds — and that damage can be irreversible.

Safety for agents works in three layers, like the defense systems of a car:

Input Guardrails

Filter what goes into the system. Detect PII (social security numbers, credit cards) and strip it. Detect prompt injection attacks and block them. Filter NSFW content and out-of-scope requests. This is the seatbelt — always on, preventing basic harm.

PII Detection Prompt Injection NSFW Filter
Output Guardrails

Verify what comes out of the system. Check factual accuracy against source documents, verify the output matches the requested format, scan for toxic or biased language, validate that code won't introduce security vulnerabilities. This is the airbag — catches failures before they reach the user.

Factuality Toxicity Format
Action Guardrails

Control what the agent does in the real world. Classify actions as safe (read-only) or risky (write). Require human approval before executing write actions. Set spending limits, rate limits, and scope restrictions. This is the emergency brake — prevents catastrophic real-world consequences.

Human-in-the-Loop Action Limits Scope Restriction

When a guardrail catches a problem, there are several possible responses. The choice depends on the severity and the use case — a chatbot might just regenerate, while a financial system should escalate to a human.

Libraries like Guardrails and NVIDIA's NeMo-Guardrails make it straightforward to add safety checks. Here's the Guardrails library in action:

from guardrails import Guard
from guardrails.hub import DetectPII

# Set up PII detection — reask the model if PII is found
guard = Guard().use(
    DetectPII,
    ["EMAIL_ADDRESS", "PHONE_NUMBER"],
    "reask"  # response strategy when PII detected
)

# This will trigger the guardrail:
guard.validate(
    "The Nobel prize was won by Geoff Hinton, "
    "who can be reached at +1 234 567 8900"
)
# → Detects phone number, asks LLM to regenerate without PII

Key validators available:

  • DetectPII — Detects personally identifiable information (emails, phone numbers, SSNs) using Microsoft Presidio under the hood
  • Prompt Injection — Detects adversarial prompts using the Rebuff library
  • NSFW Text — Flags profanity, violence, and sexual content
  • Politeness Check — Ensures outputs are sufficiently polite (related: Toxic Language validator)
  • Web Sanitization — Checks for code that could execute in a browser (uses the bleach library)

What happens when validation fails? Seven response strategies:

Re-ask

Ask the LLM to regenerate, with instructions to specifically fix the failed criterion.

Fix

Automatically repair the output (deletion/replacement) without re-calling the LLM.

Filter

For structured outputs, remove only the field that failed validation. Return the rest.

Refrain

Simply refuse to return the output. User gets a refusal message.

Noop

Take no action, but log the failure for later inspection. Useful for monitoring.

Exception

Raise a software exception. Custom handlers can trigger alerts or fallback workflows.

Fix-reask (hybrid)

Try to fix automatically first. If the fix still fails validation, then re-ask the LLM. Best of both worlds — fast when the fix works, thorough when it doesn't.

Models like Llama Guard can also be used as dedicated safety classifiers that run alongside your main model, checking both inputs and outputs against a predefined safety taxonomy.

In production, "check the output" means verifying against a specific list of criteria. Here's what a well-designed verification system checks for, using a financial document summarization agent as an example:

Factuality

Is each claim in the output supported by the source? Frame this as Natural Language Inference (NLI): "Given the source text (premise), can we logically conclude this statement (hypothesis)?" For numbers and named entities, simple string matching can catch many errors.

Specificity

Does the output include concrete details? Check if relevant numbers and named entities from the source appear in the output. A summary that says "revenue increased" when the source says "revenue increased 23% to $4.2B" has failed specificity.

Relevance (Precision)

What percentage of sentences in the output are actually relevant? A classifier model can flag irrelevant padding sentences that the agent added to make the response look comprehensive.

Completeness (Recall)

What percentage of important information from the source made it into the output? A ranking model can identify the most important sentences in the source and check if they're represented.

Coherence

Does the output read as a clear, unified piece? A prerequisite detection model can check whether each sentence has the context it needs from earlier sentences.

Structure & Format

Does the output follow the specified format? Check for required sections, proper bullet formatting, valid JSON, chronological ordering — these are the easiest criteria to verify with simple rules.

Important principle: Don't expect your verification process to be strictly better than your generation model. If it were, you could just use the verifier to generate the output directly! The goal is to catch a useful subset of errors at low cost.

Adding verifiers increases latency, so balance them with your accuracy and speed requirements. A good strategy is to run cheap verifiers (format, structure, string matching) on every output, and expensive ones (NLI, coherence models) only on high-stakes outputs.

Natural Language Inference (NLI) is the backbone of factuality verification. The setup: you have a premise (the source text) and a hypothesis (a sentence from the agent's output). The question: can the hypothesis be logically concluded from the premise? This is a well-studied NLP task with many pre-trained models available.

For numbers and named entities, factuality can be approximated with simpler string matching — check if all numbers and named entities in the output actually appear in the source. Libraries like SpaCy can extract and tag these automatically.

For cases where NLI models aren't available or practical, there are single-number automatic metrics for evaluating text quality:

Metric How It Works Limitation
BLEU Counts n-gram overlaps between generated text and a reference. Originally designed for machine translation. Pure token overlap — ignores meaning. "The cat sat on the mat" vs "The mat was sat on by the cat" would score poorly despite identical meaning.
ROUGE Similar to BLEU but focused on recall — what percentage of reference n-grams appear in the generated text? Commonly used for summarization. Also token-based. Has been shown to be woefully inadequate for measuring summary quality in practice.
BERTScore Uses BERT embeddings to compute semantic similarity between generated and reference text. Captures meaning, not just word overlap. More promising than BLEU/ROUGE, but summaries have subjective quality notions that no single metric captures fully.

The reality: BLEU and ROUGE rely on token overlap heuristics and have been shown to correlate poorly with human judgment. BERTScore is better because it uses semantic similarity, but ultimately text quality — especially for summaries and agent outputs — requires a more holistic verification approach (the multi-criteria system described above). This is why modern systems increasingly use LLM-as-a-Judge approaches, where another model evaluates the output against specific criteria.

Orchestration frameworks are the glue that connects all the pieces — they manage state, invoke tools, initiate retrieval, buffer intermediate outputs, and log everything. Here's how the major frameworks compare:

Framework Best For Key Strength Watch Out
LangChain General-purpose, prototyping Huge ecosystem of connectors (search, DBs, APIs). Chains + memory abstractions. Earliest and most popular. Can be over-abstracted. Rapid API changes between versions.
LlamaIndex Data-heavy RAG applications Purpose-built for connecting LLMs to data sources. Excellent indexing, chunking, and retrieval primitives. Narrower scope — less general than LangChain for pure agent tasks.
crewAI Multi-agent collaboration Role-based agents that work together. Good for complex workflows with specialized sub-agents. Newer framework, smaller community. Multi-agent adds complexity.
AutoGen Research, multi-agent chat Microsoft-backed. Strong multi-agent conversation framework. Good for conversational agents. More research-oriented, less production-hardened.

Practical advice: For prototyping, pick LangChain or LlamaIndex for their ease of use. For production, you may want to build your own framework or extend an open-source one — these frameworks are opinionated and change rapidly. The KISS principle (Keep It Simple, Stupid) applies to agents more than any other paradigm. Don't complicate your architecture unless there's a compelling reason.

Building a ReAct agent with LangChain:

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain.tools import DuckDuckGoSearchResults, Tool
from langchain import load_tools

# Load the LLM
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Define tools
search = DuckDuckGoSearchResults()
search_tool = Tool(
    name="web_search",
    description="Search engine for factual queries",
    func=search.run,
)
math_tools = load_tools(["llm-math"], llm=llm)
tools = math_tools + [search_tool]

# Create and run the agent
agent = create_react_agent(llm, tools, react_prompt)
executor = AgentExecutor(
    agent=agent, tools=tools,
    verbose=True, handle_parsing_errors=True
)
result = executor.invoke({
    "input": "What is the current price of a MacBook Pro?"
})

The verbose=True flag prints each Thought/Action/Observation step, letting you debug the agent's reasoning process. In production, replace this with proper logging and observability tools like LangSmith.

LangChain chains are the building blocks of these frameworks. A chain connects an LLM with a prompt template, memory, or another chain. The simplest chain is prompt | llm. But the real power is sequential chains — where the output of one chain feeds into the next:

from langchain import LLMChain, PromptTemplate

# Chain 1: Generate a title
title_prompt = PromptTemplate(
    template="Create a title for a story about {summary}.",
    input_variables=["summary"]
)
title_chain = LLMChain(
    llm=llm, prompt=title_prompt, output_key="title"
)

# Chain 2: Generate a character using the title
char_prompt = PromptTemplate(
    template="Describe the main character of '{title}'.",
    input_variables=["summary", "title"]
)
char_chain = LLMChain(
    llm=llm, prompt=char_prompt, output_key="character"
)

# Link them: output of chain 1 feeds into chain 2
full_chain = title_chain | char_chain
result = full_chain.invoke("a girl that lost her mother")
# result has both 'title' and 'character' keys

This decomposition pattern — breaking complex prompts into smaller sequential chains — is fundamental. Instead of asking one prompt to do everything, you get better results by having each chain focus on one subtask.

Throughout this post, agents have interacted with the world through APIs and function calls. But many human tasks involve actions on a computer that don't have a clean API — filling in spreadsheets, navigating web forms, copy-pasting between systems.

Web Agents

Web agents use the DOM (Document Object Model) or a screenshot of a web page to understand its current state and perform actions: entering text, clicking elements, navigating links. A working web agent could book a flight by navigating to a travel website, entering your preferences, comparing options, and completing payment.

As of today, this is still a fledgling technology with poor results on benchmark tasks. The challenge is that web pages are complex, dynamic, and often adversarial (CAPTCHAs, cookie banners, variable layouts).

Computer Use

Anthropic has launched initial versions of Computer Use, enabling agents to control a full desktop environment — moving the mouse, clicking, typing, reading the screen. This extends agents beyond the browser to any software. While promising, the common failure modes include misinterpreting UI elements, getting stuck on unexpected dialogs, and performing irreversible actions (closing without saving).

Running Models Locally: Quantization and GGUF

Many agent frameworks need to run models locally — for privacy, cost, or latency reasons. Quantization reduces the bits needed to represent model parameters (from 16-bit to 8-bit or even 4-bit), cutting memory requirements dramatically while maintaining most accuracy.

Think of it like telling time: "14:16" vs "14:16:12" — dropping the seconds (lower precision) is rarely harmful for practical purposes. GGUF is a popular file format for quantized models, supported by tools like llama-cpp-python. As a rule of thumb, use at least 4-bit quantization — below that, quality degrades noticeably and you're better off choosing a smaller model at higher precision.

Beyond LangChain: DSPy and Haystack

While LangChain is the most popular orchestration framework, newer alternatives are emerging:

  • DSPy — takes a more "programming" approach to prompt optimization. Instead of manually crafting prompts, you define the task declaratively and DSPy automatically optimizes the prompts through compilation.
  • Haystack — another end-to-end framework with strong RAG pipeline support, useful for building search-powered applications.

The landscape is evolving rapidly. For prototyping, pick the framework with the best documentation for your use case. For production, consider building your own thin orchestration layer — these frameworks are opinionated and change fast.

Decision Time #6
You're deploying a customer service agent for an e-commerce platform. The agent can search order history, look up product info, process returns, and issue refunds. A customer writes: "I ordered a laptop but received a broken screen. I want a full refund and a replacement sent immediately." What safety pattern should you implement?

You now have the complete picture — from finding information (Part 1), to building production-ready retrieval pipelines (Part 2), to optimizing every layer (Part 3), to letting AI take autonomous action with proper safety guardrails (Part 4). Time to put it all together.

Practice Mode

Apply what you've learned to real-world decisions

0/4
Scenario 1 of 4
You're building a legal research assistant. Lawyers need to find relevant case law, statutes, and legal opinions across a corpus of 2 million documents. Queries use highly specific legal terminology like "estoppel," "prima facie," and "habeas corpus" — but lawyers also ask broader conceptual questions like "cases where a company was held liable for employee negligence."
What retrieval architecture should you use?
A
Keyword search only (BM25) — Legal terms are precise, so exact matching will find the right documents.
B
Semantic search only — Embedding-based retrieval understands meaning, which handles conceptual queries.
C
Hybrid search + cross-encoder reranking — Combine BM25 for exact terms with semantic search for concepts, then rerank the merged results.
Cheat Sheet: RAG & Agents at a Glance

Retrieval Types

Term-based (BM25): matches exact words, fast & cheap. Embedding-based: matches meaning, handles synonyms. Hybrid: combine both with Reciprocal Rank Fusion for the best of both worlds.

Reranking

Two-stage pipeline: bi-encoder retrieves top 100 fast (independent encoding), then cross-encoder reranks top 10 precisely (query + doc together). 10x better precision, reasonable cost.

RAG Pipeline

Full pipeline: Rewrite query → Retrieve documents → Rerank by relevance → Refine chunks → Insert into prompt → Generate answer. Each stage catches failures from the previous one.

Chunking

Small chunks (~128 tokens): precise retrieval, loses context. Large chunks (~512 tokens): retains context, dilutes relevance. Best strategies: recursive (split by structure) or semantic (split by meaning changes).

RAG vs Alternatives

RAG: dynamic knowledge, citations, no retraining. Long context: small, stable documents, no infrastructure. Fine-tuning: behavior/style changes, not knowledge updates. Best results often combine approaches.

Agent Loop (ReAct)

Thought (what should I do?) → Action (use a tool) → Observation (what happened?) → repeat until done. Decouple planning from execution. Validate plans before running.

Agent Memory

Internal: baked into weights, can't update. Short-term: context window, fast but limited. Long-term: external stores, persistent but requires retrieval. Management: buffer (keep all), window (keep last K), or summary (compress).

Safety Guardrails

Three layers: Input (PII detection, injection blocking, NSFW filter) → Output (factuality, toxicity, format checks) → Action (human-in-the-loop for write operations, scope limits). Classify every tool as read or write.

BM25
Vector DB
RAG
Chunking
ReAct
Agents
Guardrails
Where to Go Deep
  • AI Memory — the next post in this series: how to give LLMs persistent memory across conversations
  • Fine-Tuning — when RAG isn't enough and you need to teach the model your domain (LoRA, QLoRA, RAFT)
  • How LLMs Work — understand the transformer architecture that makes all of this possible

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this helpful?

Comments

Loading comments...

Leave a comment