بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
In the name of Allah, the Most Gracious, the Most Merciful
We built a RAG system for a medical client. It retrieved the right documents 92% of the time. But the model kept confusing drug interactions — not because it couldn't find the information, but because it didn't understand pharmacology.
More context didn't help. Better embeddings didn't help. We tried reranking, hybrid search, prompt engineering. The model would read "Warfarin interacts with NSAIDs" and still suggest ibuprofen alongside blood thinners.
The model needed to actually learn the domain. That's when we discovered fine-tuning isn't a luxury — it's sometimes the only path forward.
- Part 1: When RAG hits its limits — the decision matrix for RAG vs fine-tuning vs both
- Part 2: LoRA & QLoRA — how to modify 0.1% of a model's parameters and match full fine-tuning performance
- Part 3: The training recipe — data quality, hyperparameters, and detecting overfitting
- Part 4: Catastrophic forgetting — why your model might learn the domain but forget how to speak English
- Part 5: Synthetic data — using strong models to generate training data when you don't have enough
- Part 6: RAFT & embedding fine-tuning — combining RAG with fine-tuning, and teaching the search itself
- Your RAG system retrieves the right documents but the model still generates wrong or inconsistent answers
- You need consistent output formats (JSON, medical codes, legal citations) and prompt engineering isn't reliable enough
- You've heard "LoRA," "QLoRA," or "fine-tuning" and want to understand what they actually mean and when to use them
- You want to run a domain-specific model on your own hardware instead of paying per-token to a closed API
The Reference Book vs. The Trained Specialist
Imagine you need surgery. You have two options. The first is a general practitioner holding an anatomy textbook. They can look up any procedure, find the right page, and read the steps aloud. The second is a surgeon who spent years in residency — they've internalized the patterns, the edge cases, the muscle memory.
The GP with the textbook is RAG. They can retrieve the right information, but they don't understand it at a deep level. The surgeon is a fine-tuned model. They've absorbed the domain into their parameters.
RAG works brilliantly for many problems. But it has a fundamental limitation: it retrieves text snippets, not deep understanding. The model reads the retrieved context and does its best, but it's still reasoning with general-purpose knowledge. When the domain requires specialized vocabulary, consistent formatting, or nuanced reasoning that general models struggle with — retrieval alone isn't enough.
Five Signs You Need Fine-Tuning
You need consistent JSON, XML, or domain-specific output structures. Prompt engineering gets you 80% consistency; fine-tuning gets you 99%.
Medical codes, legal terminology, financial instruments. The model misuses domain terms even when the right definitions are in context.
Brand voice, technical writing style, or communication tone. Every response should feel like it came from the same expert.
A fine-tuned 8B model can replace a 70B+ model with RAG, running 5–10x faster and costing 80% less per token.
Five Signs You Should Stick with RAG
- Rapidly changing data — your knowledge base updates daily (news, stock prices, product catalogs). Fine-tuned knowledge becomes stale the moment training ends.
- Broad knowledge needs — you need answers across thousands of topics, not deep expertise in one domain.
- Limited training data — you have fewer than 200 high-quality examples. Start with few-shot prompting and RAG instead.
- Traceability required — you need to cite sources and show where the answer came from. RAG provides natural citations; fine-tuned knowledge is a black box.
- It already works — your RAG pipeline gives satisfactory answers. Don't fine-tune because it sounds impressive.
Format consistency is where fine-tuning shines most. Prompt engineering can get you to ~80% format consistency, but fine-tuning on examples of the exact output format you want pushes that to 99%+. The model internalizes the structure as a pattern, not as a set of instructions it might interpret differently each time.
Better prompts help, but they can't guarantee consistency at scale — the model will still interpret formatting instructions differently in edge cases. Better retrieval doesn't address the core issue either: the retrieval already works (95% accuracy). The problem is output format, not input quality. Fine-tuning for format is the right answer: train on 1,000+ examples of perfectly formatted legal responses so the model internalizes the structure.
The Chef Analogy
Imagine a classically trained French chef who wants to learn Thai cooking. They don't start over from culinary school — they already know how to handle heat, balance flavors, time dishes. Instead, they add a small "adapter" of new techniques on top of their existing skills: how to use a wok, how to balance fish sauce with lime, how to make curry paste from scratch.
That's exactly how LoRA (Low-Rank Adaptation) works. Instead of retraining every parameter in a model (the equivalent of sending the chef back to culinary school), LoRA attaches small "adapter" matrices to specific layers. The original model weights stay frozen. Only the adapters learn the new domain.
Full Fine-Tuning vs. LoRA
A model like Llama 3 8B has about 8 billion parameters. Full fine-tuning means updating all 8 billion parameters during training. That requires massive GPU memory (at least 80GB VRAM for the model, gradients, and optimizer states) and risks destroying the model's general knowledge.
LoRA typically trains 0.1% to 1% of the parameters. For an 8B model, that's 8–80 million parameters. Research consistently shows that LoRA matches or comes within 1–2% of full fine-tuning on most tasks, while using 3–10x less GPU memory.
In a transformer model, the heavy lifting happens in the attention layers. Each attention layer has weight matrices (typically called Q, K, V, and O) that determine how the model processes relationships between tokens.
During full fine-tuning, you'd update these massive matrices directly. A single attention weight matrix in an 8B model might be 4096 x 4096 — that's 16.7 million parameters in just one matrix.
LoRA's insight: instead of updating the full matrix W, decompose the update into two much smaller matrices:
The original matrix W has 16.7M parameters. The two adapter matrices A and B together have only 131,072 parameters (4096 x 16 x 2). That's a 128x reduction.
The number 16 in this example is the LoRA rank (often written as r). It controls how much "capacity" the adapter has to learn new patterns. Higher rank = more capacity but more parameters and memory.
LoRA Rank — Finding the Sweet Spot
Think of LoRA rank like the width of a highway. A rank of 4 is a single-lane road — cheap and fast, but limited capacity. A rank of 64 is a six-lane highway — lots of capacity, but expensive and often unnecessary.
In practice, rank 16 is the sweet spot for most tasks. Here's why:
| LoRA Rank | Trainable Params (8B model) | Performance vs Full FT | Best For |
|---|---|---|---|
| r = 4 | ~2M (0.025%) | 90–95% | Simple style transfer, format control |
| r = 16 | ~8M (0.1%) | 97–99% | Most domain adaptation tasks |
| r = 64 | ~33M (0.4%) | 99–100% | Complex reasoning, new language |
| r = 256 | ~131M (1.6%) | ~100% | Rarely needed, diminishing returns |
Start with rank 16. If your validation loss plateaus early, drop to 8. If it's still improving when training ends, try 32. Going above 64 rarely helps and significantly increases memory usage.
QLoRA — Fine-Tuning on Consumer Hardware
QLoRA combines LoRA with quantization. Think of quantization like compressing a high-resolution photo into a JPEG — you lose a tiny bit of detail, but the file is 4x smaller.
In QLoRA, the base model is compressed from 16-bit to 4-bit precision (that's the quantization part), and then LoRA adapters are trained on top at full precision. The result: you can fine-tune a 13B parameter model on a single GPU with 24GB VRAM, or an 8B model on as little as 6GB.
8B model needs ~32GB VRAM for LoRA training. 13B model needs ~52GB. You need an A100 ($2+/hour).
8B model needs ~6GB VRAM. 13B model needs ~10GB. A single RTX 4090 ($0.40/hour) works fine.
QLoRA is designed for exactly this situation. By quantizing the base model to 4-bit precision, a 13B model fits in ~10GB VRAM, leaving plenty of room on your 24GB GPU for the LoRA adapters and training overhead. Research shows QLoRA loses less than 1% accuracy compared to full-precision fine-tuning on most benchmarks.
You don't need to give up on the 13B model. QLoRA quantizes the base model to 4-bit precision, reducing memory from ~52GB to ~10GB. With LoRA adapters on top, you can fine-tune the full 13B model on your 24GB GPU. The quality loss from quantization is minimal (less than 1% on most benchmarks), and 13B models significantly outperform 7B on medical terminology tasks.
LIMA — Less Is More for Alignment
One of the most counterintuitive findings in fine-tuning comes from a 2023 paper called LIMA: "Less Is More for Alignment." The researchers fine-tuned a 65B parameter model on just 1,000 carefully curated examples and found it competed with models trained on 52,000 examples from the same domain.
The insight: data quality beats data quantity by a massive margin. One thousand perfect examples teach the model a pattern. One hundred thousand noisy examples teach the model to be mediocre.
Correct: The answer must be factually accurate. One wrong example can embed a permanent error in the model's behavior.
Complete: The response should be as thorough as you'd want in production. Don't use abbreviated training examples.
Consistent: Every example should follow the exact format and style you want in production. If you want JSON, every response should be valid JSON.
Representative: Your training set should cover the range of inputs you'll see in production — including edge cases, ambiguous inputs, and "I don't know" scenarios.
How Much Data Do You Need?
The Hyperparameters That Actually Matter
Fine-tuning has dozens of knobs you can turn. But in practice, four matter most:
Epochs — how many times the model sees your entire dataset.
Think of it like studying for an exam. Reading your notes once (1 epoch) gives a rough understanding. Reading them 2–3 times (2–3 epochs) solidifies the knowledge. Reading them 20 times (20 epochs) means you've memorized specific sentences without understanding the material — that's overfitting.
Optimal: 2–3 epochs for most tasks. More than 4 almost always leads to overfitting, especially with small datasets.
Learning rate — how aggressively the model updates its weights on each step.
Think of it like adjusting a satellite dish. Too high a learning rate (big adjustments) and you keep overshooting the signal. Too low (tiny adjustments) and you'll take forever to find it. Just right and you converge on the best reception quickly.
Baselines: Full fine-tuning: 2e-5. LoRA: 5e-5 (higher because you're training fewer parameters). QLoRA: 1e-5 to 2e-5 (lower because quantization adds noise).
Batch size — how many examples the model processes before updating weights.
Think of it like grading student essays. Grading one essay at a time (batch size 1) gives detailed but noisy feedback. Grading 8 at once (batch size 8) gives smoother, more balanced feedback because individual quirks average out.
Typical: 4–8 for most fine-tuning. Smaller batches (1–2) work when VRAM is limited. Larger batches (16–32) can speed up training but may need learning rate adjustments.
Early stopping — automatically stopping training when the model starts overfitting.
During training, you track performance on a held-out validation set (typically 10–20% of your data). When validation loss starts rising while training loss keeps dropping, the model is memorizing your training data instead of learning patterns. Stop there.
Before fine-tuning, measure your base model's performance on your task. If GPT-4 with good prompts scores 85% on your evaluation set, you know the ceiling. If your fine-tuned Llama 8B scores 82%, that's a huge win at 10x lower cost. Without a baseline, you can't tell if fine-tuning helped or hurt.
The Bilingual Speaker Problem
Imagine someone who speaks fluent English and French. They move to Japan and study Japanese intensively for a year, practicing 8 hours a day. After a year, their Japanese is excellent — but they discover they've started struggling with French. They haven't spoken it in months, and the Japanese patterns keep interfering.
That's catastrophic forgetting. When you fine-tune a model on domain-specific data, the new gradients (the weight updates from training) can overwrite the general knowledge the model learned during pre-training. The model gets great at your domain but loses basic capabilities.
What It Looks Like in Practice
A team fine-tuned an LLM on 10,000 medical case studies. The model became excellent at diagnosing conditions from symptoms. But when they tested basic conversational abilities, they found:
- "What's 15 x 23?" — Model responded with a diagnosis instead of 345
- "Write me a haiku" — Model responded with medical terminology
- "Hello, how are you?" — Model responded: "Patient presents with..."
The model had overfit to the medical domain so heavily that it forgot how to be a general-purpose language model.
How to Prevent It
LoRA keeps the original weights frozen. The adapter learns the new domain while the base model retains all general knowledge. This is the single most effective prevention.
Add 10–20% general instruction-following data to your training mix. This reminds the model "you're still a general assistant that also knows this domain."
Track performance on both domain tasks AND general benchmarks (MMLU, HellaSwag). If general scores drop more than 5%, you're forgetting.
Forgetting accelerates with more training. Limiting to 2–3 epochs reduces the chance of overwriting general knowledge.
This is textbook catastrophic forgetting. The model's financial training overwrote its general world knowledge. The fix has two parts: (1) switch to LoRA so original weights stay frozen, and (2) mix 10–20% general instruction-following data into your training set so the model retains basic capabilities alongside its new financial expertise.
This is catastrophic forgetting, not underfitting or bad prompting. The model isn't failing at finance — it's failing at everything else. During fine-tuning, the gradients from financial data overwrote the neurons responsible for general knowledge. The fix: use LoRA (freezes original weights) and mix in 10–20% general data to maintain broad capabilities.
The Teacher-Student Pipeline
Imagine you're starting a new restaurant and need to train a sous chef. You can't afford to have them study under a master for years, but you can have a renowned chef create a book of 3,000 signature recipes with detailed notes on technique, timing, and plating. Your sous chef studies that book and becomes highly skilled — not by learning directly from the master, but by learning from the master's output.
That's synthetic data generation for fine-tuning. You use a powerful model (GPT-4, Claude, DeepSeek-R1) as the "teacher" to generate training examples, then fine-tune a smaller, cheaper model (the "student") on those examples. The student learns to mimic the teacher's capabilities at a fraction of the cost.
The Pipeline
Five Rules for Quality Synthetic Data
1. Use persona variation in prompts. Don't ask the teacher model the same question the same way every time. Vary the persona: "You are a senior cardiologist," "You are a first-year medical resident," "You are a patient asking about their condition." This creates diversity in your training data and prevents the student from only learning one communication style.
2. Ground with source passages. When generating training data, provide the teacher model with real source documents. This prevents hallucinated training data — you don't want the teacher making up facts that the student then learns as truth.
3. Use a critic model for quality control. After the teacher generates examples, run them through a separate model (or the same model with a different prompt) that scores quality, correctness, and relevance. Only keep examples above a quality threshold (typically top 70–80%).
4. Enforce schema consistency. If your training examples need to follow a specific format (like JSON with required fields), validate every generated example against the schema before including it. One malformed example can teach the model bad habits.
5. Self-validation loops. Ask the teacher to generate an answer, then ask it to verify its own answer. Discard any examples where the verification step identifies errors. This catches the ~10–15% of outputs where even GPT-4 makes mistakes.
Most closed-source APIs (OpenAI, Anthropic) have terms of service that restrict using their output to train competing models. DeepSeek-R1 and other open models can be used freely for distillation. Always check the license before building a teacher-student pipeline.
RAFT — The Best of Both Worlds
Remember the chef analogy? RAG is the chef reading a recipe book. Fine-tuning is the chef who learned from residency. But what about a chef who went through residency AND keeps a recipe book nearby for reference?
That's RAFT (Retrieval-Augmented Fine-Tuning). You fine-tune the model to be better at using retrieved context. Instead of just teaching domain knowledge, you teach the model how to reason with retrieved documents — which parts to trust, which to ignore, and how to combine multiple sources.
How RAFT Works
During training, RAFT deliberately mixes two types of examples:
The correct document is included alongside 3–4 "distractor" documents. The model learns to identify and extract the right information from noisy context.
No relevant document is provided. The model learns to rely on its own parametric knowledge when retrieval fails — and to say "I don't know" when appropriate.
Real-World RAFT Results
| Use Case | Before RAFT | After RAFT | Cost Impact |
|---|---|---|---|
| Invoice extraction | GPT-4 at $10/1K invoices | Mistral 7B at $1.20/1K | 88% cost reduction |
| Medical Q&A | 72% accuracy (RAG alone) | 92% accuracy (RAFT) | 20% accuracy boost |
| Podcast summarization | Generic summaries | Domain-aware highlights | Custom model, no API cost |
Embedding Fine-Tuning — Teach the Search Itself
Everything we've covered so far fine-tunes the answer model — the LLM that generates responses. But what about the search model — the embedding model that decides which documents are relevant to a query?
Think of it this way: if your search keeps returning the wrong documents, no amount of fine-tuning the answer model will help. You need to teach the search itself to understand your domain.
Matryoshka Representation Learning
Think of Russian nesting dolls (matryoshka). Each doll contains a smaller version of itself. Matryoshka Representation Learning creates embeddings that work the same way: the full 1024-dimension embedding is the most detailed representation, but the first 256 dimensions are a perfectly usable smaller version, and even the first 64 dimensions capture the core meaning.
This matters for production because you can use smaller embeddings for fast initial filtering, then full-size embeddings for precise reranking — without maintaining separate models.
When to Fine-Tune Embeddings vs. the LLM
- Search returns irrelevant results despite good data
- Domain vocabulary confuses the embedding model
- Queries use different terms than documents
- You need multilingual search in a specialized domain
- Search works but the model misinterprets results
- You need specific output format or style
- Domain reasoning requires specialized knowledge
- You want to replace a larger model with a smaller one
The problem is in the search, not the answer. The LLM already handles products well. The embedding model needs to learn that trail runners and hiking boots are different categories in your product domain, even though they share many terms. Fine-tuning the embedding model on product pairs (similar vs. different) will teach it these domain-specific distinctions.
The LLM is already working fine — it answers questions about products perfectly. The problem is that the wrong products are being retrieved. No amount of LLM fine-tuning will fix search results. You need to fine-tune the embedding model to understand that trail runners and hiking boots are different product categories, even though they share terms like "outdoor," "rugged," and "grip."
Fine-tuning teaches the model new skills. But how do you measure whether those skills actually work? How do you know if your fine-tuned model is better, worse, or just different? That's the domain of evaluation — the subject of the next post in this series.
وَاللَّهُ أَعْلَمُ
And Allah knows best
وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ
May Allah's peace and blessings be upon our master Muhammad and his family
Was this post helpful?
Your feedback helps me improve these deep dives.
Practice Mode
Test your understanding with real-world fine-tuning scenarios.
Your medical chatbot uses RAG to retrieve clinical guidelines. It retrieves the right documents 90% of the time, but frequently misuses domain terms — confusing "contraindicated" with "not recommended," using "hypertension" and "high blood pressure" inconsistently, and formatting drug dosages differently each time.
Cheat Sheet
The essential reference for LLM fine-tuning.
RAG vs Fine-Tuning
RAG: Changing data, broad knowledge, traceability needed
Fine-Tune: Format control, domain vocabulary, consistent style, speed
RAFT: Need both retrieval AND domain expertise
Start with: RAG first, add FT when you hit a wall
LoRA Essentials
What: Small adapter matrices, original weights frozen
Rank 16: Sweet spot for most tasks (0.1% params)
QLoRA: 4-bit base + full-precision adapters
Memory: 8B model: ~6GB (QLoRA) vs ~32GB (LoRA) vs 80GB+ (full)
Training Recipe
Data: 1,000–3,000 quality examples for production
Epochs: 2–3 optimal (more = overfitting)
LR: 2e-5 (full), 5e-5 (LoRA), 1e-5 (QLoRA)
Batch: 4–8 typical
LIMA: Quality > quantity, always
Forgetting Prevention
Symptom: Domain tasks improve, general tasks degrade
Fix 1: Use LoRA (original weights frozen)
Fix 2: Mix 10–20% general data in training
Fix 3: Monitor general benchmarks (MMLU, HellaSwag)
Fix 4: Limit to 2–3 epochs
Synthetic Data Pipeline
Teacher: GPT-4 / Claude / DeepSeek-R1 generates
Critic: Separate model validates quality
Student: Smaller model trains on filtered output
Key: Persona variation, source grounding, schema validation
RAFT & Embeddings
RAFT: 70% oracle docs + 30% distractor-only
Result: Mistral 7B beating GPT-4 on invoice tasks
Embedding FT: When search returns wrong results
Matryoshka: Multi-resolution embeddings (64–1024 dims)
What to Fine-Tune
Wrong answers: Fine-tune the LLM
Wrong documents: Fine-tune embeddings
Both problems: RAFT
Not enough data: Synthetic pipeline first
<200 examples: Don't fine-tune yet
Cost Impact
GPT-4 API: $30–60 / 1M tokens
Fine-tuned 8B: $0.20 / 1M tokens (self-hosted)
Training cost: $5–50 for LoRA on cloud GPU
Break-even: ~1M tokens/month makes FT worthwhile
- RAG & Agents — understand the RAG pipeline limitations that fine-tuning addresses (prerequisite for this post)
- From Prompt to GPU — understand the inference pipeline you're modifying when you fine-tune
- Evaluation — the next post: how to measure whether fine-tuning actually worked
Discussion
Leave a comment