Fine-Tuning LLMs

When RAG Isn't Enough

Your RAG retrieves the right documents, but the model still gets the answer wrong. More context doesn't help. Better embeddings don't help. Sometimes the model needs to actually learn your domain — and that changes everything.

Bahgat
Bahgat Ahmed
December 2025 · ~20 min read
What's Inside
~20 min read
1 When RAG Hits the Wall 2 LoRA — The Efficient Adapter 3 The Training Recipe 4 Catastrophic Forgetting 5 Synthetic Data Pipelines 6 RAFT & Embedding Tuning Practice Mode Cheat Sheet

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

We built a RAG system for a medical client. It retrieved the right documents 92% of the time. But the model kept confusing drug interactions — not because it couldn't find the information, but because it didn't understand pharmacology.

More context didn't help. Better embeddings didn't help. We tried reranking, hybrid search, prompt engineering. The model would read "Warfarin interacts with NSAIDs" and still suggest ibuprofen alongside blood thinners.

The model needed to actually learn the domain. That's when we discovered fine-tuning isn't a luxury — it's sometimes the only path forward.

Quick Summary
  • Part 1: When RAG hits its limits — the decision matrix for RAG vs fine-tuning vs both
  • Part 2: LoRA & QLoRA — how to modify 0.1% of a model's parameters and match full fine-tuning performance
  • Part 3: The training recipe — data quality, hyperparameters, and detecting overfitting
  • Part 4: Catastrophic forgetting — why your model might learn the domain but forget how to speak English
  • Part 5: Synthetic data — using strong models to generate training data when you don't have enough
  • Part 6: RAFT & embedding fine-tuning — combining RAG with fine-tuning, and teaching the search itself
This post is for you if...
  • Your RAG system retrieves the right documents but the model still generates wrong or inconsistent answers
  • You need consistent output formats (JSON, medical codes, legal citations) and prompt engineering isn't reliable enough
  • You've heard "LoRA," "QLoRA," or "fine-tuning" and want to understand what they actually mean and when to use them
  • You want to run a domain-specific model on your own hardware instead of paying per-token to a closed API
Part 1
When RAG Hits the Wall

The Reference Book vs. The Trained Specialist

Imagine you need surgery. You have two options. The first is a general practitioner holding an anatomy textbook. They can look up any procedure, find the right page, and read the steps aloud. The second is a surgeon who spent years in residency — they've internalized the patterns, the edge cases, the muscle memory.

The GP with the textbook is RAG. They can retrieve the right information, but they don't understand it at a deep level. The surgeon is a fine-tuned model. They've absorbed the domain into their parameters.

RAG works brilliantly for many problems. But it has a fundamental limitation: it retrieves text snippets, not deep understanding. The model reads the retrieved context and does its best, but it's still reasoning with general-purpose knowledge. When the domain requires specialized vocabulary, consistent formatting, or nuanced reasoning that general models struggle with — retrieval alone isn't enough.

RAG vs Fine-Tuning: Two Ways to Know
RAG
The GP with a Textbook
Retrieves and reads at query time. Broad knowledge, always up-to-date, but reasoning stays general-purpose. Great when the answer is in the retrieved text.
Fine-Tuned
The Trained Surgeon
Domain expertise internalized in weights. Consistent format, specialized vocabulary, nuanced reasoning. Great when the model needs to think like a domain expert.

Five Signs You Need Fine-Tuning

Format Control

You need consistent JSON, XML, or domain-specific output structures. Prompt engineering gets you 80% consistency; fine-tuning gets you 99%.

Domain Vocabulary

Medical codes, legal terminology, financial instruments. The model misuses domain terms even when the right definitions are in context.

Consistent Style

Brand voice, technical writing style, or communication tone. Every response should feel like it came from the same expert.

Latency & Cost

A fine-tuned 8B model can replace a 70B+ model with RAG, running 5–10x faster and costing 80% less per token.

Five Signs You Should Stick with RAG

When RAG is the Right Answer
  • Rapidly changing data — your knowledge base updates daily (news, stock prices, product catalogs). Fine-tuned knowledge becomes stale the moment training ends.
  • Broad knowledge needs — you need answers across thousands of topics, not deep expertise in one domain.
  • Limited training data — you have fewer than 200 high-quality examples. Start with few-shot prompting and RAG instead.
  • Traceability required — you need to cite sources and show where the answer came from. RAG provides natural citations; fine-tuned knowledge is a black box.
  • It already works — your RAG pipeline gives satisfactory answers. Don't fine-tune because it sounds impressive.
RAG vs Fine-Tuning Decision Flowchart
Does your data change frequently?
YES
Use RAG
Knowledge stays current
NO
Need consistent format/style?
YES
Fine-Tune
Format control
NO
Domain terms misused?
YES
Fine-Tune
Vocabulary
NO
Need retrieval + expertise?
YES
RAFT
Both combined
NO
Stick with RAG
Most production systems eventually use a combination. Start with RAG, add fine-tuning when you hit a wall.
Decision Point
Your legal chatbot retrieves the correct contract clauses 95% of the time. But responses vary wildly in format — sometimes bullet points, sometimes paragraphs, sometimes missing key elements like the jurisdiction reference. What's the fix?
Correct!

Format consistency is where fine-tuning shines most. Prompt engineering can get you to ~80% format consistency, but fine-tuning on examples of the exact output format you want pushes that to 99%+. The model internalizes the structure as a pattern, not as a set of instructions it might interpret differently each time.

Not the best approach.

Better prompts help, but they can't guarantee consistency at scale — the model will still interpret formatting instructions differently in edge cases. Better retrieval doesn't address the core issue either: the retrieval already works (95% accuracy). The problem is output format, not input quality. Fine-tuning for format is the right answer: train on 1,000+ examples of perfectly formatted legal responses so the model internalizes the structure.

Part 2
LoRA — Teaching a Chef a New Cuisine

The Chef Analogy

Imagine a classically trained French chef who wants to learn Thai cooking. They don't start over from culinary school — they already know how to handle heat, balance flavors, time dishes. Instead, they add a small "adapter" of new techniques on top of their existing skills: how to use a wok, how to balance fish sauce with lime, how to make curry paste from scratch.

That's exactly how LoRA (Low-Rank Adaptation) works. Instead of retraining every parameter in a model (the equivalent of sending the chef back to culinary school), LoRA attaches small "adapter" matrices to specific layers. The original model weights stay frozen. Only the adapters learn the new domain.

Full Fine-Tuning vs. LoRA

A model like Llama 3 8B has about 8 billion parameters. Full fine-tuning means updating all 8 billion parameters during training. That requires massive GPU memory (at least 80GB VRAM for the model, gradients, and optimizer states) and risks destroying the model's general knowledge.

LoRA typically trains 0.1% to 1% of the parameters. For an 8B model, that's 8–80 million parameters. Research consistently shows that LoRA matches or comes within 1–2% of full fine-tuning on most tasks, while using 3–10x less GPU memory.

How LoRA Works — Adapter Weights
Full Fine-Tuning
8B
parameters updated
GPU: 80+ GB VRAM
Risk: Catastrophic forgetting
Cost: $$$
vs
LoRA Fine-Tuning
8B
frozen (unchanged)
+8M trained
GPU: 6–24 GB VRAM
Risk: Minimal forgetting
Cost: $
LoRA adds tiny adapter matrices while keeping the original model frozen. This is why you can fine-tune a 7B model on a single consumer GPU.
How LoRA Actually Works — Low-Rank Matrices

In a transformer model, the heavy lifting happens in the attention layers. Each attention layer has weight matrices (typically called Q, K, V, and O) that determine how the model processes relationships between tokens.

During full fine-tuning, you'd update these massive matrices directly. A single attention weight matrix in an 8B model might be 4096 x 4096 — that's 16.7 million parameters in just one matrix.

LoRA's insight: instead of updating the full matrix W, decompose the update into two much smaller matrices:

W' = W + A x B
where A is (4096 x 16) and B is (16 x 4096)

The original matrix W has 16.7M parameters. The two adapter matrices A and B together have only 131,072 parameters (4096 x 16 x 2). That's a 128x reduction.

The number 16 in this example is the LoRA rank (often written as r). It controls how much "capacity" the adapter has to learn new patterns. Higher rank = more capacity but more parameters and memory.

LoRA Rank — Finding the Sweet Spot

Think of LoRA rank like the width of a highway. A rank of 4 is a single-lane road — cheap and fast, but limited capacity. A rank of 64 is a six-lane highway — lots of capacity, but expensive and often unnecessary.

In practice, rank 16 is the sweet spot for most tasks. Here's why:

LoRA Rank Trainable Params (8B model) Performance vs Full FT Best For
r = 4 ~2M (0.025%) 90–95% Simple style transfer, format control
r = 16 ~8M (0.1%) 97–99% Most domain adaptation tasks
r = 64 ~33M (0.4%) 99–100% Complex reasoning, new language
r = 256 ~131M (1.6%) ~100% Rarely needed, diminishing returns
Practical Anchor

Start with rank 16. If your validation loss plateaus early, drop to 8. If it's still improving when training ends, try 32. Going above 64 rarely helps and significantly increases memory usage.

QLoRA — Fine-Tuning on Consumer Hardware

QLoRA combines LoRA with quantization. Think of quantization like compressing a high-resolution photo into a JPEG — you lose a tiny bit of detail, but the file is 4x smaller.

In QLoRA, the base model is compressed from 16-bit to 4-bit precision (that's the quantization part), and then LoRA adapters are trained on top at full precision. The result: you can fine-tune a 13B parameter model on a single GPU with 24GB VRAM, or an 8B model on as little as 6GB.

Without QLoRA

8B model needs ~32GB VRAM for LoRA training. 13B model needs ~52GB. You need an A100 ($2+/hour).

With QLoRA

8B model needs ~6GB VRAM. 13B model needs ~10GB. A single RTX 4090 ($0.40/hour) works fine.

Decision Point
You have a single GPU with 24GB VRAM. You need to fine-tune a 13B parameter model for domain-specific medical terminology. Can you do it, and how?
Exactly right!

QLoRA is designed for exactly this situation. By quantizing the base model to 4-bit precision, a 13B model fits in ~10GB VRAM, leaving plenty of room on your 24GB GPU for the LoRA adapters and training overhead. Research shows QLoRA loses less than 1% accuracy compared to full-precision fine-tuning on most benchmarks.

There's a better option.

You don't need to give up on the 13B model. QLoRA quantizes the base model to 4-bit precision, reducing memory from ~52GB to ~10GB. With LoRA adapters on top, you can fine-tune the full 13B model on your 24GB GPU. The quality loss from quantization is minimal (less than 1% on most benchmarks), and 13B models significantly outperform 7B on medical terminology tasks.

Part 3
The Training Recipe

LIMA — Less Is More for Alignment

One of the most counterintuitive findings in fine-tuning comes from a 2023 paper called LIMA: "Less Is More for Alignment." The researchers fine-tuned a 65B parameter model on just 1,000 carefully curated examples and found it competed with models trained on 52,000 examples from the same domain.

The insight: data quality beats data quantity by a massive margin. One thousand perfect examples teach the model a pattern. One hundred thousand noisy examples teach the model to be mediocre.

What makes a "perfect" training example?
The Recipe for Quality Data

Correct: The answer must be factually accurate. One wrong example can embed a permanent error in the model's behavior.

Complete: The response should be as thorough as you'd want in production. Don't use abbreviated training examples.

Consistent: Every example should follow the exact format and style you want in production. If you want JSON, every response should be valid JSON.

Representative: Your training set should cover the range of inputs you'll see in production — including edge cases, ambiguous inputs, and "I don't know" scenarios.

How Much Data Do You Need?

1
Minimum Viable
200–500

Good enough to see if fine-tuning helps. Useful for proof of concept and format control.

2
Production Quality
1,000–3,000

The sweet spot for most domain tasks. Enough to learn vocabulary, style, and reasoning patterns.

3
Enterprise Grade
5,000–50,000

For complex reasoning tasks, multi-step workflows, or when you need the model to generalize across diverse inputs.

The Hyperparameters That Actually Matter

Fine-tuning has dozens of knobs you can turn. But in practice, four matter most:

Epochs, Learning Rate, Batch Size & Early Stopping

Epochs — how many times the model sees your entire dataset.

Think of it like studying for an exam. Reading your notes once (1 epoch) gives a rough understanding. Reading them 2–3 times (2–3 epochs) solidifies the knowledge. Reading them 20 times (20 epochs) means you've memorized specific sentences without understanding the material — that's overfitting.

Optimal: 2–3 epochs for most tasks. More than 4 almost always leads to overfitting, especially with small datasets.

Learning rate — how aggressively the model updates its weights on each step.

Think of it like adjusting a satellite dish. Too high a learning rate (big adjustments) and you keep overshooting the signal. Too low (tiny adjustments) and you'll take forever to find it. Just right and you converge on the best reception quickly.

Baselines: Full fine-tuning: 2e-5. LoRA: 5e-5 (higher because you're training fewer parameters). QLoRA: 1e-5 to 2e-5 (lower because quantization adds noise).

Batch size — how many examples the model processes before updating weights.

Think of it like grading student essays. Grading one essay at a time (batch size 1) gives detailed but noisy feedback. Grading 8 at once (batch size 8) gives smoother, more balanced feedback because individual quirks average out.

Typical: 4–8 for most fine-tuning. Smaller batches (1–2) work when VRAM is limited. Larger batches (16–32) can speed up training but may need learning rate adjustments.

Early stopping — automatically stopping training when the model starts overfitting.

During training, you track performance on a held-out validation set (typically 10–20% of your data). When validation loss starts rising while training loss keeps dropping, the model is memorizing your training data instead of learning patterns. Stop there.

Training Loss vs Validation Loss — When to Stop
Training Loss
Validation Loss
Loss
Epochs
Stop here (epoch 2–3)
Overfitting zone
Training loss always decreases. When validation loss starts climbing, the model is memorizing instead of learning. That inflection point is where you stop.
Baseline First, Always

Before fine-tuning, measure your base model's performance on your task. If GPT-4 with good prompts scores 85% on your evaluation set, you know the ceiling. If your fine-tuned Llama 8B scores 82%, that's a huge win at 10x lower cost. Without a baseline, you can't tell if fine-tuning helped or hurt.

Part 4
Catastrophic Forgetting — The Hidden Danger

The Bilingual Speaker Problem

Imagine someone who speaks fluent English and French. They move to Japan and study Japanese intensively for a year, practicing 8 hours a day. After a year, their Japanese is excellent — but they discover they've started struggling with French. They haven't spoken it in months, and the Japanese patterns keep interfering.

That's catastrophic forgetting. When you fine-tune a model on domain-specific data, the new gradients (the weight updates from training) can overwrite the general knowledge the model learned during pre-training. The model gets great at your domain but loses basic capabilities.

What It Looks Like in Practice

Real Case: Medical Fine-Tuning Gone Wrong

A team fine-tuned an LLM on 10,000 medical case studies. The model became excellent at diagnosing conditions from symptoms. But when they tested basic conversational abilities, they found:

  • "What's 15 x 23?" — Model responded with a diagnosis instead of 345
  • "Write me a haiku" — Model responded with medical terminology
  • "Hello, how are you?" — Model responded: "Patient presents with..."

The model had overfit to the medical domain so heavily that it forgot how to be a general-purpose language model.

How to Prevent It

Use LoRA (Best Defense)

LoRA keeps the original weights frozen. The adapter learns the new domain while the base model retains all general knowledge. This is the single most effective prevention.

Mix General Data (10–20%)

Add 10–20% general instruction-following data to your training mix. This reminds the model "you're still a general assistant that also knows this domain."

Use Validation Benchmarks

Track performance on both domain tasks AND general benchmarks (MMLU, HellaSwag). If general scores drop more than 5%, you're forgetting.

Train Less (2–3 Epochs)

Forgetting accelerates with more training. Limiting to 2–3 epochs reduces the chance of overwriting general knowledge.

Decision Point
After fine-tuning your model on financial data, it aces questions about bond yields and portfolio analysis. But when a user asks "What's the capital of France?" the model responds with a stock market analysis. What happened, and what's the fix?
Correct!

This is textbook catastrophic forgetting. The model's financial training overwrote its general world knowledge. The fix has two parts: (1) switch to LoRA so original weights stay frozen, and (2) mix 10–20% general instruction-following data into your training set so the model retains basic capabilities alongside its new financial expertise.

Not quite.

This is catastrophic forgetting, not underfitting or bad prompting. The model isn't failing at finance — it's failing at everything else. During fine-tuning, the gradients from financial data overwrote the neurons responsible for general knowledge. The fix: use LoRA (freezes original weights) and mix in 10–20% general data to maintain broad capabilities.

Part 5
Synthetic Data — When You Don't Have Enough Examples

The Teacher-Student Pipeline

Imagine you're starting a new restaurant and need to train a sous chef. You can't afford to have them study under a master for years, but you can have a renowned chef create a book of 3,000 signature recipes with detailed notes on technique, timing, and plating. Your sous chef studies that book and becomes highly skilled — not by learning directly from the master, but by learning from the master's output.

That's synthetic data generation for fine-tuning. You use a powerful model (GPT-4, Claude, DeepSeek-R1) as the "teacher" to generate training examples, then fine-tune a smaller, cheaper model (the "student") on those examples. The student learns to mimic the teacher's capabilities at a fraction of the cost.

The Pipeline

Synthetic Data Pipeline
Teacher Model
GPT-4 / Claude / DeepSeek
Quality Filter
Critic model validates
Clean Dataset
1,000–5,000 examples
Student Model
Llama / Mistral fine-tuned
The teacher generates, the critic filters, the student learns. This is how companies build domain-specific open-source models that rival closed APIs.

Five Rules for Quality Synthetic Data

Practical Guide: Building a Synthetic Data Pipeline

1. Use persona variation in prompts. Don't ask the teacher model the same question the same way every time. Vary the persona: "You are a senior cardiologist," "You are a first-year medical resident," "You are a patient asking about their condition." This creates diversity in your training data and prevents the student from only learning one communication style.

2. Ground with source passages. When generating training data, provide the teacher model with real source documents. This prevents hallucinated training data — you don't want the teacher making up facts that the student then learns as truth.

3. Use a critic model for quality control. After the teacher generates examples, run them through a separate model (or the same model with a different prompt) that scores quality, correctness, and relevance. Only keep examples above a quality threshold (typically top 70–80%).

4. Enforce schema consistency. If your training examples need to follow a specific format (like JSON with required fields), validate every generated example against the schema before including it. One malformed example can teach the model bad habits.

5. Self-validation loops. Ask the teacher to generate an answer, then ask it to verify its own answer. Discard any examples where the verification step identifies errors. This catches the ~10–15% of outputs where even GPT-4 makes mistakes.

Distillation Ethics Note

Most closed-source APIs (OpenAI, Anthropic) have terms of service that restrict using their output to train competing models. DeepSeek-R1 and other open models can be used freely for distillation. Always check the license before building a teacher-student pipeline.

Part 6
RAFT & Embedding Fine-Tuning

RAFT — The Best of Both Worlds

Remember the chef analogy? RAG is the chef reading a recipe book. Fine-tuning is the chef who learned from residency. But what about a chef who went through residency AND keeps a recipe book nearby for reference?

That's RAFT (Retrieval-Augmented Fine-Tuning). You fine-tune the model to be better at using retrieved context. Instead of just teaching domain knowledge, you teach the model how to reason with retrieved documents — which parts to trust, which to ignore, and how to combine multiple sources.

How RAFT Works

During training, RAFT deliberately mixes two types of examples:

Oracle Examples (70%)

The correct document is included alongside 3–4 "distractor" documents. The model learns to identify and extract the right information from noisy context.

Distractor-Only Examples (30%)

No relevant document is provided. The model learns to rely on its own parametric knowledge when retrieval fails — and to say "I don't know" when appropriate.

RAFT Training Process
During Training
Question
+
Oracle Doc
correct
Distractor
Distractor
Distractor
Model learns to
find signal in noise
At Inference Time
User Query
RAG Retrieval
real documents
RAFT Model
trained for noisy context
Better Answer
+20% accuracy
RAFT trains the model with a mix of correct and irrelevant documents, teaching it to extract signal from noise. At inference, it applies this skill to real RAG retrieval — getting 20%+ accuracy improvements.

Real-World RAFT Results

Use Case Before RAFT After RAFT Cost Impact
Invoice extraction GPT-4 at $10/1K invoices Mistral 7B at $1.20/1K 88% cost reduction
Medical Q&A 72% accuracy (RAG alone) 92% accuracy (RAFT) 20% accuracy boost
Podcast summarization Generic summaries Domain-aware highlights Custom model, no API cost

Embedding Fine-Tuning — Teach the Search Itself

Everything we've covered so far fine-tunes the answer model — the LLM that generates responses. But what about the search model — the embedding model that decides which documents are relevant to a query?

Think of it this way: if your search keeps returning the wrong documents, no amount of fine-tuning the answer model will help. You need to teach the search itself to understand your domain.

Matryoshka Representation Learning

Think of Russian nesting dolls (matryoshka). Each doll contains a smaller version of itself. Matryoshka Representation Learning creates embeddings that work the same way: the full 1024-dimension embedding is the most detailed representation, but the first 256 dimensions are a perfectly usable smaller version, and even the first 64 dimensions capture the core meaning.

This matters for production because you can use smaller embeddings for fast initial filtering, then full-size embeddings for precise reranking — without maintaining separate models.

Matryoshka Embeddings: Nested Precision
1024 dims
256 dims
64
1024 dims — Full detail
Precise reranking, final scoring
256 dims — Good for filtering
Fast candidate retrieval, 4x less storage
64 dims — Core meaning
Quick clustering, 16x less storage
Like Russian nesting dolls, the first N dimensions form a complete (smaller) embedding. Use 64 dims for fast filtering, 256 for retrieval, 1024 for precise reranking — all from one model.

When to Fine-Tune Embeddings vs. the LLM

Fine-Tune Embeddings When...
  • Search returns irrelevant results despite good data
  • Domain vocabulary confuses the embedding model
  • Queries use different terms than documents
  • You need multilingual search in a specialized domain
Fine-Tune the LLM When...
  • Search works but the model misinterprets results
  • You need specific output format or style
  • Domain reasoning requires specialized knowledge
  • You want to replace a larger model with a smaller one
Decision Point
Your e-commerce search returns hiking boots when users search for "waterproof trail runners." The product descriptions are accurate, but the embedding model treats "hiking boots" and "trail runners" as nearly identical. Meanwhile, the LLM answers questions about the returned products perfectly. What should you fine-tune?
Exactly right!

The problem is in the search, not the answer. The LLM already handles products well. The embedding model needs to learn that trail runners and hiking boots are different categories in your product domain, even though they share many terms. Fine-tuning the embedding model on product pairs (similar vs. different) will teach it these domain-specific distinctions.

Wrong layer.

The LLM is already working fine — it answers questions about products perfectly. The problem is that the wrong products are being retrieved. No amount of LLM fine-tuning will fix search results. You need to fine-tune the embedding model to understand that trail runners and hiking boots are different product categories, even though they share terms like "outdoor," "rugged," and "grip."

Fine-tuning teaches the model new skills. But how do you measure whether those skills actually work? How do you know if your fine-tuned model is better, worse, or just different? That's the domain of evaluation — the subject of the next post in this series.

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this post helpful?


Discussion

No comments yet. Be the first to share your thoughts!

Leave a comment

Practice Mode

Test your understanding with real-world fine-tuning scenarios.

Score: 0 / 4
Scenario 1 of 4

Your medical chatbot uses RAG to retrieve clinical guidelines. It retrieves the right documents 90% of the time, but frequently misuses domain terms — confusing "contraindicated" with "not recommended," using "hypertension" and "high blood pressure" inconsistently, and formatting drug dosages differently each time.

What's the best approach?
A
Fine-tune the LLM on 2,000 correctly formatted medical responses to teach domain vocabulary and consistent formatting.
B
Improve the RAG pipeline — add more context about terminology standards to the retrieved documents.
C
Better prompts — add a comprehensive style guide to the system prompt with all terminology rules.

Cheat Sheet

The essential reference for LLM fine-tuning.

RAG vs Fine-Tuning

RAG: Changing data, broad knowledge, traceability needed
Fine-Tune: Format control, domain vocabulary, consistent style, speed
RAFT: Need both retrieval AND domain expertise
Start with: RAG first, add FT when you hit a wall

LoRA Essentials

What: Small adapter matrices, original weights frozen
Rank 16: Sweet spot for most tasks (0.1% params)
QLoRA: 4-bit base + full-precision adapters
Memory: 8B model: ~6GB (QLoRA) vs ~32GB (LoRA) vs 80GB+ (full)

Training Recipe

Data: 1,000–3,000 quality examples for production
Epochs: 2–3 optimal (more = overfitting)
LR: 2e-5 (full), 5e-5 (LoRA), 1e-5 (QLoRA)
Batch: 4–8 typical
LIMA: Quality > quantity, always

Forgetting Prevention

Symptom: Domain tasks improve, general tasks degrade
Fix 1: Use LoRA (original weights frozen)
Fix 2: Mix 10–20% general data in training
Fix 3: Monitor general benchmarks (MMLU, HellaSwag)
Fix 4: Limit to 2–3 epochs

Synthetic Data Pipeline

Teacher: GPT-4 / Claude / DeepSeek-R1 generates
Critic: Separate model validates quality
Student: Smaller model trains on filtered output
Key: Persona variation, source grounding, schema validation

RAFT & Embeddings

RAFT: 70% oracle docs + 30% distractor-only
Result: Mistral 7B beating GPT-4 on invoice tasks
Embedding FT: When search returns wrong results
Matryoshka: Multi-resolution embeddings (64–1024 dims)

What to Fine-Tune

Wrong answers: Fine-tune the LLM
Wrong documents: Fine-tune embeddings
Both problems: RAFT
Not enough data: Synthetic pipeline first
<200 examples: Don't fine-tune yet

Cost Impact

GPT-4 API: $30–60 / 1M tokens
Fine-tuned 8B: $0.20 / 1M tokens (self-hosted)
Training cost: $5–50 for LoRA on cloud GPU
Break-even: ~1M tokens/month makes FT worthwhile

Where to Go Deep
  • RAG & Agents — understand the RAG pipeline limitations that fine-tuning addresses (prerequisite for this post)
  • From Prompt to GPU — understand the inference pipeline you're modifying when you fine-tune
  • Evaluation — the next post: how to measure whether fine-tuning actually worked