Deep Dive

The Math That Makes LLMs Work

You already use math every day — you just don't call it math. This post shows you that the concepts behind language models are things you already understand, dressed up in notation.

35 min read
5 Parts, 18 Concepts
Zero prerequisites
P(x)
Probability
σ
Softmax
Gradient
H(p)
Entropy
QKT
Attention
DKL
KL Divergence
ε
Clipping
E[x]
Expected Value
log p
Log Prob

Pick up your phone. Open a text message. Type "I'm on my".

Your keyboard suggests: "way". Maybe "own". Maybe "phone".

That autocomplete? It's a language model. A tiny one, but it's doing the same fundamental thing that ChatGPT, Claude, and every LLM does: looking at the words you've typed and predicting a probability distribution over what comes next. "Way" gets 60%. "Own" gets 15%. "Phone" gets 8%. Everything else shares the remaining 17%.

That's it. That's the core idea. Every language model is a next-token probability machine.

The math that makes this work — probability, softmax, attention, gradients, entropy — sounds intimidating in a textbook. But you already use these concepts every single day. You just never had someone show you the connection. This post is that connection. By the end, you'll look at every formula you'll encounter in LLM research and think: "Oh, that's just the math for what I already understood."

What you'll learn:

This is for you if: you want to understand reasoning models deeply but math notation makes you anxious. Every formula here gets a real-life analogy first.

Table of Contents

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

Part 1
The Foundation
Three concepts everything else builds on — and you already understand all of them

1. Probability & Distributions

You already understand probability. You use it every time you check the weather forecast, decide whether to bring an umbrella, or pick a restaurant based on reviews. "There's a 70% chance of rain" — that's a probability. It tells you: out of many days with conditions like today, about 70 out of 100 would have rain.

A probability distribution is just the complete picture — all the possible outcomes and how likely each one is. When your phone keyboard suggests the next word, it's computing a probability distribution: a list of every possible next word, each with a probability, and all of them adding up to 100%.

Your Phone Keyboard is a Language Model
After typing "I'm on my", the keyboard computes a probability distribution over possible next words
Input
I'm on my ___
Probability Distribution
way
60%
own
15%
phone
8%
mind
5%
break
3%
...rest
9%
All probabilities sum to 100% — the model must pick something

This is exactly what GPT-4, Claude, and every LLM does — just with a vocabulary of 100,000+ tokens instead of a few words, and context spanning thousands of tokens instead of three. The model looks at everything you've typed so far and produces a probability distribution over what token comes next. Then it picks one (we'll cover how it picks in the sampling section). Then it looks at everything including that new token, produces another distribution, picks again. Repeat until done.

This is what "next-token prediction" means. Every time you interact with ChatGPT, Claude, or any LLM, the model is generating a probability distribution over the entire vocabulary for the next token — then sampling from it. This happens hundreds of times per response. Everything else in this post is about how that distribution gets computed, shaped, and improved.

Two key properties of any probability distribution: (1) every probability is between 0 and 1, and (2) they all sum to 1. Nothing can have negative probability, and the model must predict something. These constraints might seem obvious, but they'll matter a lot when we get to softmax — the function that enforces them.

2. Expected Value & Variance

Imagine you're comparing two job offers. Company A pays $100,000, guaranteed. Company B pays $50,000 base with a 50% chance of a $120,000 bonus — so either $50K or $170K, each with equal odds. Which offer is "better"?

You need a single number that captures the overall worth of Company B. But how do you boil two possible outcomes into one number? You can't just average $50K and $170K and call it $110K — that only works because both outcomes are equally likely (50/50). What if the bonus had a 90% chance? Then the $170K outcome should count way more than the $50K outcome. The "average" has to care about how likely each outcome is.

This is the key insight: each outcome should contribute to the average in proportion to how likely it is. An outcome that happens 90% of the time should have 9x the influence of one that happens 10% of the time. The natural way to do this? Multiply each outcome by its probability, then add them up. That's the expected value — and once you see why it has to work this way, the formula feels inevitable rather than arbitrary.

But expected value only tells half the story. Company B's expected value is $110K — higher than Company A's $100K. But Company B is also riskier. Half the time you get $170K, half the time you get $50K. That spread — how far outcomes scatter from the average — is the variance. Two distributions can have the same expected value but wildly different variance. One feels safe. The other feels like gambling.

Expected Value vs Variance: Two Job Offers
Same question, two very different distributions — expected value alone doesn't tell the full story
Company A
Guaranteed salary
100%
$100K
Expected value: $100K
Variance: $0
Company B
Base + possible bonus
50%
$50K
50%
$170K
Expected value: $110K
Variance: High
Building the formulas: why expected value is a weighted sum
Let's discover this step by step

Imagine Company B's deal plays out across 1,000 parallel universes. In 500 of them (50%), you earn $50K. In the other 500 (50%), you earn $170K. What's your average across all universes?

Simple arithmetic: (500 × $50K + 500 × $170K) / 1000 = $110K.

Now watch what happens when you simplify
1
Start: (500 × $50K + 500 × $170K) / 1000
2
Distribute the division: (500/1000) × $50K + (500/1000) × $170K
3
But 500/1000 = 0.5 — that's just the probability!
4
So: 0.5 × $50K + 0.5 × $170K = $110K — outcome × probability, summed up

That's the entire trick. The "expected value formula" isn't some arbitrary definition mathematicians invented. It falls out naturally from asking "what's the average across many repetitions?" When you simplify "count of each outcome / total" into probabilities, you get: each outcome weighted by how often it happens.

The formula:

E[X] = Σ xi · P(xi)

This isn't a definition to memorize — it's the only formula that gives you the long-run average. Each outcome (xi) contributes to the sum in exact proportion to how likely it is (P(xi)). A 90% outcome dominates. A 1% outcome barely registers. That's what "weighted" means — and it's why this formula works.

Now: variance. You know the average is $110K. But how spread out are the outcomes? You need to measure the "typical distance from the average."

First attempt: just average the distances. $50K is $60K below the mean. $170K is $60K above. Average distance = (60K + 60K) / 2 = $60K. But there's a problem: if you have outcomes both above and below, the positive and negative distances cancel out. A distribution with outcomes at $109K and $111K (tiny spread) would get the same "average signed distance" as one with $50K and $170K (huge spread) — both sum to zero.

Fix: square the distances first. Squaring does two things: (1) makes everything positive so nothing cancels, and (2) makes big deviations count way more than small ones. A $60K gap becomes $60K2 = 3,600,000,000. A $1K gap becomes just 1,000,000. The big gap counts 3,600x more. This is exactly what you want — variance should scream when outcomes are wildly spread out.

Var(X) = Σ (xi - μ)2 · P(xi)

Same idea as expected value: weight each squared distance by probability, sum them up. It's an expected value — the expected value of "how far am I from the average, squared."

Why this matters for LLMs: When a model generates multiple responses to the same question (self-consistency), each response is a sample from a distribution. The expected value tells us the "average quality." Variance tells us how much quality jumps around between samples. High variance = unreliable. A model that gives you a brilliant answer, then garbage, then brilliant again has high variance even if the expected quality is good. Techniques like baselines in RL training exist specifically to reduce variance — making the training signal more stable — while keeping the expected value the same. The math guarantees you can do this: subtract a constant from all rewards, and the expected gradient direction stays identical, but the noise drops dramatically.

Here's why this matters for LLMs: when you ask a model the same question 10 times, you might get 7 correct answers and 3 wrong ones. The expected value of "correctness" is 70%. But the variance — how much the quality jumps around — determines whether the model is reliably useful or frustratingly inconsistent. A huge part of making LLMs better is reducing variance while maintaining or increasing expected value.

3. The Law of Large Numbers

Flip a coin once. You get heads. Is the coin fair? You have no idea — one sample tells you nothing. Flip it 10 times: 7 heads, 3 tails. Maybe it's biased, maybe you got lucky. Flip it 10,000 times: 5,023 heads, 4,977 tails. Now you're confident — this coin is fair. The ratio gets closer and closer to 50/50 as you flip more.

That's the law of large numbers: the more samples you take, the closer your observed average gets to the true expected value. It sounds obvious, but it's the mathematical foundation for one of the most powerful techniques in reasoning model research.

More Samples = More Reliable Results
As sample count increases, the observed average converges to the true value
1 sample
100%
Heads
Way off
10 samples
70%
Heads
Noisy
100 samples
54%
Heads
Getting close
10,000
50.2%
Heads
Converged
True probability: 50% — more samples = closer to truth

Self-consistency uses this directly. Instead of generating one answer and hoping it's right, you generate 10 answers and take a majority vote. If the model gets the answer right 70% of the time, a single sample might be wrong — but with 10 samples, the majority vote is almost certainly right. The law of large numbers is why 10 samples beats 1.

Quick Check
A model answers a math question correctly 60% of the time. You generate 10 responses and take a majority vote. Will the majority vote be correct more or less than 60% of the time?
About the same — 60%, since that's the model's accuracy
More than 60% — the majority vote filters out noise
Less than 60% — averaging dilutes the best answer
Exactly right

If a model is correct 60% of the time, a majority vote across 10 independent samples is correct about 83% of the time. That's the law of large numbers at work — errors are random and tend to cancel out, while correct answers cluster. This is why self-consistency is such a powerful technique.

Not quite

The majority vote actually amplifies accuracy. If the model is right 60% of the time, with 10 samples you'd expect about 6 correct and 4 incorrect — so the majority vote is correct. The probability of majority-wrong requires at least 6 wrong answers, which is unlikely when each answer is independently right 60% of the time. Result: ~83% accuracy from a 60% model.

With that, you have the three foundation concepts: probability distributions (how models represent predictions), expected value and variance (how we measure those predictions), and the law of large numbers (why more samples give better results). Everything from here builds on these.

Now let's see these ideas in action. When you type a message to ChatGPT and hit send, a chain of mathematical operations fires — and every link in that chain uses the concepts we just covered. Let's follow a token through the machine.

Part 2
The LLM Machinery
Follow a token through the system — from raw scores to a chosen word

The model has processed your input and produced a raw score for every word in its vocabulary — 100,000+ numbers, one per possible next token. These scores are called logits. They're not probabilities yet. The next five concepts are the pipeline that turns them into an actual word on your screen.

4. Softmax: Turning Numbers into Probabilities

Inside an LLM, the model computes a raw score for every token in its vocabulary — "how much do I like this token as the next word?" These raw scores are called logits. A token might get a logit of 5.2, another gets 2.1, another gets -0.8. These numbers can be anything — positive, negative, huge, tiny. They're not probabilities yet. You can't sample from raw numbers — you need a probability distribution where everything is positive and sums to 100%.

Softmax is the function that transforms these raw logits into a proper probability distribution. It does three things: (1) makes everything positive (by exponentiating), (2) preserves the ranking (bigger logit = bigger probability), and (3) makes them sum to 1.

Softmax: Watching the Transformation Happen
Follow one set of logits through exponentiation, summing, and division
1
Raw logits (any number)
5.2
the
2.1
a
-0.8
cat
Problem: negative numbers, don't sum to 1
raise e to the power of each logit...
2
After ex — everything positive, gaps amplified
181.3
the
e5.2
8.2
a
e2.1
0.45
cat
e-0.8
All positive! Gap: 5.2 vs 2.1 became 181 vs 8 (22x amplification)
sum = 189.95, divide each by sum...
3
Probabilities — sum to 100%
95.4%
the
4.3%
a
0.2%
cat
Valid probability distribution: all positive, sums to 1, ranking preserved
The key insight: a small logit gap (5.2 vs 2.1) becomes a huge probability gap (95.4% vs 4.3%). Softmax doesn't just convert — it amplifies the model's conviction.

Why does softmax use ex specifically? Because the exponential function has a unique shape that makes it perfect for this job. Look at the curve:

The Shape of ex — Why Small Gaps Become Huge
The exponential curve is nearly flat for negative inputs, then explodes upward — this is what amplifies the model's top choice
Safe zone: outputs < 1 Explosion zone -3 -2 -1 0 1 2 3 4 5 0 20 40 60 100 150 input (logit value) output (e to the x) e⁰= 1 e¹= 2.7 e²= 7.4 e⁴= 54.6 e⁵= 148! Logit gap: 2 to 4 Output gap: 7x!
Why this shape matters: Negative logits get crushed near zero. Positive logits explode. A small difference in logits (like 2 vs 4) becomes a 7x difference in output. This is why softmax makes the model's top choice dominate.
Why softmax uses exponentiation (and not something simpler)
Why not just divide each score by the total?

Your first instinct might be: just divide each logit by their sum. Logits are 5.2, 2.1, -0.8. Sum is 6.5. So: 5.2/6.5 = 80%, 2.1/6.5 = 32%, -0.8/6.5 = -12%.

Problem: you get a negative probability (-12%). That's meaningless — you can't have a -12% chance of something. Logits can be negative, so simple division breaks.

What you need: a function that (1) turns any number — positive, negative, huge, tiny — into a positive number, (2) preserves the ordering (bigger input = bigger output), and (3) is smooth (for gradients to work during training). The exponential function ex does all three.

The shape of ex — why it amplifies
-2 -1 0 1 2 3 4 5 6 0 40 80 120 150 logit value after exponentiation e&sup0;=1 e¹=2.7 e²=7.4 e&sup5;=148 Small logit differences become HUGE here

The formula:

softmax(zi) = ezi / Σj ezj

Step 1: exponentiate every logit (now everything is positive). Step 2: divide by the total (now they sum to 1). That's it.

Walking through our example:

1
Exponentiate: e5.2 = 181.3,   e2.1 = 8.2,   e-0.8 = 0.45 — all positive now!
2
Sum: 181.3 + 8.2 + 0.45 = 189.95
3
Divide: 181.3/189.95 = 95.4%,   8.2/189.95 = 4.3%,   0.45/189.95 = 0.2%

Why exponentiation has a second superpower: it amplifies differences. The raw logit gap between "the" (5.2) and "a" (2.1) is 3.1 — seems modest. But e5.2 / e2.1 = e3.1 ≈ 22x. A 3-point logit gap becomes a 22x probability gap. This means softmax naturally makes the model's top choice dominate. The model doesn't "slightly prefer" — it strongly commits. And if you want to soften this amplification? That's exactly what temperature does (dividing logits before softmax reduces the gaps before exponentiation).

Why this matters: Softmax runs at the very end of the model, after all the attention layers and neural network computations. It's the final step that converts the model's internal "opinions" into a probability distribution you can sample from. Every single token generated by every LLM passes through softmax. And because it uses ex, it has nice derivatives — softmax's gradient has a clean closed form, which is why backpropagation through it is efficient.

Good — softmax gave us a probability distribution. The model is 95.4% confident the next token is "the." But what if we need to score an entire sequence of tokens, not just one? That's where we hit a problem that forces us into a different mathematical space.

5. Log Probabilities: Why We Work in Log-Space

Earthquakes use the Richter scale. Sound uses decibels. Star brightness uses magnitude. Why? Because these phenomena span enormous ranges, and our brains can't process raw numbers that differ by factors of billions. A magnitude 8 earthquake is 10x more powerful than magnitude 7 — but the actual energy difference is 31.6x. The log scale compresses huge ranges into manageable numbers.

LLMs have the same problem. When a model evaluates a 100-word response, it computes the probability of the entire sequence by multiplying the probability of each token. Token 1: 0.02. Token 2: 0.15. Token 3: 0.08. Multiply 100 numbers like this together, and you get something astronomically small — like 0.000000000000000000000000000001. Computers can't reliably store or compare numbers this tiny. They lose precision. This is called underflow.

Why Log Probabilities: Multiplication vs Addition
Multiplying tiny probabilities causes underflow — log turns multiplication into addition
Raw Probabilities
0.02 × 0.15 × 0.08
× 0.12 × 0.03 × 0.22
× ... (100 tokens)
= 0.0000...0001
Computer: "Is this zero? I can't tell anymore."
Log Probabilities
-3.91 + (-1.90) + (-2.53)
+ (-2.12) + (-3.51) + (-1.51)
+ ... (100 tokens)
= -247.3
Computer: "Clear, precise, no problem."

The trick: log(a × b) = log(a) + log(b). By working in log-space, we turn multiplication into addition. Instead of multiplying 100 tiny probabilities and getting underflow, we add 100 negative log-probabilities and get a perfectly manageable number like -247.3.

But why does log work this way? Because of the shape of the curve. The log function does something powerful: it compresses huge ranges into manageable ones, while stretching tiny differences apart.

The Shape of log(x) — Why It Compresses Perfectly
Large numbers get squeezed together. Tiny numbers get spread apart. Exactly what we need.
Tiny probs spread apart Large probs compressed together 0 0.2 0.4 0.6 0.8 1.0 0 -1 -2 -3 -4 -5 probability log(probability) p=0.01: -4.6 p=0.1: -2.3 p=0.5: -0.69 p=1.0: 0 0.5 → 1.0 maps to just 0.69 0.01 → 0.1 maps to 2.3!
The key: log is the inverse of exponential. Where ex explodes upward, log(x) flattens out. This is why log-space turns unmanageable products into clean sums — it tames the explosion.
Why sequence probability is a product — and why that forces us into log-space
Why multiply, not add?

Imagine a 3-word sentence. The model is 90% confident about word 1, 80% about word 2, and 70% about word 3. What's the probability of the whole sentence?

It can't be 90% + 80% + 70% = 240% — that's more than 100%, which is nonsensical. The right operation is multiplication: 0.9 × 0.8 × 0.7 = 0.504. Why? Because each word's probability is conditional — "given that word 1 was correct, what's the probability of word 2?" These conditional probabilities chain through multiplication, not addition. It's like asking: "What are the odds of passing 3 checkpoints in a row?" You multiply the pass rates.

This is called the chain rule of probability — and it's why sequence probability must be a product.

Sequence Probability in Action
1
P("I") = 0.02
2
P("love"|"I") = 0.15
3
P("cats"|"I love") = 0.08
4
Multiply: 0.02 × 0.15 × 0.08 = 0.00024
5
In log-space: -3.91 + (-1.90) + (-2.53) = -8.34
P(w1, w2, ..., wn) = Πi P(wi | w1, ..., wi-1)

Now the problem: for a 100-token sentence, you're multiplying 100 numbers that are each less than 1. As we saw above, this quickly becomes astronomically small — too small for computers to handle. The log transform saves us:

The shape of log(x) — why it compresses
0 0.2 0.4 0.6 0.8 1.0 0 -1 -2 -3 -4 -5 probability log probability 1.0 → 0 0.5 → -0.69 0.1 → -2.3 0.01 → -4.6 Tiny probabilities → very negative log-probs
log P(sequence) = Σi log P(wi | w1, ..., wi-1)

Since probabilities are between 0 and 1, their logs are always negative. A log-prob of -0.1 means the model was very confident (probability ~90%). A log-prob of -4.6 means only ~1% confidence. When comparing two responses, you compare their total log-probabilities: the response with the higher (less negative) sum is the one the model "likes" better.

Why this matters: In practice, you'll see log_softmax everywhere. Models don't compute softmax and then take the log — they compute log-softmax directly in a single fused operation, which avoids numerical issues. The REINFORCE algorithm (Part 5) uses log-probabilities in its gradient formula. Cross-entropy loss is a log-probability. Log-space is the native language of LLM training.

So the pipeline so far: logits → softmax → probabilities, and we score sequences in log-space. But what if softmax is too confident? What if we want the model to be more creative, or more cautious? There's one number that controls this — and it's inserted right before softmax.

6. Temperature: The Confidence Dial

Temperature is a single number that controls how sharp the model's probability distribution is. It sits between the logits and softmax — dividing the logits before they get exponentiated. Low temperature (T=0.1) makes the model very confident — almost all probability goes to the top choice. High temperature (T=2.0) flattens the distribution — all tokens get more equal probability, making outputs creative but unpredictable.

Technically, temperature divides the logits before softmax. Lower T makes the logits more extreme before exponentiation; higher T compresses them together.

Temperature: How One Number Changes Everything
Same logits [5.2, 2.1, 1.5, 0.3, -0.8], three different temperatures
T = 0.1
Focused / Deterministic
the
99.9%
a
0.1%
an
~0%
big
~0%
cat
~0%
Always picks the top choice
T = 1.0
Default / Balanced
the
72%
a
15%
an
8%
big
4%
cat
1%
Mostly top choice, some variety
T = 2.0
Creative / Chaotic
the
35%
a
22%
an
18%
big
14%
cat
11%
Anything could happen
Why dividing logits controls confidence
The key: softmax amplifies gaps

We just learned that softmax uses exponentiation, which amplifies differences between logits. A logit gap of 3 becomes a 22x probability gap. So what if you could control how big those gaps appear to softmax?

If you divide all logits by 2 before softmax, a gap of 3.0 becomes a gap of 1.5 — softmax sees a smaller difference and produces a flatter distribution. If you divide by 0.5, the gap of 3.0 becomes 6.0 — softmax sees a bigger difference and produces a sharper spike.

Temperature is that divisor. It scales how extreme the gaps look to softmax, without changing which token is ranked highest.

Temperature: Amplify or Compress the Gap
1
Logits: [5.2, 2.1] — gap = 3.1
2
Divide by T=0.5: [10.4, 4.2] — gap = 6.2 (amplified!)
3
Softmax sees huge gap → 99.8% vs 0.2%
1
Divide by T=2.0: [2.6, 1.05] — gap = 1.55 (compressed!)
2
Softmax sees small gap → 68% vs 32%
softmax(zi / T) = ezi/T / Σj ezj/T

Now the edge cases make sense intuitively:

  • T → 0: Dividing by near-zero makes every gap look infinite. einfinite vs eanything-less means the top token gets 100%. Result: greedy/deterministic.
  • T = 1: No scaling. Standard softmax, gaps unchanged.
  • T → ∞: Dividing by a huge number makes all logits ≈ 0. Every token becomes e0 = 1, so they all get equal probability. Result: pure random.

Why this matters: Temperature is the slider you see in ChatGPT's settings. For math problems, you want low temperature — confident, deterministic answers. For creative writing, higher temperature — diverse, surprising outputs. There's a non-obvious insight here: the optimal temperature for single-answer generation is different from the optimal temperature for self-consistency (where you want diverse answers). That insight only makes sense once you understand that temperature controls the gap-amplification before softmax.

Now we have a probability distribution shaped by temperature. The final step in generating a token: actually pick one. But there's more than one way to choose from a distribution.

7. Sampling Strategies: How to Pick from a Distribution

You're at a restaurant with a massive menu. You've ranked every dish by how much you want it. Now — how do you actually choose?

Option 1 (Greedy/Argmax): Always pick the dish you want most. Reliable, but boring. You'll order the same thing every time.

Option 2 (Full sampling): Put every dish in a hat, weighted by preference, and draw randomly. Exciting, but you might end up with something terrible.

Option 3 (Top-k): Only consider your top 5 favorites, then pick randomly among those. Safe variety.

Option 4 (Top-p / Nucleus): Keep adding dishes from your ranking until you've covered 90% of your preference weight. Then pick from that set. Adapts to the situation — sometimes your top 2 dishes cover 90%, sometimes your top 20 do.

Sampling: Where Do You Draw the Line?
Same distribution, different selection rules. Tokens: "the" 45%, "a" 20%, "an" 15%, "big" 10%, "old" 5%, "my" 3%, "red" 2%
Top-k (k = 3)
"Keep exactly the top 3 tokens, discard the rest"
k=3 cutoff 45% 20% 15% 10% 5%
Problem: always 3 tokens, whether the model is 95% sure or completely unsure.
Top-p (p = 0.9)
"Keep adding tokens until cumulative probability hits 90%"
45+20+15+10 = 90% 45% +20 +15 +10 5%
Adapts: if "the" was 95%, only 1 token kept. If flat, many tokens kept.
Greedy (argmax) — always pick #1. Deterministic. Same output every time.
Full sampling — every token has a chance. Maximum variety, risky.

Why top-p won. Most production LLMs default to top-p (nucleus) sampling because it adapts. When the model is very confident (one token has 95% probability), top-p=0.9 only considers that one token — acting like greedy. When the model is uncertain (many tokens around 10-15%), top-p=0.9 considers many options — allowing diversity. It automatically adjusts based on the model's confidence. This is what ChatGPT uses by default.

We now have the full generation pipeline: logits → temperature scaling → softmax → sampling → output token. But how do we measure how confident or uncertain a distribution is? We need a number that captures the shape of the distribution itself.

8. Entropy: Measuring Uncertainty

Check two weather forecasts. Forecast A says: "100% sunny." Forecast B says: "30% sun, 25% rain, 25% clouds, 20% snow." Which forecast carries more surprise? Which one leaves you more uncertain about what to wear?

Forecast A has zero entropy — there's no uncertainty, no surprise. You know exactly what will happen. Forecast B has high entropy — lots of possibilities, each reasonably likely. Anything could happen.

Entropy measures how much "surprise" or "randomness" is in a probability distribution. It's a single number that captures the shape of the distribution: is it a sharp spike (low entropy, model very confident) or a flat spread (high entropy, model uncertain)?

Entropy: Measuring How Uncertain a Distribution Is
Low entropy = confident. High entropy = uncertain. Both have their place.
Low Entropy
H = 0.47 bits
Sharp spike — one dominant choice
"I know the answer"
High Entropy
H = 1.92 bits
Flat spread — many plausible choices
"I'm really not sure"

But how do you measure surprise mathematically? Entropy is built on a single idea: surprise is inversely related to probability. If something was almost certain (p = 0.99), no surprise. If it was nearly impossible (p = 0.01), massive surprise. The function -log(p) captures this perfectly:

The Surprise Function: -log(p)
Rare events are surprising. Common events are boring. This curve captures that exactly.
0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 probability of the event surprise = -log(p) SHOCK notable boring p=0.01 → surprise = 4.6 p=0.1 → 2.3 p=0.5 → 0.69 p=1.0 → 0
Entropy = average surprise. Weight each point on this curve by how often it happens (its probability), then sum. That's the formula: H = Σ p · (-log p). A sharp distribution (one big spike) averages low surprise. A flat distribution (many equal options) averages high surprise.
Why the entropy formula looks that way
What does "surprise" mean mathematically?

Think about it: how surprised are you when something happens? It depends entirely on how likely it was. If the weather forecast says 99% sunny and it's sunny — zero surprise. If it says 1% snow and it snows — massive surprise.

Surprise is inversely related to probability. The less likely something is, the more surprised you are when it happens. And the relationship isn't linear — an event with 1% probability is not just 2x more surprising than 2%. It's way more surprising. This suggests we want a logarithmic relationship:

Surprise of outcome x = -log(P(x))

Surprise Levels by Probability
1
Surprise of P=0.99 event: -log(0.99) = 0.01 (boring)
2
Surprise of P=0.25 event: -log(0.25) = 1.39 (interesting)
3
Surprise of P=0.01 event: -log(0.01) = 4.61 (shocking!)
4
Entropy = weighted average of all surprises
The shape of surprise: -log(p) explodes as probability drops
SHOCK notable boring / expected 0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 probability p surprise = -log(p) 0.01 → 4.6 0.25 → 1.4 1.0 → 0
The curve shoots toward infinity at p→0 (impossible event happening = infinite surprise). At p=1 (certain), surprise = 0. Entropy is the weighted average height of this curve over your whole distribution.

Does this make sense? Let's check:

  • P = 1.0 (certain): surprise = -log(1) = 0. No surprise at all. Correct.
  • P = 0.5 (coin flip): surprise = -log(0.5) = 0.69. Some surprise. Correct.
  • P = 0.01 (rare): surprise = -log(0.01) = 4.61. Very surprised. Correct.
  • P → 0 (almost impossible): surprise → . Infinitely surprised. Correct!

Now, entropy is the average surprise — how surprised you'd be on average across many draws from this distribution. And we already know how to compute an average over a distribution (we just learned expected value!): weight each outcome by its probability and sum.

Average surprise = expected value of surprise = Σ P(x) · surprise(x) = Σ P(x) · (-log P(x)).

H(P) = -Σi P(xi) · log P(xi)

This isn't arbitrary — it's literally "the expected surprise." Each term P(x) · (-log P(x)) asks: "how much surprise does this outcome contribute, weighted by how often it occurs?" Rare events contribute high surprise but happen infrequently. Common events contribute little surprise but happen all the time. Entropy balances these two forces.

Edge cases now make perfect sense:

  • H = 0: One outcome has 100% probability. Its surprise is zero. Everything else has 0% probability and contributes nothing. Average surprise = zero. The model is fully certain.
  • H = log(n): All n outcomes are equally likely (1/n each). Every outcome is equally surprising. This is maximum uncertainty — you have no idea what will happen.

Why this matters for LLMs: Entropy is a health metric for training. During GRPO training, if the model's entropy drops to zero, the model has become 100% certain of its outputs — it stopped exploring alternatives. This is called entropy collapse, and it kills training. The model gets stuck repeating the same response, can't discover better strategies, and the reward signal becomes useless. DAPO and other improved RL algorithms add explicit entropy bonuses to prevent this — essentially telling the model "stay surprised, keep exploring."

Notice how entropy connects back to temperature: low temperature pushes the distribution toward low entropy (confident), high temperature pushes toward high entropy (uncertain). Temperature is the knob; entropy is the measurement of where the knob is pointing.

Quick Check
During RL training, you notice the model's entropy has dropped from 2.5 to 0.01 over the last 100 steps. What's happening?
The model is getting better — it's becoming more confident in correct answers
Entropy collapse — the model stopped exploring and is stuck in a rut
This is normal — entropy should decrease during training
Spot on

Entropy dropping to near-zero is a red flag. The model has collapsed to producing essentially the same output for every input. Some entropy decrease during training is normal (the model learns which tokens are better), but dropping from 2.5 to 0.01 means exploration has died. Training cannot improve from here because the model never tries anything different.

Watch out

While some entropy decrease is normal, a drop from 2.5 to 0.01 is entropy collapse — the model has become pathologically confident. It outputs essentially the same tokens for every input, regardless of the question. Training stalls because the model never explores alternative strategies. This is a critical failure mode in GRPO training.

Part 3
How Models Learn
The model can generate text — now, how does it get better at it?

We've followed a token through the generation pipeline: logits → temperature → softmax → sampling. The model can produce text. But the first time it tries, it's terrible — random weights produce random outputs. It needs to learn. Learning requires three things: a way to measure mistakes (loss), a way to figure out which weights caused the mistake (gradients), and a way to keep the adjustments reasonable (regularization).

9. Cross-Entropy Loss: "How Wrong Was I?"

The model just predicted a probability distribution over next tokens. The actual next token in the training data was "Paris." The model assigned 40% probability to "Paris" and 35% to "London." It got the right answer as its top pick — but how do we turn that into a single number that says "how wrong were you"?

Cross-entropy loss captures this difference. It measures not just whether the model got the right answer, but how confident it was in the right answer. A model that assigns 90% probability to the correct next token gets a low loss (good). A model that assigns 10% probability to the correct token, even if it was still the model's top choice, gets a high loss (bad).

Cross-Entropy Loss: Measuring Confidence in the Right Answer
Both models "got it right" — but the loss reveals one is much better than the other
Confident & Correct
Paris
90%
London
5%
Berlin
5%
Cross-entropy loss
0.105
Low loss = good prediction
Uncertain but Correct
Paris
40%
London
35%
Berlin
25%
Cross-entropy loss
0.916
High loss = weak prediction

But why does cross-entropy use -log specifically? And why does it work as a training signal? The answer is in the shape of the curve — and a concept from calculus called differentiability: the curve is smooth everywhere, so at every point we can draw a tangent line whose slope tells the model exactly how urgently to improve.

Why -log(p) Is the Perfect Loss Function
The slope (tangent line) at each point tells the model how urgently to fix the prediction
0 0.25 0.5 0.75 1.0 0 1 2 3 4 probability assigned to correct token loss = -log(p)
p = 0.1
Slope = -10
Steep tangent = "FIX THIS NOW!"
Model was 90% wrong → massive correction signal
p = 0.5
Slope = -2
Moderate tangent = "Could be better"
Coin flip → moderate push to improve
p = 0.9
Slope = -1.1
Gentle tangent = "Almost there"
Already confident → small nudge
Smooth = Differentiable
The -log curve is smooth everywhere between 0 and 1. At every point you can draw a tangent line. That tangent's slope IS the gradient — it tells training which direction to adjust weights and how much.
Why Not a Step Function?
A step function (right/wrong, 0 or 1) has no slope — it's flat then jumps. No slope = no gradient = training can't figure out which direction to improve. The smooth log curve gives a signal everywhere.
Why negative log? Building the loss function from scratch
Let's invent a loss function

You need a single number that says "how bad was this prediction?" What properties should it have?

  • If the model assigned 100% to the correct answer, the loss should be zero — perfection, nothing to improve
  • If the model assigned 50% to the correct answer, the loss should be moderate — it was hedging
  • If the model assigned 1% to the correct answer, the loss should be enormous — it was confidently wrong
  • Going from 90% → 91% should barely matter. Going from 5% → 1% should be a catastrophe.

You want a function that's zero at 1.0, gently increases as probability drops, and explodes as probability approaches zero. What function does that?

Negative log does exactly this
-log(1.0) = 0.0 — perfect prediction, zero loss
-log(0.5) = 0.69 — coin flip, moderate loss
-log(0.1) = 2.30 — only 10% on the right answer, ouch
-log(0.01) = 4.61 — 1% on the right answer, catastrophic
-log(0.001) = 6.91 — almost zero confidence in the truth → massive penalty
Notice the asymmetry: improving from 90% → 100% saves 0.105 loss. Dropping from 10% → 1% costs 2.31 loss. The function naturally punishes confident mistakes far more than it rewards minor improvements. This is exactly the behavior you want from a training signal.
The loss function shape: why it punishes wrong answers so hard
CATASTROPHIC struggling good / excellent 0 0.2 0.4 0.6 0.8 1.0 0 2 4 probability assigned to correct token p=0.1 → loss=2.30 p=0.5 → loss=0.69 p=1.0 → loss=0 explodes!
The curve is smooth everywhere — you can always compute a gradient and tell the model "reduce this loss by adjusting your weights."

That's why negative log is the loss function, and not something simpler like (1 - p). With (1 - p), going from 1% to 0.1% confidence in the correct answer would only add 0.009 to the loss — a rounding error. With -log, the same drop adds 2.3 to the loss — a screaming alarm. Negative log says: "being confidently wrong is exponentially worse than being slightly unsure."

The formula:

L = -log(pcorrect)

That's it. The loss is just the negative log of whatever probability the model assigned to the correct next token. The full "cross-entropy" formula with the summation (Σ yi · log(pi)) handles the general case where you might have multiple correct answers, but for language modeling with one correct token, it collapses to this simple form.

Connection to entropy: If the model's predictions perfectly match the true distribution, cross-entropy equals entropy — you literally can't do better. The gap between cross-entropy and entropy measures how much the model's predictions deviate from reality. That gap has a name: KL divergence (Part 5). So cross-entropy = entropy + KL divergence. Training to minimize cross-entropy is the same as minimizing KL divergence — making the model's distribution match reality as closely as possible.

Why this matters: This is the training signal for every LLM during pre-training and supervised fine-tuning. Every single training step says: "you assigned X probability to the correct token — here's -log(X), that's how wrong you were — now adjust your weights to make X larger next time." In distillation, instead of training against the correct answer, you train against a teacher model's full probability distribution — but the underlying math is the same cross-entropy formula, just with a softer target.

Cross-entropy loss gives us a number: "the model was this wrong." But a number alone doesn't tell the model how to improve. Which of the model's billions of weights should change? And by how much?

10. Gradients & Backpropagation: "What Should I Adjust?"

You're cooking a new recipe and the soup comes out too salty. What do you adjust? The salt, obviously. Not the oven temperature, not the cooking time, not the size of the pot. You identify which ingredient caused the problem and adjust that specific ingredient by the right amount.

That's exactly what gradients do for neural networks. After the model makes a prediction and we compute the loss (cross-entropy), we need to answer two questions: (1) which weights were responsible for the mistake? and (2) how much should each one change?

The gradient of the loss with respect to a weight tells you: "if you increase this weight a tiny bit, how much does the loss change?" But to understand gradients, we need to start with a more fundamental idea from calculus: the derivative.

A derivative measures the rate of change at a specific point. Visually, it's the slope of the tangent line — the line that just barely touches the curve at one point. If the curve is steep at that point, the derivative is large. If the curve is flat, the derivative is near zero.

What Is a Derivative? The Slope of the Tangent Line
At every point on a smooth curve, there's a line that just barely touches it. The slope of that line IS the derivative.
weight value loss Steep! slope = -7 slope = -3 slope = 0 MINIMUM! Steep! slope = +5
Negative slope
"Loss decreases if you increase this weight" → move right
Zero slope
"You're at the bottom" → stop! This is the optimal weight
Positive slope
"Loss increases if you increase this weight" → move left
The rule:
Move in the opposite direction of the gradient. Negative slope → go right. Positive slope → go left. Always walk downhill.

This is the fundamental idea of gradient descent: compute the slope at your current position, then take a step in the opposite direction (downhill). Repeat. Each step brings you closer to the minimum loss — the best weights.

Gradient Descent: Walking Downhill to Better Weights
Start at a random weight, compute the gradient, take a step downhill. Repeat until you reach the bottom.
weight value loss Step 1 Step 2 Minimum! Steps get smaller as slope flattens near the bottom
This is ALL of training. Every LLM training step — pre-training, fine-tuning, GRPO — does exactly this: compute the loss, find the gradient (slope) for each weight, take a step opposite to the gradient. Billions of weights, billions of steps, same principle.

But a neural network has millions of weights arranged in layers, not just one. How does the gradient flow from the output loss all the way back to the first layer? That's what backpropagation solves — using the chain rule from calculus to efficiently compute every gradient in a single backward sweep.

Backpropagation: Numbers Flowing Forward, Gradients Flowing Back
Watch actual values move through a tiny network, then see how gradients trace responsibility backward
Forward Pass: Input → Prediction
INPUT
0.5
w1=2.0
HIDDEN
0.5 × 2.0
= 1.0
w2=3.0
OUTPUT
1.0 × 3.0
= 3.0
LOSS
target=2.0
(3-2)²
= 1.0
Now trace backward: who caused the error?
Backward Pass: Gradients Flow Back (Chain Rule)
1
dLoss/dOutput = 2 × (3.0 - 2.0) = 2.0 "Output was 1.0 too high"
2
dLoss/dw2 = 2.0 × hidden = 2.0 × 1.0 = 2.0 "w2 contributed 2.0 to the error"
3
dLoss/dHidden = 2.0 × w2 = 2.0 × 3.0 = 6.0 "Hidden node's responsibility"
4
dLoss/dw1 = 6.0 × input = 6.0 × 0.5 = 3.0 "w1 contributed 3.0"
The Chain Rule = Multiply Through Each Link
dL/dw1 = (dL/dOut) × (dOut/dHidden) × (dHidden/dw1) = 2.0 × 3.0 × 0.5 = 3.0
Each link in the chain either amplifies or dampens the gradient signal. One forward pass + one backward pass = gradients for every weight. This scales to billions of parameters.

Backpropagation is just this chain rule applied systematically to every layer. A modern LLM has billions of weights, but the principle is identical: one forward pass computes the prediction, one backward pass traces the gradient from loss through every layer back to every weight. This single pair of passes gives every weight its gradient — the insight that made training deep networks practical.

The chain rule: why you go backward, and why you multiply
The Telephone Game

Imagine a recipe with three steps: garlic → sauce → final dish. You taste the dish and it's too garlicky. How much should you reduce the garlic?

You need to trace backward: "How much does garlic affect the sauce?" (a lot — garlic is dominant). "How much does the sauce affect the final dish?" (a lot — it's the main component). The total effect is the product of these two links. If garlic has a 3x effect on sauce, and sauce has a 2x effect on the dish, then garlic has a 3 × 2 = 6x effect on the dish.

But if the sauce barely affects the final dish (say, 0.1x — it's just a garnish), then garlic's total effect is only 3 × 0.1 = 0.3x. The weakest link in the chain limits the total effect. This is why you multiply — each link in the chain either amplifies or dampens the signal.

dL/dw = (dL/dy) · (dy/dw)

"How much does the loss change when weight w changes?" = "how much does the loss respond to the output?" × "how much does the output respond to the weight?"

Chain Rule in Action: Numbers Flowing Backward
FORWARD (compute) BACKWARD (gradients) Input 0.5 Hidden 1.0 Output 3.0 Loss 4.5 w1=2 0.5×2=1.0 w2=3 1.0×3=3.0 3²/2 =4.5 dL/dOut=3.0 dL/dH=3×3=9 dL/dw1=9×0.5=4.5 dL/dw2=3×1=3 multiply the links backward!
One forward pass computes values. One backward pass traces gradients to every weight. dL/dw1 = 4.5: "increasing w1 by 1 increases loss by 4.5 — so nudge w1 down."

In a deep network with many layers, the chain gets longer: dL/dw = (dL/dyn) · (dyn/dyn-1) · ... · (dy1/dw). Each layer is a link. You multiply all the links together. And you compute from the end backward — because you start with the loss (the only thing you can directly measure) and trace responsibility back through each layer to the weights. This is why it's called backpropagation.

Why backward is efficient: A forward pass (wiggle each weight and see what happens to the loss) would require one pass per weight — billions of passes for a modern LLM. The backward pass computes all gradients in a single sweep by reusing intermediate results. Layer 10's gradient feeds into layer 9's computation, which feeds into layer 8's, and so on. One forward pass + one backward pass = gradients for every weight in the entire model. This is the insight that made training deep networks practical.

Why this matters: Every training step of every LLM — pre-training, fine-tuning, GRPO, distillation — uses backpropagation. "The model learned" = the loss was computed, backprop traced responsibility to each weight, each weight was nudged to reduce the loss. That loop, repeated billions of times, is the entirety of deep learning.

11. Norms & Regularization: Keeping the Model on a Leash

We have the full training loop: loss measures the error, gradients trace blame to each weight, and we update accordingly. But there's a risk: if we follow the gradients blindly, the weights can grow enormous and the model memorizes the training data instead of learning general patterns. We need a way to say "learn, but don't go crazy."

Without constraints, weights can become huge, small input changes produce wildly different outputs, and the model overfits — performing brilliantly on training data but terribly on anything new.

Norms measure how "big" the weights have gotten — how far they've drifted from zero. Regularization is the technique that penalizes large weights, keeping them small and the model well-behaved.

Norms: Measuring How Far Weights Have Drifted
L1 and L2 norms measure distance differently — L2 penalizes big outliers more
L1 Norm (Manhattan)
Sum of absolute values
||w||1 = |w1| + |w2| + ... + |wn|

Like walking on a city grid — total distance is the sum of all blocks walked. Encourages sparse weights (many zeros).

L2 Norm (Euclidean)
Square root of sum of squares
||w||2 = √(w12 + w22 + ... + wn2)

Like flying in a straight line — the direct distance. Penalizes large outlier weights more than many small ones.

But what does regularization actually do to the weights during training? L2 regularization adds a penalty proportional to the square of each weight. This means large weights get pulled much harder toward zero than small ones:

L2 Regularization: Bigger Weights Get Pulled Harder
Each weight feels a "rubber band" pulling it toward zero — the further it drifts, the stronger the pull
w1
0
5.0
Pull = 10.0 (STRONG)
w2
0
2.0
Pull = 4.0 (moderate)
w3
0
0.3
Pull = 0.6 (gentle)
w4
0
-3.0
Pull = 6.0 (toward 0)
L2 penalty = weight²
A weight of 5.0 gets penalty 25. A weight of 0.3 gets penalty 0.09. The square function makes large weights disproportionately expensive — keeping the model from going extreme on any single weight.

Where you'll see this: In RLHF and GRPO, there's a KL penalty that measures how far the trained model has drifted from the original model. This is literally an L2-style norm applied to probability distributions (KL divergence, covered later in this post). Weight decay — standard in all LLM training — adds a small L2 penalty to every weight update, gently pulling weights back toward zero. The principle is always the same: don't drift too far.

Part 4
The Transformer
We know how to generate and train — but what's the architecture that does the actual thinking?

We've seen the output side (softmax, sampling) and the training side (loss, gradients). But what happens between the input text and those logits? How does the model decide that after "The cat sat on the" the next word should probably be "mat"? The answer is a specific architecture called the transformer, and its key operation is surprisingly simple: measuring similarity between words.

12. Matrix Multiplication as Similarity

Here's a surprisingly intuitive idea: two things are similar if they "point in the same direction."

Imagine you're rating movies on two dimensions: action (0-10) and romance (0-10). The movie Die Hard might be [9, 1] — lots of action, little romance. The Notebook might be [1, 9] — the opposite. Mr. & Mrs. Smith might be [7, 7] — both.

The dot product between two vectors measures how similar they are — how much they "point in the same direction." Die Hard · Notebook = 9×1 + 1×9 = 18 (low similarity). Die Hard · Mr. & Mrs. Smith = 9×7 + 1×7 = 70 (higher similarity). Two identical vectors give the highest dot product.

Dot Product: Measuring Similarity Between Vectors
Vectors are arrows in space. The angle between them determines the dot product.
A B 15°
Same Direction
dot = +0.97
Small angle = high similarity
A B 90°
Perpendicular
dot = 0
Right angle = completely unrelated
A B 180°
Opposite
dot = -1.0
180° = opposite meaning
This is the core operation of attention. In a transformer, each word is a vector (768+ dimensions). The model computes dot products between word vectors to decide which words are relevant to each other. Same direction = related meaning.

Matrix multiplication is just doing many dot products at once. When we multiply two matrices, each element in the result is a dot product between a row of the first matrix and a column of the second. Let's see it with actual numbers:

Matrix Multiplication: Many Dot Products at Once
Each cell in the result = one dot product between a row and a column. Highlighted: row 1 · column 2
Matrix A (2×3)
1
2
3
4
5
6
Row 1 highlighted
×
Matrix B (3×2)
7
8
9
10
11
12
Col 2 highlighted
=
Result (2×2)
58
64
139
154
How the highlighted cell (row 1 · col 2) = 64
(1×8) + (2×10) + (3×12) = 8 + 20 + 36 = 64
That's a dot product! Row · Column = one number. A 2×3 matrix times a 3×2 matrix = 4 dot products = a 2×2 result.
In attention, one matrix holds queries (what each word is looking for) and the other holds keys (what each word advertises). The matrix multiply computes every query-key similarity at once — that's why transformers are so parallelizable.

This is why GPUs are so important for LLMs — they're built to do thousands of dot products simultaneously. In the context of transformers, matrix multiplication computes the similarity between every pair of words in a sequence, all at once. This is the foundation of the attention mechanism.

13. The Attention Mechanism: "Who Should I Listen To?"

You're in a meeting. Someone asks you: "What was the revenue last quarter?" To answer, you don't pay equal attention to everything said in the meeting. You focus on the finance report from 20 minutes ago, and maybe a comment from the CFO. You attend to the relevant parts and ignore the rest.

The attention mechanism works the same way. For each token it's generating, the model asks: "Which of the previous tokens are most relevant to deciding what comes next?" It then focuses more on those relevant tokens and less on irrelevant ones.

Attention has three components, called Query (Q), Key (K), and Value (V). Think of it like a search engine:

The model computes the dot product between the Query and every Key to get attention scores — how relevant each stored piece is. Then it runs softmax to turn these scores into weights (summing to 1). Finally, it takes a weighted sum of the Values — the more relevant a piece, the more its content contributes to the output.

Attention: The Process, Step by Step
Predicting the next word after "The cat sat on the mat" — which words matter?
1
Query: "What am I looking for?"
Q
The current position (mat) asks: "Which earlier words are relevant to predicting what comes next?"
compute dot product Q · K for each word...
2
Keys + Scores: "How relevant is each word?"
K
The
0.8
K
cat
4.2
K
sat
2.8
K
on
1.1
K
the
1.5
softmax to turn scores into weights...
3
Attention Weights: "How much to listen to each word"
The
5%
cat
42%
sat
28%
on
10%
the
15%
weighted sum of Value vectors using these weights...
4
Output: Weighted blend of Values
Output = 42% of cat's info + 28% of sat's info + 15% of the's + 10% of on's + 5% of The's
The output vector is mostly about the cat and what it did (sat) — exactly the context needed to predict the next word.

The diagram above shows the concept — but what does the actual computation look like? The formula Attention(Q, K, V) = softmax(Q·KT/√d) · V is a sequence of matrix operations. Here's what each step produces:

Inside the Attention Formula: The Matrix Operations
For 3 words, each represented as a 4-dimensional vector. Watch the matrices transform step by step.
1
Q · KT
Each query asks: "how similar am I to each key?"
cat
sat
mat
cat
4.2
1.1
0.5
sat
2.8
3.5
1.8
mat
3.1
2.4
0.9
Raw attention scores
softmax
each row
2
softmax(scores)
Turn scores into weights (each row sums to 1)
cat
sat
mat
cat
.80
.12
.08
sat
.27
.55
.18
mat
.40
.32
.28
Attention weights (rows sum to 1)
× V
3
weights · V
Weighted blend of information
cat's output
80% cat + 12% sat + 8% mat
sat's output
27% cat + 55% sat + 18% mat
mat's output
40% cat + 32% sat + 28% mat
Read row 1: when "cat" looks at the sequence, it attends 80% to itself, 12% to "sat", 8% to "mat." Its output vector is mostly its own information, slightly mixed with context from the verb and location. Each word gets its own personalized mix of the whole sequence.
Why Q, K, V — and why the formula has that √d scaling
Why three separate roles?

Think about a library. Each book has two different things: a label on its spine (what the book is about) and its actual content (what's inside). When you search for a book, you match your search query against the labels — not against every page of every book. Once you find matching labels, you pull those books' content.

Q, K, V work the same way. A single word can't play all three roles simultaneously. "Cat" as a query asks "what should I attend to?". "Cat" as a key advertises "I'm an animal, a subject." "Cat" as a value carries "here's my actual information content." The model learns different projections for each role — three different views of the same word, each optimized for its purpose.

If you used the same representation for matching and content, the model would be forced to choose: make vectors good for similarity search, OR good for carrying information. Splitting into Q/K/V lets it do both.

One Word, Three Roles
1
Same word "bank" → needs different roles
2
As Query: "What info do I need?"
3
As Key: "I'm about finance" (or "I'm about rivers")
4
As Value: "Here's my actual meaning in context"
Attention(Q, K, V) = softmax(Q · KT / √dk) · V

Reading the formula left to right:

  • Q · KT — compute the dot product (similarity) between the query and every key. This produces a matrix of raw attention scores: "how relevant is word j to word i?"
  • / √dk — the scaling factor. Why? Dot products in high dimensions tend to become very large numbers (if your vectors have 768 dimensions, the dot product is the sum of 768 terms). Large numbers push softmax into its extreme zones where it outputs near-0 or near-1, crushing the gradients. Dividing by √768 ≈ 27.7 rescales the dot products back to a range where softmax behaves well. Without this, training becomes unstable.
  • softmax(...) — turns raw attention scores into weights that sum to 1. Now we know which words matter most.
  • · V — weighted sum of the value vectors. The words that got high attention weights contribute more of their information to the output. Words with low attention are effectively ignored.

Why this changed everything: Before transformers, models like RNNs and LSTMs processed tokens one at a time, passing information through a fixed-size "memory" that got compressed at every step. Token 1's information had to survive being squeezed through 500 compression steps to reach token 500. Attention eliminates this bottleneck entirely: every token can directly look at every other token, regardless of distance. Token 500 can attend to token 1 just as easily as token 499. No compression, no forgetting, and the entire operation is a matrix multiplication — perfectly parallelizable on GPUs.

14. Embeddings & Positional Encoding: Text as Numbers

We've been talking about dot products and attention between word vectors. But wait — how did words become vectors in the first place? Before any attention can happen, text needs to be converted into numbers. This is the very first step — the input side of the pipeline.

These vectors are called embeddings, and they're not random. Words with similar meanings end up close together in this number space. "Cat" and "dog" would be near each other. "Cat" and "democracy" would be far apart. The model learns these embeddings during training — starting random, then adjusting until similar words cluster together.

Embeddings: Words as Points in Space
Similar words cluster together — the model learns this during training
Animals
cat dog bird
Actions
run walk
Abstract
freedom justice
In reality, embeddings have 768-4096 dimensions — but the clustering principle is the same

But embeddings have a problem: they don't know word order. The embeddings for "dog bites man" and "man bites dog" would be the same set of vectors — just in different positions. The model needs to know that position 1 is different from position 5.

Positional encoding solves this by adding a unique position signal to each embedding. Think of it as a timestamp: "this word is in position 1," "this word is in position 2." Modern transformers learn these position encodings during training (learned positional embeddings) or use clever mathematical patterns (rotary position embeddings / RoPE) that can generalize to lengths the model hasn't seen before.

The full picture: Token (text) → tokenizer splits into subwords → each subword looks up its embedding vector → positional encoding gets added → these vectors flow through layers of attention and feedforward networks → final layer produces logits → softmax turns them into probabilities → sample a token. That's the complete journey from text in to text out.

Part 5
The RL Math
The model can generate and learn from labels — now, how do we teach it to reason on its own?

Everything up to this point is about how LLMs work out of the box — how they generate tokens, compute probabilities, and learn from labeled data. But reasoning models take a different path. Instead of showing the model what to say (supervised learning), we give it a reward signal and let it figure out reasoning strategies on its own. This is reinforcement learning (RL), and it needs its own set of mathematical tools.

15. KL Divergence: How Different Are Two Distributions?

You're training a puppy. At first, the puppy walks close to you. As training progresses, the puppy might wander further and further away — chasing squirrels, exploring new paths. If you put the puppy on a leash, you let it explore but limit how far it can drift from you.

KL divergence is the mathematical leash for LLMs. When you fine-tune a model with reinforcement learning, the model's behavior (its probability distribution over tokens) starts changing. Some change is good — that's learning! But too much change is dangerous: the model might "forget" basic language, start producing gibberish, or exploit the reward function in unintended ways.

KL divergence measures how different two probability distributions are. In RL training, it measures how far the model being trained has drifted from the original pre-trained model. If the KL divergence gets too large, a penalty kicks in: "you've strayed too far, come back closer to the original."

KL Divergence: Measuring How Far the Model Has Drifted
The original model is the anchor — KL divergence measures the gap
After light training next-token probability distribution tiny gap P — original model Q — trained model KL = 0.3 — Healthy drift After excessive training next-token probability distribution LARGE GAP P — original model Q — trained model KL = 4.8 — Penalty kicks in
KL measures the area between the two curves — how different their probability shapes are. Large gap = the model has "forgotten" its original language understanding.
Where the KL formula comes from (it's built on entropy)
Building on what we already know

In the entropy section, we defined surprise of an event as -log(P(x)). If the true distribution is P, entropy H(P) measures your average surprise when things are as expected.

Now imagine you think the distribution is Q, but reality is actually P. When event x happens (with true probability P(x)), your surprise is -log(Q(x)) — because you're judging surprise based on your (wrong) belief Q. Your average surprise using the wrong model Q is: Σ P(x) · (-log Q(x)). That's the cross-entropy H(P, Q).

KL divergence is the extra surprise from using the wrong model: DKL(P || Q) = H(P, Q) - H(P). Cross-entropy minus entropy. The part of your surprise that comes from Q being wrong, not from P being inherently uncertain.

KL Divergence = The Extra Surprise
1
True distribution P: surprise = H(P) = 1.5 nats
2
Wrong model Q: average surprise = H(P, Q) = 2.3 nats
3
Extra surprise from using Q: DKL = 2.3 - 1.5 = 0.8 nats
4
If Q = P exactly: DKL = 0 (no extra surprise)
DKL(P || Q) = Σi P(xi) · log(P(xi) / Q(xi))

This is always ≥ 0 (you can never be less surprised with the wrong model than the right one), and equals 0 only when P and Q are identical.

Critical subtlety: KL divergence is NOT symmetric. DKL(P || Q) ≠ DKL(Q || P). This matters enormously because the two directions optimize for different things. Here's what it looks like on actual distribution curves:

Forward KL: D(P ‖ Q) Q penalized for missing any part of P P P Q covers both modes P — complex distribution Q — broad (mode-covering) Used in distillation Reverse KL: D(Q ‖ P) Q penalized for going where P has no mass ignored! Q P — complex distribution Q — sharp (mode-seeking) Used in RLHF policy updates
Forward KL: DKL(P || Q)

Mode-covering — Q gets heavily penalized for assigning zero probability wherever P has mass. So Q spreads out to cover everything. Result: Q is broader, safer, avoids missing anything. Used in distillation.

Reverse KL: DKL(Q || P)

Mode-seeking — Q gets penalized based on its own mass, so it concentrates on P's highest peak and ignores the rest. Result: Q is sharper, picks the best mode. Used in RLHF.

Why this matters: In RLHF and GRPO, a KL penalty is added to the reward: reward_adjusted = reward - β · DKL. β controls the leash length. Too small: the model drifts dangerously far from its pre-training. Too large: the model can't learn anything new — every change gets penalized. Finding the right β is a key challenge. In distillation, choosing forward vs reverse KL determines whether the student copies all the teacher's behaviors (broad but diluted) or just the teacher's best behavior (sharp but potentially missing edge cases).

KL divergence is the leash. Now we need the actual training algorithm — how does the model use rewards to update its weights?

16. Policy Gradients / REINFORCE: "Do More of What Worked"

Imagine a model generates 5 different answers to a math problem. Two answers are correct. Three are wrong. How should the model learn from this experience?

The answer is remarkably intuitive: make the correct answers more likely, and the incorrect answers less likely. Specifically, increase the probability of the tokens that led to correct answers, and decrease the probability of tokens that led to wrong answers. And do this in proportion to how good or bad the outcome was — a brilliantly correct answer gets reinforced more than a barely-correct one.

This is the REINFORCE algorithm — the foundation of all RL for LLMs. The name "policy gradient" comes from the jargon: the model's behavior (which tokens it tends to generate) is called its "policy," and we're computing the gradient (direction of improvement) for that policy.

REINFORCE: Learn from Your Own Attempts
Generate multiple answers, reward the good ones, penalize the bad ones
Problem
What is 15% of 240?
R = +1 "15% of 240 = 0.15 × 240 = 36" More likely
R = -1 "15% of 240 = 240/15 = 16" Less likely
R = +1 "10% is 24, half of that is 12, so 24+12 = 36" More likely
R = -1 "15 + 240 = 255" Less likely
R = -1 "I'm not sure, maybe 30?" Less likely
Next time the model sees a similar problem, it's more likely to use multiplication (correct) and less likely to use division or addition (wrong)
Why "reward × log-probability gradient" — and not something simpler?
The problem: you can't compute a gradient of the reward

In supervised learning, the loss function is differentiable — you can compute gradients directly through it. But in RL, the "loss" is the reward, and reward is not differentiable. It's a black box: the model generates an answer, a verifier says "correct" or "wrong." You can't compute dReward/dWeights because there's no smooth mathematical path from weights to reward — it goes through discrete token sampling.

So you need a trick. The trick is: instead of differentiating the reward, differentiate the probability of the actions that led to that reward. The model's probabilities are differentiable (they come from softmax, which is smooth). The reward just acts as a multiplier — a scalar that says "how much should I push in this direction?"

This gives you: reward × gradient of log-probability. The gradient of log-prob tells you "which direction makes this output more likely." The reward tells you "should I go that way (positive) or the opposite way (negative)?"

The REINFORCE Trick
1
Model generates response → black box reward says "+1"
2
Can't differentiate through the black box ✗
3
BUT we CAN differentiate the probability of that response ✓
4
Gradient direction = ∇ log π (differentiable!)
5
Scale by reward: R × ∇ log π
Why we use log-probability: one is differentiable, one isn't.
Reward R(response) discrete: 0 (wrong) or 1 (correct) model output (continuous) 0 1 JUMP! no slope here = no gradient Cannot differentiate ✗ log π(response | model) smooth: gradient exists everywhere model weights θ (continuous) slope = gradient Differentiable everywhere ✓
The reward is a black box — no gradient flows through it. But log-probability is smooth and differentiable. So: use reward as a multiplier, differentiate the log-prob. That's the trick.
∇J(θ) = E[ R · ∇θ log πθ(a|s) ]

Why log-probability specifically? Why not just the probability gradient? Two reasons. First, log turns products into sums — and a sequence's probability is a product of token probabilities (we learned this in the log-probs section). So log-probability gradients decompose cleanly across tokens. Second, the gradient of log π has a nice mathematical property: ∇ log π = ∇π / π. This "divides out" the current probability, so rare tokens that happen to be correct get appropriately large gradient signals, while common tokens that happen to be correct don't get disproportionately reinforced.

Reading the formula:

  • R — the reward. Positive = good output, reinforce it. Negative = bad output, push away.
  • θ log πθ(a|s) — "which direction in weight-space makes this output more likely?" This is differentiable because it goes through softmax.
  • R × ∇ log π — combine them: push toward this output proportionally to how good it was.
  • E[...] — average over many samples. Law of large numbers makes this estimate reliable.

Why this matters: GRPO is literally a variant of this formula. The core idea is identical: generate multiple responses, score them, use reward × ∇ log π to update. GRPO adds three refinements: (1) the group mean reward as a baseline (section 17), (2) clipping to prevent overcorrection (section 18), and (3) a KL penalty to prevent drift (section 15). But the engine underneath is REINFORCE.

REINFORCE works, but it has a problem: the raw rewards are noisy. When all rewards are positive (say, all responses score between +0.5 and +1.0), the gradient says "reinforce everything" — even the worst response. We need a way to separate "better than average" from "worse than average."

17. Baselines & Variance Reduction: Grading on a Curve

A professor grades an exam. Every student scores between 75 and 82. Using absolute scores, the gradient would say "everyone did well — reinforce everything a lot." But that's useless! The professor needs to know who did relatively well.

Grading on a curve — subtracting the average score from each individual score — reveals the real signal. A student who scored 82 when the average was 78 did well (+4). A student who scored 75 did poorly (-3). Now the gradient has useful information: push the model toward the 82-score strategy and away from the 75-score strategy.

In RL, this is called a baseline. Instead of using raw rewards, you subtract a baseline (often the average reward): advantage = reward - baseline. This doesn't change the expected direction of the update (the math guarantees this), but it dramatically reduces variance — the updates become much more stable and training converges faster.

Baselines: From Noisy Rewards to Clear Signals
Subtracting the mean turns absolute rewards into relative advantages
Raw Rewards
Response A +1
Response B +1
Response C 0
Response D 0
Mean = 0.5
"All positives get reinforced equally"
subtract
mean
Advantages (reward - baseline)
Response A +0.5
Response B +0.5
Response C -0.5
Response D -0.5
Now we see the signal!
"A, B better than average. C, D worse."

This is GRPO's key innovation. Traditional RL algorithms like PPO use a separate neural network (the "critic") to estimate the baseline. GRPO skips the critic entirely — it generates a group of responses and uses the group's average reward as the baseline. The group IS the curve. No extra model needed. This is why GRPO is simpler and cheaper to run than PPO, and why it's become the go-to algorithm for training reasoning models.

18. Clipping & Surrogate Objectives: "Don't Overcorrect"

We now have the complete RL update: advantage (reward minus baseline) times the policy gradient, with a KL leash to prevent drift. One last problem: even with all these safeguards, a single training step can make too large a change. The model might overshoot, forget what it knew, or collapse into repeating the same response for every input. We need a speed limit on each update.

Clipping limits how much the policy can change in a single step. It caps the ratio between the new policy's probabilities and the old policy's probabilities. If the ratio gets too large (the model is trying to change too much), the gradient is clipped — preventing the overcorrection.

Clipping: Limiting How Much the Model Changes Per Step
The probability ratio is capped to prevent catastrophic policy shifts
Probability Ratio: πnew(token) / πold(token)
1-ε
1.0
1+ε
Clipped! Too much decrease
Clipped! Too much increase
Allowed range
Within clip range

Gradient flows normally. Model updates as planned. Token becomes somewhat more/less likely.

Outside clip range

Gradient is zeroed out. Update blocked. "You've changed enough for this step. Wait until next time."

Why clip a ratio (and not just limit the gradient)?
The problem with vanilla REINFORCE

REINFORCE says "make good actions more likely." But how much more likely? If a response was good, REINFORCE pushes toward it. If you take a big step, the probability might jump from 5% to 80%. Is that okay?

Usually no. You computed the gradient based on the old model's behavior. If the model changes dramatically in one step, the gradient you computed is no longer accurate — you're steering with a stale map. The bigger the change, the more your map lies.

The fix: measure how much the policy actually changed using the ratio r = πnew / πold. If r = 1, nothing changed. If r = 2, this action is now 2x more likely. If r = 0.1, it's now 10x less likely. Clip this ratio to stay near 1, and you guarantee the policy doesn't change too much in any single step.

Clipping in Action
1
Old policy: P("correct approach") = 10%
2
After update: P("correct approach") = 85%
3
Ratio = 85%/10% = 8.5x — massive change!
4
With ε=0.2 clip: ratio capped to 1.2x
5
Actual change: 10% → 12% per step (safe)
LCLIP = min(r(θ) · A, clip(r(θ), 1-ε, 1+ε) · A)

Why a ratio is the right thing to clip: The ratio r = πnew(a|s) / πold(a|s) directly measures "how much did behavior change?" It's model-agnostic, action-agnostic, and dimensionless. A ratio of 1.2 means "20% more likely" regardless of whether the original probability was 5% or 50%. Clipping gradients directly would be less meaningful — a gradient of 0.5 might be huge for one weight and tiny for another.

Reading the formula:

  • r(θ) — the probability ratio. r=1: no change. r > 1: action became more likely. r < 1: less likely.
  • A — the advantage (reward - baseline). Positive: this action was above average.
  • clip(r, 1-ε, 1+ε) — cap the ratio to the band [0.8, 1.2] (for ε=0.2). Even if the model wants to change more, it's forced to stay within 20% of the old policy.
  • min(...) — take the more conservative estimate. If the unclipped version would give a bigger update than the clipped version, use the clipped one. This is a one-way gate: clipping can only reduce the update, never increase it.

Why this matters: PPO (the algorithm behind ChatGPT's RLHF training) invented this clipping mechanism. GRPO uses the same idea. DAPO (arxiv 2503.14476) introduces asymmetric clipping — a wider clip range for reinforcing good actions than for suppressing bad ones — because symmetric clipping was sometimes too restrictive for positive learning signals. The principle is always the same: allow learning, prevent catastrophe.

Quick Check
In GRPO training, the model generates 8 responses to a math problem. 3 are correct (reward = 1) and 5 are wrong (reward = 0). The mean reward is 0.375. What is the advantage for a correct response?
+1.0 — the reward itself is the advantage
+0.625 — the reward (1.0) minus the baseline (0.375)
+0.375 — the mean reward is the advantage
Perfect

Advantage = reward - baseline = 1.0 - 0.375 = +0.625. The correct response was above average by 0.625, so the model should make it more likely. A wrong response has advantage = 0 - 0.375 = -0.375, so the model should make it less likely. This is GRPO's mechanism: the group mean reward IS the baseline.

Not quite

The advantage is the reward minus the baseline. GRPO uses the group mean as the baseline: mean = (3×1 + 5×0) / 8 = 0.375. So a correct response has advantage = 1.0 - 0.375 = +0.625. This tells the model: "this response was 0.625 units better than your average — make it more likely." That's the power of baselines — they convert absolute rewards into relative quality signals.

The Complete Picture

You just covered 18 mathematical concepts in 5 parts. Here's how they all connect — the complete chain from text in to better model out.

Text Input
Embeddings (#14)
Attention (#13)
uses dot products (#12)
Logits → Softmax (#4)
Probabilities (#1)
shaped by temperature (#6)
Sampling (#7)
training loop
Loss (#9)
Gradients (#10)
Update Weights
with regularization (#11)
RL loop (GRPO)
Reward → Advantage (#17)
Policy Gradient (#16)
with clipping (#18)
Update
KL leash (#15)

Quick Reference: 18 Concepts in 60 Seconds

Part 1: Foundation

  • Probability distribution — all possible outcomes with their likelihoods, summing to 1
  • Expected value — the average outcome if you replayed many times
  • Variance — how spread out outcomes are around the expected value
  • Law of large numbers — more samples → closer to the true expected value

Part 2: LLM Machinery

  • Softmax — turns raw logits into probabilities: ez / Σez
  • Log probability — log-space prevents underflow; turns multiplication into addition
  • Temperature — divides logits before softmax; low T = focused, high T = creative
  • Sampling — greedy (always top), top-k (top k tokens), top-p (cover p% mass)
  • Entropy — measures distribution uncertainty; collapse = model stopped exploring

Part 3: How Models Learn

  • Cross-entropy loss — -log(probability of correct token); lower = better
  • Gradient — "which direction should each weight move to reduce the loss?"
  • Backpropagation — chain rule to compute all gradients in one backward pass
  • Norms & regularization — L1/L2 penalties keep weights small and model stable

Part 4: The Transformer

  • Dot product — measures similarity between vectors; core of attention
  • Attention — Q·KT/√d → softmax → ·V; "who should I listen to?"
  • Embeddings — words as vectors; similar words = nearby vectors
  • Positional encoding — adds position information so the model knows word order

Part 5: The RL Math

  • KL divergence — measures distribution distance; acts as a leash during training
  • REINFORCE — reward × ∇logπ; "do more of what worked"
  • Baseline — subtract average reward for cleaner signal; GRPO uses group mean
  • Clipping — cap the policy ratio to [1-ε, 1+ε]; prevents overcorrection

You just learned the math behind LLMs.

Every concept from here builds on what you now know. When you see a formula, you'll recognize it. When someone mentions KL divergence or policy gradients, you'll know what they mean. The math isn't scary anymore — it's just the notation for ideas you already understand.

What's next?

Now that you understand the math, you're ready to explore how these concepts come together in real reasoning models — from inference-time scaling to GRPO training to distillation.

وَاللهُ أَعْلَم

And Allah knows best

وَصَلَّى اللهُ وَسَلَّمَ وَبَارَكَ عَلَى سَيِّدِنَا مُحَمَّدٍ وَعَلَى آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this helpful?

Comments

Loading comments...

Leave a comment