What Changed? How Modern LLMs Evolved Beyond the Original Transformer

بِسْمِ اللَّهِ الرَّحْمَـٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

The ancient Greeks had a thought experiment: if you replace every plank of a ship, one at a time, is it still the same ship? They called it the Ship of Theseus.

The original transformer from "Attention Is All You Need" (2017) has had the same thing happen to it. The positional encoding? Replaced. The normalization? Replaced. The attention heads? Reorganized. The activation function? Swapped. The feed-forward network? Split into dozens of parallel experts with a router.

Q/K/V attention is still there. Residual connections are still there. But almost everything else has been upgraded. At what point is it a different architecture?

Answer: it isn't. Every upgrade is defined by what it replaced. Understanding the original is exactly how you understand the modern version.

Quick Summary

Part 1: What stayed the same from the original transformer (and why that matters)
Part 2: Six component upgrades: RoPE, RMSNorm, GQA/MLA, SwiGLU, Flash Attention, Sliding Window
Part 3: Mixture of Experts — how you get 671B parameters at the cost of 37B
Part 4: Beyond transformers — Mamba, hybrids, and where architecture is heading

This post is for you if...

You understand the basics of the transformer (Q/K/V attention, encoder/decoder) and want to know what changed since 2017
You see model cards with "RoPE, GQA, SwiGLU, MoE" and want to know what each actually means
You read How LLMs Actually Work and want the natural sequel

Part 1

The Foundation That Never Changed

Before we look at what changed, let's establish what didn't. This matters more than it sounds.

If you learned about the transformer — from the original paper, a course, or the previous post — you might worry that your knowledge is outdated. You read "Attention Is All You Need" from 2017, and now it's 2026. Surely the architecture has changed beyond recognition?

It hasn't. The core of what you learned is still running inside GPT-4, Claude, Llama 4, DeepSeek-R1, and every other major LLM today. Here's specifically what stayed the same — and why it matters that you know this before looking at the upgrades.

The Unchanged Core

Q/K/V Dot-Product Attention — the exact same formula from the 2017 paper. Every token creates a Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what meaning do I carry"). The dot product of Q and K produces attention scores. Softmax normalizes them. The result weights the Values.

This is identical in every modern model. If you open the DeepSeek-V3 technical report (December 2024, 671 billion parameters), the attention computation is: softmax(QK^T/√d) × V — the same formula Vaswani et al. wrote in 2017. The variants we'll cover in Part 2 change how the heads are organized and how the computation is scheduled on hardware — but the underlying math hasn't changed in 9 years.

Residual Connections — every transformer block still adds the output of each sub-layer back to its input. This skip connection was in the original paper and has not been removed in any major model. In fact, it became more important as models grew deeper. At 80-120+ layers, gradients would vanish without residual connections. They're the reason modern models can be so deep.

The Transformer Block Pattern — the repeating unit: Attention sub-layer, then Feed-Forward sub-layer, stacked N times. Llama 4 Scout stacks this 48 times. DeepSeek-V3 stacks it 61 times. The skeleton is the same — what goes inside each sub-layer is what changed.

Softmax and Learned Embeddings — unchanged. Every model converts tokens to dense vectors through a learned embedding table. Every attention layer normalizes scores through softmax. These haven't been touched.

Decoder-Only Simplification — the original transformer had an encoder and a decoder. Since GPT (2018), modern LLMs almost universally use decoder-only. This was the single biggest structural change from the original paper — but the components inside each block stayed the same.

Why this matters: every upgrade in this post is defined by what it replaced from this foundation. If you understand the original components, you already understand the "before" picture for every upgrade. That's why the original transformer isn't outdated knowledge — it's the reference frame for everything that came after.

The Unchanged Core vs. What Got Upgraded

Still the Same

Q/K/V Dot-Product Attention

Residual Connections

Attention + FFN Block Pattern

Softmax & Learned Embeddings

Decoder-Only Architecture

Got Upgraded

Position Encoding Sinusoidal → RoPE

Normalization LayerNorm → RMSNorm

Attention Heads MHA → GQA/MLA

Activation ReLU → SwiGLU

FFN Structure Single → MoE

Everything in green is identical in GPT-4, Claude, Llama 4, DeepSeek-R1, and Qwen3. Everything in amber is what we'll cover in Parts 2 and 3.

Check Your Understanding

Someone tells you "transformers are obsolete — everything uses Mamba now." True or false?

Correct

GPT-4, Claude, Llama 4, DeepSeek-R1, Qwen3, Gemini — all transformer-based with Q/K/V attention. Mamba and SSMs show promise (especially in hybrids), but they haven't replaced transformers. We'll cover this properly in Part 4.

Not quite

Every major production LLM — GPT-4, Claude, Llama 4, DeepSeek-R1, Qwen3, Gemini — still uses transformer blocks with Q/K/V attention. Mamba exists and shows promise in hybrid architectures, but the transformer core is very much alive. We'll cover what's actually happening in Part 4.

Now that we know the foundation is solid, let's look at what got upgraded — and why.

Part 2

The Component Upgrades

Each upgrade that follows is a targeted improvement to one specific component of the transformer. The pattern is always the same: there was an original component from 2017, it had a specific limitation, and someone found a better replacement. Understanding the original tells you exactly what changed and why.

Six upgrades. Each one defined by what it replaced. Six planks of the ship, swapped for something better.

Positional Encoding: Sinusoidal → RoPE

The original transformer stamped a fixed "seat number" on each token using sinusoidal functions — sine and cosine waves of different frequencies, added to the token's embedding at the very first layer. Token 1 gets one pattern, token 2 gets a slightly different pattern, and so on.

This had three problems:

Hard ceiling at training length. If you trained on 512 tokens, token 513 had no meaningful encoding. The model simply hadn't seen a position that far.
Position info degrades through layers. The positional signal was added once at the input. By layer 32, it had been diluted through dozens of matrix multiplications.
Absolute positions only. The model knew "this token is at position 7" but had no natural way to encode "these two tokens are 5 apart" — which is what actually matters for understanding relationships.

Sinusoidal vs. RoPE Position Encoding

Original: Sinusoidal (2017)

Fixed sin/cos added at input only

Encodes absolute position (seat #7)

Signal degrades through layers

Hard ceiling at training length

Train on 512 tokens? Token 513 = lost

Modern: RoPE (2021+)

Rotates Q and K at every layer

Encodes relative distance (5 apart)

Position info refreshed every layer

Extends beyond training length

Llama 4 Scout: train 256K → extend to 10M

RoPE (Rotary Position Embedding, Su et al. 2021) replaced sinusoidal encoding in essentially every model from 2023 onward — Llama, Mistral, DeepSeek, Qwen, Gemma, Falcon, and PaLM 2. It's also compatible with KV caching — no need to store positional info separately.

How RoPE works conceptually: instead of adding position info to embeddings at the input, RoPE rotates the Q and K vectors by an angle proportional to the token's position. Each pair of dimensions gets rotated by a different frequency — like hands on a clock spinning at different speeds.

When you compute the dot product Q·K, the result naturally encodes the relative distance between two tokens. Rotating by angle A then dotting with something rotated by angle B gives you cos(A-B) — the angle between them. The model doesn't need to know "this is position 7 and that is position 12." It directly sees "these are 5 apart."

How rotation encodes relative distance (the clock analogy)

Imagine each pair of dimensions in Q and K as a point on a clock face. Token at position 0 has the hands pointing at 12 o'clock. Token at position 1 is rotated by a small angle θ. Token at position 100 is rotated by 100θ.

When you compute the dot product of two rotated vectors, the result depends only on the difference in their angles — not the absolute angles themselves. Token at position 7 dotted with token at position 12 gives the same result as token at position 107 dotted with token at position 112, because both pairs are 5 apart.

Different dimension pairs use different rotation speeds (frequencies). Low-frequency pairs capture long-range relationships ("these sentences are far apart"). High-frequency pairs capture local patterns ("these words are adjacent").

The math: For position m, dimension pair i, RoPE applies a 2D rotation matrix with angle mθ_i, where θ_i = 10000^-2i/d. The dot product Q_m·K_n then naturally contains cos((m-n)θ_i) terms — pure relative position.

Context extension: how Llama 4 goes to 10M tokens

Because RoPE uses rotation, you can extend the context window beyond training length by scaling the rotation frequencies. Key techniques:

Position Interpolation (Meta, 2023): Scale all RoPE frequencies down so that positions 0-128K map to the same angle range as 0-8K during training. Simple but effective.

YaRN (2023): "NTK-aware" scaling. Instead of uniform scaling, it adjusts high-frequency and low-frequency rotation speeds differently — preserving local position resolution while extending the range.

Llama 4 Scout (Meta, 2025): Trained at 256K tokens, then fine-tuned with RoPE extensions to handle 10 million tokens. This is possible because RoPE encodes relative position — the model can generalize to longer sequences even if it hasn't seen them during initial training.

RoPE fixed how the model knows where tokens are. The next upgrade fixes something more subtle — how the model keeps its activations stable as signals pass through dozens of layers.

Normalization: LayerNorm → RMSNorm

The original transformer used LayerNorm (2016), which normalizes activations by computing two statistics: the mean and the variance. It subtracts the mean (re-centering), divides by the standard deviation (re-scaling), then applies two learned parameters (scale and shift).

RMSNorm (Zhang & Sennrich, 2019) drops the mean-centering step entirely. It only computes one statistic — the Root Mean Square — and applies one learned parameter (scale only). That's it.

Why removing the mean doesn't matter: Think of normalization as having two jobs — (1) centering the data around zero, and (2) controlling the scale so values don't explode or vanish. It turns out that job #2 is the one that actually matters for training stability. When activations blow up to huge numbers or shrink to near-zero, gradients break. Controlling the scale prevents that. But centering? The very next linear layer in the network can learn its own bias term, which effectively re-centers the data however it wants. So LayerNorm was doing redundant work — computing a mean, subtracting it, then the next layer immediately learns to shift things again anyway.

The analogy: imagine calibrating a scale before weighing something. LayerNorm both zeros out the scale (re-centering) AND adjusts the units (re-scaling). RMSNorm only adjusts the units — because the next person in line will zero it out however they need to. Half the work, same result.

Practical impact: 7-64% faster normalization depending on hardware, with zero quality loss. At scale, this adds up. When you're training on thousands of GPUs for weeks, shaving even 10% off a per-layer operation that runs billions of times translates to real money and time saved.

Pre-Norm vs. Post-Norm: Where Normalization Goes

Original: Post-Norm (2017)

Input

↓

Sublayer (Attention/FFN)

↓

Add (residual)

↓

LayerNorm

Gradients must flow through norm

→

Modern: Pre-RMSNorm (2023+)

Input

↓

RMSNorm

↓

Sublayer (Attention/FFN)

↓

Add (residual)

Clean residual path for gradients

Pre-norm (normalize BEFORE the sublayer) gives gradients a clean residual path — critical when stacking 80-120+ layers. Llama 1 was the first major model to combine Pre-RMSNorm, and every subsequent open model followed.

The switch happened in two stages. GPT-2 (2019) moved from post-norm to pre-norm (keeping LayerNorm). Llama 1 (2023) switched to pre-RMSNorm. Since then, every major open-weight model — Mistral, DeepSeek, Qwen, Gemma — uses pre-RMSNorm.

The math: LayerNorm vs RMSNorm

LayerNorm:

1. Compute mean: μ = mean(x)
2. Compute variance: σ² = var(x)
3. Normalize: (x - μ) / √(σ² + ε)
4. Scale and shift: γ × normalized + β

Two statistics (mean, variance). Two learned parameters (γ, β).

RMSNorm:

1. Compute RMS: rms = √(mean(x²))
2. Normalize: x / rms
3. Scale only: γ × normalized

One statistic (RMS). One learned parameter (γ). Roughly half the operations, no quality loss.

So far we've upgraded where (RoPE) and how stable (RMSNorm). The next three upgrades tackle the biggest practical bottleneck: memory and compute cost during inference. They're all different angles on the same problem — making long sequences affordable.

Attention Heads: MHA → GQA → MLA

This is the evolution of how Q, K, V heads are organized. Remember: the attention math itself (Q·K^T/√d → softmax → ×V) is identical. What changes is how many separate K and V projections exist.

The problem this solves: during text generation, the model stores all previous K and V vectors in a "KV cache" to avoid recomputing them for every new token. For long sequences, this cache dominates GPU memory.

How bad can it get? Llama 2 70B with 128 attention heads, processing 100K tokens: KV cache per token = 2 (K+V) × 128 heads × 128 dim × 2 bytes (FP16) = 65 KB per token. At 100K tokens: 100,000 × 65 KB = 6.5 GB just for KV cache. At 1M tokens, that's 65 GB — more than an entire A100 GPU.

The KV Cache Memory Problem

The Textbook Analogy

MHA

64 students, each with their own textbook

GQA

64 students, 8 shared textbooks

MQA

64 students, 1 shared textbook

MLA

Compressed summary cards that expand when needed

MHA

64 KV heads = 6.5 GB

GQA

8 KV heads = 0.8 GB

MQA

1 KV head = 0.1 GB

MLA

Compressed latent = 0.23 GB (28x reduction)

Approximate KV cache for Llama 2 70B at 100K tokens

The evolution happened in stages:

Multi-Query Attention (MQA, Shazeer 2019) — all query heads share a single K and V projection. Extreme memory reduction (from N KV heads to 1), with a small quality trade-off (~2%) and 1.8-2.4x faster decoding. Used by PaLM and Falcon. Invented by the same Noam Shazeer who co-authored "Attention Is All You Need."

Grouped Query Attention (GQA, Ainslie et al. 2023) — the middle ground. G groups of KV heads, where each group of query heads shares one KV pair. Near-MHA quality with near-MQA speed. Specific configurations:

Llama 2 70B: 64 query heads, 8 KV heads
Llama 3 70B: 64 query heads, 8 KV heads
Mistral 7B: 32 query heads, 8 KV heads
Qwen3-235B: 64 query heads, 4 KV heads

Multi-Head Latent Attention (MLA, DeepSeek-V2 2024) — instead of caching full-dimensional K and V, MLA compresses them into a much smaller "latent" vector using a down-projection. At inference, it projects back up. DeepSeek-V3 achieves 28x KV cache compression — from 213.5 GB down to 7.6 GB.

How MLA compression works

In standard MHA, each token at each layer caches a full K vector and V vector across all heads. For DeepSeek-V3 with 128 heads and 128 dim per head, that's 128 × 128 = 16,384 values for K alone, times 2 for K+V = ~32K values per token per layer.

MLA adds a down-projection that compresses K and V into a joint latent vector of just 512 dimensions. Only this 512-dim latent is stored in the KV cache. When the model needs to compute attention, it applies an up-projection to expand the latent back to full K and V dimensions.

The compression ratio: 32K values → 512 values = ~28x reduction in KV cache per token. This is how DeepSeek-V3 handles long contexts on practical hardware.

The quality cost? Negligible — because the down-projection is learned during training, the model learns what information to preserve and what to discard.

Check Your Understanding

Your model needs to handle 1M token context on a single GPU. Which attention variant matters most?

Correct

At 1M tokens, even GQA's 8x reduction leaves a massive KV cache. MLA's 28x compression (DeepSeek-V3 goes from 213.5 GB to 7.6 GB) makes million-token contexts practical on real hardware. This is why DeepSeek-V2 invented it.

Not quite

At 1M tokens, KV cache memory is the bottleneck. MHA would need hundreds of GB. Even GQA's 8x reduction isn't enough. MLA achieves 28x compression — DeepSeek-V3 goes from 213.5 GB to 7.6 GB — making million-token contexts practical on real hardware.

GQA and MLA optimized the attention side of the transformer block. Now let's look at the other half — the feed-forward network (FFN). Two things changed inside it: the activation function, and (in Part 3) the entire structure.

Activation Function: ReLU → SwiGLU

The original transformer used ReLU inside the feed-forward network: if the input is positive, pass it through; if negative, output zero. Simple — but it has a fatal flaw called the "dead neuron" problem. Once a neuron outputs 0, its gradient is 0, and it never recovers. It's permanently dead.

ReLU's deeper problem is that it's amplitude-based — it only looks at whether a number is positive or negative. It can't look at the content (what the number represents in context) and make an intelligent decision about what to pass through.

GELU (used by GPT-2, GPT-3, BERT) softened this — instead of a sharp cutoff at zero, it uses a smooth curve that gives small negative values a tiny chance of passing through. Better, but still a single path making a single decision.

SwiGLU (Shazeer, 2020) fundamentally changed the structure. Instead of one path through one activation function, SwiGLU splits the FFN into two parallel paths:

One path goes through Swish activation (a smooth, non-monotonic function) — this is the "value" path, computing what the output could be
The other path stays linear — this is the "gate" path, computing how much of that output to allow through
The two paths are multiplied together element-wise — the gate controls the value

The analogy: think of a recording studio. ReLU is a simple noise gate — any signal below a threshold gets cut to silence, regardless of what it is. SwiGLU is an experienced sound engineer with two hands on the mixing board: one hand holds the audio signal (the Swish path), the other hand controls the volume fader (the gate path). The engineer listens to the content and decides how much to let through — quiet passages get boosted, noise gets suppressed, and the decision is based on what the signal means, not just how loud it is.

This is the key insight: the gate path sees the same input but through different learned weights. It learns a content-dependent filter — "this pattern of activations represents something important, let it through" vs "this pattern is noise, suppress it." ReLU can't do this because it has no second path to learn the gating.

Activation Functions: From On/Off Switch to Learned Gate

ReLU (2017)

f(x) = max(0, x)

Dead neurons: once 0, always 0

GELU (2018)

f(x) = x × Φ(x)

Smooth curve, no sharp cutoff

SwiGLU (2020+)

Swish(xW₁) ⊙ xW₂

Two paths × gating = learned control

SwiGLU is used by essentially every model from 2023+ — Llama, DeepSeek, Mistral, Qwen, Gemma, PaLM 2, and Hunyuan (Tencent). Same Noam Shazeer who co-authored the original transformer.

The gating mechanism: two parallel paths multiplied together

In a standard FFN, the input goes through one linear layer, then an activation function, then another linear layer: output = W2 * ReLU(W1 * x)

In SwiGLU, the first linear layer is split into two:

output = W2 * (Swish(W1 * x) ⊙ (W3 * x))

Where ⊙ means element-wise multiplication. The W3 * x path is the "gate" — it controls how much of the Swish(W1 * x) path gets through. This is why SwiGLU requires 3 weight matrices instead of 2, adding ~50% more parameters to the FFN layer. Models compensate by slightly reducing the FFN hidden dimension.

The result: measurably better downstream performance across all benchmarks, consistently. The extra parameters pay for themselves.

SwiGLU improved the quality of the FFN. The next upgrade doesn't change any math at all — instead, it rewrites how the existing math runs on GPU hardware, and that single change made everything from 100K to 10M token context windows possible.

Flash Attention: Same Math, Different Hardware Strategy

This one is different from all the others. Flash Attention changes zero math. The output is bit-for-bit identical to standard attention. What it changes is entirely about how the computation is organized on GPU hardware — and that change enabled everything from 100K to 10M token context windows.

The problem: standard attention materializes the full N × N attention matrix in GPU memory. For 100K tokens, that's 100,000 × 100,000 = 10 billion entries. This matrix is computed, used once for the softmax-weighted sum, then thrown away. The waste is staggering.

GPU Memory Hierarchy: Why Flash Attention Works

SRAM (On-Chip)

Where Flash Attention works

20 MB

Size

19 TB/s

Bandwidth

Tiny but blazing fast

HBM (Off-Chip)

Where standard attention writes N×N

80 GB

Size (A100)

2 TB/s

Bandwidth

Large but 10x slower

The Flash Attention Insight

Standard attention writes the full N×N matrix to slow HBM. Flash Attention processes attention in small tiles that fit entirely in fast SRAM — loads a tile of Q, K, V, computes partial attention, accumulates the result, and writes only the final output back to HBM. The N×N matrix is never materialized.

Created by Tri Dao (Stanford PhD, co-founder/Chief Scientist at Together AI). Flash Attention is now integrated into PyTorch 2.0+ via torch.nn.functional.scaled_dot_product_attention and used by every serious LLM training pipeline.

Flash Attention: Tile-by-Tile Processing

Step 1

Load small tiles of Q, K, V from HBM into fast SRAM

Q tile

K tile

V tile

Step 2

Compute partial attention for this tile entirely in SRAM (QK^T, softmax, ×V)

Step 3

Accumulate into running output using online softmax (track running max + running sum)

Step 4

Repeat for all tiles. Write only final output back to HBM

The N×N attention matrix is never materialized. Result is bit-for-bit identical to standard attention.

Standard attention writes the entire N×N matrix to slow HBM. Flash Attention keeps everything in fast SRAM by processing in tiles and accumulating incrementally.

The analogy: standard attention is like printing two huge spreadsheets on paper, multiplying them, then throwing the paper away. Flash Attention works with small sticky notes on your fast desk (SRAM), computing sections and keeping a running total, without ever printing the full intermediate result.

The impact:

Memory: O(N²) → O(N) — this single change enabled 100K+ context windows
Speed: 2-4x faster for long sequences, 3x speedup on GPT-2 training end-to-end
FlashAttention-2 (2023): 2x faster than v1, up to 72% GPU utilization on A100, up to 225 TFLOPs/s
FlashAttention-3 (2024): 75% utilization on H100, FP8 support, up to 740 TFLOPs/s

Online softmax: computing softmax without seeing all values first

The trickiest part of Flash Attention is this: softmax needs to see all attention scores to compute the normalization factor (the sum of exponentials). But if you're processing in tiles, you only see a subset at a time.

Online softmax (Milakov & Gimelshein, 2018) solves this by maintaining a running maximum and running sum of exponentials. When you process a new tile:

1. Compute the new local maximum
2. Rescale the previous running sum to account for the new maximum
3. Add the new tile's exponentials to the running sum
4. The final softmax is exact — not approximate

This is why Flash Attention produces bit-for-bit identical results to standard attention. It's not an approximation — it's the same computation, reorganized for hardware efficiency.

Flash Attention made each attention operation faster and lighter in memory. Sliding Window goes further — it reduces how many tokens each layer needs to attend to in the first place. Where Flash Attention optimizes the hardware, Sliding Window optimizes the algorithm. They're complementary, and many models use both together.

Sliding Window Attention

Instead of each token attending to all previous tokens (full attention), sliding window attention restricts each layer to attending only to the previous W tokens. Mistral 7B (September 2023) used a window of W = 4,096.

The analogy: imagine you're in a long hallway with 100,000 people standing in a line. Full attention means every person can whisper to every other person directly — that's 100,000 × 100,000 possible conversations. Sliding window means each person can only whisper to the 4,096 people closest to them. Far cheaper, and you'd be surprised how far messages still travel.

Why? Because the model is stacked. The clever insight is that stacking layers extends the effective reach. In layer 1, each token sees W tokens back. But in layer 2, each token — which now contains information from W tokens — sees W more tokens back. It's like a game of telephone where each relay point has a 4,096-person reach. With 32 layers and W=4,096, the theoretical reach is 32 × 4,096 = 131,072 tokens — the message has traveled across the entire sequence, just indirectly.

Sliding Window: Local Attention, Global Reach

Layer 1

Each token sees W = 4,096 tokens back

Reach: 4K

Layer 2

Each token (with info from 4K) sees 4K more

Reach: 8K

⋮

Layer 32

Effective receptive field = 32 × 4,096

Reach: 131K

KV cache bonus: Only W entries ever stored using a rotating buffer. Fixed memory regardless of sequence length. Halved cache memory for 8K sequences; 2x improvement for 16K sequences with W=4K. Integrated with FlashAttention for hardware efficiency.

Used by Mistral 7B and Mixtral. Gemma uses a hybrid approach: alternating sliding window and full attention layers for the best of both worlds.

When to use full attention vs. sliding window

Full attention when: you need precise retrieval from anywhere in the context ("find the name mentioned 50,000 tokens ago"). Every token can directly see every other token.

Sliding window when: most important context is local (adjacent sentences, nearby code). Huge throughput and memory gains for sequences where long-range precision isn't critical.

Hybrid (best of both): Gemma's approach — alternate sliding window layers (for local patterns) with full attention layers (for long-range retrieval). Most tokens get fast local attention; critical long-range connections use the full attention layers.

Trade-off: Pure sliding window can miss dependencies that span more than layers × W tokens. The hybrid approach costs more compute than pure sliding window, but less than full attention everywhere.

The Upgrade Summary

Six upgrades. Each one a targeted replacement of a specific 2017 component. Here they are side by side:

Component	Original (2017)	Modern (2023+)	Key Benefit
Position	Sinusoidal	RoPE	Relative position, extendable context
Normalization	LayerNorm (post)	RMSNorm (pre)	7-64% cheaper, better gradient flow
KV Heads	MHA (N heads)	GQA / MLA	8-28x KV cache reduction
Activation	ReLU	SwiGLU	Gated control, no dead neurons
Attention compute	Standard matmul	Flash Attention	O(N²) → O(N) memory, 2-4x speed
Attention scope	Full (all tokens)	Sliding window / hybrid	Fixed KV cache, linear compute
Context window	512 tokens	128K – 10M tokens	RoPE scaling (YaRN, NTK)
Training data	~300B tokens	15T+ tokens	Chinchilla-optimal and beyond
Architecture	Encoder-Decoder	Decoder-only	Simpler, scales better for generation

Six planks replaced. The ship still sails the same way — attention + FFN + residual, stacked N times. But every plank is better than the original. Now let's look at the biggest structural change of all — one that doesn't replace a plank, but adds an entire new deck to the ship.

Part 3

Mixture of Experts — The Biggest Structural Shift

What if you could build a model with the knowledge of 671 billion parameters, but only pay the compute cost of 37 billion?

That's not a hypothetical. That's DeepSeek-V3, and it's the core promise of Mixture of Experts (MoE). The concept isn't new — Jacobs et al. proposed "Adaptive Mixtures of Local Experts" back in 1991. What's new is applying it at transformer scale.

The Hospital Analogy

In a dense model (like Llama 3 70B), every token passes through the same feed-forward network. All 70 billion parameters activate for every single token. It's like a hospital where every patient sees the same general doctor — whether they have a broken arm, a heart condition, or a math homework question.

In an MoE model, the single FFN is replaced with many parallel "expert" FFNs and a learned router (gating network) that selects which experts each token goes to. It's like a hospital with 256 specialist doctors and a triage nurse who sends each patient to the right 8 specialists. The other 248 sit idle for that patient.

Dense vs. Mixture of Experts Architecture

Dense Model (e.g., Llama 3 70B)

Token

↓

Single FFN
All 70B params active

↓

Output

MoE Model (e.g., DeepSeek-V3)

Token

↓

Router (gating)

selects 8 of 256

E47

E91

...

idle

weighted sum ↓

Output

Only 37B of 671B params active

How the Router Works

The router is a small linear layer followed by softmax. It takes the token's hidden state and produces a probability distribution over all experts.

How the Router Selects Experts

Token Hidden State [d_model]

↓

Linear Layer [d_model × n_experts]

↓

Softmax

↓

Probability over 256 experts:

E1
0.002

E2
0.001

E3
0.15

E4
0.008

...

E47
0.12

...

E91
0.10

...

E256
0.03

↓ select top-K

Expert 3
w=0.15

Expert 47
w=0.12

Expert 91
w=0.10

...

↓ weighted sum

Combined Expert Output

K=1-2 for Llama 4 (lightweight routing), K=8 for DeepSeek (more expert diversity). The router is just a single linear layer — tiny overhead compared to the experts themselves.

The Models

MoE isn't theoretical — it's how the most capable models in the world are built. Here are the specific architectures:

MoE Models: Total vs. Active Parameters

Mixtral 8x7B

Mistral AI, Dec 2023

46.7B

Total

12.9B

Active

8 experts, top-2 per token. Same base as Mistral 7B

First widely-used open MoE. Beat Llama 2 70B with 5.4x fewer active params

DeepSeek-V3

DeepSeek, Dec 2024

671B

Total

37B

Active

256 routed + 1 shared, top-8. MLA attention

14.8T tokens, 2.788M H800 GPU hours (~$5.5M). Multi-Token Prediction. Rivals GPT-4o and Claude 3.5 Sonnet

DeepSeek-R1

DeepSeek, Jan 2025

671B

Total

37B

Active

Same arch as V3. 2 RL + 2 SFT training stages

R1-Zero spontaneously developed chain-of-thought, self-verification, and reflection from pure RL — no supervised fine-tuning. Rivals OpenAI o1

Llama 4 Scout

Meta, April 2025

109B

Total

17B

Active

16 routed + 1 shared, top-1. Native multimodal

40T+ tokens, 200 languages. Train 256K → extend to 10M context

Llama 4 Maverick

Meta, April 2025

400B

Total

17B

Active

128 routed + 1 shared, top-1. Train 256K → fine-tune to 1M context

Alternating dense + MoE layers

Qwen3-235B

Alibaba, May 2025

235B

Total

22B

Active

128 experts, top-8. 94 layers. 64 Q, 4 KV heads (GQA). No shared experts (unlike DeepSeek)

32K native, 131K with YaRN. Also: Qwen3-30B (30.5B total, 3.3B active)

GPT-4 (rumored)

OpenAI, Mar 2023

~1.76T

Total (est.)

~220B

Active (est.)

Rumored 8 experts of ~220B each. ~13T tokens

George Hotz & Soumith Chintala publicly confirmed MoE (June 2023). Unverified details

Llama 4 Behemoth

Meta, 2025 (teacher model)

~2T

Total

288B

Active

16 experts. FP8 training on 32K GPUs, 390 TFLOPs/GPU

Used as teacher to distill Scout & Maverick (co-distillation)

Gemini 1.5 / 2.0 / 2.5

Google, 2024-2025

Not disclosed

Google confirmed MoE. "Divided into expert neural networks, selectively activates relevant pathways"

Sparse MoE across all Gemini versions. No exact param counts disclosed

As of mid-2025, 12 of the top 16 open-weight models are MoE. The dense-only approach is increasingly limited to smaller models (<70B).

Check Your Understanding

DeepSeek-V3 (671B total, 37B active) vs Llama 3 70B (70B dense). Which uses more compute per token?

Correct

Llama 3 70B uses MORE compute per token despite having fewer total parameters. All 70B activate for every token. DeepSeek-V3 only activates 37B of its 671B per token. This is the entire point of MoE — massive knowledge capacity (671B) with practical compute cost (37B).

Not quite

Total parameters ≠ active parameters. Llama 3 70B activates ALL 70B parameters for every token. DeepSeek-V3 only activates 37B of its 671B per token (the router selects 8 of 256 experts). So Llama 3 70B actually uses more compute per token, despite having fewer total parameters. That's the MoE advantage.

Load Balancing: The Critical Challenge

Without intervention, the router tends to "collapse" — sending all tokens to 1-2 favorite experts while the rest sit idle. This wastes the capacity you paid for.

Three approaches to load balancing

1. Auxiliary loss (Shazeer 2017, GShard 2020): Add a loss term that penalizes uneven expert utilization. Effective but degrades model quality because the auxiliary loss fights against the primary training objective.

2. Bias terms (DeepSeek-V3, 2024): Add a learned bias to the router scores, but only for routing decisions — NOT included in the training loss. The bias steers tokens to underused experts without polluting the learning signal. This is called "auxiliary-loss-free" load balancing.

3. Global-batch balancing (Qwen3, 2025): Compute load balance across the entire training batch rather than per-sequence, allowing more natural token distribution. Each sequence can be "unbalanced" as long as the batch averages out.

Advantages and Disadvantages

MoE: What You Gain vs. What You Pay

Advantages

Compute efficiency — 671B model runs at 37B cost

Knowledge capacity — more total params = more stored knowledge

Expert specialization — different experts for code, math, languages

Training efficiency — higher quality per compute dollar

Scalable to trillions — scale to trillions of params without proportional compute increase

Disadvantages

Memory — ALL 671B params in memory, even though most idle

Load balancing — router collapse wastes capacity

Communication — routing tokens to experts on different GPUs

Fine-tuning — not all experts see all training data

Batch sensitivity — small batches mean uneven expert utilization

Serving complexity — need to shard experts across GPUs carefully

Check Your Understanding

You want to fine-tune a MoE model on your domain. What's the biggest challenge?

Correct

In MoE, only a few experts activate per token. During fine-tuning on your domain data, some experts may barely see any examples — they sit idle while the router keeps sending domain tokens to the same 2-3 "favorite" experts. This can lead to uneven adaptation, where some experts specialize in your domain while others stay generic.

Not quite

The compute cost is actually lower than expected (only active params compute per token), and the architecture doesn't break during fine-tuning. The real challenge is that MoE routing is sparse — not all experts see all examples. Some experts may barely encounter your domain data, leading to uneven adaptation.

Part 4

Beyond Transformers — Hybrid Architectures

Parts 1-3 covered upgrades within the transformer framework — better position encoding, cheaper normalization, smarter attention heads, faster computation, and MoE. The transformer block pattern (attention + FFN + residual) stayed the same throughout.

Part 4 asks a different question: what if the attention mechanism itself — the O(N²) core of the transformer — isn't always the best tool? Not because it's bad, but because there are tasks where a fundamentally different approach is more efficient.

The answer isn't "replace transformers." It's "augment them." The future is hybrid.

Mamba: A Fundamentally Different Way to Process Sequences

Every upgrade so far has worked within the attention framework. Flash Attention made it faster. GQA made it cheaper. Sliding Window made it narrower. But all of them still compute attention — they still ask "how does every token relate to every other token?"

Mamba asks a different question entirely: "what if you don't need to look at all tokens at once?"

Transformer attention is O(N²) in sequence length. Even with Flash Attention (which reduces the memory footprint), the quadratic scaling makes very long sequences expensive. Double the sequence length, and you quadruple the compute. For 1M+ token contexts, this becomes prohibitive.

Previous state space models (S4, by Albert Gu) offered O(N) linear scaling, but with a catch: they used fixed state transitions — the same rules applied regardless of the input. The model couldn't distinguish between important and unimportant tokens. Mamba (Gu & Dao, December 2023) solved this by making the transitions input-dependent: the model learns to selectively "remember" important tokens and "forget" irrelevant ones.

The analogy: transformer attention is a group discussion where everyone talks to everyone simultaneously — powerful for finding connections, but the cost grows with the square of the group size. Mamba is reading a book with a notepad. You process one page at a time, write down what seems important, cross out what doesn't matter anymore, and update your running summary as you go. By the end, your notepad captures the key ideas — and the process was O(N) linear. The notepad is the "state vector."

The critical difference from older approaches: Mamba's notepad is smart. It doesn't just mechanically record everything with the same rules. It looks at each new token and decides how much to remember and how much to forget, based on the content. A proper noun gets written down carefully. A filler word gets barely noted. This input-dependent selectivity is what makes Mamba competitive with transformers on quality, not just speed.

Transformer Attention vs. Mamba Processing

Transformer: All-to-All

↓ every token attends to every other

N × N Attention Matrix

↓

T1'

T2'

T3'

T4'

O(N²) compute and memory

Mamba: Sequential State

→

State

update ↓

→

State

update ↓

→

State

update ↓

→

Final State

O(N) compute, constant memory

Mamba's key innovation: selectivity. The state transitions are input-dependent — the model learns to "remember" important tokens and "forget" irrelevant ones. Created by Tri Dao (Flash Attention) and Albert Gu.

Mamba-3B matches transformers of the same size and beats transformers 2x its size on language modeling. It achieves 5x higher inference throughput than transformers with linear time and constant memory — no KV cache needed. This speed comes from hardware-aware design: kernel fusion (fusing multiple operations into one GPU kernel), parallel scan (processing the recurrence in parallel, not truly sequential), and recomputation instead of materialization (saving memory by recomputing during the backward pass).

Mamba-2 (May 2024) is 2-8x faster than Mamba-1, and its paper — "Transformers are SSMs" — proved that attention and SSMs are mathematically equivalent under certain conditions.

"Transformers are SSMs" — the unification (Mamba-2)

In May 2024, Tri Dao and Albert Gu published Mamba-2 with a remarkable title: "Transformers are SSMs."

They proved a mathematical equivalence: a specific form of structured state space model is exactly equivalent to a specific form of linear attention. The two paradigms aren't fundamentally different — they're different views of the same mathematical operation, with different computational trade-offs.

Beyond the theoretical insight, Mamba-2 is 2-8x faster than Mamba-1 thanks to a reformulated algorithm that maps better onto GPU matrix multiply units.

This is philosophically important: it means the field isn't "transformers vs SSMs." It's one unified framework. Some computations are more efficient as attention (precise retrieval), others as recurrence (sequential processing). The best architectures will use both.

Jamba: The Hybrid in Production

If Mamba is O(N) and transformers are O(N²), why not just use Mamba for everything? Because attention has one ability that Mamba doesn't match: precise retrieval. If you need to find a specific name mentioned 50,000 tokens ago, attention can look directly at that token. Mamba's compressed state vector may have lost that detail — it depends on whether the model deemed it important enough to "remember."

The question becomes: can you get the best of both? Use Mamba for the bulk of processing (fast, efficient, handles narrative and context flow well), but inject a few attention layers where precision matters?

Jamba (AI21 Labs, March 2024) answers yes. It interleaves transformer attention and Mamba layers in a 1:7 ratio — for every 8 layers, 1 uses transformer attention and 7 use Mamba. It also applies MoE to every other layer (16 experts, top-2).

Why 1:7 specifically? The 7 Mamba layers handle the sequential "reading with a notepad" work — building up context, processing tokens efficiently, maintaining the narrative flow. The 1 attention layer acts as a "checkpoint" — it can look at any token in the full context with precise attention, catching anything the Mamba layers might have compressed too aggressively. Think of it as reading a long report: you skim most pages efficiently (Mamba), but every 8th page you stop and carefully cross-reference specific details against everything you've read (attention).

Jamba: Interleaved Transformer + Mamba + MoE

Mamba

Layer 1

Mamba + MoE

Layer 2

Mamba

Layer 3

Mamba + MoE

Layer 4

⋮

Mamba

Layer 7

Transformer Attention + MoE

Layer 8

Mamba layer (7 of 8)

Attention layer (1 of 8)

+ MoE

Every other layer

Jamba's 1:7 ratio: 7 efficient Mamba layers for sequential processing, 1 attention layer for precise long-range retrieval. MoE on alternating layers (16 experts, top-2). Result: 52B total, 12B active, fits on a single 80GB GPU.

Why does MoE appear on alternating layers? The MoE layers add knowledge capacity without proportional compute — the same benefit from Part 3. But not every layer needs it. The non-MoE layers use a single dense FFN (cheaper, simpler). Alternating gives the model expert specialization where it helps, while keeping the overall parameter count manageable.

The result is a triple hybrid: Mamba for efficient sequential processing + attention for precise retrieval + MoE for knowledge capacity. Each component does what it's best at.

Results: Jamba fits on a single 80GB GPU, handles 256K token context, and achieves 3x throughput vs a comparable transformer — with competitive language modeling quality. 52B total params, only 12B active. For context: a pure transformer with the same quality would need multiple GPUs and significantly more memory. Jamba showed that hybrid architectures aren't just theoretically interesting — they're practically superior for the memory-constrained real world.

RWKV: RNNs Strike Back

Jamba interleaves attention with Mamba. RWKV (Receptance Weighted Key Value, Peng et al., EMNLP 2023) goes further — it eliminates attention entirely.

RWKV solves a different problem than Mamba. Mamba's innovation was selectivity — making state transitions input-dependent. RWKV's innovation is dual-mode operation: it can run as a transformer during training (processing all tokens in parallel for fast training) and as an RNN during inference (processing token-by-token with constant O(1) memory per step). Same model, two operating modes.

This is a big deal for deployment. During training, you want parallelism — process the entire sequence at once across your GPUs. During inference, you want efficiency — generate one token at a time with minimal memory. Transformers are parallel for both (but expensive during inference because of the KV cache). RNNs are sequential for both (fast inference, but training can't be parallelized). RWKV gets parallel training AND sequential inference — best of both worlds.

How different from Mamba? Mamba uses a continuous state space model with learned selectivity. RWKV uses a reformulated attention mechanism where the attention weights decay exponentially with distance — so recent tokens get full weight, distant tokens get fading weight. The decay rates are learned. At training time, this can be computed in parallel (like attention). At inference time, it can be computed recurrently (like an RNN). The two computations are mathematically equivalent, just organized differently.

RWKV: One Architecture, Two Operating Modes

Training Mode

↓ all tokens processed at once

Parallel
like attention

Fast training on GPUs

Inference Mode

→

state

→

state

⋮

→

state

O(1) memory per step

Same model, same weights — the math is equivalent, just computed differently. Transformers can't do this: they're parallel for both modes (but pay the KV cache cost at inference).

RWKV's key insight: exponentially decaying attention weights can be computed as parallel attention (fast training) OR as recurrent state updates (constant-memory inference). Best of both worlds.

The real-world impact proves this isn't just academic: Microsoft deployed RWKV v5 "Eagle" to 1.5 billion Windows 10/11 devices for on-device Copilot — the largest deployment of an attention-free architecture in production. Why RWKV for this? Because on-device means tight memory constraints and no cloud GPUs. RWKV's constant-memory inference makes it ideal for running on laptops and phones, where a transformer's KV cache would be prohibitively expensive. RWKV-7 "Goose" (2025) describes itself as the "strongest attention-free, 100% RNN architecture."

Are These Replacing Transformers?

No. And the reason goes back to the question we started with.

Remember the Ship of Theseus? We asked: if you replace every plank, is it still the same ship? We spent Parts 2 and 3 watching components get swapped out — RoPE for sinusoidal encoding, RMSNorm for LayerNorm, GQA for MHA, SwiGLU for ReLU, MoE for dense FFN. And yet the answer was clear: yes, it's still a transformer. The core — Q/K/V attention, residual connections, the alternating attention-then-FFN pattern — never changed.

Now Part 4 asks a different question: what if you build a different ship entirely? Mamba doesn't replace planks — it reimagines the hull. Instead of every token talking to every other token (attention), it processes tokens one at a time through a learned state. RWKV takes a different approach, using exponentially decaying weights that can run as parallel attention during training and sequential state during inference. These aren't upgraded transformers. They're alternative architectures.

So are they winning? The honest answer: not yet, and maybe not alone.

Pure transformers still produce the highest quality on most benchmarks — especially tasks that require precise retrieval. Here's a concrete example: imagine you're processing a 200-page legal contract and you ask "what was the liability cap mentioned in clause 47?" With transformer attention, every token can directly compare against every other token — the model can look at the question and attend specifically to clause 47, wherever it sits in the 100,000-token context. It's a direct lookup.

With Mamba, all those 100,000 tokens have been compressed into a single fixed-size state vector — a summary. If the model deemed that liability cap important enough to "remember" during sequential processing, it's there. If not, it's lost. For most tasks (summarization, general Q&A, narrative understanding), the compressed state is sufficient. But for needle-in-a-haystack retrieval — finding one specific detail in a vast context — attention's direct access wins.

But SSMs offer something transformers can't: truly linear scaling. Double the context length and a transformer's attention cost quadruples. Double it for Mamba and the cost merely doubles. For processing millions of tokens — genomics, legal corpora, codebases — that difference isn't incremental, it's the difference between possible and impossible.

The Architecture Spectrum: Precision vs. Efficiency

Pure Transformer

✓ Best quality

✓ Precise retrieval

✗ O(N²) scaling

✗ KV cache grows linearly

GPT-4, Claude, Llama 3, DeepSeek-V3

Hybrid

✓ Near-best quality

✓ Selective precision

✓ Much longer context

~ Design complexity

Jamba, Llama 4 (rumored), Zamba

Pure SSM / RNN

~ Competitive quality

✗ Weaker at retrieval

✓ O(N) scaling

✓ Constant memory

Mamba, RWKV, Griffin

The industry is converging toward the middle

Most production models live in "Transformer with optimizations" — hybrids are the emerging frontier

Today, most major models are pure transformers with component upgrades (Parts 2-3). Hybrids are gaining traction for efficiency. Pure SSMs excel on-device and for very long contexts.

This is exactly why the industry is converging on hybrid architectures. Jamba's 1:7 ratio wasn't arbitrary — it was an engineering answer to a practical question: how do you get the precision of attention where it matters most, while using Mamba's efficiency for the long stretches in between? The result: a model with a 256K context window that fits in the memory budget of a 70K-context pure transformer.

Even the researchers building these alternatives see convergence, not competition. The Mamba-2 paper is titled "Transformers are SSMs" — arguing that attention and state space models are different views of the same underlying mathematics. They're not rivals. They're complementary tools.

The emerging recipe looks like this: transformer attention as the backbone for precision, SSM layers for efficiency on long sequences, and MoE for capacity without proportional compute cost. Not one architecture winning, but three ideas combining. The ship isn't being replaced by a different ship — it's being extended with new kinds of planks that the original builders never imagined.

The Modern Recipe: 2017 vs 2025

Here's what a state-of-the-art transformer block looks like today, compared to the original:

The 2017 Block vs. The 2025 Block

Original Transformer (2017)

Input

↓

Multi-Head Attention
sinusoidal PE

↓

Add (residual)

LayerNorm

↓

FFN (ReLU)

↓

Add (residual)

LayerNorm

→

Modern Block (2025)

Input

↓

RMSNorm

↓

GQA Self-Attention
RoPE + Flash Attention

↓

Add (residual)

↓

RMSNorm

↓

MoE Router → K Expert FFNs
SwiGLU activation

↓

Add (residual)

Same skeleton (attention + FFN + residual). Every component inside has been upgraded. The ship has new planks, but it's still the same ship.

Timeline: Key innovations from 2017 to 2025

Year	Innovation	Paper / Model	Impact
2017	Original Transformer	Vaswani et al., "Attention Is All You Need"	Foundation of everything
2018	GPT-1, BERT	OpenAI, Google	Decoder-only and encoder-only paradigms
2019	Multi-Query Attention	Shazeer	First KV cache reduction technique
2019	RMSNorm	Zhang & Sennrich, NeurIPS	Cheaper normalization, zero quality loss
2020	GPT-3 (175B)	OpenAI	Proved scale = capability (Kaplan scaling laws)
2020	SwiGLU	Shazeer	Gated activations replaced ReLU
2021	RoPE	Su et al., "RoFormer"	Relative position via rotation
2022	Chinchilla	Hoffmann et al., DeepMind	Proved optimal data-to-param ratio. Changed industry from "bigger model" to "more data"
2022	Flash Attention	Tri Dao, Stanford	O(N) memory, 2-4x speed, enabled 100K+ context
2023	Llama 1	Meta	First major Pre-RMSNorm + RoPE + SwiGLU + GQA model
2023	GQA	Ainslie et al., EMNLP	Middle ground: shared KV groups
2023	Mistral 7B	Mistral AI	Sliding window attention + GQA
2023	Mixtral 8x7B	Mistral AI	First widely-used open MoE
2023	Mamba	Gu & Dao	Selective SSM, linear-time alternative to attention
2023	RWKV	Peng et al., EMNLP	Parallel training + O(1) inference RNN
2024	Mamba-2	Dao & Gu, "Transformers are SSMs"	Unified framework, 2-8x faster
2024	MLA	DeepSeek-V2	28x KV cache compression
2024	Jamba	AI21 Labs	First production hybrid (Transformer + Mamba + MoE)
2024	DeepSeek-V3	DeepSeek	671B total, 37B active, trained for ~$5.5M
2024	Flash Attention 3	Tri Dao	FP8 support, 740 TFLOPs/s on H100
2025	DeepSeek-R1	DeepSeek	Spontaneous CoT from pure RL
2025	Llama 4	Meta	Scout/Maverick/Behemoth MoE family, 10M context
2025	Qwen3	Alibaba	Global-batch load balancing, thinking/non-thinking modes

Key people to know

Noam Shazeer — Co-authored the original transformer (2017), invented MQA (2019), SwiGLU (2020), and pioneer of MoE load balancing. Co-founded Character.AI, then returned to Google.

Tri Dao — Created Flash Attention (2022-2024), co-created Mamba (2023-2024). Stanford PhD, co-founder/Chief Scientist at Together AI.

Albert Gu — Co-created S4 and Mamba SSM architectures. Carnegie Mellon professor. His work on structured state spaces opened the SSM paradigm.

Ashish Vaswani — Lead author of "Attention Is All You Need." Co-founded Adept AI, then Essential AI.

References and further reading

Core Papers:

Vaswani et al., "Attention Is All You Need" (2017) — arXiv:1706.03762
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021) — arXiv:2104.09864
Zhang & Sennrich, "Root Mean Square Layer Normalization" (2019, NeurIPS) — arXiv:1910.07467
Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need" (2019) — arXiv:1911.02150
Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023, EMNLP) — arXiv:2305.13245
Shazeer, "GLU Variants Improve Transformer" (2020) — arXiv:2002.05202
Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022) — arXiv:2205.14135
Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) — arXiv:2307.08691
Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022) — arXiv:2203.15556

Architecture Papers:

Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023) — arXiv:2312.00752
Dao & Gu, "Transformers are SSMs" (Mamba-2, 2024) — arXiv:2405.21060
Peng et al., "RWKV: Reinventing RNNs for the Transformer Era" (2023, EMNLP) — arXiv:2305.13048
Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (2024) — arXiv:2403.19887
DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024) — arXiv:2405.04434
DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024) — arXiv:2412.19437
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025) — arXiv:2501.12948
Jiang et al., "Mixtral of Experts" (2024) — arXiv:2401.04088

Educational Resources:

Jay Alammar — "The Illustrated Transformer" (jalammar.github.io)
Lilian Weng — "The Transformer Family Version 2.0" (lilianweng.github.io)
Sebastian Raschka — "Understanding Large Language Models" (magazine.sebastianraschka.com)

Reference

Cheat Sheet

Q/K/V Attention

Same formula since 2017. Every model computes softmax(QK^T/√d)V. The variants optimize how heads are organized and how it runs on hardware — the math hasn't changed.

RoPE

Rotation encodes relative position. Applied to Q and K at every layer (not just input). Enables context extension beyond training length. Universal from 2023+.

RMSNorm

Cheaper normalization, pre-norm placement. Drops mean-centering, keeps only RMS scaling. Pre-norm gives gradients a clean residual path. 7-64% faster, zero quality loss.

GQA / MLA

Share or compress KV heads. GQA: 8 shared KV heads instead of 64 (8x savings). MLA: compress to tiny latent vector (28x savings). Essential for long context windows.

SwiGLU

Gated activation function. Two parallel paths multiplied together — the network learns which information to pass. Replaced ReLU's dead neuron problem with learned control.

Flash Attention

Same math, GPU-optimized tiling. Processes attention in SRAM tiles instead of materializing N×N in HBM. O(N²) → O(N) memory. Enabled 100K+ context windows.

Mixture of Experts

Many experts, router selects few. DeepSeek-V3: 671B total, 37B active. Router picks 8 of 256 experts per token. Massive capacity at fraction of compute cost.

Hybrid Future

Attention backbone + SSM efficiency + MoE capacity. Jamba: 1 attention layer per 7 Mamba layers. Transformers aren't being replaced — they're being augmented.

Practice Mode

Four real-world architecture decisions. Can you make the right call?

0 / 4

Scenario 1 of 4

Your team needs a 100B+ parameter model that can run on 2 GPUs (160 GB total memory). You need frontier-class performance for a multi-language customer support system.

Dense or MoE?

Dense 100B — simpler architecture, easier to fine-tune, proven reliability

MoE with ~200B+ total, 30B active — more knowledge capacity with less active compute per token

Dense 70B — keep it simple, 70B fits easily and is "good enough"

Scenario 2 of 4

Your model was trained on 8K context, but users need to process documents up to 128K tokens. You can't retrain from scratch.

Which component makes this possible?

Flash Attention — reduces memory so longer sequences fit

RoPE with YaRN/NTK scaling — extends position encoding beyond training length

GQA — reduces KV cache so longer contexts fit in memory

Scenario 3 of 4

Your inference server is running out of GPU memory. Profiling shows KV cache at 100K tokens is the bottleneck — the model weights fit fine, but the cache doesn't.

What helps most?

Flash Attention — it reduces memory from O(N²) to O(N)

MLA or GQA — directly reduces KV cache by compressing or sharing KV heads

RMSNorm — cheaper normalization frees up memory for the cache

Scenario 4 of 4

You need maximum throughput for summarizing millions of long documents (50K+ tokens each). Cost efficiency matters — you're processing at scale. Some documents require precise extraction of specific names and dates.

Pure transformer, pure Mamba, or hybrid?

Pure transformer — you need precise extraction, and attention handles that best

Pure Mamba — maximum throughput with O(N) scaling for long documents

Hybrid (Mamba + attention layers) — Mamba for efficient processing, attention layers for precise extraction

Was this helpful?

Your feedback helps me write better articles

Comments

Loading comments...

What Changed? How Modern LLMs Evolved Beyond the Original Transformer

The Unchanged Core

Positional Encoding: Sinusoidal → RoPE

Normalization: LayerNorm → RMSNorm

Attention Heads: MHA → GQA → MLA

Activation Function: ReLU → SwiGLU

Flash Attention: Same Math, Different Hardware Strategy

Sliding Window Attention

The Upgrade Summary

The Hospital Analogy

How the Router Works

The Models

Load Balancing: The Critical Challenge

Advantages and Disadvantages

Mamba: A Fundamentally Different Way to Process Sequences

Jamba: The Hybrid in Production

RWKV: RNNs Strike Back

Are These Replacing Transformers?

The Modern Recipe: 2017 vs 2025

Q/K/V Attention

RoPE

RMSNorm

GQA / MLA

SwiGLU

Flash Attention

Mixture of Experts

Hybrid Future

Practice Mode

Was this helpful?

Comments

Leave a comment

Share this article

Enjoyed this article?