بِسْمِ اللَّهِ الرَّحْمَـٰنِ الرَّحِيمِ
In the name of Allah, the Most Gracious, the Most Merciful
The ancient Greeks had a thought experiment: if you replace every plank of a ship, one at a time, is it still the same ship? They called it the Ship of Theseus.
The original transformer from "Attention Is All You Need" (2017) has had the same thing happen to it. The positional encoding? Replaced. The normalization? Replaced. The attention heads? Reorganized. The activation function? Swapped. The feed-forward network? Split into dozens of parallel experts with a router.
Q/K/V attention is still there. Residual connections are still there. But almost everything else has been upgraded. At what point is it a different architecture?
Answer: it isn't. Every upgrade is defined by what it replaced. Understanding the original is exactly how you understand the modern version.
- Part 1: What stayed the same from the original transformer (and why that matters)
- Part 2: Six component upgrades: RoPE, RMSNorm, GQA/MLA, SwiGLU, Flash Attention, Sliding Window
- Part 3: Mixture of Experts — how you get 671B parameters at the cost of 37B
- Part 4: Beyond transformers — Mamba, hybrids, and where architecture is heading
- You understand the basics of the transformer (Q/K/V attention, encoder/decoder) and want to know what changed since 2017
- You see model cards with "RoPE, GQA, SwiGLU, MoE" and want to know what each actually means
- You read How LLMs Actually Work and want the natural sequel
Before we look at what changed, let's establish what didn't. This matters more than it sounds.
If you learned about the transformer — from the original paper, a course, or the previous post — you might worry that your knowledge is outdated. You read "Attention Is All You Need" from 2017, and now it's 2026. Surely the architecture has changed beyond recognition?
It hasn't. The core of what you learned is still running inside GPT-4, Claude, Llama 4, DeepSeek-R1, and every other major LLM today. Here's specifically what stayed the same — and why it matters that you know this before looking at the upgrades.
The Unchanged Core
Q/K/V Dot-Product Attention — the exact same formula from the 2017 paper. Every token creates a Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what meaning do I carry"). The dot product of Q and K produces attention scores. Softmax normalizes them. The result weights the Values.
This is identical in every modern model. If you open the DeepSeek-V3 technical report (December 2024, 671 billion parameters), the attention computation is: softmax(QKT/√d) × V — the same formula Vaswani et al. wrote in 2017. The variants we'll cover in Part 2 change how the heads are organized and how the computation is scheduled on hardware — but the underlying math hasn't changed in 9 years.
Residual Connections — every transformer block still adds the output of each sub-layer back to its input. This skip connection was in the original paper and has not been removed in any major model. In fact, it became more important as models grew deeper. At 80-120+ layers, gradients would vanish without residual connections. They're the reason modern models can be so deep.
The Transformer Block Pattern — the repeating unit: Attention sub-layer, then Feed-Forward sub-layer, stacked N times. Llama 4 Scout stacks this 48 times. DeepSeek-V3 stacks it 61 times. The skeleton is the same — what goes inside each sub-layer is what changed.
Softmax and Learned Embeddings — unchanged. Every model converts tokens to dense vectors through a learned embedding table. Every attention layer normalizes scores through softmax. These haven't been touched.
Decoder-Only Simplification — the original transformer had an encoder and a decoder. Since GPT (2018), modern LLMs almost universally use decoder-only. This was the single biggest structural change from the original paper — but the components inside each block stayed the same.
Why this matters: every upgrade in this post is defined by what it replaced from this foundation. If you understand the original components, you already understand the "before" picture for every upgrade. That's why the original transformer isn't outdated knowledge — it's the reference frame for everything that came after.
GPT-4, Claude, Llama 4, DeepSeek-R1, Qwen3, Gemini — all transformer-based with Q/K/V attention. Mamba and SSMs show promise (especially in hybrids), but they haven't replaced transformers. We'll cover this properly in Part 4.
Every major production LLM — GPT-4, Claude, Llama 4, DeepSeek-R1, Qwen3, Gemini — still uses transformer blocks with Q/K/V attention. Mamba exists and shows promise in hybrid architectures, but the transformer core is very much alive. We'll cover what's actually happening in Part 4.
Now that we know the foundation is solid, let's look at what got upgraded — and why.
Each upgrade that follows is a targeted improvement to one specific component of the transformer. The pattern is always the same: there was an original component from 2017, it had a specific limitation, and someone found a better replacement. Understanding the original tells you exactly what changed and why.
Six upgrades. Each one defined by what it replaced. Six planks of the ship, swapped for something better.
Positional Encoding: Sinusoidal → RoPE
The original transformer stamped a fixed "seat number" on each token using sinusoidal functions — sine and cosine waves of different frequencies, added to the token's embedding at the very first layer. Token 1 gets one pattern, token 2 gets a slightly different pattern, and so on.
This had three problems:
- Hard ceiling at training length. If you trained on 512 tokens, token 513 had no meaningful encoding. The model simply hadn't seen a position that far.
- Position info degrades through layers. The positional signal was added once at the input. By layer 32, it had been diluted through dozens of matrix multiplications.
- Absolute positions only. The model knew "this token is at position 7" but had no natural way to encode "these two tokens are 5 apart" — which is what actually matters for understanding relationships.
How RoPE works conceptually: instead of adding position info to embeddings at the input, RoPE rotates the Q and K vectors by an angle proportional to the token's position. Each pair of dimensions gets rotated by a different frequency — like hands on a clock spinning at different speeds.
When you compute the dot product Q·K, the result naturally encodes the relative distance between two tokens. Rotating by angle A then dotting with something rotated by angle B gives you cos(A-B) — the angle between them. The model doesn't need to know "this is position 7 and that is position 12." It directly sees "these are 5 apart."
Imagine each pair of dimensions in Q and K as a point on a clock face. Token at position 0 has the hands pointing at 12 o'clock. Token at position 1 is rotated by a small angle θ. Token at position 100 is rotated by 100θ.
When you compute the dot product of two rotated vectors, the result depends only on the difference in their angles — not the absolute angles themselves. Token at position 7 dotted with token at position 12 gives the same result as token at position 107 dotted with token at position 112, because both pairs are 5 apart.
Different dimension pairs use different rotation speeds (frequencies). Low-frequency pairs capture long-range relationships ("these sentences are far apart"). High-frequency pairs capture local patterns ("these words are adjacent").
The math: For position m, dimension pair i, RoPE applies a 2D rotation matrix with angle mθi, where θi = 10000-2i/d. The dot product Qm·Kn then naturally contains cos((m-n)θi) terms — pure relative position.
Because RoPE uses rotation, you can extend the context window beyond training length by scaling the rotation frequencies. Key techniques:
Position Interpolation (Meta, 2023): Scale all RoPE frequencies down so that positions 0-128K map to the same angle range as 0-8K during training. Simple but effective.
YaRN (2023): "NTK-aware" scaling. Instead of uniform scaling, it adjusts high-frequency and low-frequency rotation speeds differently — preserving local position resolution while extending the range.
Llama 4 Scout (Meta, 2025): Trained at 256K tokens, then fine-tuned with RoPE extensions to handle 10 million tokens. This is possible because RoPE encodes relative position — the model can generalize to longer sequences even if it hasn't seen them during initial training.
RoPE fixed how the model knows where tokens are. The next upgrade fixes something more subtle — how the model keeps its activations stable as signals pass through dozens of layers.
Normalization: LayerNorm → RMSNorm
The original transformer used LayerNorm (2016), which normalizes activations by computing two statistics: the mean and the variance. It subtracts the mean (re-centering), divides by the standard deviation (re-scaling), then applies two learned parameters (scale and shift).
RMSNorm (Zhang & Sennrich, 2019) drops the mean-centering step entirely. It only computes one statistic — the Root Mean Square — and applies one learned parameter (scale only). That's it.
Why removing the mean doesn't matter: Think of normalization as having two jobs — (1) centering the data around zero, and (2) controlling the scale so values don't explode or vanish. It turns out that job #2 is the one that actually matters for training stability. When activations blow up to huge numbers or shrink to near-zero, gradients break. Controlling the scale prevents that. But centering? The very next linear layer in the network can learn its own bias term, which effectively re-centers the data however it wants. So LayerNorm was doing redundant work — computing a mean, subtracting it, then the next layer immediately learns to shift things again anyway.
The analogy: imagine calibrating a scale before weighing something. LayerNorm both zeros out the scale (re-centering) AND adjusts the units (re-scaling). RMSNorm only adjusts the units — because the next person in line will zero it out however they need to. Half the work, same result.
Practical impact: 7-64% faster normalization depending on hardware, with zero quality loss. At scale, this adds up. When you're training on thousands of GPUs for weeks, shaving even 10% off a per-layer operation that runs billions of times translates to real money and time saved.
The switch happened in two stages. GPT-2 (2019) moved from post-norm to pre-norm (keeping LayerNorm). Llama 1 (2023) switched to pre-RMSNorm. Since then, every major open-weight model — Mistral, DeepSeek, Qwen, Gemma — uses pre-RMSNorm.
LayerNorm:
1. Compute mean: μ = mean(x)
2. Compute variance: σ² = var(x)
3. Normalize: (x - μ) / √(σ² + ε)
4. Scale and shift: γ × normalized + β
Two statistics (mean, variance). Two learned parameters (γ, β).
RMSNorm:
1. Compute RMS: rms = √(mean(x²))
2. Normalize: x / rms
3. Scale only: γ × normalized
One statistic (RMS). One learned parameter (γ). Roughly half the operations, no quality loss.
So far we've upgraded where (RoPE) and how stable (RMSNorm). The next three upgrades tackle the biggest practical bottleneck: memory and compute cost during inference. They're all different angles on the same problem — making long sequences affordable.
Attention Heads: MHA → GQA → MLA
This is the evolution of how Q, K, V heads are organized. Remember: the attention math itself (Q·KT/√d → softmax → ×V) is identical. What changes is how many separate K and V projections exist.
The problem this solves: during text generation, the model stores all previous K and V vectors in a "KV cache" to avoid recomputing them for every new token. For long sequences, this cache dominates GPU memory.
How bad can it get? Llama 2 70B with 128 attention heads, processing 100K tokens: KV cache per token = 2 (K+V) × 128 heads × 128 dim × 2 bytes (FP16) = 65 KB per token. At 100K tokens: 100,000 × 65 KB = 6.5 GB just for KV cache. At 1M tokens, that's 65 GB — more than an entire A100 GPU.
The evolution happened in stages:
Multi-Query Attention (MQA, Shazeer 2019) — all query heads share a single K and V projection. Extreme memory reduction (from N KV heads to 1), with a small quality trade-off (~2%) and 1.8-2.4x faster decoding. Used by PaLM and Falcon. Invented by the same Noam Shazeer who co-authored "Attention Is All You Need."
Grouped Query Attention (GQA, Ainslie et al. 2023) — the middle ground. G groups of KV heads, where each group of query heads shares one KV pair. Near-MHA quality with near-MQA speed. Specific configurations:
- Llama 2 70B: 64 query heads, 8 KV heads
- Llama 3 70B: 64 query heads, 8 KV heads
- Mistral 7B: 32 query heads, 8 KV heads
- Qwen3-235B: 64 query heads, 4 KV heads
Multi-Head Latent Attention (MLA, DeepSeek-V2 2024) — instead of caching full-dimensional K and V, MLA compresses them into a much smaller "latent" vector using a down-projection. At inference, it projects back up. DeepSeek-V3 achieves 28x KV cache compression — from 213.5 GB down to 7.6 GB.
In standard MHA, each token at each layer caches a full K vector and V vector across all heads. For DeepSeek-V3 with 128 heads and 128 dim per head, that's 128 × 128 = 16,384 values for K alone, times 2 for K+V = ~32K values per token per layer.
MLA adds a down-projection that compresses K and V into a joint latent vector of just 512 dimensions. Only this 512-dim latent is stored in the KV cache. When the model needs to compute attention, it applies an up-projection to expand the latent back to full K and V dimensions.
The compression ratio: 32K values → 512 values = ~28x reduction in KV cache per token. This is how DeepSeek-V3 handles long contexts on practical hardware.
The quality cost? Negligible — because the down-projection is learned during training, the model learns what information to preserve and what to discard.
At 1M tokens, even GQA's 8x reduction leaves a massive KV cache. MLA's 28x compression (DeepSeek-V3 goes from 213.5 GB to 7.6 GB) makes million-token contexts practical on real hardware. This is why DeepSeek-V2 invented it.
At 1M tokens, KV cache memory is the bottleneck. MHA would need hundreds of GB. Even GQA's 8x reduction isn't enough. MLA achieves 28x compression — DeepSeek-V3 goes from 213.5 GB to 7.6 GB — making million-token contexts practical on real hardware.
GQA and MLA optimized the attention side of the transformer block. Now let's look at the other half — the feed-forward network (FFN). Two things changed inside it: the activation function, and (in Part 3) the entire structure.
Activation Function: ReLU → SwiGLU
The original transformer used ReLU inside the feed-forward network: if the input is positive, pass it through; if negative, output zero. Simple — but it has a fatal flaw called the "dead neuron" problem. Once a neuron outputs 0, its gradient is 0, and it never recovers. It's permanently dead.
ReLU's deeper problem is that it's amplitude-based — it only looks at whether a number is positive or negative. It can't look at the content (what the number represents in context) and make an intelligent decision about what to pass through.
GELU (used by GPT-2, GPT-3, BERT) softened this — instead of a sharp cutoff at zero, it uses a smooth curve that gives small negative values a tiny chance of passing through. Better, but still a single path making a single decision.
SwiGLU (Shazeer, 2020) fundamentally changed the structure. Instead of one path through one activation function, SwiGLU splits the FFN into two parallel paths:
- One path goes through Swish activation (a smooth, non-monotonic function) — this is the "value" path, computing what the output could be
- The other path stays linear — this is the "gate" path, computing how much of that output to allow through
- The two paths are multiplied together element-wise — the gate controls the value
The analogy: think of a recording studio. ReLU is a simple noise gate — any signal below a threshold gets cut to silence, regardless of what it is. SwiGLU is an experienced sound engineer with two hands on the mixing board: one hand holds the audio signal (the Swish path), the other hand controls the volume fader (the gate path). The engineer listens to the content and decides how much to let through — quiet passages get boosted, noise gets suppressed, and the decision is based on what the signal means, not just how loud it is.
This is the key insight: the gate path sees the same input but through different learned weights. It learns a content-dependent filter — "this pattern of activations represents something important, let it through" vs "this pattern is noise, suppress it." ReLU can't do this because it has no second path to learn the gating.
In a standard FFN, the input goes through one linear layer, then an activation function, then another linear layer: output = W2 * ReLU(W1 * x)
In SwiGLU, the first linear layer is split into two:
output = W2 * (Swish(W1 * x) ⊙ (W3 * x))
Where ⊙ means element-wise multiplication. The W3 * x path is the "gate" — it controls how much of the Swish(W1 * x) path gets through. This is why SwiGLU requires 3 weight matrices instead of 2, adding ~50% more parameters to the FFN layer. Models compensate by slightly reducing the FFN hidden dimension.
The result: measurably better downstream performance across all benchmarks, consistently. The extra parameters pay for themselves.
SwiGLU improved the quality of the FFN. The next upgrade doesn't change any math at all — instead, it rewrites how the existing math runs on GPU hardware, and that single change made everything from 100K to 10M token context windows possible.
Flash Attention: Same Math, Different Hardware Strategy
This one is different from all the others. Flash Attention changes zero math. The output is bit-for-bit identical to standard attention. What it changes is entirely about how the computation is organized on GPU hardware — and that change enabled everything from 100K to 10M token context windows.
The problem: standard attention materializes the full N × N attention matrix in GPU memory. For 100K tokens, that's 100,000 × 100,000 = 10 billion entries. This matrix is computed, used once for the softmax-weighted sum, then thrown away. The waste is staggering.
Standard attention writes the full N×N matrix to slow HBM. Flash Attention processes attention in small tiles that fit entirely in fast SRAM — loads a tile of Q, K, V, computes partial attention, accumulates the result, and writes only the final output back to HBM. The N×N matrix is never materialized.
torch.nn.functional.scaled_dot_product_attention and used by every serious LLM training pipeline.The analogy: standard attention is like printing two huge spreadsheets on paper, multiplying them, then throwing the paper away. Flash Attention works with small sticky notes on your fast desk (SRAM), computing sections and keeping a running total, without ever printing the full intermediate result.
The impact:
- Memory: O(N²) → O(N) — this single change enabled 100K+ context windows
- Speed: 2-4x faster for long sequences, 3x speedup on GPT-2 training end-to-end
- FlashAttention-2 (2023): 2x faster than v1, up to 72% GPU utilization on A100, up to 225 TFLOPs/s
- FlashAttention-3 (2024): 75% utilization on H100, FP8 support, up to 740 TFLOPs/s
The trickiest part of Flash Attention is this: softmax needs to see all attention scores to compute the normalization factor (the sum of exponentials). But if you're processing in tiles, you only see a subset at a time.
Online softmax (Milakov & Gimelshein, 2018) solves this by maintaining a running maximum and running sum of exponentials. When you process a new tile:
1. Compute the new local maximum
2. Rescale the previous running sum to account for the new maximum
3. Add the new tile's exponentials to the running sum
4. The final softmax is exact — not approximate
This is why Flash Attention produces bit-for-bit identical results to standard attention. It's not an approximation — it's the same computation, reorganized for hardware efficiency.
Flash Attention made each attention operation faster and lighter in memory. Sliding Window goes further — it reduces how many tokens each layer needs to attend to in the first place. Where Flash Attention optimizes the hardware, Sliding Window optimizes the algorithm. They're complementary, and many models use both together.
Sliding Window Attention
Instead of each token attending to all previous tokens (full attention), sliding window attention restricts each layer to attending only to the previous W tokens. Mistral 7B (September 2023) used a window of W = 4,096.
The analogy: imagine you're in a long hallway with 100,000 people standing in a line. Full attention means every person can whisper to every other person directly — that's 100,000 × 100,000 possible conversations. Sliding window means each person can only whisper to the 4,096 people closest to them. Far cheaper, and you'd be surprised how far messages still travel.
Why? Because the model is stacked. The clever insight is that stacking layers extends the effective reach. In layer 1, each token sees W tokens back. But in layer 2, each token — which now contains information from W tokens — sees W more tokens back. It's like a game of telephone where each relay point has a 4,096-person reach. With 32 layers and W=4,096, the theoretical reach is 32 × 4,096 = 131,072 tokens — the message has traveled across the entire sequence, just indirectly.
Full attention when: you need precise retrieval from anywhere in the context ("find the name mentioned 50,000 tokens ago"). Every token can directly see every other token.
Sliding window when: most important context is local (adjacent sentences, nearby code). Huge throughput and memory gains for sequences where long-range precision isn't critical.
Hybrid (best of both): Gemma's approach — alternate sliding window layers (for local patterns) with full attention layers (for long-range retrieval). Most tokens get fast local attention; critical long-range connections use the full attention layers.
Trade-off: Pure sliding window can miss dependencies that span more than layers × W tokens. The hybrid approach costs more compute than pure sliding window, but less than full attention everywhere.
The Upgrade Summary
Six upgrades. Each one a targeted replacement of a specific 2017 component. Here they are side by side:
| Component | Original (2017) | Modern (2023+) | Key Benefit |
|---|---|---|---|
| Position | Sinusoidal | RoPE | Relative position, extendable context |
| Normalization | LayerNorm (post) | RMSNorm (pre) | 7-64% cheaper, better gradient flow |
| KV Heads | MHA (N heads) | GQA / MLA | 8-28x KV cache reduction |
| Activation | ReLU | SwiGLU | Gated control, no dead neurons |
| Attention compute | Standard matmul | Flash Attention | O(N²) → O(N) memory, 2-4x speed |
| Attention scope | Full (all tokens) | Sliding window / hybrid | Fixed KV cache, linear compute |
| Context window | 512 tokens | 128K – 10M tokens | RoPE scaling (YaRN, NTK) |
| Training data | ~300B tokens | 15T+ tokens | Chinchilla-optimal and beyond |
| Architecture | Encoder-Decoder | Decoder-only | Simpler, scales better for generation |
Six planks replaced. The ship still sails the same way — attention + FFN + residual, stacked N times. But every plank is better than the original. Now let's look at the biggest structural change of all — one that doesn't replace a plank, but adds an entire new deck to the ship.
What if you could build a model with the knowledge of 671 billion parameters, but only pay the compute cost of 37 billion?
That's not a hypothetical. That's DeepSeek-V3, and it's the core promise of Mixture of Experts (MoE). The concept isn't new — Jacobs et al. proposed "Adaptive Mixtures of Local Experts" back in 1991. What's new is applying it at transformer scale.
The Hospital Analogy
In a dense model (like Llama 3 70B), every token passes through the same feed-forward network. All 70 billion parameters activate for every single token. It's like a hospital where every patient sees the same general doctor — whether they have a broken arm, a heart condition, or a math homework question.
In an MoE model, the single FFN is replaced with many parallel "expert" FFNs and a learned router (gating network) that selects which experts each token goes to. It's like a hospital with 256 specialist doctors and a triage nurse who sends each patient to the right 8 specialists. The other 248 sit idle for that patient.
All 70B params active
How the Router Works
The router is a small linear layer followed by softmax. It takes the token's hidden state and produces a probability distribution over all experts.
0.002
0.001
0.15
0.008
0.12
0.10
0.03
w=0.15
w=0.12
w=0.10
The Models
MoE isn't theoretical — it's how the most capable models in the world are built. Here are the specific architectures:
Llama 3 70B uses MORE compute per token despite having fewer total parameters. All 70B activate for every token. DeepSeek-V3 only activates 37B of its 671B per token. This is the entire point of MoE — massive knowledge capacity (671B) with practical compute cost (37B).
Total parameters ≠ active parameters. Llama 3 70B activates ALL 70B parameters for every token. DeepSeek-V3 only activates 37B of its 671B per token (the router selects 8 of 256 experts). So Llama 3 70B actually uses more compute per token, despite having fewer total parameters. That's the MoE advantage.
Load Balancing: The Critical Challenge
Without intervention, the router tends to "collapse" — sending all tokens to 1-2 favorite experts while the rest sit idle. This wastes the capacity you paid for.
1. Auxiliary loss (Shazeer 2017, GShard 2020): Add a loss term that penalizes uneven expert utilization. Effective but degrades model quality because the auxiliary loss fights against the primary training objective.
2. Bias terms (DeepSeek-V3, 2024): Add a learned bias to the router scores, but only for routing decisions — NOT included in the training loss. The bias steers tokens to underused experts without polluting the learning signal. This is called "auxiliary-loss-free" load balancing.
3. Global-batch balancing (Qwen3, 2025): Compute load balance across the entire training batch rather than per-sequence, allowing more natural token distribution. Each sequence can be "unbalanced" as long as the batch averages out.
Advantages and Disadvantages
In MoE, only a few experts activate per token. During fine-tuning on your domain data, some experts may barely see any examples — they sit idle while the router keeps sending domain tokens to the same 2-3 "favorite" experts. This can lead to uneven adaptation, where some experts specialize in your domain while others stay generic.
The compute cost is actually lower than expected (only active params compute per token), and the architecture doesn't break during fine-tuning. The real challenge is that MoE routing is sparse — not all experts see all examples. Some experts may barely encounter your domain data, leading to uneven adaptation.
Parts 1-3 covered upgrades within the transformer framework — better position encoding, cheaper normalization, smarter attention heads, faster computation, and MoE. The transformer block pattern (attention + FFN + residual) stayed the same throughout.
Part 4 asks a different question: what if the attention mechanism itself — the O(N²) core of the transformer — isn't always the best tool? Not because it's bad, but because there are tasks where a fundamentally different approach is more efficient.
The answer isn't "replace transformers." It's "augment them." The future is hybrid.
Mamba: A Fundamentally Different Way to Process Sequences
Every upgrade so far has worked within the attention framework. Flash Attention made it faster. GQA made it cheaper. Sliding Window made it narrower. But all of them still compute attention — they still ask "how does every token relate to every other token?"
Mamba asks a different question entirely: "what if you don't need to look at all tokens at once?"
Transformer attention is O(N²) in sequence length. Even with Flash Attention (which reduces the memory footprint), the quadratic scaling makes very long sequences expensive. Double the sequence length, and you quadruple the compute. For 1M+ token contexts, this becomes prohibitive.
Previous state space models (S4, by Albert Gu) offered O(N) linear scaling, but with a catch: they used fixed state transitions — the same rules applied regardless of the input. The model couldn't distinguish between important and unimportant tokens. Mamba (Gu & Dao, December 2023) solved this by making the transitions input-dependent: the model learns to selectively "remember" important tokens and "forget" irrelevant ones.
The analogy: transformer attention is a group discussion where everyone talks to everyone simultaneously — powerful for finding connections, but the cost grows with the square of the group size. Mamba is reading a book with a notepad. You process one page at a time, write down what seems important, cross out what doesn't matter anymore, and update your running summary as you go. By the end, your notepad captures the key ideas — and the process was O(N) linear. The notepad is the "state vector."
The critical difference from older approaches: Mamba's notepad is smart. It doesn't just mechanically record everything with the same rules. It looks at each new token and decides how much to remember and how much to forget, based on the content. A proper noun gets written down carefully. A filler word gets barely noted. This input-dependent selectivity is what makes Mamba competitive with transformers on quality, not just speed.
Mamba-3B matches transformers of the same size and beats transformers 2x its size on language modeling. It achieves 5x higher inference throughput than transformers with linear time and constant memory — no KV cache needed. This speed comes from hardware-aware design: kernel fusion (fusing multiple operations into one GPU kernel), parallel scan (processing the recurrence in parallel, not truly sequential), and recomputation instead of materialization (saving memory by recomputing during the backward pass).
Mamba-2 (May 2024) is 2-8x faster than Mamba-1, and its paper — "Transformers are SSMs" — proved that attention and SSMs are mathematically equivalent under certain conditions.
In May 2024, Tri Dao and Albert Gu published Mamba-2 with a remarkable title: "Transformers are SSMs."
They proved a mathematical equivalence: a specific form of structured state space model is exactly equivalent to a specific form of linear attention. The two paradigms aren't fundamentally different — they're different views of the same mathematical operation, with different computational trade-offs.
Beyond the theoretical insight, Mamba-2 is 2-8x faster than Mamba-1 thanks to a reformulated algorithm that maps better onto GPU matrix multiply units.
This is philosophically important: it means the field isn't "transformers vs SSMs." It's one unified framework. Some computations are more efficient as attention (precise retrieval), others as recurrence (sequential processing). The best architectures will use both.
Jamba: The Hybrid in Production
If Mamba is O(N) and transformers are O(N²), why not just use Mamba for everything? Because attention has one ability that Mamba doesn't match: precise retrieval. If you need to find a specific name mentioned 50,000 tokens ago, attention can look directly at that token. Mamba's compressed state vector may have lost that detail — it depends on whether the model deemed it important enough to "remember."
The question becomes: can you get the best of both? Use Mamba for the bulk of processing (fast, efficient, handles narrative and context flow well), but inject a few attention layers where precision matters?
Jamba (AI21 Labs, March 2024) answers yes. It interleaves transformer attention and Mamba layers in a 1:7 ratio — for every 8 layers, 1 uses transformer attention and 7 use Mamba. It also applies MoE to every other layer (16 experts, top-2).
Why 1:7 specifically? The 7 Mamba layers handle the sequential "reading with a notepad" work — building up context, processing tokens efficiently, maintaining the narrative flow. The 1 attention layer acts as a "checkpoint" — it can look at any token in the full context with precise attention, catching anything the Mamba layers might have compressed too aggressively. Think of it as reading a long report: you skim most pages efficiently (Mamba), but every 8th page you stop and carefully cross-reference specific details against everything you've read (attention).
Why does MoE appear on alternating layers? The MoE layers add knowledge capacity without proportional compute — the same benefit from Part 3. But not every layer needs it. The non-MoE layers use a single dense FFN (cheaper, simpler). Alternating gives the model expert specialization where it helps, while keeping the overall parameter count manageable.
The result is a triple hybrid: Mamba for efficient sequential processing + attention for precise retrieval + MoE for knowledge capacity. Each component does what it's best at.
Results: Jamba fits on a single 80GB GPU, handles 256K token context, and achieves 3x throughput vs a comparable transformer — with competitive language modeling quality. 52B total params, only 12B active. For context: a pure transformer with the same quality would need multiple GPUs and significantly more memory. Jamba showed that hybrid architectures aren't just theoretically interesting — they're practically superior for the memory-constrained real world.
RWKV: RNNs Strike Back
Jamba interleaves attention with Mamba. RWKV (Receptance Weighted Key Value, Peng et al., EMNLP 2023) goes further — it eliminates attention entirely.
RWKV solves a different problem than Mamba. Mamba's innovation was selectivity — making state transitions input-dependent. RWKV's innovation is dual-mode operation: it can run as a transformer during training (processing all tokens in parallel for fast training) and as an RNN during inference (processing token-by-token with constant O(1) memory per step). Same model, two operating modes.
This is a big deal for deployment. During training, you want parallelism — process the entire sequence at once across your GPUs. During inference, you want efficiency — generate one token at a time with minimal memory. Transformers are parallel for both (but expensive during inference because of the KV cache). RNNs are sequential for both (fast inference, but training can't be parallelized). RWKV gets parallel training AND sequential inference — best of both worlds.
How different from Mamba? Mamba uses a continuous state space model with learned selectivity. RWKV uses a reformulated attention mechanism where the attention weights decay exponentially with distance — so recent tokens get full weight, distant tokens get fading weight. The decay rates are learned. At training time, this can be computed in parallel (like attention). At inference time, it can be computed recurrently (like an RNN). The two computations are mathematically equivalent, just organized differently.
like attention
The real-world impact proves this isn't just academic: Microsoft deployed RWKV v5 "Eagle" to 1.5 billion Windows 10/11 devices for on-device Copilot — the largest deployment of an attention-free architecture in production. Why RWKV for this? Because on-device means tight memory constraints and no cloud GPUs. RWKV's constant-memory inference makes it ideal for running on laptops and phones, where a transformer's KV cache would be prohibitively expensive. RWKV-7 "Goose" (2025) describes itself as the "strongest attention-free, 100% RNN architecture."
Are These Replacing Transformers?
No. And the reason goes back to the question we started with.
Remember the Ship of Theseus? We asked: if you replace every plank, is it still the same ship? We spent Parts 2 and 3 watching components get swapped out — RoPE for sinusoidal encoding, RMSNorm for LayerNorm, GQA for MHA, SwiGLU for ReLU, MoE for dense FFN. And yet the answer was clear: yes, it's still a transformer. The core — Q/K/V attention, residual connections, the alternating attention-then-FFN pattern — never changed.
Now Part 4 asks a different question: what if you build a different ship entirely? Mamba doesn't replace planks — it reimagines the hull. Instead of every token talking to every other token (attention), it processes tokens one at a time through a learned state. RWKV takes a different approach, using exponentially decaying weights that can run as parallel attention during training and sequential state during inference. These aren't upgraded transformers. They're alternative architectures.
So are they winning? The honest answer: not yet, and maybe not alone.
Pure transformers still produce the highest quality on most benchmarks — especially tasks that require precise retrieval. Here's a concrete example: imagine you're processing a 200-page legal contract and you ask "what was the liability cap mentioned in clause 47?" With transformer attention, every token can directly compare against every other token — the model can look at the question and attend specifically to clause 47, wherever it sits in the 100,000-token context. It's a direct lookup.
With Mamba, all those 100,000 tokens have been compressed into a single fixed-size state vector — a summary. If the model deemed that liability cap important enough to "remember" during sequential processing, it's there. If not, it's lost. For most tasks (summarization, general Q&A, narrative understanding), the compressed state is sufficient. But for needle-in-a-haystack retrieval — finding one specific detail in a vast context — attention's direct access wins.
But SSMs offer something transformers can't: truly linear scaling. Double the context length and a transformer's attention cost quadruples. Double it for Mamba and the cost merely doubles. For processing millions of tokens — genomics, legal corpora, codebases — that difference isn't incremental, it's the difference between possible and impossible.
This is exactly why the industry is converging on hybrid architectures. Jamba's 1:7 ratio wasn't arbitrary — it was an engineering answer to a practical question: how do you get the precision of attention where it matters most, while using Mamba's efficiency for the long stretches in between? The result: a model with a 256K context window that fits in the memory budget of a 70K-context pure transformer.
Even the researchers building these alternatives see convergence, not competition. The Mamba-2 paper is titled "Transformers are SSMs" — arguing that attention and state space models are different views of the same underlying mathematics. They're not rivals. They're complementary tools.
The emerging recipe looks like this: transformer attention as the backbone for precision, SSM layers for efficiency on long sequences, and MoE for capacity without proportional compute cost. Not one architecture winning, but three ideas combining. The ship isn't being replaced by a different ship — it's being extended with new kinds of planks that the original builders never imagined.
The Modern Recipe: 2017 vs 2025
Here's what a state-of-the-art transformer block looks like today, compared to the original:
sinusoidal PE
RoPE + Flash Attention
SwiGLU activation
| Year | Innovation | Paper / Model | Impact |
|---|---|---|---|
| 2017 | Original Transformer | Vaswani et al., "Attention Is All You Need" | Foundation of everything |
| 2018 | GPT-1, BERT | OpenAI, Google | Decoder-only and encoder-only paradigms |
| 2019 | Multi-Query Attention | Shazeer | First KV cache reduction technique |
| 2019 | RMSNorm | Zhang & Sennrich, NeurIPS | Cheaper normalization, zero quality loss |
| 2020 | GPT-3 (175B) | OpenAI | Proved scale = capability (Kaplan scaling laws) |
| 2020 | SwiGLU | Shazeer | Gated activations replaced ReLU |
| 2021 | RoPE | Su et al., "RoFormer" | Relative position via rotation |
| 2022 | Chinchilla | Hoffmann et al., DeepMind | Proved optimal data-to-param ratio. Changed industry from "bigger model" to "more data" |
| 2022 | Flash Attention | Tri Dao, Stanford | O(N) memory, 2-4x speed, enabled 100K+ context |
| 2023 | Llama 1 | Meta | First major Pre-RMSNorm + RoPE + SwiGLU + GQA model |
| 2023 | GQA | Ainslie et al., EMNLP | Middle ground: shared KV groups |
| 2023 | Mistral 7B | Mistral AI | Sliding window attention + GQA |
| 2023 | Mixtral 8x7B | Mistral AI | First widely-used open MoE |
| 2023 | Mamba | Gu & Dao | Selective SSM, linear-time alternative to attention |
| 2023 | RWKV | Peng et al., EMNLP | Parallel training + O(1) inference RNN |
| 2024 | Mamba-2 | Dao & Gu, "Transformers are SSMs" | Unified framework, 2-8x faster |
| 2024 | MLA | DeepSeek-V2 | 28x KV cache compression |
| 2024 | Jamba | AI21 Labs | First production hybrid (Transformer + Mamba + MoE) |
| 2024 | DeepSeek-V3 | DeepSeek | 671B total, 37B active, trained for ~$5.5M |
| 2024 | Flash Attention 3 | Tri Dao | FP8 support, 740 TFLOPs/s on H100 |
| 2025 | DeepSeek-R1 | DeepSeek | Spontaneous CoT from pure RL |
| 2025 | Llama 4 | Meta | Scout/Maverick/Behemoth MoE family, 10M context |
| 2025 | Qwen3 | Alibaba | Global-batch load balancing, thinking/non-thinking modes |
Noam Shazeer — Co-authored the original transformer (2017), invented MQA (2019), SwiGLU (2020), and pioneer of MoE load balancing. Co-founded Character.AI, then returned to Google.
Tri Dao — Created Flash Attention (2022-2024), co-created Mamba (2023-2024). Stanford PhD, co-founder/Chief Scientist at Together AI.
Albert Gu — Co-created S4 and Mamba SSM architectures. Carnegie Mellon professor. His work on structured state spaces opened the SSM paradigm.
Ashish Vaswani — Lead author of "Attention Is All You Need." Co-founded Adept AI, then Essential AI.
Core Papers:
- Vaswani et al., "Attention Is All You Need" (2017) — arXiv:1706.03762
- Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021) — arXiv:2104.09864
- Zhang & Sennrich, "Root Mean Square Layer Normalization" (2019, NeurIPS) — arXiv:1910.07467
- Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need" (2019) — arXiv:1911.02150
- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023, EMNLP) — arXiv:2305.13245
- Shazeer, "GLU Variants Improve Transformer" (2020) — arXiv:2002.05202
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022) — arXiv:2205.14135
- Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) — arXiv:2307.08691
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022) — arXiv:2203.15556
Architecture Papers:
- Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023) — arXiv:2312.00752
- Dao & Gu, "Transformers are SSMs" (Mamba-2, 2024) — arXiv:2405.21060
- Peng et al., "RWKV: Reinventing RNNs for the Transformer Era" (2023, EMNLP) — arXiv:2305.13048
- Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (2024) — arXiv:2403.19887
- DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024) — arXiv:2405.04434
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024) — arXiv:2412.19437
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025) — arXiv:2501.12948
- Jiang et al., "Mixtral of Experts" (2024) — arXiv:2401.04088
Educational Resources:
- Jay Alammar — "The Illustrated Transformer" (jalammar.github.io)
- Lilian Weng — "The Transformer Family Version 2.0" (lilianweng.github.io)
- Sebastian Raschka — "Understanding Large Language Models" (magazine.sebastianraschka.com)
Q/K/V Attention
Same formula since 2017. Every model computes softmax(QKT/√d)V. The variants optimize how heads are organized and how it runs on hardware — the math hasn't changed.
RoPE
Rotation encodes relative position. Applied to Q and K at every layer (not just input). Enables context extension beyond training length. Universal from 2023+.
RMSNorm
Cheaper normalization, pre-norm placement. Drops mean-centering, keeps only RMS scaling. Pre-norm gives gradients a clean residual path. 7-64% faster, zero quality loss.
GQA / MLA
Share or compress KV heads. GQA: 8 shared KV heads instead of 64 (8x savings). MLA: compress to tiny latent vector (28x savings). Essential for long context windows.
SwiGLU
Gated activation function. Two parallel paths multiplied together — the network learns which information to pass. Replaced ReLU's dead neuron problem with learned control.
Flash Attention
Same math, GPU-optimized tiling. Processes attention in SRAM tiles instead of materializing N×N in HBM. O(N²) → O(N) memory. Enabled 100K+ context windows.
Mixture of Experts
Many experts, router selects few. DeepSeek-V3: 671B total, 37B active. Router picks 8 of 256 experts per token. Massive capacity at fraction of compute cost.
Hybrid Future
Attention backbone + SSM efficiency + MoE capacity. Jamba: 1 attention layer per 7 Mamba layers. Transformers aren't being replaced — they're being augmented.
Practice Mode
Four real-world architecture decisions. Can you make the right call?
Was this helpful?
Your feedback helps me write better articles
Comments
Leave a comment