LLM Architecture

What Changed? How Modern LLMs Evolved Beyond the Original Transformer

You read "Attention Is All You Need" from 2017. Then you see a 2025 model card: RoPE, RMSNorm, GQA, SwiGLU, MoE, Flash Attention. Same building? Or a completely different one?

Bahgat
Bahgat Ahmed
March 2026
The Journey
What's Still the Same
The foundation that never changed
Component Upgrades
RoPE, RMSNorm, GQA, SwiGLU, Flash Attention
Mixture of Experts
The biggest structural shift
Beyond Transformers
Mamba, hybrids & where it's going
What's Inside
30 min read
1 The Foundation That Never Changed 2 The Component Upgrades 3 Mixture of Experts 4 Beyond Transformers Practice Mode Cheat Sheet

بِسْمِ اللَّهِ الرَّحْمَـٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

The ancient Greeks had a thought experiment: if you replace every plank of a ship, one at a time, is it still the same ship? They called it the Ship of Theseus.

The original transformer from "Attention Is All You Need" (2017) has had the same thing happen to it. The positional encoding? Replaced. The normalization? Replaced. The attention heads? Reorganized. The activation function? Swapped. The feed-forward network? Split into dozens of parallel experts with a router.

Q/K/V attention is still there. Residual connections are still there. But almost everything else has been upgraded. At what point is it a different architecture?

Answer: it isn't. Every upgrade is defined by what it replaced. Understanding the original is exactly how you understand the modern version.

Quick Summary
  • Part 1: What stayed the same from the original transformer (and why that matters)
  • Part 2: Six component upgrades: RoPE, RMSNorm, GQA/MLA, SwiGLU, Flash Attention, Sliding Window
  • Part 3: Mixture of Experts — how you get 671B parameters at the cost of 37B
  • Part 4: Beyond transformers — Mamba, hybrids, and where architecture is heading
This post is for you if...
  • You understand the basics of the transformer (Q/K/V attention, encoder/decoder) and want to know what changed since 2017
  • You see model cards with "RoPE, GQA, SwiGLU, MoE" and want to know what each actually means
  • You read How LLMs Actually Work and want the natural sequel
Part 1
The Foundation That Never Changed

Before we look at what changed, let's establish what didn't. This matters more than it sounds.

If you learned about the transformer — from the original paper, a course, or the previous post — you might worry that your knowledge is outdated. You read "Attention Is All You Need" from 2017, and now it's 2026. Surely the architecture has changed beyond recognition?

It hasn't. The core of what you learned is still running inside GPT-4, Claude, Llama 4, DeepSeek-R1, and every other major LLM today. Here's specifically what stayed the same — and why it matters that you know this before looking at the upgrades.

The Unchanged Core

Q/K/V Dot-Product Attention — the exact same formula from the 2017 paper. Every token creates a Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what meaning do I carry"). The dot product of Q and K produces attention scores. Softmax normalizes them. The result weights the Values.

This is identical in every modern model. If you open the DeepSeek-V3 technical report (December 2024, 671 billion parameters), the attention computation is: softmax(QKT/√d) × V — the same formula Vaswani et al. wrote in 2017. The variants we'll cover in Part 2 change how the heads are organized and how the computation is scheduled on hardware — but the underlying math hasn't changed in 9 years.

Residual Connections — every transformer block still adds the output of each sub-layer back to its input. This skip connection was in the original paper and has not been removed in any major model. In fact, it became more important as models grew deeper. At 80-120+ layers, gradients would vanish without residual connections. They're the reason modern models can be so deep.

The Transformer Block Pattern — the repeating unit: Attention sub-layer, then Feed-Forward sub-layer, stacked N times. Llama 4 Scout stacks this 48 times. DeepSeek-V3 stacks it 61 times. The skeleton is the same — what goes inside each sub-layer is what changed.

Softmax and Learned Embeddings — unchanged. Every model converts tokens to dense vectors through a learned embedding table. Every attention layer normalizes scores through softmax. These haven't been touched.

Decoder-Only Simplification — the original transformer had an encoder and a decoder. Since GPT (2018), modern LLMs almost universally use decoder-only. This was the single biggest structural change from the original paper — but the components inside each block stayed the same.

Why this matters: every upgrade in this post is defined by what it replaced from this foundation. If you understand the original components, you already understand the "before" picture for every upgrade. That's why the original transformer isn't outdated knowledge — it's the reference frame for everything that came after.

The Unchanged Core vs. What Got Upgraded
Still the Same
Q/K/V Dot-Product Attention
Residual Connections
Attention + FFN Block Pattern
Softmax & Learned Embeddings
Decoder-Only Architecture
Got Upgraded
Position Encoding Sinusoidal → RoPE
Normalization LayerNorm → RMSNorm
Attention Heads MHA → GQA/MLA
Activation ReLU → SwiGLU
FFN Structure Single → MoE
Everything in green is identical in GPT-4, Claude, Llama 4, DeepSeek-R1, and Qwen3. Everything in amber is what we'll cover in Parts 2 and 3.
Check Your Understanding
Someone tells you "transformers are obsolete — everything uses Mamba now." True or false?
Correct

GPT-4, Claude, Llama 4, DeepSeek-R1, Qwen3, Gemini — all transformer-based with Q/K/V attention. Mamba and SSMs show promise (especially in hybrids), but they haven't replaced transformers. We'll cover this properly in Part 4.

Not quite

Every major production LLM — GPT-4, Claude, Llama 4, DeepSeek-R1, Qwen3, Gemini — still uses transformer blocks with Q/K/V attention. Mamba exists and shows promise in hybrid architectures, but the transformer core is very much alive. We'll cover what's actually happening in Part 4.

Now that we know the foundation is solid, let's look at what got upgraded — and why.

Part 2
The Component Upgrades

Each upgrade that follows is a targeted improvement to one specific component of the transformer. The pattern is always the same: there was an original component from 2017, it had a specific limitation, and someone found a better replacement. Understanding the original tells you exactly what changed and why.

Six upgrades. Each one defined by what it replaced. Six planks of the ship, swapped for something better.

Positional Encoding: Sinusoidal → RoPE

The original transformer stamped a fixed "seat number" on each token using sinusoidal functions — sine and cosine waves of different frequencies, added to the token's embedding at the very first layer. Token 1 gets one pattern, token 2 gets a slightly different pattern, and so on.

This had three problems:

  1. Hard ceiling at training length. If you trained on 512 tokens, token 513 had no meaningful encoding. The model simply hadn't seen a position that far.
  2. Position info degrades through layers. The positional signal was added once at the input. By layer 32, it had been diluted through dozens of matrix multiplications.
  3. Absolute positions only. The model knew "this token is at position 7" but had no natural way to encode "these two tokens are 5 apart" — which is what actually matters for understanding relationships.
Sinusoidal vs. RoPE Position Encoding
Original: Sinusoidal (2017)
Fixed sin/cos added at input only
Encodes absolute position (seat #7)
Signal degrades through layers
Hard ceiling at training length
Train on 512 tokens? Token 513 = lost
Modern: RoPE (2021+)
Rotates Q and K at every layer
Encodes relative distance (5 apart)
Position info refreshed every layer
Extends beyond training length
Llama 4 Scout: train 256K → extend to 10M
RoPE (Rotary Position Embedding, Su et al. 2021) replaced sinusoidal encoding in essentially every model from 2023 onward — Llama, Mistral, DeepSeek, Qwen, Gemma, Falcon, and PaLM 2. It's also compatible with KV caching — no need to store positional info separately.

How RoPE works conceptually: instead of adding position info to embeddings at the input, RoPE rotates the Q and K vectors by an angle proportional to the token's position. Each pair of dimensions gets rotated by a different frequency — like hands on a clock spinning at different speeds.

When you compute the dot product Q·K, the result naturally encodes the relative distance between two tokens. Rotating by angle A then dotting with something rotated by angle B gives you cos(A-B) — the angle between them. The model doesn't need to know "this is position 7 and that is position 12." It directly sees "these are 5 apart."

How rotation encodes relative distance (the clock analogy)

Imagine each pair of dimensions in Q and K as a point on a clock face. Token at position 0 has the hands pointing at 12 o'clock. Token at position 1 is rotated by a small angle θ. Token at position 100 is rotated by 100θ.

When you compute the dot product of two rotated vectors, the result depends only on the difference in their angles — not the absolute angles themselves. Token at position 7 dotted with token at position 12 gives the same result as token at position 107 dotted with token at position 112, because both pairs are 5 apart.

Different dimension pairs use different rotation speeds (frequencies). Low-frequency pairs capture long-range relationships ("these sentences are far apart"). High-frequency pairs capture local patterns ("these words are adjacent").

The math: For position m, dimension pair i, RoPE applies a 2D rotation matrix with angle i, where θi = 10000-2i/d. The dot product Qm·Kn then naturally contains cos((m-n)θi) terms — pure relative position.

Context extension: how Llama 4 goes to 10M tokens

Because RoPE uses rotation, you can extend the context window beyond training length by scaling the rotation frequencies. Key techniques:

Position Interpolation (Meta, 2023): Scale all RoPE frequencies down so that positions 0-128K map to the same angle range as 0-8K during training. Simple but effective.

YaRN (2023): "NTK-aware" scaling. Instead of uniform scaling, it adjusts high-frequency and low-frequency rotation speeds differently — preserving local position resolution while extending the range.

Llama 4 Scout (Meta, 2025): Trained at 256K tokens, then fine-tuned with RoPE extensions to handle 10 million tokens. This is possible because RoPE encodes relative position — the model can generalize to longer sequences even if it hasn't seen them during initial training.

RoPE fixed how the model knows where tokens are. The next upgrade fixes something more subtle — how the model keeps its activations stable as signals pass through dozens of layers.

Normalization: LayerNorm → RMSNorm

The original transformer used LayerNorm (2016), which normalizes activations by computing two statistics: the mean and the variance. It subtracts the mean (re-centering), divides by the standard deviation (re-scaling), then applies two learned parameters (scale and shift).

RMSNorm (Zhang & Sennrich, 2019) drops the mean-centering step entirely. It only computes one statistic — the Root Mean Square — and applies one learned parameter (scale only). That's it.

Why removing the mean doesn't matter: Think of normalization as having two jobs — (1) centering the data around zero, and (2) controlling the scale so values don't explode or vanish. It turns out that job #2 is the one that actually matters for training stability. When activations blow up to huge numbers or shrink to near-zero, gradients break. Controlling the scale prevents that. But centering? The very next linear layer in the network can learn its own bias term, which effectively re-centers the data however it wants. So LayerNorm was doing redundant work — computing a mean, subtracting it, then the next layer immediately learns to shift things again anyway.

The analogy: imagine calibrating a scale before weighing something. LayerNorm both zeros out the scale (re-centering) AND adjusts the units (re-scaling). RMSNorm only adjusts the units — because the next person in line will zero it out however they need to. Half the work, same result.

Practical impact: 7-64% faster normalization depending on hardware, with zero quality loss. At scale, this adds up. When you're training on thousands of GPUs for weeks, shaving even 10% off a per-layer operation that runs billions of times translates to real money and time saved.

Pre-Norm vs. Post-Norm: Where Normalization Goes
Original: Post-Norm (2017)
Input
Sublayer (Attention/FFN)
Add (residual)
LayerNorm
Gradients must flow through norm
Modern: Pre-RMSNorm (2023+)
Input
RMSNorm
Sublayer (Attention/FFN)
Add (residual)
Clean residual path for gradients
Pre-norm (normalize BEFORE the sublayer) gives gradients a clean residual path — critical when stacking 80-120+ layers. Llama 1 was the first major model to combine Pre-RMSNorm, and every subsequent open model followed.

The switch happened in two stages. GPT-2 (2019) moved from post-norm to pre-norm (keeping LayerNorm). Llama 1 (2023) switched to pre-RMSNorm. Since then, every major open-weight model — Mistral, DeepSeek, Qwen, Gemma — uses pre-RMSNorm.

The math: LayerNorm vs RMSNorm

LayerNorm:

1. Compute mean: μ = mean(x)
2. Compute variance: σ² = var(x)
3. Normalize: (x - μ) / √(σ² + ε)
4. Scale and shift: γ × normalized + β

Two statistics (mean, variance). Two learned parameters (γ, β).

RMSNorm:

1. Compute RMS: rms = √(mean(x²))
2. Normalize: x / rms
3. Scale only: γ × normalized

One statistic (RMS). One learned parameter (γ). Roughly half the operations, no quality loss.

So far we've upgraded where (RoPE) and how stable (RMSNorm). The next three upgrades tackle the biggest practical bottleneck: memory and compute cost during inference. They're all different angles on the same problem — making long sequences affordable.

Attention Heads: MHA → GQA → MLA

This is the evolution of how Q, K, V heads are organized. Remember: the attention math itself (Q·KT/√d → softmax → ×V) is identical. What changes is how many separate K and V projections exist.

The problem this solves: during text generation, the model stores all previous K and V vectors in a "KV cache" to avoid recomputing them for every new token. For long sequences, this cache dominates GPU memory.

How bad can it get? Llama 2 70B with 128 attention heads, processing 100K tokens: KV cache per token = 2 (K+V) × 128 heads × 128 dim × 2 bytes (FP16) = 65 KB per token. At 100K tokens: 100,000 × 65 KB = 6.5 GB just for KV cache. At 1M tokens, that's 65 GB — more than an entire A100 GPU.

The KV Cache Memory Problem
The Textbook Analogy
MHA
64 students, each with their own textbook
GQA
64 students, 8 shared textbooks
MQA
64 students, 1 shared textbook
MLA
Compressed summary cards that expand when needed
MHA
64 KV heads = 6.5 GB
GQA
8 KV heads = 0.8 GB
MQA
1 KV head = 0.1 GB
MLA
Compressed latent = 0.23 GB (28x reduction)
Approximate KV cache for Llama 2 70B at 100K tokens

The evolution happened in stages:

Multi-Query Attention (MQA, Shazeer 2019) — all query heads share a single K and V projection. Extreme memory reduction (from N KV heads to 1), with a small quality trade-off (~2%) and 1.8-2.4x faster decoding. Used by PaLM and Falcon. Invented by the same Noam Shazeer who co-authored "Attention Is All You Need."

Grouped Query Attention (GQA, Ainslie et al. 2023) — the middle ground. G groups of KV heads, where each group of query heads shares one KV pair. Near-MHA quality with near-MQA speed. Specific configurations:

  • Llama 2 70B: 64 query heads, 8 KV heads
  • Llama 3 70B: 64 query heads, 8 KV heads
  • Mistral 7B: 32 query heads, 8 KV heads
  • Qwen3-235B: 64 query heads, 4 KV heads

Multi-Head Latent Attention (MLA, DeepSeek-V2 2024) — instead of caching full-dimensional K and V, MLA compresses them into a much smaller "latent" vector using a down-projection. At inference, it projects back up. DeepSeek-V3 achieves 28x KV cache compression — from 213.5 GB down to 7.6 GB.

How MLA compression works

In standard MHA, each token at each layer caches a full K vector and V vector across all heads. For DeepSeek-V3 with 128 heads and 128 dim per head, that's 128 × 128 = 16,384 values for K alone, times 2 for K+V = ~32K values per token per layer.

MLA adds a down-projection that compresses K and V into a joint latent vector of just 512 dimensions. Only this 512-dim latent is stored in the KV cache. When the model needs to compute attention, it applies an up-projection to expand the latent back to full K and V dimensions.

The compression ratio: 32K values → 512 values = ~28x reduction in KV cache per token. This is how DeepSeek-V3 handles long contexts on practical hardware.

The quality cost? Negligible — because the down-projection is learned during training, the model learns what information to preserve and what to discard.

Check Your Understanding
Your model needs to handle 1M token context on a single GPU. Which attention variant matters most?
Correct

At 1M tokens, even GQA's 8x reduction leaves a massive KV cache. MLA's 28x compression (DeepSeek-V3 goes from 213.5 GB to 7.6 GB) makes million-token contexts practical on real hardware. This is why DeepSeek-V2 invented it.

Not quite

At 1M tokens, KV cache memory is the bottleneck. MHA would need hundreds of GB. Even GQA's 8x reduction isn't enough. MLA achieves 28x compression — DeepSeek-V3 goes from 213.5 GB to 7.6 GB — making million-token contexts practical on real hardware.

GQA and MLA optimized the attention side of the transformer block. Now let's look at the other half — the feed-forward network (FFN). Two things changed inside it: the activation function, and (in Part 3) the entire structure.

Activation Function: ReLU → SwiGLU

The original transformer used ReLU inside the feed-forward network: if the input is positive, pass it through; if negative, output zero. Simple — but it has a fatal flaw called the "dead neuron" problem. Once a neuron outputs 0, its gradient is 0, and it never recovers. It's permanently dead.

ReLU's deeper problem is that it's amplitude-based — it only looks at whether a number is positive or negative. It can't look at the content (what the number represents in context) and make an intelligent decision about what to pass through.

GELU (used by GPT-2, GPT-3, BERT) softened this — instead of a sharp cutoff at zero, it uses a smooth curve that gives small negative values a tiny chance of passing through. Better, but still a single path making a single decision.

SwiGLU (Shazeer, 2020) fundamentally changed the structure. Instead of one path through one activation function, SwiGLU splits the FFN into two parallel paths:

  1. One path goes through Swish activation (a smooth, non-monotonic function) — this is the "value" path, computing what the output could be
  2. The other path stays linear — this is the "gate" path, computing how much of that output to allow through
  3. The two paths are multiplied together element-wise — the gate controls the value

The analogy: think of a recording studio. ReLU is a simple noise gate — any signal below a threshold gets cut to silence, regardless of what it is. SwiGLU is an experienced sound engineer with two hands on the mixing board: one hand holds the audio signal (the Swish path), the other hand controls the volume fader (the gate path). The engineer listens to the content and decides how much to let through — quiet passages get boosted, noise gets suppressed, and the decision is based on what the signal means, not just how loud it is.

This is the key insight: the gate path sees the same input but through different learned weights. It learns a content-dependent filter — "this pattern of activations represents something important, let it through" vs "this pattern is noise, suppress it." ReLU can't do this because it has no second path to learn the gating.

Activation Functions: From On/Off Switch to Learned Gate
ReLU (2017)
f(x) = max(0, x)
0
Dead neurons: once 0, always 0
GELU (2018)
f(x) = x × Φ(x)
0
Smooth curve, no sharp cutoff
SwiGLU (2020+)
Swish(xW₁) ⊙ xW₂
0
Two paths × gating = learned control
SwiGLU is used by essentially every model from 2023+ — Llama, DeepSeek, Mistral, Qwen, Gemma, PaLM 2, and Hunyuan (Tencent). Same Noam Shazeer who co-authored the original transformer.
The gating mechanism: two parallel paths multiplied together

In a standard FFN, the input goes through one linear layer, then an activation function, then another linear layer: output = W2 * ReLU(W1 * x)

In SwiGLU, the first linear layer is split into two:

output = W2 * (Swish(W1 * x) ⊙ (W3 * x))

Where ⊙ means element-wise multiplication. The W3 * x path is the "gate" — it controls how much of the Swish(W1 * x) path gets through. This is why SwiGLU requires 3 weight matrices instead of 2, adding ~50% more parameters to the FFN layer. Models compensate by slightly reducing the FFN hidden dimension.

The result: measurably better downstream performance across all benchmarks, consistently. The extra parameters pay for themselves.

SwiGLU improved the quality of the FFN. The next upgrade doesn't change any math at all — instead, it rewrites how the existing math runs on GPU hardware, and that single change made everything from 100K to 10M token context windows possible.

Flash Attention: Same Math, Different Hardware Strategy

This one is different from all the others. Flash Attention changes zero math. The output is bit-for-bit identical to standard attention. What it changes is entirely about how the computation is organized on GPU hardware — and that change enabled everything from 100K to 10M token context windows.

The problem: standard attention materializes the full N × N attention matrix in GPU memory. For 100K tokens, that's 100,000 × 100,000 = 10 billion entries. This matrix is computed, used once for the softmax-weighted sum, then thrown away. The waste is staggering.

GPU Memory Hierarchy: Why Flash Attention Works
SRAM (On-Chip)
Where Flash Attention works
20 MB
Size
19 TB/s
Bandwidth
Tiny but blazing fast
HBM (Off-Chip)
Where standard attention writes N×N
80 GB
Size (A100)
2 TB/s
Bandwidth
Large but 10x slower
The Flash Attention Insight

Standard attention writes the full N×N matrix to slow HBM. Flash Attention processes attention in small tiles that fit entirely in fast SRAM — loads a tile of Q, K, V, computes partial attention, accumulates the result, and writes only the final output back to HBM. The N×N matrix is never materialized.

Created by Tri Dao (Stanford PhD, co-founder/Chief Scientist at Together AI). Flash Attention is now integrated into PyTorch 2.0+ via torch.nn.functional.scaled_dot_product_attention and used by every serious LLM training pipeline.
Flash Attention: Tile-by-Tile Processing
Step 1
Load small tiles of Q, K, V from HBM into fast SRAM
Q tile
K tile
V tile
Step 2
Compute partial attention for this tile entirely in SRAM (QKT, softmax, ×V)
Step 3
Accumulate into running output using online softmax (track running max + running sum)
Step 4
Repeat for all tiles. Write only final output back to HBM
The N×N attention matrix is never materialized. Result is bit-for-bit identical to standard attention.
Standard attention writes the entire N×N matrix to slow HBM. Flash Attention keeps everything in fast SRAM by processing in tiles and accumulating incrementally.

The analogy: standard attention is like printing two huge spreadsheets on paper, multiplying them, then throwing the paper away. Flash Attention works with small sticky notes on your fast desk (SRAM), computing sections and keeping a running total, without ever printing the full intermediate result.

The impact:

  • Memory: O(N²) → O(N) — this single change enabled 100K+ context windows
  • Speed: 2-4x faster for long sequences, 3x speedup on GPT-2 training end-to-end
  • FlashAttention-2 (2023): 2x faster than v1, up to 72% GPU utilization on A100, up to 225 TFLOPs/s
  • FlashAttention-3 (2024): 75% utilization on H100, FP8 support, up to 740 TFLOPs/s
Online softmax: computing softmax without seeing all values first

The trickiest part of Flash Attention is this: softmax needs to see all attention scores to compute the normalization factor (the sum of exponentials). But if you're processing in tiles, you only see a subset at a time.

Online softmax (Milakov & Gimelshein, 2018) solves this by maintaining a running maximum and running sum of exponentials. When you process a new tile:

1. Compute the new local maximum
2. Rescale the previous running sum to account for the new maximum
3. Add the new tile's exponentials to the running sum
4. The final softmax is exact — not approximate

This is why Flash Attention produces bit-for-bit identical results to standard attention. It's not an approximation — it's the same computation, reorganized for hardware efficiency.

Flash Attention made each attention operation faster and lighter in memory. Sliding Window goes further — it reduces how many tokens each layer needs to attend to in the first place. Where Flash Attention optimizes the hardware, Sliding Window optimizes the algorithm. They're complementary, and many models use both together.

Sliding Window Attention

Instead of each token attending to all previous tokens (full attention), sliding window attention restricts each layer to attending only to the previous W tokens. Mistral 7B (September 2023) used a window of W = 4,096.

The analogy: imagine you're in a long hallway with 100,000 people standing in a line. Full attention means every person can whisper to every other person directly — that's 100,000 × 100,000 possible conversations. Sliding window means each person can only whisper to the 4,096 people closest to them. Far cheaper, and you'd be surprised how far messages still travel.

Why? Because the model is stacked. The clever insight is that stacking layers extends the effective reach. In layer 1, each token sees W tokens back. But in layer 2, each token — which now contains information from W tokens — sees W more tokens back. It's like a game of telephone where each relay point has a 4,096-person reach. With 32 layers and W=4,096, the theoretical reach is 32 × 4,096 = 131,072 tokens — the message has traveled across the entire sequence, just indirectly.

Sliding Window: Local Attention, Global Reach
Layer 1
Each token sees W = 4,096 tokens back
Reach: 4K
Layer 2
Each token (with info from 4K) sees 4K more
Reach: 8K
Layer 32
Effective receptive field = 32 × 4,096
Reach: 131K
KV cache bonus: Only W entries ever stored using a rotating buffer. Fixed memory regardless of sequence length. Halved cache memory for 8K sequences; 2x improvement for 16K sequences with W=4K. Integrated with FlashAttention for hardware efficiency.
Used by Mistral 7B and Mixtral. Gemma uses a hybrid approach: alternating sliding window and full attention layers for the best of both worlds.
When to use full attention vs. sliding window

Full attention when: you need precise retrieval from anywhere in the context ("find the name mentioned 50,000 tokens ago"). Every token can directly see every other token.

Sliding window when: most important context is local (adjacent sentences, nearby code). Huge throughput and memory gains for sequences where long-range precision isn't critical.

Hybrid (best of both): Gemma's approach — alternate sliding window layers (for local patterns) with full attention layers (for long-range retrieval). Most tokens get fast local attention; critical long-range connections use the full attention layers.

Trade-off: Pure sliding window can miss dependencies that span more than layers × W tokens. The hybrid approach costs more compute than pure sliding window, but less than full attention everywhere.

The Upgrade Summary

Six upgrades. Each one a targeted replacement of a specific 2017 component. Here they are side by side:

Component Original (2017) Modern (2023+) Key Benefit
Position Sinusoidal RoPE Relative position, extendable context
Normalization LayerNorm (post) RMSNorm (pre) 7-64% cheaper, better gradient flow
KV Heads MHA (N heads) GQA / MLA 8-28x KV cache reduction
Activation ReLU SwiGLU Gated control, no dead neurons
Attention compute Standard matmul Flash Attention O(N²) → O(N) memory, 2-4x speed
Attention scope Full (all tokens) Sliding window / hybrid Fixed KV cache, linear compute
Context window 512 tokens 128K – 10M tokens RoPE scaling (YaRN, NTK)
Training data ~300B tokens 15T+ tokens Chinchilla-optimal and beyond
Architecture Encoder-Decoder Decoder-only Simpler, scales better for generation

Six planks replaced. The ship still sails the same way — attention + FFN + residual, stacked N times. But every plank is better than the original. Now let's look at the biggest structural change of all — one that doesn't replace a plank, but adds an entire new deck to the ship.

Part 3
Mixture of Experts — The Biggest Structural Shift

What if you could build a model with the knowledge of 671 billion parameters, but only pay the compute cost of 37 billion?

That's not a hypothetical. That's DeepSeek-V3, and it's the core promise of Mixture of Experts (MoE). The concept isn't new — Jacobs et al. proposed "Adaptive Mixtures of Local Experts" back in 1991. What's new is applying it at transformer scale.

The Hospital Analogy

In a dense model (like Llama 3 70B), every token passes through the same feed-forward network. All 70 billion parameters activate for every single token. It's like a hospital where every patient sees the same general doctor — whether they have a broken arm, a heart condition, or a math homework question.

In an MoE model, the single FFN is replaced with many parallel "expert" FFNs and a learned router (gating network) that selects which experts each token goes to. It's like a hospital with 256 specialist doctors and a triage nurse who sends each patient to the right 8 specialists. The other 248 sit idle for that patient.

Dense vs. Mixture of Experts Architecture
Dense Model (e.g., Llama 3 70B)
Token
Single FFN
All 70B params active
Output
MoE Model (e.g., DeepSeek-V3)
Token
Router (gating)
selects 8 of 256
E3
E47
E91
...
idle
idle
idle
idle
weighted sum ↓
Output
Only 37B of 671B params active

How the Router Works

The router is a small linear layer followed by softmax. It takes the token's hidden state and produces a probability distribution over all experts.

How the Router Selects Experts
Token Hidden State [d_model]
Linear Layer [d_model × n_experts]
Softmax
Probability over 256 experts:
E1
0.002
E2
0.001
E3
0.15
E4
0.008
...
E47
0.12
...
E91
0.10
...
E256
0.03
↓ select top-K
Expert 3
w=0.15
Expert 47
w=0.12
Expert 91
w=0.10
...
↓ weighted sum
Combined Expert Output
K=1-2 for Llama 4 (lightweight routing), K=8 for DeepSeek (more expert diversity). The router is just a single linear layer — tiny overhead compared to the experts themselves.

The Models

MoE isn't theoretical — it's how the most capable models in the world are built. Here are the specific architectures:

MoE Models: Total vs. Active Parameters
Mixtral 8x7B
Mistral AI, Dec 2023
46.7B
Total
12.9B
Active
8 experts, top-2 per token. Same base as Mistral 7B
First widely-used open MoE. Beat Llama 2 70B with 5.4x fewer active params
DeepSeek-V3
DeepSeek, Dec 2024
671B
Total
37B
Active
256 routed + 1 shared, top-8. MLA attention
14.8T tokens, 2.788M H800 GPU hours (~$5.5M). Multi-Token Prediction. Rivals GPT-4o and Claude 3.5 Sonnet
DeepSeek-R1
DeepSeek, Jan 2025
671B
Total
37B
Active
Same arch as V3. 2 RL + 2 SFT training stages
R1-Zero spontaneously developed chain-of-thought, self-verification, and reflection from pure RL — no supervised fine-tuning. Rivals OpenAI o1
Llama 4 Scout
Meta, April 2025
109B
Total
17B
Active
16 routed + 1 shared, top-1. Native multimodal
40T+ tokens, 200 languages. Train 256K → extend to 10M context
Llama 4 Maverick
Meta, April 2025
400B
Total
17B
Active
128 routed + 1 shared, top-1. Train 256K → fine-tune to 1M context
Alternating dense + MoE layers
Qwen3-235B
Alibaba, May 2025
235B
Total
22B
Active
128 experts, top-8. 94 layers. 64 Q, 4 KV heads (GQA). No shared experts (unlike DeepSeek)
32K native, 131K with YaRN. Also: Qwen3-30B (30.5B total, 3.3B active)
GPT-4 (rumored)
OpenAI, Mar 2023
~1.76T
Total (est.)
~220B
Active (est.)
Rumored 8 experts of ~220B each. ~13T tokens
George Hotz & Soumith Chintala publicly confirmed MoE (June 2023). Unverified details
Llama 4 Behemoth
Meta, 2025 (teacher model)
~2T
Total
288B
Active
16 experts. FP8 training on 32K GPUs, 390 TFLOPs/GPU
Used as teacher to distill Scout & Maverick (co-distillation)
Gemini 1.5 / 2.0 / 2.5
Google, 2024-2025
?
Not disclosed
?
Not disclosed
Google confirmed MoE. "Divided into expert neural networks, selectively activates relevant pathways"
Sparse MoE across all Gemini versions. No exact param counts disclosed
As of mid-2025, 12 of the top 16 open-weight models are MoE. The dense-only approach is increasingly limited to smaller models (<70B).
Check Your Understanding
DeepSeek-V3 (671B total, 37B active) vs Llama 3 70B (70B dense). Which uses more compute per token?
Correct

Llama 3 70B uses MORE compute per token despite having fewer total parameters. All 70B activate for every token. DeepSeek-V3 only activates 37B of its 671B per token. This is the entire point of MoE — massive knowledge capacity (671B) with practical compute cost (37B).

Not quite

Total parameters ≠ active parameters. Llama 3 70B activates ALL 70B parameters for every token. DeepSeek-V3 only activates 37B of its 671B per token (the router selects 8 of 256 experts). So Llama 3 70B actually uses more compute per token, despite having fewer total parameters. That's the MoE advantage.

Load Balancing: The Critical Challenge

Without intervention, the router tends to "collapse" — sending all tokens to 1-2 favorite experts while the rest sit idle. This wastes the capacity you paid for.

Three approaches to load balancing

1. Auxiliary loss (Shazeer 2017, GShard 2020): Add a loss term that penalizes uneven expert utilization. Effective but degrades model quality because the auxiliary loss fights against the primary training objective.

2. Bias terms (DeepSeek-V3, 2024): Add a learned bias to the router scores, but only for routing decisions — NOT included in the training loss. The bias steers tokens to underused experts without polluting the learning signal. This is called "auxiliary-loss-free" load balancing.

3. Global-batch balancing (Qwen3, 2025): Compute load balance across the entire training batch rather than per-sequence, allowing more natural token distribution. Each sequence can be "unbalanced" as long as the batch averages out.

Advantages and Disadvantages

MoE: What You Gain vs. What You Pay
Advantages
Compute efficiency — 671B model runs at 37B cost
Knowledge capacity — more total params = more stored knowledge
Expert specialization — different experts for code, math, languages
Training efficiency — higher quality per compute dollar
Scalable to trillions — scale to trillions of params without proportional compute increase
Disadvantages
Memory — ALL 671B params in memory, even though most idle
Load balancing — router collapse wastes capacity
Communication — routing tokens to experts on different GPUs
Fine-tuning — not all experts see all training data
Batch sensitivity — small batches mean uneven expert utilization
Serving complexity — need to shard experts across GPUs carefully
Check Your Understanding
You want to fine-tune a MoE model on your domain. What's the biggest challenge?
Correct

In MoE, only a few experts activate per token. During fine-tuning on your domain data, some experts may barely see any examples — they sit idle while the router keeps sending domain tokens to the same 2-3 "favorite" experts. This can lead to uneven adaptation, where some experts specialize in your domain while others stay generic.

Not quite

The compute cost is actually lower than expected (only active params compute per token), and the architecture doesn't break during fine-tuning. The real challenge is that MoE routing is sparse — not all experts see all examples. Some experts may barely encounter your domain data, leading to uneven adaptation.

Part 4
Beyond Transformers — Hybrid Architectures

Parts 1-3 covered upgrades within the transformer framework — better position encoding, cheaper normalization, smarter attention heads, faster computation, and MoE. The transformer block pattern (attention + FFN + residual) stayed the same throughout.

Part 4 asks a different question: what if the attention mechanism itself — the O(N²) core of the transformer — isn't always the best tool? Not because it's bad, but because there are tasks where a fundamentally different approach is more efficient.

The answer isn't "replace transformers." It's "augment them." The future is hybrid.

Mamba: A Fundamentally Different Way to Process Sequences

Every upgrade so far has worked within the attention framework. Flash Attention made it faster. GQA made it cheaper. Sliding Window made it narrower. But all of them still compute attention — they still ask "how does every token relate to every other token?"

Mamba asks a different question entirely: "what if you don't need to look at all tokens at once?"

Transformer attention is O(N²) in sequence length. Even with Flash Attention (which reduces the memory footprint), the quadratic scaling makes very long sequences expensive. Double the sequence length, and you quadruple the compute. For 1M+ token contexts, this becomes prohibitive.

Previous state space models (S4, by Albert Gu) offered O(N) linear scaling, but with a catch: they used fixed state transitions — the same rules applied regardless of the input. The model couldn't distinguish between important and unimportant tokens. Mamba (Gu & Dao, December 2023) solved this by making the transitions input-dependent: the model learns to selectively "remember" important tokens and "forget" irrelevant ones.

The analogy: transformer attention is a group discussion where everyone talks to everyone simultaneously — powerful for finding connections, but the cost grows with the square of the group size. Mamba is reading a book with a notepad. You process one page at a time, write down what seems important, cross out what doesn't matter anymore, and update your running summary as you go. By the end, your notepad captures the key ideas — and the process was O(N) linear. The notepad is the "state vector."

The critical difference from older approaches: Mamba's notepad is smart. It doesn't just mechanically record everything with the same rules. It looks at each new token and decides how much to remember and how much to forget, based on the content. A proper noun gets written down carefully. A filler word gets barely noted. This input-dependent selectivity is what makes Mamba competitive with transformers on quality, not just speed.

Transformer Attention vs. Mamba Processing
Transformer: All-to-All
T1
T2
T3
T4
↓ every token attends to every other
N × N Attention Matrix
T1'
T2'
T3'
T4'
O(N²) compute and memory
Mamba: Sequential State
T1
State
update ↓
T2
State
update ↓
T3
State
update ↓
T4
Final State
O(N) compute, constant memory
Mamba's key innovation: selectivity. The state transitions are input-dependent — the model learns to "remember" important tokens and "forget" irrelevant ones. Created by Tri Dao (Flash Attention) and Albert Gu.

Mamba-3B matches transformers of the same size and beats transformers 2x its size on language modeling. It achieves 5x higher inference throughput than transformers with linear time and constant memory — no KV cache needed. This speed comes from hardware-aware design: kernel fusion (fusing multiple operations into one GPU kernel), parallel scan (processing the recurrence in parallel, not truly sequential), and recomputation instead of materialization (saving memory by recomputing during the backward pass).

Mamba-2 (May 2024) is 2-8x faster than Mamba-1, and its paper — "Transformers are SSMs" — proved that attention and SSMs are mathematically equivalent under certain conditions.

"Transformers are SSMs" — the unification (Mamba-2)

In May 2024, Tri Dao and Albert Gu published Mamba-2 with a remarkable title: "Transformers are SSMs."

They proved a mathematical equivalence: a specific form of structured state space model is exactly equivalent to a specific form of linear attention. The two paradigms aren't fundamentally different — they're different views of the same mathematical operation, with different computational trade-offs.

Beyond the theoretical insight, Mamba-2 is 2-8x faster than Mamba-1 thanks to a reformulated algorithm that maps better onto GPU matrix multiply units.

This is philosophically important: it means the field isn't "transformers vs SSMs." It's one unified framework. Some computations are more efficient as attention (precise retrieval), others as recurrence (sequential processing). The best architectures will use both.

Jamba: The Hybrid in Production

If Mamba is O(N) and transformers are O(N²), why not just use Mamba for everything? Because attention has one ability that Mamba doesn't match: precise retrieval. If you need to find a specific name mentioned 50,000 tokens ago, attention can look directly at that token. Mamba's compressed state vector may have lost that detail — it depends on whether the model deemed it important enough to "remember."

The question becomes: can you get the best of both? Use Mamba for the bulk of processing (fast, efficient, handles narrative and context flow well), but inject a few attention layers where precision matters?

Jamba (AI21 Labs, March 2024) answers yes. It interleaves transformer attention and Mamba layers in a 1:7 ratio — for every 8 layers, 1 uses transformer attention and 7 use Mamba. It also applies MoE to every other layer (16 experts, top-2).

Why 1:7 specifically? The 7 Mamba layers handle the sequential "reading with a notepad" work — building up context, processing tokens efficiently, maintaining the narrative flow. The 1 attention layer acts as a "checkpoint" — it can look at any token in the full context with precise attention, catching anything the Mamba layers might have compressed too aggressively. Think of it as reading a long report: you skim most pages efficiently (Mamba), but every 8th page you stop and carefully cross-reference specific details against everything you've read (attention).

Jamba: Interleaved Transformer + Mamba + MoE
Mamba
Layer 1
Mamba + MoE
Layer 2
Mamba
Layer 3
Mamba + MoE
Layer 4
Mamba
Layer 7
Transformer Attention + MoE
Layer 8
Mamba layer (7 of 8)
Attention layer (1 of 8)
+ MoE
Every other layer
Jamba's 1:7 ratio: 7 efficient Mamba layers for sequential processing, 1 attention layer for precise long-range retrieval. MoE on alternating layers (16 experts, top-2). Result: 52B total, 12B active, fits on a single 80GB GPU.

Why does MoE appear on alternating layers? The MoE layers add knowledge capacity without proportional compute — the same benefit from Part 3. But not every layer needs it. The non-MoE layers use a single dense FFN (cheaper, simpler). Alternating gives the model expert specialization where it helps, while keeping the overall parameter count manageable.

The result is a triple hybrid: Mamba for efficient sequential processing + attention for precise retrieval + MoE for knowledge capacity. Each component does what it's best at.

Results: Jamba fits on a single 80GB GPU, handles 256K token context, and achieves 3x throughput vs a comparable transformer — with competitive language modeling quality. 52B total params, only 12B active. For context: a pure transformer with the same quality would need multiple GPUs and significantly more memory. Jamba showed that hybrid architectures aren't just theoretically interesting — they're practically superior for the memory-constrained real world.

RWKV: RNNs Strike Back

Jamba interleaves attention with Mamba. RWKV (Receptance Weighted Key Value, Peng et al., EMNLP 2023) goes further — it eliminates attention entirely.

RWKV solves a different problem than Mamba. Mamba's innovation was selectivity — making state transitions input-dependent. RWKV's innovation is dual-mode operation: it can run as a transformer during training (processing all tokens in parallel for fast training) and as an RNN during inference (processing token-by-token with constant O(1) memory per step). Same model, two operating modes.

This is a big deal for deployment. During training, you want parallelism — process the entire sequence at once across your GPUs. During inference, you want efficiency — generate one token at a time with minimal memory. Transformers are parallel for both (but expensive during inference because of the KV cache). RNNs are sequential for both (fast inference, but training can't be parallelized). RWKV gets parallel training AND sequential inference — best of both worlds.

How different from Mamba? Mamba uses a continuous state space model with learned selectivity. RWKV uses a reformulated attention mechanism where the attention weights decay exponentially with distance — so recent tokens get full weight, distant tokens get fading weight. The decay rates are learned. At training time, this can be computed in parallel (like attention). At inference time, it can be computed recurrently (like an RNN). The two computations are mathematically equivalent, just organized differently.

RWKV: One Architecture, Two Operating Modes
Training Mode
T1
T2
T3
T4
↓ all tokens processed at once
Parallel
like attention
Fast training on GPUs
Inference Mode
T1
state
T2
state
Tn
state
O(1) memory per step
Same model, same weights — the math is equivalent, just computed differently. Transformers can't do this: they're parallel for both modes (but pay the KV cache cost at inference).
RWKV's key insight: exponentially decaying attention weights can be computed as parallel attention (fast training) OR as recurrent state updates (constant-memory inference). Best of both worlds.

The real-world impact proves this isn't just academic: Microsoft deployed RWKV v5 "Eagle" to 1.5 billion Windows 10/11 devices for on-device Copilot — the largest deployment of an attention-free architecture in production. Why RWKV for this? Because on-device means tight memory constraints and no cloud GPUs. RWKV's constant-memory inference makes it ideal for running on laptops and phones, where a transformer's KV cache would be prohibitively expensive. RWKV-7 "Goose" (2025) describes itself as the "strongest attention-free, 100% RNN architecture."

Are These Replacing Transformers?

No. And the reason goes back to the question we started with.

Remember the Ship of Theseus? We asked: if you replace every plank, is it still the same ship? We spent Parts 2 and 3 watching components get swapped out — RoPE for sinusoidal encoding, RMSNorm for LayerNorm, GQA for MHA, SwiGLU for ReLU, MoE for dense FFN. And yet the answer was clear: yes, it's still a transformer. The core — Q/K/V attention, residual connections, the alternating attention-then-FFN pattern — never changed.

Now Part 4 asks a different question: what if you build a different ship entirely? Mamba doesn't replace planks — it reimagines the hull. Instead of every token talking to every other token (attention), it processes tokens one at a time through a learned state. RWKV takes a different approach, using exponentially decaying weights that can run as parallel attention during training and sequential state during inference. These aren't upgraded transformers. They're alternative architectures.

So are they winning? The honest answer: not yet, and maybe not alone.

Pure transformers still produce the highest quality on most benchmarks — especially tasks that require precise retrieval. Here's a concrete example: imagine you're processing a 200-page legal contract and you ask "what was the liability cap mentioned in clause 47?" With transformer attention, every token can directly compare against every other token — the model can look at the question and attend specifically to clause 47, wherever it sits in the 100,000-token context. It's a direct lookup.

With Mamba, all those 100,000 tokens have been compressed into a single fixed-size state vector — a summary. If the model deemed that liability cap important enough to "remember" during sequential processing, it's there. If not, it's lost. For most tasks (summarization, general Q&A, narrative understanding), the compressed state is sufficient. But for needle-in-a-haystack retrieval — finding one specific detail in a vast context — attention's direct access wins.

But SSMs offer something transformers can't: truly linear scaling. Double the context length and a transformer's attention cost quadruples. Double it for Mamba and the cost merely doubles. For processing millions of tokens — genomics, legal corpora, codebases — that difference isn't incremental, it's the difference between possible and impossible.

The Architecture Spectrum: Precision vs. Efficiency
Pure Transformer
Best quality
Precise retrieval
O(N²) scaling
KV cache grows linearly
GPT-4, Claude, Llama 3, DeepSeek-V3
Hybrid
Near-best quality
Selective precision
Much longer context
~ Design complexity
Jamba, Llama 4 (rumored), Zamba
Pure SSM / RNN
~ Competitive quality
Weaker at retrieval
O(N) scaling
Constant memory
Mamba, RWKV, Griffin
The industry is converging toward the middle
Most production models live in "Transformer with optimizations" — hybrids are the emerging frontier
Today, most major models are pure transformers with component upgrades (Parts 2-3). Hybrids are gaining traction for efficiency. Pure SSMs excel on-device and for very long contexts.

This is exactly why the industry is converging on hybrid architectures. Jamba's 1:7 ratio wasn't arbitrary — it was an engineering answer to a practical question: how do you get the precision of attention where it matters most, while using Mamba's efficiency for the long stretches in between? The result: a model with a 256K context window that fits in the memory budget of a 70K-context pure transformer.

Even the researchers building these alternatives see convergence, not competition. The Mamba-2 paper is titled "Transformers are SSMs" — arguing that attention and state space models are different views of the same underlying mathematics. They're not rivals. They're complementary tools.

The emerging recipe looks like this: transformer attention as the backbone for precision, SSM layers for efficiency on long sequences, and MoE for capacity without proportional compute cost. Not one architecture winning, but three ideas combining. The ship isn't being replaced by a different ship — it's being extended with new kinds of planks that the original builders never imagined.

The Modern Recipe: 2017 vs 2025

Here's what a state-of-the-art transformer block looks like today, compared to the original:

The 2017 Block vs. The 2025 Block
Original Transformer (2017)
Input
Multi-Head Attention
sinusoidal PE
Add (residual)
LayerNorm
FFN (ReLU)
Add (residual)
LayerNorm
Modern Block (2025)
Input
RMSNorm
GQA Self-Attention
RoPE + Flash Attention
Add (residual)
RMSNorm
MoE Router → K Expert FFNs
SwiGLU activation
Add (residual)
Same skeleton (attention + FFN + residual). Every component inside has been upgraded. The ship has new planks, but it's still the same ship.
Timeline: Key innovations from 2017 to 2025
YearInnovationPaper / ModelImpact
2017Original TransformerVaswani et al., "Attention Is All You Need"Foundation of everything
2018GPT-1, BERTOpenAI, GoogleDecoder-only and encoder-only paradigms
2019Multi-Query AttentionShazeerFirst KV cache reduction technique
2019RMSNormZhang & Sennrich, NeurIPSCheaper normalization, zero quality loss
2020GPT-3 (175B)OpenAIProved scale = capability (Kaplan scaling laws)
2020SwiGLUShazeerGated activations replaced ReLU
2021RoPESu et al., "RoFormer"Relative position via rotation
2022ChinchillaHoffmann et al., DeepMindProved optimal data-to-param ratio. Changed industry from "bigger model" to "more data"
2022Flash AttentionTri Dao, StanfordO(N) memory, 2-4x speed, enabled 100K+ context
2023Llama 1MetaFirst major Pre-RMSNorm + RoPE + SwiGLU + GQA model
2023GQAAinslie et al., EMNLPMiddle ground: shared KV groups
2023Mistral 7BMistral AISliding window attention + GQA
2023Mixtral 8x7BMistral AIFirst widely-used open MoE
2023MambaGu & DaoSelective SSM, linear-time alternative to attention
2023RWKVPeng et al., EMNLPParallel training + O(1) inference RNN
2024Mamba-2Dao & Gu, "Transformers are SSMs"Unified framework, 2-8x faster
2024MLADeepSeek-V228x KV cache compression
2024JambaAI21 LabsFirst production hybrid (Transformer + Mamba + MoE)
2024DeepSeek-V3DeepSeek671B total, 37B active, trained for ~$5.5M
2024Flash Attention 3Tri DaoFP8 support, 740 TFLOPs/s on H100
2025DeepSeek-R1DeepSeekSpontaneous CoT from pure RL
2025Llama 4MetaScout/Maverick/Behemoth MoE family, 10M context
2025Qwen3AlibabaGlobal-batch load balancing, thinking/non-thinking modes
Key people to know

Noam Shazeer — Co-authored the original transformer (2017), invented MQA (2019), SwiGLU (2020), and pioneer of MoE load balancing. Co-founded Character.AI, then returned to Google.

Tri Dao — Created Flash Attention (2022-2024), co-created Mamba (2023-2024). Stanford PhD, co-founder/Chief Scientist at Together AI.

Albert Gu — Co-created S4 and Mamba SSM architectures. Carnegie Mellon professor. His work on structured state spaces opened the SSM paradigm.

Ashish Vaswani — Lead author of "Attention Is All You Need." Co-founded Adept AI, then Essential AI.

References and further reading

Core Papers:

  • Vaswani et al., "Attention Is All You Need" (2017) — arXiv:1706.03762
  • Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021) — arXiv:2104.09864
  • Zhang & Sennrich, "Root Mean Square Layer Normalization" (2019, NeurIPS) — arXiv:1910.07467
  • Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need" (2019) — arXiv:1911.02150
  • Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023, EMNLP) — arXiv:2305.13245
  • Shazeer, "GLU Variants Improve Transformer" (2020) — arXiv:2002.05202
  • Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022) — arXiv:2205.14135
  • Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) — arXiv:2307.08691
  • Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022) — arXiv:2203.15556

Architecture Papers:

  • Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023) — arXiv:2312.00752
  • Dao & Gu, "Transformers are SSMs" (Mamba-2, 2024) — arXiv:2405.21060
  • Peng et al., "RWKV: Reinventing RNNs for the Transformer Era" (2023, EMNLP) — arXiv:2305.13048
  • Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (2024) — arXiv:2403.19887
  • DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024) — arXiv:2405.04434
  • DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024) — arXiv:2412.19437
  • DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025) — arXiv:2501.12948
  • Jiang et al., "Mixtral of Experts" (2024) — arXiv:2401.04088

Educational Resources:

  • Jay Alammar — "The Illustrated Transformer" (jalammar.github.io)
  • Lilian Weng — "The Transformer Family Version 2.0" (lilianweng.github.io)
  • Sebastian Raschka — "Understanding Large Language Models" (magazine.sebastianraschka.com)
Reference
Cheat Sheet

Q/K/V Attention

Same formula since 2017. Every model computes softmax(QKT/√d)V. The variants optimize how heads are organized and how it runs on hardware — the math hasn't changed.

RoPE

Rotation encodes relative position. Applied to Q and K at every layer (not just input). Enables context extension beyond training length. Universal from 2023+.

RMSNorm

Cheaper normalization, pre-norm placement. Drops mean-centering, keeps only RMS scaling. Pre-norm gives gradients a clean residual path. 7-64% faster, zero quality loss.

GQA / MLA

Share or compress KV heads. GQA: 8 shared KV heads instead of 64 (8x savings). MLA: compress to tiny latent vector (28x savings). Essential for long context windows.

SwiGLU

Gated activation function. Two parallel paths multiplied together — the network learns which information to pass. Replaced ReLU's dead neuron problem with learned control.

Flash Attention

Same math, GPU-optimized tiling. Processes attention in SRAM tiles instead of materializing N×N in HBM. O(N²) → O(N) memory. Enabled 100K+ context windows.

Mixture of Experts

Many experts, router selects few. DeepSeek-V3: 671B total, 37B active. Router picks 8 of 256 experts per token. Massive capacity at fraction of compute cost.

Hybrid Future

Attention backbone + SSM efficiency + MoE capacity. Jamba: 1 attention layer per 7 Mamba layers. Transformers aren't being replaced — they're being augmented.

Practice Mode

Four real-world architecture decisions. Can you make the right call?

0 / 4
Scenario 1 of 4
Your team needs a 100B+ parameter model that can run on 2 GPUs (160 GB total memory). You need frontier-class performance for a multi-language customer support system.
Dense or MoE?
A
Dense 100B — simpler architecture, easier to fine-tune, proven reliability
B
MoE with ~200B+ total, 30B active — more knowledge capacity with less active compute per token
C
Dense 70B — keep it simple, 70B fits easily and is "good enough"

Was this helpful?

Comments

Loading comments...

Leave a comment