LLM Inference

From Prompt to GPU

What happens in the 0.8 seconds between hitting Enter and seeing the first token — tokenization, GPU memory, the two-phase pipeline, KV caches, batching, and the self-host vs API decision.

Bahgat
Bahgat Ahmed
February 2026 · ~25 min read
LLM Engineering Series
How LLMs Work
Completed
RAG & Agents
Completed
From Prompt to GPU
You are here
AI Memory
Upcoming
Graph Memory
Upcoming
Fine-Tuning
Upcoming
Evaluation
Upcoming
What's Inside
~25 min read
1 Tokenization 2 Loading the Model 3 The Two Phases 4 The KV Cache 5 Serving Multiple Users 6 Context Extension 7 Generation Control 8 Self-Host or API? Practice Mode Cheat Sheet

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

In the name of Allah, the Most Gracious, the Most Merciful

You send a 50-word prompt. 0.8 seconds later, the first token appears. In that 0.8 seconds, your text was shredded into tokens, multiplied through 96 layers of 175 billion parameters, and the GPU moved 2 terabytes of data through memory.

Then the model generates 100 tokens per second — smooth, streaming, almost conversational. But the engineering behind that smoothness involves two completely different computational phases, a cache that grows with every single token, and a memory bandwidth bottleneck that determines whether your $30,000 GPU sits idle or runs at full capacity.

This post is the complete map of what happens between "Enter" and the response — from text to tokens to tensors to output. Here's what actually happened.

Quick Summary
  • Part 1: Tokenization — how text becomes numbers via BPE and SentencePiece, and why Arabic costs 2-5x more than English
  • Part 2: GPU memory — VRAM, FP16, TFLOPS, memory bandwidth, and why bandwidth is almost always the bottleneck
  • Part 3: Prefill vs decode — the two-phase pipeline that explains why the first token is slow and streaming is fast
  • Part 4: KV cache — the memory structure that grows with every token, and PagedAttention's solution
  • Parts 5-8: Batching, context extension, generation control, and the self-host vs API decision framework
This post is for you if...
  • You've used LLM APIs and wondered what actually happens between your request and the response
  • You've seen terms like "KV cache," "VRAM," "FP16," or "TTFT" and want to understand what they actually mean
  • You're evaluating whether to self-host a model or use an API, and need to understand the real tradeoffs
  • You read How LLMs Work and want to understand the engineering that makes inference fast at scale
Part 1
Tokenization — How Text Becomes Numbers

The Phrasebook Problem

Imagine you're traveling to Japan with a phrasebook. You could look up every single letter individually — "h," "e," "l," "l," "o" — and spell everything character by character. That would work, but it would take forever. A better phrasebook groups common sequences: "hello" is one entry, "thank you" is another. Common phrases get a single lookup; rare words get spelled out letter by letter.

That's exactly how a tokenizer works. It's the very first thing that processes your prompt — before any neural network, any attention mechanism, any GPU computation. The tokenizer converts your text into a sequence of numbers (called tokens) that the model can process. And the quality of this conversion has a direct impact on cost, speed, and even the model's ability to understand your input.

A token is roughly three-quarters of a word in English, or about 4 characters. "Hello, how are you?" becomes about 5 tokens. But this ratio changes dramatically across languages, and that has real financial consequences.

How Tokenization Works
Input Text
"The database connection pool is leaking"
The
464
database
6831
connection
3717
pool
7924
is
374
leak
19452
ing
287
Token IDs sent to the model
[464, 6831, 3717, 7924, 374, 19452, 287]
Common words like "The" and "is" get their own token. Less common words like "leaking" get split into "leak" + "ing." Each token maps to a unique number.

BPE: How the Phrasebook Is Built

Byte Pair Encoding (BPE) is the algorithm that builds this phrasebook. Think of it like a language teacher who watches millions of sentences and notices which character pairs appear together most frequently. The teacher then creates a shorthand for the most common pairs, then the most common pairs of those pairs, and so on — building up from individual characters to common words and even multi-word fragments.

The process starts with every character as its own token. Then it repeatedly merges the most frequent adjacent pair into a single new token. After thousands of merge operations, you get a vocabulary of 32,000 to 128,000 tokens that efficiently encodes common text.

SentencePiece is the tool that actually builds these phrasebooks. Developed by Google, it's the standard tokenizer builder used by Llama, Mistral, and most modern open-source models. SentencePiece takes a corpus of text and runs BPE (or a variant called Unigram) to produce the final vocabulary file that the model uses forever after.

BPE Step by Step: Watching Merges Happen

Here's how BPE builds a vocabulary from scratch. Imagine training on just one phrase: "low lower lowest"

Step 0 (character level): l o w | l o w e r | l o w e s t

Step 1: Most common pair is "l" + "o" → merge into "lo": lo w | lo w e r | lo w e s t

Step 2: Most common pair is "lo" + "w" → merge into "low": low | low e r | low e s t

Step 3: "e" + "r" and "e" + "s" are equally common. Say we merge "e" + "s" → "es": low | low e r | low es t

Step 4: Now "es" + "t" → "est": low | low e r | low est

After just 4 merges, "lowest" went from 6 tokens to 2 tokens. In real models, this process runs for 32,000 to 128,000 merge steps across billions of words of training data.

Key insight: The merge order is learned from data, not hand-coded. Common English words like "the" and "and" get merged early and become single tokens. Rare words or unusual character sequences stay split into smaller pieces.

The Arabic Tokenizer Tax

Here's where tokenization has real business impact. The fertility score measures how many tokens a language needs per word. English has a fertility of about 1.0 — one word, one token (roughly). Arabic has a fertility of 2.19. That means every Arabic word uses more than twice as many tokens as an English word.

Why? Because BPE vocabularies are trained primarily on English text. Common English words like "the," "is," and "database" each get their own token. But Arabic words — even extremely common ones — often get split into 3, 4, or even 5 subword pieces because the tokenizer never learned to merge those character sequences.

The financial impact is direct: if every API call costs per token, Arabic users are paying 2-5x more for the same conversation. A chatbot that costs $500/month in English costs $1,000-$2,500/month in Arabic — for the same number of users, the same conversations, the same functionality.

The Tokenizer Tax: English vs Arabic
English
The database connection timed out
5 tokens
Fertility: ~1.0
2.19x
cost multiplier
Arabic
ات صال الا قاع دة البي انات انت هت مهل ة
11 tokens
Fertility: ~2.19
The same meaning requires 2x+ more tokens in Arabic because BPE vocabularies were trained primarily on English text. Arabic words get split into many subword pieces.
The Cost Math

At GPT-4o pricing ($2.50 per million input tokens): a 500-word English prompt uses ~375 tokens ($0.0009). The same content in Arabic uses ~820 tokens ($0.002). Multiply by 100,000 daily queries and that's the difference between $90/day and $200/day — an extra $3,300/month.

Vocabulary Extension: Retraining the Phrasebook

If your application primarily serves a non-English language, you can retrain the tokenizer's vocabulary to include common words from that language. This is called vocabulary extension.

The process works like this:

  1. Train a new SentencePiece model on a large corpus of your target language
  2. Merge the new vocabulary with the original model's vocabulary, adding new tokens for common words in the target language
  3. Resize the model's embedding layer to accommodate the new vocabulary size
  4. Fine-tune the model so it learns the new token embeddings

For Arabic, this can reduce the fertility score from 2.19 to close to 1.2 — nearly halving the cost per query. The Jais family of models (from the UAE) did exactly this: they extended the Llama tokenizer with Arabic-specific merges and achieved near-English efficiency.

The tradeoff: Vocabulary extension requires fine-tuning, which costs compute time and money upfront. It's worth it when your Arabic API costs justify the one-time investment — typically when monthly API costs exceed $2,000.

Decision Card
A client's Arabic chatbot costs 3x the English version for the same conversations. What's the first optimization to investigate?
Correct!

The 3x cost comes directly from Arabic's high fertility score (2.19 tokens per word vs 1.0 for English). Vocabulary extension attacks the root cause by teaching the tokenizer common Arabic words, reducing tokens per word. A cheaper model still pays the same per-token tax. Shorter prompts sacrifice quality.

Not quite.

The root cause is tokenization — Arabic text uses 2-3x more tokens than English for the same content. The fix is vocabulary extension: retrain the tokenizer to recognize common Arabic words as single tokens, reducing the fertility score from 2.19 toward 1.2. This directly halves the cost without sacrificing model quality or context.

Part 2
Loading the Model — What Lives on the GPU

The Desk Size Problem

Imagine your GPU is a desk. The model's parameters — all 70 billion of them — are textbooks that need to be open on that desk. The bigger the model, the more textbooks. The desk has a fixed size, and if the books don't fit, you either need a bigger desk or you need to squeeze the books.

That desk is VRAM — Video RAM, the memory physically located on the GPU chip. Unlike regular system RAM (which sits on your motherboard and communicates with the CPU over a relatively slow bus), VRAM sits right next to the GPU's compute cores with a massively wide memory bus. An NVIDIA A100 GPU has 80GB of VRAM. An H100 has 80GB. A consumer RTX 4090 has 24GB.

FP16: Squeezing the Books

Every parameter in a model is a number — a weight that was learned during training. In full precision (FP32), each number takes 4 bytes of storage. A 70-billion parameter model in FP32 needs 70B x 4 bytes = 280GB. That doesn't fit on any single GPU.

The solution: FP16 (half precision). Instead of using 32 bits per number, you use 16 bits — cutting storage exactly in half. A 70B model in FP16 needs 70B x 2 bytes = 140GB. Still doesn't fit on one 80GB GPU, but it fits on two.

The remarkable thing is that half precision barely affects output quality. The difference between 3.14159265 and 3.14160 matters in scientific computing but is irrelevant for language model weights. Most modern inference runs in FP16 or BF16 (brain float 16, a variant that handles extreme values better) by default.

GPU Memory Layout — What Lives in VRAM
NVIDIA A100 — 80GB VRAM
Running Llama-70B in FP16
Model Weights — 140GB
Does not fit! Need 2 GPUs (tensor parallelism)
Model Weights
~70%
The learned parameters
KV Cache
~20%
Grows with each token
Activations
~10%
Temporary computation
Model weights dominate VRAM. The KV cache (explained in Part 4) is the second-largest consumer and grows with every token generated.

TFLOPS vs Memory Bandwidth: The Real Bottleneck

GPUs have two speed metrics, and understanding which one matters is the key to understanding inference performance.

TFLOPS (Tera Floating-point Operations Per Second) measures how fast the GPU can do math. The A100 does 312 TFLOPS in FP16. That's 312 trillion multiply-add operations every second. It sounds like the GPU should be blindingly fast.

Memory bandwidth measures how fast the GPU can move data from VRAM to the compute cores. The A100 has 2 TB/s of memory bandwidth. That's 2 terabytes per second — fast, but not fast enough.

Think of it as a factory. TFLOPS is how fast the workers can assemble products. Memory bandwidth is how fast the conveyor belt delivers raw materials. Even if you have 1,000 workers, if the conveyor belt can only deliver enough materials for 100 workers at a time, 900 workers sit idle. That's exactly what happens during most of LLM inference: the GPU compute cores are waiting for data to arrive from memory.

The GPU Bottleneck: Compute vs Memory
312 TFLOPS
Compute Speed
1,000 workers
<2%
actual utilization
2 TB/s
Bandwidth
Narrow conveyor belt
The A100 can compute 312 trillion operations/sec but can only feed data at 2 TB/s. During decode, compute cores are starved for data — running at less than 2% capacity. This is why batching and memory optimizations matter more than raw compute.
Why This Matters

The A100 has 312 TFLOPS but only 2 TB/s bandwidth. To keep those compute cores fully busy, you'd need to do 156 floating-point operations for every byte loaded from memory. During token generation, the model does about 1-2 operations per byte. The GPU is running at less than 2% of its theoretical compute capability. The bottleneck is memory bandwidth, not compute.

VRAM Calculation Cheat Sheet

Quick formula: VRAM (GB) = Parameters (B) x Bytes per parameter

PrecisionBytes/param7B Model13B Model70B Model
FP32428 GB52 GB280 GB
FP16/BF16214 GB26 GB140 GB
INT817 GB13 GB70 GB
INT40.53.5 GB6.5 GB35 GB

Add 10-20% overhead for KV cache and activations on top of these base numbers. A 70B model in INT4 quantization (35 GB) can fit on a single 48 GB GPU (like the A6000 or 2x RTX 4090) with room for the KV cache.

GPU VRAM sizes: RTX 4090 = 24 GB, A6000 = 48 GB, A100 = 80 GB, H100 = 80 GB.

Part 3
The Two Phases — Why Streaming Exists

Why the First Token Takes So Long

You've probably noticed this when using ChatGPT or Claude: you hit send, there's a noticeable pause, and then tokens start appearing rapidly one after another. That initial pause and the subsequent fast streaming aren't just an interface choice — they reflect two fundamentally different computational phases happening on the GPU.

Phase 1: Prefill — Reading Your Prompt

Imagine a speed reader who can read an entire book in parallel — all pages at once. That's the prefill phase. The GPU takes your entire prompt (all tokens at once) and processes them simultaneously through the model. This is matrix-matrix multiplication: large blocks of data being multiplied together.

During prefill, the GPU's compute cores are actually busy. The operation has high arithmetic intensity — the ratio of computation to memory access. For every byte loaded from memory, the GPU performs many operations. This is compute-bound: the speed limit is how fast the GPU can do math, not how fast it can fetch data.

Phase 2: Decode — Generating One Token at a Time

Now imagine the same speed reader has to write a novel, but they can only write one word at a time, and before writing each word, they need to re-read everything they've written so far. That's the decode phase. The GPU generates tokens one by one, and each token requires accessing the entire model's weights.

During decode, the operation is matrix-vector multiplication: one tiny vector (the current token) multiplied against the enormous weight matrices. The arithmetic intensity drops dramatically — you load a huge amount of data for a tiny amount of computation. This is memory-bound: the speed limit is how fast data can move from VRAM to the compute cores.

Prefill vs Decode — The Two-Phase Pipeline
Phase 1: Prefill
Reading Your Prompt
Operation: Matrix-matrix multiplication
Bottleneck: Compute-bound (GPU math speed)
Arithmetic Intensity: High (~100+ ops/byte)
Speed: Processes all prompt tokens at once
Metric: TTFT (Time to First Token)
Phase 2: Decode
Generating Tokens
Operation: Matrix-vector multiplication
Bottleneck: Memory-bound (bandwidth)
Arithmetic Intensity: Low (~1-2 ops/byte)
Speed: One token at a time, sequentially
Metric: TPS (Tokens Per Second)
The first token is slow because prefill processes your entire prompt. Subsequent tokens stream fast because decode only needs to generate one token at a time. This is why you see that initial pause followed by rapid streaming.

Arithmetic intensity is the ratio that determines which phase you're in. High arithmetic intensity (many operations per byte of data moved) means compute-bound. Low arithmetic intensity (few operations per byte) means memory-bound. Prefill has high intensity because you're multiplying large matrices together. Decode has low intensity because you're multiplying one tiny vector against large matrices.

Real Numbers

A typical GPT-4-class model on an A100 cluster: TTFT = 500-800ms for a 200-token prompt, then 50-100 tokens/second during decode. The prefill feels slow because the model is reading and processing your entire prompt. The decode feels fast because each new token only requires a fraction of the computation.

Decision Card
Your API shows 800ms TTFT but 50 tokens/sec decode speed. Is something broken?
Correct!

This is completely normal. The 800ms TTFT reflects the prefill phase processing your entire prompt (compute-bound, matrix-matrix multiplication). The 50 TPS decode speed reflects token-by-token generation (memory-bound, matrix-vector multiplication). The asymmetry is a fundamental property of transformer inference, not a bug.

Not quite.

800ms TTFT with 50 TPS decode is completely normal for a large model. Prefill processes all prompt tokens in parallel (compute-bound) which takes significant time. Decode generates tokens one-by-one (memory-bound) which is actually quite fast per token. This asymmetry is why streaming UIs exist — the first token waits, then the rest flow quickly.

Part 4
The KV Cache — Memory That Grows With Every Token

The Re-Reading Problem

Imagine you're writing an essay, and every time you write a new sentence, you need to re-read the entire essay from the beginning to remember what you've said. Sentence 1 takes 1 second. Sentence 2 takes 2 seconds (re-read sentence 1, then write). Sentence 100 takes 100 seconds. The time to write each new sentence grows linearly with the length of what you've already written.

This is exactly what would happen during the decode phase without a cache. For each new token, the model needs to "attend to" (look at) every previous token. The attention mechanism computes Key and Value vectors for each token — these are the compressed representations that allow the model to "remember" what each token means in context.

The KV cache stores these Key and Value vectors so the model doesn't have to recompute them for every new token. When generating token 100, instead of recomputing Keys and Values for tokens 1-99, the model just looks up the cached values and only computes the Key/Value for the new token.

KV Cache Growth Over a Conversation
Token 1
~1 MB
Token 10
~10 MB
Token 100
~100 MB
Token 1K
~1 GB
Token 10K
~10 GB
The KV cache grows linearly with sequence length. For a 70B model, 10,000 tokens of KV cache consumes ~10 GB of VRAM — that's VRAM that can't be used for serving other requests.

PagedAttention: The Filing Cabinet Fix

Without PagedAttention, the KV cache works like a bookshelf with fixed-size shelves. You allocate a contiguous block of VRAM for each request based on the maximum possible sequence length. A request that might use 4K tokens gets a 4K-token allocation, even if it only ends up using 200 tokens. The wasted space fragments memory and limits how many requests can run concurrently.

PagedAttention (introduced by vLLM) works like a filing cabinet with movable folders. Instead of allocating one big continuous block per request, it breaks the KV cache into small, fixed-size pages (typically 16 tokens each). Pages are allocated on demand as the sequence grows, and freed pages can be reused by other requests immediately.

Think of the difference between renting an entire floor of an office building (even if you only use 3 rooms) versus renting individual rooms as you need them. PagedAttention reduces VRAM waste from 60-80% down to less than 4%, which means you can serve 2-4x more concurrent requests with the same GPU.

PagedAttention: Before vs After
Without PagedAttention
Req A
empty
Req B
C
empty
reserved but unused
60-80% waste
With PagedAttention
Small pages allocated on demand, scattered but tightly packed
< 4% waste • 2-4x more concurrent requests
Without PagedAttention, each request reserves a contiguous memory block (mostly empty). With it, memory is split into small pages allocated on demand — dramatically reducing waste.
Practical Impact

Before PagedAttention, a single A100 serving a 70B model could handle ~8 concurrent requests. With PagedAttention, the same GPU handles ~20-30 concurrent requests. vLLM made PagedAttention the default, and it's now the standard for all serious LLM serving.

KV Cache Quantization (SnapKV): Compressing the Cache

Just like model weights can be quantized from FP16 to INT8 or INT4, the KV cache can also be compressed. SnapKV and similar techniques selectively keep only the most "important" key-value pairs — the ones that receive the most attention from the model.

How it works: SnapKV observes which token positions the model's attention focuses on during inference. Tokens that rarely receive attention (typically in the middle of long contexts) have their KV cache entries compressed or dropped entirely. This can reduce KV cache size by 3-5x with minimal quality loss.

When to use: Long-context applications (10K+ tokens) where KV cache memory is the limiting factor for concurrent requests. Not needed for short conversations.

When NOT to use: Applications requiring perfect recall of all context (medical, legal, financial). Any compression risks dropping a fact the model might need.

Part 5
Serving Multiple Users — Batching and Beyond

The Empty Factory Problem

Imagine a factory with 1,000 workers. One customer walks in with a single widget to assemble. All 1,000 workers crowd around that one widget. 999 of them stand idle while one does the work. That's what happens when a GPU with thousands of compute cores processes a single request during the decode phase — the memory bandwidth bottleneck means most cores have nothing to do.

The fix? Batching — processing multiple requests at the same time. Instead of one customer's widget, you have 32 customers' widgets on the assembly line simultaneously. Now more workers have something to do, and the factory is actually efficient.

Continuous Batching: No Waiting in Line

Static batching is like a bus: it waits until all seats are full, drives the route, and nobody can get on or off until the trip is complete. If one request finishes early (a short response), its "seat" sits empty until the longest request in the batch completes.

Continuous batching (also called "inflight batching") is like a conveyor belt at a buffet. Requests enter and exit freely. When one request finishes, a new one immediately takes its place. There's no wasted time waiting for the entire batch to complete.

This seemingly simple change has an enormous impact: continuous batching improves GPU utilization by 10-20x compared to static batching in real workloads with mixed-length requests.

Static vs Continuous Batching
Static Batching
Timeline →
Req A
Req B (long)
C
A and C wait idle until B finishes. Shaded = wasted GPU time.
Continuous Batching
Timeline →
Req A
Req D (new!)
Req B (long)
C
Req E
Req F
Finished slots immediately filled. No wasted GPU time.
Static batching wastes GPU cycles waiting for the longest request. Continuous batching fills empty slots immediately — 10-20x better utilization.

The Batch Size Trap

"If batching helps, why not make the batch huge?" Because of a trap: throughput scales sub-linearly with batch size. Doubling the batch from 8 to 16 might increase throughput by 70%. Doubling from 32 to 64 might only increase it by 20%. And at some point, you run out of VRAM for the KV caches of all those concurrent requests.

The metric that captures this is MBU (Model Bandwidth Utilization) — what percentage of the GPU's memory bandwidth is actually being used for useful computation. A well-optimized serving system targets 80-90% MBU. Below 50% means you're leaving money on the table. Above 95% means you're probably trading latency for throughput.

MBU (Model Bandwidth Utilization) Zones
<50%
80-90%
>95%
Underutilized — wasting GPU money
Target zone
Overloaded
Semantic Caching: The 5-10x Shortcut

Research shows 31% of LLM API calls are semantically redundant — different users ask effectively the same question with different wording. Semantic caching stores responses indexed by meaning similarity (not exact string match). When a new query is semantically close to a cached one, return the cached response instantly. This provides a 5-10x speedup for those cached queries with zero GPU cost. Tools like GPTCache and Redis with vector similarity search implement this pattern.

Speculative Decoding: Draft-then-Verify for 2-3x Speedup

Think of a junior architect who drafts a building plan, and then a senior architect reviews it. The junior is fast but makes mistakes. The senior is slow but catches everything. If the junior gets most of it right, the senior only needs to fix a few things — much faster than the senior doing the whole thing alone.

Speculative decoding works the same way. A small, fast "draft" model generates several tokens quickly. Then the large, slow "target" model verifies all those tokens in parallel (this is a single forward pass, so it's fast). Tokens the target model agrees with are accepted. Tokens it rejects get regenerated by the target model.

The speedup: If the draft model has an 80% acceptance rate (common for well-matched pairs), you generate ~5 tokens for the compute cost of ~2 tokens. That's a 2-3x speedup with mathematically identical output quality.

Use when: Latency is critical and you can deploy two models (one small, one large). Common pairing: Llama-7B draft + Llama-70B target.

Don't use when: VRAM is tight (you need memory for both models) or your workload is already batch-throughput-optimized rather than latency-optimized.

Kernel Fusion: Reducing GPU Overhead

Every GPU operation (called a "kernel") has startup overhead: launch the kernel, allocate registers, synchronize. If a model layer requires 10 separate operations (multiply, add, normalize, activate, etc.), that's 10 kernel launches with 10 overheads.

Kernel fusion combines multiple operations into a single GPU kernel. Instead of 10 launches, you get 1. The data stays in the GPU's fast on-chip cache (SRAM) between operations instead of being written back to slow VRAM after each step.

FlashAttention is the most famous example: it fuses the entire attention computation (Q*K, softmax, *V) into a single kernel that never writes intermediate results to VRAM. This provides 2-4x speedup for the attention operation specifically.

Practical impact: You don't implement kernel fusion yourself. Libraries like vLLM, TensorRT-LLM, and FlashAttention handle it. But understanding why these libraries are faster than naive PyTorch helps you make informed deployment choices.

Part 6
When Context Isn't Long Enough

Stretching the Window

A model trained with a 4K context window can't magically handle 32K tokens. The model's position encoding — how it understands "this token is in position 500" vs "this token is in position 5000" — breaks when you exceed the training length. The model has never seen positions beyond 4K during training, so those positions produce garbage.

But what if you need longer context? Retraining the entire model on longer sequences is expensive (millions of dollars). Researchers found shortcuts.

RoPE: How Models Know Position

Think of a clock. The hour hand rotates at one speed, the minute hand at another. By combining these two rotations, you can encode any time of day. RoPE (Rotary Position Embedding) works similarly — it encodes each token's position by rotating its vector representation at specific frequencies. Position 1 gets a small rotation. Position 1000 gets a large rotation. The model learns to interpret these rotations during training.

The problem: the model was trained with rotations up to, say, position 4096. Position 5000 requires a rotation the model has never seen. It's like a clock trying to display 25 o'clock — the numbers don't wrap around properly.

NTK-Aware Scaling: Changing the Base Frequency

NTK-Aware Scaling modifies the base frequency of RoPE's rotational encoding, effectively "stretching" the position space. Instead of positions 0-4096, the same rotation angles now span 0-32768. It's like switching from a 12-hour clock to a 24-hour clock — you cover more positions without changing the model's ability to distinguish nearby positions.

The beauty is that NTK-Aware Scaling requires zero additional training. You modify a single hyperparameter (the RoPE base frequency) at inference time. The tradeoff is that the model's ability to distinguish very close positions decreases slightly — position 100 and position 101 become slightly harder to tell apart — but for most applications, this is negligible.

YaRN: The Full Solution

YaRN (Yet another RoPE extensioN) takes NTK scaling further. Instead of uniformly stretching all frequencies, YaRN applies different scaling factors to different frequency bands. High-frequency components (which encode local relationships like "these two words are adjacent") get less stretching. Low-frequency components (which encode global relationships like "this paragraph is about the same topic") get more stretching.

This preserves local precision while extending global range. YaRN can extend context from 4K to 128K with only a small amount of fine-tuning data (a few hundred examples), compared to the full retraining that would otherwise be needed.

Context Extension Methods Compared
Original Training Range
0 — 4,096 positions
RoPE
The Baseline Encoding
0 – 4,096
Rotates vectors at fixed frequencies. Breaks beyond training length. Like a clock that can't show 25 o'clock.
NTK Scaling
Uniform Stretching
0 – 32,768
Stretches ALL frequencies equally. Zero extra training. Like switching from a 12-hour to 24-hour clock. Slight loss of local precision.
YaRN
Smart Stretching
0 – 128,000
Different stretch per frequency band. High-freq (local) stretched less, low-freq (global) stretched more. Preserves local precision. Small fine-tune needed.
RoPE is the baseline position encoding. NTK uniformly stretches it (free, ~8x extension). YaRN smartly stretches different bands (small fine-tune, ~32x extension with better quality).
Attention Sinks (StreamingLLM): Infinite-Length Processing

Researchers discovered something curious: the first few tokens in any sequence receive disproportionately high attention, regardless of their content. These are called attention sinks. Even if the first token is meaningless punctuation, the model's attention mechanism uses it as an "anchor."

StreamingLLM exploits this by keeping the first 4 tokens (attention sinks) plus a sliding window of the most recent tokens. Everything in between is discarded. This allows processing of theoretically infinite-length streams with fixed memory usage.

Use when: You need to process continuous streams (real-time transcription, log monitoring) where you don't need to recall the entire history.

Don't use when: You need the model to reference information from the beginning of a long document. StreamingLLM deliberately discards the middle and only keeps the anchors plus recent context.

MemGPT: Virtual Memory for LLMs

Your computer has 16 GB of RAM but can run programs that use 100 GB of memory. How? Virtual memory — the operating system pages data in and out of RAM as needed, using the hard drive as overflow. Only the actively-needed data sits in RAM at any time.

MemGPT applies this exact concept to LLMs. The context window is "RAM" — limited and expensive. External storage (databases, files) is the "hard drive" — huge and cheap. An LLM controller manages what's in the context window, paging information in when the model needs it and paging it out when it's not immediately relevant.

For applications that need both long-term memory and long-context understanding, MemGPT provides a compelling middle ground between RAG (which retrieves from external storage) and simply using a larger context window (which is expensive and suffers from "lost in the middle" degradation).

Decision Card
Your app needs 200K context but the model supports 32K. What approach do you take?
Best approach!

The 200K of content likely doesn't all need to be in context simultaneously. RAG retrieves the most relevant portions, and YaRN extends the window enough to hold the retrieved chunks with good quality. StreamingLLM would lose the middle. NTK scaling 6x degrades quality significantly. The hybrid approach (retrieve what matters + extend moderately) gives the best quality-cost tradeoff.

Not the best approach.

The best strategy combines RAG (to retrieve only the relevant portions of the 200K content) with YaRN (to moderately extend context for those retrieved chunks). StreamingLLM discards the middle content, making it useless for recall. NTK scaling 6x degrades quality badly. A hybrid approach that reduces what needs to be in context and extends moderately gives the best results.

Part 7
Generation Control

The Creativity Dial

After the model processes your prompt through all its layers, the final step isn't just "pick the best word." The model outputs a probability distribution over its entire vocabulary — a score for every possible next token. "The" might have a 15% probability, "A" might have 8%, "In" might have 5%, and so on for all 128,000 tokens. How you sample from this distribution determines the character of the output.

Temperature: How Much Randomness

Think of a radio dial. On the left (temperature = 0), you get crystal-clear reception of a single station — the model always picks the highest-probability token. The output is deterministic and repeatable. On the right (temperature = 1.0+), you get static mixed with signal — lower-probability tokens get a real chance of being selected, creating more diverse, creative, and sometimes surprising output.

Temperature = 0: "The cat sat on the mat." Every time, identical output. Perfect for factual tasks, code generation, data extraction.

Temperature = 0.7: "The cat settled on the mat." Slight variation, still coherent. Good for conversational AI, writing assistance.

Temperature = 1.2: "The feline lounged across the weathered doormat." More creative, sometimes surprising. Good for brainstorming, creative writing.

Temperature = 2.0: "The quantum velvet whispered pancake orbits." Chaotic. Rarely useful except for generating random inspiration seeds.

Temperature Spectrum
0
Deterministic
"The cat sat on the mat."
Code, data extraction, facts
0.7
Balanced
"The cat settled on the mat."
Chat, writing assistance
1.2
Creative
"The feline lounged across the weathered doormat."
Brainstorming, stories
2.0
Chaotic
"The quantum velvet whispered pancake orbits."
Random seeds only
Temperature controls randomness in token selection. Lower = more predictable and factual. Higher = more creative but less reliable. Most production systems use 0–0.7.

Logprobs: The Confidence Score

Logprobs (log-probabilities) reveal the model's confidence in each token it generates. A logprob of 0 means 100% confident. A logprob of -1 means ~37% confident. A logprob of -5 means ~0.7% confident.

This is incredibly useful for detecting hallucination. When the model confidently states a fact, the logprob for that token should be high (close to 0). When the logprob drops below -2.0, the model is essentially guessing. In production systems, you can flag low-confidence claims for human review.

Logprobs Confidence Scale
Logprobs use natural logarithm: elogprob = probability (e.g., e-1 ≈ 0.37)
0
100%
-1
37%
Flag for review
-2
14%
-5
0.7%
When the model generates a factual claim with logprob below -2.0, it's essentially guessing. Flag these for human review.
Production Tip

If logprob < -2.0 on a factual claim, flag it for human review. This simple threshold catches a significant percentage of hallucinations. APIs like OpenAI and Anthropic can return logprobs per token — parse the response, identify factual claims, and check their logprobs.

Stop Sequences: Knowing When to Stop

Stop sequences tell the model when to stop generating. Without them, the model might keep generating until it hits the maximum token limit — producing rambling, repetitive, or off-topic content. Common stop sequences include newlines (for single-line answers), closing brackets (for JSON), or specific delimiter strings.

For structured output (JSON, XML, SQL), stop sequences are essential. You tell the model to stop after the closing } of a JSON object, preventing it from generating garbage after the valid output.

Part 8
Self-Host or API? The Decision

The Build vs Buy Question

Every team deploying LLMs faces this decision: use a hosted API (OpenAI, Anthropic, Google) or run your own model on your own GPUs. The answer depends on your volume, your latency requirements, your data privacy needs, and — critically — your team's operational capacity.

The Serving Frameworks

If you self-host, you need a serving framework. Here are the four that matter:

vLLM
The Production Workhorse

PagedAttention, continuous batching, speculative decoding, tensor parallelism. Powers Amazon Rufus and LinkedIn's AI features. Best overall choice for most production deployments. Open-source, active community.

TGI (Text Generation Inference)
Hugging Face's Option

Built-in OpenTelemetry for observability, first-class Hugging Face model hub integration, and production-ready defaults. Slightly lower raw throughput than vLLM but easier to set up with the Hugging Face ecosystem.

TensorRT-LLM
Maximum NVIDIA Performance

NVIDIA's own framework. Squeezes every last drop of performance from NVIDIA GPUs with custom kernels and hardware-specific optimizations. 20-40% faster than vLLM on raw throughput, but harder to set up and NVIDIA-only.

Ollama
Local & Edge Deployment

One-command model download and serving. Perfect for local development, edge deployment, and situations where simplicity matters more than raw performance. Runs on consumer hardware (MacBooks, gaming PCs).

The Cost Comparison

The math changes dramatically based on your scale:

Metric API (GPT-4o) Self-Host (vLLM + Llama-70B)
Setup time Minutes Days to weeks
Cost at 10K requests/day ~$300/month ~$3,000/month (GPU lease)
Cost at 1M requests/day ~$30,000/month ~$8,000/month (GPU cluster)
Latency (P50) 500-1500ms TTFT 100-300ms TTFT
Data privacy Data leaves your network Data stays in your infrastructure
Ops burden Zero (managed) High (GPU monitoring, scaling, updates)
Model flexibility Vendor's models only Any open-source model, fine-tuned
The Decision Framework: When to Self-Host

Use an API when:

  • Your volume is under 500K requests/month (API is cheaper when accounting for engineering time)
  • You don't have a dedicated ML infrastructure team
  • You need the best frontier model quality (GPT-4o, Claude 3.7 are still ahead of open-source)
  • Speed to market matters more than cost optimization

Self-host when:

  • Your API costs exceed $10K/month and are growing
  • Data privacy/compliance requires data to stay in your infrastructure (HIPAA, GDPR, etc.)
  • You need latency under 200ms TTFT consistently
  • You have (or can hire) at least one engineer dedicated to ML infrastructure
  • You need fine-tuned models for your specific domain

The hybrid approach (most common in production): Use an API for frontier quality tasks (complex reasoning, creative writing) and self-host for high-volume, simpler tasks (classification, extraction, summarization). Route requests to the right backend based on task complexity.

وَاللَّهُ أَعْلَمُ

And Allah knows best

وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ

May Allah's peace and blessings be upon our master Muhammad and his family

Was this helpful?


Discussion

No comments yet. Be the first to share your thoughts!

Leave a comment

Practice Mode

Test your understanding with real-world LLM inference scenarios.

Score: 0 / 4
Scenario 1 of 4

Your team deployed Llama-70B on a single GPU with 48GB of VRAM. The model needs 140GB in FP16. The deployment fails with an out-of-memory error before a single request is served.

What's wrong and what's the quickest fix?
A
Use tensor parallelism across 4 GPUs — split the model across 4x 48GB GPUs to get 192GB total VRAM.
B
Quantize to INT4 — reduces from 140GB to ~35GB, fitting on the single 48GB GPU with room for KV cache.
C
Increase system RAM to 256GB and offload model weights to CPU memory.

Cheat Sheet

The essential reference for LLM inference.

Tokenization

BPE: Byte Pair Encoding — merges frequent character pairs into tokens
SentencePiece: The tool that builds BPE vocabularies
Fertility: Tokens per word (English ~1.0, Arabic ~2.19)
Fix: Vocabulary extension for non-English languages

GPU Math

VRAM: Parameters(B) x bytes/param (FP16 = 2, INT4 = 0.5)
TFLOPS: How fast the GPU does math
Bandwidth: How fast data moves (A100: 2 TB/s)
Bottleneck: Almost always memory bandwidth, not compute

Two Phases

Prefill: Matrix-matrix, compute-bound, processes all tokens at once
Decode: Matrix-vector, memory-bound, one token at a time
TTFT: Time to first token (prefill speed)
TPS: Tokens per second (decode speed)

KV Cache

What: Stores Key/Value vectors from previous tokens
Growth: Linear with sequence length (~1MB per token for large models)
PagedAttention: Allocates pages on-demand, 60-80% less waste
SnapKV: Compresses cache by keeping only high-attention entries

Serving

Continuous batching: Requests enter/exit freely (10-20x over static)
MBU: Target 80-90% Model Bandwidth Utilization
Speculative decoding: Small draft + large verify = 2-3x speedup
Semantic caching: 31% queries are redundant → 5-10x cache hit speedup

Context Extension

RoPE: Rotary Position Embedding — encodes position via rotation
NTK scaling: Stretch positions by changing base frequency (no retraining)
YaRN: Non-uniform scaling, best quality for 4K→128K
StreamingLLM: Keep first 4 + recent tokens for infinite streams

Generation Control

Temperature 0: Deterministic (facts, code, extraction)
Temperature 0.7: Balanced (conversation, writing)
Logprobs < -2.0: Flag for human review (hallucination risk)
Stop sequences: Essential for structured output (JSON, SQL)

Self-Host vs API

API: <500K req/mo, no ML team, frontier quality needed
Self-host: >$10K/mo API cost, data privacy, <200ms TTFT needed
vLLM: Production workhorse (most teams start here)
Hybrid: API for complex, self-host for high-volume simple tasks

Where to Go Deep
  • How LLMs Work — understand the transformer architecture, attention mechanism, and training process that underpin everything in this post (prerequisite)
  • RAG & Agents — retrieval-augmented generation and agent architectures (context for when you need external knowledge)
  • Fine-Tuning — the next post in this series: when you need to adapt the model itself rather than optimizing inference