بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
In the name of Allah, the Most Gracious, the Most Merciful
You send a 50-word prompt. 0.8 seconds later, the first token appears. In that 0.8 seconds, your text was shredded into tokens, multiplied through 96 layers of 175 billion parameters, and the GPU moved 2 terabytes of data through memory.
Then the model generates 100 tokens per second — smooth, streaming, almost conversational. But the engineering behind that smoothness involves two completely different computational phases, a cache that grows with every single token, and a memory bandwidth bottleneck that determines whether your $30,000 GPU sits idle or runs at full capacity.
This post is the complete map of what happens between "Enter" and the response — from text to tokens to tensors to output. Here's what actually happened.
- Part 1: Tokenization — how text becomes numbers via BPE and SentencePiece, and why Arabic costs 2-5x more than English
- Part 2: GPU memory — VRAM, FP16, TFLOPS, memory bandwidth, and why bandwidth is almost always the bottleneck
- Part 3: Prefill vs decode — the two-phase pipeline that explains why the first token is slow and streaming is fast
- Part 4: KV cache — the memory structure that grows with every token, and PagedAttention's solution
- Parts 5-8: Batching, context extension, generation control, and the self-host vs API decision framework
- You've used LLM APIs and wondered what actually happens between your request and the response
- You've seen terms like "KV cache," "VRAM," "FP16," or "TTFT" and want to understand what they actually mean
- You're evaluating whether to self-host a model or use an API, and need to understand the real tradeoffs
- You read How LLMs Work and want to understand the engineering that makes inference fast at scale
The Phrasebook Problem
Imagine you're traveling to Japan with a phrasebook. You could look up every single letter individually — "h," "e," "l," "l," "o" — and spell everything character by character. That would work, but it would take forever. A better phrasebook groups common sequences: "hello" is one entry, "thank you" is another. Common phrases get a single lookup; rare words get spelled out letter by letter.
That's exactly how a tokenizer works. It's the very first thing that processes your prompt — before any neural network, any attention mechanism, any GPU computation. The tokenizer converts your text into a sequence of numbers (called tokens) that the model can process. And the quality of this conversion has a direct impact on cost, speed, and even the model's ability to understand your input.
A token is roughly three-quarters of a word in English, or about 4 characters. "Hello, how are you?" becomes about 5 tokens. But this ratio changes dramatically across languages, and that has real financial consequences.
[464, 6831, 3717, 7924, 374, 19452, 287]
BPE: How the Phrasebook Is Built
Byte Pair Encoding (BPE) is the algorithm that builds this phrasebook. Think of it like a language teacher who watches millions of sentences and notices which character pairs appear together most frequently. The teacher then creates a shorthand for the most common pairs, then the most common pairs of those pairs, and so on — building up from individual characters to common words and even multi-word fragments.
The process starts with every character as its own token. Then it repeatedly merges the most frequent adjacent pair into a single new token. After thousands of merge operations, you get a vocabulary of 32,000 to 128,000 tokens that efficiently encodes common text.
SentencePiece is the tool that actually builds these phrasebooks. Developed by Google, it's the standard tokenizer builder used by Llama, Mistral, and most modern open-source models. SentencePiece takes a corpus of text and runs BPE (or a variant called Unigram) to produce the final vocabulary file that the model uses forever after.
Here's how BPE builds a vocabulary from scratch. Imagine training on just one phrase: "low lower lowest"
Step 0 (character level): l o w | l o w e r | l o w e s t
Step 1: Most common pair is "l" + "o" → merge into "lo": lo w | lo w e r | lo w e s t
Step 2: Most common pair is "lo" + "w" → merge into "low": low | low e r | low e s t
Step 3: "e" + "r" and "e" + "s" are equally common. Say we merge "e" + "s" → "es": low | low e r | low es t
Step 4: Now "es" + "t" → "est": low | low e r | low est
After just 4 merges, "lowest" went from 6 tokens to 2 tokens. In real models, this process runs for 32,000 to 128,000 merge steps across billions of words of training data.
Key insight: The merge order is learned from data, not hand-coded. Common English words like "the" and "and" get merged early and become single tokens. Rare words or unusual character sequences stay split into smaller pieces.
The Arabic Tokenizer Tax
Here's where tokenization has real business impact. The fertility score measures how many tokens a language needs per word. English has a fertility of about 1.0 — one word, one token (roughly). Arabic has a fertility of 2.19. That means every Arabic word uses more than twice as many tokens as an English word.
Why? Because BPE vocabularies are trained primarily on English text. Common English words like "the," "is," and "database" each get their own token. But Arabic words — even extremely common ones — often get split into 3, 4, or even 5 subword pieces because the tokenizer never learned to merge those character sequences.
The financial impact is direct: if every API call costs per token, Arabic users are paying 2-5x more for the same conversation. A chatbot that costs $500/month in English costs $1,000-$2,500/month in Arabic — for the same number of users, the same conversations, the same functionality.
At GPT-4o pricing ($2.50 per million input tokens): a 500-word English prompt uses ~375 tokens ($0.0009). The same content in Arabic uses ~820 tokens ($0.002). Multiply by 100,000 daily queries and that's the difference between $90/day and $200/day — an extra $3,300/month.
If your application primarily serves a non-English language, you can retrain the tokenizer's vocabulary to include common words from that language. This is called vocabulary extension.
The process works like this:
- Train a new SentencePiece model on a large corpus of your target language
- Merge the new vocabulary with the original model's vocabulary, adding new tokens for common words in the target language
- Resize the model's embedding layer to accommodate the new vocabulary size
- Fine-tune the model so it learns the new token embeddings
For Arabic, this can reduce the fertility score from 2.19 to close to 1.2 — nearly halving the cost per query. The Jais family of models (from the UAE) did exactly this: they extended the Llama tokenizer with Arabic-specific merges and achieved near-English efficiency.
The tradeoff: Vocabulary extension requires fine-tuning, which costs compute time and money upfront. It's worth it when your Arabic API costs justify the one-time investment — typically when monthly API costs exceed $2,000.
The 3x cost comes directly from Arabic's high fertility score (2.19 tokens per word vs 1.0 for English). Vocabulary extension attacks the root cause by teaching the tokenizer common Arabic words, reducing tokens per word. A cheaper model still pays the same per-token tax. Shorter prompts sacrifice quality.
The root cause is tokenization — Arabic text uses 2-3x more tokens than English for the same content. The fix is vocabulary extension: retrain the tokenizer to recognize common Arabic words as single tokens, reducing the fertility score from 2.19 toward 1.2. This directly halves the cost without sacrificing model quality or context.
The Desk Size Problem
Imagine your GPU is a desk. The model's parameters — all 70 billion of them — are textbooks that need to be open on that desk. The bigger the model, the more textbooks. The desk has a fixed size, and if the books don't fit, you either need a bigger desk or you need to squeeze the books.
That desk is VRAM — Video RAM, the memory physically located on the GPU chip. Unlike regular system RAM (which sits on your motherboard and communicates with the CPU over a relatively slow bus), VRAM sits right next to the GPU's compute cores with a massively wide memory bus. An NVIDIA A100 GPU has 80GB of VRAM. An H100 has 80GB. A consumer RTX 4090 has 24GB.
FP16: Squeezing the Books
Every parameter in a model is a number — a weight that was learned during training. In full precision (FP32), each number takes 4 bytes of storage. A 70-billion parameter model in FP32 needs 70B x 4 bytes = 280GB. That doesn't fit on any single GPU.
The solution: FP16 (half precision). Instead of using 32 bits per number, you use 16 bits — cutting storage exactly in half. A 70B model in FP16 needs 70B x 2 bytes = 140GB. Still doesn't fit on one 80GB GPU, but it fits on two.
The remarkable thing is that half precision barely affects output quality. The difference between 3.14159265 and 3.14160 matters in scientific computing but is irrelevant for language model weights. Most modern inference runs in FP16 or BF16 (brain float 16, a variant that handles extreme values better) by default.
TFLOPS vs Memory Bandwidth: The Real Bottleneck
GPUs have two speed metrics, and understanding which one matters is the key to understanding inference performance.
TFLOPS (Tera Floating-point Operations Per Second) measures how fast the GPU can do math. The A100 does 312 TFLOPS in FP16. That's 312 trillion multiply-add operations every second. It sounds like the GPU should be blindingly fast.
Memory bandwidth measures how fast the GPU can move data from VRAM to the compute cores. The A100 has 2 TB/s of memory bandwidth. That's 2 terabytes per second — fast, but not fast enough.
Think of it as a factory. TFLOPS is how fast the workers can assemble products. Memory bandwidth is how fast the conveyor belt delivers raw materials. Even if you have 1,000 workers, if the conveyor belt can only deliver enough materials for 100 workers at a time, 900 workers sit idle. That's exactly what happens during most of LLM inference: the GPU compute cores are waiting for data to arrive from memory.
The A100 has 312 TFLOPS but only 2 TB/s bandwidth. To keep those compute cores fully busy, you'd need to do 156 floating-point operations for every byte loaded from memory. During token generation, the model does about 1-2 operations per byte. The GPU is running at less than 2% of its theoretical compute capability. The bottleneck is memory bandwidth, not compute.
Quick formula: VRAM (GB) = Parameters (B) x Bytes per parameter
| Precision | Bytes/param | 7B Model | 13B Model | 70B Model |
|---|---|---|---|---|
| FP32 | 4 | 28 GB | 52 GB | 280 GB |
| FP16/BF16 | 2 | 14 GB | 26 GB | 140 GB |
| INT8 | 1 | 7 GB | 13 GB | 70 GB |
| INT4 | 0.5 | 3.5 GB | 6.5 GB | 35 GB |
Add 10-20% overhead for KV cache and activations on top of these base numbers. A 70B model in INT4 quantization (35 GB) can fit on a single 48 GB GPU (like the A6000 or 2x RTX 4090) with room for the KV cache.
GPU VRAM sizes: RTX 4090 = 24 GB, A6000 = 48 GB, A100 = 80 GB, H100 = 80 GB.
Why the First Token Takes So Long
You've probably noticed this when using ChatGPT or Claude: you hit send, there's a noticeable pause, and then tokens start appearing rapidly one after another. That initial pause and the subsequent fast streaming aren't just an interface choice — they reflect two fundamentally different computational phases happening on the GPU.
Phase 1: Prefill — Reading Your Prompt
Imagine a speed reader who can read an entire book in parallel — all pages at once. That's the prefill phase. The GPU takes your entire prompt (all tokens at once) and processes them simultaneously through the model. This is matrix-matrix multiplication: large blocks of data being multiplied together.
During prefill, the GPU's compute cores are actually busy. The operation has high arithmetic intensity — the ratio of computation to memory access. For every byte loaded from memory, the GPU performs many operations. This is compute-bound: the speed limit is how fast the GPU can do math, not how fast it can fetch data.
Phase 2: Decode — Generating One Token at a Time
Now imagine the same speed reader has to write a novel, but they can only write one word at a time, and before writing each word, they need to re-read everything they've written so far. That's the decode phase. The GPU generates tokens one by one, and each token requires accessing the entire model's weights.
During decode, the operation is matrix-vector multiplication: one tiny vector (the current token) multiplied against the enormous weight matrices. The arithmetic intensity drops dramatically — you load a huge amount of data for a tiny amount of computation. This is memory-bound: the speed limit is how fast data can move from VRAM to the compute cores.
Arithmetic intensity is the ratio that determines which phase you're in. High arithmetic intensity (many operations per byte of data moved) means compute-bound. Low arithmetic intensity (few operations per byte) means memory-bound. Prefill has high intensity because you're multiplying large matrices together. Decode has low intensity because you're multiplying one tiny vector against large matrices.
A typical GPT-4-class model on an A100 cluster: TTFT = 500-800ms for a 200-token prompt, then 50-100 tokens/second during decode. The prefill feels slow because the model is reading and processing your entire prompt. The decode feels fast because each new token only requires a fraction of the computation.
This is completely normal. The 800ms TTFT reflects the prefill phase processing your entire prompt (compute-bound, matrix-matrix multiplication). The 50 TPS decode speed reflects token-by-token generation (memory-bound, matrix-vector multiplication). The asymmetry is a fundamental property of transformer inference, not a bug.
800ms TTFT with 50 TPS decode is completely normal for a large model. Prefill processes all prompt tokens in parallel (compute-bound) which takes significant time. Decode generates tokens one-by-one (memory-bound) which is actually quite fast per token. This asymmetry is why streaming UIs exist — the first token waits, then the rest flow quickly.
The Re-Reading Problem
Imagine you're writing an essay, and every time you write a new sentence, you need to re-read the entire essay from the beginning to remember what you've said. Sentence 1 takes 1 second. Sentence 2 takes 2 seconds (re-read sentence 1, then write). Sentence 100 takes 100 seconds. The time to write each new sentence grows linearly with the length of what you've already written.
This is exactly what would happen during the decode phase without a cache. For each new token, the model needs to "attend to" (look at) every previous token. The attention mechanism computes Key and Value vectors for each token — these are the compressed representations that allow the model to "remember" what each token means in context.
The KV cache stores these Key and Value vectors so the model doesn't have to recompute them for every new token. When generating token 100, instead of recomputing Keys and Values for tokens 1-99, the model just looks up the cached values and only computes the Key/Value for the new token.
PagedAttention: The Filing Cabinet Fix
Without PagedAttention, the KV cache works like a bookshelf with fixed-size shelves. You allocate a contiguous block of VRAM for each request based on the maximum possible sequence length. A request that might use 4K tokens gets a 4K-token allocation, even if it only ends up using 200 tokens. The wasted space fragments memory and limits how many requests can run concurrently.
PagedAttention (introduced by vLLM) works like a filing cabinet with movable folders. Instead of allocating one big continuous block per request, it breaks the KV cache into small, fixed-size pages (typically 16 tokens each). Pages are allocated on demand as the sequence grows, and freed pages can be reused by other requests immediately.
Think of the difference between renting an entire floor of an office building (even if you only use 3 rooms) versus renting individual rooms as you need them. PagedAttention reduces VRAM waste from 60-80% down to less than 4%, which means you can serve 2-4x more concurrent requests with the same GPU.
Before PagedAttention, a single A100 serving a 70B model could handle ~8 concurrent requests. With PagedAttention, the same GPU handles ~20-30 concurrent requests. vLLM made PagedAttention the default, and it's now the standard for all serious LLM serving.
Just like model weights can be quantized from FP16 to INT8 or INT4, the KV cache can also be compressed. SnapKV and similar techniques selectively keep only the most "important" key-value pairs — the ones that receive the most attention from the model.
How it works: SnapKV observes which token positions the model's attention focuses on during inference. Tokens that rarely receive attention (typically in the middle of long contexts) have their KV cache entries compressed or dropped entirely. This can reduce KV cache size by 3-5x with minimal quality loss.
When to use: Long-context applications (10K+ tokens) where KV cache memory is the limiting factor for concurrent requests. Not needed for short conversations.
When NOT to use: Applications requiring perfect recall of all context (medical, legal, financial). Any compression risks dropping a fact the model might need.
The Empty Factory Problem
Imagine a factory with 1,000 workers. One customer walks in with a single widget to assemble. All 1,000 workers crowd around that one widget. 999 of them stand idle while one does the work. That's what happens when a GPU with thousands of compute cores processes a single request during the decode phase — the memory bandwidth bottleneck means most cores have nothing to do.
The fix? Batching — processing multiple requests at the same time. Instead of one customer's widget, you have 32 customers' widgets on the assembly line simultaneously. Now more workers have something to do, and the factory is actually efficient.
Continuous Batching: No Waiting in Line
Static batching is like a bus: it waits until all seats are full, drives the route, and nobody can get on or off until the trip is complete. If one request finishes early (a short response), its "seat" sits empty until the longest request in the batch completes.
Continuous batching (also called "inflight batching") is like a conveyor belt at a buffet. Requests enter and exit freely. When one request finishes, a new one immediately takes its place. There's no wasted time waiting for the entire batch to complete.
This seemingly simple change has an enormous impact: continuous batching improves GPU utilization by 10-20x compared to static batching in real workloads with mixed-length requests.
The Batch Size Trap
"If batching helps, why not make the batch huge?" Because of a trap: throughput scales sub-linearly with batch size. Doubling the batch from 8 to 16 might increase throughput by 70%. Doubling from 32 to 64 might only increase it by 20%. And at some point, you run out of VRAM for the KV caches of all those concurrent requests.
The metric that captures this is MBU (Model Bandwidth Utilization) — what percentage of the GPU's memory bandwidth is actually being used for useful computation. A well-optimized serving system targets 80-90% MBU. Below 50% means you're leaving money on the table. Above 95% means you're probably trading latency for throughput.
Research shows 31% of LLM API calls are semantically redundant — different users ask effectively the same question with different wording. Semantic caching stores responses indexed by meaning similarity (not exact string match). When a new query is semantically close to a cached one, return the cached response instantly. This provides a 5-10x speedup for those cached queries with zero GPU cost. Tools like GPTCache and Redis with vector similarity search implement this pattern.
Think of a junior architect who drafts a building plan, and then a senior architect reviews it. The junior is fast but makes mistakes. The senior is slow but catches everything. If the junior gets most of it right, the senior only needs to fix a few things — much faster than the senior doing the whole thing alone.
Speculative decoding works the same way. A small, fast "draft" model generates several tokens quickly. Then the large, slow "target" model verifies all those tokens in parallel (this is a single forward pass, so it's fast). Tokens the target model agrees with are accepted. Tokens it rejects get regenerated by the target model.
The speedup: If the draft model has an 80% acceptance rate (common for well-matched pairs), you generate ~5 tokens for the compute cost of ~2 tokens. That's a 2-3x speedup with mathematically identical output quality.
Use when: Latency is critical and you can deploy two models (one small, one large). Common pairing: Llama-7B draft + Llama-70B target.
Don't use when: VRAM is tight (you need memory for both models) or your workload is already batch-throughput-optimized rather than latency-optimized.
Every GPU operation (called a "kernel") has startup overhead: launch the kernel, allocate registers, synchronize. If a model layer requires 10 separate operations (multiply, add, normalize, activate, etc.), that's 10 kernel launches with 10 overheads.
Kernel fusion combines multiple operations into a single GPU kernel. Instead of 10 launches, you get 1. The data stays in the GPU's fast on-chip cache (SRAM) between operations instead of being written back to slow VRAM after each step.
FlashAttention is the most famous example: it fuses the entire attention computation (Q*K, softmax, *V) into a single kernel that never writes intermediate results to VRAM. This provides 2-4x speedup for the attention operation specifically.
Practical impact: You don't implement kernel fusion yourself. Libraries like vLLM, TensorRT-LLM, and FlashAttention handle it. But understanding why these libraries are faster than naive PyTorch helps you make informed deployment choices.
Stretching the Window
A model trained with a 4K context window can't magically handle 32K tokens. The model's position encoding — how it understands "this token is in position 500" vs "this token is in position 5000" — breaks when you exceed the training length. The model has never seen positions beyond 4K during training, so those positions produce garbage.
But what if you need longer context? Retraining the entire model on longer sequences is expensive (millions of dollars). Researchers found shortcuts.
RoPE: How Models Know Position
Think of a clock. The hour hand rotates at one speed, the minute hand at another. By combining these two rotations, you can encode any time of day. RoPE (Rotary Position Embedding) works similarly — it encodes each token's position by rotating its vector representation at specific frequencies. Position 1 gets a small rotation. Position 1000 gets a large rotation. The model learns to interpret these rotations during training.
The problem: the model was trained with rotations up to, say, position 4096. Position 5000 requires a rotation the model has never seen. It's like a clock trying to display 25 o'clock — the numbers don't wrap around properly.
NTK-Aware Scaling: Changing the Base Frequency
NTK-Aware Scaling modifies the base frequency of RoPE's rotational encoding, effectively "stretching" the position space. Instead of positions 0-4096, the same rotation angles now span 0-32768. It's like switching from a 12-hour clock to a 24-hour clock — you cover more positions without changing the model's ability to distinguish nearby positions.
The beauty is that NTK-Aware Scaling requires zero additional training. You modify a single hyperparameter (the RoPE base frequency) at inference time. The tradeoff is that the model's ability to distinguish very close positions decreases slightly — position 100 and position 101 become slightly harder to tell apart — but for most applications, this is negligible.
YaRN: The Full Solution
YaRN (Yet another RoPE extensioN) takes NTK scaling further. Instead of uniformly stretching all frequencies, YaRN applies different scaling factors to different frequency bands. High-frequency components (which encode local relationships like "these two words are adjacent") get less stretching. Low-frequency components (which encode global relationships like "this paragraph is about the same topic") get more stretching.
This preserves local precision while extending global range. YaRN can extend context from 4K to 128K with only a small amount of fine-tuning data (a few hundred examples), compared to the full retraining that would otherwise be needed.
Researchers discovered something curious: the first few tokens in any sequence receive disproportionately high attention, regardless of their content. These are called attention sinks. Even if the first token is meaningless punctuation, the model's attention mechanism uses it as an "anchor."
StreamingLLM exploits this by keeping the first 4 tokens (attention sinks) plus a sliding window of the most recent tokens. Everything in between is discarded. This allows processing of theoretically infinite-length streams with fixed memory usage.
Use when: You need to process continuous streams (real-time transcription, log monitoring) where you don't need to recall the entire history.
Don't use when: You need the model to reference information from the beginning of a long document. StreamingLLM deliberately discards the middle and only keeps the anchors plus recent context.
Your computer has 16 GB of RAM but can run programs that use 100 GB of memory. How? Virtual memory — the operating system pages data in and out of RAM as needed, using the hard drive as overflow. Only the actively-needed data sits in RAM at any time.
MemGPT applies this exact concept to LLMs. The context window is "RAM" — limited and expensive. External storage (databases, files) is the "hard drive" — huge and cheap. An LLM controller manages what's in the context window, paging information in when the model needs it and paging it out when it's not immediately relevant.
For applications that need both long-term memory and long-context understanding, MemGPT provides a compelling middle ground between RAG (which retrieves from external storage) and simply using a larger context window (which is expensive and suffers from "lost in the middle" degradation).
The 200K of content likely doesn't all need to be in context simultaneously. RAG retrieves the most relevant portions, and YaRN extends the window enough to hold the retrieved chunks with good quality. StreamingLLM would lose the middle. NTK scaling 6x degrades quality significantly. The hybrid approach (retrieve what matters + extend moderately) gives the best quality-cost tradeoff.
The best strategy combines RAG (to retrieve only the relevant portions of the 200K content) with YaRN (to moderately extend context for those retrieved chunks). StreamingLLM discards the middle content, making it useless for recall. NTK scaling 6x degrades quality badly. A hybrid approach that reduces what needs to be in context and extends moderately gives the best results.
The Creativity Dial
After the model processes your prompt through all its layers, the final step isn't just "pick the best word." The model outputs a probability distribution over its entire vocabulary — a score for every possible next token. "The" might have a 15% probability, "A" might have 8%, "In" might have 5%, and so on for all 128,000 tokens. How you sample from this distribution determines the character of the output.
Temperature: How Much Randomness
Think of a radio dial. On the left (temperature = 0), you get crystal-clear reception of a single station — the model always picks the highest-probability token. The output is deterministic and repeatable. On the right (temperature = 1.0+), you get static mixed with signal — lower-probability tokens get a real chance of being selected, creating more diverse, creative, and sometimes surprising output.
Temperature = 0: "The cat sat on the mat." Every time, identical output. Perfect for factual tasks, code generation, data extraction.
Temperature = 0.7: "The cat settled on the mat." Slight variation, still coherent. Good for conversational AI, writing assistance.
Temperature = 1.2: "The feline lounged across the weathered doormat." More creative, sometimes surprising. Good for brainstorming, creative writing.
Temperature = 2.0: "The quantum velvet whispered pancake orbits." Chaotic. Rarely useful except for generating random inspiration seeds.
Logprobs: The Confidence Score
Logprobs (log-probabilities) reveal the model's confidence in each token it generates. A logprob of 0 means 100% confident. A logprob of -1 means ~37% confident. A logprob of -5 means ~0.7% confident.
This is incredibly useful for detecting hallucination. When the model confidently states a fact, the logprob for that token should be high (close to 0). When the logprob drops below -2.0, the model is essentially guessing. In production systems, you can flag low-confidence claims for human review.
If logprob < -2.0 on a factual claim, flag it for human review. This simple threshold catches a significant percentage of hallucinations. APIs like OpenAI and Anthropic can return logprobs per token — parse the response, identify factual claims, and check their logprobs.
Stop Sequences: Knowing When to Stop
Stop sequences tell the model when to stop generating. Without them, the model might keep generating until it hits the maximum token limit — producing rambling, repetitive, or off-topic content. Common stop sequences include newlines (for single-line answers), closing brackets (for JSON), or specific delimiter strings.
For structured output (JSON, XML, SQL), stop sequences are essential. You tell the model to stop after the closing } of a JSON object, preventing it from generating garbage after the valid output.
The Build vs Buy Question
Every team deploying LLMs faces this decision: use a hosted API (OpenAI, Anthropic, Google) or run your own model on your own GPUs. The answer depends on your volume, your latency requirements, your data privacy needs, and — critically — your team's operational capacity.
The Serving Frameworks
If you self-host, you need a serving framework. Here are the four that matter:
PagedAttention, continuous batching, speculative decoding, tensor parallelism. Powers Amazon Rufus and LinkedIn's AI features. Best overall choice for most production deployments. Open-source, active community.
Built-in OpenTelemetry for observability, first-class Hugging Face model hub integration, and production-ready defaults. Slightly lower raw throughput than vLLM but easier to set up with the Hugging Face ecosystem.
NVIDIA's own framework. Squeezes every last drop of performance from NVIDIA GPUs with custom kernels and hardware-specific optimizations. 20-40% faster than vLLM on raw throughput, but harder to set up and NVIDIA-only.
One-command model download and serving. Perfect for local development, edge deployment, and situations where simplicity matters more than raw performance. Runs on consumer hardware (MacBooks, gaming PCs).
The Cost Comparison
The math changes dramatically based on your scale:
| Metric | API (GPT-4o) | Self-Host (vLLM + Llama-70B) |
|---|---|---|
| Setup time | Minutes | Days to weeks |
| Cost at 10K requests/day | ~$300/month | ~$3,000/month (GPU lease) |
| Cost at 1M requests/day | ~$30,000/month | ~$8,000/month (GPU cluster) |
| Latency (P50) | 500-1500ms TTFT | 100-300ms TTFT |
| Data privacy | Data leaves your network | Data stays in your infrastructure |
| Ops burden | Zero (managed) | High (GPU monitoring, scaling, updates) |
| Model flexibility | Vendor's models only | Any open-source model, fine-tuned |
Use an API when:
- Your volume is under 500K requests/month (API is cheaper when accounting for engineering time)
- You don't have a dedicated ML infrastructure team
- You need the best frontier model quality (GPT-4o, Claude 3.7 are still ahead of open-source)
- Speed to market matters more than cost optimization
Self-host when:
- Your API costs exceed $10K/month and are growing
- Data privacy/compliance requires data to stay in your infrastructure (HIPAA, GDPR, etc.)
- You need latency under 200ms TTFT consistently
- You have (or can hire) at least one engineer dedicated to ML infrastructure
- You need fine-tuned models for your specific domain
The hybrid approach (most common in production): Use an API for frontier quality tasks (complex reasoning, creative writing) and self-host for high-volume, simpler tasks (classification, extraction, summarization). Route requests to the right backend based on task complexity.
وَاللَّهُ أَعْلَمُ
And Allah knows best
وَصَلَّى اللَّهُ وَسَلَّمَ وَبَارَكَ عَلَىٰ سَيِّدِنَا مُحَمَّدٍ وَعَلَىٰ آلِهِ
May Allah's peace and blessings be upon our master Muhammad and his family
Was this helpful?
Your feedback helps improve future posts.
Practice Mode
Test your understanding with real-world LLM inference scenarios.
Your team deployed Llama-70B on a single GPU with 48GB of VRAM. The model needs 140GB in FP16. The deployment fails with an out-of-memory error before a single request is served.
Cheat Sheet
The essential reference for LLM inference.
Tokenization
BPE: Byte Pair Encoding — merges frequent character pairs into tokens
SentencePiece: The tool that builds BPE vocabularies
Fertility: Tokens per word (English ~1.0, Arabic ~2.19)
Fix: Vocabulary extension for non-English languages
GPU Math
VRAM: Parameters(B) x bytes/param (FP16 = 2, INT4 = 0.5)
TFLOPS: How fast the GPU does math
Bandwidth: How fast data moves (A100: 2 TB/s)
Bottleneck: Almost always memory bandwidth, not compute
Two Phases
Prefill: Matrix-matrix, compute-bound, processes all tokens at once
Decode: Matrix-vector, memory-bound, one token at a time
TTFT: Time to first token (prefill speed)
TPS: Tokens per second (decode speed)
KV Cache
What: Stores Key/Value vectors from previous tokens
Growth: Linear with sequence length (~1MB per token for large models)
PagedAttention: Allocates pages on-demand, 60-80% less waste
SnapKV: Compresses cache by keeping only high-attention entries
Serving
Continuous batching: Requests enter/exit freely (10-20x over static)
MBU: Target 80-90% Model Bandwidth Utilization
Speculative decoding: Small draft + large verify = 2-3x speedup
Semantic caching: 31% queries are redundant → 5-10x cache hit speedup
Context Extension
RoPE: Rotary Position Embedding — encodes position via rotation
NTK scaling: Stretch positions by changing base frequency (no retraining)
YaRN: Non-uniform scaling, best quality for 4K→128K
StreamingLLM: Keep first 4 + recent tokens for infinite streams
Generation Control
Temperature 0: Deterministic (facts, code, extraction)
Temperature 0.7: Balanced (conversation, writing)
Logprobs < -2.0: Flag for human review (hallucination risk)
Stop sequences: Essential for structured output (JSON, SQL)
Self-Host vs API
API: <500K req/mo, no ML team, frontier quality needed
Self-host: >$10K/mo API cost, data privacy, <200ms TTFT needed
vLLM: Production workhorse (most teams start here)
Hybrid: API for complex, self-host for high-volume simple tasks
- How LLMs Work — understand the transformer architecture, attention mechanism, and training process that underpin everything in this post (prerequisite)
- RAG & Agents — retrieval-augmented generation and agent architectures (context for when you need external knowledge)
- Fine-Tuning — the next post in this series: when you need to adapt the model itself rather than optimizing inference
Discussion
Leave a comment