If you're deploying LLMs in production in 2026, inference is almost certainly your biggest line item. Training gets the headlines, but inference is where the money actually goes — running models 24/7, at scale, for every user request. A single unoptimized 70B model can cost $100+/hour on A100s. Optimize that same model properly, and you're looking at $15-20/hour for equivalent throughput.
This isn't about squeezing out marginal gains. The difference between naive and optimized inference is often 5-10x in cost and 3-5x in latency. For ML and AI engineers, understanding these techniques isn't optional anymore — it's the skill that determines whether your model ships or stays in a notebook.
The Inference Tax: Why This Matters Now
Training a frontier model costs $50-200M. Serving it costs more — within months. OpenAI processes billions of tokens per day across ChatGPT, the API, and enterprise deployments. Anthropic, Google DeepMind, and every company building on LLMs faces the same math: inference compute dominates total cost of ownership.
The bottleneck is counterintuitive. LLM inference during token generation is memory-bandwidth bound, not compute-bound. Your GPU sits there, largely idle, waiting for weights and KV cache data to stream from HBM. An H100 has 3.35 TB/s of memory bandwidth but 989 TFLOPS of FP16 compute — during autoregressive decoding, you're barely using 10-20% of that compute capacity.
Every technique in this guide attacks the same fundamental problem: reduce the amount of data that needs to move through memory, and make better use of the data that does move.
KV Cache: The Hidden Memory Hog
The Key-Value cache is the single most important concept in inference optimization. During autoregressive generation, every new token attends to all previous tokens. Without caching, you'd recompute attention for the entire sequence at every step — quadratic cost that makes long-context generation impossible.
The KV cache stores the key and value projections from every layer for every token generated so far. For a Llama 3 70B model with 80 layers and 64 attention heads, generating a 4096-token sequence requires:
That's 41 GB just for the KV cache with a modest batch size — often more than the quantized model weights themselves. This is why KV cache optimization matters so much.
PagedAttention (vLLM)
Traditional KV cache allocation reserves contiguous memory for the maximum sequence length, even if most requests are short. PagedAttention, introduced by vLLM, borrows the idea of virtual memory paging from operating systems: KV cache is allocated in fixed-size blocks ("pages") and mapped via a page table. Memory waste drops from 60-80% to near zero.
In practice, this means you can serve 2-3x more concurrent requests on the same GPU. PagedAttention is now the default in vLLM, and variants have been adopted by most serving frameworks.
Prefix Caching
If many requests share a common prefix (system prompt, few-shot examples, shared context), prefix caching computes the KV cache for that prefix once and reuses it across requests. For a 2000-token system prompt served to 1000 users, you save 2 billion tokens of redundant computation. Every major serving framework now supports this — it's particularly impactful for RAG pipelines where the same retrieved documents appear in multiple requests.
Grouped-Query Attention (GQA)
GQA reduces KV cache size by sharing key-value heads across multiple query heads. Llama 3 uses 8 KV heads shared across 64 query heads — an 8x reduction in KV cache size compared to standard multi-head attention. This is a model architecture decision, not a serving optimization, but understanding it is crucial for capacity planning.
Quantization: Doing More With Less
Quantization reduces model precision from FP16/BF16 (16 bits per parameter) to INT8 (8 bits), INT4 (4 bits), or even lower. It's the single highest-impact optimization you can apply to any model — smaller weights mean less memory, less bandwidth, and faster inference.
| Format | Bits | Memory | Quality Loss | Best For |
|---|---|---|---|---|
| FP8 (E4M3) | 8 | 2x reduction | Negligible | H100/H200 production serving |
| INT8 (W8A8) | 8 | 2x reduction | <1% on benchmarks | General production serving |
| INT4 (GPTQ) | 4 | 4x reduction | 1-3% on benchmarks | Memory-constrained GPUs |
| INT4 (AWQ) | 4 | 4x reduction | 1-2% on benchmarks | Better quality than GPTQ |
| GGUF (llama.cpp) | 2-8 | 2-8x reduction | Varies by quant level | Local/CPU inference |
The practical rule of thumb
Start with FP8 on H100/H200 or INT8 (AWQ) on A100 — you'll barely notice quality degradation but cut memory in half. Only go to INT4 when you must fit a model on a smaller GPU. Below INT4, expect meaningful quality loss on complex reasoning tasks.
GPTQ vs AWQ: Both are post-training quantization methods for INT4. AWQ (Activation-Aware Weight Quantization) generally produces better quality by protecting the 1% of "salient" weights that disproportionately affect model output. GPTQ is faster to calibrate but slightly less accurate. For new deployments in 2026, AWQ is the default choice.
Speculative Decoding: Getting Multiple Tokens Per Step
Standard autoregressive decoding generates one token per forward pass through the full model. Speculative decoding breaks this bottleneck using a clever asymmetry: verifying N tokens in parallel is as cheap as generating one.
The process works like this:
- Draft phase: A small, fast model (1-7B parameters) generates K candidate tokens autoregressively (K is typically 3-8).
- Verify phase: The full model processes all K candidates in a single forward pass, computing the probability of each candidate token given the previous ones.
- Accept/reject: Starting from the first candidate, accept tokens whose probabilities under the full model meet a threshold. Reject the first mismatch and resample from the full model's distribution.
If the draft model is good (and for predictable sequences like code, it usually is), you accept 70-90% of candidates. That means 3-6 tokens per forward pass of the main model, with mathematically identical output distribution to standard decoding.
When not to use speculative decoding
Speculative decoding improves latency (time to complete a single request) but can reduce throughput (total tokens per second across all users). If you're serving many concurrent users and GPU utilization is already high, the draft model's compute competes with the main model. Use it for latency-sensitive, low-concurrency scenarios — interactive coding, real-time agents, streaming responses.
Flash Attention: Memory-Efficient Attention
Flash Attention rewrites the attention computation to avoid materializing the full N×N attention matrix in GPU HBM. Instead of computing attention as three separate operations (QKT, softmax, multiply by V), it fuses them into a single kernel that works on blocks of the attention matrix in fast SRAM.
The result: attention computation goes from O(N²) memory to O(N) — critical for long-context models. Flash Attention 2 and its successors are now the default in every major framework. You don't need to implement it — but you need to know it's there, because if you're seeing OOM errors on long sequences, missing Flash Attention is often the cause.
Continuous Batching: Maximizing GPU Utilization
Static batching groups requests together and processes them as a unit. The problem: if one request in a batch generates 500 tokens and another generates 10, the GPU sits idle waiting for the long request to finish before processing new ones.
Continuous (or dynamic) batching solves this by inserting new requests into the batch as soon as any request completes. GPU utilization jumps from 30-40% to 80-90%, and average latency drops because short requests aren't held hostage by long ones.
vLLM, TGI, and TensorRT-LLM all support continuous batching out of the box. If you're serving a model without it, you're leaving 2-3x throughput on the table.
Multi-GPU Strategies: Tensor vs Pipeline Parallelism
When a model doesn't fit on a single GPU, you have two main options:
- Tensor parallelism (TP): Splits individual layers across GPUs. Each GPU holds a slice of every layer and they communicate via all-reduce at every layer. Requires high-bandwidth interconnect (NVLink). Best for 2-8 GPUs on the same node. Latency stays roughly constant as you add GPUs.
- Pipeline parallelism (PP): Assigns different layers to different GPUs. GPU 1 does layers 0-19, GPU 2 does 20-39, etc. Requires less interconnect bandwidth but introduces "pipeline bubbles" (idle time between stages). Best for cross-node setups.
For most production setups in 2026: use TP=4 or TP=8 within a single node, and PP only when scaling across nodes. vLLM makes this straightforward with --tensor-parallel-size.
The Inference Framework Landscape
| Framework | Strengths | Weaknesses | When to Use |
|---|---|---|---|
| vLLM | PagedAttention, broad model support, active community, simple API | Slightly lower throughput than TRT-LLM on NVIDIA | Default choice for production serving |
| TensorRT-LLM | Maximum throughput on NVIDIA GPUs, deep H100/H200 optimization | Complex setup, NVIDIA-only, narrower model support | Maximum throughput on NVIDIA hardware |
| TGI | HuggingFace ecosystem, good documentation, easy deployment | Performance gap vs vLLM/TRT-LLM for large models | HuggingFace-centric workflows |
| llama.cpp / Ollama | CPU inference, GGUF quantization, runs anywhere | Not designed for high-throughput serving | Local development, edge deployment |
| SGLang | RadixAttention, structured output optimization, fast prefix caching | Younger project, smaller community | Structured generation, agentic workloads |
A Real-World Optimization Workflow
When you inherit a slow, expensive LLM deployment, here's the sequence that works:
- Profile first. Use
nvidia-smi, PyTorch Profiler, or NSight Systems to identify the actual bottleneck. Is it memory (OOM at batch size 4)? Bandwidth (low GPU utilization during decoding)? Compute (slow prefill)? - Enable continuous batching. Switch from static batching to vLLM or TGI. This alone often doubles throughput with negligible quality risk.
- Quantize. FP8 or INT8 first. Measure quality on your eval suite. If it holds, you've cut memory in half and can double your batch size.
- Enable prefix caching. If you have a shared system prompt (most production apps do), this is free throughput.
- Tune batch size and concurrency. Increase max batch size until you hit the throughput/latency tradeoff your SLA requires. More concurrent requests = better GPU utilization but higher per-request latency.
- Consider speculative decoding. If your latency SLA is tight and you're serving individual users (not batch), this can cut time-to-first-token by 2-3x.
- Scale out. If one GPU (or node) isn't enough, add tensor parallelism within nodes and pipeline parallelism across nodes.
The 80/20 rule of inference optimization
Steps 1-4 deliver 80% of the gains with minimal risk. Steps 5-7 require careful benchmarking and may introduce regressions. Don't skip to "advanced" techniques before nailing the fundamentals.
Who's Hiring Inference Engineers
Inference optimization has gone from a niche specialization to one of the most in-demand skills in AI engineering. Companies building inference infrastructure are hiring aggressively:
- Anthropic — 391 open roles including inference and systems engineering positions
- OpenAI — 698 open roles across research and engineering
- Fireworks AI — dedicated inference platform, building the fastest LLM serving stack
- Together AI — 55 open roles focused on efficient inference and open-source models
- Modal — serverless GPU infrastructure for inference workloads
- Cerebras — 90 open roles, wafer-scale inference with record-breaking token/s
- Mistral AI — 163 open roles at the leading European AI lab
Beyond pure-play inference companies, every organization deploying LLMs needs someone who can optimize their serving stack. Startups building AI agents and RAG systems are particularly hungry for this skill set.
Skills to Build
If you want to work in this space, here's what matters most:
- Transformer architecture internals. You need to understand multi-head attention, KV caching, and the prefill/decode distinction at a mechanical level — not just the paper-reading level.
- GPU memory hierarchy. Know the difference between registers, shared memory (SRAM), L2 cache, and HBM. Understand why a kernel that's compute-bound behaves differently from one that's memory-bound. You don't need to write CUDA kernels, but you need to understand why they matter.
- At least one inference framework. vLLM is the best starting point. Deploy a model, benchmark it, tune it. Read the source code — PagedAttention is remarkably elegant.
- Quantization hands-on. Quantize a model with AWQ. Compare quality on a benchmark suite. Understand the calibration process and why some layers are more sensitive than others.
- Benchmarking and profiling. Learn to measure time-to-first-token (TTFT), inter-token latency (ITL), and throughput (tokens/s) under realistic load patterns. Use tools like
genai-perffrom NVIDIA's Triton project.
Find inference & ML engineering roles
282+ ML/AI roles from companies building the next generation of LLM infrastructure.
Browse ML/AI Jobs → AI Skills Hub →