If you're deploying LLMs in production in 2026, inference is almost certainly your biggest line item. Training gets the headlines, but inference is where the money actually goes — running models 24/7, at scale, for every user request. A single unoptimized 70B model can cost $100+/hour on A100s. Optimize that same model properly, and you're looking at $15-20/hour for equivalent throughput.

This isn't about squeezing out marginal gains. The difference between naive and optimized inference is often 5-10x in cost and 3-5x in latency. For ML and AI engineers, understanding these techniques isn't optional anymore — it's the skill that determines whether your model ships or stays in a notebook.

5-10x
Cost reduction with optimization
80-90%
GPU utilization with batching
282+
ML/AI roles hiring now

The Inference Tax: Why This Matters Now

Training a frontier model costs $50-200M. Serving it costs more — within months. OpenAI processes billions of tokens per day across ChatGPT, the API, and enterprise deployments. Anthropic, Google DeepMind, and every company building on LLMs faces the same math: inference compute dominates total cost of ownership.

The bottleneck is counterintuitive. LLM inference during token generation is memory-bandwidth bound, not compute-bound. Your GPU sits there, largely idle, waiting for weights and KV cache data to stream from HBM. An H100 has 3.35 TB/s of memory bandwidth but 989 TFLOPS of FP16 compute — during autoregressive decoding, you're barely using 10-20% of that compute capacity.

Every technique in this guide attacks the same fundamental problem: reduce the amount of data that needs to move through memory, and make better use of the data that does move.

KV Cache: The Hidden Memory Hog

The Key-Value cache is the single most important concept in inference optimization. During autoregressive generation, every new token attends to all previous tokens. Without caching, you'd recompute attention for the entire sequence at every step — quadratic cost that makes long-context generation impossible.

The KV cache stores the key and value projections from every layer for every token generated so far. For a Llama 3 70B model with 80 layers and 64 attention heads, generating a 4096-token sequence requires:

// KV cache size calculation layers = 80 heads = 64 (8 KV heads with GQA) head_dim = 128 seq_len = 4096 dtype = float16 (2 bytes) // Per-token KV size: 2 * layers * kv_heads * head_dim * dtype per_token = 2 * 80 * 8 * 128 * 2 = 327,680 bytes ~ 320 KB // Full sequence KV cache total = 320 KB * 4096 = ~1.3 GB per sequence // With batch size 32: batch_kv = 1.3 GB * 32 = ~41 GB // often exceeds model weights!

That's 41 GB just for the KV cache with a modest batch size — often more than the quantized model weights themselves. This is why KV cache optimization matters so much.

PagedAttention (vLLM)

Traditional KV cache allocation reserves contiguous memory for the maximum sequence length, even if most requests are short. PagedAttention, introduced by vLLM, borrows the idea of virtual memory paging from operating systems: KV cache is allocated in fixed-size blocks ("pages") and mapped via a page table. Memory waste drops from 60-80% to near zero.

In practice, this means you can serve 2-3x more concurrent requests on the same GPU. PagedAttention is now the default in vLLM, and variants have been adopted by most serving frameworks.

Prefix Caching

If many requests share a common prefix (system prompt, few-shot examples, shared context), prefix caching computes the KV cache for that prefix once and reuses it across requests. For a 2000-token system prompt served to 1000 users, you save 2 billion tokens of redundant computation. Every major serving framework now supports this — it's particularly impactful for RAG pipelines where the same retrieved documents appear in multiple requests.

Grouped-Query Attention (GQA)

GQA reduces KV cache size by sharing key-value heads across multiple query heads. Llama 3 uses 8 KV heads shared across 64 query heads — an 8x reduction in KV cache size compared to standard multi-head attention. This is a model architecture decision, not a serving optimization, but understanding it is crucial for capacity planning.

Quantization: Doing More With Less

Quantization reduces model precision from FP16/BF16 (16 bits per parameter) to INT8 (8 bits), INT4 (4 bits), or even lower. It's the single highest-impact optimization you can apply to any model — smaller weights mean less memory, less bandwidth, and faster inference.

Format Bits Memory Quality Loss Best For
FP8 (E4M3) 8 2x reduction Negligible H100/H200 production serving
INT8 (W8A8) 8 2x reduction <1% on benchmarks General production serving
INT4 (GPTQ) 4 4x reduction 1-3% on benchmarks Memory-constrained GPUs
INT4 (AWQ) 4 4x reduction 1-2% on benchmarks Better quality than GPTQ
GGUF (llama.cpp) 2-8 2-8x reduction Varies by quant level Local/CPU inference

The practical rule of thumb

Start with FP8 on H100/H200 or INT8 (AWQ) on A100 — you'll barely notice quality degradation but cut memory in half. Only go to INT4 when you must fit a model on a smaller GPU. Below INT4, expect meaningful quality loss on complex reasoning tasks.

GPTQ vs AWQ: Both are post-training quantization methods for INT4. AWQ (Activation-Aware Weight Quantization) generally produces better quality by protecting the 1% of "salient" weights that disproportionately affect model output. GPTQ is faster to calibrate but slightly less accurate. For new deployments in 2026, AWQ is the default choice.

Speculative Decoding: Getting Multiple Tokens Per Step

Standard autoregressive decoding generates one token per forward pass through the full model. Speculative decoding breaks this bottleneck using a clever asymmetry: verifying N tokens in parallel is as cheap as generating one.

The process works like this:

  1. Draft phase: A small, fast model (1-7B parameters) generates K candidate tokens autoregressively (K is typically 3-8).
  2. Verify phase: The full model processes all K candidates in a single forward pass, computing the probability of each candidate token given the previous ones.
  3. Accept/reject: Starting from the first candidate, accept tokens whose probabilities under the full model meet a threshold. Reject the first mismatch and resample from the full model's distribution.

If the draft model is good (and for predictable sequences like code, it usually is), you accept 70-90% of candidates. That means 3-6 tokens per forward pass of the main model, with mathematically identical output distribution to standard decoding.

When not to use speculative decoding

Speculative decoding improves latency (time to complete a single request) but can reduce throughput (total tokens per second across all users). If you're serving many concurrent users and GPU utilization is already high, the draft model's compute competes with the main model. Use it for latency-sensitive, low-concurrency scenarios — interactive coding, real-time agents, streaming responses.

Flash Attention: Memory-Efficient Attention

Flash Attention rewrites the attention computation to avoid materializing the full N×N attention matrix in GPU HBM. Instead of computing attention as three separate operations (QKT, softmax, multiply by V), it fuses them into a single kernel that works on blocks of the attention matrix in fast SRAM.

The result: attention computation goes from O(N²) memory to O(N) — critical for long-context models. Flash Attention 2 and its successors are now the default in every major framework. You don't need to implement it — but you need to know it's there, because if you're seeing OOM errors on long sequences, missing Flash Attention is often the cause.

Continuous Batching: Maximizing GPU Utilization

Static batching groups requests together and processes them as a unit. The problem: if one request in a batch generates 500 tokens and another generates 10, the GPU sits idle waiting for the long request to finish before processing new ones.

Continuous (or dynamic) batching solves this by inserting new requests into the batch as soon as any request completes. GPU utilization jumps from 30-40% to 80-90%, and average latency drops because short requests aren't held hostage by long ones.

vLLM, TGI, and TensorRT-LLM all support continuous batching out of the box. If you're serving a model without it, you're leaving 2-3x throughput on the table.

Multi-GPU Strategies: Tensor vs Pipeline Parallelism

When a model doesn't fit on a single GPU, you have two main options:

For most production setups in 2026: use TP=4 or TP=8 within a single node, and PP only when scaling across nodes. vLLM makes this straightforward with --tensor-parallel-size.

The Inference Framework Landscape

Framework Strengths Weaknesses When to Use
vLLM PagedAttention, broad model support, active community, simple API Slightly lower throughput than TRT-LLM on NVIDIA Default choice for production serving
TensorRT-LLM Maximum throughput on NVIDIA GPUs, deep H100/H200 optimization Complex setup, NVIDIA-only, narrower model support Maximum throughput on NVIDIA hardware
TGI HuggingFace ecosystem, good documentation, easy deployment Performance gap vs vLLM/TRT-LLM for large models HuggingFace-centric workflows
llama.cpp / Ollama CPU inference, GGUF quantization, runs anywhere Not designed for high-throughput serving Local development, edge deployment
SGLang RadixAttention, structured output optimization, fast prefix caching Younger project, smaller community Structured generation, agentic workloads

A Real-World Optimization Workflow

When you inherit a slow, expensive LLM deployment, here's the sequence that works:

  1. Profile first. Use nvidia-smi, PyTorch Profiler, or NSight Systems to identify the actual bottleneck. Is it memory (OOM at batch size 4)? Bandwidth (low GPU utilization during decoding)? Compute (slow prefill)?
  2. Enable continuous batching. Switch from static batching to vLLM or TGI. This alone often doubles throughput with negligible quality risk.
  3. Quantize. FP8 or INT8 first. Measure quality on your eval suite. If it holds, you've cut memory in half and can double your batch size.
  4. Enable prefix caching. If you have a shared system prompt (most production apps do), this is free throughput.
  5. Tune batch size and concurrency. Increase max batch size until you hit the throughput/latency tradeoff your SLA requires. More concurrent requests = better GPU utilization but higher per-request latency.
  6. Consider speculative decoding. If your latency SLA is tight and you're serving individual users (not batch), this can cut time-to-first-token by 2-3x.
  7. Scale out. If one GPU (or node) isn't enough, add tensor parallelism within nodes and pipeline parallelism across nodes.

The 80/20 rule of inference optimization

Steps 1-4 deliver 80% of the gains with minimal risk. Steps 5-7 require careful benchmarking and may introduce regressions. Don't skip to "advanced" techniques before nailing the fundamentals.

Who's Hiring Inference Engineers

Inference optimization has gone from a niche specialization to one of the most in-demand skills in AI engineering. Companies building inference infrastructure are hiring aggressively:

Beyond pure-play inference companies, every organization deploying LLMs needs someone who can optimize their serving stack. Startups building AI agents and RAG systems are particularly hungry for this skill set.

Skills to Build

If you want to work in this space, here's what matters most:

vLLM TensorRT-LLM Flash Attention CUDA PyTorch AWQ GPTQ llama.cpp Triton SGLang

Find inference & ML engineering roles

282+ ML/AI roles from companies building the next generation of LLM infrastructure.

Browse ML/AI Jobs → AI Skills Hub →

Frequently Asked Questions

What's the biggest bottleneck in LLM inference? +
Memory bandwidth, not compute. During the decode phase, LLM inference is memory-bound — the GPU spends most of its time waiting for model weights and KV cache data to load from HBM. This is why quantization and KV cache optimization have outsized impact on performance.
Should I use vLLM or TensorRT-LLM for production? +
vLLM is the safer default — broader model support, simpler setup, active community, and performance within 10-15% of TensorRT-LLM for most workloads. Choose TensorRT-LLM only when you need absolute maximum throughput on NVIDIA hardware and can invest the engineering time in its more complex build and deployment pipeline.
How much quality do I lose with INT4 quantization? +
With AWQ, typically 1-2% on standard benchmarks (MMLU, HumanEval, GSM8K). However, quality loss is task-dependent — creative writing and complex reasoning are more sensitive than classification or extraction. Always measure on your specific use case with your own eval suite, not just published benchmarks.
Can I run a 70B model on a single GPU? +
Yes, with INT4 quantization. A 70B model at INT4 requires ~35 GB of memory for weights alone, which fits on an A100 80GB or H100 80GB with room for KV cache. At FP16, you need at least 2x A100 80GB with tensor parallelism. For local development, llama.cpp can run INT4 70B models on a 64GB Mac with Apple Silicon.
What salary can inference engineers expect? +
Based on our research across AI companies, ML/inference engineers command $180k-$350k+ total compensation (base + equity) at well-funded AI companies. Senior inference engineers at frontier labs can exceed $400k. The premium reflects the direct revenue impact — a 2x improvement in serving efficiency saves millions per year in GPU costs. See current ML/AI roles.
What's the difference between prefill and decode in LLM inference? +
Prefill processes the entire input prompt in parallel (compute-bound, like training). Decode generates output tokens one at a time (memory-bandwidth-bound). They have completely different performance characteristics and bottlenecks, which is why disaggregated serving — running prefill and decode on separate GPU pools — is becoming a popular architecture for high-throughput deployments.