What is the biggest bottleneck in LLM inference?

Memory bandwidth, not compute. LLM inference during the decode phase is memory-bound — the GPU spends most of its time waiting for model weights and KV cache data to be loaded from HBM, not performing matrix multiplications. This is why techniques like quantization (reducing data size) and KV cache optimization (reducing memory reads) have such outsized impact on throughput and latency.

What is KV cache and why does it matter for inference?

The KV (Key-Value) cache stores the attention keys and values from all previous tokens so they don't need to be recomputed at each generation step. Without it, generating each token would require reprocessing the entire sequence. The KV cache grows linearly with sequence length and batch size — for a 70B parameter model with a 4096-token context, it can consume 40+ GB of GPU memory, often more than the model weights themselves.

What quantization format should I use for production LLM serving?

For production serving: INT8 (W8A8) with GPTQ or AWQ offers the best balance of quality and speed — typically less than 1% quality degradation with 2x memory reduction. For edge devices or local inference, INT4 (W4A16) with GGUF format works well with llama.cpp. FP8 is becoming the standard for high-throughput serving on H100/H200 GPUs, offering near-FP16 quality with significant memory and compute savings.

What is speculative decoding and when should I use it?

Speculative decoding uses a small, fast 'draft' model to generate multiple candidate tokens in parallel, then the main model verifies them in a single forward pass. If the draft model predicts correctly (which happens 70-90% of the time for common sequences), you get multiple tokens for the cost of one verification pass. Use it when latency matters more than throughput — it's especially effective for code generation and structured outputs where sequences are predictable.

Which inference framework should I use: vLLM, TGI, or TensorRT-LLM?

vLLM is the best default for most production workloads — it has excellent PagedAttention support, broad model compatibility, and a simple API. TensorRT-LLM offers the highest throughput on NVIDIA GPUs but requires more setup and has a narrower model support range. TGI (Text Generation Inference by Hugging Face) is a good middle ground with a strong ecosystem. For local development, use llama.cpp or Ollama.

How much can inference optimization reduce costs?

A well-optimized inference stack can reduce serving costs by 5-10x compared to naive deployment. Quantization alone typically provides 2-4x memory savings (letting you serve on fewer or smaller GPUs). Continuous batching improves GPU utilization from 30-40% to 80-90%. Prefix caching eliminates redundant computation for common prompts. Combined, these techniques can bring the cost of serving a 70B model from $50-100/hour down to $10-20/hour on equivalent hardware.

What skills do I need for an LLM inference engineering role?

Core skills include: understanding transformer architecture internals, GPU memory hierarchy (registers, shared memory, L2, HBM), proficiency with PyTorch and at least one inference framework (vLLM, TensorRT-LLM), basic CUDA knowledge (you don't need to write kernels, but you need to understand why they matter), quantization techniques, and load testing/benchmarking. Companies like Anthropic, Fireworks AI, Together AI, and Cerebras actively hire for these roles.

LLM Inference Optimization: A Practical Guide for AI Engineers (2026)

If you're deploying LLMs in production in 2026, inference is almost certainly your biggest line item. Training gets the headlines, but inference is where the money actually goes — running models 24/7, at scale, for every user request. A single unoptimized 70B model can cost $100+/hour on A100s. Optimize that same model properly, and you're looking at $15-20/hour for equivalent throughput.

This isn't about squeezing out marginal gains. The difference between naive and optimized inference is often 5-10x in cost and 3-5x in latency. For ML and AI engineers, understanding these techniques isn't optional anymore — it's the skill that determines whether your model ships or stays in a notebook.

5-10x

Cost reduction with optimization

80-90%

GPU utilization with batching

282+

ML/AI roles hiring now

The Inference Tax: Why This Matters Now

Training a frontier model costs $50-200M. Serving it costs more — within months. OpenAI processes billions of tokens per day across ChatGPT, the API, and enterprise deployments. Anthropic, Google DeepMind, and every company building on LLMs faces the same math: inference compute dominates total cost of ownership.

The bottleneck is counterintuitive. LLM inference during token generation is memory-bandwidth bound, not compute-bound. Your GPU sits there, largely idle, waiting for weights and KV cache data to stream from HBM. An H100 has 3.35 TB/s of memory bandwidth but 989 TFLOPS of FP16 compute — during autoregressive decoding, you're barely using 10-20% of that compute capacity.

Every technique in this guide attacks the same fundamental problem: reduce the amount of data that needs to move through memory, and make better use of the data that does move.

KV Cache: The Hidden Memory Hog

The Key-Value cache is the single most important concept in inference optimization. During autoregressive generation, every new token attends to all previous tokens. Without caching, you'd recompute attention for the entire sequence at every step — quadratic cost that makes long-context generation impossible.

The KV cache stores the key and value projections from every layer for every token generated so far. For a Llama 3 70B model with 80 layers and 64 attention heads, generating a 4096-token sequence requires:

// KV cache size calculation
layers = 80
heads = 64 (8 KV heads with GQA)
head_dim = 128
seq_len = 4096
dtype = float16 (2 bytes)

// Per-token KV size: 2 * layers * kv_heads * head_dim * dtype
per_token = 2 * 80 * 8 * 128 * 2 = 327,680 bytes ~ 320 KB

// Full sequence KV cache
total = 320 KB * 4096 = ~1.3 GB per sequence

// With batch size 32:
batch_kv = 1.3 GB * 32 = ~41 GB // often exceeds model weights!
        

That's 41 GB just for the KV cache with a modest batch size — often more than the quantized model weights themselves. This is why KV cache optimization matters so much.

PagedAttention (vLLM)

Traditional KV cache allocation reserves contiguous memory for the maximum sequence length, even if most requests are short. PagedAttention, introduced by vLLM, borrows the idea of virtual memory paging from operating systems: KV cache is allocated in fixed-size blocks ("pages") and mapped via a page table. Memory waste drops from 60-80% to near zero.

In practice, this means you can serve 2-3x more concurrent requests on the same GPU. PagedAttention is now the default in vLLM, and variants have been adopted by most serving frameworks.

Prefix Caching

If many requests share a common prefix (system prompt, few-shot examples, shared context), prefix caching computes the KV cache for that prefix once and reuses it across requests. For a 2000-token system prompt served to 1000 users, you save 2 billion tokens of redundant computation. Every major serving framework now supports this — it's particularly impactful for RAG pipelines where the same retrieved documents appear in multiple requests.

Grouped-Query Attention (GQA)

GQA reduces KV cache size by sharing key-value heads across multiple query heads. Llama 3 uses 8 KV heads shared across 64 query heads — an 8x reduction in KV cache size compared to standard multi-head attention. This is a model architecture decision, not a serving optimization, but understanding it is crucial for capacity planning.

Quantization: Doing More With Less

Quantization reduces model precision from FP16/BF16 (16 bits per parameter) to INT8 (8 bits), INT4 (4 bits), or even lower. It's the single highest-impact optimization you can apply to any model — smaller weights mean less memory, less bandwidth, and faster inference.

Format	Bits	Memory	Quality Loss	Best For
FP8 (E4M3)	8	2x reduction	Negligible	H100/H200 production serving
INT8 (W8A8)	8	2x reduction	<1% on benchmarks	General production serving
INT4 (GPTQ)	4	4x reduction	1-3% on benchmarks	Memory-constrained GPUs
INT4 (AWQ)	4	4x reduction	1-2% on benchmarks	Better quality than GPTQ
GGUF (llama.cpp)	2-8	2-8x reduction	Varies by quant level	Local/CPU inference

The practical rule of thumb

Start with FP8 on H100/H200 or INT8 (AWQ) on A100 — you'll barely notice quality degradation but cut memory in half. Only go to INT4 when you must fit a model on a smaller GPU. Below INT4, expect meaningful quality loss on complex reasoning tasks.

GPTQ vs AWQ: Both are post-training quantization methods for INT4. AWQ (Activation-Aware Weight Quantization) generally produces better quality by protecting the 1% of "salient" weights that disproportionately affect model output. GPTQ is faster to calibrate but slightly less accurate. For new deployments in 2026, AWQ is the default choice.

Speculative Decoding: Getting Multiple Tokens Per Step

Standard autoregressive decoding generates one token per forward pass through the full model. Speculative decoding breaks this bottleneck using a clever asymmetry: verifying N tokens in parallel is as cheap as generating one.

The process works like this:

Draft phase: A small, fast model (1-7B parameters) generates K candidate tokens autoregressively (K is typically 3-8).
Verify phase: The full model processes all K candidates in a single forward pass, computing the probability of each candidate token given the previous ones.
Accept/reject: Starting from the first candidate, accept tokens whose probabilities under the full model meet a threshold. Reject the first mismatch and resample from the full model's distribution.

If the draft model is good (and for predictable sequences like code, it usually is), you accept 70-90% of candidates. That means 3-6 tokens per forward pass of the main model, with mathematically identical output distribution to standard decoding.

When not to use speculative decoding

Speculative decoding improves latency (time to complete a single request) but can reduce throughput (total tokens per second across all users). If you're serving many concurrent users and GPU utilization is already high, the draft model's compute competes with the main model. Use it for latency-sensitive, low-concurrency scenarios — interactive coding, real-time agents, streaming responses.

Flash Attention: Memory-Efficient Attention

Flash Attention rewrites the attention computation to avoid materializing the full N×N attention matrix in GPU HBM. Instead of computing attention as three separate operations (QK^T, softmax, multiply by V), it fuses them into a single kernel that works on blocks of the attention matrix in fast SRAM.

The result: attention computation goes from O(N²) memory to O(N) — critical for long-context models. Flash Attention 2 and its successors are now the default in every major framework. You don't need to implement it — but you need to know it's there, because if you're seeing OOM errors on long sequences, missing Flash Attention is often the cause.

Continuous Batching: Maximizing GPU Utilization

Static batching groups requests together and processes them as a unit. The problem: if one request in a batch generates 500 tokens and another generates 10, the GPU sits idle waiting for the long request to finish before processing new ones.

Continuous (or dynamic) batching solves this by inserting new requests into the batch as soon as any request completes. GPU utilization jumps from 30-40% to 80-90%, and average latency drops because short requests aren't held hostage by long ones.

vLLM, TGI, and TensorRT-LLM all support continuous batching out of the box. If you're serving a model without it, you're leaving 2-3x throughput on the table.

Multi-GPU Strategies: Tensor vs Pipeline Parallelism

When a model doesn't fit on a single GPU, you have two main options:

Tensor parallelism (TP): Splits individual layers across GPUs. Each GPU holds a slice of every layer and they communicate via all-reduce at every layer. Requires high-bandwidth interconnect (NVLink). Best for 2-8 GPUs on the same node. Latency stays roughly constant as you add GPUs.
Pipeline parallelism (PP): Assigns different layers to different GPUs. GPU 1 does layers 0-19, GPU 2 does 20-39, etc. Requires less interconnect bandwidth but introduces "pipeline bubbles" (idle time between stages). Best for cross-node setups.

For most production setups in 2026: use TP=4 or TP=8 within a single node, and PP only when scaling across nodes. vLLM makes this straightforward with --tensor-parallel-size.

The Inference Framework Landscape

Framework	Strengths	Weaknesses	When to Use
vLLM	PagedAttention, broad model support, active community, simple API	Slightly lower throughput than TRT-LLM on NVIDIA	Default choice for production serving
TensorRT-LLM	Maximum throughput on NVIDIA GPUs, deep H100/H200 optimization	Complex setup, NVIDIA-only, narrower model support	Maximum throughput on NVIDIA hardware
TGI	HuggingFace ecosystem, good documentation, easy deployment	Performance gap vs vLLM/TRT-LLM for large models	HuggingFace-centric workflows
llama.cpp / Ollama	CPU inference, GGUF quantization, runs anywhere	Not designed for high-throughput serving	Local development, edge deployment
SGLang	RadixAttention, structured output optimization, fast prefix caching	Younger project, smaller community	Structured generation, agentic workloads

A Real-World Optimization Workflow

When you inherit a slow, expensive LLM deployment, here's the sequence that works:

Profile first. Use nvidia-smi, PyTorch Profiler, or NSight Systems to identify the actual bottleneck. Is it memory (OOM at batch size 4)? Bandwidth (low GPU utilization during decoding)? Compute (slow prefill)?
Enable continuous batching. Switch from static batching to vLLM or TGI. This alone often doubles throughput with negligible quality risk.
Quantize. FP8 or INT8 first. Measure quality on your eval suite. If it holds, you've cut memory in half and can double your batch size.
Enable prefix caching. If you have a shared system prompt (most production apps do), this is free throughput.
Tune batch size and concurrency. Increase max batch size until you hit the throughput/latency tradeoff your SLA requires. More concurrent requests = better GPU utilization but higher per-request latency.
Consider speculative decoding. If your latency SLA is tight and you're serving individual users (not batch), this can cut time-to-first-token by 2-3x.
Scale out. If one GPU (or node) isn't enough, add tensor parallelism within nodes and pipeline parallelism across nodes.

The 80/20 rule of inference optimization

Steps 1-4 deliver 80% of the gains with minimal risk. Steps 5-7 require careful benchmarking and may introduce regressions. Don't skip to "advanced" techniques before nailing the fundamentals.

Who's Hiring Inference Engineers

Inference optimization has gone from a niche specialization to one of the most in-demand skills in AI engineering. Companies building inference infrastructure are hiring aggressively:

Anthropic — 391 open roles including inference and systems engineering positions
OpenAI — 698 open roles across research and engineering
Fireworks AI — dedicated inference platform, building the fastest LLM serving stack
Together AI — 55 open roles focused on efficient inference and open-source models
Modal — serverless GPU infrastructure for inference workloads
Cerebras — 90 open roles, wafer-scale inference with record-breaking token/s
Mistral AI — 163 open roles at the leading European AI lab

Beyond pure-play inference companies, every organization deploying LLMs needs someone who can optimize their serving stack. Startups building AI agents and RAG systems are particularly hungry for this skill set.

Skills to Build

If you want to work in this space, here's what matters most:

Transformer architecture internals. You need to understand multi-head attention, KV caching, and the prefill/decode distinction at a mechanical level — not just the paper-reading level.
GPU memory hierarchy. Know the difference between registers, shared memory (SRAM), L2 cache, and HBM. Understand why a kernel that's compute-bound behaves differently from one that's memory-bound. You don't need to write CUDA kernels, but you need to understand why they matter.
At least one inference framework. vLLM is the best starting point. Deploy a model, benchmark it, tune it. Read the source code — PagedAttention is remarkably elegant.
Quantization hands-on. Quantize a model with AWQ. Compare quality on a benchmark suite. Understand the calibration process and why some layers are more sensitive than others.
Benchmarking and profiling. Learn to measure time-to-first-token (TTFT), inter-token latency (ITL), and throughput (tokens/s) under realistic load patterns. Use tools like genai-perf from NVIDIA's Triton project.

vLLM TensorRT-LLM Flash Attention CUDA PyTorch AWQ GPTQ llama.cpp Triton SGLang

Find inference & ML engineering roles

282+ ML/AI roles from companies building the next generation of LLM infrastructure.

Browse ML/AI Jobs → AI Skills Hub →

Frequently Asked Questions

What's the biggest bottleneck in LLM inference? +

Memory bandwidth, not compute. During the decode phase, LLM inference is memory-bound — the GPU spends most of its time waiting for model weights and KV cache data to load from HBM. This is why quantization and KV cache optimization have outsized impact on performance.

Should I use vLLM or TensorRT-LLM for production? +

vLLM is the safer default — broader model support, simpler setup, active community, and performance within 10-15% of TensorRT-LLM for most workloads. Choose TensorRT-LLM only when you need absolute maximum throughput on NVIDIA hardware and can invest the engineering time in its more complex build and deployment pipeline.

How much quality do I lose with INT4 quantization? +

With AWQ, typically 1-2% on standard benchmarks (MMLU, HumanEval, GSM8K). However, quality loss is task-dependent — creative writing and complex reasoning are more sensitive than classification or extraction. Always measure on your specific use case with your own eval suite, not just published benchmarks.

Can I run a 70B model on a single GPU? +

Yes, with INT4 quantization. A 70B model at INT4 requires ~35 GB of memory for weights alone, which fits on an A100 80GB or H100 80GB with room for KV cache. At FP16, you need at least 2x A100 80GB with tensor parallelism. For local development, llama.cpp can run INT4 70B models on a 64GB Mac with Apple Silicon.

What salary can inference engineers expect? +

Based on our research across AI companies, ML/inference engineers command $180k-$350k+ total compensation (base + equity) at well-funded AI companies. Senior inference engineers at frontier labs can exceed $400k. The premium reflects the direct revenue impact — a 2x improvement in serving efficiency saves millions per year in GPU costs. See current ML/AI roles.

What's the difference between prefill and decode in LLM inference? +

Prefill processes the entire input prompt in parallel (compute-bound, like training). Decode generates output tokens one at a time (memory-bandwidth-bound). They have completely different performance characteristics and bottlenecks, which is why disaggregated serving — running prefill and decode on separate GPU pools — is becoming a popular architecture for high-throughput deployments.