The short answer

An LLM gateway is the unified API between your application and the model providers. It owns six load-bearing concerns: provider abstraction, routing, retries / fallback, caching, observability, and rate limits.

For almost every team in 2026, the right move is to use a gateway rather than build one. Pick OpenRouter for hosted simplicity, Vercel AI Gateway for Next.js shops, LiteLLM for self-hosted with team-level budgets, or Portkey for the heaviest LLMOps surface. Build from scratch only for unusual compliance or latency requirements.

If you are running LLMs in production and your codebase has more than one service calling a model provider, the question of whether to put a gateway in front of them is no longer hypothetical. The patterns — retries, caching, fallback chains, cost telemetry, model selection — show up in every service, the implementations diverge over time, and the day eventually arrives when "what is our actual monthly LLM cost broken down by feature" becomes an answer nobody can produce.

This guide is the practical, opinionated take on LLM gateways in 2026: what they do, what the architectural choices are, how to choose between the major options, and the failure modes that show up once you have one in production.

What an LLM Gateway Actually Is

A gateway is a thin routing layer between your application and one or more LLM providers. The application calls the gateway with a request that looks like an OpenAI chat completion. The gateway figures out which provider and which model to use, applies caching, handles retries, and returns the response. Everything else — observability, cost attribution, rate limiting — is built on top of that core path.

The contract that makes this work is the OpenAI-compatible API. Almost every gateway in 2026 speaks it, almost every framework expects it, and the providers themselves have converged on it for their own SDKs. If you're designing a gateway, this is the input/output schema you support; if you're picking one, this is the surface you check first.

The Six Load-Bearing Features

A useful gateway has six things, in roughly this order of importance:

Feature 1 — The Whole Point

Unified provider abstraction

Application code calls one API. Behind the gateway, the request goes to Anthropic, OpenAI, Google, Bedrock, Azure, or your own self-hosted endpoint. Swap providers without touching app code. Add a new provider in one place. This is the load-bearing payoff — everything else is leverage built on top of it.

Feature 2 — The Highest-Leverage Architectural Choice

Model routing

Rules-based ("use Haiku for classification, Sonnet for chat"), dynamic ("use the cheapest available model that hits the quality bar for this route"), or hybrid. The gateway should make routing a config change, not a code change, so you can roll out a routing rule, observe metrics, and roll back without a deploy. See our companion piece on LLM cost optimization for why routing is the single biggest cost lever.

Feature 3 — Cheap Reliability

Retries and fallback

Exponential backoff on transient errors. Provider fallback chains when one provider is degraded. A circuit breaker per provider so you stop hammering a dead endpoint. None of this is novel; all of it is what application teams reinvent badly when there's no gateway. Doing it once, centrally, with good defaults, is a 10x leverage win on reliability incidents.

Feature 4 — Cost Multiplier

Caching

Three layers, each with different trade-offs. Prompt caching at the provider level — you set a cache marker on the stable prefix, the provider serves cached input tokens at a small fraction of the standard rate. Semantic caching at the gateway level — for FAQ-style or deterministic workloads, hits can be high but quality risk is real. Exact-match caching — cheap, low-risk, useful for evaluation runs and deterministic dataset processing.

Feature 5 — The Thing You Always Wish You Had Sooner

Per-request observability

Input tokens, cached tokens, output tokens, model, route, latency, cost, attribution to feature and user. Without this, every conversation about cost is a forensic investigation. With it, "the LLM bill spiked yesterday, what changed" becomes a five-minute dashboard query. Build the cost telemetry first — almost every other lever depends on being able to measure it.

Feature 6 — The Quiet Saver

Rate limits and virtual keys

Per-team or per-feature quotas. Virtual API keys for internal users so an experiment can't burn through the production budget. Hard limits that prevent a runaway agent loop from costing $5K overnight. The gateway is the natural place to enforce these because it sees every request.

The Build-vs-Buy Decision

In 2024 the right call was usually "build a thin one." By 2026 the calculus has flipped. The open-source and hosted options have matured to the point that building from scratch is justified only by specific constraints: extreme latency requirements, data-sovereignty rules that prohibit any third-party hop, or proprietary routing logic that doesn't fit standard frameworks.

ConstraintLikely answer
You're a solo dev or hobby projectOpenRouter or Vercel AI Gateway. Hosted, fast, minimal setup.
Small startup, <$5K/month spendOpenRouter for prototyping; revisit at scale.
Growing team, $5K-$50K/month, want controlLiteLLM (self-hosted) with Helicone for observability.
LLMOps-heavy organizationPortkey (open-sourced under Apache 2.0 in early 2026) or a hosted alternative.
Enterprise w/ data sovereignty needsSelf-hosted open-source on your own infrastructure.
Already deep in Next.js / VercelVercel AI Gateway via the AI SDK — minimal additional setup.
Unusual latency / compliance / IP requirementsOnly here does "build your own" make sense.

The Major Options, In One Line Each

OpenRouter

Hosted, OpenAI-compatible API, large model catalog, consolidated billing. The lowest-friction option — sign up, swap your base URL, ship. Free tier on open-weight models works for prototyping. Pay-per-request scales without surprise. Trade-off: you're routing through their infrastructure, which is a latency hop and an additional vendor on your compliance review.

LiteLLM

Open-source, self-hosted, MIT-licensed, massive provider catalog. The default choice for teams that want OpenAI-compatible routing without giving up infrastructure control. Strong on virtual keys, team-level budgets, and integrating with your own observability stack. Trade-off: you run it, which means you patch it and you're on call when it falls over.

Portkey

Open-sourced under Apache 2.0 in early 2026, which made it the most feature-complete OSS option overnight. Best for teams that need the full LLMOps surface — guardrails, prompt management, evals, compliance reporting — without writing it themselves. Trade-off: more surface area to learn than LiteLLM; better fit when you're using more of what it offers.

Helicone

Low-latency open-source gateway with strong built-in observability. A good fit when observability is the primary driver and you don't need the heavier LLMOps features. Often used alongside LiteLLM in production stacks — one for routing, the other for observability.

Vercel AI Gateway

Hosted gateway integrated with the Vercel AI SDK. If you're already in the Next.js + Vercel ecosystem, the integration cost is essentially zero — a few config lines and you have routing, fallback, observability, and unified billing. Trade-off: works best when you're shipping on Vercel; less compelling as a standalone product.

Architectural Decisions That Bite You Later

Sync vs streaming

Streaming has to be first-class. If you bolt it on after the fact, you'll discover every observability hook, retry path, and cache layer needs to be reworked. Pick a gateway that supports streaming from day one and design your application code around it.

Per-request vs per-session

For chat applications, decide whether the gateway is per-request (stateless, each call independent) or per-session (the gateway holds conversation state). Per-request is simpler and more scalable; per-session enables better caching and lower latency for multi-turn conversations. Most production gateways are per-request; per-session shows up in higher-latency RAG and agent systems where the state is large.

Where the eval lives

The gateway is the natural place to sample requests for offline eval scoring. Build this in from the start — a sampling rule per route ("score 1% of these requests against the eval set") gives you ongoing quality telemetry without the cost of evaluating every request. Without it, regressions show up in user complaints rather than dashboards.

How keys are managed

Provider API keys belong in the gateway, not in application code. The gateway issues virtual keys to each service or team, the application calls with the virtual key, the gateway swaps in the real provider key. This is how you contain key rotation, enforce team-level budgets, and avoid accidentally committing an OpenAI key to a public repo.

Production Failure Modes

A gateway introduces a single point of failure. Plan for it from day one.

The gateway itself going down

If the gateway is down, every LLM-touching feature is down. Mitigations: keep a fallback path in application code that talks directly to a primary provider if the gateway is unavailable; use a managed gateway with SLA guarantees rather than self-hosting on a single VPS; run multiple gateway instances behind a load balancer with health checks.

Silent quality regressions from routing

You roll out a routing rule that sends "easy" requests to a smaller model. The eval looked good. Six weeks later, edge cases start producing subtly wrong outputs and nobody notices until a customer complaint. Mitigation: continuous offline eval sampling per route, with regression alerts.

Cache poisoning

Semantic cache hits return a "similar enough" response that is actually wrong for the current request. Easy to introduce, hard to detect because the response is plausible. Mitigation: high similarity threshold (cosine ≥ 0.95), per-route cache-on/cache-off gating, log all cache hits for sampled offline review.

Runaway agent loops

An agent retries, re-plans, re-calls tools without a step ceiling and burns $5K overnight. The gateway is the right place to enforce per-request and per-session token budgets that hard-stop the loop. Never trust the application code to enforce these alone.

Provider key leaks

The gateway holds provider keys for the whole organization. If the gateway is compromised, every provider key is compromised. Treat the gateway as a high-value secrets store: rotate keys regularly, restrict admin access, log all key reads, and run security review on every gateway upgrade.

The 30-Day Rollout Plan

If you're convinced you need a gateway and want to act on it this month, here's the sequence that works.

Week 1 — Pick and stand up

Week 2 — Migrate one service

Week 3 — Add fallback and routing

Week 4 — Migrate the rest

Done well, the rollout cuts cost 30–60% (mostly from prompt caching and consistent routing) and reduces LLM-related production incidents to roughly zero. The compounding part: every future change — new provider, new model, new caching strategy — goes through one config change instead of N codebase patches. That's the leverage worth paying for.

FAQ

What is an LLM gateway?+
A routing layer between your application and one or more LLM providers. Exposes a unified API (usually OpenAI-compatible) and centralizes model routing, retries, fallbacks, caching, observability, rate limiting, and cost tracking. Instead of every service calling providers directly, they call the gateway.
When should I add an LLM gateway?+
Typically around the second LLM-touching service or once you have meaningful monthly production spend. Earlier than that, the indirection adds complexity without payoff. The signal: you've copy-pasted retry-with-backoff code into three services, or nobody can answer "what's our monthly LLM cost broken down by feature."
Should I build or buy an LLM gateway?+
Buy or use open-source for almost everyone in 2026. Hosted options handle the infrastructure; self-hosted open source (LiteLLM, Portkey, Helicone) gives you control without writing the gateway. Building from scratch is justified only for unusual latency, compliance, or routing requirements.
What features should an LLM gateway have?+
Six load-bearing features: unified provider API, model routing, retries with fallback, caching (prompt + optional semantic), per-request observability with cost attribution, and rate limits with virtual keys. Optional but valuable: guardrails, prompt management, structured-output validation.
Does an LLM gateway add latency?+
Yes, but usually negligible — tens of milliseconds compared to the multi-second model call itself. Caching, retries, and fallback wins typically dominate the latency penalty. For sub-100ms latency budgets, co-locate the gateway with the application or use a thin client library.
How do you handle failover between providers?+
Three patterns: static fallback chain (try A, then B, then C); tiered fallback (same model class across providers); latency-based routing (pick fastest available). Latency-based is useful only after you've validated quality equivalence with evals.
Should I cache LLM responses?+
Yes, at multiple levels. Prompt caching for stable prefixes is essentially free. Semantic caching is powerful but introduces quality risk — limit to workloads where similar inputs have similar correct outputs. Exact-match caching is low-risk for FAQ-style or deterministic workloads. Never cache for user-context-dependent or time-sensitive routes without explicit gating.

Looking for AI engineering roles?

Browse jobs at companies building production LLM systems — with verified culture profiles, comp data, and engineering blogs to evaluate before applying.

Browse AI/ML Jobs → Explore AI Tools →