An LLM gateway is the unified API between your application and the model providers. It owns six load-bearing concerns: provider abstraction, routing, retries / fallback, caching, observability, and rate limits.
For almost every team in 2026, the right move is to use a gateway rather than build one. Pick OpenRouter for hosted simplicity, Vercel AI Gateway for Next.js shops, LiteLLM for self-hosted with team-level budgets, or Portkey for the heaviest LLMOps surface. Build from scratch only for unusual compliance or latency requirements.
If you are running LLMs in production and your codebase has more than one service calling a model provider, the question of whether to put a gateway in front of them is no longer hypothetical. The patterns — retries, caching, fallback chains, cost telemetry, model selection — show up in every service, the implementations diverge over time, and the day eventually arrives when "what is our actual monthly LLM cost broken down by feature" becomes an answer nobody can produce.
This guide is the practical, opinionated take on LLM gateways in 2026: what they do, what the architectural choices are, how to choose between the major options, and the failure modes that show up once you have one in production.
What an LLM Gateway Actually Is
A gateway is a thin routing layer between your application and one or more LLM providers. The application calls the gateway with a request that looks like an OpenAI chat completion. The gateway figures out which provider and which model to use, applies caching, handles retries, and returns the response. Everything else — observability, cost attribution, rate limiting — is built on top of that core path.
The contract that makes this work is the OpenAI-compatible API. Almost every gateway in 2026 speaks it, almost every framework expects it, and the providers themselves have converged on it for their own SDKs. If you're designing a gateway, this is the input/output schema you support; if you're picking one, this is the surface you check first.
The Six Load-Bearing Features
A useful gateway has six things, in roughly this order of importance:
Unified provider abstraction
Application code calls one API. Behind the gateway, the request goes to Anthropic, OpenAI, Google, Bedrock, Azure, or your own self-hosted endpoint. Swap providers without touching app code. Add a new provider in one place. This is the load-bearing payoff — everything else is leverage built on top of it.
Model routing
Rules-based ("use Haiku for classification, Sonnet for chat"), dynamic ("use the cheapest available model that hits the quality bar for this route"), or hybrid. The gateway should make routing a config change, not a code change, so you can roll out a routing rule, observe metrics, and roll back without a deploy. See our companion piece on LLM cost optimization for why routing is the single biggest cost lever.
Retries and fallback
Exponential backoff on transient errors. Provider fallback chains when one provider is degraded. A circuit breaker per provider so you stop hammering a dead endpoint. None of this is novel; all of it is what application teams reinvent badly when there's no gateway. Doing it once, centrally, with good defaults, is a 10x leverage win on reliability incidents.
Caching
Three layers, each with different trade-offs. Prompt caching at the provider level — you set a cache marker on the stable prefix, the provider serves cached input tokens at a small fraction of the standard rate. Semantic caching at the gateway level — for FAQ-style or deterministic workloads, hits can be high but quality risk is real. Exact-match caching — cheap, low-risk, useful for evaluation runs and deterministic dataset processing.
Per-request observability
Input tokens, cached tokens, output tokens, model, route, latency, cost, attribution to feature and user. Without this, every conversation about cost is a forensic investigation. With it, "the LLM bill spiked yesterday, what changed" becomes a five-minute dashboard query. Build the cost telemetry first — almost every other lever depends on being able to measure it.
Rate limits and virtual keys
Per-team or per-feature quotas. Virtual API keys for internal users so an experiment can't burn through the production budget. Hard limits that prevent a runaway agent loop from costing $5K overnight. The gateway is the natural place to enforce these because it sees every request.
The Build-vs-Buy Decision
In 2024 the right call was usually "build a thin one." By 2026 the calculus has flipped. The open-source and hosted options have matured to the point that building from scratch is justified only by specific constraints: extreme latency requirements, data-sovereignty rules that prohibit any third-party hop, or proprietary routing logic that doesn't fit standard frameworks.
| Constraint | Likely answer |
| You're a solo dev or hobby project | OpenRouter or Vercel AI Gateway. Hosted, fast, minimal setup. |
| Small startup, <$5K/month spend | OpenRouter for prototyping; revisit at scale. |
| Growing team, $5K-$50K/month, want control | LiteLLM (self-hosted) with Helicone for observability. |
| LLMOps-heavy organization | Portkey (open-sourced under Apache 2.0 in early 2026) or a hosted alternative. |
| Enterprise w/ data sovereignty needs | Self-hosted open-source on your own infrastructure. |
| Already deep in Next.js / Vercel | Vercel AI Gateway via the AI SDK — minimal additional setup. |
| Unusual latency / compliance / IP requirements | Only here does "build your own" make sense. |
The Major Options, In One Line Each
OpenRouter
Hosted, OpenAI-compatible API, large model catalog, consolidated billing. The lowest-friction option — sign up, swap your base URL, ship. Free tier on open-weight models works for prototyping. Pay-per-request scales without surprise. Trade-off: you're routing through their infrastructure, which is a latency hop and an additional vendor on your compliance review.
LiteLLM
Open-source, self-hosted, MIT-licensed, massive provider catalog. The default choice for teams that want OpenAI-compatible routing without giving up infrastructure control. Strong on virtual keys, team-level budgets, and integrating with your own observability stack. Trade-off: you run it, which means you patch it and you're on call when it falls over.
Portkey
Open-sourced under Apache 2.0 in early 2026, which made it the most feature-complete OSS option overnight. Best for teams that need the full LLMOps surface — guardrails, prompt management, evals, compliance reporting — without writing it themselves. Trade-off: more surface area to learn than LiteLLM; better fit when you're using more of what it offers.
Helicone
Low-latency open-source gateway with strong built-in observability. A good fit when observability is the primary driver and you don't need the heavier LLMOps features. Often used alongside LiteLLM in production stacks — one for routing, the other for observability.
Vercel AI Gateway
Hosted gateway integrated with the Vercel AI SDK. If you're already in the Next.js + Vercel ecosystem, the integration cost is essentially zero — a few config lines and you have routing, fallback, observability, and unified billing. Trade-off: works best when you're shipping on Vercel; less compelling as a standalone product.
Architectural Decisions That Bite You Later
Sync vs streaming
Streaming has to be first-class. If you bolt it on after the fact, you'll discover every observability hook, retry path, and cache layer needs to be reworked. Pick a gateway that supports streaming from day one and design your application code around it.
Per-request vs per-session
For chat applications, decide whether the gateway is per-request (stateless, each call independent) or per-session (the gateway holds conversation state). Per-request is simpler and more scalable; per-session enables better caching and lower latency for multi-turn conversations. Most production gateways are per-request; per-session shows up in higher-latency RAG and agent systems where the state is large.
Where the eval lives
The gateway is the natural place to sample requests for offline eval scoring. Build this in from the start — a sampling rule per route ("score 1% of these requests against the eval set") gives you ongoing quality telemetry without the cost of evaluating every request. Without it, regressions show up in user complaints rather than dashboards.
How keys are managed
Provider API keys belong in the gateway, not in application code. The gateway issues virtual keys to each service or team, the application calls with the virtual key, the gateway swaps in the real provider key. This is how you contain key rotation, enforce team-level budgets, and avoid accidentally committing an OpenAI key to a public repo.
Production Failure Modes
A gateway introduces a single point of failure. Plan for it from day one.
The gateway itself going down
If the gateway is down, every LLM-touching feature is down. Mitigations: keep a fallback path in application code that talks directly to a primary provider if the gateway is unavailable; use a managed gateway with SLA guarantees rather than self-hosting on a single VPS; run multiple gateway instances behind a load balancer with health checks.
Silent quality regressions from routing
You roll out a routing rule that sends "easy" requests to a smaller model. The eval looked good. Six weeks later, edge cases start producing subtly wrong outputs and nobody notices until a customer complaint. Mitigation: continuous offline eval sampling per route, with regression alerts.
Cache poisoning
Semantic cache hits return a "similar enough" response that is actually wrong for the current request. Easy to introduce, hard to detect because the response is plausible. Mitigation: high similarity threshold (cosine ≥ 0.95), per-route cache-on/cache-off gating, log all cache hits for sampled offline review.
Runaway agent loops
An agent retries, re-plans, re-calls tools without a step ceiling and burns $5K overnight. The gateway is the right place to enforce per-request and per-session token budgets that hard-stop the loop. Never trust the application code to enforce these alone.
Provider key leaks
The gateway holds provider keys for the whole organization. If the gateway is compromised, every provider key is compromised. Treat the gateway as a high-value secrets store: rotate keys regularly, restrict admin access, log all key reads, and run security review on every gateway upgrade.
The 30-Day Rollout Plan
If you're convinced you need a gateway and want to act on it this month, here's the sequence that works.
Week 1 — Pick and stand up
- Pick one option from the table above based on your constraints.
- Stand up the gateway with one model and one provider; verify the OpenAI-compatible API works end-to-end.
- Build per-request cost telemetry from day one — don't ship without it.
Week 2 — Migrate one service
- Migrate your highest-volume LLM-touching service to the gateway.
- Verify telemetry matches your previous direct-provider bill within a tolerance.
- Add prompt caching for any stable prefix >1k tokens.
Week 3 — Add fallback and routing
- Add a fallback chain — primary provider, secondary provider, hard fail.
- Identify a candidate route for model downgrade (e.g. routine classification) and roll out routing behind a feature flag.
- Build a small eval set for that route; sample 1–5% of production traffic into it.
Week 4 — Migrate the rest
- Migrate remaining LLM-touching services.
- Turn on team-level virtual keys and budgets.
- Set up regression alerts on cost per request, latency, and eval scores.
Done well, the rollout cuts cost 30–60% (mostly from prompt caching and consistent routing) and reduces LLM-related production incidents to roughly zero. The compounding part: every future change — new provider, new model, new caching strategy — goes through one config change instead of N codebase patches. That's the leverage worth paying for.
FAQ
Looking for AI engineering roles?
Browse jobs at companies building production LLM systems — with verified culture profiles, comp data, and engineering blogs to evaluate before applying.
Browse AI/ML Jobs → Explore AI Tools →