LLM Gateway Design Guide (2026): Architecture, Trade-offs, and When to Build vs Buy

Q: What is an LLM gateway?

An LLM gateway is a routing layer between your application and one or more LLM providers. It exposes a unified API (usually OpenAI-compatible) and centralizes model routing, retries, fallbacks, caching, observability, rate limiting, and cost tracking. Instead of every service in your codebase calling OpenAI/Anthropic/Google directly, they call the gateway, which calls the right provider with the right configuration. The result: a single place to change provider, swap models, roll out caching, or A/B test routing — without touching application code.

Q: When should I add an LLM gateway?

The threshold is usually around the second LLM-touching service or the first month you have meaningful production spend (often quoted as around $1K-$10K per month, depending on team size). Earlier than that, the indirection adds complexity without payoff. Later than that, you're patching the same caching and retry logic in multiple codebases, which becomes painful fast. The signal: you've copy-pasted retry-with-exponential-backoff code into three services, or someone asks 'what's our actual monthly LLM cost' and the answer requires a forensic investigation.

Q: Should I build or buy an LLM gateway?

Buy (or use open-source) for almost everyone in 2026. The hosted options (OpenRouter, Vercel AI Gateway) require minimal setup and handle the boring infrastructure. Self-hosted open source (LiteLLM, Portkey, Helicone) gives you control without writing the gateway yourself. Building from scratch is justified only if you have unusual requirements: extreme low-latency demands, specific compliance constraints (data sovereignty, on-prem only), or proprietary routing logic that nobody else implements. The build-vs-buy default in 2026 has flipped to 'don't build.'

Q: What features should an LLM gateway have?

Six load-bearing features. (1) Unified API across providers — OpenAI-compatible is the de facto standard. (2) Model routing with rules-based and dynamic strategies. (3) Retries with exponential backoff and provider fallback chains. (4) Caching — at minimum prompt-level, ideally semantic-level. (5) Per-request observability: input/output tokens, cached tokens, model, latency, cost, attribution to feature or user. (6) Rate limiting and quota management, ideally with virtual keys for team-level budgets. Optional but valuable: guardrails, prompt management, and structured-output validation.

Q: Does an LLM gateway add latency?

Yes, but typically less than 10–50ms when designed well — small compared to the multi-second LLM inference itself. The latency comes from a network hop, the routing decision, and (sometimes) a cache lookup. The wins from caching, retries, and fallback usually dominate the latency penalty. For genuinely latency-critical paths (sub-100ms total budget), co-locate the gateway with the application or use a thin client-side library that calls providers directly while still reporting telemetry to your gateway.

Q: How do you handle failover between providers?

Three patterns, in increasing sophistication. (1) Static fallback chain: 'try Anthropic, then OpenAI, then Google.' Easy, predictable, but doesn't account for model quality differences. (2) Tiered fallback: 'try Sonnet on Anthropic, fall back to Sonnet via Bedrock, then to GPT-5 on OpenAI.' Better — keeps quality consistent across providers. (3) Latency-based routing: pick the fastest available provider for each request based on rolling latency stats. Useful only if you've validated quality equivalence with evals; otherwise you're routing for speed and degrading silently.

Q: Should I cache LLM responses?

Yes, at multiple levels. Prompt caching (provider-side) for stable prefixes is essentially free wins — ~10% rate on cached input tokens at most providers. Semantic caching (gateway-side) is more powerful but introduces quality risk and should be limited to workloads where semantic-similar inputs have semantic-similar correct outputs. Exact-match response caching is great for FAQ-style or deterministic workloads. Never cache responses for personalized, user-context-dependent, or time-sensitive workloads without explicit per-route gating.

The short answer

An LLM gateway is the unified API between your application and the model providers. It owns six load-bearing concerns: provider abstraction, routing, retries / fallback, caching, observability, and rate limits.

For almost every team in 2026, the right move is to use a gateway rather than build one. Pick OpenRouter for hosted simplicity, Vercel AI Gateway for Next.js shops, LiteLLM for self-hosted with team-level budgets, or Portkey for the heaviest LLMOps surface. Build from scratch only for unusual compliance or latency requirements.

If you are running LLMs in production and your codebase has more than one service calling a model provider, the question of whether to put a gateway in front of them is no longer hypothetical. The patterns — retries, caching, fallback chains, cost telemetry, model selection — show up in every service, the implementations diverge over time, and the day eventually arrives when "what is our actual monthly LLM cost broken down by feature" becomes an answer nobody can produce.

This guide is the practical, opinionated take on LLM gateways in 2026: what they do, what the architectural choices are, how to choose between the major options, and the failure modes that show up once you have one in production.

What an LLM Gateway Actually Is

A gateway is a thin routing layer between your application and one or more LLM providers. The application calls the gateway with a request that looks like an OpenAI chat completion. The gateway figures out which provider and which model to use, applies caching, handles retries, and returns the response. Everything else — observability, cost attribution, rate limiting — is built on top of that core path.

The contract that makes this work is the OpenAI-compatible API. Almost every gateway in 2026 speaks it, almost every framework expects it, and the providers themselves have converged on it for their own SDKs. If you're designing a gateway, this is the input/output schema you support; if you're picking one, this is the surface you check first.

The Six Load-Bearing Features

A useful gateway has six things, in roughly this order of importance:

Feature 1 — The Whole Point

Unified provider abstraction

Application code calls one API. Behind the gateway, the request goes to Anthropic, OpenAI, Google, Bedrock, Azure, or your own self-hosted endpoint. Swap providers without touching app code. Add a new provider in one place. This is the load-bearing payoff — everything else is leverage built on top of it.

Feature 2 — The Highest-Leverage Architectural Choice

Model routing

Rules-based ("use Haiku for classification, Sonnet for chat"), dynamic ("use the cheapest available model that hits the quality bar for this route"), or hybrid. The gateway should make routing a config change, not a code change, so you can roll out a routing rule, observe metrics, and roll back without a deploy. See our companion piece on LLM cost optimization for why routing is the single biggest cost lever.

Feature 3 — Cheap Reliability

Retries and fallback

Exponential backoff on transient errors. Provider fallback chains when one provider is degraded. A circuit breaker per provider so you stop hammering a dead endpoint. None of this is novel; all of it is what application teams reinvent badly when there's no gateway. Doing it once, centrally, with good defaults, is a 10x leverage win on reliability incidents.

Feature 4 — Cost Multiplier

Caching

Three layers, each with different trade-offs. Prompt caching at the provider level — you set a cache marker on the stable prefix, the provider serves cached input tokens at a small fraction of the standard rate. Semantic caching at the gateway level — for FAQ-style or deterministic workloads, hits can be high but quality risk is real. Exact-match caching — cheap, low-risk, useful for evaluation runs and deterministic dataset processing.

Feature 5 — The Thing You Always Wish You Had Sooner

Per-request observability

Input tokens, cached tokens, output tokens, model, route, latency, cost, attribution to feature and user. Without this, every conversation about cost is a forensic investigation. With it, "the LLM bill spiked yesterday, what changed" becomes a five-minute dashboard query. Build the cost telemetry first — almost every other lever depends on being able to measure it.

Feature 6 — The Quiet Saver

Rate limits and virtual keys

Per-team or per-feature quotas. Virtual API keys for internal users so an experiment can't burn through the production budget. Hard limits that prevent a runaway agent loop from costing $5K overnight. The gateway is the natural place to enforce these because it sees every request.

The Build-vs-Buy Decision

In 2024 the right call was usually "build a thin one." By 2026 the calculus has flipped. The open-source and hosted options have matured to the point that building from scratch is justified only by specific constraints: extreme latency requirements, data-sovereignty rules that prohibit any third-party hop, or proprietary routing logic that doesn't fit standard frameworks.

Constraint	Likely answer
You're a solo dev or hobby project	OpenRouter or Vercel AI Gateway. Hosted, fast, minimal setup.
Small startup, <$5K/month spend	OpenRouter for prototyping; revisit at scale.
Growing team, $5K-$50K/month, want control	LiteLLM (self-hosted) with Helicone for observability.
LLMOps-heavy organization	Portkey (open-sourced under Apache 2.0 in early 2026) or a hosted alternative.
Enterprise w/ data sovereignty needs	Self-hosted open-source on your own infrastructure.
Already deep in Next.js / Vercel	Vercel AI Gateway via the AI SDK — minimal additional setup.
Unusual latency / compliance / IP requirements	Only here does "build your own" make sense.

The Major Options, In One Line Each

OpenRouter

Hosted, OpenAI-compatible API, large model catalog, consolidated billing. The lowest-friction option — sign up, swap your base URL, ship. Free tier on open-weight models works for prototyping. Pay-per-request scales without surprise. Trade-off: you're routing through their infrastructure, which is a latency hop and an additional vendor on your compliance review.

LiteLLM

Open-source, self-hosted, MIT-licensed, massive provider catalog. The default choice for teams that want OpenAI-compatible routing without giving up infrastructure control. Strong on virtual keys, team-level budgets, and integrating with your own observability stack. Trade-off: you run it, which means you patch it and you're on call when it falls over.

Portkey

Open-sourced under Apache 2.0 in early 2026, which made it the most feature-complete OSS option overnight. Best for teams that need the full LLMOps surface — guardrails, prompt management, evals, compliance reporting — without writing it themselves. Trade-off: more surface area to learn than LiteLLM; better fit when you're using more of what it offers.

Helicone

Low-latency open-source gateway with strong built-in observability. A good fit when observability is the primary driver and you don't need the heavier LLMOps features. Often used alongside LiteLLM in production stacks — one for routing, the other for observability.

Vercel AI Gateway

Hosted gateway integrated with the Vercel AI SDK. If you're already in the Next.js + Vercel ecosystem, the integration cost is essentially zero — a few config lines and you have routing, fallback, observability, and unified billing. Trade-off: works best when you're shipping on Vercel; less compelling as a standalone product.

Architectural Decisions That Bite You Later

Sync vs streaming

Streaming has to be first-class. If you bolt it on after the fact, you'll discover every observability hook, retry path, and cache layer needs to be reworked. Pick a gateway that supports streaming from day one and design your application code around it.

Per-request vs per-session

For chat applications, decide whether the gateway is per-request (stateless, each call independent) or per-session (the gateway holds conversation state). Per-request is simpler and more scalable; per-session enables better caching and lower latency for multi-turn conversations. Most production gateways are per-request; per-session shows up in higher-latency RAG and agent systems where the state is large.

Where the eval lives

The gateway is the natural place to sample requests for offline eval scoring. Build this in from the start — a sampling rule per route ("score 1% of these requests against the eval set") gives you ongoing quality telemetry without the cost of evaluating every request. Without it, regressions show up in user complaints rather than dashboards.

How keys are managed

Provider API keys belong in the gateway, not in application code. The gateway issues virtual keys to each service or team, the application calls with the virtual key, the gateway swaps in the real provider key. This is how you contain key rotation, enforce team-level budgets, and avoid accidentally committing an OpenAI key to a public repo.

Production Failure Modes

A gateway introduces a single point of failure. Plan for it from day one.

The gateway itself going down

If the gateway is down, every LLM-touching feature is down. Mitigations: keep a fallback path in application code that talks directly to a primary provider if the gateway is unavailable; use a managed gateway with SLA guarantees rather than self-hosting on a single VPS; run multiple gateway instances behind a load balancer with health checks.

Silent quality regressions from routing

You roll out a routing rule that sends "easy" requests to a smaller model. The eval looked good. Six weeks later, edge cases start producing subtly wrong outputs and nobody notices until a customer complaint. Mitigation: continuous offline eval sampling per route, with regression alerts.

Cache poisoning

Semantic cache hits return a "similar enough" response that is actually wrong for the current request. Easy to introduce, hard to detect because the response is plausible. Mitigation: high similarity threshold (cosine ≥ 0.95), per-route cache-on/cache-off gating, log all cache hits for sampled offline review.

Runaway agent loops

An agent retries, re-plans, re-calls tools without a step ceiling and burns $5K overnight. The gateway is the right place to enforce per-request and per-session token budgets that hard-stop the loop. Never trust the application code to enforce these alone.

Provider key leaks

The gateway holds provider keys for the whole organization. If the gateway is compromised, every provider key is compromised. Treat the gateway as a high-value secrets store: rotate keys regularly, restrict admin access, log all key reads, and run security review on every gateway upgrade.

The 30-Day Rollout Plan

If you're convinced you need a gateway and want to act on it this month, here's the sequence that works.

Week 1 — Pick and stand up

Pick one option from the table above based on your constraints.
Stand up the gateway with one model and one provider; verify the OpenAI-compatible API works end-to-end.
Build per-request cost telemetry from day one — don't ship without it.

Week 2 — Migrate one service

Migrate your highest-volume LLM-touching service to the gateway.
Verify telemetry matches your previous direct-provider bill within a tolerance.
Add prompt caching for any stable prefix >1k tokens.

Week 3 — Add fallback and routing

Add a fallback chain — primary provider, secondary provider, hard fail.
Identify a candidate route for model downgrade (e.g. routine classification) and roll out routing behind a feature flag.
Build a small eval set for that route; sample 1–5% of production traffic into it.

Week 4 — Migrate the rest

Migrate remaining LLM-touching services.
Turn on team-level virtual keys and budgets.
Set up regression alerts on cost per request, latency, and eval scores.

Done well, the rollout cuts cost 30–60% (mostly from prompt caching and consistent routing) and reduces LLM-related production incidents to roughly zero. The compounding part: every future change — new provider, new model, new caching strategy — goes through one config change instead of N codebase patches. That's the leverage worth paying for.

FAQ

What is an LLM gateway?+

A routing layer between your application and one or more LLM providers. Exposes a unified API (usually OpenAI-compatible) and centralizes model routing, retries, fallbacks, caching, observability, rate limiting, and cost tracking. Instead of every service calling providers directly, they call the gateway.

When should I add an LLM gateway?+

Typically around the second LLM-touching service or once you have meaningful monthly production spend. Earlier than that, the indirection adds complexity without payoff. The signal: you've copy-pasted retry-with-backoff code into three services, or nobody can answer "what's our monthly LLM cost broken down by feature."

Should I build or buy an LLM gateway?+

Buy or use open-source for almost everyone in 2026. Hosted options handle the infrastructure; self-hosted open source (LiteLLM, Portkey, Helicone) gives you control without writing the gateway. Building from scratch is justified only for unusual latency, compliance, or routing requirements.

What features should an LLM gateway have?+

Six load-bearing features: unified provider API, model routing, retries with fallback, caching (prompt + optional semantic), per-request observability with cost attribution, and rate limits with virtual keys. Optional but valuable: guardrails, prompt management, structured-output validation.

Does an LLM gateway add latency?+

Yes, but usually negligible — tens of milliseconds compared to the multi-second model call itself. Caching, retries, and fallback wins typically dominate the latency penalty. For sub-100ms latency budgets, co-locate the gateway with the application or use a thin client library.

How do you handle failover between providers?+

Three patterns: static fallback chain (try A, then B, then C); tiered fallback (same model class across providers); latency-based routing (pick fastest available). Latency-based is useful only after you've validated quality equivalence with evals.

Should I cache LLM responses?+

Yes, at multiple levels. Prompt caching for stable prefixes is essentially free. Semantic caching is powerful but introduces quality risk — limit to workloads where similar inputs have similar correct outputs. Exact-match caching is low-risk for FAQ-style or deterministic workloads. Never cache for user-context-dependent or time-sensitive routes without explicit gating.

Looking for AI engineering roles?

Browse jobs at companies building production LLM systems — with verified culture profiles, comp data, and engineering blogs to evaluate before applying.

Browse AI/ML Jobs → Explore AI Tools →