The short answer. A good LLM routing layer sends easy requests to small fast models and only escalates to a frontier model when the request actually warrants it. The five strategies that matter in production are cost-aware, semantic, cascading, intent-based, and load-balancing. Most production systems combine two of them. The result, when tuned, is dramatic cost reduction with most of the quality preserved — public benchmarks have shown order-of-magnitude savings while keeping the strongest model's quality on the requests where quality matters.
What follows is the long answer: how each strategy works, what it costs you (in latency, in complexity, in eval debt), and how to think about the routing layer as a system rather than a feature.
Why routing matters now
The conversation about LLMs in 2024 was which model is best. The conversation in 2026 is how do we serve millions of requests at unit economics that work. That shift turned routing from an interesting research topic into a first-class engineering concern, and it shows up in the job market: descriptions for AI engineering, ML platform, and AI infra roles increasingly list "LLM gateways," "model routing," and "inference optimization" alongside the usual MLOps stack. Routing is no longer a "nice to have" line in an architecture doc. It's the thing your CFO will ask about by name in your second budget cycle.
The reason is simple. Frontier models are extraordinary, but for most production traffic they are also extraordinarily overqualified. A reformat-this-JSON request does not need the same model that solves a multi-step reasoning problem. If every request takes the expensive path, you pay frontier prices on traffic that a small model would have handled fine — and your latency suffers for no reason. Routing fixes both problems at the same time.
The three tradeoffs every router negotiates
Before getting into specific strategies, it's worth being precise about what a router is actually optimizing. Every routing policy is a negotiation among three goals that pull against each other:
- Cost — dollars per request, dominated by tokens-in × input-price + tokens-out × output-price.
- Latency — time-to-first-token and total response time, which matters enormously for streaming UIs and agent loops.
- Quality — measured against an eval suite that mirrors your real traffic; this is where most teams underinvest and pay for it later.
You cannot optimize all three. A router that always picks the cheapest model will degrade quality on hard requests. A router that always picks the strongest model will degrade your unit economics. A router that always picks the lowest-latency model will degrade either cost or quality, often both. The art of routing is being explicit about which tradeoff you're making for which class of request — and being able to measure it.
The five routing strategies worth knowing
1. Cost-aware routing
The simplest useful strategy. You define a quality threshold (often as a cheap classifier or a small "judge" model output), and you pick the cheapest model whose expected quality on this request exceeds the threshold. This sounds trivial; in practice the entire value is in the threshold. Tune it too tight and you over-route to the frontier model. Tune it too loose and quality silently degrades for weeks before someone notices.
Cost-aware routing is also where most teams start their routing journey, because it produces the biggest immediate dollar savings and the smallest architectural change. The router lives in front of your model provider client, looks at the request, and rewrites the model parameter. Done.
2. Semantic routing
Instead of a heuristic, embed the request into a vector and pick the model whose "semantic profile" best matches that vector. A semantic router typically encodes both the user query and a set of candidate routing utterances (or category seeds) into the same embedding space, then routes by cosine similarity to whichever route's seed cluster is nearest. Open-source projects like the vLLM Semantic Router and Red Hat's LLM Semantic Router have made this pattern increasingly accessible.
The strength of semantic routing is that it generalizes — you don't have to enumerate every keyword that means "this is a code request" or "this is a summarization." The weakness is that you need to think carefully about the seed routes, and you pay an extra embedding call per request (typically a few milliseconds against a small embedding model).
3. Cascading (escalation) routing
Try the cheap model first. If its output passes a quality check (low perplexity on the answer, judge-model approval, schema validation, a confidence score), return it. If not, escalate to the next tier — and only if that fails, to the frontier model. Cascading is conceptually elegant: pay for what you actually need.
The catch is the quality check. A bad check defeats the whole strategy: false negatives mean you over-escalate (paying for both calls), false positives mean you ship low-quality answers. Cascading works best when there's a cheap, reliable signal for "did this answer actually solve the task" — structured output validation is the textbook example, and it's why cascading is so popular in tool-use and JSON-extraction workflows.
4. Intent-based routing
Classify the request by intent or domain (code, summarization, math, conversational, retrieval-augmented generation, vision-language, agent step) and route to a model that's specifically good at that intent. A small classifier or even a regex on system-prompt tags can do this. The router then picks among a pool of specialists: a code-tuned model for code, a long-context model for summarization of long documents, a reasoning model for math.
Intent-based routing shines when your workload has clearly separable intents and you have access to specialists that meaningfully outperform a general-purpose frontier model on a specific intent at a fraction of the cost. It struggles when intents blur — a "summarize this Python file" request is partly code, partly summarization, partly retrieval. In practice, intent routing is most useful as a coarse first pass that hands off to one of the other strategies.
5. Load balancing and reliability routing
Routing for reliability is its own discipline. Round-robin, lowest-latency, weighted by quota, consistent-hashing on user ID for cache locality — these are infrastructure patterns borrowed from the broader microservices world, applied to LLM providers. The point of reliability routing is not cost savings; it's throughput, fallback to another provider when one is rate-limited, and resilience to provider outages. Most production systems run a reliability-routing layer underneath their cost/quality routing layer.
How the strategies compare
| Strategy | Best for | Typical overhead | Eval burden |
|---|---|---|---|
| Cost-aware | Mixed-difficulty traffic with cheap quality signal | <1 ms (rules) to ~5 ms (classifier) | Medium — need a good threshold proxy |
| Semantic | Many intents, hard-to-enumerate categories | ~5 ms (embedding call) | Medium — need good seed utterances |
| Cascading | Validated outputs (JSON, tool calls) | +1 model call when escalating | High — quality check is critical |
| Intent-based | Workloads with clear intent separation | ~5 ms (classifier) | Medium — need calibrated classifier |
| Load balancing | Reliability, throughput, fallback | <1 ms | Low — measure provider health |
What a real routing layer looks like in 2026
Most teams that take routing seriously end up with a layered design rather than a single strategy. The shape that has emerged looks something like this:
- Reliability layer (bottom). Multi-provider fallback, retries with backoff, rate-limit aware key rotation. This is just good infra.
- Intent / category layer (middle). Coarse classification — is this a code request, a chat turn, an agent step, a vision request. Routes to a pool.
- Cost/quality layer (top). Inside the pool, either cost-aware routing with a threshold, semantic routing among specialists, or cascading from cheap to expensive.
- Observability everywhere. Every routing decision is logged with: chosen model, alternatives considered, decision signal, latency, cost, and the user-visible outcome. Without this you cannot tune anything.
The observability point is the one teams underweight. A router is only as good as your ability to measure what it did. Production routing layers ship with structured logs, per-model dashboards, and ideally a shadow-traffic mode where a small percentage of requests are also sent to alternative models so you can compare quality offline. Routers that can't tell you why they made a specific routing decision become impossible to debug the moment quality dips.
Libraries vs. managed gateways
In 2026 the routing ecosystem splits cleanly into self-hosted libraries and managed gateways.
Self-hosted libraries give you full policy control and the lowest possible latency. LiteLLM has become the de facto Python adapter layer for routing across providers. RouteLLM ships a training framework specifically for strong-vs-weak router classifiers and has driven a lot of the academic work on cost-quality tradeoffs. The vLLM Semantic Router pushes routing down to the inference layer itself, treating routing as part of the serving stack rather than the application stack.
Managed gateways trade off a small amount of control for a lot of operational simplicity. OpenRouter aggregates dozens of providers behind one API. Cloudflare AI Gateway and Kong AI Gateway bring routing into the edge / API-gateway layer alongside rate limiting, caching, and observability. Martian is explicitly built around routing as the primary product — every request goes through a routing decision and gets traced with the why.
The right answer for most teams: start with a managed gateway to ship something this quarter, then move policy into a self-hosted library once routing logic becomes business-critical. The handoff is straightforward — both worlds are converging on similar APIs.
What's new in 2026
A few patterns are emerging that didn't exist in earlier generations of routing:
- Online-learning routers. Newer research (BaRP and PILOT, among others) treats routing as a bandit problem — the router updates its policy from feedback signals (user thumbs, judge model scores, downstream tool success) rather than relying on a static classifier. This is still early but starting to ship in production at teams with heavy eval infrastructure.
- Inference-level semantic routing. Projects like the vLLM Semantic Router push routing below the application layer, so the same routing logic governs both first-party traffic and third-party API calls hitting a shared inference cluster.
- MCP gateways. As Model Context Protocol adoption grows, MCP gateways are emerging as a unified control plane for tool routing as well as model routing — the same layer decides which tool to call and which model to call it with.
- Reasoning-vs-non-reasoning routing. The split between reasoning models (multi-step thinking, longer time-to-answer) and non-reasoning models is now a first-class routing decision. Many production systems use a small classifier specifically to decide "does this request need a reasoning model" before they touch any other routing logic.
Common mistakes
Three patterns show up over and over in teams that have routing turned on but routing not working.
Routing without evals. The single biggest failure mode. You can't tune cost-quality if you don't have an automated suite that reflects your real traffic. Routing decisions made without evals are decisions you can't verify, can't reproduce, and can't improve. Build the eval suite before you build the router. If you only have time for one of the two, build the eval suite.
Confusing routing with caching. Caching reuses answers; routing picks models. They're complementary, not the same. A common pattern is to put a semantic cache in front of the router (return the cached answer if the request is similar to one you've seen) and then route only on cache misses. This is great, but make sure both layers are observable separately — debugging a quality dip when you can't tell whether it came from the cache or the router is its own kind of pain.
Routing on the request without considering the conversation. Multi-turn applications often route per-turn, but the right model for turn N depends on what happened in turns 1 through N-1. Long agent loops in particular benefit from "sticky" routing: once you've committed to a stronger model for a hard task, switching mid-flight rarely helps and often confuses the trace.
Routing as an AI engineering skill
The teams that put routing on a roadmap a year ago are now hiring for it explicitly. Job listings for AI engineering and AI platform roles in 2026 routinely call out LLM gateways, model routing, semantic routing, inference optimization, and prompt caching as required or strongly preferred skills. Engineers who can design a routing policy, instrument it with the right evals, and tune the cost-quality curve are the ones taking those roles. If you're building toward an AI engineering career, routing is one of the more concrete, demonstrable things you can put on your portfolio: take a small workload, ship a router, measure the cost/quality delta, and write up what you learned.
It's also one of the more underrated skills in terms of leverage. The compounding savings of a well-tuned routing layer outlast any single model upgrade — you'll get to keep them when next year's frontier model arrives and you start the whole tradeoff conversation over again.
Looking for AI engineering roles?
Browse open ML / AI engineering positions across hundreds of culture-vetted companies. Every job links to the company's culture profile so you can decide if it's a fit before you apply.
Browse AI & ML Jobs → Explore AI Tools →The short version
If you're starting from zero, pick cost-aware routing as your first strategy — it's the lowest-overhead, highest-impact starting point. Build an eval suite that mirrors your real traffic before you tune. Add a reliability layer for multi-provider fallback. Add semantic or cascading routing only when the simple version stops giving you marginal improvement. Instrument everything. And revisit your routing policy whenever a new model lands — the cost-quality curve shifts every time, and yesterday's optimal router is rarely today's.
Routing is one of the rare AI engineering disciplines where the right call today still pays out a year from now. Worth the investment.
Frequently Asked Questions
Find an AI engineering role that fits
Hundreds of culture-vetted ML and AI engineering jobs, each with a culture profile of the company hiring. Filter by remote, role level, and culture values.
Browse AI & ML Jobs → Read the LLM Cost Guide →