LLM Routers Are Not Enough
Budget ceilings, tiered escalation, and production-grade control for LLM systems
Modern LLM applications are no longer just prompt → response tools.
They are production systems.
They call external APIs.
They run multi-step workflows.
They depend on multiple providers.
They incur real cost per token.
And yet, most of the infrastructure around them is still designed like a demo.
The Illusion of “Model Selection”
Many teams start with something like this:
if complexity > threshold:
model = “gpt-4”
else:
model = “gpt-3.5”
Or slightly more advanced:
Use embeddings to classify intent
Route “hard” questions to bigger models
Route “easy” ones to cheaper models
This is fine for toy systems.
But once you deploy at scale, three problems appear.
Problem 1: Budget Is Not a Hint — It’s a Constraint
In production, cost is not an afterthought.
It is:
A per-request ceiling
A per-workflow ceiling
A monthly burn constraint
Sometimes a per-customer contract limit
Most routers treat cost as an optimization target.
Very few treat it as a hard boundary.
If no model fits within the budget, the correct behavior is not “pick something anyway.”
The correct behavior is:
Fail fast.
Silent overspending is worse than a clean error.
Problem 2: Failover Is Not the Same as Escalation
Let’s say you try a budget-tier model and it fails.
What should happen?
Many systems:
Retry the same tier
Randomly pick another model
Escalate without preserving capability
But escalation should follow rules:
Never downgrade.
Preserve required capabilities.
Move strictly upward in defined tiers.
Respect remaining budget.
Failover is about availability.
Escalation is about capability and correctness.
They are not the same.
Problem 3: Capability Matters
Not all models are interchangeable.
Some are better at:
Code
Reasoning
Math
General text
If a step requires reasoning and you fall back to a model that is optimized for general chat, you haven’t solved the failure.
You’ve introduced a hidden degradation.
Routing should be capability-aware, not just price-aware.
LLM Systems Are Distributed Systems
If you step back, this looks familiar.
We already know how to reason about:
Tiered infrastructure
Resource ceilings
Strict failure semantics
Deterministic escalation
Observability
We just haven’t consistently applied those principles to LLM systems yet.
But we should.
Because LLM applications are infrastructure now.
What Production-Grade Routing Requires
At minimum:
A strict budget ceiling
Tiered model classification
Capability-aware filtering
Deterministic escalation
Multi-provider failover
A clear error contract
Anything less is convenience routing.
A Minimal Control Layer
I built a small open-source project called TokenWise to explore this properly.
It is not an agent framework.
It is not a prompt library.
It is a control layer for LLM routing.
It enforces:
Budget ceilings
Tier-based escalation (Budget → Mid → Flagship)
Capability-aware fallback
Multi-provider support
An OpenAI-compatible proxy interface
The goal is simple:
Treat LLM routing like infrastructure, not heuristics.
If no model fits within the ceiling, it fails.
If a model fails, it escalates strictly upward.
If a capability is required, fallback preserves it.
Nothing silent.
Nothing implicit.
Why This Matters
As more teams deploy LLM systems:
Cost predictability becomes critical.
Provider reliability becomes variable.
Multi-model strategies become normal.
Escalation logic becomes business logic.
The next wave of LLM tooling will not be about bigger prompts.
It will be about:
Control planes.
Budget guardrails.
Deterministic behavior.
Systems thinking.
The Real Question
The real question is not:
Which model should we use?
It is:
What are the invariants of our LLM system?
What is the maximum allowed cost?
What is the escalation policy?
What are the capability requirements?
When should we fail instead of fallback?
Answer those first.
Then implement routing.
Closing Thought
LLMs are probabilistic.
Your infrastructure should not be.
If you want to explore the implementation details, the project is here:

