How DeepHarness's Q-Learning Router Saves 73% on AI Costs
Deep-dive into the reinforcement-learning router that picks the optimal AI model tier per query — saving 73% vs single-tier routing.
Most AI platforms route every query to the same model. A question about today’s weather and a request to analyze a complex multi-source dataset both hit Opus at $15 per million tokens. This is wasteful, obvious, and surprisingly hard to fix.
The naive solution — rules that pattern-match queries to model tiers — works until it does not. Language is ambiguous. A short query can require deep reasoning. A long query can be a simple lookup. Static rules are brittle and wrong often enough to erode trust.
We built a different system. A reinforcement-learning router that observes outcomes and learns which model tier handles which query type. It gets better automatically. In production, it saves 73% on queries that would otherwise be over-served by expensive models.
Here is how it works.
The Problem: Uniform Routing Is Expensive
Three model tiers exist in practice:
| Tier | Model | Cost | Capability |
|---|---|---|---|
| Fast | Haiku | $0.80/M tokens | Simple queries, lookups, classification |
| Strong | Sonnet | $3.00/M tokens | Analysis, multi-step reasoning, generation |
| Orchestrator | Opus | $15.00/M tokens | Complex orchestration, novel problems, long-context |
The price spread is 19x between the cheapest and most expensive tier. If 60% of your queries can be handled by Haiku but you route everything to Sonnet, you are paying 3.75x more than necessary on the majority of your traffic.
The question is: how do you know which queries can be handled by the cheapest tier without degrading quality?
State Encoding: Representing the Query
The Q-learning router encodes each query as a composite state vector with four components:
Intent hash. The query’s classified intent mapped to one of 11 discrete values — data lookup, analysis, generation, monitoring, configuration, advisory, and so on. This captures what the user is trying to accomplish.
Complexity bucket. A numeric complexity score (0-100) from the complexity classifier, bucketed into three tiers: simple (0-33), moderate (34-66), complex (67-100). The classifier evaluates query length, entity count, reasoning depth, temporal scope, and ambiguity.
Entity bitmask. A 6-bit mask encoding which entity types are present in the query: data sources, metrics, time ranges, comparisons, thresholds, geographic references. This captures the structural complexity that intent alone misses.
Recent agent hash. A hash of the last agent used in the conversation. Conversations have momentum — a follow-up question after a data discovery agent is likely data-related, even if the query itself is ambiguous.
The full state is a tuple: (intentHash, complexityBucket, entityBitmask, recentAgentHash). This produces a state space large enough to capture meaningful distinctions but small enough for a Q-table to converge in reasonable time.
Action Selection: Epsilon-Greedy with Decay
The action space is the set of available agents, each with a default model tier. The router selects an agent (and by extension, a model tier) for each query.
Selection uses epsilon-greedy: with probability epsilon, the router explores a random agent. With probability 1-epsilon, it exploits the highest Q-value for the current state.
Epsilon decays from 0.15 to 0.01 over time. Early in deployment, the router explores 15% of queries — deliberately trying suboptimal routes to learn from the outcomes. As the Q-table fills with experience, exploration decreases to 1%. This balances learning against immediate cost efficiency.
The decay rate is calibrated so that a deployment processing 500 queries/day reaches near-optimal routing within two weeks.
Reward Signal: Outcome Quality
After each query completes, the system generates a reward signal based on:
- Outcome quality. Did the agent produce a result the user accepted? Did the user refine the request (signal of partial failure) or move on (signal of success)?
- Cost efficiency. Lower cost for equivalent quality increases reward. A Haiku response that satisfies the user scores higher than a Sonnet response with the same outcome.
- Latency. Faster responses score higher. Haiku is 3-5x faster than Opus, so correctly routing simple queries to Haiku improves both cost and user experience.
The reward function is: R = quality_score * (1 + cost_savings_ratio) * latency_bonus. This means the router is not just minimizing cost — it is maximizing the product of quality and efficiency. A cheap response that fails is penalized. An expensive response that succeeds when a cheap one would have also succeeded is penalized. The optimal outcome is the cheapest response that fully satisfies the query.
The 5-Layer Cost Cascade
The Q-learning router is layer 4 of a 5-layer cascade. Each layer can override the layers below it:
Layer 1: Confidentiality override. If the query involves confidential or restricted data (classified by the orchestrator), it routes to a local model regardless of cost considerations. Security overrides economics.
Layer 2: Budget degradation. If the organization is approaching its spend limit, force the fast tier. Better to serve a slightly lower-quality response than to exceed budget and halt all AI operations.
Layer 3: User model override. If the user explicitly selects a model via the model switcher, respect that choice. Human judgment overrides automated routing.
Layer 4: Cost-aware routing. The Q-learning router selects the cheapest adequate model tier based on learned experience. This is where the 73% savings happen.
Layer 5: Agent definition tier. If no higher layer has made a decision, fall back to the agent’s default model tier as defined in the agent registry.
The cascade means cost optimization never compromises security, budget controls, or explicit user intent. The Q-learning router only controls the cases where automated routing is appropriate — which is the majority of queries.
Experience Replay and Batch Updates
The router does not update after every single query. It uses experience replay — a standard reinforcement-learning technique that improves learning stability.
Each completed query generates an experience tuple: (state, action, reward, next_state). These are stored in a circular buffer of 10,000 entries. Every 50 queries, the router samples a batch of 32 experiences and performs Bellman Q-value updates.
This matters for three reasons:
- Stability. Batch updates prevent the Q-table from oscillating based on individual noisy outcomes.
- Efficiency. A single unusual query does not distort routing. The batch average smooths out variance.
- Recency. The circular buffer means old experiences are gradually replaced by new ones. The router adapts to changing query patterns over time.
The Q-table persists to disk every 50 updates. On restart, the router resumes from its learned state instead of starting from scratch.
Production Distribution
In a representative production deployment, the tier distribution stabilizes at approximately:
| Tier | Percentage | Query Types |
|---|---|---|
| Haiku (fast) | 58-65% | Status checks, simple lookups, classification, formatting |
| Sonnet (strong) | 25-30% | Multi-step analysis, content generation, data synthesis |
| Opus (orchestrator) | 8-12% | Complex orchestration, novel problems, multi-agent coordination |
Compare this to uniform Sonnet routing (the industry default): 100% of queries at $3/M. With learned routing, the weighted average cost drops to approximately $1.30/M — a 57% reduction. For the 60%+ of queries routed to Haiku, the per-query savings is 73%.
These numbers improve over time as the Q-table accumulates more experience.
The Self-Improvement Loop
This is the property that makes the system fundamentally different from rule-based routing: it gets better automatically.
A new deployment starts with the agent definition tiers (layer 5) handling most routing. The Q-learning router explores, gathering experience. After a few hundred queries, the Q-table starts capturing reliable patterns. After a few thousand, the router’s learned policy dominates and the agent defaults become fallbacks for rare query types.
No engineering intervention required. No rule updates. No manual tuning. The router observes outcomes and adjusts.
If query patterns change — a new data source, a seasonal shift in user behavior, a new team onboarding — the circular buffer naturally phases out stale experience and the router adapts.
Beyond Cost: Quality Improves Too
An unexpected benefit: routing simple queries to Haiku often produces better user experience than routing them to Opus.
Haiku responds 3-5x faster. For a simple status check or data lookup, the user gets their answer in 800ms instead of 3 seconds. The response quality is identical — Haiku handles simple queries perfectly — but the experience is noticeably better.
Opus is not better at everything. It is better at hard things. For easy things, it is slower and more expensive with no quality advantage. Correct routing means every model tier operates in its optimal range.
The system does not just save money. It puts the right tool on the right problem. That is what intelligent routing actually means.