The $1.96M Routing Table: What 440 Benchmark Runs Taught Me About Picking the Right LLM

Most production AI systems route every request to the same model. Usually the expensive one. I know because I used to do it too, until I ran 440 benchmark tests across 22 model configurations and four workflow types and discovered the per-user cost spread between the cheapest viable model and the priciest was 64x. At 1,000 users, optimized routing saves $1.96 million per year compared to sending everything to a frontier model.

That number is not a projection. It fell directly out of the data.

The Most Expensive Model Was the Least Reliable

The most expensive model in the benchmark (priced at $5.00/$25.00 per million input/output tokens) produced the best advisory analysis output of any model I tested. It also failed 25% of all runs. On structured extraction specifically, it failed 80% of the time. API errors, malformed responses, timeouts. Meanwhile, three other providers ran 240 combined test runs with zero failures, including their budget tiers.

I call this the Frontier Paradox: the model with the highest peak quality can still be the worst production choice. Your routing table needs a reliability column, not just a quality column. If a model's failure rate blows past your error budget, its best-case output stops mattering.

Tool-Calling Behavior Predicts Quality Better Than Price

This was the most actionable finding. Models that called all available tools produced better output than models that skipped tools, regardless of size or price.

The workflows gave models access to three tools: record lookup, recent history, and knowledge base search. Models that called all three produced sharply grounded, context-specific output. Models that skipped even one tool drifted toward generic filler, even when they were expensive frontier models.

The numbers tell the story clearly:

Mid-tier model, 3 tool calls: 15/15 quality at $0.004/call
Frontier model, 2 tool calls: 14/15 quality at $0.046/call
Budget model, 1 tool call: 6/15 quality at $0.001/call

A budget model that skipped the recent history tool produced responses anchored to stale context from three days earlier. Another that skipped record lookup produced generic output with no useful specificity. Both failures were invisible from the model's perspective. The output looked clean, grammatical, and well structured. In practice, it was dead weight.

The practical takeaway: when evaluating models for tool-calling workflows, do not stop at whether the model can call tools. Check whether it calls the right tools, in the right order, with sane parameters.

A One-Line Prompt Fix Changed the Entire Routing Table

This was the finding I did not see coming. A $0.001/call budget model initially scored 10/15 on response generation (the highest-frequency workflow at 50 calls per user per day) because it was skipping a critical knowledge base tool call. The easy conclusion was that the model was not good enough. Pay more and move on.

That conclusion fell apart when I checked the traces. The same model, on the advisory analysis workflow, was happily making 4 to 5 tool calls per run including deep knowledge base searches. It was not a capability limit. The model was not too weak. The prompt was too vague.

The original prompt said: "Use these to gather information before suggesting." That sentence gave the model permission to decide which tools were necessary. It made a reasonable optimization (skip the knowledge base for quick responses) and quietly traded away quality.

The fix was one line: "You MUST call all three tools before responding."

Quality jumped from 10/15 to 13 to 14/15. Cost stayed at $0.001/call. Latency rose by about 2 seconds. That single edit made the budget model viable as the primary across all four workflows, dropping per-user monthly cost from $18.13 to $2.60. At 1,000 users, that prompt fix is worth $186,000 per year.

The Routing Table That Fell Out of the Data

After scoring every model on every workflow and computing value scores (quality divided by cost times frequency weight), the routing table landed in a surprising place: one budget model for everything.

Response generation (50 calls/day): Budget model at $0.001/call, 13 to 14/15 quality
Advisory analysis (10 calls/day): Budget model at $0.002/call
Structured extraction (5 calls/day): Budget model at $0.003/call
Text summarization (5 calls/day): Budget model at $0.001/call

Total per-user monthly cost: $2.60. At frontier-only pricing, the same workload costs $166 per user per month.

That might look like it contradicts the premise of routing. It does not. The methodology told me when I could get away with one model and when I could not. Without the prompt fix, response generation needed a mid-tier model at $0.004/call. With the fix, the budget model cleared the bar everywhere.

If you want the absolute best output quality, swapping in a mid-tier model for just response generation raises the total to $7.28 per user per month, still 23x cheaper than frontier, while getting a perfect 15/15 on the workflow that runs most often.

Two other findings that shaped the table: reasoning mode added zero measurable quality improvement across 160 dedicated runs (one provider silently disables it when tools are present), and self-hosted models produced competitive quality at 6 to 7x the latency, making them viable for batch processing and privacy-sensitive workloads but not real-time production.

The models and prices will change. The methodology will not. Inventory your workflows, build scenario tests with real data, run every model through every scenario, score on domain-specific rubrics, compute value scores, and re-benchmark quarterly. The investment is one afternoon of compute. The payoff is a routing table backed by evidence instead of vibes.