We put 22 model configurations through 4 workflow types and 440 benchmark runs. The cheapest model aced structured extraction. The most expensive model failed 25% of the time. And a $0.001 budget model, after a one-line prompt fix, landed within 7% of a $0.046 frontier model on output quality. At enterprise scale, the routing decisions in this article are worth six figures a year.

Every production AI system has more than one kind of job to do. Some requests are fast, frequent, and boring. Others are slow, thorny, and high stakes. Most teams still route everything to one model, usually the expensive one, because nobody has stopped long enough to test what each workflow actually needs.

This article is what fell out of that test. We run a production knowledge agent with four distinct workflows: high-frequency response generation, deep advisory analysis, structured data extraction, and text summarization. Every workflow calls tools, pulls context from a knowledge base, and produces output a human will actually read. We benchmarked major model providers across every workflow, scored the outputs for quality, and built a routing table from the evidence.

The per-user cost difference between naive routing and data-driven routing was 64x. Multiply that across hundreds or thousands of users, and the routing table becomes one of the highest-ROI engineering decisions you can make.

TL;DR

  • 440 benchmark runs across 22 models: the $0.001/call budget model matched the $0.046/call frontier model within 7% on quality after a one-line prompt fix
  • The most expensive model failed 25% of the time; reliability varied by workflow, not just by provider
  • Tool-calling behavior was the strongest predictor of output quality, outweighing model size and price
  • Reasoning mode added zero value across 160 dedicated runs on tool-calling workflows
  • Optimized routing saved $1.96M/year at 1,000 users vs. frontier-only ($2.60/user/month vs. $166/user/month)
64x
Per-user cost spread between cheapest and most expensive model
0 to 100%
Failure rate range across providers
$1.9M/yr
Savings at 1,000 users: optimized routing vs. frontier-only

The Benchmark Setup

We tested 14 base models and 8 reasoning variants across 5 providers: OpenAI (gpt-5.4 family), Google (Gemini 3 family), xAI (Grok 4 family), Anthropic (Claude family), and self-hosted (Ollama with Qwen 3.5 on an NVIDIA 5090 and a consumer GPU). Each model went through 4 workflows and 5 test scenarios, producing 440 output files. Five scenarios per configuration is a modest sample, but the quality spreads were wide enough that more runs would not have changed the routing decisions.

The 4 workflow types, abstracted from our production system:

  • Quick response generation (high frequency, ~50 calls/user/day): Given context about a topic and recent activity, generate 3 response options. Output must be valid JSON. Tests personalization, tone matching, and context awareness.
  • Deep advisory analysis (moderate frequency, ~10 calls/user/day): Given a complex question and access to background data, produce a structured multi-section analysis with specific recommendations. Tests reasoning depth, framework application, and strategic nuance.
  • Structured data extraction (low frequency, ~5 calls/user/day): Given raw activity data, extract structured profiles with classifications, key attributes, and status determinations. Tests accuracy, schema compliance, and resistance to hallucination.
  • Text summarization (low frequency, ~5 calls/user/day): Given source records, produce a coherent summary paragraph. Tests conciseness and factual accuracy.

Every workflow involved tool calling: models had access to 3 tools (record lookup, recent history, and knowledge base search) and were expected to gather context before generating output. That matters because tool-calling behavior turned out to be one of the strongest predictors of output quality.

We scored outputs on domain-specific rubrics (specificity, context threading, usability, framework usage, strategic depth) rated 0-3 per dimension, then computed a composite value score:

value_score = quality_score / (cost_per_call × frequency_weight)

This formula drags the tradeoff into daylight. A model that scores 15/15 at $0.046/call has a lower value score than a model that scores 15/15 at $0.004/call. And a model that scores 12/15 at $0.002/call might beat both for budget-sensitive workflows.

Finding 1: Your High-Frequency Workflow Drives the Bill

This should be obvious. It still gets ignored all the time. If one workflow runs 10x more often than the others, its per-call cost will bully the rest of your monthly bill. In our system, response generation accounts for ~71% of daily call volume per user (50 out of 70 calls). The model choice for this single workflow determines whether you spend $3 or $100 per user per month. At 1,000 users, that is the difference between $3,000 and $100,000 monthly.

Model Cost/Call Monthly/User (50/day) Quality
Budget A (with prompt fix) $0.001 $1.46 13-14/15
Mid-tier A $0.004 $6.14 15/15
Budget B $0.002 $2.47 12/15
Premium A $0.014 $20.30 15/15
Frontier A $0.046 $69.23 14/15

The budget model (at $0.001/call) scored 13-14 out of 15, within 7% of the frontier model that costs 46x more. Here is the twist: it only reached that score after a one-line prompt fix. In the initial benchmark, the same model scored 10/15 because it skipped a critical tool call. The model had the horsepower. The prompt had a hole in it. (More on that in Finding 6.)

The mid-tier model (at $0.004/call) still wears the quality crown at 15/15, with sharper detail than the budget model can quite match. But at 4x the cost and 50 calls per user per day, that final 7% of quality costs $56 per user per year. At 500 users, you are paying $28,000 annually for the last 7% of quality on your most frequent workflow. Whether that is worth it depends on your domain.

The lesson: do not assume expensive means better, and do not assume cheap means worse. Benchmark it, and if a model underperforms on tool calling, inspect the prompt before you replace the model.

Finding 2: Reliability Varies Wildly by Task Complexity

This was the real curveball. We expected models to either work or not work. Instead, reliability turned out to be workflow-dependent. A model can be 100% reliable on one workflow and fail 80% of the time on another.

Provider Total Runs Failures Rate Notes
Provider A (budget tier) 20 20 100% API connectivity issue; every run failed
Provider A (frontier) 20 5 25% 4/5 extraction failures, 1 summarization
Self-hosted (large) 20 2 10% Timeout on complex extraction
Provider A (mid-tier) 20 0 0% Perfect reliability
Provider B (all tiers) 120 0 0% 60 base + 60 reasoning, zero failures
Provider C (all tiers) 60 0 0% Including budget models
Provider D (all tiers) 60 0 0% Fastest average response time

The most expensive model in our benchmark (Provider A's frontier tier at $5.00/$25.00 per million tokens) produced the best advisory analysis output. It was also the model we could not route to in production, because it failed 25% of all runs and 80% of extraction runs specifically.

This creates what I call the Frontier Paradox: the model with the highest peak quality can still be the worst production choice because reliability matters more than brilliance in flashes. A model that delivers 92% of the quality with 100% reliability beats one that delivers 100% quality but fails one in four calls.

The practical implication is simple: your routing table needs a reliability column, not just a quality column. If a model's failure rate blows past your error budget, its best-case output stops mattering.

Finding 3: The Most Expensive Model Is Not the Best

We scored the deep advisory analysis outputs on four dimensions: situation accuracy, framework application, recommendation quality, and whether the model challenged shaky assumptions when the data contradicted them. The frontier model ($0.14/call) scored 12/12. The premium model ($0.031/call) scored 11/12.

That one-point quality delta costs $0.109 per call. At 10 advisory calls per user per day, that is $32.70 per user per month for a barely noticeable improvement, plus a 25% chance the call fails outright. For a team of 200 users, that adds up to $78,480 per year for one point of quality on one workflow.

Meanwhile, the budget-fast model ($0.002/call for advisory) scored 9/12 on the same rubric. It gave up some nuance and pushback depth, but it still found the key dynamics and produced useful recommendations. At 15x cheaper than the premium model, that is a legitimate primary choice, not just a panic button.

The value score calculation makes this concrete:

Model Quality Cost/Call Value Score Verdict
Premium 11/12 $0.031 355 Best value for advisory
Budget (strong) 10/12 $0.004 2,500 Best $/quality ratio
Frontier 12/12 $0.140 86 Peak quality, lowest value

For the advisory workflow specifically, we chose the premium model at $0.031/call. It is the one route where we pay up, because people read the full output and the quality difference between $0.031 and $0.008 is noticeable. But even here, the frontier model at $0.14 does not earn its keep.

Finding 4: Tool-Calling Behavior Is the Real Quality Signal

This was the most actionable finding in the entire benchmark. Models that called all available tools produced better output than models that skipped tools, regardless of size or price.

Our workflows give models access to 3 tools: record lookup, recent history, and knowledge base search. Models that called all 3 produced sharply grounded, context-specific output. Models that skipped one or more tools drifted toward generic mush, even when they were expensive frontier models.

Model Tools Called Quality Cost
Mid-tier (3 tools) Record + History + KB Search 15/15 $0.004
Premium (3 tools) Record + History + KB Search 15/15 $0.014
Frontier A (2 tools) Record + History 14/15 $0.046
Budget X (2 tools) Record + History 12/15 $0.002
Budget Y (1 tool) History only 7/15 $0.001
Budget Z (1 tool) Record only 6/15 $0.001

The budget model that skipped recent history produced responses anchored to stale context from 3 days earlier. The one that skipped record lookup produced generic output with no useful specificity. Both failures were invisible from the model's perspective. The output looked clean, grammatical, and well structured. In practice, it was dead weight.

Practical takeaway: when evaluating models for tool-calling workflows, do not stop at whether the model can call tools. Check whether it calls the right tools, in the right order, with sane parameters. A model that skips a critical tool call will produce confidently wrong output, and metrics alone will not save you. You need to read the outputs.

Finding 5: Reasoning Mode Added Zero Value

We tested 8 reasoning variants against their base models across 160 dedicated runs. Reasoning mode delivered no measurable quality improvement on any of our four tool-calling workflows.

The highlights: OpenAI's reasoning_effort parameter is silently ignored when tools are present in the Chat Completions API. xAI's reasoning mode doubled latency (from 18s to 37s) for identical output. Google showed faint structural improvements at a 9 to 16% latency cost. None were worth keeping on.

The core insight: tool-calling workflows already have reasoning baked into the architecture. Each tool call is a deliberation step grounded in real data. Adding internal thinking tokens on top is paying twice for the same work. At 500 users, leaving reasoning on with a provider that doubles latency burns 2,500+ hours of wasted compute per year.

We wrote a full deep dive on this finding with provider-by-provider data, a decision framework, and scale economics.

Finding 6: A One-Line Prompt Fix Worth $186K/Year at Scale

This was the finding that changed our routing table. The budget-fast model initially scored 10/15 on our highest-frequency workflow because it was skipping a critical tool call. It would fetch record data and recent history, then skip knowledge base search. Without that last step, its output lacked the grounded tone that made the more expensive models feel polished.

Here is the key detail: the same model called the knowledge base extensively on the advisory workflow (4-5 tool calls including 2 KB searches per run). This was not a capability limit. It was a prompt problem.

The original prompt said:

Use these to gather information before suggesting: [tool list]. Gather what you need, then return ONLY the JSON array.

"Gather what you need" gave the model permission to decide KB search was optional. It made a plausible optimization (for a quick response, why bother searching the knowledge base?) and quietly traded away quality.

The fix was one line:

You MUST call all three tools before responding. Call all three in your first turn.

Result: the model now calls all three tools on every run. Quality jumped from 10/15 to 13-14/15. Latency rose by roughly 2 seconds (from 18s to 20s). Cost per call stayed at $0.001.

That one-line change made this budget model viable as the primary for all four workflows, dropping per-user monthly cost from $18.13 to $2.60. Engineering effort: about five minutes. At 1,000 users, that prompt fix is worth $186,000 per year. (We wrote a full article on this finding.)

The broader lesson: when a model underperforms on tool calling, check the prompt before replacing the model. A permissive prompt ("gather what you need") lets models make optimization decisions that trade quality for speed. A directive prompt ("you MUST call all three tools") removes that wiggle room. The model was always capable; it just needed sharper instructions.

This changes how you should evaluate models in benchmarks. If you test a model with a permissive prompt and it scores poorly, you may be measuring prompt quality rather than model quality. Test with directive prompts first, then relax them only if the extra tool calls add latency without improving output.

The Routing Table

Based on the benchmark, here is the routing configuration we deployed:

High-Frequency Response Generation Primary

50 calls/day: context-aware response options
Model: Budget-fast ($0.20/$0.50 per 1M) Avg cost: $0.001/call Avg latency: 20s Monthly: $1.46

Deep Advisory Analysis Primary

10 calls/day: strategic multi-section analysis
Model: Budget-fast ($0.20/$0.50 per 1M) Avg cost: $0.002/call Avg latency: 29s Monthly: $0.56

Structured Data Extraction Primary

5 calls/day: record profiling, classification, status
Model: Budget-fast ($0.20/$0.50 per 1M) Avg cost: $0.003/call Avg latency: 34s Monthly: $0.49

Text Summarization Primary

5 calls/day: synthesis paragraphs from source records
Model: Budget-fast ($0.20/$0.50 per 1M) Avg cost: $0.001/call Avg latency: 11s Monthly: $0.09

Total per-user monthly cost with routing: $2.60

Yes, a single budget model handles all four workflows. That might look like it contradicts the whole premise of routing. It does not. The point of routing is not that you will always use different models. The point is that the methodology tells you when you can get away with one and when you cannot. In our case, the prompt fix in Finding 6 made the budget model viable everywhere. Without that fix, we would need a mid-tier model for response generation and the routing table would look different.

At frontier-only pricing, the same workload costs $166 per user per month. At 1,000 users, that is $166,000/month vs $2,600/month. The routing table saves $1.96 million per year at that scale.

If you want the absolute best output quality on your highest-frequency workflow, you can swap in a mid-tier model ($0.004/call) for that one route. That raises the per-user monthly total to $7.28 (still 23x cheaper than frontier) while getting a perfect 15/15 on the workflow that runs 50 times per user per day.

Self-Hosted as a Tier

We tested self-hosted models (Qwen 3.5 on NVIDIA GPUs) as a potential near-zero-marginal-cost tier. The results were mixed.

The larger self-hosted model (27B parameters on an RTX 5090) produced surprisingly competitive output. On response generation, it scored 13/15, better than several API models that cost $0.01+ per call. On advisory analysis, it scored 10/12, competitive with premium API models. Raw quality was not the problem.

The problem was speed. The 27B model averaged 112s for response generation and 111s for advisory analysis. That is 6-7x slower than equivalent API models. For a real-time workflow, that is a nonstarter.

The smaller self-hosted model (9B parameters on a consumer GPU) was faster but produced significantly lower quality: 7/15 on response generation with generic, poorly personalized output.

Break-even calculation

At cloud GPU pricing (~$0.80/hr for a 5090-class instance), self-hosting costs ~$576/month for 24/7 availability. Our per-user API bill with optimized routing is $2.60/month. For a single user, self-hosting would need to handle 220x current volume to break even. But at 100 users ($260/month API total), break-even arrives at roughly 2x current volume per user. At spot pricing ($0.30/hr = $216/month), break-even arrives even sooner.

Self-hosting makes sense for three specific scenarios:

  • Privacy requirements that prohibit sending data to third-party APIs
  • Batch processing where latency is irrelevant and you can saturate the GPU
  • Development/testing where you want unlimited calls without API costs

For production real-time workloads at our volume, API wins on both economics and speed.

How to Build Your Own Routing Table

The specific models and prices in our routing table will age quickly. The methodology will not. Here is the framework:

1. Inventory your workflows.

List every distinct AI workflow in your system. For each one, record daily call volume, latency requirements, output format, and required tools or context. The volume numbers matter most because they tell you which workflows dominate the bill.

2. Build scenario tests.

For each workflow, create 3-5 representative test scenarios that cover your range of difficulty. Include at least one easy case, one hard case, and one edge case. Use real production data if you can. Synthetic benchmarks will lie to you with a straight face.

3. Run every model through every scenario.

Yes, every model. The results will surprise you. Budget models will beat premium ones on specific tasks. Premium models will stumble on tasks you expected to be easy. You will not know any of that until you test it.

4. Score outputs on domain-specific rubrics, not generic benchmarks.

Public benchmarks measure broad capability. Your routing decisions need performance on your specific tasks with your specific data. Build rubrics that reflect what "quality" means for each workflow. Score actual outputs, not token counts or latency.

5. Compute value scores, not just quality scores.

Use the formula: value = quality / (cost × frequency_weight). A model that scores 12/15 at $0.002 has a higher value than one that scores 15/15 at $0.046, especially for a workflow that runs 50 times a day. Let the math drive the routing, not your hunches about which model "should" win.

6. Track reliability separately from quality.

A model with a 25% failure rate is not a production model, regardless of its quality scores. Set a reliability threshold (we use 95%) and disqualify any model that misses it for that workflow. Reliability data needs larger sample sizes than quality scoring, so plan for at least 20 runs per model per workflow.

7. Check tool-calling behavior.

If your workflows involve tool calling, read the actual tool call logs. Models that skip tools will produce confidently wrong output that looks fine on the surface. The rubric will catch this only if you are reading the outputs yourself, not just checking whether the format is valid.

8. Re-benchmark quarterly.

Models improve. Pricing changes. New models launch. Your routing table is a snapshot, not a permanent decision. Schedule quarterly re-benchmarks with the same scenarios and rubrics. The investment is one afternoon of compute. The payoff is confidence that your routing still makes sense.


The Bottom Line

Model routing is not premature optimization. It is the difference between $166 and $2.60 per user per month for the same or better output quality. At 1,000 users, that is $1.96 million per year. At any meaningful scale, naive routing is one of the most expensive technical debts an AI team can carry.

The work is not glamorous. It is 440 benchmark tests, a pile of output reviews, a scoring rubric, and some arithmetic. But it gives you a routing table backed by data instead of vibes, and that routing table pays for itself in the first week.

Four things we did not expect going in:

  1. A $0.001 budget model coming within 7% of the $0.046 frontier model after a one-line prompt fix
  2. The most expensive model being the least reliable (25% failure rate)
  3. Reasoning mode providing zero benefit for tool-calling workflows
  4. Prompt engineering mattering more than model selection for tool-calling quality

The last point bears repeating. We spent the initial analysis comparing 22 models and building a routing table. Then we spent five minutes tweaking one prompt and cut per-user cost by 86% with comparable quality. The prompt fix moved the needle more than the model swap. At enterprise scale, those five minutes of prompt work are worth more than most quarterly optimization projects.

We would not have known any of this without running the benchmark. Neither will you. The routing table is not a guess. It is an engineering artifact built from evidence. The methodology ages slowly even as models and pricing change fast.