We tested reasoning mode across 8 model configurations and 160 dedicated runs. It improved zero of our four workflows. One provider silently disables it when tools are present. Another doubled response time for no quality gain. At enterprise scale, leaving reasoning on by default burns real money on nothing.

Reasoning mode is the shiny add-on in the AI provider menu. Flip it on and the model is supposed to pause, think harder, and come back with something sharper. More internal chain-of-thought. More compute. More cost. The pitch is irresistible: buy a little extra thinking, get a little extra quality.

Then we ran the benchmark. Within a larger 440-run study across 22 model configurations, we carved out 160 runs specifically comparing reasoning variants against their base models on the same inputs. Reasoning mode failed to move the quality needle on any of our four tool-calling workflows. Best case, it changed nothing. Worst case, it torched response time. And one provider's reasoning parameter is silently ignored when tool definitions are present, meaning you pay for a feature that never activates.

For a team of 500 users running tool-calling workflows, leaving reasoning enabled on a provider that doubles latency means burning an extra 18 seconds per call across 25,000 daily requests. That is 125 hours of wasted compute per day for identical output. This article breaks down what we found, why it happens, and how to decide whether reasoning mode deserves a place in your stack.

TL;DR

  • 160 benchmark runs across 8 reasoning variants: zero quality improvement on any of four tool-calling workflows
  • OpenAI silently ignores the reasoning_effort parameter when tools are present in the Chat Completions API, with no error or warning
  • xAI's reasoning mode doubled latency (18s to 37s) while producing identical output
  • Tool-calling workflows already have reasoning built in; each tool call is a grounded deliberation step, making extra thinking tokens redundant
  • Default reasoning to off for any workflow that uses tools, and only enable it if a structured benchmark proves a real quality gain
160
Reasoning variant runs tested
0
Workflows improved by reasoning
2x
Max latency penalty for zero quality gain

What Reasoning Mode Is Supposed to Do

Reasoning mode, sometimes labeled "thinking mode" or "extended thinking," spends extra compute at inference time. Instead of answering immediately, the model burns internal reasoning tokens to work through the task before it commits to an output. You never see those tokens. You still pay for the time and, depending on the provider, the usage.

The premise is not absurd. Step-by-step reasoning helps on the kinds of problems that punish sloppy jumps: math, logic, dense constraint satisfaction, code that needs planning before execution. The real question is narrower and much more practical: does reasoning help on your workload, inside your workflow design?

In our benchmark, the answer was a flat no.

What We Tested

This test was part of a larger 440-run benchmark across 22 model configurations from four providers: OpenAI, Google (Gemini), xAI (Grok), and Anthropic (Claude). Of those, 160 runs were dedicated reasoning-variant comparisons. Anthropic's Claude does not expose a reasoning toggle on its standard API, so it served as the baseline and evaluator model rather than a reasoning test subject. The three providers below all offer explicit reasoning controls. We tested four workflow types:

  • Response generation: the highest-frequency workflow, generating contextual responses using retrieved data and tool calls
  • Advisory analysis: a complex analytical workflow requiring multi-source synthesis, strategic reasoning, and nuanced recommendations
  • Structured extraction: pulling structured data from unstructured input with strict schema compliance
  • Text summarization: synthesizing information into coherent, well-organized prose

All four workflows use tool calling. The models can retrieve profile data, inspect interaction history, and search a knowledge base. That detail matters because the entire story turns on the collision between reasoning mode and tool use.

For each reasoning variant, we ran the same inputs as the base model and scored cost, latency, and quality with a structured rubric. Each configuration ran 5 identical scenarios, which is a modest sample size. But the results were not close calls. The deltas were either zero (OpenAI), negligible (Google), or large penalties with no upside (xAI). When effects are this consistent across every scenario, more runs would not change the conclusion.

Finding 1: OpenAI's Reasoning Is Silently Disabled with Tools

This was the headline finding, and the one that should make any team using OpenAI with tools stop and check their assumptions.

OpenAI's Chat Completions API accepts a reasoning_effort parameter (low, medium, high) that is supposed to control how deeply the model reasons before answering. We tested all three levels across our workflows. The outputs came back looking like photocopies of the base model: same cost, same speed, same quality.

Workflow Cost Delta Speed Delta Quality Impact
Response generation 0.98x 0.96x Identical
Advisory analysis 1.09x 1.11x Identical
Structured extraction 0.96x 0.95x Identical

Those deltas are noise. Cost wiggles by a few percent. Speed wiggles by a few percent. Quality does not budge. That is not what a feature looks like when it is working. That is what a parameter looks like when it disappears into the floorboards.

When tool definitions are present in the Chat Completions API request, the reasoning_effort parameter is silently ignored. No error. No warning. No "reduced capability" flag in the response. The extra reasoning tokens never show up in the usage data. The request accepts your setting and then behaves like the plain base model. We confirmed this across 60 runs at all three reasoning levels (low, medium, high), and the outputs were statistically indistinguishable from the base model every time.

An important caveat: we tested through OpenAI's Chat Completions API, which is the endpoint most production systems use. OpenAI's newer Responses API may handle reasoning differently with tools. If your system uses the Responses API, run your own comparison before drawing conclusions. But if you are on Chat Completions with tools, the evidence is clear: reasoning effort is accepted and ignored.

If you are paying for reasoning on tool-calling workflows through OpenAI's Chat Completions API, you are paying for a feature that does not activate.

This is a silent failure mode, which makes it dangerous. Monitoring will not catch it because nothing throws. Cost dashboards will not catch it because spend stays flat, which is itself evidence that the reasoning path never kicked in. The only way to detect it is the boring way: run a controlled comparison and check whether the outputs differ at all. In our 60 runs, they did not.

Finding 2: xAI's Reasoning Doubles Latency for Nothing

If OpenAI's failure mode is silent, xAI's is impossible to miss. Grok really does spend extra time "thinking." You can watch the latency stack up. What you cannot find is the payoff.

Model Workflow Cost Delta Speed Delta Quality Impact
Grok (fast tier) Response generation 1.05x 1.62x None
Grok (fast tier) Advisory analysis 1.05x 1.35x None
Grok (premium tier) Response generation 0.88x 2.04x None
Grok (premium tier) Advisory analysis 1.23x 1.43x None

The ugliest result was premium-tier response generation with reasoning enabled: a 2.04x latency penalty. Wall-clock time ballooned from 18 seconds to 37 seconds. Twice the wait. Same answer quality. The model took the scenic route and still arrived at the same place.

On the fast tier, reasoning added 35 to 62% more latency. On the premium tier, it added 43 to 104%. Cost moved around noisily enough that the bigger issue is obvious: time. If your application is user-facing, this is how you make it feel sluggish for no measurable gain.

At enterprise scale, the waste compounds fast. A team of 200 users running 50 response-generation calls per day at a 2x latency penalty burns 10,000 extra wait-seconds daily. That is 2.8 hours of user time wasted every day on reasoning tokens that produce identical output. Over a year, that is over 1,000 hours of productivity lost to a toggle that should have been off.

Finding 3: Google's Reasoning, the Closest Thing to a Win, Is Marginal

Google's Gemini models were the only place where reasoning mode showed a pulse at all. Even then, the pulse was faint.

Model Workflow Cost Delta Speed Delta Quality Impact
Gemini Flash Response generation 0.99x 1.09x Negligible
Gemini Flash Advisory analysis 1.12x 1.16x Negligible
Gemini Pro Advisory analysis 0.82x 1.11x Slight

On the cheapest tier, Flash Lite, reasoning slightly improved structural compliance: outputs were a bit more likely to follow the expected format and include the required sections. On Flash and Pro, the effect ranged from negligible to slight. Cost barely moved. Latency rose 9-16%.

If any provider came closest to a case for reasoning on tool-calling workflows, it was Google. But "slightly cleaner structure for a noticeable speed hit" is not much of a sales pitch. If a model keeps drifting from your schema, the better fix is tighter prompting or harder schema enforcement, not paid contemplation.

Why Reasoning Doesn't Help Tool-Calling Workflows

Once you stare at the numbers long enough, the explanation starts to feel obvious.

Tool-calling workflows already come with reasoning baked into the architecture. A model with tool access has to analyze the request, decide which tools to invoke, form the tool-call arguments, process the results, and then synthesize a final answer. That is not a single leap. It is a chain of explicit decisions. Each tool call is already a reasoning step with real data attached.

Now compare that with what reasoning mode promises: step-by-step decomposition before the answer. Tool-calling workflows already do that, except the steps are grounded. The model does not need to imagine what a profile might contain if it can fetch the profile. It does not need to speculate about the knowledge base if it can search it. The architecture has already given the model the scaffolding that reasoning mode is trying to sell.

This is the core insight: for tool-calling workflows, the tools are the reasoning structure. Adding explicit reasoning tokens on top of that is usually duplication, not improvement. You get a second layer of deliberation stacked on top of a system that already deliberates in a structured, grounded way. More tokens. More waiting. Same outcome.

Reasoning mode solves a problem tool calling already solved. It is extra ceremony around an architecture that already knows how to think in steps.

The Narrow Window Where Reasoning Earns Its Keep

This is not a blanket indictment. Reasoning mode has a legitimate use case, but it is much narrower than the marketing suggests. The pattern is specific: reasoning helps when the model has no other structure for deliberation and must solve the entire problem in a single internal pass.

That means pure mathematical reasoning, complex code generation where the model needs to plan before committing, or dense text analysis with no tools to ground the work. In all of those cases, the model cannot fetch data, call tools, or break the problem into grounded steps. Reasoning tokens buy it time to think before it answers, and that thinking has nowhere else to happen.

But most production AI systems do not look like that. They use tools, retrieval, schemas, and multi-step pipelines. Those architectural elements already provide the deliberation structure that reasoning mode is trying to sell. If your system has tools, your tools are the reasoning. Adding a second layer of internal deliberation on top is paying twice for the same work.

The Decision Framework: Should You Enable Reasoning?

Based on what we found, here is the practical filter for deciding whether reasoning mode belongs on your workloads.

1. Identify whether your workflow uses tool calling.

If your requests include tool definitions, reasoning mode is already on shaky ground. The tools are providing the step-by-step structure that reasoning tokens are supposed to add. Start with reasoning off. Turn it on only if you can prove a real quality gap that better prompting cannot close.

2. Check whether reasoning actually runs on your provider.

If you are using OpenAI's Chat Completions API with tools, verify that reasoning actually activates. Compare token usage and behavior between reasoning and non-reasoning runs. If the counts are basically identical, the feature is not doing anything worth paying for.

3. Measure the latency impact before committing.

Run the same inputs through the same model with reasoning on and off. If latency climbs more than 20% and quality stays flat, cut it. In our tests, xAI added 35-104% more latency. Google added 9-16%. OpenAI with tools looked like 0%, which is consistent with reasoning not running at all.

4. Evaluate quality with a structured rubric, not vibes.

Do not skim three outputs and decide they "feel smarter." Use a rubric with concrete dimensions like accuracy, completeness, formatting, and relevance. In our benchmark, some reasoning outputs looked more serious because they arrived with extra verbal padding. The scores stayed the same.

5. Consider the alternative: spend the budget on better prompts.

If reasoning mode makes your system slower without clearly raising quality, spend the effort elsewhere. Better prompts, cleaner tool design, and stricter schema enforcement usually produce improvements that are easier to measure and harder for a provider toggle to take away.


The Bottom Line

Reasoning mode is not useless. It is just narrower than the marketing suggests. When a model needs room to work through a hard problem and has no better structure available, it can help.

But on the workloads that dominate production AI systems (tool calling, retrieval-heavy response generation, structured extraction, agent pipelines), reasoning mode mostly looks like overhead. The workflow already supplies the structure. Adding more internal deliberation on top ranges from redundant to actively harmful.

The ugliest part is not mediocre performance. It is silent failure. If a major provider can accept a reasoning parameter, give you no warning, and then quietly skip the feature when tools are present, you do not have a tuning knob. You have a trap.

Do not enable reasoning mode by default. Treat it like any other expensive performance setting: benchmark it on your actual workload, compare the quality delta against the latency and cost hit, and keep it off unless the evidence is unambiguous. For tool-calling workflows specifically, the default should always be off.

The 160 runs we spent on reasoning variants yielded one clear operational decision: turn it off for all of our tool-calling workflows and keep the speed. For an organization with 500 users, that decision reclaims over 2,500 hours of compute time per year. That is not an optimization. That is stopping a waste stream that was invisible until someone ran the comparison.