// field note 80

AI Infrastructure

The 93% Problem: Why Your AI Agent Is Wasting 9 Out of 10 Thinking Steps and Nobody Can Fix It

Uber's AI budget burned in 4 months. A new arXiv paper proves 93% of LLM reasoning tokens are structurally wasted. Here's why you're overpaying by 10x.

AI reasoning waste visualization - 93% of LLM thinking tokens are structurally wasted according to new arXiv paper — Nine out of ten reasoning steps produce nothing useful. The math is provable.

Uber spent $3.4 billion on research and development in 2025, 9% more than the prior year, and by April 2026, the company had already burned through its annual AI budget.

When your ride-hailing app exhausts its AI allocation four months into the fiscal year, someone at the top starts asking hard questions.

That someone was Andrew Macdonald, Uber president and chief operating officer, who told the Rapid Response podcast what few Fortune 500 executives have been willing to say out loud: the connection between token consumption and useful features shipped is broken.

"That link is not there yet," Macdonald said. "I think maybe implicitly there is more that is getting shipped, but it is very hard to draw a line between one of those stats and, okay, now we are actually producing 25 percent more useful consumer features."

He went further: "We are going to have to start talking about token consumption and the associated cost versus headcount."

The translation is blunt: Uber is paying for AI reasoning by the token, hiring fewer humans to fund it, and cannot prove the trade delivers value.

CEO Dara Khosrowshahi confirmed the headcount side of the equation earlier in May, stating the company is making up for AI investments by hiring fewer employees. Macdonald's admission is the first major public acknowledgment from a COO-level executive that AI return on investment may be structurally broken, not merely slow to materialize.

The mystery everyone at Uber is now facing is simple: where do the tokens go?

The answer, published the same day Macdonald gave that interview, is that most of them go nowhere.

THE PAPER: 93% WASTE, PROVED STRUCTURAL

On May 26, 2026, a paper dropped on arXiv that explains exactly why the link Macdonald is searching for does not exist.

The paper is arXiv 2605.23926, titled "How Much Thinking Is Enough? Quantifying and Understanding Redundancy in LLM Reasoning," authored by Zhiyuan Zhai, Xinkai You, Wenjing Yan, and Xin Wang.

It is not an opinion. It is a formal proof.

The authors measured redundancy as the largest fraction of chain-of-thought steps that can be truncated from a correct reasoning trace while the model, forced to stop thinking and emit a final answer, still produces the correct result.

Across four frontier reasoning models and two mathematical benchmarks, they found step-level redundancy consistently between 61% and 93%. In six of the eight model-and-benchmark conditions they studied, the median critical prefix was a single segmented step.

One step mattered. The rest was computational noise.

Even on the hardest Level-5 problems from the MATH-500 benchmark, redundancy remained between 46% and 85%.

The models are not occasionally verbose. They are structurally, mathematically, provably wasteful. Here is the finding that changes how we should think about reasoning models.

The authors prove this redundancy is a structural consequence of length-agnostic outcome rewards. When a reasoning model is trained with reinforcement learning that only rewards getting the right answer, with no penalty for how many tokens it consumes to get there, there is no finite expected stopping time that is optimal.

The model has no incentive to stop thinking. It will keep emitting reasoning tokens until the probability distribution over answers stabilizes, which is almost always long after the actual inferential work is done.

The paper holds regardless of the RL algorithm used, the base model architecture, the data distribution, or whether the policy is obtained via RL or distillation. Overthinking is not a bug to patch in the next version. It is baked into the training paradigm itself.

THE COST MODEL: WHAT 93% WASTE MEANS AT SCALE

Math is clean. Infrastructure bills are not.

Let's translate the proof into money. If your production AI agent burns $1,000 per day in reasoning tokens, the arXiv proof says that somewhere between $610 and $930 of that spend is structurally unnecessary.

You are not overpaying because the vendor prices are high. You are overpaying because the underlying model has no concept of brevity.

Multiply that across every autonomous coding agent deployed by a mid-size software team, every customer service bot handling thousands of sessions per hour, and every research agent crawling and summarizing enterprise documents.

At the scale a company like Uber operates, a structural 61% to 93% waste rate is not a rounding error.

It is a budget crater.

Graph showing LLM reasoning redundancy across benchmark conditions - 61% to 93% of thinking steps are structurally wasteful — Across every frontier model tested, redundancy holds steady between 61% and 93%.

The energy implications are equally concrete. Every redundant reasoning token passes through a GPU cluster, consuming electricity and generating heat that requires cooling. Data center water usage for AI cooling already draws scrutiny from regulators in drought-prone regions.

A model that thinks three to ten times longer than necessary is not just expensive. It carries a carbon and water footprint that could fairly be called reckless, given that the excess is mathematically provable and not functionally useful.

A companion paper on arXiv, 2605.23929, offers a partial mathematical remedy. Yang and Zhu introduce a water-filling token allocation policy that optimally distributes reasoning tokens across workflow stages to minimize latency-reliability-cost tradeoffs.

It is a genuine optimization framework with formal proofs. The catch: it assumes you are designing the workflow from scratch, not retrofitting an existing agent stack.

Most enterprises are in retrofit mode, having already purchased API access and deployed pipelines that the vendor trained with outcome-only rewards. So the optimization layer exists on paper and remains out of reach for production systems already deployed.

WHY IT IS UNPATCHABLE

This is the part that should worry every engineering manager who just got a mandate to deploy reasoning models in production: you cannot fix this by tweaking inference temperature or adding a system prompt that says "be concise."

Those adjustments touch the surface. The problem is in the reward function, three training layers down.

Current frontier reasoning models are trained with outcome-only reinforcement learning. The reward signal is binary: correct answer or incorrect answer. There is no gradient that pushes the model toward shorter reasoning. Shortening the chain of thought does not improve the reward. Lengthening it does not hurt the reward.

So the model rambles, because rambling is free.

The proposed fixes are technically sound and commercially absent. Process-based rewards would assign partial credit to intermediate reasoning steps, creating an incentive to reach the answer efficiently. Step-level verification would require human or automated judges to score each reasoning step for relevance and necessity, feeding that signal back into the training loop. Token-budget-aware training would bake a hard computation limit into the optimization objective, exactly as compilers enforce memory budgets.

None of these exist in production frontier models from OpenAI, Anthropic, Google DeepMind, or Meta today. The structural waste is universal because the training paradigm is universal.

Until a major lab ships a reasoning model trained with process-level or budget-aware rewards, the 61% to 93% waste rate will remain.

THE ENTERPRISE RECKONING

The enterprise gap is about to become a chasm.

A report from MIT Technology Review, published May 26, 2026, found that 85% of organizations want to be agentic within three years, but 76% say their current operations and infrastructure cannot support that ambition.

Prasun Shah, global CTO for workforce consulting and chief AI officer at PwC UK Consulting, described what is happening as "embedding AI employees into what is a human operating model," which he compared to "adding sticky tapes to parts of an operating model that is breaking."

That is generous. My view is that organizations are writing budgets for agentic AI based on current token pricing, and current token pricing assumes a level of efficiency that arXiv 2605.23926 just proved does not exist.

The finance team assumes $1 buys one unit of useful reasoning. The math says $1 buys somewhere between $0.07 and $0.39 of useful reasoning, with the rest structurally discarded.

Uber is the canary. Macdonald's public inability to connect token spend to feature output is not a failure of Uber's analytics team. It is a preview of what every chief technology officer will face within six months when their boards start asking why the AI budget is triple what was modeled.

The answer will be the same everywhere: the model was not designed to stop thinking, and nobody told the finance team.

TAKEAWAY

The AI industry is running on a training paradigm that structurally inflates compute cost by a factor of three to ten.

That is not rhetorical exaggeration. It is the mathematical lower and upper bound proved in a peer-reviewed paper published this week.

Until reward structures change, a "reasoning" model is more accurately a "rambling" model: it emits syllables at your expense because the training algorithm never taught it that brevity has value.

Every CTO currently pricing an agentic rollout should build the 61% to 93% waste rate into their models. It is not a temporary inefficiency. It is the baseline.

And the vendors selling you these tokens are selling syllables by the pound, not insight by the word.

Get More Articles Like This

Understanding AI infrastructure costs is essential for every engineering leader. I'm documenting every paper, trend, and hard lesson as I build PhantomByte.

Subscribe to receive updates when we publish new content. No spam, just real lessons from the trenches.

Enjoyed this article?

☕ Buy Me a Coffee

Support PhantomByte and keep the content coming!