97.8 percent.
That is the KV cache hit rate Inferoa AI measured during a real agentic coding task. For every 100 tokens the agent processed, only 2.2 were computed from scratch. The rest were free.
Now look at what most engineering teams are doing right now. They are dumping entire conversation histories into context windows and praying. Every tool call, every observation, every previous turn gets stuffed back into the prompt. Token counts explode. Latency climbs. Bills stack up. The naive assumption is that more context equals smarter agents. The data says the opposite.
This is not a minor optimization. This is the single biggest lever on inference economics that nobody is talking about.
The Context Bomb
Multi-turn agent workflows are destroying inference economics. Every time your agent makes a tool call, the result gets fed back into the next prompt. Do that twenty times and you are pushing a million tokens through the model. Do it a hundred times and you are lighting money on fire.
The industry has been running on a bad assumption: keep everything, because context is knowledge. A new paper on arXiv demolishes that idea.
arXiv 2606.10209 ("Less Context, Better Agents") tested GPT-5 configurations on a 50-task enterprise expense itemization benchmark. The results are brutal.
Full conversation history: 71.0% task completion, 1,480,996 tokens consumed, 14.56 hours per run.
Pruning to the last 5 tool call/response pairs: 79.0% completion, 535,274 tokens, 5.39 hours.
Pruning plus automated summarization: 91.6% completion, 553,374 tokens, 5.79 hours.
The win is not marginal. It is 20 percentage points better accuracy at 63% fewer tokens. Less context produced a better agent. Not just a cheaper one. A better one.
Stale context poisons agent decisions. The model gets distracted by old observations that no longer matter. It fixates on expired tool outputs. It burns compute on irrelevant history while missing the signal in the current turn. The cost of full retention is not just tokens. It is accuracy.
How the Cache Actually Works
Inferoa did not invent a new model architecture. They used what was already sitting in the serving stack and applied discipline.
The technique relies on vLLM's prefix caching feature combined with a "byte-stable prefix" approach. System prompts, tool schemas, and conversation prefixes are kept byte-identical across agent turns. vLLM sees the exact same bytes at the start of every prompt and reuses the KV computations instead of recomputing them.
KV caching is not a new concept. Key-value cache reuse has been standard in transformer inference since the original "Attention Is All You Need" paper. The innovation here is not the mechanism. It is the measurement and the discipline.
Inferoa ran two sandboxes on islo.dev. One ran plain vLLM with Qwen2.5-0.5B. The other ran Inferoa's full stack, which layers CodeGraph (80.8% context reduction), RTK (61.4% tool-output reduction), and a semantic router on top of the byte-stable prefix approach. They read vLLM's Prometheus metrics directly. The counters recorded 1,611,008 prefix cache hits out of 1,647,574 total prompt tokens. That is 97.8%.
The economics are stark. Cached tokens bill at roughly 10% of normal price at most inference providers. A 97.8% hit rate does not just reduce compute. It slashes per-turn spend. In a multi-turn agent loop where the same system prompt, tool definitions, and conversation prefix get repeated dozens or hundreds of times, the savings compound geometrically.
This is not about using a cheaper model. This is about serving the model you already chose for a fraction of the cost.
The Economics Nobody Is Talking About
The infrastructure story here connects to a broader market shift that is already underway.
Coinbase co-founder Brian Armstrong predicts 80% of AI workloads will run on 99% cheaper models within 12 to 18 months. (Reported via TechCrunch, June 10.) That is not a niche prediction. That is a founding figure in tech saying the market is about to abandon frontier pricing for routine work.
Anthropic just shipped Claude Fable 5 at $10 per million input tokens and $50 per million output tokens. That is double the price of Opus 4.8. The labs are heading toward IPOs with premium pricing models. The market is already voting with its wallet for cheaper, smarter serving.
Harvey, the legal AI tool, partnered with Fireworks AI on a hybrid routing experiment. They reserved frontier models for only the hardest tasks and routed everything else to smaller, cheaper alternatives. The result: 3x cost reduction without sacrificing quality. (Reported via TensorFeed daily data, June 10.)
Cohere shipped North Mini Code, a specialized smaller model for developers. AWS Cloud posted publicly that "more AI-generated code does not make your team faster." (Reported via Hacker News, June 10.)
The pattern is consistent across every signal. The assumption that everyone will just buy frontier models is breaking down. Enterprise buyers are doing the math. The economically rational play is to run smaller, cached, optimized models 80% of the time and save frontier access for the hard 20%.
Your caching strategy matters more than your model choice.
What to Do Today
Here are five actionable steps for builders and deployers.
1. Audit your agent loop for redundant context. If you are passing full conversation history on every turn, you are bleeding money. Count your tokens per turn. If the number grows linearly with conversation length, you have a problem.
2. Implement prefix caching if you are on vLLM. Byte-stable prefixes are not rocket science. They are discipline. Keep your system prompt identical. Keep your tool schema identical. Do not inject dynamic content into the prefix. The cache hit rate is a direct function of your own consistency.
3. Prune tool call history to the last N interactions and summarize the rest. The arXiv paper gives you the exact numbers. Five turns of recent history plus automated summarization of older context beat full retention by 20 points at half the cost.
4. Route by task complexity. Not every subtask needs a frontier model. Hybrid routing is now a table-stakes architectural decision. The Harvey/Fireworks experiment proved you can cut costs by 3x without quality loss by sending simple tasks to smaller models.
5. Measure before you optimize. Inferoa published their Prometheus metrics. You should be measuring cache hit rates too. If you do not know your hit rate, you do not know whether your infrastructure is efficient or whether you are subsidizing lazy engineering.
The Uncomfortable Question
If you are paying premium token prices and your cache hit rate is under 50%, you are not buying better intelligence. You are subsidizing lazy infrastructure.
The labs are heading toward IPOs with premium pricing models. The market is already voting with its wallet for cheaper, smarter serving. Which side of that trade are you on?
The cache is not a bolt-on optimization. In multi-turn agent workflows, it is the model. Every cached token is a token you do not pay for. Every prefix hit is compute you do not burn. Every disciplined byte-stable prompt is engineering maturity that your competitors lack.
97.8 percent means only 2.2 percent of the work is real. The rest is free. The question is whether your stack is built to capture it.
Get More Articles Like This
Getting your AI infrastructure right is just the start. I'm documenting every optimization, cost lesson, and architectural insight as I build PhantomByte.
Subscribe to receive updates when we publish new content. No spam, just real analysis from the trenches.