I still remember the Slack message from my CTO: "We need to talk about OpenAI costs."

It was 6 PM on a Friday. The customer support pilot had been live for three weeks. Response times were down 60%. CSAT scores were up. We were ready to expand to all users.

Then I saw the bill.

One customer had discovered they could paste War and Peace into the chat box "to see what happens." They did. In five minutes, a single curious user burned through nearly a million tokens on one conversation.

That was not an anomaly. It was a pattern waiting to happen.

The Token Cost Reality Nobody Talks About

Traditional SaaS has reasonably predictable costs. Servers scale roughly linearly. Databases grow with usage. Bandwidth is metered by the gigabyte. Your CFO can model next quarter's infrastructure spend on a napkin.

LLM costs are different. They scale with tokens processed, and tokens are driven by user behavior in ways that are hard to predict up front.

Here is what a single user session can cost with a modern GPT 4 class model such as GPT 4o, assuming public pricing around March 2026 (roughly $2.50 per million input tokens and $10 per million output tokens).

Scenario Input Tokens Output Tokens Approx. Cost (GPT-4o)
Simple question 100 150 ≈ $0.0017
Context-rich support ticket 2,000 500 ≈ $0.014
Document analysis 8,000 2,000 ≈ $0.055
User pastes 50-page PDF 32,000 4,000 ≈ $0.21
Recursive agent loop 50,000 10,000 ≈ $0.43

These look small in isolation. But multiply them across thousands of users, long chat histories, and agent loops, and you can easily end up with thousands of dollars per day in unexpected spend.

In our case, a few "let's see what happens" experiments added up to a multi-thousand dollar surprise on a single month's bill. No DDoS. No botnet. Just curious humans and an invisible cost structure.

Traditional request rate limiting is not enough. A single large request can cost more than a hundred small ones. Token-based economics require token-based controls.

The Five Token Bomb Vectors

After that incident, I started mapping every way LLM costs can explode in production. These five patterns show up over and over.

1. Context Window Abuse

Users discover they can paste arbitrarily large content into the chat box. It might be a PDF, a 10,000-line CSV, or an entire JavaScript bundle from a broken build. Modern models happily accept 128K+ tokens per request, but "technically possible" is not the same as "economically sane."

Without limits, one user can easily consume the equivalent of thousands of normal interactions in a single paste.

2. Agent Loops

Autonomous agents are powerful, but they are also a cost black hole. A poorly constrained agent that calls tools or other models recursively can burn through millions of tokens while you're getting coffee.

The MAST study ("Why Do Multi-Agent LLM Systems Fail?") analyzed 1,600+ multi-agent traces and found failure rates between 41% and 86.7% across state-of-the-art systems, with many failures involving repeated or unnecessary actions rather than clean crashes. In other words, agents often fail by doing too much rather than stopping.

If you're using orchestration frameworks like OpenClaw or similar systems, your "reasoning budget" is effectively a cost dial. Too small, and the agent hallucinates or gives up early. Too large, and you often pay a lot more for marginal gains, or, in the worst case, for loops.

3. Verbose Defaults

Most teams start with a generous max_tokens "just in case." The result is 500-token answers to yes/no questions and sprawling explanations where a single paragraph would do.

Each response has a tiny cost, but at scale, token waste becomes death by a thousand cuts. The more you normalize verbosity, the harder it becomes to roll back later without users feeling like the product got worse.

4. Feature Sprawl

Every new "smart" feature usually means another API call. Summarize this. Analyze that. Extract structured data here. Generate recommendations there. Individually, each call seems cheap. Collectively, they compound into bill shock.

Before you add an LLM call, ask whether you can do the same thing with cached data, a cheap local model, or simple rules. There is also a lot of free or near-free data in the world, RSS feeds, public APIs, log streams, that can provide the context you're currently paying tokens to recreate on every request.

Token cost explosion diagram showing five cost vectors
The five vectors that turn AI features into budget nightmares

5. Chat History Bloat

By default, many chat architectures send the entire conversation history on every new turn. A 20-turn chat that started at 500 tokens can easily end up at 10,000 tokens per request once the system prompt, user profile, and all prior messages are included.

The user sees "one more message." You see a 20x increase in per-request cost and a direct hit to your margin.

A simple rule of thumb: anything you intend to do more than once, like "remember the user's company," "remember the project context," or "remember their tools," should become a skill or a retrieved fact, not raw chat history.

The Token Budgeting Architecture

Token budgeting is not about being cheap. It is about being predictable.

Unpredictable costs kill AI features. A CFO who sees a 10x spike cannot budget for scale. A PM who cannot estimate per-user cost cannot price the product.

Token budgeting architecture treats tokens as a first-class resource: allocated, monitored, and consumed deliberately.

Here is the FastAPI middleware pattern that saved us conceptually:

from fastapi import FastAPI, HTTPException, Request
import tiktoken
import redis
from datetime import datetime

app = FastAPI()
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)

class TokenBudgetMiddleware:
    def __init__(self, default_budget: int = 100_000):
        self.default_budget = default_budget
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    async def estimate_request_tokens(self, request: Request) -> int:
        body = await request.body()
        text = body.decode("utf-8")
        return len(self.encoder.encode(text))

    async def check_budget(self, user_id: str, estimated_tokens: int) -> bool:
        today = datetime.utcnow().strftime("%Y-%m-%d")
        key = f"token_budget:{user_id}:{today}"

        current_usage = int(redis_client.get(key) or 0)
        budget = int(redis_client.get(f"user_budget:{user_id}") or self.default_budget)

        if current_usage + estimated_tokens > budget:
            return False

        pipe = redis_client.pipeline()
        pipe.incrby(key, estimated_tokens)
        pipe.expire(key, 86400)
        pipe.execute()

        return True

    async def __call__(self, request: Request, call_next):
        if request.url.path.startswith("/api/ai"):
            user_id = request.headers.get("x-user-id", "anonymous")
            estimated = await self.estimate_request_tokens(request)

            allowed = await self.check_budget(user_id, estimated)
            if not allowed:
                raise HTTPException(
                    status_code=429,
                    detail="Token budget exceeded. Daily limit applied.",
                )

        response = await call_next(request)
        return response

app.add_middleware(TokenBudgetMiddleware)

The code is simple on purpose. The hard part is deciding, from day one, that tokens are budgeted resources, not an afterthought you worry about after the invoice arrives.

Production Lessons

We learned four hard lessons putting this into production.

1. Budget Granularity Matters

Global budgets hide abuse. If all you track is "total tokens per day," you will not know which users or features are burning your money.

We ended up with three levels:

  • Global: application-level caps and alerts.
  • Per-user: daily allocations to catch outliers quickly.
  • Per-feature: higher budgets for expensive operations like document analysis, lower budgets for routine chat.

This lets you choose what to optimize and what to subsidize.

2. Token Estimation Is Approximate

Libraries like tiktoken can estimate token counts, but they are never perfect, and different providers count tokens differently.

We budget 10-20% below the actual ceiling to absorb estimation variance and model changes. When prices or models change, we adjust the internal conversion from tokens to dollars in one place, not everywhere.

3. Graceful Degradation Is Harder Than It Looks

If you just cut users off when they hit a limit, they get angry. We had to design gentle fallbacks:

  • Switch to cheaper models for summary-style tasks.
  • Shorten responses or omit non-essential context when budgets run low.
  • Use cached or precomputed answers for common questions.
  • Escalate to a human when automation is too expensive or risky.

A hard cutoff creates support tickets. A soft landing creates trust.

@app.post("/api/ai/chat")
async def chat_endpoint(request: Request, budget: TokenBudgetMiddleware):
    user_id = request.headers.get("x-user-id", "anonymous")

    # Try full model first
    if await budget.check_budget(user_id, estimated_tokens=4000):
        return await call_gpt4o(request)

    # Fall back to a cheaper model with a summarized prompt
    if await budget.check_budget(user_id, estimated_tokens=2000):
        summary_prompt = "Summarize the key points and answer briefly:\n\n" + await request.body()
        return await call_gpt4o_mini(summary_prompt)

    # Final fallback: cached response or human handoff
    return await get_cached_or_escalate(request)

The exact models and numbers will change, but the pattern, try full experience, then cheaper experience, then escalation, holds up.

4. Monitoring Is Non-Negotiable

You cannot manage what you cannot see.

We built dashboards showing real-time token consumption by:

  • User
  • Feature or endpoint
  • Model
  • Time window (minute, hour, day)

We set alerts at 50%, 80%, and 95% of daily budgets for key tenants and for the platform as a whole. The 50% alert gives us time to investigate. The 95% alert usually means something is wrong, an agent loop, a misconfigured feature, or an unexpected usage pattern.

The CFO Conversation That Actually Works

Once token budgeting was in place, conversations with finance changed.

Instead of, "Explain this spike," we could say, "Here is the model for per-customer cost at scale."

The math is simple:

  • Token budget per user per day, for example, $0.50 worth of tokens.
  • Average tokens per interaction, say 500 tokens (input and output).
  • Implied maximum interactions per user per day, derived from actual prices and usage, not guesses.

Traditional SaaS: "We need more servers."

Token-based AI: "We need better prompting and better budgets."

The infrastructure decision becomes a product decision. You optimize by:

  • Reducing unnecessary tokens in prompts and responses.
  • Choosing cheaper models where quality allows.
  • Avoiding loops, redundant calls, and overlong histories.

All of that matters even more when you remember that LLM compute is energy-intensive at scale. The best time to fix waste is before you have a million users.

What We Built (And What You Can Too)

Our Token Budget Manager in production now handles:

  • Multi-model token counting (OpenAI, other providers, and local models).
  • Per-user, per-feature, and per-session budgets.
  • Real-time cost and usage monitoring (Prometheus and Grafana).
  • Automatic graceful degradation when budgets approach limits.
  • Alerting and cost attribution dashboards tied to tenants and features.

Every AI API call goes through this layer. Every token is counted. Every user has a budget.

The alternative is hoping your users are sensible with a resource they cannot see and you are paying for. They are not, and it is not their job to be.

The Bottom Line

That large bill taught us something about AI infrastructure that no whitepaper could: token economics are different. They demand different architecture, different monitoring, and different budgeting logic than traditional SaaS.

The real question is not whether you can afford to add AI features.

The real question is whether you can afford to ship them without cost controls.

My answer: no.

Enjoyed this article?

Buy Me a Coffee

Support PhantomByte and keep the content coming!