The Token Budget Problem Nobody Designed For

Two years ago, the expensive part of AI was training. The big line item was the upfront compute to build the foundation model. That’s flipped. Inference — the ongoing cost of actually running models — now eats roughly 85% of enterprise AI budgets. And most of that spend isn’t going where you’d think.

It’s not chatbots. It’s agentic loops and RAG context taxes.

Where the tokens actually go

An agentic loop is any workflow where the model gets called multiple times in sequence: plan, act, check, re-plan. Every iteration costs tokens. A customer support agent that pulls up account history, reasons through the issue, drafts a response, checks it against policy, then sends it — that’s four or five model calls per ticket. At scale, that stops being a chat feature and starts being a compute workload.

RAG makes it worse. Retrieval-Augmented Generation means jamming retrieved documents into the prompt so the model can answer grounded questions. That context isn’t free. Every document you pull in is tokens on the meter, every single request. Retrieve 20 documents when 3 would do and you’ve got a 7x context tax on every call.

Now multiply those two patterns. An agent that loops multiple times and retrieves fat context on each loop. You get inference bills that scale in ways nobody modeled when they greenlit the project.

The mobile data cap thing

The best analogy I’ve heard is early smartphones. When mobile apps first shipped, developers built with desktop assumptions. Nobody thought about data budgets because they’d never had to. Users ended up with apps that torched their data plans doing background syncs and pulling full-resolution images for thumbnail previews.

Same thing is happening with AI agents. Engineers built for capability — can it do the task? — but not for cost — how many tokens does it burn per task, and what happens when you multiply that by Tuesday’s volume?

You end up with agents that work fine but bleed money. And like those early mobile apps, the bleeding doesn’t announce itself. It just shows up on the invoice.

Why nobody’s panicking (yet)

The insurance company that replaced 200 claims processors with an AI agent is still saving money. The SaaS company that bolted on an AI assistant is still delivering value. The ROI math works, so the bill gets paid.

But zoom in. Two insurance companies both deploy AI claims processing. Company A built it fast — fat context windows, redundant retrieval, eight model calls per claim. Company B took the time to build lean — same accuracy, two model calls per claim. Both are cheaper than humans. But Company B has a 4x cost advantage on every claim processed. That compounds into pricing power, margin, reinvestment capacity.

Once AI is table stakes in your industry, the sloppy builders get squeezed by the efficient ones. Not because their agents don’t work, but because their agents cost four times as much to do the same job.

There’s a ceiling problem too. Some use cases that almost pencil out economically never get built because the token math is too expensive. The inefficiency isn’t just running up bills — it’s killing applications that would be viable if anyone had designed them lean.

What this looks like in practice

I’m not talking about this from the outside. I run an AI assistant on a local server — cron jobs, email monitoring, news digestion, daily reports. It hits multiple models across multiple providers dozens of times a day. If I hadn’t thought about token costs at the architecture level, I’d have built myself a denial-of-wallet machine.

Three things that actually move the needle:

Model routing. Not every task needs your most capable model. I run three tiers: a free-tier model for low-stakes stuff like social media digests, a cheap model for routine monitoring and scheduled jobs, and the expensive model only for interactive work that genuinely needs strong reasoning. The default is always the cheapest thing that doesn’t screw up. Most teams do the opposite — they default to the biggest model for everything, which is like overnighting every package when ground gets there on time.

Context discipline. The single biggest waste I’ve run into is retrieval systems that grab everything potentially relevant and cram it into the prompt. A prefetch step — retrieve, filter, pre-process before the model ever sees it — can cut consumption by orders of magnitude. I wrote about this before: one of my monitoring jobs dropped from 1.45 million tokens per cycle to under 15,000 by pulling data processing out of the model call and into a Python script upstream. Same results. 99% fewer tokens.

Loop awareness. When an agent calls the model multiple times, those steps need to be intentional. Not “reason freely until you’re satisfied” but “here are the three things you do, in order.” Every unplanned loop iteration is tokens you didn’t budget for. In my experience, the gap between open-ended reasoning and structured steps is 3-5x on cost with no difference in output quality.

Where this is headed

Token efficiency is going to become a real engineering discipline. The same way we got serious about memory management, query optimization, and mobile data usage once the constraints bit hard enough that sloppy work had real consequences.

Most teams right now are learning this the expensive way — build the agent, ship it, then scramble when the inference bill lands. The ones who bake token budgets into the architecture from day one, before writing any agent code, are going to carry a structural advantage that gets wider over time.

The constraint in 2026 isn’t compute. Compute is abundant. The constraint is that nobody built their agents with a meter running. The teams that start thinking about it now won’t be the ones refactoring later.