The AI Inference Cost Wars: Where the Economics Actually Land for Enterprise Workloads in 2026

The CFO leans across the table and says it with the tone of someone who has been told three contradictory things in a week. "Tokens are basically free now, aren't they?" The engineer sitting opposite has the bill open on her laptop. Last quarter the AI line was $340,000. This quarter, with prices down sharply on every model the team uses, the line reads $410,000. The CFO is not wrong about the price list. The engineer is not wrong about the bill. Both numbers are real, and the gap between them is the whole story.

Headline pricing has genuinely collapsed. A million input tokens to Claude Sonnet 4.5 cost ten to twenty times more in 2024 than the $3 they cost today. Gemini 2.5 Pro charges $1.25 per million input tokens at the standard tier. GPT-5 sits at $1.25 across its range. Bedrock Nova undercuts all three at the low end. If the unit you bill in is the token, the curve has bent decisively in your favour. The trouble is that the unit you actually consume is the workload, and the workload has grown like a teenager.

Three forces have moved against the headline. The first is that the unit of work itself is bigger now: long context windows are standard, agentic loops multiply token counts the way compound interest multiplies a debt, and reasoning models bill for thinking they do internally as well as the output they emit. The second is that output prices fell more slowly than input prices — Claude Sonnet 4.5 is $3 input and $15 output, Gemini 2.5 Pro is $1.25 input and $10 output, ratios between five and eight, and workloads that emit long structured responses feel the output side more than the input side. The third is that reasoning tokens are charged the same way as output tokens. A query that returns 500 visible tokens may have produced 5,000 reasoning tokens behind the scenes, and the bill reflects all 5,500. Picture an iceberg priced by total mass; only the tip pokes above the waterline of what the user sees.

What follows is the cost-economics map for enterprise workloads on the four frontier providers, the patterns that decide whether your bill scales gracefully or quadratically, and the four levers that actually move the number.

Key takeaways

Frontier per-token pricing dropped roughly 10x since 2024 — Claude Sonnet 4.5 at $3 input / $15 output, Gemini 2.5 Pro at $1.25 / $10, GPT-5 at $1.25 / $10, Bedrock Nova undercutting all three on the low end — but the per-workload bill has barely moved because output ratios run 5x to 8x input.
The same Claude Sonnet 4.5 workload that costs $0.0375 per Q&A query costs $0.285 per agentic loop with five tool calls — a 23x bill increase for the same buyer-perceived "one query."
Reasoning tokens bill at the output rate, so a query that emits 500 visible tokens but 5,000 reasoning tokens charges for all 5,500 — extended-thinking workloads cost 3-5x the same query without reasoning.
Four levers move the bill in practice: model routing by query class (40-60% saving on mixed workloads), aggressive caching of stable context at 0.1x read cost, separating hot and cold paths to use batch APIs at 50%, and capping reasoning budgets with structured-output constraints.
Production AI cost is determined more by architecture decisions — router, cache, batch path, reasoning budget, multi-agent topology — than by which provider's price list you read.

Five production workload shapes priced on Claude Sonnet 4.5 ($3 input / $15 output per 1M tokens) showing per-query cost and monthly bill — single-turn Q&A $0.0375 per query / $11K per month, long-context document review $0.63 per query / $1.9K at low volume, extended-thinking reasoning $0.179 per query / $54K per month, multi-agent workflow $0.480 per workflow / $14K per month, agentic loop with five tool calls $0.285 per query / $85K per month — 23 times the Q&A bill for the same buyer-perceived 'one query'. Plus four levers that move the bill in practice: model routing (40-60% saving), prompt caching at 0.1× read cost, batch APIs at 50% off, and reasoning budgets with structured output. — Figure 1 — Same model, same provider price · per-workload bills span $1.9K to $85K per month

The Map of the Price List

The published prices, normalised to dollars per million tokens, paint a tidy picture if you only read the headline column. Anthropic charges $5 input and $25 output on Claude Opus 4.5, drops to $3 input and $15 output on Sonnet 4.5, and lands at $1 input and $5 output on Haiku 4.5; long-context pricing on Sonnet and above stays flat across the 1M-token window, prompt-cache writes run at 1.25x or 2x input depending on TTL, and cache reads cost a tenth of input. OpenAI's GPT-5 family lists at $1.25 input and $10 output, with mini and nano tiers below, cached input at a tenth, reasoning tokens billed as output. Google's Gemini 2.5 Pro lists at $1.25 input below 200K tokens and $2.50 above, $10 output below 200K and $15 above, with Flash and Flash-Lite below and context caching at a tenth plus a storage premium. AWS Bedrock prices Nova Pro at $0.80 input and $3.20 output, Nova Lite at $0.06 and $0.24, Nova Micro at $0.035 and $0.14, and runs Anthropic, Meta, Mistral, Cohere, and Stability families on the same wire with batch and provisioned-throughput options layered on top.

All four offer batch APIs at roughly half the standard rate for workloads that can wait, and all four offer caching at roughly a tenth of input rates for context that repeats. This is where the optimisation lives. The headline rates are like the price on the petrol station signboard: real, but only one variable in a tank-to-tank cost that depends on traffic, route, and load.

What Production Actually Spends Money On

Take the deployment shapes that show up most often and cost them out at production volume, with Claude Sonnet 4.5 as the reference (the modal enterprise choice for medium-complexity work). The numbers change at different price points; the shape does not.

A single-turn Q&A over a small corpus is the workload the headline pricing was built for. Ten thousand input tokens of retrieved chunks plus system prompt, five hundred output tokens, and the per-query cost is $0.0375. Ten thousand queries a day works out to $375 a day, $11,000 a month. Cheap, predictable, the kind of workload that lets a CFO sleep.

An agentic loop with five tool calls is a different beast. Each iteration carries 8,000 tokens of context, 800 tokens of agent output, and 1,200 tokens of tool response that get added to the context for the next iteration. Five iterations later, the context has grown by roughly 10,000 tokens across the loop. Add it up and you reach 75,000 input tokens and 4,000 output tokens per query, costing $0.285. At the same ten thousand queries a day, that workload bills $2,850 a day and $85,000 a month — twenty-three times the Q&A bill, for what the buyer perceives as "one query." The fan-out is invisible to the user and visible to the finance team.

Extended-thinking workloads behave like the iceberg analogy. A research query that triggers deep reasoning mode might return 1,500 visible tokens of output while having produced 8,000 reasoning tokens on the way there. Input is 12,000 tokens. The bill is $0.179 per query — $54,000 a month at ten thousand queries a day — three to five times what the same query costs without reasoning enabled.

Long-context document review costs more per query but runs at lower volumes. A 200,000-token document with a 2,000-token synthesis is $0.63 per query. At a hundred queries a day, the right order of magnitude for that workload, the bill is $1,900 a month. Modest on the face of it. Alarming once you start asking what happens at higher volumes.

Multi-agent workflows fan out in a different direction. A three-subagent workflow with an orchestrator might consume 30,000 input tokens and 2,000 output tokens at the orchestrator plus another 25,000 input and 3,000 output at each subagent — 105,000 input and 11,000 output overall, $0.48 per workflow. A thousand workflows a day costs $14,000 a month. The unit to count is the workflow, not the query, because the workflow has already fanned out before the user sees a response.

The pattern is consistent across the five shapes. Headline pricing applies cleanly to short prompts and short responses. Long context, agentic loops, reasoning tokens, and multi-agent fan-out each multiply the bill by a factor that does not appear in any price list. Most production workloads carry at least two of those multipliers at once.

The Four Levers That Actually Move the Number

We have shipped enough production AI to know which optimisations pay off and which look impressive in a slide deck without moving the bill. Four levers do the work.

The first is model routing by query class. Think of it as triage: the urgent care nurse who decides whether your case needs a GP, a specialist, or a surgical team, rather than sending everyone to the surgeon by default. Most production workloads split cleanly into three tiers — simple (extraction, classification, short Q&A, served by Haiku 4.5 or Nova Lite), medium (synthesis, multi-step reasoning, most tool-using agents, served by Sonnet 4.5), and complex (deep research, code generation, long-document reasoning, served by Opus 4.5 or Gemini 2.5 Pro Thinking). A router classifies the incoming query and picks the model — sometimes a rules engine on prompt features, sometimes a small classifier model. On workloads that mix tiers, this typically takes 40 to 60 percent off the bill. The classifier is cheap; the tier upgrades you avoid are not. The Claude-first multi-model routing pattern from the Bedrock cost piece remains the single best return.

The second is aggressive caching of stable context. System prompts, retrieved corpora, tool definitions, persona blocks, few-shot examples — anything that does not change per query is a caching candidate. Anthropic, Gemini, and OpenAI all price cache reads at roughly a tenth of input, which makes any workload with repeated context a caching candidate. There is a trap: caching only pays when the TTL covers the query interval. A workload of one query an hour running against a five-minute cache pays the full rate every time. Pick the TTL the way you pick a fridge setting — for what you are actually storing — and amortise across query batches where you can.

The third is the separation of hot and cold paths. Interactive UX needs sub-second latency. Batch processing does not. All three of Anthropic, OpenAI, and Bedrock offer batch APIs at half the standard rate with 24-hour SLAs. Overnight document processing, analytics enrichment, classification backfills, async report generation — all of these belong on the batch tier, and most teams forget the batch tier exists because the marketing materials are loud about the realtime API. The realtime API is for realtime. The batch API is for everything else, and it halves the bill on the workloads that fit.

The fourth is the discipline of output structure and reasoning budgets. Reasoning models charge for the thinking, and unbounded thinking is a tap left running. If your workload does not need 8,000 reasoning tokens to answer a multiple-choice classification, cap the budget. Claude's extended-thinking budget parameter, GPT-5's reasoning effort tier, Gemini's thinking budget — all three give you a knob. Pair the cap with structured-output constraints, JSON schemas or function-call formats, that bound the response length. A model told it has 200 reasoning tokens and must emit a specific JSON schema produces a tighter, cheaper answer than the same model given freedom to ramble. The same way a writer with a 400-word limit produces sharper copy than one given a thousand.

Why the Curve Bends But the Bill Stays

Two things are simultaneously true, and both deserve saying plainly.

The price-per-token curve really is down by a factor of 10 to 100 at the same capability level since 2024. That is one of the steepest declines any compute market has ever shown. The deflation is real.

The price-per-useful-workload curve has bent much less. Better models reason longer, hold larger contexts, fan out into agentic loops, and emit more structured intermediate work along the way. The total tokens consumed per "one useful answer" has grown at roughly the same rate the per-token price has fallen. On commodity workloads — single-turn Q&A, simple classification, short summarisation — the deflation is fully felt; the bill drops. On the workloads that justify having an AI strategy in the first place — agentic automation, multi-step synthesis, long-document reasoning, complex tool use — the deflation gets neutralised by complexity growth.

The strategic implication is uncomfortable for anyone budgeting on the assumption that next year's model releases will lower the bill on this year's workloads. They will lower the per-token price. They will not lower your bill unless the workload mix and architecture explicitly extract the saving. Teams that expect their bill to drop are extrapolating the headline curve onto a substrate where it does not apply. Teams that expect their bill to stay flat while the workloads grow more capable are reading the substrate correctly.

Underneath all of it is a single fact that the price list cannot capture: production AI cost is determined more by architecture decisions than by which model you chose. The price list is one input. The router, the cache, the batch path, the reasoning budget, the multi-agent topology, the long-context decision (covered separately in the long-context vs RAG piece) — these are where the bill actually lands.

The buyers who believe AI is cheap have not yet deployed production agents at scale. The teams that have built the routing and the caching and the batching and the reasoning-budget discipline are the ones running production AI inside their finance team's expectations, and the ones being trusted with next year's workloads. The bill is engineering, not procurement. The next time the CFO asks why the price list says one thing and the invoice says another, what answer will the engineer with the laptop give?

FAQs

If per-token prices have dropped 10x, why hasn't my AI bill dropped 10x?

Because the unit of work has grown to absorb the saving. Long context windows are now standard, agentic loops multiply token counts through tool calls and reasoning, reasoning models bill internal thinking tokens at the output rate, and output prices fell less than input prices. On commodity workloads — single-turn Q&A, classification, short summarisation — you feel the full deflation. On the workloads that justify having an AI strategy in the first place, complexity growth approximately offsets the per-token deflation.

Which optimisation actually moves the bill the most?

Model routing by query class. Most production workloads split into three tiers — simple (Haiku 4.5 or Nova Lite), medium (Sonnet 4.5), complex (Opus 4.5 or Gemini 2.5 Pro Thinking). A router that classifies the query and picks the model typically reduces the bill 40-60% on workloads that mix tiers. The classifier model is cheap; the saved tier upgrades are not.

When does prompt caching actually pay off?

When the cache TTL covers the query interval. Anthropic, Gemini, and OpenAI all price cache reads at roughly 0.1x input cost, but a workload of one query per hour against a 5-minute cache pays full price every time. Pick the TTL deliberately, amortise across query batches when you can, and cache the things that don't change per query — system prompts, retrieved corpora, tool definitions, persona blocks, few-shot examples.

How much can the batch API save on the right workload?

Roughly 50%. Anthropic, OpenAI, and Bedrock all offer batch APIs at half standard pricing with 24-hour SLAs. Workloads like overnight document processing, analytics enrichment, classification backfills, and async report generation belong on the batch tier. Most teams forget this exists because the marketing pushes the realtime API.

Should we expect AI bills to keep dropping as new model releases land?

Only on commodity workloads. Per-token prices will keep dropping because the unit economics let providers keep dropping them. But better models also reason for longer, hold larger contexts, and fan out into more elaborate agentic loops — so the saving is approximately neutralised on frontier workloads. Teams that expect their bill to stay flat while their workloads grow more capable are reading the substrate correctly.

Companion content

How to engage

We size, model, and optimise enterprise AI inference economics for regulated buyers — with the cost projections that survive a finance review and a board read. Talk to us at creativeminds.dev/contact.