The headline is correct as far as it goes. A million input tokens to Claude Sonnet 4.5 costs $3 today; the equivalent capability in 2024 cost ten to twenty times that. Gemini 2.5 Pro charges $1.25 per million input tokens for the standard tier, GPT-5 sits at $1.25 across its range, Bedrock Nova prices undercut all three on the low end. The per-token curve has bent decisively in the buyer's favour.
The headline misses three things that matter to anyone costing out a production deployment.
One. Per-token costs have collapsed; per-workload costs have not, because the unit of work has grown. Long context windows are now standard, agentic loops multiply token counts, and reasoning models charge for the thinking they do internally as well as the output they produce.
Two. Output tokens did not move as fast as input. Claude Sonnet 4.5 charges $3 input but $15 output. Gemini 2.5 Pro charges $1.25 input but $10 output. The ratio is 5x to 8x. Workloads that emit long structured responses — agent reasoning, code generation, synthesis tasks — feel the output-side pricing more than the input-side.
Three. Reasoning tokens are charged. The reasoning models (Claude with extended thinking, GPT-5 in reasoning mode, Gemini 2.5 Pro thinking) bill the internal thinking tokens at the output rate. A query that emits 500 visible tokens may have produced 5,000 reasoning tokens behind the scenes. The bill reflects all 5,500.
This piece is the cost-economics map for enterprise workloads on the four frontier providers, the patterns that determine whether your bill scales gracefully or quadratically, and the levers that actually work.
Key takeaways
- Frontier per-token pricing dropped roughly 10x since 2024 — Claude Sonnet 4.5 at $3 input / $15 output, Gemini 2.5 Pro at $1.25 / $10, GPT-5 at $1.25 / $10, Bedrock Nova undercutting all three on the low end — but the per-workload bill has barely moved because output ratios run 5x to 8x input.
- The same Claude Sonnet 4.5 workload that costs $0.0375 per Q&A query costs $0.285 per agentic loop with five tool calls — a 23x bill increase for the same buyer-perceived "one query."
- Reasoning tokens bill at the output rate, so a query that emits 500 visible tokens but 5,000 reasoning tokens charges for all 5,500 — extended-thinking workloads cost 3-5x the same query without reasoning.
- Four levers move the bill in practice: model routing by query class (40-60% saving on mixed workloads), aggressive caching of stable context at 0.1x read cost, separating hot and cold paths to use batch APIs at 50%, and capping reasoning budgets with structured-output constraints.
- Production AI cost is determined more by architecture decisions — router, cache, batch path, reasoning budget, multi-agent topology — than by which provider's price list you read.
The pricing landscape as it stands
Current published pricing on the four providers, normalised to dollars per million tokens, for the model tiers most commonly chosen for production:
Anthropic. Claude Opus 4.5 — $5 input, $25 output. Claude Sonnet 4.5 — $3 input, $15 output. Claude Haiku 4.5 — $1 input, $5 output. Long-context pricing on Sonnet and above is flat across the 1M-token window. Prompt caching: cache writes at 1.25x or 2x input (5-minute or 1-hour TTL), cache reads at 0.1x input.
OpenAI. GPT-5 family — $1.25 input, $10 output (gpt-5), with gpt-5-mini and gpt-5-nano below. Cached input at 0.1x. Reasoning tokens billed as output.
Google. Gemini 2.5 Pro — $1.25 input below 200K tokens, $2.50 above; $10 output below 200K, $15 above. Gemini 2.5 Flash and Flash-Lite for cheaper tiers. Context caching at 0.1x input plus storage premium.
AWS Bedrock. Nova Pro — $0.80 input, $3.20 output. Nova Lite — $0.06 input, $0.24 output. Nova Micro — $0.035 input, $0.14 output. Plus Anthropic, Meta, Mistral, Cohere, Stability and other model families on the same wire, with Bedrock-specific batch and provisioned-throughput pricing.
All four providers offer batch APIs at roughly 50% of standard rates for non-latency-sensitive workloads, and all four offer caching at roughly 10% of input rates for repeated context. These are the levers the cost optimisation lives on; the headline rates are the tip.
Where enterprise workloads actually spend
Take the patterns most commonly deployed and cost them out at production volume. Numbers below assume Claude Sonnet 4.5 as the reference (the modal enterprise choice for medium-complexity workloads); the same shape holds across providers at different absolute prices.
Single-turn Q&A over a small corpus. Input: 10K tokens (retrieved chunks + system prompt). Output: 500 tokens. Cost per query: ($3 × 10,000 + $15 × 500) / 1M = $0.0375. At 10,000 queries per day: $375/day, $11K/month. Cheap. This is the workload that responds best to the headline pricing.
Agentic loop with five tool calls. Each loop iteration: 8K tokens of context, 800 tokens of agent output (reasoning + tool call), 1.2K tokens of tool response added to context. Five iterations means context grows by ~10K tokens across the loop. Total per query: ~75K input tokens, ~4K output tokens. Cost: ($3 × 75,000 + $15 × 4,000) / 1M = $0.285. At 10,000 queries per day: $2,850/day, $85K/month. Twenty-three times the cost of the Q&A workload at the same query rate, for the same buyer-perceived "one query."
Extended-thinking reasoning task. A research query that triggers deep reasoning mode. Visible output: 1,500 tokens. Reasoning tokens: 8,000. Input: 12,000 tokens. Cost: ($3 × 12,000 + $15 × 9,500) / 1M = $0.179. At 10,000 queries per day: $1,790/day, $54K/month. Three to five times the bill of the same query without reasoning, depending on how much thinking the model does.
Long-context document review. A 200K-token document submitted for synthesis. Output: 2,000 tokens. Per query: ($3 × 200,000 + $15 × 2,000) / 1M = $0.630. At 100 queries per day (which is the right order of magnitude for this workload): $63/day, $1,900/month. Modest in absolute terms; alarming when extrapolated to higher volumes.
Multi-agent workflow with three subagents. Each subagent runs its own context. Orchestrator: 30K input, 2K output. Three subagents: 25K input, 3K output each. Total: 30K + 75K = 105K input, 2K + 9K = 11K output. Cost per workflow: ($3 × 105,000 + $15 × 11,000) / 1M = $0.480. At 1,000 workflows per day: $480/day, $14K/month. Per-workflow cost is the right unit to think in; per-query is the wrong unit because the workflow has fanned out.
The pattern is consistent. The headline pricing is real, and applies cleanly to short-prompt-short-response workloads. Long context, agentic loops, reasoning tokens, and multi-agent fan-out each multiply the bill by a factor that does not appear in the headline. Production workloads usually have at least two of these multipliers in play.
The four levers that actually move the bill
We have shipped enough production AI to have a strong opinion on which optimisations pay off and which look impressive but do not move the bill. Four levers.
One — model routing by query class. The single highest-leverage optimisation. Most production workloads can be split into three tiers: simple (extraction, classification, short Q&A — Haiku 4.5 or Nova Lite is sufficient), medium (synthesis, multi-step reasoning, most tool-using agents — Sonnet 4.5 is the modal choice), complex (deep research, code generation, long-document reasoning — Opus 4.5 or Gemini 2.5 Pro Thinking). A router that classifies the query and picks the model — sometimes a simple rules engine on prompt features, sometimes a small classifier model — typically reduces the bill 40-60% on workloads that mix tiers. The classifier model is cheap; the saved tier upgrades are not. This is the Claude-first multi-model routing pattern we wrote about in the Bedrock cost optimisation piece, and it remains the single best ROI.
Two — aggressive caching of stable context. System prompts, retrieved corpora, tool definitions, persona blocks, few-shot examples — anything that does not change per query is a caching candidate. Anthropic's prompt caching at 0.1x read cost, Gemini's context caching at 0.1x read plus storage, OpenAI's prompt caching at 0.1x — all three providers price caching steeply enough that any workload with repeated context should be caching. The trap: caching only pays when the cache TTL covers the query interval. A workload of one query per hour against a 5-minute cache pays full price every time. Pick the cache TTL deliberately, and amortise across query batches when you can.
Three — separate hot and cold paths. Interactive UX needs sub-second latency, but batch processing does not. Anthropic, OpenAI, and Bedrock all offer batch APIs at 50% of standard pricing with 24-hour SLAs. Workloads like overnight document processing, analytics enrichment, classification backfills, and async report generation belong on the batch tier. Most teams forget this exists because the marketing pushes the realtime API. The realtime API is for realtime. The batch API is for everything else, and it halves the bill on those workloads.
Four — output structure and reasoning budgets. Reasoning models charge for the thinking. If the workload doesn't need 8,000 reasoning tokens to answer a multiple-choice classification, cap the budget. Claude's extended-thinking budget parameter, GPT-5's reasoning effort tier, Gemini's thinking budget — all three have explicit knobs that limit how much the model spends thinking. Use them. Pair with structured-output constraints (JSON schema, function-call format) that bound the output length. The model that knows it has 200 reasoning tokens and must emit a specific JSON schema will produce a tighter, cheaper answer than the same model given unlimited reasoning and prose freedom.
What the "AI is cheap" framing gets wrong
Two things worth saying directly.
The price-per-token curve really is down 10x to 100x at the same capability level since 2024. That is real. The deflation in raw inference cost is one of the most rapid in any compute market we have on record.
The price-per-useful-workload curve has bent much less. Better models reason for longer, hold larger contexts, fan out into agentic loops, and emit more structured intermediate work. The total tokens consumed per "one useful answer" has grown by a factor that is approximately offsetting the per-token deflation on workloads at the frontier of what's possible. On commodity workloads — single-turn Q&A, simple classification, short summarisation — the deflation is fully felt. On the workloads that justify having an AI strategy in the first place — agentic automation, multi-step synthesis, long-document reasoning, complex tool use — the deflation is approximately neutralised by complexity growth.
This has a strategic implication. Teams that expect the next year of model releases to "make their AI bill drop further" are extrapolating the headline curve onto workloads where it does not apply. Teams that expect their bill to stay flat while their workloads grow more capable are reading the substrate correctly. The model providers will continue dropping per-token prices because the unit economics let them. Your bill will not drop unless your workload mix and architecture explicitly extract the saving.
What this teaches us about enterprise scaling
The pattern underneath all of this is that production AI cost is determined more by architecture decisions than by model selection. The model price list is one input. The router, the cache, the batch path, the reasoning budget, the multi-agent topology, the long-context decision (covered separately in the long-context vs RAG piece) — these are where the bill actually lands.
The buyers who think AI is cheap have not deployed production agents at scale. The teams that have built the routing, the caching, the batching, and the reasoning-budget discipline — they are the ones running production AI at scale within a finance team's expectations. The discipline is engineering, not procurement, and it compounds across every new workload added to the platform.
FAQs
If per-token prices have dropped 10x, why hasn't my AI bill dropped 10x?
Because the unit of work has grown to absorb the saving. Long context windows are now standard, agentic loops multiply token counts through tool calls and reasoning, reasoning models bill internal thinking tokens at the output rate, and output prices fell less than input prices. On commodity workloads — single-turn Q&A, classification, short summarisation — you feel the full deflation. On the workloads that justify having an AI strategy in the first place, complexity growth approximately offsets the per-token deflation.
Which optimisation actually moves the bill the most?
Model routing by query class. Most production workloads split into three tiers — simple (Haiku 4.5 or Nova Lite), medium (Sonnet 4.5), complex (Opus 4.5 or Gemini 2.5 Pro Thinking). A router that classifies the query and picks the model typically reduces the bill 40-60% on workloads that mix tiers. The classifier model is cheap; the saved tier upgrades are not.
When does prompt caching actually pay off?
When the cache TTL covers the query interval. Anthropic, Gemini, and OpenAI all price cache reads at roughly 0.1x input cost, but a workload of one query per hour against a 5-minute cache pays full price every time. Pick the TTL deliberately, amortise across query batches when you can, and cache the things that don't change per query — system prompts, retrieved corpora, tool definitions, persona blocks, few-shot examples.
How much can the batch API save on the right workload?
Roughly 50%. Anthropic, OpenAI, and Bedrock all offer batch APIs at half standard pricing with 24-hour SLAs. Workloads like overnight document processing, analytics enrichment, classification backfills, and async report generation belong on the batch tier. Most teams forget this exists because the marketing pushes the realtime API.
Should we expect AI bills to keep dropping as new model releases land?
Only on commodity workloads. Per-token prices will keep dropping because the unit economics let providers keep dropping them. But better models also reason for longer, hold larger contexts, and fan out into more elaborate agentic loops — so the saving is approximately neutralised on frontier workloads. Teams that expect their bill to stay flat while their workloads grow more capable are reading the substrate correctly.
Companion content
- Cost Optimisation on Amazon Bedrock
- Long-Context vs RAG: When 2M Tokens Beats Retrieval
- Federated AI: On-Prem vs Frontier
- Cold-Start Latency and Cost of Multimodal RAG Pipelines
- Multi-Agent Orchestration: CrewAI vs LangGraph vs Custom
How to engage
We size, model, and optimise enterprise AI inference economics for regulated buyers — with the cost projections that survive a finance review and a board read. Talk to us at creativeminds.dev/contact.
