Series · Amazon Bedrock for Production AI · Part 7 of 8 ← Part 6: Security Guardrails and Observability · Cost Optimization · Part 8: Case Study: SRE AI Agent for CloudWatch →
Key takeaways
- Routing a query to Claude Haiku versus Claude Opus is a 12–15× cost difference; "Sonnet for everything" is an order of magnitude overspend on most workloads.
- A cascade router (Haiku classifier + tiered routing + escalation on low confidence) typically drops average query cost 30–50% with no measurable quality regression.
- Prompt caching converts a 5,000-token system prompt invoked 100k times/day from full input rate to ~10% of input rate — four-figure-USD-per-day savings at high volume.
- Batch inference prices the same calls at ~50% of on-demand for non-real-time workloads (embedding ingestion, document classification, eval runs).
- The combined effect of cascade routing, caching, top-K discipline, batch, and provisioned throughput on a workload that fits all of them is roughly 25–35% of the baseline cost.
The bill that arrives without a name on it
The first Bedrock bill at the end of a team's first month carries one of two flavours of surprise. Either it is smaller than expected — the agent has not yet scaled to real production volume, and the line item reflects pilot traffic. Or it is larger than expected — the agent went into production, a small percentage of queries hit the longest context windows, and the bill is dominated by the tail rather than the median.
Either way, nobody on the team can tell you which feature, which agent, which prompt, which workflow is driving the spend. The cost surface is opaque by default. It is like opening an electricity bill and seeing one number for the whole house — no breakdown for the kettle, no breakdown for the boiler. Making it transparent is the first move in cost discipline; every optimisation that follows only matters if the team can attribute the spend.
This piece is the deepest treatment of the Claude-first multi-model routing thread that runs through Parts 1-6. Tag-driven attribution. Cascade routing in operational depth. Prompt and response caching. Batch inference for non-real-time work. The provisioned-throughput versus on-demand decision tree.
Token economics, in ratios that stick
Bedrock pricing is per-token. Input tokens and output tokens carry different rates — output is typically 4-5 times more expensive than input — and each model carries its own scale. The rough relationships across the catalogue, mid-2026:
| Model | Relative input cost | Relative output cost | When it's the right tier |
|---|---|---|---|
| Claude Haiku 4.5 | 1× (baseline) | 1× (baseline) | High-volume / low-complexity: routing, simple classification, factual lookup, summarisation, structured extraction |
| Claude Sonnet 4.6 | ~3-4× Haiku | ~3-4× Haiku | Default production reasoning tier: agent loops, RAG synthesis, code generation, multi-step planning |
| Claude Opus 4.7 | ~12-15× Haiku | ~12-15× Haiku | Hardest tasks only: complex multi-step analysis, novel synthesis, legal / financial / scientific where stakes justify the cost |
| Llama 4 Maverick | ~0.6× Haiku | ~0.6× Haiku | Cost-sensitive workloads where Haiku's quality isn't required |
| Llama 4 Scout | ~0.3× Haiku | ~0.3× Haiku | Very high volume, simple classification |
| Titan Embeddings v2 | ~0.05× Haiku per 1M tokens | (no output) | Embeddings (no output cost — embeddings are vectors, not generated text) |
| Cohere Embed v3 | ~0.05× Haiku per 1M tokens | (no output) | Embeddings — production default per Part 2 |
| Cohere Rerank v3 | ~0.5× Haiku per call | (per-document scored) | Re-ranking the top-N RAG candidates |
The numbers are illustrative ratios — pull the current Bedrock pricing page before quoting them in a proposal.
The relationship that matters: routing a query to Haiku versus Opus is roughly a 12-15 times cost difference. A workload that runs everything through Opus is paying an order of magnitude more than one that routes intelligently. It is the difference between sending every letter by air freight and using the post office for the postcards. The architectural skill is matching tier to task.
The cascade, plumbing first
The cascade pattern (introduced in Part 2 and threaded through every subsequent piece) is the single highest-leverage cost optimisation available. The mechanics:
Three operational decisions sit inside the cascade.
The router itself
The classifier is a small model — Claude Haiku, by default. Cost per routing decision lands around a tenth of a cent. The classifier's task is narrow: take the incoming query, return one of three labels — simple, standard, complex. A 50-word system prompt and a structured output format is enough.
def classify_complexity(query: str) -> str:
classifier = Agent(
model="anthropic.claude-haiku-4-5-20251001",
instruction=(
"Classify the query complexity. Reply with exactly one word: "
"'simple' (factual lookup), 'standard' (single-domain reasoning), "
"or 'complex' (multi-domain synthesis, comparative analysis)."
),
)
return classifier(query).output.strip().lower()
The classifier does not need to be perfect. Even a 70 percent accurate router on a workload with 60 percent simple, 30 percent standard, 10 percent complex queries delivers an order-of-magnitude cost reduction over a Sonnet-for-everything baseline. Good enough is the engineering target, not perfect.
The routing table
Mapping classifier output to model and parameters:
| Classifier label | Model | Other parameters |
|---|---|---|
simple |
Haiku 4.5 | max_tokens=500, no extended thinking |
standard |
Sonnet 4.6 | max_tokens=2000, re-ranking on for RAG |
complex |
Sonnet 4.6 with extended thinking, or Opus 4.7 if complexity_score > threshold |
max_tokens=4000, extended thinking, top-5 RAG chunks |
Most production routing tables have three tiers. Some workloads benefit from a fourth — a trivial tier handled by Llama 4 Scout or a rules-based response that does not invoke a model at all (FAQ matching, intent detection from a small set of options). For workloads with a very heavy simple-query tail, the trivial tier adds another order of magnitude.
Escalation on low confidence
The router can be wrong. The defensive pattern: if the lower-tier model's response carries a confidence signal below threshold, escalate to the next tier up. Confidence signals come in three flavours — the model itself returning I don't know or I'm not confident; a Guardrails grounding score below threshold (the response is not well-supported by retrieved context); a cheap Haiku-based self-evaluation step against a quality rubric. Escalation is a small percentage of total queries but recovers the quality on the cases where the router missed. Net effect: router accuracy does not need to be high because escalation catches the misses.
What the cascade actually delivers
For a typical agent workload with a 60-30-10 distribution: a Sonnet-everywhere baseline costs four times Haiku per query, while cascade routing — including the classifier overhead — brings the average down to about 2.8 times Haiku-equivalent. That is roughly a 30 percent saving on the average query, with no measurable quality regression and often a latency improvement on the simple-tier queries.
For a workload with a heavier trivial-query tail (80 percent simple, 15 percent standard, 5 percent complex), the cascade lands around 2 times Haiku-equivalent against the same 4 times baseline — roughly a 50 percent saving. These are conservative numbers. Many production workloads see steeper distributions and correspondingly larger savings. The router pays for itself within hours of deployment on any workload with a long tail of simple queries.
The cache that almost pays for itself
Bedrock prompt caching, introduced in 2024 and expanded through 2025-26, caches portions of the prompt across invocations. The cached prefix is charged at a fraction of the standard input rate — typically about 10 percent — and delivered as cached on subsequent calls.
Three use cases drive most of the savings. Large system prompts are the most common — agent system prompts with detailed instructions, long tool descriptions, embedded examples, all identical across every invocation. Long context documents come next: a RAG workflow where the same retrieved chunks are reused across multiple turns, or an agent holding a conversation about a single document. Few-shot example sets are the third: the same training examples included in every prompt for in-context learning.
Configure caching on the Bedrock invocation:
response = bedrock_runtime.converse(
modelId="anthropic.claude-sonnet-4-6-20251022",
messages=[
{
"role": "user",
"content": [
{
"text": LARGE_SYSTEM_PROMPT,
"cachePoint": {"type": "default"} # cache everything before this point
},
{
"text": user_query # uncached — varies per call
}
]
}
]
)
The arithmetic: for an agent with a 5,000-token system prompt invoked 100,000 times a day, caching converts a recurring 500 million tokens of daily input from full rate to roughly a tenth. At Sonnet pricing, that is a four-figure daily saving on a high-volume workload.
Two operational notes. The cache TTL is bounded — five minutes by default, with longer-lived options configurable per region. High-volume workloads keep the cache warm naturally; low-volume workloads see frequent misses and lower hit rates. And cache invalidation is by content hash — any change to the cached prefix invalidates the cache. That is why agent system prompts that change frequently see lower hit rates. Version-control the system prompt and treat changes as deliberate events, not casual edits.
The discount for patience
For workloads where latency does not matter — overnight batch processing, large-corpus enrichment, scheduled analytics — Bedrock Batch Inference prices the same calls at roughly half the on-demand rate. The trade is patience: response latency is hours-to-days instead of seconds.
Five use cases fit batch cleanly. Embedding ingestion for Knowledge Bases — embedding a 10 GB corpus of documents is a batch job, not a real-time call. Document classification at scale — overnight processing of a day's worth of inbound documents. Synthetic data generation for fine-tuning, per Part 4. Periodic re-summarisation of long-running content. And eval-harness runs against test sets.
Batch inference is dispatched via S3 — a JSONL file of prompts in, a JSONL file of completions out. The integration with Step Functions (Part 5) is clean: a batch job is a single state with a poll-and-wait pattern. The cost saving is roughly 50 percent on the model calls themselves. For batch-eligible workloads, it is immediate and structural — the kind of discount that arrives just for using the right pipe.
Provisioned throughput, on-demand, and the break-even
On-demand pricing is the default — pay per token, no commitment, scales with demand. Provisioned throughput reserves capacity for a model: you pay a fixed hourly rate for guaranteed throughput, regardless of usage. It is the difference between hailing a taxi each time and hiring a driver for the month.
The break-even decision tree:
| Workload characteristic | Right pricing model |
|---|---|
| Unpredictable / bursty traffic | On-demand |
| Sustained high-volume traffic with predictable load | Provisioned throughput if the utilisation breakeven is achievable |
| Latency-sensitive (need to bypass on-demand throttling) | Provisioned throughput |
| Custom models (fine-tuned or imported via Custom Model Import) | Provisioned throughput required for some |
| Steady embedding ingestion or batch processing | On-demand (use Batch Inference rates) |
| Real-time agent workloads at predictable volume | Calculate breakeven; often Provisioned throughput above ~50% utilisation |
The arithmetic: divide the hourly provisioned-throughput cost by the per-token on-demand rate, multiply by the model units' token throughput per hour. If sustained traffic exceeds that threshold for the contract duration, provisioned is cheaper. Below it, on-demand wins. Provisioned throughput is sold in one-month and six-month commitments, with a discount on the longer term. For production workloads with predictable load, six-month provisioned on Sonnet often makes sense for the cost-stability and latency-guarantee reasons alone, before the savings come in.
Tags, or: the bill with names on it
You cannot optimise what you cannot measure. Every Bedrock-consuming resource carries at least five tags: Workload for the application or product the spend belongs to; Team for the team responsible; Environment for dev, staging, or prod; CostCenter for chargeback; and Model for the tier.
The wrinkle is that Bedrock model invocations themselves do not carry tags directly. Tags apply to the consuming resource — the Lambda function, the Step Functions state machine, the API Gateway in front of the agent. The pattern that works is to tag the Lambda calling Sonnet with Workload=customer-support and Model=sonnet, and to tag a separate Lambda calling Haiku with the same Workload and Model=haiku. Cost Explorer then attributes Bedrock spend per tag combination across the consumer surface.
For workloads using direct Bedrock tasks in Step Functions (per Part 5), tag the state machine and use distinct state-machine names per workload — the per-state-machine Bedrock charge is attributable that way. For workloads on AgentCore, the runtime emits per-agent cost metrics natively, so per-agent attribution works without manual tagging.
The dashboard that matters
Six fields keep the cost discipline alive. Per-workload spend for the current month and trended over three months. Per-model-tier spend, with the cascade router's effectiveness measurable as the Haiku-to-Sonnet ratio. Cache hit rate per workload — below 60 percent is usually a system-prompt-stability problem worth investigating. Token volume broken out by input and output and by model — output volume surprises teams when an agent generates verbose responses. Cost per query per workload, the unit-economics metric that ties spend to business value. And anomaly alerts on per-workload spend, where a three-times day-over-day jump pages the on-call.
CloudWatch dashboards built from the metrics in Part 6 cover most of this. Cost Explorer covers the rest with daily granularity and tag filters. For finance-grade attribution, AWS Cost and Usage Reports deliver per-resource per-hour spend with tags into S3, queryable from Athena or QuickSight. This is the detail a CFO will eventually ask for; standing it up early is cheaper than rebuilding it under a deadline.
What bites teams in production
Five cost patterns show up in real Bedrock deployments, each one a quiet leak that compounds.
The verbose-response tax is the most common. Output tokens cost 4-5 times input tokens. An agent generating 3,000-token responses when 500 would do is paying six times more on the output side than necessary. max_tokens is the easiest single dial to turn.
The runaway agent loop is the most expensive when it happens. An agent stuck in a tool-call loop that does not terminate cleanly can eat 20-50 times a normal turn's tokens. Hard turn limits, plus monitoring for outliers, catches it before it makes the day's news.
The dev-traffic-in-prod problem inflates the prod cost dashboard with non-production traffic. Strict environment tagging is the fix; the cost of slipping is months of confusing CFO conversations.
The forgotten provisioned throughput is the most embarrassing. A team provisions throughput for a workload that never reaches utilisation, and the commitment ticks over for months. A quarterly review of all provisioned commitments against actual usage catches it.
The unbounded context window is the most architectural. Claude's 200K-token context window is a feature, not a target. An RAG workflow that stuffs 50 chunks into context when 5 would do is paying 10 times on input cost for diluted attention. Top-K discipline (Part 2) is cost discipline by another name.
The composite picture
When the patterns from the prior pieces compose, the cost structure for a representative production workload becomes:
| Optimisation | Multiplicative effect |
|---|---|
| Baseline (Sonnet for everything, no caching, on-demand) | 1.0× |
| Cascade routing applied | ~0.5–0.7× |
| Prompt caching on system prompt | ~0.85× of the input-token portion (overall ~0.95×) |
| RAG top-K discipline (5 vs 50 chunks) | ~0.6× of the input-token portion (overall ~0.8×) |
| Batch inference where applicable | ~0.5× of the batched portion |
| Provisioned throughput where utilisation justifies | ~0.7× of the steady-state portion |
The combined effect on a workload that fits all of them is roughly 25 to 35 percent of the baseline cost. Three-to-four-times cheaper, with no measurable quality regression and often improved latency. The Claude-first multi-model routing rule that runs through this series is, at the cost layer, the difference between Bedrock workloads that scale economically and ones that do not.
If your next Bedrock bill arrived tomorrow, would you know which features paid for it?
FAQs
How accurate does the cascade router actually need to be?
Even a 70% accurate router on a workload with 60% simple, 30% standard, 10% complex queries delivers an order-of-magnitude cost reduction over a Sonnet-for-everything baseline. Escalation on low confidence catches the misclassifications, so router accuracy is not the gating constraint — the routing table and escalation policy do most of the work.
When does provisioned throughput beat on-demand pricing?
Calculate the breakeven: divide the hourly provisioned-throughput cost by the per-token on-demand cost, multiplied by the model units' token throughput per hour. Sustained traffic above that threshold for the contract duration makes provisioned cheaper. Bursty traffic, custom models on dedicated capacity, or latency-sensitive workloads needing to bypass on-demand throttling also tilt the decision toward provisioned.
Why is our cache hit rate so low?
Below 60% is usually a system-prompt-stability problem. Cache invalidation is by content hash — any change to the cached prefix invalidates the cache. Frequently-edited agent system prompts, dynamic timestamps, or rotating examples in the prefix all break caching. Version-control the system prompt and treat changes as deliberate events; low-volume workloads also see misses because the default 5-minute TTL expires between invocations.
What single dial gives the biggest savings on day one?
`max_tokens`. Output tokens cost 4–5× input tokens, and agents that generate 3,000-token responses when 500 would do are paying 6× more on the output side than necessary. Capping output, combined with cost-allocation tags so the team can see which workloads are doing it, recovers material spend before the cascade router is even built.
Do tags propagate to the per-Bedrock-invocation level?
No — Bedrock model invocations themselves don't carry tags directly. Tags are applied to the consuming resource (the Lambda function, the Step Functions state machine, the API Gateway), and Cost Explorer attributes Bedrock spend per tag combination on the consumer. For finance-grade attribution, CUR (Cost and Usage Reports) delivers per-resource per-hour spend with tags into S3, queryable from Athena.
What's next
Part 8 closes the series with the case study — an SRE AI Agent on Bedrock that observes CloudWatch logs, diagnoses incidents, and executes remediation actions. Every architectural decision documented across Parts 1–7 lands in one worked implementation. The agent is built on Strands + AgentCore, uses Claude Sonnet for the reasoning step and Haiku for the routing layer, reaches into Cohere embeddings for log retrieval, executes through Lambda action tools, runs under Guardrails with event.interrupt() gates for destructive actions, and is fully observable end-to-end.
The full series:
- Part 1 — Foundations: Building AI Agents on Amazon Bedrock
- Part 2 — RAG with Bedrock Knowledge Bases
- Part 3 — Open-source Agent Frameworks on Bedrock
- Part 4 — Model Customization on Amazon Bedrock
- Part 5 — Multi-step AI Workflows with Step Functions and Bedrock
- Part 6 — Security Guardrails and Observability for Bedrock
- Part 7 — Cost Optimization on Bedrock (this piece)
- Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage
The cost discipline documented here applies across every prior piece in the series. Without it, the architecture from Parts 1–6 is technically sound but economically vulnerable.
