Series · Amazon Bedrock for Production AI · Part 1 of 8 Foundations · Part 2: RAG with Bedrock Knowledge Bases →
The SRE on shift gets a page at 03:14. The agent has restarted the wrong service. Not in production — the team had the sense to keep it in staging — but the agent took the action it was given access to, called the tool the way the schema described, and the schema described the wrong service. The model did its job. The plumbing around the model is what failed.
That is the shape of every agent failure we see in production. The model behaved. The pipework behind it did not. An AI agent is not a chatbot with extra steps; it is a system that observes, reasons, decides, acts against real infrastructure, and lives or dies on the layers around the model rather than on the model itself.
Key takeaways
- An AI agent is a four-layer system: model (the reasoning), tools (the actions), memory (what it remembers), and orchestration (the loop). The model is one component, not the whole thing.
- Bedrock is two things: a unified inference API across foundation models and a set of managed services (Agents, Knowledge Bases, Guardrails, AgentCore) sitting on top. Code model-agnostic against the Converse API; externalise model IDs as configuration.
- AgentCore is the runtime layer to reach for when the agent has to do real work over time: longer execution windows than Lambda, managed memory, built-in OAuth identity, MCP gateway, and X-Ray observability.
- Seven decisions define every Bedrock agent deployment — orchestration substrate, foundation model, version pinning, tool design (many narrow tools), KB coupling (referenced), guardrails posture (default-on, referenced), and memory configuration (bounded). Get them right by default; revisit on a schedule.
- Production-hard problems bite in order: fluent wrong answers, opaque cost surface, dangerous action surface, memory as a PII leak, and model changes underneath you. Build the evaluation harness before traffic.
The Four Layers Of A Live Agent
An agent in 2026 is a closed-loop system. It observes a state, reasons about what to do, decides on an action, executes that action against a real system, observes the result, and either declares done or loops again. The "agent" word covers the whole loop. The model is one component of it.
Think of it the way a hospital thinks of a surgical team. The surgeon is the model. The instruments are the tools. The patient chart is the memory. The protocol that drives the procedure end-to-end is the orchestration. Replace any one of those four and the operation changes. Skip any one and someone dies.
The model produces the reasoning. Claude, Llama, Mistral, Titan, Cohere — Amazon Bedrock unifies access to them under a single API. The choice is a real engineering decision (cost, latency, context window, tool-use quality), but it is one decision among several rather than the whole thing.
The tools are the actions the agent can take. Lambda functions, REST APIs, MCP servers, AWS API calls. The agent reasons in tokens but acts through tools.
The memory is what the agent remembers across turns. Short-term holds the current conversation. Long-term holds what the agent learned in previous sessions. Semantic memory is the documents and facts the agent has been given access to. Memory is what separates a useful agent from one that hallucinates the same answer every Tuesday morning.
The orchestration is the loop itself — how the agent decides whether to call a tool, call another model, return to the user, or stop. AWS's managed answer is Bedrock Agents and AgentCore. The open-source answers (LangChain, LlamaIndex, Strands) get their own treatment in Part 3.
This series treats each of those layers — and the operational concerns wrapped around them — as its own piece. This first piece is the vocabulary.
What Bedrock Actually Is
Amazon Bedrock is two things, usefully distinguished. The first is a unified inference API over a catalogue of foundation models. Code against InvokeModel or Converse and Bedrock routes the call to Claude, Llama, Mistral, Titan, Cohere, or whatever else is in the catalogue. No model-vendor SDKs, no separate billing per provider, no juggling API keys. One IAM role, one API surface, one bill.
The second is a set of managed services that sit on top of that API: Bedrock Agents for tool-using agents, Knowledge Bases for retrieval-augmented generation, Guardrails for content safety, AgentCore for production-grade agent runtime.
The first part is the foundation. The second part is what most teams will actually consume.
Foundation models worth knowing in 2026
The Bedrock catalogue is the working list. The production-relevant models, in the rough shape the current landscape suggests:
Claude (Anthropic) is the strongest tool-use and long-context model in the catalogue. The 4.x family. Default choice for production agents where reasoning quality matters more than per-token cost. Native support for "extended thinking" lets the model spend more compute before responding.
Llama 4 (Meta) is the strong general-purpose model with an attractive cost profile. Llama 4 Maverick and Scout are the production-relevant variants.
Mistral Large 3 and Mixtral are the strong European options, useful when data-residency or model-provenance arguments matter for regulated EU customers.
Amazon Titan and Nova are AWS's own families. Titan Embeddings and the multimodal Titan models cover embedding and image-to-text work.
Cohere Command R+ is RAG-tuned and useful as the reasoning model in Knowledge-Base-heavy pipelines.
The right architectural posture is model-agnostic by default. Code your agent against the Converse API, externalise the model ID as configuration, swap models per environment or per task. The model the agent uses today is not the model it should use in twelve months — pricing and capability move fast.
Bedrock Agents — The Managed Loop
Bedrock Agents is AWS's managed implementation of the agent loop. You define a foundation model, a set of instructions, action groups (collections of tools defined by OpenAPI schema or Lambda function spec), knowledge bases (vector stores the agent can query), and guardrails.
Bedrock runs the loop. When you invoke the agent, it sends the user's query plus the agent's instructions to the chosen foundation model, parses the response to see whether the model wants to call a tool or query a knowledge base or return to the user, invokes the Lambda function if a tool was called, feeds the response back to the model, and repeats until the model declares done or hits a turn limit.
The advantage is that the orchestration is handled. You do not write the loop yourself. The trade-off is reduced control: you cannot intervene mid-loop, model selection is constrained to Bedrock-catalogued models, and the customisation surface is smaller than what LangChain or Strands give you. Whether that trade-off fits the use case is the subject of Part 3.
Action groups — the tools the agent can call
An action group is a named collection of actions. Each action is either a Lambda function with an OpenAPI schema describing its inputs, outputs, and purpose, or a function declared directly in the agent's configuration with the same schema.
The OpenAPI schema is what the model reads to decide which tool to call. The description fields matter enormously — the model decides based on description text. A clear description means correct tool selection; a vague description means the agent hallucinates which tool to call. This is the most important piece of prompt engineering in any agent deployment, and it lives in OpenAPI YAML rather than in a system prompt.
The Lambda function the action group points to does the work — queries CloudWatch Logs, calls a payment API, updates a DynamoDB record. The agent does not know it is calling Lambda; it knows it is calling "the query_logs tool" or "the restart_service tool" with structured inputs.
Knowledge Bases — managed RAG
A knowledge base is a managed retrieval-augmented-generation pipeline. Point it at an S3 bucket of documents; Bedrock orchestrates the chunking, embedding, and indexing into a vector store of your choice (OpenSearch Serverless, pgvector on Aurora, Pinecone, MongoDB Atlas, Redis Enterprise Cloud). At query time, the agent or your application queries the knowledge base, gets the top-k most relevant chunks, and includes them in the prompt context.
Knowledge Bases are the right answer for most production RAG. They handle plumbing most teams should not be reinventing. They support hybrid search, filtered retrieval, multiple chunking strategies, and re-ranking. The configurable depth per knowledge base is the subject of Part 2. Wire a KB into an agent and the agent gets retrieval implicitly — when the model decides it needs additional context, it queries the KB through the agent loop with no separate retrieval code on your side.
Guardrails — content safety as configuration
Bedrock Guardrails is a separate service that filters input and output for content safety. You configure denied topics, content filters with thresholds per category (hate, violence, sexual content), sensitive-information filters for PII detection and redaction, word filters for blocked terms, contextual grounding that checks the model's output is supported by the retrieved KB context, and automated reasoning checks that validate output against declared policies.
Guardrails apply both before the prompt reaches the model (filtering user input) and after the model responds (filtering output). For regulated workloads, Guardrails is not optional. Part 6 goes deep on policy design.
AgentCore — The Production Runtime
In 2025 AWS introduced Bedrock AgentCore, a higher-level platform layer that addresses what was missing from Bedrock Agents when teams pushed to production scale. Five capabilities.
Runtime is a serverless agent runtime with longer execution windows than Lambda — multi-hour tasks become viable — plus session isolation and stateful execution. Memory is managed short-term and long-term memory with semantic search across past sessions. Identity is built-in OAuth flows so the agent can act on behalf of the user against external services, with token management handled. Tools include a browser the agent can drive (autonomous web actions), a code interpreter (Python execution sandbox), and a Model Context Protocol gateway that exposes external tool stacks. Observability is built-in tracing, metrics, and audit logging mapped to AWS X-Ray and CloudWatch.
AgentCore is the layer to reach for when the agent has to do real work over time — research a customer's account history across multiple systems, drive a browser through a multi-step process, hold context across hour-long sessions, audit-log every action for compliance. Bedrock Agents alone is sufficient for shorter-running, single-task agents; AgentCore covers long-running, multi-task agents that need production-grade observability.
The MCP gateway in particular is interesting. Model Context Protocol — originated by Anthropic, broadly adopted in 2024-2025 — is the open standard for exposing tools to LLMs in a structured, model-agnostic way. AgentCore's MCP gateway means an agent can use any MCP server in the ecosystem (internal tools, public servers for filesystem, GitHub, databases, third-party SaaS servers) without rewriting tool integrations per vendor.
A Minimal Agent, Concretely
A minimal production-grade Bedrock Agent has four configuration concerns:
agent:
name: customer-support-agent
foundationModel: anthropic.claude-sonnet-4-6-20251022
instruction: |
You are a customer support agent for [Company]. You have access to
tools that let you query a customer's account, recent transactions,
and the knowledge base of FAQs and policies. Always verify the
customer's identity before discussing account-specific information.
If a request involves a refund or account change, do not act —
return a structured handoff to a human agent.
actionGroups:
- name: account_query
lambdaArn: arn:aws:lambda:us-east-1:123:function:account-query
apiSchema: s3://bucket/account-query.openapi.yaml
- name: transaction_history
lambdaArn: arn:aws:lambda:us-east-1:123:function:transaction-history
apiSchema: s3://bucket/transaction-history.openapi.yaml
- name: human_handoff
lambdaArn: arn:aws:lambda:us-east-1:123:function:human-handoff
apiSchema: s3://bucket/human-handoff.openapi.yaml
knowledgeBases:
- knowledgeBaseId: KB-faqs-policies-001
description: Company FAQs and policy documents
guardrailConfiguration:
guardrailId: GR-customer-support-001
guardrailVersion: "DRAFT"
memoryConfiguration:
enabled: true
sessionSummaryConfiguration:
maxRecentSessions: 5
That is roughly the smallest defensible production configuration. The model is named with a specific version (not "the latest Claude") so behaviour is reproducible. Each action group has its own Lambda and OpenAPI schema (no one big tool — each action is its own narrow tool). The knowledge base is referenced by ID, decoupled so the KB can be updated independently. Guardrails are referenced, not inlined. Memory is enabled with a finite session-history window.
The agent definition itself is small. The work is in the OpenAPI schemas, the Lambda functions behind them, the knowledge-base content, and the guardrail configuration.
Where The Agent Actually Lives
Every line in that diagram is deliberate. Five are worth naming.
Identity sits at the perimeter, not just at the API. IAM Identity Center brokers human and CI/CD access. Workload identities are IRSA or IAM Roles, never long-lived access keys. The agent's IAM role is scoped per action group, not granted at the agent level.
No public IPs on workloads. Action-group Lambdas, Knowledge Base components, and the agent itself live in private subnets. Traffic to AWS APIs (bedrock-runtime, S3, KMS, Secrets) goes through VPC interface endpoints, not the public internet.
Guardrails wrap every invocation, input and output. There is no path where a user prompt reaches the model without passing the guardrail, and no path where a model response reaches the user without passing it back through.
Model invocation logs are first-class. Every call is logged to a dedicated S3 bucket with KMS encryption and Object Lock. This is what makes audit, debugging, and cost attribution possible, and what Part 6 builds on.
Cost tags live everywhere. Each component is tagged with workload, team, and environment so the dashboards in Part 7 can attribute spend by feature rather than by service.
Seven Decisions, And The Reasons
Every Bedrock agent deployment makes the same seven decisions, knowingly or by drift. The defensible default for each, and the reason, sits in one table for ease of reference:
| # | Decision | Default we recommend | Reasoning | When to revisit |
|---|---|---|---|---|
| 1 | Orchestration substrate — managed loop or open-source | Bedrock Agents for single-task, short-running; AgentCore + Strands for long-running production; LangChain only when cross-cloud or model-vendor portability is a hard requirement | Bedrock Agents has the lowest start cost; AgentCore + Strands has the strongest AWS-native production story; LangChain has the broadest ecosystem but the most plumbing | When the agent has to run for hours, hold state across sessions, or drive a browser → graduate to AgentCore. When the same agent has to run on Azure or GCP → consider LangChain. |
| 2 | Foundation model — which model powers reasoning | Claude 4.x as default for production agents; Llama 4 when cost matters more than reasoning depth; Mistral when EU provenance argument is load-bearing | Claude has the strongest tool-use and long-context behaviour in the catalogue; the other choices are cost or compliance moves | Quarterly. Bedrock's catalogue moves fast; what was state-of-the-art in February may be eclipsed in August. Re-evaluate with the same eval harness against the new model. |
| 3 | Model versioning — pinned or rolling | Always pinned to a specific dated model ID — never "the latest Claude" | A rolling model ID changes production behaviour on AWS's schedule, not yours. Pinned IDs make rollback meaningful and audit defensible. | When a pinned model is deprecated. Bedrock gives 90–120 days notice; use that window to re-evaluate against the successor before rolling forward. |
| 4 | Tool design — broad tools or narrow tools | Many narrow tools, each with its own OpenAPI schema and Lambda, not one tool with branching parameters | The model decides which tool to call based on the schema's description field. Narrow tools with sharp descriptions get correct routing; broad tools with branched parameters get hallucinated routing. |
Never. This pattern holds across every agent we ship. |
| 5 | Knowledge base coupling — inlined or referenced | Referenced by ID, with the KB defined and versioned independently of the agent | Documents change weekly; agent configuration changes monthly. Coupling them means every KB refresh requires an agent redeploy. Decoupling lets each evolve at its own cadence. | When the KB is single-purpose and tightly bound to a single agent (rare). |
| 6 | Guardrails posture — opt-in or default-on | Default-on, with the guardrail referenced (not inlined) so it can be tuned without redeploying the agent | Inline guardrails mean every policy update redeploys the agent and breaks the audit trail. Referenced guardrails have their own versioned lifecycle. | Never reduce below default-on. Tune the policies; don't disable. |
| 7 | Memory configuration — enabled or off | Enabled with a finite session window and explicit cross-session policy | Memory is the biggest PII leak surface in agent deployments. An unbounded memory accumulates personal data across sessions without explicit consent. A finite window plus explicit retention policy is the defensible default. | When the use case explicitly requires unlimited cross-session recall (e.g., a personal assistant). Even then, with explicit retention and erasure controls. |
Each of these decisions has a piece in this series that goes deeper. The table is the map; the parts that follow are the territory.
What This Series Will Cover
This is a foundations piece. The seven subsequent parts go deeper on what production agent deployment actually requires:
- Part 2 — RAG with Bedrock Knowledge Bases: chunking strategies, embedding models, vector store selection (OpenSearch Serverless vs pgvector vs Pinecone), hybrid search, re-ranking, evaluation.
- Part 3 — Open-source Agent Frameworks on Bedrock: when LangChain, LlamaIndex, or Strands beat the managed Bedrock Agents path. Deployment patterns for self-managed agent loops on EC2 or EKS.
- Part 4 — Model Customization on Bedrock: continued pre-training, fine-tuning, distillation, custom model imports, evaluation. When customization beats prompt engineering.
- Part 5 — Multi-step AI Workflows with Step Functions and Bedrock: chaining model calls, integrating Bedrock with the 9,000+ AWS APIs Step Functions covers, the choice between Step Functions workflows and Bedrock Agents.
- Part 6 — Security Guardrails and Observability for Bedrock: Guardrails policy design, IAM patterns for least-privilege Bedrock access, VPC endpoints / PrivateLink, CloudTrail audit, model-invocation logging.
- Part 7 — Cost Optimization on Bedrock: token economics per model, cost allocation tags, prompt and response caching, batch inference, model-tier routing, provisioned throughput vs on-demand.
- Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage: the reference implementation that ties everything together — an agent that observes CloudWatch logs, diagnoses incidents, and executes remediation actions through Lambda and Step Functions, under Guardrails, with cost and observability instrumented from the start.
What Bites You In Production
The honest summary of what makes agent deployment hard, in the order it usually bites.
The agent does the wrong thing fluently. A misconfigured agent looks like it is working — plausible text, tool calls, completed turns. Whether it is actually doing the right thing is something only evaluation can tell you. Build the evaluation harness before the agent goes near production traffic.
The cost surface is opaque. Token-by-token pricing across calls, retrieval rounds, retries, and tool invocations adds up quickly. Without instrumentation you do not see the bill coming. Cost tracking is a Phase-1 concern, not a Phase-3 concern.
The action surface is dangerous. An agent that can call "restart service" can call "restart the wrong service" or "restart the right service at the wrong time" — which brings us back to the SRE at 03:14. Approval gates, dry-run modes, and human-in-the-loop for destructive actions are not optional. Part 8 unpacks this.
The memory becomes a leak. Agents accumulate context. Without explicit policies on what is stored, for how long, and across which sessions, you have built a PII-leaking system without realising it. Guardrails address part of this; memory configuration addresses the rest.
The model changes underneath you. Bedrock's catalogue evolves; what worked in February breaks in August because a model version was deprecated. Version pinning, regression testing, and rolling-evaluation pipelines are operational requirements, not nice-to-haves.
Three Steps Before Your First Production Agent
Stand up a sandbox account. Enable Bedrock model access in the AWS console — this is a model-by-model approval flow that takes a day or two for some models. Pin one specific Claude or Llama version. Provision a Bedrock-enabled IAM role. You are now ready to call the Converse API.
Build a no-tool agent first. Define a Bedrock Agent with a system prompt and a knowledge base but no action groups. Test that the model answers questions from the KB correctly before giving it the ability to take actions. Most agent failures come from rushing to actions before the reasoning is reliable.
Add one tool. The first tool is a read-only tool — query a database, query CloudWatch, fetch a record. Validate that the agent uses the tool correctly and integrates the response into its reasoning. Only then add a tool that can mutate state.
Part 2 picks up at the retrieval layer — what makes Knowledge Bases work in production, and what to choose when the managed path is not the right fit. Whose agent — and whose tools — are running in your account right now?
FAQs
When should we use Bedrock Agents and when should we use AgentCore?
Use Bedrock Agents for single-task, short-running agents where the managed loop is enough. Use AgentCore + Strands when the agent runs for hours, holds state across sessions, drives a browser, or needs production-grade observability via X-Ray and CloudWatch. AgentCore is the production runtime layer; Bedrock Agents is the lowest-start-cost entry point.
Should we use one broad tool with branching parameters or many narrow tools?
Many narrow tools, each with its own OpenAPI schema and Lambda. The model decides which tool to call based on the schema's description field. Narrow tools with sharp descriptions get correctly routed; broad tools with branched parameters get hallucinated routing. This pattern holds across every agent we ship.
Do we pin the model ID or follow rolling updates?
Pin to a specific dated model ID — never "the latest Claude." A rolling ID changes production behaviour on AWS's schedule, not yours, and breaks both rollback and audit. When a pinned model is deprecated, Bedrock gives 90–120 days notice; use that window to re-evaluate against the successor before rolling forward.
Why couple Knowledge Bases by reference rather than inlining them in the agent?
Documents change weekly; agent configuration changes monthly. Inlining the KB means every refresh requires an agent redeploy. Referencing the KB by ID lets each evolve at its own cadence and keeps the agent definition stable for audit.
What is the first thing to build before adding tools to a new agent?
The evaluation harness. A misconfigured agent looks like it is working — plausible text, tool calls, completed turns — so only evaluation tells you whether it is doing the right thing. Build the eval harness, then stand up a no-tool agent against a knowledge base, then add a single read-only tool. Mutation-capable tools come last.
This is Part 1 of an eight-part series on Amazon Bedrock for production AI. The series accompanies the Hardening-before-AWS series and the AWS-for-banks architecture series.
