Engineering

Open-Source Agent Frameworks on Bedrock: Strands, LangChain, LlamaIndex and the Managed-vs-Open-Source Decision

cmdev14 min read
Open-Source Agent Frameworks on Bedrock: Strands, LangChain, LlamaIndex and the Managed-vs-Open-Source Decision
Share
~21 min

Series · Amazon Bedrock for Production AI · Part 3 of 8 ← Part 2: RAG with Bedrock Knowledge Bases · Open-source Agent Frameworks · Part 4: Model Customization on Bedrock →

Key takeaways

  • Bedrock Agents wins for ~60–70% of production cases — well-defined surface, modest customization, AWS-native. The other 30–40% need the agent loop in your own code.
  • AgentCore Runtime + Strands SDK is the defensible default for AWS-native production work that needs loop control: hooks, steering handlers, multi-agent patterns, and human-in-the-loop are built in.
  • Steering handlers — "no, do it this way instead" — are the genuine differentiator over raw LangChain; Strands' published benchmark shows steering-equipped agents recovered from every induced error in the test set vs ~82.5% for prompt-only.
  • Multi-model routing applies inside the agent: SummarizingConversationManager runs summaries on Haiku, reasoning runs on Sonnet, classifier-routers send simple queries to Haiku — cascade savings pay for the classifier many times over.
  • LangChain / LlamaIndex remain right for cross-cloud portability, existing team investment, non-Bedrock models, or heavy plugin ecosystem use — but the hooks and steering Strands ships out of the box become hand-rolled callback code.

The wall the managed agent runs into

The first time you reach for Bedrock Agents — from Part 1 — the managed loop solves a real problem. Orchestration is handled. Action-group plumbing is wired. The audit trail comes free. Then you try to do something the framework did not anticipate, and you meet the wall. You cannot intervene mid-turn to redirect the model. You cannot insert custom validation between a tool call's response and the next reasoning step. You cannot run an evaluation harness against the agent's reasoning trace because you do not own the reasoning trace. The agent is opaque by design. It is a sealed car — fine if the route is the one the manufacturer mapped, useless when you need to take a different road.

For an estimated 60 to 70 percent of production agent use cases — well-defined task surface, modest customisation, AWS-native deployment — Bedrock Agents is the right answer. For the other 30 to 40 percent, the work has shape that needs the agent loop in your own code. This piece is about that other 30 to 40 percent.

Three credible paths lead into that space. AgentCore Runtime plus the Strands SDK is AWS's own open-source SDK paired with the AgentCore managed runtime, with production-grade hooks, steering handlers, multi-agent patterns, and observability built in. LangChain or LlamaIndex is the framework-agnostic option — broadest ecosystem, oldest documentation, most cross-cloud portability, most plumbing the team has to operate. And a custom agent loop in Python or TypeScript that calls Bedrock's Converse API directly — maximum control, maximum work, the right answer only when the use case is genuinely novel.

The defensible default for AWS-native production work is AgentCore plus Strands. The reasoning is the rest of this piece.

The frameworks side by side

Agent framework selection on Bedrock — short-running single-task AWS-native routes to Bedrock Agents; long-running multi-task AWS-native that wants control over the loop routes to AgentCore Runtime + Strands SDK; cross-cloud portability or existing team investment in a framework routes to LangChain or LlamaIndex; a novel architecture that none of the above fit routes to a custom Python or TypeScript loop on the Converse API directly.
Figure 1 — Four paths — pick by control needs, deployment target, and portability requirements.

The matrix that matters in practice:

Dimension Bedrock Agents AgentCore + Strands LangChain / LlamaIndex
Control over the loop None (managed) Full (open-source) Full (open-source)
Time to first working agent Minutes Hours Hours to days
AWS-native observability Built-in (CloudWatch / X-Ray) First-class (AgentCore traces, OpenTelemetry) Manual wiring
Multi-agent orchestration Limited Built-in: Agent-as-Tool, Swarm Manual or via add-on libraries
Steering / runtime correction Not available Steering handlers built-in Manual via callbacks
Human-in-the-loop Limited event.interrupt() primitive Manual implementation
Cross-cloud / cross-vendor portability None Bedrock-coupled by default; portable with effort Full portability across providers
Ecosystem breadth (integrations, plugins) AWS-only Growing, AWS-led Largest in the space
Production maturity (2026) Mature Mature, AWS-backed Mature, community-led
The right answer when… Single-task, well-defined, fast-start Production AWS-native with control Cross-cloud, framework-portable, or existing investment

Why Strands earns the centre column

Strands is, narrowly, the AWS-backed open-source agent harness SDK. The marketing line is control end-to-end. The engineering reality is four pieces that earn the claim and matter in production.

The loop lives in your code

The minimal Strands agent is short:

from strands import Agent, tool
from pathlib import Path

@tool
def save_report(title: str, content: str) -> str:
    """Save a research report to disk."""
    path = f"reports/{title}.md"
    Path(path).write_text(content)
    return f"Saved {path}"

agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",  # pinned Claude
    tools=[save_report],
)
agent("Research AI agent frameworks and save the report.")

That is the whole thing. The Agent class runs the loop. The @tool decorator registers a function the model can call. The loop is in Python you can inspect, debug, and modify. The model is pinned (per the architectural rule from Part 1). Claude is the production default because tool-use quality matters; the model parameter is one line to swap.

Hooks intercept lifecycle events

The hooks system is what most LangChain users build by hand. Strands ships it.

from strands import Agent, tool
from strands.hooks import BeforeToolCallEvent, AfterToolCallEvent

def audit_tool_call(event: BeforeToolCallEvent):
    # Validate, log, or redirect — before the tool actually runs
    log_to_cloudtrail({
        "agent": event.agent_id,
        "tool": event.tool_name,
        "args": event.tool_args,
        "session": event.session_id,
    })
    if event.tool_name == "delete_record" and not event.tool_args.get("approved_by"):
        raise PermissionError("delete_record requires approved_by")

agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[save_report, delete_record],
    hooks=[audit_tool_call],
)

The hook fires before every tool call. You can validate, log, redirect, or block. The agent's audit trail in CloudTrail is built from these events rather than reverse-engineered from output logs after the fact. For regulated workloads — banking, healthcare, anything with NDPA, GDPR, or NIS2 obligations — the hook-based audit is the difference between we think we have an audit trail and we do.

Steering handlers say no, do it this way

A blocking guardrail says no. A steering handler says no, do it this way instead. It is the difference between a parent shouting stop and a parent saying you forgot to add the WHERE clause, try again with the row scoped to one record. The second produces a corrected attempt; the first leaves the child stuck at the door.

from strands.steering import steer

@steer("sql_query_safety")
def require_where_clause(event):
    """If the model emits a DELETE or UPDATE without WHERE, redirect it."""
    if event.tool_name == "execute_sql":
        sql = event.tool_args.get("query", "").upper()
        if ("DELETE" in sql or "UPDATE" in sql) and "WHERE" not in sql:
            return {
                "redirect": (
                    "That DELETE/UPDATE has no WHERE clause and would affect "
                    "every row. Add a WHERE clause that scopes the change to "
                    "the specific record(s) intended."
                )
            }

The agent reads the steering message back into its reasoning, corrects itself, and proceeds. A LangChain implementation either lets the bad call through, blocks it (and leaves the agent stuck without guidance), or requires a custom retry-with-correction loop the developer writes from scratch.

Strands' published numbers on this are stark: prompt-only agents recovered from roughly 82.5 percent of induced errors, hard-coded workflows from 80.8 percent, agents with steering handlers from every one in the test set. The numbers are the vendor's; the mechanism is real and operationally distinct from what raw frameworks provide.

Multi-agent patterns out of the box

Strands ships two production-relevant multi-agent patterns out of the box.

Agent-as-Tool is a specialist agent registered as a tool a generalist agent can call.

specialist = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[query_database, generate_chart],
    instruction="You are a data analyst. Answer with charts where useful.",
)

@tool
def ask_data_analyst(question: str) -> str:
    """Delegate a data-analysis question to the specialist agent."""
    return specialist(question).output

orchestrator = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[ask_data_analyst, send_email, ...],
    instruction="You are a research orchestrator.",
)

Swarm is multiple peer agents coordinating on a shared task — useful where parallel exploration produces better outcomes than sequential reasoning. Imagine three architects sketching the same brief and reconciling at the end, versus one architect working alone.

The Agent-as-Tool pattern is the one most production deployments end up using. The specialist agent typically runs on a smaller model (Haiku or Sonnet); the orchestrator runs on Sonnet or Opus. Model-tier routing is per-agent, and the cost discipline pays for itself immediately on real workloads.

Memory at the right granularity

Conversation managers are the bridge between the short-term-memory layer of Part 1's four-layer architecture and the actual production behaviour. Strands ships two.

The SlidingWindowConversationManager keeps the last N turns verbatim. Predictable, fast, low cost. Right for transactional agents — customer support, single-task assistants — where the conversation is short. The SummarizingConversationManager keeps the last N turns verbatim plus a running summary of older turns. Higher per-turn cost, because summary updates are an extra model call, but it maintains coherence across longer conversations. Right for research agents, copilots, long-running session work.

from strands.memory import SummarizingConversationManager

agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[...],
    conversation_manager=SummarizingConversationManager(
        recent_turns=10,
        summary_model="anthropic.claude-haiku-4-5-20251001",  # cheaper for summaries
    ),
)

Note the multi-model routing inside one agent: summary updates run on Haiku, not Sonnet, because summarisation is a Haiku-class task. This is the cascade pattern from Part 2 applied at the conversation-management layer. The savings on a long-running agent are material — every turn would otherwise pay Sonnet rates just to maintain context.

The gate for irreversible actions

The hardest agent-deployment problem is the destructive-action gate: how does the agent stop and ask a human before doing something irreversible? Strands' event.interrupt() is the primitive.

from strands.hooks import BeforeToolCallEvent

def gate_destructive(event: BeforeToolCallEvent):
    if event.tool_name in ("delete_record", "send_payment", "deploy_to_prod"):
        approval = event.interrupt({
            "type": "approval_required",
            "tool": event.tool_name,
            "args": event.tool_args,
            "reason": "Destructive action — requires human approval before execution.",
        })
        if not approval.get("approved"):
            return {"redirect": "Human declined the action. Stop and report to the user."}

agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[..., delete_record, send_payment, deploy_to_prod],
    hooks=[gate_destructive],
)

The interrupt produces a callable hold-point. The agent pauses, the operator gets a notification (Slack, PagerDuty, email, in-app modal — your choice), the operator approves or rejects, the agent resumes. For the case study in Part 8, this is the gate that turns an SRE AI agent from interesting demo into ship in production. Currently shipping in Python; the TypeScript SDK has it on the published roadmap.

Where Strands actually runs

The deployment story orders from most AWS-native to least. AgentCore Runtime is AWS's managed long-running agent runtime — multi-hour task execution, session isolation, built-in memory, MCP gateway, browser tool, code interpreter. Right for production agents that need to do real work over time. Pricing is execution-time-based, not request-based. AWS Lambda suits short-lived stateless agent invocations — right for request-response agents where a single turn is the whole job, with the 15-minute Lambda limit constraining multi-turn agents that should go to AgentCore instead. AWS Fargate sits in the middle for sustained agent processes that need a longer runtime than Lambda but do not need AgentCore's specific feature set — good for scheduled agents, batch processors, agent-as-microservice. Amazon EKS is for teams already on Kubernetes who want the agent under the same control plane as everything else, with the usual kubectl, helm, argocd flow. And Docker anywhere works because Strands ships as a portable Python or TypeScript package the host platform does not need to know about. Terraform modules cover all of the above so deployment is infrastructure-as-code from day zero.

For most cmdev work, AgentCore for long-running production, Lambda for stateless request-response, Fargate for sustained scheduled work is the deployment triad that covers the cases. The choice is per-agent, not per-organisation.

When LangChain or LlamaIndex still wins

Strands sitting in the centre column does not put LangChain and LlamaIndex off the table. Four cases make them the right answer.

Cross-cloud portability as a hard requirement: a workload that has to run on AWS, Azure, and GCP with the same agent code wants LangChain's model-provider abstraction, because Strands is Bedrock-coupled by default and portable only with effort. Existing LangChain or LlamaIndex investment: migration across an existing codebase is non-trivial, and the pragmatic move is often to keep the framework and use Bedrock as the model provider behind it. Non-Bedrock model access: direct connection to a specific OpenAI, xAI, or self-hosted model the Bedrock catalogue does not include. Heavy ecosystem use: specific LangChain plugins, document loaders, or retrievers that Strands' newer, smaller ecosystem has not yet matched — LangChain's age is sometimes its advantage.

A working LangChain-on-Bedrock pattern:

from langchain_aws import ChatBedrockConverse
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

llm = ChatBedrockConverse(
    model="anthropic.claude-sonnet-4-6-20251022",
    region_name="us-east-1",
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a customer support agent..."),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
executor.invoke({"input": "Where's my order #12345?"})

That works. It runs Claude on Bedrock under the LangChain orchestrator. It is the right move when LangChain is the existing investment. The trade-off is that the hooks, steering, and multi-agent patterns Strands ships out of the box become hand-rolled LangChain callbacks and chain compositions — more code, more maintenance, more bug surface.

Routing across model tiers, regardless of framework

Whichever framework you pick, the multi-model routing rule applies: cheap models for narrow tasks, Claude for reasoning, route by query complexity. In Strands:

def classify_complexity(query: str) -> str:
    """Use Haiku to decide whether a query needs Sonnet or Haiku."""
    classifier = Agent(
        model="anthropic.claude-haiku-4-5-20251001",
        instruction="Classify the query complexity: 'simple' or 'complex'. Reply with one word.",
    )
    return classifier(query).output.strip().lower()

def route_agent(query: str):
    complexity = classify_complexity(query)
    model_id = {
        "simple": "anthropic.claude-haiku-4-5-20251001",
        "complex": "anthropic.claude-sonnet-4-6-20251022",
    }.get(complexity, "anthropic.claude-sonnet-4-6-20251022")

    agent = Agent(model=model_id, tools=[...])
    return agent(query)

Average per-query cost drops materially on workloads with a heavy long tail of simple queries. The classifier itself costs a Haiku call; the savings on routed-to-Haiku queries pay for it many times over. This is the cascade pattern that Part 7 goes deeper into.

What production actually looks like

Five things show up in production agent deployments regardless of framework, and each one is a lesson cheaper to learn before launch than after.

Cold starts on Lambda hurt. A 2-3 second cold start on a Lambda-hosted agent is unacceptable for conversational latency. Provisioned concurrency or moving to AgentCore Runtime or Fargate are the answers. Retries need to be intelligent — a failed tool call retried naively often fails the same way. Strands' hooks system lets you implement smart retries with backoff and context modification; LangChain requires custom callback work. Token usage is the hidden cost: hooks, steering, multi-agent calls, summarisation each add tokens, and instrumenting every model call with usage tracking from day zero is materially cheaper than debugging cost spikes after the fact. The eval harness is not optional — whatever the framework, you need a golden set of inputs with expected outputs (or behaviours) and a regression test that runs on every change, because framework upgrades break production silently otherwise. And production agents drift: the model changes through catalogue updates, the corpus changes through KB ingestion, the tools change through Lambda updates. Drift detection — comparing agent behaviour today against the harness baseline — is what catches the change-induced regression before users do.

If your next agent has to defend a destructive action to a regulator, which path lets you show the hook that paused it?

FAQs

How do I decide between Bedrock Agents and Strands?

If the task surface is well-defined, customization is modest, and AWS-native deployment is fine, Bedrock Agents is the right answer — managed loop, automatic audit trail, fast to ship. If you need to intervene mid-turn, insert custom validation between tool call and reasoning, run evaluation harnesses against the reasoning trace, or implement steering and human-in-the-loop, Strands is the centre column.

What's the difference between a guardrail and a steering handler?

A guardrail says "no" — it blocks the action. A steering handler says "no, do it this way instead" — it redirects the model with a corrective message that becomes part of the reasoning trace. The agent reads the steering message, corrects, and proceeds. Strands' numbers show steering-equipped agents recover from induced errors at a meaningfully higher rate than prompt-only or hard-coded-workflow approaches.

Where should I deploy a Strands agent in production?

AgentCore Runtime for long-running production agents that need multi-hour execution, session isolation, and built-in memory. Lambda for stateless request/response agents that fit in 15 minutes. Fargate for sustained scheduled work. EKS if you're already on Kubernetes. The decision is per-agent, not per-organization.

Is LangChain dead on Bedrock?

No. LangChain remains the right choice for cross-cloud portability, existing team investment, non-Bedrock model access, and heavy ecosystem use. The trade-off is that Strands' built-in hooks, steering, and multi-agent patterns become hand-rolled callback and chain composition code — more maintenance, more bug surface. The pragmatic move is often to keep LangChain and use Bedrock as the model provider behind it.

How do I gate destructive actions like deletes or payments?

Strands' event.interrupt() primitive — invoked from a hook before the tool call — pauses the agent, sends a notification to the operator (Slack, PagerDuty, in-app), waits for approval or rejection, then either resumes or feeds a redirect message back. This is the gate that turns an interesting demo into a production-shippable agent for regulated, irreversible operations.

What's next

Part 4 picks up the model layer: when prompt engineering and RAG hit their limit, when fine-tuning makes sense, and the combined-model pattern that pairs pinned Claude for the hard reasoning step with a custom-tuned smaller model for the narrow recurring task.

The full series:


Reference: strandsagents.com for the canonical Strands documentation. The Hardening-before-AWS and AWS-for-banks series provide the security and identity substrate; this AI series builds the workload on top.

amazon-bedrockstrands-agentsagentcorelangchainllamaindexai-agentsopen-sourceawsclaudeagent-frameworks

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation