Multi-step AI Workflows with Step Functions and Bedrock

Series · Amazon Bedrock for Production AI · Part 5 of 8 ← Part 4: Model Customization · Multi-step Workflows with Step Functions · Part 6: Security Guardrails and Observability →

Key takeaways

A Bedrock Agent handles one conversational turn well; Step Functions handles known multi-step shapes — document pipelines, parallel multi-document synthesis, conditional branching, long-running research — where the structure benefits from being declared, not discovered each turn.
The 2024 Bedrock task type lets Step Functions invoke models directly without Lambda wrappers — no cold start, no Lambda concurrency cap, native per-step cost-tag attribution.
The multi-model routing rule applies inside the state machine: Haiku for extraction and per-item Map steps, Sonnet for reasoning and synthesis, Opus only for high-stakes final outputs. A pipeline run entirely on Sonnet costs roughly 10× the routed version.
Use Standard workflows for long-running or audit-relevant work (NDPA/GDPR/NIS2); use Express for high-frequency short workloads. The S3-backed execution history is what auditors actually want to see.
Retry policy must be tuned per Bedrock error class — exponential backoff on throttling, no retry on validation errors, catch quota-exceeded into a fallback-model branch so the workflow still completes.

Where the conversation stops being the right shape

A Bedrock Agent — from Part 1 — handles a single conversational turn with tool calls. That covers a large fraction of production AI work. It does not cover the cases where the work has a known shape that benefits from being expressed as shape, rather than rediscovered by the agent at every turn.

An invoice arrives. The team extracts fields, enriches vendor data from a master record, classifies as capex or opex or travel or cogs, and routes to the right approver. Each step is well-defined; the variation lives in the document, not in the workflow. Or imagine pulling ten regulatory PDFs at once, summarising each in parallel, comparing the summaries, producing a consolidated view — the parallel-then-merge shape does not fit cleanly inside a single agent loop. Or a long-running research task that decomposes a question, spawns sub-tasks, runs web searches and database queries and model summarisations in parallel, reassembles, quality-checks, re-runs failed branches. Or a conditional pipeline: English document routes to model A, German to model B, low confidence escalates to human review. Conditional branching is awkward inside an agent's reasoning trace; it is native to a state machine.

For these shapes, AWS Step Functions is the orchestration layer. The Bedrock task type (introduced in 2024) gives Step Functions direct integration with model invocations — no Lambda wrapping required. Combined with Step Functions' coverage of more than 9,000 AWS API actions, the result is a workflow language where AI calls are first-class citizens alongside every other AWS service. The state machine is the score; the agent and the model and the database call are the instruments.

Which orchestration shape fits the work

The framing is not which is better. Both are right for different workload shapes. The framework:

Decision tree for Step Functions vs Bedrock Agents — Q1 is the workflow shape known in advance? If no (dynamic or conversational) it routes to Bedrock Agent (managed loop) or AgentCore + Strands; if yes, Q2 asks whether the work spans multiple AWS services and conditional branches; if yes it routes to Step Functions with a Bedrock task; if no and the shape is a simple chain or single call it routes to Lambda calling Bedrock Converse directly; long-running parallel fan-out with high volume routes to Step Functions Standard, short-running event-driven routes to Step Functions Express. — Figure 1 — Match orchestration shape to workload shape — they compose, they don't compete.

The matrix in one table:

Workload shape	Right answer
Single-turn conversational reasoning with tool use	Bedrock Agent (or AgentCore + Strands)
Long-running agent with multi-hour task, session memory, autonomous browsing	AgentCore Runtime + Strands
Known multi-step pipeline with conditional branches	Step Functions Standard + Bedrock task
High-frequency short pipelines (sub-5-min, event-driven)	Step Functions Express + Bedrock task
Single Bedrock invocation, no orchestration needed	Lambda → Bedrock Converse
Combination: agent inside a larger pipeline	Step Functions calling Bedrock Agent as one step

The last row matters. Step Functions and Bedrock Agents compose. A Step Functions workflow can invoke a Bedrock Agent as one task in a larger state machine. The Agent handles the conversational complexity of one turn; the state machine handles the cross-step orchestration around it. It is the difference between hiring a translator for a meeting and asking the translator to also run the agenda.

A document workflow, in five primitives

A reference document-processing workflow recurs across cmdev engagements:

Step Functions document processing workflow — S3 PutObject triggers an EventBridge rule into a Step Functions Express workflow; State 1 Extract uses a Bedrock task with Claude Haiku for structured extraction; State 2 Choice branches on document type — invoices fan out to vendor lookup (DynamoDB), category classification (Sonnet), and risk scoring (Haiku); State 3 Map enriches and validates each line item; State 4 Choice routes low-confidence items to an SQS human-review queue and high-confidence items to auto-approve; State 5 Write persists to DynamoDB and archives to S3; CloudWatch metrics per state, X-Ray traces across the workflow, cost-tagged per Bedrock task. — Figure 2 — Multi-model routing per state — Haiku where structure is clear, Sonnet where reasoning depth matters.

Five state-machine primitives recur in AI workflows. The Task with Bedrock integration makes a direct call to bedrock:InvokeModel or bedrock:Converse without Lambda — the state-machine definition holds the model ID, the prompt template with state-substituted variables, and the inference parameters. The Choice state does conditional branching on the model's output, usually a structured JSON the workflow inspects with a JSONPath expression — far cleaner than asking the agent to decide implicitly. The Parallel state fans out to multiple branches at once, each branch its own Bedrock invocation, SQL query, or API call, re-converging when all complete. The Map state applies the same workflow to each item in a list — process ten line items by running the same enrichment pipeline ten times in parallel, with configurable concurrency, tolerance, and per-item retries. And Catch and Retry handles errors with Bedrock-specific awareness — rate limits, model unavailable, context-too-long all behave differently, and the retry policy reflects that.

The Bedrock task, without the Lambda wrapper

The 2024 Bedrock integration with Step Functions removed the need for Lambda wrappers around every model call. A Bedrock task in ASL:

{
  "ExtractInvoiceFields": {
    "Type": "Task",
    "Resource": "arn:aws:states:::bedrock:invokeModel",
    "Parameters": {
      "ModelId": "anthropic.claude-haiku-4-5-20251001",
      "Body": {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2000,
        "messages": [
          {
            "role": "user",
            "content.$": "States.Format('Extract invoice fields from: {}', $.documentText)"
          }
        ]
      }
    },
    "ResultPath": "$.extraction",
    "Next": "ClassifyByType"
  }
}

Three things this gives over the Lambda-wrapper pattern. No Lambda cold start on the model call — the Bedrock task is a synchronous state-machine action. No Lambda concurrency limits — the Bedrock task scales with the underlying model's throughput, not with the account's Lambda concurrency cap. And native cost-tag attribution — every Bedrock task in the workflow can be tagged independently, so the cost dashboard from Part 7 attributes spend per workflow step rather than per Lambda.

The trade-off: the state-machine definition holds the prompt template, which keeps prompts in version control but makes A/B testing slightly harder than a Lambda-managed prompt. For most production workflows, the version-control benefit outweighs the testability cost.

Routing model tiers across states

The Claude-first multi-model rule from earlier parts applies cleanly inside a state machine. The state machine itself becomes the routing mechanism — each step picks the model tier its specific task deserves.

A typical document-processing workflow:

State	Model	Why
Extract	Claude Haiku	Structured extraction from a known document type. Haiku is fast, cheap, and accurate enough for well-formed extraction.
Route by type	Choice state (no model)	Branching on the structured output of extraction. Free.
Enrich each line item	Map state with Claude Haiku per item	Per-item enrichment is high-volume and narrow — Haiku is the right tier. Parallelism keeps wall-clock low.
Score risk	Claude Sonnet	Risk scoring requires reasoning over patterns, anomalies, vendor history. Sonnet's reasoning depth justifies the cost.
Generate audit narrative	Claude Sonnet	Natural language synthesis for the audit trail.
Final human-readable summary	Claude Opus only if the document is high-value or contested	Opus is overkill for routine documents; right for legal or regulatory documents where the synthesis stakes are high.

A workflow running ten thousand invoices a day through this pipeline at full Sonnet rates costs roughly ten times what the same workflow costs with the routing above. The savings come from putting Haiku on the easy steps; the quality is preserved by putting Sonnet on the steps where reasoning depth matters. It is the same principle as hiring a junior to read forms and a senior to interpret contracts — both at their right billable rate.

The router pattern from Part 3 generalises here: a Haiku-tier state at the start of the workflow can decide which subsequent path the document takes, and the state machine branches on the answer.

Standard or Express, picked by duration

Step Functions has two execution modes. Standard Workflows run up to one year with a full audit trail in S3, priced per state transition at about $0.025 per thousand — right for long-running workflows, anything needing durable retry semantics, anything with a regulatory audit requirement. Express Workflows run up to five minutes, priced by execution duration and memory (much cheaper at scale), with audit via CloudWatch Logs only — right for high-volume short workflows like per-event processing or real-time enrichment.

The choice tracks the workload's time profile. Real-time document classification on file upload (30-90 seconds, high volume) is Express. Multi-document research synthesis (several minutes, lower volume, audit-relevant) is Standard. Asynchronous corpus processing (hours, parallel, fully audited) is Standard. Per-message content moderation (sub-second, very high volume) is Express. For anything that touches personal data, evidence-grade audit, or regulated workloads, Standard is the right answer regardless of duration — the S3-backed execution history is what NDPA, GDPR, and NIS2 auditors actually want to see.

Errors that need different treatment

Step Functions' Retry and Catch blocks need tuning for the failure modes Bedrock workloads actually produce:

{
  "Retry": [
    {
      "ErrorEquals": ["Bedrock.ThrottlingException"],
      "IntervalSeconds": 2,
      "BackoffRate": 2.0,
      "MaxAttempts": 4,
      "JitterStrategy": "FULL"
    },
    {
      "ErrorEquals": ["Bedrock.ModelTimeoutException"],
      "IntervalSeconds": 5,
      "BackoffRate": 1.5,
      "MaxAttempts": 2
    },
    {
      "ErrorEquals": ["Bedrock.ValidationException"],
      "MaxAttempts": 0
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["Bedrock.ServiceQuotaExceededException"],
      "ResultPath": "$.error",
      "Next": "RouteToFallbackModel"
    },
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "DeadLetterQueue"
    }
  ]
}

Two operational rules sit underneath the policy. Do not retry validation errors. A bad prompt template or malformed input will not fix itself; the retry just burns tokens. And catch quota-exceeded into a fallback path. When the primary model (Sonnet) hits a quota, route to a fallback — Haiku, or a different region's Sonnet. The workflow completes. The audit trail records the fallback for cost and quality review.

Where Lambda still earns its keep

The Bedrock task removed Lambda as a wrapper for most Bedrock calls. Lambda still has a role in three places. Custom tool implementations for Bedrock Agents — Lambda is the action-group execution surface, covered in Part 1. Pre-processing or post-processing that does not map to a Step Functions intrinsic — complex data shaping, cryptographic operations, external API calls with custom auth. And OpenAPI schema generation for action groups — Powertools for AWS Lambda includes a Bedrock Agent helper that generates the OpenAPI schema directly from typed Python function signatures.

A working pattern with Powertools:

from aws_lambda_powertools.event_handler import BedrockAgentResolver
from aws_lambda_powertools.event_handler.openapi.params import Query

app = BedrockAgentResolver()

@app.get("/cloudwatch_logs", description="Query CloudWatch Logs Insights for a log group")
def query_cloudwatch(
    log_group: str = Query(description="Log group name, e.g. /aws/lambda/api-handler"),
    query: str = Query(description="CloudWatch Insights query string"),
    minutes_back: int = Query(default=60, description="Time window in minutes"),
) -> dict:
    """Returns the parsed query results."""
    # actual implementation calls boto3 CloudWatch Logs
    return {"results": [...]}

def lambda_handler(event, context):
    return app.resolve(event, context)

Powertools generates the OpenAPI schema the Bedrock Agent reads, handles input validation, emits structured logs and X-Ray traces, and provides typed responses. It eliminates roughly half the boilerplate of writing an action-group Lambda by hand — the kind of saving that becomes visible the moment a team writes its third action group.

A research synthesis pipeline, end to end

A real workflow that combines all of the above. A research-synthesis agent takes a topic, gathers source documents, summarises each in parallel, identifies contradictions across them, and produces a consolidated briefing.

The state machine shape (ASL excerpts):

{
  "Comment": "Research synthesis workflow",
  "StartAt": "DecomposeTopic",
  "States": {
    "DecomposeTopic": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Parameters": {
        "ModelId": "anthropic.claude-sonnet-4-6-20251022",
        "Body": {
          "anthropic_version": "bedrock-2023-05-31",
          "max_tokens": 1500,
          "messages": [{
            "role": "user",
            "content.$": "States.Format('Decompose this research question into 5-8 sub-queries: {}', $.topic)"
          }]
        }
      },
      "ResultPath": "$.subQueries",
      "Next": "FanOutResearch"
    },
    "FanOutResearch": {
      "Type": "Map",
      "ItemsPath": "$.subQueries.body.content[0].text",
      "MaxConcurrency": 8,
      "Iterator": {
        "StartAt": "QueryKnowledgeBase",
        "States": {
          "QueryKnowledgeBase": {
            "Type": "Task",
            "Resource": "arn:aws:states:::bedrock-agent:retrieveAndGenerate",
            "Parameters": {
              "Input": { "Text.$": "$.subQuery" },
              "RetrieveAndGenerateConfiguration": {
                "Type": "KNOWLEDGE_BASE",
                "KnowledgeBaseConfiguration": {
                  "KnowledgeBaseId": "KB-research-corpus-001",
                  "ModelArn": "anthropic.claude-haiku-4-5-20251001"
                }
              }
            },
            "End": true
          }
        }
      },
      "ResultPath": "$.researchResults",
      "Next": "IdentifyContradictions"
    },
    "IdentifyContradictions": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Parameters": {
        "ModelId": "anthropic.claude-sonnet-4-6-20251022",
        "Body": {
          "max_tokens": 3000,
          "messages": [{
            "role": "user",
            "content.$": "States.Format('Identify contradictions across these summaries: {}', $.researchResults)"
          }]
        }
      },
      "ResultPath": "$.contradictions",
      "Next": "SynthesiseBriefing"
    },
    "SynthesiseBriefing": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Parameters": {
        "ModelId": "anthropic.claude-sonnet-4-6-20251022",
        "Body": {
          "max_tokens": 4000,
          "messages": [{
            "role": "user",
            "content.$": "..."
          }]
        }
      },
      "ResultPath": "$.briefing",
      "Next": "DeliverBriefing"
    },
    "DeliverBriefing": {
      "Type": "Task",
      "Resource": "arn:aws:states:::s3:putObject",
      "Parameters": {
        "Bucket": "research-briefings-prod",
        "Key.$": "States.Format('briefings/{}.md', $.briefingId)",
        "Body.$": "$.briefing.body.content[0].text"
      },
      "End": true
    }
  }
}

What the workflow demonstrates is the routing rule made literal. Decomposition runs on Sonnet because it is a reasoning step. Parallel research uses a Map state with MaxConcurrency: 8 — eight sub-queries run simultaneously against the Knowledge Base, each through Haiku, because per-query reasoning is light and parallelism handles the volume. Contradiction analysis runs on Sonnet, where genuine reasoning across the parallel outputs lives. Final synthesis runs on Sonnet too, with Opus available as a single-line escalation for high-stakes briefings. Delivery writes directly to S3 — no Lambda required.

End-to-end latency for a typical research run lands at 90 to 180 seconds. Cost is dominated by the synthesis step, with the parallel Map branches contributing modestly because Haiku is cheap.

Five things that bite in production

Five frictions show up in Step Functions and Bedrock workflows once the team scales beyond the pilot.

State input/output size limits come first. Step Functions imposes a 256 KB payload limit per state. Long retrieval-augmented outputs blow through it fast. The fix is to store intermediate results in S3 and pass S3 keys through the workflow rather than the contents themselves.

Map state concurrency tuning is the next. MaxConcurrency too high overwhelms Bedrock quotas; too low drags the workflow. Tune per region and per model — there is no global answer.

Prompt template hygiene is the easiest to neglect. Prompts in ASL are harder to lint than prompts in Python. Treat them like SQL: put them in version-controlled files, render with States.Format, never inline complex prompts directly in JSON. The future-you maintaining the workflow will be grateful.

Workflow versioning bites unexpectedly. A state machine change can break in-flight executions. Use the state-machine versioning and aliases pattern, and cut over with traffic shifting rather than a hard swap.

Cost visibility per workflow is the last and most expensive omission. Tag each Bedrock task independently. Without tags, the bill shows Bedrock — $X and the team has no idea which workflow generated it.

When to step back from Step Functions

The dual to the matrix above is when an Agent or a single Lambda is the right answer instead. If the workflow is genuinely conversational — multi-turn dialogue where the next step depends on what the user says next — stay with a Bedrock Agent or AgentCore plus Strands. If it is a single model call with light pre or post-processing, use Lambda calling Bedrock Converse directly; Step Functions adds overhead the workload does not need. If it is short, high-frequency, and latency-sensitive at sub-100ms, Lambda or direct API integration wins because Step Functions adds tens of milliseconds even in Express mode. And if the workflow is exploratory — the shape is not yet clear — build it as an agent first, let the structure emerge, and refactor into Step Functions when the per-step shape stabilises.

If your next AI workload has a known structure, will you write it as a conversation, or as a score?

FAQs

When should I pick a Bedrock Agent over a Step Functions workflow?

If the work is genuinely conversational — multi-turn dialogue where the next step depends on what the user says next — stay with a Bedrock Agent or AgentCore + Strands. If the workflow shape is known in advance, spans multiple AWS services, and benefits from conditional branches or parallel fan-out, Step Functions is the right layer. They also compose: Step Functions can call a Bedrock Agent as one step in a larger pipeline.

Why use the Bedrock task instead of a Lambda calling Bedrock?

Three reasons: no Lambda cold start on the model call, no Lambda concurrency limits in the way of model throughput, and native cost-tag attribution per state. The trade-off is that prompt templates live in the state-machine definition, which gives you version control by default but makes A/B testing slightly harder.

Standard or Express workflows for AI pipelines?

Standard for anything long-running, audit-relevant, or regulated — the S3-backed execution history is what NDPA, GDPR, and NIS2 auditors expect. Express for high-frequency short pipelines like per-event classification or real-time enrichment. Real-time document classification on upload is typically Express; multi-document research synthesis is Standard.

How do I handle Bedrock throttling and quota errors?

Retry throttling with exponential backoff and jitter (4 attempts is a reasonable default). Don't retry validation errors — they won't fix themselves and waste tokens. Catch quota-exceeded into a fallback-model branch (Sonnet quota exhausted → route to Haiku, or to a different region's Sonnet). The audit trail records the fallback for cost and quality review.

What's the limit on data flowing through a workflow?

Step Functions imposes a 256KB payload limit per state. Long retrieval-augmented outputs blow this fast. The fix is to store intermediate results in S3 and pass S3 keys through the workflow instead of contents. This is a real bite in production AI pipelines and worth designing around from the start.

What's next

Part 6 picks up the security and observability layer that wraps every workflow, every agent, and every direct Bedrock invocation: Guardrails policy design, IAM patterns, VPC endpoints, CloudTrail audit, model invocation logging, X-Ray tracing across the multi-step orchestration documented above.

The full series:

Part 1 — Foundations: Building AI Agents on Amazon Bedrock
Part 2 — RAG with Bedrock Knowledge Bases
Part 3 — Open-source Agent Frameworks on Bedrock
Part 4 — Model Customization on Amazon Bedrock
Part 5 — Multi-step AI Workflows with Step Functions and Bedrock (this piece)
Part 6 — Security Guardrails and Observability for Bedrock
Part 7 — Cost Optimization on Bedrock (deepest multi-model routing)
Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage

The Amazon Bedrock series. Step Functions composes with everything in the prior pieces — Agents become workflow steps, Knowledge Bases become state-machine retrievals, custom models become routing branches.