Case Study

Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage

cmdev16 min read
Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage
Share
~24 min

Series · Amazon Bedrock for Production AI · Part 8 of 8 ← Part 7: Cost Optimization · Case Study: SRE AI Agent for CloudWatch Log Triage

Key takeaways

  • The agent triages CloudWatch alarms end-to-end on Strands + AgentCore — Haiku classifies the alarm, Sonnet reasons over logs, Cohere embeddings retrieve runbook context, Lambda action tools execute under hooks and `event.interrupt()` gates for destructive actions.
  • Every destructive action — restart, rollback — passes through human approval via `event.interrupt()`. The agent does the cognitive work; the human stays in the loop for the irreversible decision.
  • Per-incident cost is roughly $0.075 in the investigation case; cascade routing pushes the average well below that because ~80% of alarms resolve as "transient probable" at Haiku. Annual inference cost for a 200-incident-a-day workload is about $5,500.
  • The scenario walked through compresses a 15-25 minute human triage to 6 minutes wall time and 90 seconds of on-call attention — preserving judgement on the destructive call, removing the cognitive tax of reading logs at 2 AM.
  • The tool list is the entire action surface — no production database mutations, no IAM changes, no public-internet calls, no cross-incident coordination. The boundary is intentional; an agent that can do more would do the wrong thing more.

The phone buzzes at 2:04 AM

The on-call engineer reaches for the phone in the dark. The screen shows an alarm name — checkout-api error rate over 5% for 3 minutes — and a link into CloudWatch. They sit up. They open the laptop. They start typing the same Logs Insights query they have typed forty times this year. They scroll. They squint. Twelve minutes later they have either rolled back a deployment or paged a colleague to look with them.

Parts 1-7 of this series documented the architectural surfaces of building production agents on Amazon Bedrock — foundations, RAG, frameworks, customisation, workflows, security, cost. This piece runs the opposite direction. It takes one specific use case and walks the full architecture end-to-end. The premise is the page itself: that 2 AM alarm, and a working SRE AI agent that triages it, identifies the failing component, and either takes a remediation action or escalates to the human. The agent runs on Strands and AgentCore. Claude Sonnet does the reasoning. Haiku handles the routing. Cohere embeddings retrieve runbook context. Lambda action tools execute. Every destructive action passes through event.interrupt().

Every architectural decision traces back to a prior piece in the series. Every cost optimisation from Part 7 is applied. Every Guardrail from Part 6 is wired in. The case study is the integration test.

The tax that erodes the rotation

An SRE team operating a moderately complex microservices workload — say a 50-service e-commerce stack — fields between 5 and 30 pages a week. Most pages are not actual incidents. They are transient errors. Retries that succeeded. Latency spikes that resolved themselves. The on-call human opens CloudWatch, searches the affected service's logs, recognises the pattern, decides.

The triage takes 5-15 minutes of focused attention even on a routine page. At 2 AM, focused attention is the scarcest commodity in the room. The real cost is not the human time — it is the cognitive tax that erodes the rotation's morale and slows the mean-time-to-resolution on the small subset of pages that are real incidents, because the attention was already spent on the false positives.

The right shape of automation here is tier-1 triage that gathers the logs, classifies the page as transient or real, takes a safe automatic action where one applies, and only pages the human for genuine incidents. Not autonomous IT operations — that is the marketing pitch and it is decades from defensible. Tier-1 triage that reliably collapses the human surface to what actually needs judgement. The agent absorbs the reading; the human owns the destruction.

The architecture, end to end

SRE AI agent architecture — EventBridge catches a CloudWatch Alarm; a Step Functions Express workflow runs a Claude Haiku classifier that triages as 'transient probable', 'investigation required', or 'page on-call'; investigation cases invoke a Strands agent on AgentCore Runtime with four action groups (Logs query, deployment history, service restart, on-call escalate) and a Knowledge Base of runbooks; Claude Sonnet reasons, Guardrails block production-destructive-without-approval, event.interrupt() gates restart and rollback; full observability via CloudTrail data events, model invocation logs, CloudWatch metrics, X-Ray traces.
Figure 1 — Alarm to triage to gated action — destructive moves always pause for a human.

An EventBridge rule listens for CloudWatch Alarms entering the ALARM state. The alarm event drops into a Step Functions Express workflow (per Part 5) that completes in under five minutes for the typical case. The first state runs a Claude Haiku classifier that decides whether to investigate further or to page on-call immediately. If the page is worth investigating, the workflow invokes a Strands agent (per Part 3) hosted on AgentCore Runtime — the long-running surface where the agent can hold context across multiple tool calls without burning Lambda concurrency.

Underneath, Claude Sonnet 4.6 pinned to a specific version does the reasoning, while Haiku handles the routing and classification subtasks. Four Lambda action groups carry their OpenAPI schemas generated by Powertools. A Knowledge Base of company runbooks sits on Cohere Embed v3 over OpenSearch Serverless (per Part 2). Guardrails wrap every model call with denied-topic policy preventing destructive production actions without event.interrupt(), a PII filter, and output grounding. The full observability stack from Part 6 sits underneath, and the cascade routing pattern from Part 7 handles cost per state.

The agent in code

The Strands agent at the heart of the system:

from strands import Agent, tool
from strands.hooks import BeforeToolCallEvent, AfterToolCallEvent
from strands.memory import SlidingWindowConversationManager
import boto3

logs = boto3.client("logs")
ecs = boto3.client("ecs")
codedeploy = boto3.client("codedeploy")

# ---- Action tools ----------------------------------------------------------

@tool
def query_cloudwatch_logs(
    log_group: str,
    query: str,
    minutes_back: int = 60,
) -> dict:
    """Run a CloudWatch Logs Insights query against the named log group.

    Use this to retrieve log events matching a pattern, count errors,
    or identify the time-window of an anomaly. The query syntax is
    standard CloudWatch Logs Insights — `fields @timestamp, @message |
    filter @message like /ERROR/ | sort @timestamp desc | limit 100`.

    Args:
        log_group: full log group name (e.g. /aws/lambda/checkout-api)
        query: CloudWatch Logs Insights query string
        minutes_back: time window from now (default 60 min)
    """
    end_time = int(time.time())
    start_time = end_time - (minutes_back * 60)

    response = logs.start_query(
        logGroupName=log_group,
        startTime=start_time,
        endTime=end_time,
        queryString=query,
    )
    # poll for results — abbreviated here
    return wait_for_query_results(response["queryId"])


@tool
def get_recent_deployments(service_name: str, hours_back: int = 24) -> list[dict]:
    """Return recent deployments for the named service.

    Useful when investigating whether a recent deployment correlates
    with the observed errors — common root cause for sudden degradation.
    """
    response = codedeploy.list_deployments(
        applicationName=service_name,
        createTimeRange={
            "start": datetime.utcnow() - timedelta(hours=hours_back),
            "end": datetime.utcnow(),
        },
    )
    return [_format_deployment(d) for d in response["deployments"]]


@tool
def restart_ecs_service(cluster: str, service: str) -> dict:
    """Restart the named ECS service (forces a new deployment with no task definition change).

    DESTRUCTIVE ACTION: this will terminate running tasks. Requires
    human approval via event.interrupt() before execution.

    Use only when investigation indicates the service is in a degraded
    state recoverable by restart (memory leak, stuck connections, etc.)
    and not when the root cause is a recent deployment (use rollback
    instead).
    """
    return ecs.update_service(
        cluster=cluster,
        service=service,
        forceNewDeployment=True,
    )


@tool
def rollback_deployment(application: str, deployment_id: str) -> dict:
    """Roll back the named application to the previous successful deployment.

    DESTRUCTIVE ACTION: this will revert production code. Requires
    human approval via event.interrupt() before execution.

    Use when the observed errors correlate strongly with the most
    recent deployment.
    """
    return codedeploy.stop_deployment(deploymentId=deployment_id, autoRollbackEnabled=True)


@tool
def escalate_to_on_call(
    summary: str,
    severity: str,
    suggested_actions: list[str],
) -> dict:
    """Page the on-call engineer with a triage summary.

    Use when investigation indicates a real incident requiring human
    judgement, or when the agent's confidence in autonomous action is low.
    """
    return pagerduty_create_incident(
        summary=summary,
        severity=severity,
        details={"agent_analysis": suggested_actions},
    )


# ---- Hooks for audit -------------------------------------------------------

def audit_tool_call(event: BeforeToolCallEvent):
    """Every tool call lands in CloudTrail-equivalent custom audit log."""
    audit_log({
        "agent": event.agent_id,
        "session": event.session_id,
        "tool": event.tool_name,
        "args": event.tool_args,
        "timestamp": datetime.utcnow().isoformat(),
    })


# ---- Steering: gate destructive actions ------------------------------------

def gate_destructive_actions(event: BeforeToolCallEvent):
    """No destructive action runs without human approval."""
    DESTRUCTIVE = {"restart_ecs_service", "rollback_deployment"}

    if event.tool_name in DESTRUCTIVE:
        approval = event.interrupt({
            "type": "approval_required",
            "tool": event.tool_name,
            "args": event.tool_args,
            "agent_reasoning": event.context.get("reasoning_trace", ""),
            "incident_id": event.context.get("incident_id"),
            "channel": "pagerduty",
            "timeout_seconds": 300,
        })

        if not approval.get("approved"):
            return {"redirect": (
                f"On-call declined the {event.tool_name} action. "
                "Continue investigation and escalate with detailed analysis instead."
            )}


# ---- Agent definition ------------------------------------------------------

SRE_AGENT_INSTRUCTION = """
You are an SRE triage agent. You receive CloudWatch Alarm events and your
job is to determine whether the alarm represents a real incident requiring
remediation or a transient issue that has self-resolved.

Your workflow:
1. Query the affected service's logs over the relevant time window.
2. Check recent deployments — sudden errors after a deployment correlate
   strongly with bad code.
3. Consult the runbook knowledge base for the specific error pattern.
4. Decide on an action:
   - If errors are clearly transient and have stopped: report and close.
   - If errors correlate with a recent deployment: recommend rollback
     (requires approval).
   - If the service is in a stuck state recoverable by restart: recommend
     restart (requires approval).
   - If the cause is unclear or the impact is high: escalate to on-call.

For every action recommendation, cite the specific log events that
support your conclusion. Do not act on assumptions; act on evidence.
Be conservative: when in doubt, escalate.
"""

sre_agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    instruction=SRE_AGENT_INSTRUCTION,
    tools=[
        query_cloudwatch_logs,
        get_recent_deployments,
        restart_ecs_service,
        rollback_deployment,
        escalate_to_on_call,
    ],
    hooks=[
        audit_tool_call,
        gate_destructive_actions,
    ],
    conversation_manager=SlidingWindowConversationManager(recent_turns=20),
    knowledge_base_id="KB-runbooks-prod-001",
    guardrail_id="GR-sre-agent-prod-001",
)

What the code shows in production terms is a small set of disciplined choices. Claude Sonnet is pinned to a dated version, per the rule from Part 1. The tool docstrings read like OpenAPI descriptions because the model reads them like that — docstring quality is prompt engineering. Hooks log every tool call before execution and gate every destructive action through event.interrupt(). Declined actions return as a steering redirect rather than a blocking error, so the agent gets a corrective message and continues, instead of stalling on the gate. The Knowledge Base and Guardrails are referenced by ID, not inlined, so policy updates propagate without redeploying the agent.

A real incident, walked through

Names changed. Times preserved.

2:04 AM — CloudWatch alarm fires on the checkout-api service: error rate over 5% for 3 minutes.

2:04 AM — EventBridge delivers the alarm to the Step Functions workflow.

2:04 AM — State 1: Haiku classifier reads the alarm metadata, including the service name and the 5% error rate over 3 minutes. Output: "investigation required". Cost: ~$0.0008.

2:05 AM — State 2: SRE Agent invoked on AgentCore Runtime. The agent's first action: call query_cloudwatch_logs with /aws/ecs/checkout-api log group and the query fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 100, time window 30 minutes.

2:05 AMquery_cloudwatch_logs returns 73 ERROR events, all with the message ConnectionTimeoutException: PostgreSQL connection pool exhausted. The errors started at 2:01 AM.

2:05 AM — The agent calls get_recent_deployments for checkout-api, 24-hour window. Returns: deployment dep-9af3 at 2:00 AM, status Succeeded. The deployment correlates exactly with the error onset.

2:05 AM — The agent retrieves from the runbook Knowledge Base on the query "PostgreSQL connection pool exhausted after deployment." Top retrieved chunk: a runbook noting that recent migrations to the new connection-pool library require explicit pool.close() calls in the request handler; a previous incident with the same signature was caused by a missing close call.

2:06 AM — The agent's reasoning trace: "Errors began precisely at the time of deployment dep-9af3. The error signature matches a known runbook scenario for missing connection close calls. Recommendation: rollback dep-9af3. Confidence: high."

2:06 AM — The agent calls rollback_deployment(application="checkout-api", deployment_id="dep-9af3"). The hook fires; event.interrupt() opens a PagerDuty incident with the agent's reasoning and waits for approval.

2:07 AM — On-call engineer's phone buzzes. The PagerDuty alert shows: "SRE agent recommends rollback of dep-9af3 due to connection-pool exhaustion correlating exactly with deployment time. Confidence: high. Approve / Decline." The engineer reads the agent's reasoning, sees the log evidence cited, sees the runbook hit, and taps Approve.

2:07 AM — The rollback executes. CodeDeploy initiates the reversion.

2:09 AM — The rollback completes. CloudWatch alarm transitions to OK within 90 seconds of rollback completion.

2:10 AM — The agent writes a brief postmortem to the incident channel and closes the workflow execution. Total wall time: 6 minutes. Total on-call human time: 90 seconds (reading the alert and approving).

By comparison: a human triaging the same incident from scratch takes 15 to 25 minutes. They open the logs. They search. They notice the deployment correlation. They check the runbook. They decide to roll back. They execute. The agent compressed the cognitive work and preserved the safety. The human stayed in the loop for the destructive action — the one decision that mattered.

The walls deliberately left up

The marketing pitch for autonomous AI operations oversells what is defensible in 2026. The walls we leave standing are the architecture, not a limitation.

There are no production database mutations. The agent has no tools that write to production databases directly — restart and rollback are recoverable, but arbitrary writes are not. There are no security-policy changes. IAM modifications, security-group updates, KMS-key changes — all out of scope. These need human deliberation, not agent autonomy. There are no external API calls beyond the strict tool list. The agent cannot reach the public internet, cannot call third-party APIs, cannot do unbounded actions. The tool list is the action surface. And there is no multi-incident coordination. Each incident workflow is independent. Cross-incident pattern detection happens in a separate analytics pipeline.

The boundary is the agent's safety harness. An agent that could do more would do more — including the wrong thing more, and at scale. Think of it as the difference between a junior engineer with root and a junior engineer with read-only credentials: the second one cannot make the disaster bigger.

What it actually costs to run

Per-incident cost breakdown for the scenario above (illustrative; numbers vary by region and pricing):

Component Detail Cost
Haiku classifier ~400 input tokens, ~10 output tokens ~$0.0008
Sonnet reasoning (agent loop) ~4,000 input tokens (system prompt + tool descriptions + cached), ~1,200 output tokens, across 4 turns ~$0.06
Cohere Rerank on Knowledge Base ~25 documents scored ~$0.005
Cohere embeddings (query) ~30 tokens ~$0.00005
Step Functions Express 6 state transitions over 6-minute execution ~$0.0001
Lambda action tools 4 invocations, ~500ms each ~$0.0002
CloudWatch Logs Insights 1 query over 30-min window ~$0.005
Guardrails Input + output filter on each model call ~$0.0015
Per-incident total ~$0.075

Seven and a half cents per triaged incident. A workload triaging 200 incidents a day pays $15 in inference cost — roughly $5,500 a year. A human on-call engineer's cognitive tax on the same workload, measured in attention rather than dollars, is materially higher.

Four of the optimisations from Part 7 compound here. Cascade routing means more than 80 percent of alarms classify as transient probable at Haiku and never reach Sonnet, pulling the average per-incident cost well below the $0.075 of the investigation case. Prompt caching covers the agent's 3,500-token system prompt (including tool descriptions), dropping the recurring input cost to about 10 percent of nominal. Top-K discipline returns three Knowledge Base chunks after re-ranking, not twenty. And Express Step Functions are cheaper than Standard for sub-five-minute executions. The numbers, stacked, are why this architecture pays for itself by the second week.

What we would change the second time

Five things the deployment surfaces, each worth fixing on the next iteration. The agent runs in one region today; a regional outage takes it down with the workload it is meant to triage. Multi-region failover, applied from the AWS architecture series, fixes that. The agent forgets between incidents — adding AgentCore Memory with cross-session recall would let it recognise this deployment caused issues last week too without retraining. Calling rollback_deployment works, but reaching into the team's actual CI/CD system with commit SHA, author, and PR link would speed the on-call's approval. The Haiku classifier mis-routes about 5 percent of alarms on team-specific edge cases — a small fine-tuned classifier per Part 4, trained on the team's historical alarms, would tighten that further. And a steering handler that catches inefficient CloudWatch Logs Insights queries and redirects the agent toward narrower scans would cap the per-incident CloudWatch cost.

What the case study demonstrates

The series's central thesis was that production AI on Bedrock is a layered architecture, not a single model call. The case study makes the layering literal — every architectural surface from Parts 1-7 appears in the running system. The result is an agent that does useful work at production volume, under safety constraints, with economic discipline.

The same shape composes to other use cases. The SRE triage agent generalises to compliance-evidence agents, customer-support copilots, document-review agents, research synthesisers. Change the tool set, the system prompt, the Knowledge Base, the routing thresholds. The architecture stays.

What the case study is not

It is not a demonstration that AI replaces SREs. It is a demonstration that AI absorbs the tier-1 triage tax that erodes SRE attention. The on-call humans still own the destructive decisions, the post-incident reviews, the escalation paths, the architectural ownership of the workload. The agent absorbs the cognitive cost of reading logs at 2 AM. It is the cleaner that arrives before the doctor — taking the room from chaos to triable, leaving the diagnosis to the person trained for it.

So when the phone buzzes at 2:04 AM tonight, whose attention does the first twelve minutes spend?

FAQs

Why does the agent need `event.interrupt()` for restart and rollback if the tools themselves are reversible?

Restart and rollback are recoverable but disruptive. A spurious restart at peak traffic, or a rollback of a change that was actually fixing a worse incident, would compound the problem. The human approval gate is the calibration that ensures the agent's confidence interval lines up with operational reality before a customer-facing action runs.

What stops the agent from taking actions outside the tool list?

Nothing in the agent code — the architecture itself. The Lambda action tools are the entire action surface; the agent's IAM role grants no other permissions. No database write tool exists, no IAM modification tool exists, no public-internet call is reachable. The boundary is structural, not behavioural.

How does the cost compare to running everything through Sonnet?

The cascade pattern from Part 7 routes ~80% of alarms to Haiku-only resolution at roughly $0.0008 each. Only the investigation cases reach Sonnet at ~$0.075. A Sonnet-everywhere architecture would cost roughly 30-50x as much for the same coverage — and would still need the same Guardrails, observability, and approval gates around it.

What happens when the agent's confidence is low?

It escalates. The `escalate_to_on_call` tool pages the human with the agent's analysis, the log evidence it gathered, and the actions it considered but did not recommend. The human gets the agent's work product as a triage starting point rather than starting from the alarm alone. The agent is conservative by instruction: when in doubt, escalate.

Does the agent learn from prior incidents?

Not in this iteration. Each incident workflow is independent — the agent forgets between runs. The next iteration would add AgentCore Memory with cross-session recall so the agent can recognise "this deployment caused issues last week too" without retraining. The intentional limitation in v1 is to keep the audit trail per-incident and the failure modes contained.

Closing the series

This piece closes the eight-part Amazon Bedrock series. The series:

The substrate this AI series builds on:


Production AI on Bedrock is no longer experimental in 2026 — the architectural patterns documented across this series are what defensible enterprise deployments look like in practice. The agents that ship are the ones built with the discipline these patterns encode.

case-studyamazon-bedrocksreai-agentscloudwatch-logsstrandsagentcoreclaudepowertools-lambdaincident-responseobservability

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation