AWS Bedrock Agents for DevOps: AI That Actually Helps in 2026

Written by Bits Lovers on 13 May 2026

AWS Bedrock Agents for DevOps: AI That Actually Helps in 2026

I built three DevOps agents on Bedrock last quarter. One was genuinely useful, one was okay, and one I threw away. That experience taught me more about what AI agents are actually good for in infrastructure than any blog post or conference talk I’ve seen. So let me tell you what I learned.

The useful one was an incident response bot that correlates CloudWatch alarms with recent deployments and posts a structured runbook to Slack. The okay one analyzed infrastructure costs from Terraform state. The thrown-away one was supposed to review PRs for security issues. It would hallucinate IAM permission problems in code that had no IAM at all. I couldn’t trust it. It went in the trash.

That’s the honest frame for everything else in this post.

What Bedrock Agents Actually Are

Bedrock Agents is AWS’s managed runtime for running LLMs that can take actions. It’s not just a chat interface. It’s a loop: the model reasons, decides it needs to call a tool, calls the tool, gets the result, reasons again, calls another tool, and eventually responds.

The three pieces that matter for DevOps work:

Action Groups are the tools your agent can call. You define them as Lambda functions and describe them with OpenAPI schemas. The agent decides when and how to call them based on the description you write. Descriptions matter more than code here. A badly described action group gets called wrong or ignored. A well-described one gets used appropriately.

Knowledge Bases are your agent’s long-term memory. You upload documents—runbooks, architecture docs, incident post-mortems—and Bedrock indexes them with vector embeddings in OpenSearch Serverless or a compatible store. When the agent needs to answer a question, it retrieves relevant chunks automatically. This is where Bedrock gets genuinely useful for DevOps: you can embed years of operational knowledge and the agent retrieves it at runtime.

Guardrails are the safety rails. You configure them to block topics, filter responses, prevent the agent from leaking data. For infrastructure work this is critical. You don’t want an agent that confidently recommends terraform destroy without confirmation. Guardrails let you enforce that some actions require human approval in the loop before they execute.

The pricing model is important to understand upfront. You pay per input and output token, plus the model cost, plus whatever your Lambda actions cost. For Claude 3.5 Sonnet on Bedrock, you’re looking at around $3 per million input tokens and $15 per million output tokens. An incident response bot that wakes up once or twice a night costs almost nothing. A deployment assistant reviewing fifty PRs a day starts to add up. Know your workload before you commit.

MCP Integration: Tools Without Lambda

In early 2026, Bedrock added native support for the Model Context Protocol. This changed what’s practical to build.

Before MCP, every tool your agent needed required a Lambda function. You’d write a Lambda that wraps the AWS SDK call, write an OpenAPI schema for it, attach it to an action group. It worked but it was tedious. A serious agent might need fifteen Lambda functions, each with its own IAM role, its own deployment pipeline, its own CloudWatch logs to monitor.

MCP changes this. You can expose a set of tools through an MCP server and register that server with Bedrock directly. The agent discovers available tools at runtime, reads their descriptions, and decides which to call. Your MCP server can expose raw AWS SDK calls, internal APIs, database queries—whatever your agent needs.

For DevOps tooling this is practical. You can run an MCP server on ECS Fargate that wraps:

aws cloudwatch get-metric-statistics for fetching metrics
aws logs filter-log-events for searching logs
aws ec2 describe-instances for infrastructure state
terraform show -json piped from your state bucket
Your internal deployment API

The agent connects, discovers these tools, and uses them. You didn’t write fifteen Lambda functions. You wrote one MCP server with fifteen handlers. The operational overhead is one ECS service instead of fifteen separate Lambdas to monitor and maintain.

The tradeoff is latency and complexity. Each tool call goes from Bedrock to your MCP server to the upstream service and back. If your MCP server is in the same region as Bedrock, this is fine. If you’re calling cross-region APIs through a server in us-east-1 from Bedrock running in eu-west-1, you’ll feel it.

Use Case 1: The Incident Response Bot That Actually Works

Here’s the agent that earned its keep.

Our on-call rotation was burning people out. Not because incidents were frequent, but because triage was painful. An alarm fires at 2 AM. You get paged. You open your laptop. You check CloudWatch. You check recent deployments. You check which service owns this alarm. You look for similar incidents in Confluence. By the time you have context, you’re wide awake and it’s been twenty minutes.

The incident response bot compresses that to two minutes.

The architecture: a CloudWatch alarm triggers an EventBridge rule, which invokes a Lambda that creates a Bedrock Agent session. The agent has four action groups: one for fetching CloudWatch metrics and alarm context, one for querying our deployment history API, one for looking up service ownership from our Backstage catalog, and one for posting to Slack.

The Knowledge Base contains every incident post-mortem from the last three years. When an alarm fires, the agent retrieves similar past incidents by vector similarity—”this looks like our database connection pool exhaustion from March 2025”—and includes that context in the Slack message.

The Slack message looks like this:

ALARM: RDS connection pool exhaustion - api-service (us-east-1)
Severity: HIGH | Started: 2026-05-12 02:14 UTC

CONTEXT
- api-service deployed 47 minutes ago (commit abc123: "Increase thread pool")
- Connection count: 487/500 (97% of max)
- Similar incident: March 2025 - same pattern, resolved by rolling back thread pool increase

RECENT DEPLOYMENTS (last 4 hours)
- api-service: 01:27 UTC (abc123) - increased thread pool from 100 → 250
- No other services deployed

SUGGESTED ACTIONS
1. Check api-service logs for connection leak signals
2. Consider rolling back abc123 if connections don't stabilize
3. If rollback needed: /deploy rollback api-service abc123

RUNBOOK: https://runbooks.internal/rds-connection-pool
On-call: @sarah-kim

The deployment history correlation is what makes this genuinely useful. Humans would figure this out eventually, but the agent does it in ten seconds at 2 AM. The on-call engineer wakes up with a clear hypothesis instead of starting from zero.

What it doesn’t do: it doesn’t decide to roll back automatically. Guardrails prevent any action that modifies infrastructure without a human clicking a button. I can’t stress this enough. An agent that autonomously makes infrastructure changes based on incomplete information is not a useful tool. It’s a liability.

Use Case 2: The Infrastructure Cost Analyzer

This one I’d call “okay.” Useful enough to keep, not transformative.

The setup: an agent that reads your Terraform state files from S3, understands your resource inventory, and answers natural language questions about cost and optimization.

# MCP tool handler example
async def handle_get_terraform_resources(params):
    bucket = params["state_bucket"]
    prefix = params.get("prefix", "")
    
    # Fetch state files
    s3 = boto3.client("s3")
    paginator = s3.get_paginator("list_objects_v2")
    
    resources = []
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            if obj["Key"].endswith(".tfstate"):
                response = s3.get_object(Bucket=bucket, Key=obj["Key"])
                state = json.loads(response["Body"].read())
                resources.extend(extract_resources(state))
    
    return {"resources": resources, "count": len(resources)}

def extract_resources(state):
    resources = []
    for resource in state.get("resources", []):
        for instance in resource.get("instances", []):
            resources.append({
                "type": resource["type"],
                "name": resource["name"],
                "module": resource.get("module", "root"),
                "attributes": {
                    k: v for k, v in instance.get("attributes", {}).items()
                    if k in ["instance_type", "engine", "size", "region", "tags"]
                }
            })
    return resources

With this tool in place, you can ask the agent: “What EC2 instances are running in our dev environment that haven’t had a deployment in thirty days?” or “Which RDS instances are using instance types that cost more than the equivalent Aurora Serverless configuration?” It fetches the state, reasons over it, and gives you a list.

The limitation is that it only knows what’s in Terraform state. Shadow resources created through the console don’t exist for this agent. And its cost estimations aren’t precise—it’s reasoning from instance types and general AWS pricing knowledge, not pulling live pricing data. Close enough for optimization conversations, not close enough for budget forecasting.

I also noticed it occasionally suggests migration paths that are architecturally wrong for our setup. It would recommend moving an RDS instance to Aurora Serverless without knowing that the service using it has latency requirements that make Aurora Serverless cold starts unacceptable. Context the agent doesn’t have. That’s the Terraform state problem: it tells you what resources exist, not why they exist that way.

Good for: generating a first pass at cost optimization candidates. Bad for: making firm recommendations without human review. See the AWS FinOps and Well-Architected 2026 guide for the full cost governance picture that this kind of agent fits into.

Use Case 3: The Deployment Assistant (Also: What I Threw Away)

I built two versions of this. The first I threw away. The second is running in staging.

The thrown-away version tried to review PRs for security issues. The problem wasn’t the architecture. The problem was reliability. Claude on Bedrock is good at many things. Identifying subtle IAM privilege escalation paths in Terraform? Hit or miss. It would flag real issues. It would also invent issues. A PR that created an S3 bucket with the correct policy would get flagged for “potential public access” that didn’t exist. A developer who sees enough false positives stops reading the comments. You’ve shipped a tool that actively reduces trust.

The version that works is narrower. It doesn’t do security review. It does deployment readiness checks.

Before a deployment can proceed to production, the agent:

Checks that all tests passed in the CI pipeline (calls our CI API)
Verifies the feature flags for new code are configured in our flag management system
Confirms there’s no active incident in the service’s dependency tree
Checks that the deployment window doesn’t overlap with scheduled maintenance

These are deterministic checks, not judgment calls. The agent is orchestrating API calls and making a binary decision: all checks pass, or they don’t. When they all pass, it comments on the PR with a green light and updates the deployment queue. When something fails, it explains specifically what failed and links to the relevant system.

This works because the agent isn’t being asked to reason about ambiguous information. Each check has a clear pass/fail signal. The agent’s job is coordination, not inference.

The lesson from building both: Bedrock Agents work best when the agent is orchestrating structured data and deterministic checks, not when it’s making nuanced judgments about code quality or security posture. Use it for the former. Use a human for the latter.

Deployment pipelines using EventBridge and Step Functions are a natural fit for this kind of agent integration. See EventBridge and Step Functions patterns for how to wire that coordination layer together.

Stateful Runtime Environment

One of the 2026 additions that changes what’s practical: stateful runtime environments.

Bedrock Agents now supports persistent sessions with memory that survives between invocations. Previously, every time your agent was invoked, it started with a blank slate. It had Knowledge Bases, but no memory of previous conversations or previous decisions.

With stateful runtime, the agent can maintain context across a multi-step incident response. You can ask it: “What did you find in the last three incidents for this alarm?” and it has an answer. You can ask it to maintain a running hypothesis about a service degradation across multiple data collection steps without re-explaining context each time.

For an incident response bot, this means the agent handling a two-hour incident maintains continuity. New on-call engineer takes over? The agent briefs them with what it knows so far. The previous data collection doesn’t need to repeat. This is a genuine operational improvement.

The implementation uses Bedrock’s session context API. You assign a session ID when you create the agent invocation, and pass the same session ID to continue it:

import boto3

client = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

# Start or continue a session
response = client.invoke_agent(
    agentId="AGENT_ID",
    agentAliasId="ALIAS_ID",
    sessionId="incident-api-service-20260512",  # persistent session ID
    inputText="What is the current status of the RDS connection issue?",
    sessionState={
        "sessionAttributes": {
            "incident_id": "INC-2024",
            "service": "api-service",
            "started_at": "2026-05-12T02:14:00Z"
        }
    }
)

The session persists for up to 24 hours by default. You can extend it. Useful for incidents that span a war room.

Bedrock vs. DIY LangChain or LangGraph

The question I keep getting from platform teams: why use Bedrock Agents instead of building your own orchestration with LangChain or LangGraph?

Honest answer: it depends on your team’s strengths and your operational requirements.

Bedrock Agents gives you managed infrastructure. No agent server to deploy, scale, or patch. No vector database to operate. No LLM API keys to rotate. IAM controls access. CloudTrail logs every invocation. For a team that lives in AWS and doesn’t want to own ML infrastructure, this matters a lot.

LangGraph gives you more control. You can implement complex reasoning patterns—parallel tool calls, conditional branching, custom retry logic, multi-agent coordination—that Bedrock’s sequential reasoning loop doesn’t support natively. If your use case requires an agent that spawns sub-agents or has non-linear reasoning paths, LangGraph is more flexible.

The operational argument for Bedrock is strong. Every tool call is logged. You can audit what the agent did, what data it accessed, which Lambda it called. For regulated environments, that audit trail matters. You can configure guardrails centrally and enforce them across all agents in your account. You can use IAM to control which principals can invoke which agents. None of this is free with a self-hosted LangChain setup.

The cost argument for DIY is also real. Bedrock charges per token and per request. If you’re building high-volume tooling, a self-hosted open-source model can be cheaper at scale. Bedrock AgentCore is worth reading alongside this if you want to understand where AWS is pushing the managed agent surface.

For Terraform-based infrastructure tooling, the MCP integration makes Bedrock more practical than it was a year ago. The ability to expose Terraform state operations as MCP tools and wire them directly to Bedrock without the Lambda overhead is a legitimate improvement.

Cost Breakdown

Let me give you real numbers from production.

Incident response bot (fires 1-3 times per night, average 8k tokens per invocation):

Claude 3.5 Sonnet: ~$0.18 per incident
Knowledge Base retrieval: ~$0.01 per retrieval
Lambda action groups: <$0.01
Monthly total: $15-25 depending on incident volume

Deployment assistant (50 PRs per day, ~6k tokens each):

Claude 3.5 Sonnet: ~$0.13 per review
Monthly total: ~$200

Cost analyzer (ad-hoc use, maybe twice a week):

Claude 3.5 Sonnet: ~$0.40 per query (more tokens, larger context)
Monthly total: $5-10

The incident bot and cost analyzer are trivially cheap. The deployment assistant at scale starts to matter. If you scale that to 500 PRs per day, you’re at $2,000/month just for Bedrock tokens. At that point, comparing against a self-hosted model becomes a reasonable conversation.

Limitations You Need to Plan For

Hallucination in infrastructure context is dangerous. An agent that confidently recommends a wrong IAM policy or an incorrect Terraform change can cause real outages. Every agent output that touches infrastructure should go through a human confirmation step. This is not optional. Use Bedrock Guardrails to enforce it.

Context windows fill fast with large infrastructure states. A Terraform state file for a mature AWS account can be hundreds of thousands of lines. You can’t just dump that into a prompt. You need your MCP tools or action groups to return filtered, relevant subsets. The agent asks for specific resources; you return those resources. Don’t try to give it everything.

Latency is not negligible. A single Bedrock Agent invocation with three tool calls takes five to fifteen seconds. For a Slack bot responding to an on-call engineer at 2 AM, that’s fine. For a deployment gate that needs to respond in under a second, that’s not fine. Know where agents fit in your latency budget before you build.

Model updates change behavior. When AWS updates the underlying Claude model, your agent’s behavior can drift. Prompts that worked reliably start producing different outputs. Test your agents after every model update. Build evals for the decision paths that matter. This is operational overhead that DIY setups with pinned model versions don’t have.

What’s Actually Worth Building

Build Bedrock Agents for DevOps when:

The task involves correlating multiple data sources that humans would manually check anyway
The output is a recommendation or a structured report, not an automated action
The checks are deterministic enough that false positives are rare
The value of automation is in speed or consistency, not in replacing human judgment

Don’t build Bedrock Agents when:

The task requires nuanced security reasoning about code (models aren’t reliable enough)
You need sub-second responses
You need to autonomously modify infrastructure without human approval
Your team doesn’t want to maintain prompt engineering as operational work

The incident response bot earned its place because it does something concrete and valuable: it correlates data faster than a human can at 2 AM and presents a structured hypothesis. The deployment readiness checks work because every check is binary. The cost analyzer is useful as a conversation starter, not as an automated decision maker.

Bedrock Agents are real infrastructure tooling in 2026. They’re not magic. But when you match them to the right problems—data correlation, runbook retrieval, deterministic orchestration—they genuinely reduce operational burden. That’s worth something.

Start with the incident response bot. It’s the most forgiving use case, the most clearly valuable, and the most representative of what these agents are actually good at.