AWS DevOps Agent: Autonomous Incident Investigation on AWS

Bits Lovers
Written by Bits Lovers on
AWS DevOps Agent: Autonomous Incident Investigation on AWS

On March 31, 2026, AWS made the DevOps Agent generally available. The announcement tweet from @awscloud got 3.3 million views in a week. The reaction from the DevOps community ranged from “this will replace on-call engineers” to “this is just a better runbook executor.” The reality sits between those extremes, closer to the second.

AWS DevOps Agent is an AI-powered operations assistant that runs autonomously when an alarm fires. It correlates CloudWatch metrics, deployment history, application logs, and service health to surface a root cause analysis before a human even opens their laptop. It doesn’t fix the problem — it tells you what’s wrong, what changed recently, and which service is the source. Whether that saves 20 minutes of digging at 2am or two hours depends on how well you’ve instrumented your systems.

This guide covers how the agent works, how to set it up, what data sources it uses, how to interpret its findings, and what it genuinely can and can’t do.

How the Agent Works

AWS DevOps Agent integrates with CloudWatch Alarms as its primary trigger. When an alarm enters the ALARM state, the agent starts an investigation automatically. It pulls data from the connected sources — CloudWatch Logs, CloudWatch Metrics, AWS X-Ray traces, AWS CodeDeploy deployment history, and AWS Systems Manager — and runs a structured analysis:

  1. Timeline construction: What changed in the 30 minutes before the alarm fired? Recent deployments, config changes, autoscaling events.
  2. Metric correlation: Which metrics degraded? Are they correlated with a specific service, instance, or Availability Zone?
  3. Log analysis: What errors appear in application logs around the alarm timestamp? What’s the frequency and pattern?
  4. Trace analysis: If X-Ray is instrumented, which downstream calls are slow or failing?
  5. Root cause hypothesis: Based on the above, what is the most likely cause and which component is responsible?

The output is a structured finding delivered to Slack, PagerDuty, or an SNS topic — wherever your team routes alerts. The finding includes a summary, the supporting evidence, and the specific resource (Lambda function, ECS service, RDS instance) the agent believes is the root cause.

Setting Up the DevOps Agent

The DevOps Agent is configured through the AWS Systems Manager console under the “DevOps Agent” section, or via the API:

# Create a DevOps Agent configuration
aws devops-agent create-agent \
  --name production-ops-agent \
  --description "Production incident investigation agent" \
  --region us-east-1

# Get the agent ARN
AGENT_ARN=$(aws devops-agent list-agents \
  --query 'agents[?name==`production-ops-agent`].agentArn' \
  --output text)

# Configure data sources the agent can access
aws devops-agent update-agent \
  --agent-arn $AGENT_ARN \
  --data-sources '{
    "cloudwatchLogs": {
      "logGroups": [
        "/aws/lambda/payment-processor",
        "/aws/lambda/order-service",
        "/ecs/production/my-api"
      ]
    },
    "xray": {
      "enabled": true
    },
    "codedeploy": {
      "enabled": true,
      "applications": ["my-api-deploy", "payment-processor-deploy"]
    }
  }'

# IAM role for the agent (needs read access to all data sources)
aws iam create-role \
  --role-name DevOpsAgentRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "devops-agent.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'

aws iam attach-role-policy \
  --role-name DevOpsAgentRole \
  --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess

The agent only needs read access. It observes and analyzes — it never modifies resources. If you see examples suggesting the agent needs write permissions, be cautious about the scope.

That read-only boundary is why the agent is only as good as the telemetry you feed it. Noisy alarms, missing traces, and sloppy log structure all turn into noisy investigations. Teams that already have disciplined monitoring will see the fastest payoff here. If your baseline is still shaky, fix that first with the alarm and dashboard patterns from the AWS CloudWatch deep dive and the service trace setup in the AWS X-Ray distributed tracing guide.

Connecting to CloudWatch Alarms

The agent triggers on CloudWatch Alarms. Configure which alarms should invoke it:

# Create an alarm that triggers the DevOps Agent
aws cloudwatch put-metric-alarm \
  --alarm-name "payment-service-error-rate" \
  --metric-name "Errors" \
  --namespace "AWS/Lambda" \
  --statistic Sum \
  --period 60 \
  --threshold 10 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --evaluation-periods 2 \
  --dimensions Name=FunctionName,Value=payment-processor \
  --alarm-actions \
    arn:aws:sns:us-east-1:123456789012:pagerduty-critical \
    arn:aws:devops-agent:us-east-1:123456789012:agent/production-ops-agent \
  --ok-actions \
    arn:aws:sns:us-east-1:123456789012:pagerduty-critical

# Alternatively, configure the agent to watch all alarms matching a pattern
aws devops-agent update-alarm-config \
  --agent-arn $AGENT_ARN \
  --alarm-filter '{"namePrefix": "production-", "states": ["ALARM"]}'

The agent ARN acts as an alarm action just like an SNS topic. When the alarm fires, both your PagerDuty notification and the agent investigation trigger simultaneously — by the time a human opens the PagerDuty ticket, the agent’s finding may already be in the Slack channel.

Configuring Notification Output

Tell the agent where to send its findings:

# Send findings to Slack via a Lambda webhook forwarder
aws devops-agent update-notification-config \
  --agent-arn $AGENT_ARN \
  --notifications '[
    {
      "type": "SNS",
      "targetArn": "arn:aws:sns:us-east-1:123456789012:devops-agent-findings",
      "severity": ["HIGH", "CRITICAL"]
    }
  ]'

# The SNS topic can fan out to Slack, PagerDuty, email, etc.
# Subscribe a Slack webhook Lambda to the SNS topic
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:devops-agent-findings \
  --protocol lambda \
  --notification-endpoint arn:aws:lambda:us-east-1:123456789012:function:slack-notifier

A typical finding notification looks like this in Slack:

🔍 DevOps Agent Finding — payment-service-error-rate [CRITICAL]

Summary: Lambda function payment-processor experiencing elevated error rate.
Root cause: Dependency failure on downstream RDS instance db-payments-prod.

Evidence:
• Error rate: 18.4% (threshold: 1%)
• Errors started: 14:32:07 UTC — 3 minutes after CodeDeploy deployment d-XXXXX
• Log pattern: "connection timeout" appears in 847 of 923 error log lines
• RDS db-payments-prod: FreeableMemory dropped to 42MB at 14:31:45 UTC
• CloudWatch metric DatabaseConnections spiked from 45 to 892 at 14:32:00 UTC

Deployment: payment-processor v2.4.1 deployed at 14:29:23 UTC
Suspect: Version 2.4.1 likely introduced a connection pool configuration that
exhausts RDS connections under load. Previous version v2.4.0 ran without error.

Suggested action: Roll back payment-processor to v2.4.0 or increase
max_connections on db-payments-prod.

That finding — linking a Lambda error spike to a connection exhaustion pattern that started 2 minutes after a deployment — would typically take 15-30 minutes of manual log analysis to piece together. The agent surfaces it in under 90 seconds.

What the Agent Does Well

The agent genuinely excels at three scenarios:

Post-deployment failures: The correlation between a new deployment and a metrics change is exactly the kind of temporal pattern the agent finds reliably. When v2.4.1 deployed at 14:29 and errors started at 14:32, that’s a clear signal a human might miss while staring at a metrics graph.

Cascading failures: In microservice architectures, the service generating errors is often not the root cause — it’s a victim of a failing dependency. The agent uses X-Ray traces to walk the dependency chain and identify which upstream or downstream service is actually failing.

3am investigations: The agent’s value compounds at odd hours when context-switching overhead is highest. Getting a structured finding that says “database connections exhausted, likely caused by deployment 14 minutes ago” lets an on-call engineer make a rollback decision in 2 minutes rather than spending 20 minutes reading logs.

What the Agent Can’t Do

Several important limitations to understand before relying on it:

It can’t investigate what it can’t see. If your application logs aren’t structured (plain-text logs instead of JSON), the agent’s log analysis is limited. If you’re not using X-Ray, it has no trace data. If your Lambda functions don’t emit custom metrics, it falls back to standard invocation metrics. The agent’s quality is directly proportional to your observability investment.

It doesn’t know your business logic. The agent can tell you that a Lambda function is throwing NullPointerExceptions, but it can’t tell you whether that’s because a required field in a new API version is missing or because a customer sent unexpected input. It recognizes error patterns, not semantic meaning.

False root causes on complex incidents. For incidents involving multiple simultaneous failures, the agent may fixate on the first signal it finds and miss the actual trigger. Multi-region failures with complex cascading behavior exceed what a single correlation pass can reliably untangle.

It can’t fix things. Despite some marketing language about “autonomous resolution,” the GA version of DevOps Agent is investigation-only. It recommends actions; it doesn’t take them. AWS has indicated that remediation actions (rollback, scaling) are on the roadmap behind an explicit approval workflow.

In practice, that’s the safer operating model anyway. Let the agent produce the evidence and the likely root cause, then hand the actual rollback, failover, or cache flush to a deterministic workflow. If you want that second step automated, wire the finding into EventBridge + Step Functions so the AI stays in the analyst role and the state machine stays in the execution role.

Pricing

AWS DevOps Agent charges per investigation. As of GA:

  • $0.10 per investigation triggered
  • $0.02 per GB of logs analyzed per investigation
  • X-Ray trace analysis included at no additional charge

A team with 50 alarm triggers per month and average 500MB of logs analyzed per investigation pays roughly $5 + $0.02 × 500MB × 50 = $505/month. That’s roughly 3 hours of a senior engineer’s time at typical SRE rates. For a team running 24/7 on-call with frequent incidents, the ROI calculates easily. For a team with 5 alarms per month, it’s hard to justify over a well-tuned runbook.

The SRE Perspective

The community pushback captured in that viral tweet — “what AWS built is actually what good SREs already do” — is accurate. The agent automates the first 20 minutes of an incident investigation: gather context, correlate signals, form a hypothesis. Good SREs already do this faster, with more context, and more accurately. What the agent provides is consistency. It doesn’t have bad nights. It doesn’t skip steps because it’s tired. It always checks deployment history. For teams that struggle with consistent incident response quality, that consistency has real value.

Teams evaluating whether to build custom investigation agents on top of Bedrock — rather than adopting the managed DevOps Agent — should read the Bedrock Agents vs direct Nova Pro API comparison. The token overhead, latency, and orchestration tradeoffs are directly relevant when deciding how much of the agent logic you want to own versus delegate to a managed service.

For the observability infrastructure that makes the DevOps Agent most effective, the AWS X-Ray distributed tracing guide covers the trace instrumentation that feeds the agent’s dependency analysis. The Prometheus and Grafana on EKS guide covers the metrics layer for EKS workloads where the agent can be integrated alongside cluster-level alerting.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus