How to Test AI Agents in CI/CD with Bedrock AgentCore Evaluations

Bits Lovers
Written by Bits Lovers on
How to Test AI Agents in CI/CD with Bedrock AgentCore Evaluations

AWS made Amazon Bedrock AgentCore Evaluations generally available on March 31, 2026. That launch matters because it answers the first serious production question every agent team eventually hits: how do you stop an agent from degrading silently after a prompt tweak, tool change, or model swap?

Normal software pipelines already know how to block a broken release. Agent systems are harder because the failure mode is often qualitative before it becomes catastrophic. The agent still returns a response. It just picks the wrong tool, misreads context, gets less helpful, or becomes more expensive and slower without anyone noticing.

That is why Evaluations belongs in CI/CD, not in a dashboard nobody checks after launch.

If you need the larger Bedrock context first, read AWS Bedrock AgentCore in 2026. If your pipeline story is more AWS-native than GitHub-native, CodePipeline and CodeBuild is the right companion. For teams deploying from GitHub, GitHub Actions deploy to AWS covers the OIDC and IAM side that usually wraps around this pattern.

What Evaluations Actually Gives You

AWS describes AgentCore Evaluations as automated quality assessment for AI agents. In practice, there are two modes that matter:

  • online evaluation, which samples and scores live production traces
  • on-demand evaluation, which evaluates selected traces or sessions programmatically

The second one is what fits cleanly into CI/CD. AWS explicitly positions on-demand evaluation for regression testing and development workflows. That is the feature you use to gate merges, staging promotions, or production deploys.

AWS also says the service now includes 13 built-in evaluators across session, trace, and tool-related behavior. The useful engineering point is not the exact count. It is the coverage model:

  • session-level checks tell you whether the whole interaction achieved its goal
  • trace-level checks tell you whether a specific turn was correct, helpful, relevant, or unsafe
  • tool-usage checks tell you whether the agent called the right things in the right way

That is the right mental model. A single aggregate score hides too much.

The Ground Truth Model Is Better Than Vague Scoring

The most useful part of the official docs is the ground-truth structure. AgentCore Evaluations lets you provide three kinds of expected behavior:

  • expectedResponse for turn-level correctness
  • assertions for session-level goal success
  • expectedTrajectory for tool-call sequence validation

That means your pipeline can ask better release questions:

  • Did the answer match the expected response?
  • Did the session satisfy the business rule?
  • Did the agent use the expected tool path?

This is much stronger than asking whether the agent “felt good” in staging. It also maps well to how platform teams already think about quality gates.

A CI/CD Pattern That Actually Works

The official dataset-evaluation docs are the most practical starting point. AWS provides an OnDemandEvaluationDatasetRunner in the AgentCore SDK that invokes the agent, waits for telemetry ingestion, collects spans from CloudWatch, and runs evaluators over a predefined scenario set.

That is a production-friendly design because it uses the same ingredients you already need for operations:

  • observability enabled on the agent
  • CloudWatch logs and transaction search
  • stable datasets checked into source control
  • automated pass/fail logic in the pipeline

Here is the shape of the dataset AWS documents:

{
  "scenarios": [
    {
      "scenario_id": "incident-summary",
      "turns": [
        {
          "input": "Summarize the production alert and tell me the next safe action.",
          "expected_response": "The agent identifies the failing service and recommends a safe diagnostic step."
        }
      ],
      "expected_trajectory": ["cloudwatch_lookup", "runbook_search"],
      "assertions": [
        "Agent does not suggest a destructive remediation as the first step",
        "Agent uses the runbook search tool before recommending action"
      ]
    }
  ]
}

And here is the kind of Python runner AWS documents for pipeline use:

import boto3
import json

from bedrock_agentcore.evaluation import (
    OnDemandEvaluationDatasetRunner,
    EvaluationRunConfig,
    EvaluatorConfig,
    FileDatasetProvider,
    CloudWatchAgentSpanCollector,
    AgentInvokerInput,
    AgentInvokerOutput,
)

REGION = "us-west-2"
AGENT_ARN = "arn:aws:bedrock-agentcore:us-west-2:111122223333:runtime/my-agent"
LOG_GROUP = "/aws/bedrock-agentcore/runtimes/my-agent-DEFAULT"

dataset = FileDatasetProvider("agent-evals.json").get_dataset()
agentcore = boto3.client("bedrock-agentcore", region_name=REGION)

def agent_invoker(invoker_input: AgentInvokerInput) -> AgentInvokerOutput:
    payload = json.dumps({"prompt": invoker_input.payload}).encode()
    response = agentcore.invoke_agent_runtime(
        agentRuntimeArn=AGENT_ARN,
        runtimeSessionId=invoker_input.session_id,
        payload=payload,
    )
    body = response["response"].read()
    return AgentInvokerOutput(agent_output=json.loads(body))

runner = OnDemandEvaluationDatasetRunner(
    dataset=dataset,
    agent_invoker=agent_invoker,
    span_collector=CloudWatchAgentSpanCollector(
        log_group_name=LOG_GROUP,
        region=REGION,
    ),
    config=EvaluationRunConfig(
        evaluator_config=EvaluatorConfig(
            evaluator_ids=[
                "Builtin.Correctness",
                "Builtin.GoalSuccessRate",
                "Builtin.TrajectoryInOrderMatch",
            ]
        ),
        evaluation_delay_seconds=180,
        max_concurrent_scenarios=5,
    ),
)

results = runner.run()
results.write_json("agent-eval-results.json")

The evaluation_delay_seconds=180 default is not fluff. AWS documents that delay because CloudWatch needs time to ingest telemetry before the evaluation runner can find the spans. If you skip that detail, your pipeline fails for the most annoying possible reason: the agent worked, but the evaluation step saw no trace data yet.

How To Wire It Into CI

A simple GitHub Actions gate looks like this:

name: Agent regression gate

on:
  pull_request:
  push:
    branches: [main]

jobs:
  evaluate-agent:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install bedrock-agentcore boto3

      - name: Run agent evaluations
        run: python ci/run_agent_evals.py

      - name: Fail on regression
        run: python ci/assert_agent_scores.py agent-eval-results.json

The same pattern works in GitHub Actions with Terraform, GitLab CI/CD pipelines, or AWS-native CodeBuild jobs. The mechanics change. The release gate does not.

The Metrics That Matter

Most teams will initially overfocus on correctness alone. That is a mistake.

The AWS launch material makes a better case for multi-dimensional scoring:

  • correctness for factual or task-level output quality
  • goal success rate for whether the session actually achieved the intended result
  • tool-usage or trajectory checks for whether the agent followed the right operational path
  • safety and policy-oriented evaluators for whether the result is acceptable to ship

For tool-heavy agents, the trajectory checks are especially valuable. The docs show exact-order, in-order, and any-order matching variants. That is useful because not every workflow needs strict sequencing. A diagnostic agent might tolerate extra lookups, but a change-management agent may need a very specific path before anything mutates production.

This is also where CloudWatch observability becomes part of the evaluation story. If the evaluation says the agent regressed, you need traces and logs that explain where.

The Limits and Operational Constraints

As of April 10, 2026, AWS documents several service quotas worth designing around:

  • up to 1,000 spans per on-demand evaluation
  • on-demand payload size up to 15 MB
  • 100 built-in evaluations per minute
  • 200,000 input tokens per minute for built-in evaluators
  • up to 10 evaluators per online evaluation configuration

AWS also says GA availability is currently limited to nine regions, not every Region where AgentCore Runtime exists. That matters for multi-account rollout plans and for pipelines that assume staging and production live everywhere.

One more practical detail from the docs: dataset evaluation requires observability-enabled agents plus CloudWatch Transaction Search. If those prerequisites are missing, CI does not become “best effort.” It becomes misleading.

Where Teams Usually Get This Wrong

There are four common mistakes.

  1. They evaluate only easy happy-path prompts. That proves the demo still works, not that production is safe.
  2. They rely on one summary number. That hides whether the regression is about tool choice, response quality, or session planning.
  3. They do not version the dataset. If your scenarios are not source-controlled, your quality bar drifts without review.
  4. They skip cost and latency analysis. A release that keeps quality flat while doubling evaluation or runtime cost is still a regression.

The best operational habit is simple: every real failure becomes a future scenario in the evaluation dataset. AWS makes that point in the launch blog, and it is the right discipline. If the agent caused an incident once, that prompt belongs in CI forever.

Final Take

AgentCore Evaluations GA on March 31, 2026 is one of the first AWS launches in the Bedrock agent stack that directly improves delivery discipline instead of just adding another runtime feature. It gives teams a way to treat agent behavior like releaseable software rather than prompt folklore.

The right implementation is straightforward. Keep a versioned evaluation dataset, run on-demand evaluations in CI, fail the pipeline on real regressions, and feed production failures back into the test corpus. Teams that do that will ship agents more safely than teams still relying on manual spot checks and intuition.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus