AWS X-Ray: Distributed Tracing for Debugging Microservices
X-Ray answers the question that CloudWatch logs and metrics can’t: why is this specific request slow? Logs tell you something happened. Metrics tell you how often. X-Ray tells you exactly how long each component of a request took — which Lambda function, which DynamoDB call, which downstream service, which SQL query. When a user reports that checkout takes 8 seconds, X-Ray shows you it’s the payment service DynamoDB GetItem taking 7.2 seconds, not the checkout service itself.
This guide covers instrumenting Lambda, ECS, and EKS services with the X-Ray SDK and OpenTelemetry, configuring sampling to control costs, reading the Service Map and Trace List, and integrating with CloudWatch ServiceLens.
How X-Ray Works
Each request generates a trace — a collection of segments and subsegments representing the work done. A segment is the top-level record for a single service (the Lambda function, the ECS task). Subsegments record calls made from within that service: HTTP requests to downstream services, SDK calls to DynamoDB or S3, SQL queries, custom annotations.
The X-Ray daemon (or AWS Distro for OpenTelemetry collector) receives segment data from your instrumented code and batches it to the X-Ray API. Lambda includes the daemon automatically; for ECS and EKS you run it as a sidecar.
Lambda: Active Tracing
The simplest X-Ray setup is Lambda with active tracing enabled:
# Enable active tracing on a Lambda function
aws lambda update-function-configuration \
--function-name my-api \
--tracing-config '{"Mode": "Active"}'
# The Lambda execution role needs X-Ray permissions
aws iam attach-role-policy \
--role-name MyLambdaRole \
--policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess
With active tracing, Lambda automatically creates a segment for each invocation capturing duration, fault/error status, and the invocation metadata. To capture subsegments for your database calls and HTTP requests, add the SDK:
# Python Lambda with X-Ray SDK
from aws_xray_sdk.core import xray_recorder, patch_all
import boto3
# Patch all supported AWS SDK clients automatically
patch_all()
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Orders')
@xray_recorder.capture('process_order')
def process_order(order_id):
# This creates a subsegment named 'process_order'
# DynamoDB call is automatically traced because we called patch_all()
response = table.get_item(Key={'orderId': order_id})
# Add custom annotations (indexed, searchable in X-Ray)
xray_recorder.current_subsegment().put_annotation('order_id', order_id)
# Add metadata (not indexed, but visible in trace)
xray_recorder.current_subsegment().put_metadata('order_data', response.get('Item'))
return response.get('Item')
def handler(event, context):
order_id = event['pathParameters']['orderId']
order = process_order(order_id)
return {
'statusCode': 200,
'body': json.dumps(order)
}
patch_all() instruments boto3, requests, httplib, and other common libraries. Every AWS SDK call (DynamoDB, S3, SQS, SNS) becomes a traced subsegment automatically. Every outbound HTTP call via requests gets a subsegment showing the URL, status code, and duration.
put_annotation adds indexed key-value pairs. Annotations are searchable — you can filter traces by annotation.order_id = "12345". Use annotations for identifiers you’ll want to look up (user IDs, order IDs, request IDs).
put_metadata adds non-indexed data. Metadata isn’t searchable but is visible when you view a specific trace. Use it for response payloads or debugging context you want alongside the trace.
ECS: X-Ray Sidecar
For ECS tasks, run the X-Ray daemon as a sidecar container:
{
"family": "my-api-task",
"taskRoleArn": "arn:aws:iam::123456789012:role/MyTaskRole",
"containerDefinitions": [
{
"name": "my-api",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-api:latest",
"environment": [
{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
]
},
{
"name": "xray-daemon",
"image": "amazon/aws-xray-daemon",
"portMappings": [
{"containerPort": 2000, "protocol": "udp"}
],
"cpu": 32,
"memory": 256,
"essential": false
}
]
}
The sidecar listens on UDP port 2000 and forwards segments to the X-Ray API. The AWS_XRAY_DAEMON_ADDRESS environment variable tells the SDK where to send segments. The task role must have xray:PutTraceSegments and xray:PutTelemetryRecords permissions.
For a Node.js service in ECS:
// app.js
const AWSXRay = require('aws-xray-sdk');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
const express = require('express');
const axios = AWSXRay.captureHTTPsGlobal(require('https'));
const app = express();
// Add X-Ray middleware (creates a segment for each request)
app.use(AWSXRay.express.openSegment('my-api'));
app.get('/orders/:id', async (req, res) => {
const segment = AWSXRay.getSegment();
// Create a subsegment for custom work
const subsegment = segment.addNewSubsegment('fetch-order');
try {
const dynamodb = new AWS.DynamoDB.DocumentClient();
const result = await dynamodb.get({
TableName: 'Orders',
Key: { orderId: req.params.id }
}).promise();
subsegment.addAnnotation('orderId', req.params.id);
subsegment.close();
res.json(result.Item);
} catch (err) {
subsegment.addError(err);
subsegment.close();
throw err;
}
});
app.use(AWSXRay.express.closeSegment());
app.listen(8080);
EKS: OpenTelemetry Collector
For EKS, the recommended approach is OpenTelemetry with the AWS Distro for OpenTelemetry (ADOT) collector sending traces to X-Ray. This avoids the X-Ray SDK dependency and works with any OpenTelemetry-compatible tracing:
# adot-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: adot-collector
namespace: observability
spec:
selector:
matchLabels:
app: adot-collector
template:
metadata:
labels:
app: adot-collector
spec:
serviceAccountName: adot-collector
containers:
- name: adot-collector
image: public.ecr.aws/aws-observability/aws-otel-collector:latest
env:
- name: AWS_REGION
value: us-east-1
volumeMounts:
- name: adot-config
mountPath: /etc/adot-config.yaml
subPath: adot-config.yaml
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
volumes:
- name: adot-config
configMap:
name: adot-config
# adot-config ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: adot-config
namespace: observability
data:
adot-config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
awsxray:
region: us-east-1
awscloudwatchmetrics:
region: us-east-1
service:
pipelines:
traces:
receivers: [otlp]
exporters: [awsxray]
metrics:
receivers: [otlp]
exporters: [awscloudwatchmetrics]
Your application code uses the OpenTelemetry SDK (not the X-Ray SDK), and the ADOT collector handles conversion to X-Ray format. This is the forward-looking approach — the X-Ray SDK is AWS-specific; OpenTelemetry works with any backend (Jaeger, Zipkin, Datadog) without code changes.
Sampling Rules
Every request traced means every request incurs X-Ray cost ($5 per million traces recorded). Sampling rules control what fraction of requests get traced.
# Create a sampling rule
aws xray create-sampling-rule \
--sampling-rule '{
"RuleName": "api-sampling",
"ResourceARN": "*",
"Priority": 1,
"FixedRate": 0.05,
"ReservoirSize": 5,
"ServiceName": "my-api",
"ServiceType": "AWS::Lambda::Function",
"Host": "*",
"HTTPMethod": "*",
"URLPath": "*",
"Version": 1
}'
# High-priority sampling rule for error responses (trace all errors)
aws xray create-sampling-rule \
--sampling-rule '{
"RuleName": "error-sampling",
"ResourceARN": "*",
"Priority": 0,
"FixedRate": 1.0,
"ReservoirSize": 100,
"ServiceName": "my-api",
"ServiceType": "*",
"Host": "*",
"HTTPMethod": "*",
"URLPath": "*",
"Version": 1,
"Attributes": {"http.status_code": "5??"}
}'
ReservoirSize: 5 means trace at least 5 requests per second regardless of FixedRate. After the reservoir fills, FixedRate: 0.05 traces 5% of remaining requests. This guarantees a minimum volume for low-traffic periods while capping cost at high traffic.
The error rule with FixedRate: 1.0 and lower priority number (0 = highest priority) traces all 5xx responses. When debugging production errors, you want a complete trace for every failure, not a 5% sample.
Reading Traces in the Console
The X-Ray console shows three main views:
Service Map: A visual graph of your services with average latency and error rates on each connection. A thick red line between two nodes means errors on that call path. The diameter of each node indicates call volume. This is the first place to look when diagnosing issues — find the node with the highest latency or error rate.
Trace List: Individual traces filterable by time range, service, URL, annotation, response code, and duration. Sort by duration descending to find the slowest requests. The filter syntax is SQL-like:
# Filter traces to find slow DynamoDB calls
service("Orders") AND responsetime > 5 AND annotation.order_type = "bulk"
# Find all traces with errors from a specific service
service("payment-service") AND fault = true
# Traces for a specific user
annotation.user_id = "user-12345"
Analytics: Run aggregate queries over traces. Build a histogram of response times, compare latency percentiles across time windows, break down error rates by annotation values.
CloudWatch ServiceLens
ServiceLens integrates X-Ray traces with CloudWatch metrics and logs in a single view:
# Enable Container Insights (required for ECS/EKS ServiceLens integration)
aws ecs put-account-setting \
--name containerInsights \
--value enabled
# ServiceLens automatically correlates:
# - X-Ray traces
# - CloudWatch metrics (request count, latency, error rate)
# - CloudWatch Logs (log lines from the same request)
In ServiceLens, clicking a node on the service map shows the metrics, logs, and traces for that service in one panel. Seeing a spike in a CloudWatch metric, clicking through to traces, and seeing the specific slow call in the log stream — without switching between tabs — is where the operational value of X-Ray becomes clear.
Pricing and Cost Control
X-Ray pricing: first 100,000 traces/month free, then $5 per million traces recorded, $0.50 per million traces retrieved, $0.50 per million traces scanned.
At 5% sampling on a service handling 10 million requests/month: 500,000 traced requests, 400,000 above the free tier = $2/month. Entirely reasonable. At 100% sampling on the same service: $50/month. Control costs with sampling rules before enabling across the board.
The most cost-effective approach: 1-5% sampling for normal requests, 100% for errors, 100% for requests matching specific annotations (debugging a specific user or order ID). This gives complete coverage where it matters without sampling-away rare edge cases.
For services where tracing data is used beyond debugging — feeding latency data to cost models or SLO calculations — integrating X-Ray with CloudWatch deep dive gives you the complete observability picture. The Prometheus and Grafana on EKS guide covers the metrics side of the observability stack for EKS workloads that use ADOT for tracing.
Comments