AWS X-Ray: Distributed Tracing for Debugging Microservices

Bits Lovers
Written by Bits Lovers on
AWS X-Ray: Distributed Tracing for Debugging Microservices

X-Ray answers the question that CloudWatch logs and metrics can’t: why is this specific request slow? Logs tell you something happened. Metrics tell you how often. X-Ray tells you exactly how long each component of a request took — which Lambda function, which DynamoDB call, which downstream service, which SQL query. When a user reports that checkout takes 8 seconds, X-Ray shows you it’s the payment service DynamoDB GetItem taking 7.2 seconds, not the checkout service itself.

This guide covers instrumenting Lambda, ECS, and EKS services with the X-Ray SDK and OpenTelemetry, configuring sampling to control costs, reading the Service Map and Trace List, and integrating with CloudWatch ServiceLens.

How X-Ray Works

Each request generates a trace — a collection of segments and subsegments representing the work done. A segment is the top-level record for a single service (the Lambda function, the ECS task). Subsegments record calls made from within that service: HTTP requests to downstream services, SDK calls to DynamoDB or S3, SQL queries, custom annotations.

The X-Ray daemon (or AWS Distro for OpenTelemetry collector) receives segment data from your instrumented code and batches it to the X-Ray API. Lambda includes the daemon automatically; for ECS and EKS you run it as a sidecar.

Lambda: Active Tracing

The simplest X-Ray setup is Lambda with active tracing enabled:

# Enable active tracing on a Lambda function
aws lambda update-function-configuration \
  --function-name my-api \
  --tracing-config '{"Mode": "Active"}'

# The Lambda execution role needs X-Ray permissions
aws iam attach-role-policy \
  --role-name MyLambdaRole \
  --policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess

With active tracing, Lambda automatically creates a segment for each invocation capturing duration, fault/error status, and the invocation metadata. To capture subsegments for your database calls and HTTP requests, add the SDK:

# Python Lambda with X-Ray SDK
from aws_xray_sdk.core import xray_recorder, patch_all
import boto3

# Patch all supported AWS SDK clients automatically
patch_all()

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Orders')

@xray_recorder.capture('process_order')
def process_order(order_id):
    # This creates a subsegment named 'process_order'
    
    # DynamoDB call is automatically traced because we called patch_all()
    response = table.get_item(Key={'orderId': order_id})
    
    # Add custom annotations (indexed, searchable in X-Ray)
    xray_recorder.current_subsegment().put_annotation('order_id', order_id)
    
    # Add metadata (not indexed, but visible in trace)
    xray_recorder.current_subsegment().put_metadata('order_data', response.get('Item'))
    
    return response.get('Item')

def handler(event, context):
    order_id = event['pathParameters']['orderId']
    order = process_order(order_id)
    return {
        'statusCode': 200,
        'body': json.dumps(order)
    }

patch_all() instruments boto3, requests, httplib, and other common libraries. Every AWS SDK call (DynamoDB, S3, SQS, SNS) becomes a traced subsegment automatically. Every outbound HTTP call via requests gets a subsegment showing the URL, status code, and duration.

put_annotation adds indexed key-value pairs. Annotations are searchable — you can filter traces by annotation.order_id = "12345". Use annotations for identifiers you’ll want to look up (user IDs, order IDs, request IDs).

put_metadata adds non-indexed data. Metadata isn’t searchable but is visible when you view a specific trace. Use it for response payloads or debugging context you want alongside the trace.

ECS: X-Ray Sidecar

For ECS tasks, run the X-Ray daemon as a sidecar container:

{
  "family": "my-api-task",
  "taskRoleArn": "arn:aws:iam::123456789012:role/MyTaskRole",
  "containerDefinitions": [
    {
      "name": "my-api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-api:latest",
      "environment": [
        {"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
      ]
    },
    {
      "name": "xray-daemon",
      "image": "amazon/aws-xray-daemon",
      "portMappings": [
        {"containerPort": 2000, "protocol": "udp"}
      ],
      "cpu": 32,
      "memory": 256,
      "essential": false
    }
  ]
}

The sidecar listens on UDP port 2000 and forwards segments to the X-Ray API. The AWS_XRAY_DAEMON_ADDRESS environment variable tells the SDK where to send segments. The task role must have xray:PutTraceSegments and xray:PutTelemetryRecords permissions.

For a Node.js service in ECS:

// app.js
const AWSXRay = require('aws-xray-sdk');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
const express = require('express');
const axios = AWSXRay.captureHTTPsGlobal(require('https'));

const app = express();

// Add X-Ray middleware (creates a segment for each request)
app.use(AWSXRay.express.openSegment('my-api'));

app.get('/orders/:id', async (req, res) => {
    const segment = AWSXRay.getSegment();
    
    // Create a subsegment for custom work
    const subsegment = segment.addNewSubsegment('fetch-order');
    
    try {
        const dynamodb = new AWS.DynamoDB.DocumentClient();
        const result = await dynamodb.get({
            TableName: 'Orders',
            Key: { orderId: req.params.id }
        }).promise();
        
        subsegment.addAnnotation('orderId', req.params.id);
        subsegment.close();
        
        res.json(result.Item);
    } catch (err) {
        subsegment.addError(err);
        subsegment.close();
        throw err;
    }
});

app.use(AWSXRay.express.closeSegment());

app.listen(8080);

EKS: OpenTelemetry Collector

For EKS, the recommended approach is OpenTelemetry with the AWS Distro for OpenTelemetry (ADOT) collector sending traces to X-Ray. This avoids the X-Ray SDK dependency and works with any OpenTelemetry-compatible tracing:

# adot-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: adot-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: adot-collector
  template:
    metadata:
      labels:
        app: adot-collector
    spec:
      serviceAccountName: adot-collector
      containers:
        - name: adot-collector
          image: public.ecr.aws/aws-observability/aws-otel-collector:latest
          env:
            - name: AWS_REGION
              value: us-east-1
          volumeMounts:
            - name: adot-config
              mountPath: /etc/adot-config.yaml
              subPath: adot-config.yaml
          ports:
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
      volumes:
        - name: adot-config
          configMap:
            name: adot-config
# adot-config ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: adot-config
  namespace: observability
data:
  adot-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    
    exporters:
      awsxray:
        region: us-east-1
      awscloudwatchmetrics:
        region: us-east-1
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [awsxray]
        metrics:
          receivers: [otlp]
          exporters: [awscloudwatchmetrics]

Your application code uses the OpenTelemetry SDK (not the X-Ray SDK), and the ADOT collector handles conversion to X-Ray format. This is the forward-looking approach — the X-Ray SDK is AWS-specific; OpenTelemetry works with any backend (Jaeger, Zipkin, Datadog) without code changes.

Sampling Rules

Every request traced means every request incurs X-Ray cost ($5 per million traces recorded). Sampling rules control what fraction of requests get traced.

# Create a sampling rule
aws xray create-sampling-rule \
  --sampling-rule '{
    "RuleName": "api-sampling",
    "ResourceARN": "*",
    "Priority": 1,
    "FixedRate": 0.05,
    "ReservoirSize": 5,
    "ServiceName": "my-api",
    "ServiceType": "AWS::Lambda::Function",
    "Host": "*",
    "HTTPMethod": "*",
    "URLPath": "*",
    "Version": 1
  }'

# High-priority sampling rule for error responses (trace all errors)
aws xray create-sampling-rule \
  --sampling-rule '{
    "RuleName": "error-sampling",
    "ResourceARN": "*",
    "Priority": 0,
    "FixedRate": 1.0,
    "ReservoirSize": 100,
    "ServiceName": "my-api",
    "ServiceType": "*",
    "Host": "*",
    "HTTPMethod": "*",
    "URLPath": "*",
    "Version": 1,
    "Attributes": {"http.status_code": "5??"}
  }'

ReservoirSize: 5 means trace at least 5 requests per second regardless of FixedRate. After the reservoir fills, FixedRate: 0.05 traces 5% of remaining requests. This guarantees a minimum volume for low-traffic periods while capping cost at high traffic.

The error rule with FixedRate: 1.0 and lower priority number (0 = highest priority) traces all 5xx responses. When debugging production errors, you want a complete trace for every failure, not a 5% sample.

Reading Traces in the Console

The X-Ray console shows three main views:

Service Map: A visual graph of your services with average latency and error rates on each connection. A thick red line between two nodes means errors on that call path. The diameter of each node indicates call volume. This is the first place to look when diagnosing issues — find the node with the highest latency or error rate.

Trace List: Individual traces filterable by time range, service, URL, annotation, response code, and duration. Sort by duration descending to find the slowest requests. The filter syntax is SQL-like:

# Filter traces to find slow DynamoDB calls
service("Orders") AND responsetime > 5 AND annotation.order_type = "bulk"

# Find all traces with errors from a specific service
service("payment-service") AND fault = true

# Traces for a specific user
annotation.user_id = "user-12345"

Analytics: Run aggregate queries over traces. Build a histogram of response times, compare latency percentiles across time windows, break down error rates by annotation values.

CloudWatch ServiceLens

ServiceLens integrates X-Ray traces with CloudWatch metrics and logs in a single view:

# Enable Container Insights (required for ECS/EKS ServiceLens integration)
aws ecs put-account-setting \
  --name containerInsights \
  --value enabled

# ServiceLens automatically correlates:
# - X-Ray traces
# - CloudWatch metrics (request count, latency, error rate)
# - CloudWatch Logs (log lines from the same request)

In ServiceLens, clicking a node on the service map shows the metrics, logs, and traces for that service in one panel. Seeing a spike in a CloudWatch metric, clicking through to traces, and seeing the specific slow call in the log stream — without switching between tabs — is where the operational value of X-Ray becomes clear.

Pricing and Cost Control

X-Ray pricing: first 100,000 traces/month free, then $5 per million traces recorded, $0.50 per million traces retrieved, $0.50 per million traces scanned.

At 5% sampling on a service handling 10 million requests/month: 500,000 traced requests, 400,000 above the free tier = $2/month. Entirely reasonable. At 100% sampling on the same service: $50/month. Control costs with sampling rules before enabling across the board.

The most cost-effective approach: 1-5% sampling for normal requests, 100% for errors, 100% for requests matching specific annotations (debugging a specific user or order ID). This gives complete coverage where it matters without sampling-away rare edge cases.

For services where tracing data is used beyond debugging — feeding latency data to cost models or SLO calculations — integrating X-Ray with CloudWatch deep dive gives you the complete observability picture. The Prometheus and Grafana on EKS guide covers the metrics side of the observability stack for EKS workloads that use ADOT for tracing.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus