Amazon ECS Service Connect: Service-to-Service Networking on ECS

Bits Lovers
Written by Bits Lovers on
Amazon ECS Service Connect: Service-to-Service Networking on ECS

AWS App Mesh is end-of-life as of September 30, 2026. If you run ECS services that communicate via App Mesh, migration is required. The AWS-recommended replacement for ECS workloads is ECS Service Connect, which launched at re:Invent 2022 and has been the standard ECS service mesh since 2023. Service Connect injects an Envoy sidecar automatically — you don’t manage the proxy, App Mesh configuration, or Virtual Node manifests. You configure ports and aliases in your ECS task definition and service, and ECS handles the rest.

This guide covers how Service Connect works, the task and service configuration, cross-service communication patterns, and the CloudWatch metrics that replace whatever App Mesh observability you had.

How Service Connect Works

Service Connect creates a private namespace for your ECS cluster (backed by AWS Cloud Map) and injects an Envoy proxy sidecar into every task that enables it. The sidecar intercepts traffic between services and handles:

  • Service discovery: other services call http://orders:8080/ and Envoy resolves the destination within the namespace, bypassing DNS entirely
  • Load balancing: Envoy distributes requests across all healthy task instances
  • Observability: request counts, error rates, and latency metrics flow automatically to CloudWatch
  • Retries and timeouts: configurable at the service level, not in application code

The key difference from traditional ECS Service Discovery (Cloud Map only): with pure Cloud Map, your application calls DNS to resolve a service address and connects directly. With Service Connect, traffic goes through the local Envoy proxy, which has a live view of all healthy endpoints. When a task fails health checks, Envoy stops routing to it within seconds — no DNS TTL delay.

Cluster Namespace Setup

Every ECS cluster that uses Service Connect needs a default namespace. Set it when creating the cluster or update an existing one:

# Create a cluster with a Service Connect namespace
aws ecs create-cluster \
  --cluster-name production \
  --service-connect-defaults namespace=production.internal

# Update an existing cluster
aws ecs update-cluster \
  --cluster production \
  --service-connect-defaults namespace=production.internal

# Verify namespace is active
aws ecs describe-clusters \
  --cluster production \
  --query 'clusters[0].serviceConnectDefaults'
# {"namespace": "arn:aws:servicediscovery:us-east-1:123456789012:namespace/ns-XXXXX"}

The namespace name is the DNS suffix services use to reach each other. production.internal means the orders service is reachable at http://orders.production.internal or (within the same namespace) just http://orders.

Task Definition Configuration

The task definition itself is mostly unchanged — you’re just adding two fields to your existing portMappings: a name (which becomes the identifier other services use to find this one) and appProtocol (which tells Envoy what to expect so it can do protocol-aware load balancing):

{
  "family": "orders-service",
  "networkMode": "awsvpc",
  "containerDefinitions": [
    {
      "name": "orders",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/orders:v2.1.0",
      "portMappings": [
        {
          "name": "orders-http",
          "containerPort": 8080,
          "protocol": "tcp",
          "appProtocol": "http"
        }
      ],
      "environment": [
        {"name": "PAYMENTS_URL", "value": "http://payments:8080"}
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/orders",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "orders"
        }
      }
    }
  ],
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024"
}

The name on the port mapping ("orders-http") is what other services will use to refer to this port. The appProtocol field tells Envoy how to interpret the traffic: http for HTTP/1.1, http2 for HTTP/2 and gRPC, or grpc explicitly. Protocol-aware load balancing only works when this is set correctly.

ECS Service Configuration

One thing that trips people up: Service Connect is configured on the ECS service, not the task definition. The task definition says “I listen on port 8080, call it orders-http.” The service says “register orders-http in the namespace as orders so other services can reach me.” That split is intentional — the same task definition can be used by services with different Service Connect configurations:

# Create an ECS service with Service Connect enabled
aws ecs create-service \
  --cluster production \
  --service-name orders \
  --task-definition orders-service:5 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-orders],assignPublicIp=DISABLED}" \
  --service-connect-configuration '{
    "enabled": true,
    "namespace": "production.internal",
    "services": [
      {
        "portName": "orders-http",
        "discoveryName": "orders",
        "clientAliases": [
          {
            "port": 8080,
            "dnsName": "orders"
          }
        ]
      }
    ]
  }'

The services array configures this service as a server: it registers orders in the namespace so other services can call it. A service that only calls other services (but doesn’t receive traffic) can enable Service Connect with "services": [] — it gets the Envoy proxy for outbound traffic but doesn’t register itself.

This distinction matters for background workers and batch processors. A consumer that reads from SQS and calls a payment API doesn’t need to register itself, but it still needs Envoy to get reliable outbound connections and CloudWatch metrics on those calls.

Cross-Service Communication

Once both services have Service Connect enabled in the same namespace, they communicate by service name:

# In the orders service — calls payments using the short name
import requests

response = requests.post(
    "http://payments:8080/v1/charge",  # 'payments' resolves via Envoy
    json={"order_id": order_id, "amount": total},
    timeout=5.0
)

The request goes to the local Envoy proxy (listening on localhost:8080 mapped to the payments alias), which forwards it to a healthy payments task instance. The calling service code has no knowledge of task IPs, availability zones, or health state — Envoy handles all of that.

For services in different clusters but the same VPC, cross-cluster Service Connect works when both clusters use the same Cloud Map namespace. Register the source namespace in the target cluster configuration:

# Allow cluster-b services to call cluster-a services
aws ecs update-cluster \
  --cluster cluster-b \
  --service-connect-defaults namespace=production.internal
  # Same namespace as cluster-a

Envoy Proxy Resource Allocation

ECS automatically adds the Envoy container to every task with Service Connect enabled. You need to account for this in your task CPU and memory:

{
  "family": "orders-service",
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "orders",
      "cpu": 412,
      "memory": 900
    }
    // ECS adds the Envoy container automatically:
    // "name": "ecs-service-connect-agent"
    // cpu: ~100, memory: ~128MB (approximate; varies by configuration)
  ]
}

A task with 512 CPU units and 1024MB that uses all of it for the app container will OOM or get CPU-throttled because Envoy also needs resources. Budget approximately 100 CPU and 128MB memory for the proxy and size the task accordingly. For small tasks (256 CPU / 512MB), this is a significant proportion — consider 512/1024 as the minimum for Service Connect-enabled tasks.

CloudWatch Metrics

Here’s the part that actually sells Service Connect to teams coming from vanilla ECS: you get request counts, error rates, and p99 latency per service pair — for free, with zero code changes. The Envoy sidecar is generating these numbers whether you look at them or not, and they flow to CloudWatch automatically. The two queries below are what you’d check during an incident to see whether a downstream service is causing elevated error rates:

# Get request count for the orders service over the last hour
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name RequestCount \
  --dimensions \
    Name=ClusterName,Value=production \
    Name=ServiceName,Value=orders \
    Name=DiscoveryName,Value=orders \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Sum

# Get error rate (5xx responses)
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name ServerError \
  --dimensions \
    Name=ClusterName,Value=production \
    Name=ServiceName,Value=orders \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Sum

Available metrics per service:

  • RequestCount — total requests handled
  • ServerError — 5xx responses
  • ClientError — 4xx responses
  • TargetProcessingTime — p50/p90/p99 response time from the target service
  • ActiveConnections — current open connections through the proxy

These replace the X-Ray traces and App Mesh access logs you might have been using for observability. For distributed tracing across services, Service Connect passes through trace context headers (X-Amzn-Trace-Id) so AWS X-Ray continues to work without changes to your application.

Terraform Configuration

resource "aws_ecs_cluster" "production" {
  name = "production"

  service_connect_defaults {
    namespace = aws_service_discovery_http_namespace.production.arn
  }
}

resource "aws_service_discovery_http_namespace" "production" {
  name = "production.internal"
}

resource "aws_ecs_service" "orders" {
  name            = "orders"
  cluster         = aws_ecs_cluster.production.id
  task_definition = aws_ecs_task_definition.orders.arn
  desired_count   = 3

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.orders.id]
    assign_public_ip = false
  }

  service_connect_configuration {
    enabled   = true
    namespace = aws_service_discovery_http_namespace.production.arn

    service {
      port_name      = "orders-http"
      discovery_name = "orders"

      client_alias {
        port     = 8080
        dns_name = "orders"
      }
    }

    log_configuration {
      log_driver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/service-connect-proxy"
        "awslogs-region"        = "us-east-1"
        "awslogs-stream-prefix" = "orders"
      }
    }
  }
}

The log_configuration on the service_connect_configuration block sets log routing for the Envoy proxy container’s access logs — separate from your application container logs. These logs show the full request details Envoy proxied, useful for debugging connection issues between services.

Service Connect vs Service Discovery

Both use Cloud Map underneath. The difference is where the routing logic lives:

Feature Service Discovery (Cloud Map only) Service Connect
Load balancing Client-side, DNS TTL Server-side, Envoy
Health updates DNS TTL delay (30s default) Near-real-time (seconds)
Observability None built-in CloudWatch metrics automatic
Retries Application code Configurable in service
Protocols Any (DNS resolution only) HTTP/1.1, HTTP/2, gRPC
Cost Free Free (pay for Envoy CPU/memory)

For green/brown-field ECS migrations from App Mesh, AWS provides migration guidance. The ECS Fargate guide covers the Fargate-specific network configuration that Service Connect builds on. For the IAM permissions the Envoy proxy needs to register with Cloud Map, the IAM Permission Boundaries guide covers how to scope task execution roles correctly.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus