CloudWatch Container Insights for EKS: Metrics, Logs, and Dashboards

Bits Lovers
Written by Bits Lovers on
CloudWatch Container Insights for EKS: Metrics, Logs, and Dashboards

Running Kubernetes on EKS without Container Insights is like flying without instruments. You can see your pods are running, but when a node is memory-pressured and pods start getting OOMKilled, you won’t know until users report the impact. Container Insights gives you per-container CPU, memory, network, and disk metrics — the visibility layer that makes the difference between reacting to outages and preventing them.

The setup is more involved than most AWS features, but the payoff is a complete view of your cluster’s health in CloudWatch without running or managing your own Prometheus + Grafana stack.

What Container Insights Collects

Container Insights works at three levels simultaneously. Cluster-level metrics show aggregate resource utilization across everything. Node-level data breaks it down per EC2 instance — CPU, memory, network traffic, and disk I/O. The most actionable layer is per-pod and per-container: CPU and memory as a percentage of limits, memory working set (the non-swappable portion), and restart counts. Each layer gives you a different angle on the same underlying resource problem.

The restart count metric is particularly useful. A pod restarting repeatedly shows up as a rising counter in Container Insights before it triggers a CrashLoopBackOff and becomes obviously broken. An alarm on restart count lets you catch degraded pods during off-hours before they fail completely.

AWS added Enhanced Container Insights in 2024, which uses eBPF to collect a broader set of metrics — container-level disk I/O, network packet drops, and kernel-level signals that the standard CloudWatch Agent can’t see. Enhanced Container Insights is available for EKS with managed node groups running Amazon Linux 2023.

Architecture Overview

Container Insights on EKS uses two components: the CloudWatch Agent running as a DaemonSet for metrics, and Fluent Bit running as a DaemonSet for logs. Both need IAM permissions to write to CloudWatch.

The CloudWatch Agent runs on every node and collects performance metrics from the kubelet and cAdvisor endpoints. It sends metrics to CloudWatch as custom metrics under the ContainerInsights namespace. Fluent Bit tails container log files from /var/log/containers/ and ships them to CloudWatch Logs.

Both components use IRSA (IAM Roles for Service Accounts) to get AWS credentials without needing static access keys. The setup involves creating an IAM role, attaching the appropriate policy, creating a Kubernetes service account annotated with the role ARN, and deploying the agent using that service account.

Setting Up the CloudWatch Agent

The AWS-provided quick install uses a CloudFormation template, but doing it via Helm gives you more control and integrates better with GitOps workflows.

First, create the IAM policy and IRSA role:

# Create the IAM policy for CloudWatch Agent
cat > /tmp/cwagent-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData",
        "ec2:DescribeVolumes",
        "ec2:DescribeTags",
        "logs:PutLogEvents",
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:DescribeLogStreams",
        "logs:DescribeLogGroups"
      ],
      "Resource": "*"
    }
  ]
}
EOF

aws iam create-policy \
  --policy-name CloudWatchAgentServerPolicy \
  --policy-document file:///tmp/cwagent-policy.json

# Create IRSA service account
eksctl create iamserviceaccount \
  --cluster my-cluster \
  --namespace amazon-cloudwatch \
  --name cloudwatch-agent \
  --attach-policy-arn arn:aws:iam::123456789012:policy/CloudWatchAgentServerPolicy \
  --approve

Then deploy the agent via Helm:

helm repo add aws-observability https://aws.github.io/eks-charts
helm repo update

helm upgrade --install aws-cloudwatch-metrics aws-observability/aws-cloudwatch-metrics \
  --namespace amazon-cloudwatch \
  --create-namespace \
  --set clusterName=my-cluster \
  --set serviceAccount.create=false \
  --set serviceAccount.name=cloudwatch-agent

The agent starts collecting metrics within 2-3 minutes. Check the ContainerInsights namespace in CloudWatch to confirm data is flowing:

aws cloudwatch list-metrics \
  --namespace ContainerInsights \
  --dimensions Name=ClusterName,Value=my-cluster \
  --query 'Metrics[*].MetricName' \
  --output text | tr '\t' '\n' | sort -u | head -20

Setting Up Fluent Bit for Logs

Fluent Bit handles log collection and shipping. It runs as a DaemonSet and reads container logs from each node:

kubectl create namespace amazon-cloudwatch 2>/dev/null; true

kubectl create configmap fluent-bit-cluster-info \
  --from-literal=cluster.name=my-cluster \
  --from-literal=http.server=On \
  --from-literal=http.port=2020 \
  --from-literal=read.head=Off \
  --from-literal=read.tail=On \
  --from-literal=logs.region=us-east-1 \
  -n amazon-cloudwatch

# Deploy Fluent Bit using the AWS-provided manifest
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

Fluent Bit creates separate CloudWatch Log Groups for each log source:

  • /aws/containerinsights/<cluster>/application — all container stdout/stderr
  • /aws/containerinsights/<cluster>/host — node-level system logs
  • /aws/containerinsights/<cluster>/dataplane — Kubernetes API server, kubelet, kube-proxy logs

The application log group is what you’ll query most often. Each log stream corresponds to one container, named <pod-name>_<namespace>_<container-name>.

Key Metrics to Monitor

Container Insights publishes ~30 metrics to the ContainerInsights namespace. The ones worth creating alarms for:

pod_cpu_utilization — CPU used by a pod as a percentage of its request. Sustained values over 80% suggest under-provisioned CPU requests or a runaway process.

pod_memory_utilization — memory as a percentage of limit. A pod at 90%+ of its memory limit is a candidate for OOMKill. Alert before it happens.

pod_number_of_container_restarts — container restart counter. More than 3 restarts in 10 minutes almost always indicates a real problem.

node_cpu_utilization and node_memory_utilization — node-level resource pressure. When a node is memory-pressured, Kubernetes starts evicting pods. Alert at 85% so you have time to act.

cluster_failed_node_count — any value above 0 means you have a node that’s not joining the cluster. Worth an immediate alert.

The CloudWatch deep dive covers how to set up composite alarms that combine multiple Container Insights metrics — for example, alerting only when both CPU and memory are high simultaneously on the same node, which is a stronger signal than either alone.

CloudWatch Logs Insights Queries

Container Insights logs are structured JSON when applications use structured logging. Logs Insights can parse them directly. Useful queries:

Find all application errors in the last hour:

fields @timestamp, kubernetes.pod_name, kubernetes.namespace_name, log
| filter @message like /ERROR|Exception|error/
| sort @timestamp desc
| limit 100

Find pods with the most restarts by namespace:

fields kubernetes.pod_name, kubernetes.namespace_name, kubernetes.container_name
| stats count() as restarts by kubernetes.pod_name, kubernetes.namespace_name
| sort restarts desc
| limit 20

Find slow HTTP responses (if your application logs include response time):

fields @timestamp, kubernetes.pod_name, responseTime
| filter ispresent(responseTime) and responseTime > 1000
| stats avg(responseTime) as avgLatency, count() as count by kubernetes.pod_name
| sort avgLatency desc

Logs Insights charges per GB of data scanned ($0.005 per GB). For a busy cluster with large log volume, the cost of ad-hoc queries adds up. Use specific time windows and filter by log stream when possible to reduce scan scope.

Performance Dashboards

Container Insights includes pre-built CloudWatch Dashboards. In the console, navigate to CloudWatch → Container Insights → Performance Monitoring and select your cluster. The built-in dashboards show:

  • Cluster-level CPU and memory trend
  • Top 10 CPU and memory consumers by pod
  • Node-level resource heat map
  • Pod restart timeline

Nothing to configure — the dashboards pull from your ContainerInsights metrics automatically. AWS owns the layout, so you can’t edit these dashboards directly. If you want a different arrangement or need to combine Container Insights metrics with application-level data, create a custom dashboard from scratch and add the metrics you want side by side.

For custom dashboards using the AWS CDK, Container Insights metrics can be referenced by the standard CloudWatch metric pattern with namespace ContainerInsights and the relevant dimensions.

Enhanced Container Insights

Enhanced Container Insights (ECI) uses eBPF probes on the node to collect metrics that the standard CloudWatch Agent can’t see:

  • Container-level disk I/O (read/write bytes per container, not just per node)
  • Network packet drops and errors
  • GPU utilization (for GPU nodes)
  • Host-level CPU steal and iowait

ECI requires CloudWatch Observability add-on version 1.7.0+ and EKS nodes running Amazon Linux 2023 or Bottlerocket. Enable it via the EKS add-on:

aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability \
  --addon-version v1.7.0-eksbuild.1 \
  --service-account-role-arn arn:aws:iam::123456789012:role/CloudWatchAgentRole

ECI metrics appear in a separate ContainerInsights/Prometheus namespace and include additional dimensions not available in standard Container Insights.

Cost Breakdown

Container Insights isn’t free. Understanding the cost before you enable it on a large cluster avoids bill shock.

CloudWatch custom metrics cost $0.30 per metric per month (first 10,000 are in the free tier, which a medium cluster will exceed). A cluster with 50 nodes and 200 pods generates roughly 2,000-5,000 Container Insights metrics, costing $600-$1,500 per month at standard pricing. Using metric streams to export metrics to a third-party tool can reduce costs if you’re already paying for Grafana Cloud or Datadog.

CloudWatch Logs charges $0.50 per GB ingested. Container log verbosity varies wildly — some applications log a line per request, others log several lines. A high-traffic cluster might ingest 10-50GB of logs per day. Configure log retention (30-90 days is typical) and consider shipping only ERROR-level logs to CloudWatch while routing DEBUG logs to lower-cost storage.

The OpenTelemetry + CloudWatch guide covers using ADOT (AWS Distro for OpenTelemetry) as an alternative collection mechanism that can route metrics and traces to multiple backends, which is useful if you want CloudWatch for alerting but Grafana for visualization.

Container Insights vs Self-Managed Prometheus

The honest comparison: Prometheus + Grafana gives you more flexibility and lower cost at scale. The kube-prometheus-stack chart deploys Prometheus, Grafana, and alert rules in under 10 minutes. Grafana dashboards are more sophisticated than CloudWatch’s. Prometheus Alertmanager has more routing options than CloudWatch Alarms.

Container Insights wins on operational simplicity. No Prometheus instances to manage, no Grafana to maintain, no persistent storage for metrics. Everything integrates with existing AWS tooling — CloudWatch Alarms trigger SNS or Lambda, metrics appear alongside your other AWS service metrics. For teams already invested in CloudWatch for EKS cluster monitoring and application metrics, Container Insights keeps everything in one place.

The practical answer for most EKS clusters: use Container Insights for the out-of-the-box dashboards and critical alarms on the metrics that matter (OOMKills, node pressure, restart counts), and consider Prometheus if you need custom metrics at high cardinality or more sophisticated alerting logic.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus