AWS CloudWatch Deep Dive: Metrics, Alarms, and Logs Insights

Bits Lovers
Written by Bits Lovers on
AWS CloudWatch Deep Dive: Metrics, Alarms, and Logs Insights

A tweet that reached 17,105 people last January listed the seven AWS services you need to know to get hired. CloudWatch was on it alongside EC2, S3, IAM, Lambda, RDS, and VPC. That tells you something about how central monitoring is to AWS work — not optional, not advanced, not something you bolt on later. Every Lambda you invoke, every ECS task that fails, every RDS connection that spikes: without CloudWatch, you’re finding out about it from your users.

Most CloudWatch guides cover the happy path. This one covers the gaps — what you’re not getting by default, why Log Insights costs spike, and which alarm configurations actually work at 2am.

Metrics: What You Get and What You Don’t

Spin up a Lambda function and check CloudWatch two minutes later. You’ll find invocation counts, error counts, duration percentiles, and throttle counts already there — no configuration, no agent, no SDK calls. This is how AWS built-in metrics work: every managed service pushes data to CloudWatch automatically, organized into namespaces. AWS/Lambda for functions, AWS/EC2 for instances, AWS/RDS for databases. Each metric carries dimensions (FunctionName, InstanceId, DBInstanceIdentifier) that identify the specific resource. Most engineers discover this and wonder why they were using third-party monitoring tools for basic cloud metrics.

EC2 feeds CloudWatch CPU utilization, network throughput, disk I/O operations, and instance status checks. Lambda reports invocation counts, durations, error counts, throttles, and concurrent executions. RDS sends CPU, connection count, free storage, and read/write IOPS. You don’t configure any of it. It just shows up.

What doesn’t show up: memory and disk space for EC2 instances. AWS can’t look inside the guest OS without an agent installed. This gaps catches people every time they first set up monitoring — the CPU graph looks fine, memory is at 98%, and nobody knows because that metric isn’t there. The fix is the CloudWatch Agent, which I’ll cover in its own section.

Your application can push its own metrics using the PutMetricData API. The first 10 custom metrics cost nothing; after that it’s $0.30 per metric per month. Most teams publishing request counts, error rates, or queue depths hit that paid tier almost immediately. For granularity, the default is 1-minute data, but high-resolution metrics can push at 1-second intervals for the same per-metric price. Sub-minute granularity is only genuinely useful for things that spike and recover faster than a minute — rare outside very specific rate-limiting or latency monitoring scenarios.

CloudWatch doesn’t keep fine-grained data forever. 1-second metrics disappear after 3 hours, 1-minute data holds for 15 days, and 5-minute aggregates survive 63 days. After that, you get hourly data, and even that only lasts 15 months. Once the rollup happens the original resolution is gone — no way to recover it. If you need raw metrics for capacity planning reports or compliance mandates, set up Metric Streams to Kinesis Data Firehose before the data ages out. The rollup is silent and permanent.

CloudWatch Logs: Groups, Streams, and Insights

Lambda functions, ECS containers, EC2 instances running the CloudWatch Agent — they all write into log groups. Each log group is a named collection, and within it each individual source (a specific Lambda execution environment, a specific container ID) writes to its own log stream. You define the log group name; CloudWatch creates streams automatically as new sources come online.

Pricing: $0.50 per GB ingested, $0.03 per GB per month for storage. The number that surprises people is the storage one, because log groups have no expiration by default. Zero. If you create a log group and never set a retention policy, AWS stores those logs forever. For a Lambda that processes millions of invocations per day, that adds up. I’ve seen accounts with $200/month in CloudWatch log storage from functions nobody looked at in two years. Set a retention policy on every log group — 30 days for high-frequency function logs is a reasonable default, longer for audit trails you might need during an incident review.

# Set retention on a log group — always do this
aws logs put-retention-policy \
  --log-group-name /aws/lambda/my-function \
  --retention-in-days 30

# List all log groups with their retention settings
aws logs describe-log-groups \
  --query 'logGroups[*].{name:logGroupName, retention:retentionInDays}' \
  --output table

The second group without a retention value set is costing you money every month.

Log Insights: Powerful but Priced Per Scan

Log Insights is the query engine for CloudWatch Logs. You write queries in a purpose-built language that mixes SQL-ish field selection with pipeline operators. The pricing model is $0.005 per GB scanned per query. That’s cheap for narrow time ranges on small log groups. It adds up fast if you habitually query 90-day windows on high-volume groups.

Here are the queries you’ll actually use:

# Show recent errors with full message
fields @timestamp, @message, @requestId
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Count invocations per 5-minute window
fields @timestamp
| stats count() as requests by bin(5m)
| sort @timestamp asc

# Find slow Lambda invocations (>3000ms)
filter @type = "REPORT"
| fields @timestamp, @requestId, @duration, @billedDuration, @maxMemoryUsed
| filter @duration > 3000
| sort @duration desc
| limit 50

# Group errors by type
fields @message
| filter @message like /ERROR/
| parse @message /ERROR (?<errorType>[A-Za-z]+Exception)/
| stats count(*) by errorType
| sort count desc

The parse command extracts fields from unstructured log text using regex. The stats command aggregates. Avoid querying more time range than you need — if you’re investigating an incident that started 2 hours ago, scope the query to 3 hours, not 7 days.

Alarms: Getting Paged for the Right Things

Every CloudWatch alarm sits in one of three states: OK, ALARM, or INSUFFICIENT_DATA. New alarms land in INSUFFICIENT_DATA immediately — no history to evaluate yet. They also return there whenever the metric goes quiet: your Lambda hasn’t been invoked, the EC2 instance got terminated, the batch job is between runs. Whether that silence is a problem depends entirely on what you’re monitoring. A function that legitimately goes quiet overnight should not wake your on-call engineer.

Standard alarms evaluate on 1-minute intervals at $0.10/month each. High-resolution alarms can check every 10 seconds but cost $0.30/month — only worth it for metrics where a 50-second problem genuinely needs to be caught within the minute.

The alarm setting most people misconfigure is evaluation-periods. Every alarm has two numbers: period (how much data goes into each evaluation point, in seconds) and evaluation-periods (how many consecutive breaching points before the alarm fires). Leave evaluation-periods at 1 and you’ll get woken up at 2am because one Lambda errored once. Set it to 3 on a 60-second period and you only get paged when errors persist for 3 minutes straight. Most teams should default to 2-3 evaluation periods and tune down from there if they’re missing real incidents.

# Create a Lambda error rate alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "lambda-errors-my-function" \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --dimensions Name=FunctionName,Value=my-function \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 3 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

This alarm fires only after three consecutive 1-minute periods with more than 5 errors — enough to filter out transient blips.

Composite alarms combine multiple alarms with AND/OR logic, and they solve one of the most common alerting problems: false positives from metrics that only mean something when correlated. Lambda errors spiking while the SQS queue depth is also rising is a real incident. Lambda errors spiking while the queue is draining normally might just be a bad batch. A composite alarm with AND logic catches the first scenario without paging for the second. They run $0.50/month each.

Anomaly detection alarms take a different approach — instead of a fixed threshold, CloudWatch builds an ML model of the metric’s normal behavior and alerts when the actual value deviates from the expected band. For metrics with daily patterns (request rate that’s always lower at 4am, always higher at noon), anomaly detection beats static thresholds that are either too tight to be useful at night or too loose to catch problems at peak.

The CloudWatch Agent (The Part Tutorials Skip)

The CloudWatch Agent is a daemon that runs inside EC2 instances and ECS containers. It does two things: publishes OS-level metrics that AWS can’t see from outside the instance, and ships application logs from arbitrary files on disk.

Without the agent on an EC2 instance, you’re blind to memory and disk. mem_used_percent and disk_used_percent are the two metrics that will actually tell you when an instance is about to fall over — and neither exists in CloudWatch until you install the agent. I’ve seen production outages caused by instances quietly filling their root volume while the CPU dashboard looked perfectly healthy.

{
  "agent": {
    "metrics_collection_interval": 60
  },
  "metrics": {
    "namespace": "CWAgent",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"]
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["/"]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/app/application.log",
            "log_group_name": "/ec2/my-app",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

Store this config in SSM Parameter Store and the agent can pull it automatically on startup. That way, new instances in an Auto Scaling group pick up the monitoring config without any manual steps. The IAM roles and policies guide covers what permissions the agent’s EC2 role needs — specifically CloudWatchAgentServerPolicy.

Dashboards and Metric Streams

Dashboards are the visualization layer. You can add any CloudWatch metric as a line graph, stacked area, number widget, or alarm status widget. The first three dashboards per account are free; additional dashboards cost $3 per dashboard per month.

Cross-account dashboards are supported. If you’re running multi-account setups, you can create a central observability account that aggregates CloudWatch data from all member accounts into shared dashboards. The OpenTelemetry + CloudWatch observability setup covers how to layer OTel instrumentation on top of this for distributed tracing.

Metric Streams let you forward CloudWatch metrics in near-real-time to Amazon Kinesis Data Firehose — from there to S3, Splunk, Datadog, New Relic, or any custom destination. This is the right approach when you need metrics in a third-party tool without polling delays. Cost is $0.003 per 1,000 metric updates streamed.

Six Gotchas Worth Knowing

EC2 memory and disk are the most commonly missing metrics in AWS setups. They require the CloudWatch Agent, which requires an instance profile with the right IAM policy. If your dashboard shows CPU at 5% and users are reporting slowness, check whether you even have memory metrics before assuming the CPU graph tells the whole story.

Log Insights scans data even for queries that return zero results. A query with a 30-day range on a 100GB log group costs $0.50 per run. Scope your queries to the minimum time range needed. Create log metric filters for common patterns you check repeatedly — metric filters run continuously and cost $0 per query.

INSUFFICIENT_DATA looks alarming but often isn’t. Alarms land in this state when no data points arrive during the evaluation window — Lambda not invoked, EC2 instance terminated, scheduled batch job between runs. If you wire up alarm actions for INSUFFICIENT_DATA state on every alarm, you’ll receive pages at 2am whenever a function goes quiet overnight. Reserve INSUFFICIENT_DATA actions for metrics where silence is itself a failure — a health check that stops reporting, a heartbeat metric that should fire every minute.

High-cardinality dimensions are expensive. If you publish a custom metric with userId as a dimension, you create one distinct metric time series per user. Ten thousand users means ten thousand metrics at $0.30 each per month. Use dimensions that have bounded cardinality (environment, region, service-name) rather than unbounded identifiers.

Detailed EC2 monitoring isn’t free. By default, EC2 metrics are published at 5-minute resolution. Enabling detailed monitoring switches to 1-minute resolution at $2.10 per instance per month. For production instances where you need fast alarm response, it’s worth it. For dev environments, the default is fine.

Cross-account alarms require setup. You can’t create a CloudWatch alarm in Account A that watches a metric in Account B without configuring cross-account observability first. This is a common oversight in multi-account architectures — centralized alerting needs deliberate configuration, not just IAM permissions.

Practical Setup: Monitoring a Lambda Function End-to-End

This gives you full visibility on a Lambda function: errors, duration p99, throttles, and a log alarm for uncaught exceptions.

FUNCTION="my-function"
ACCOUNT="123456789012"
REGION="us-east-1"
SNS_ARN="arn:aws:sns:$REGION:$ACCOUNT:alerts"

# Error rate alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "$FUNCTION-errors" \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=$FUNCTION \
  --statistic Sum --period 60 \
  --evaluation-periods 2 --threshold 3 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions $SNS_ARN

# Throttle alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "$FUNCTION-throttles" \
  --namespace AWS/Lambda \
  --metric-name Throttles \
  --dimensions Name=FunctionName,Value=$FUNCTION \
  --statistic Sum --period 60 \
  --evaluation-periods 1 --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions $SNS_ARN

# Log metric filter for uncaught exceptions
aws logs put-metric-filter \
  --log-group-name "/aws/lambda/$FUNCTION" \
  --filter-name "UnhandledExceptions" \
  --filter-pattern "?\"Unhandled\" ?\"ERROR\" ?\"exception\"" \
  --metric-transformations \
    metricName=UnhandledException,metricNamespace=AppMetrics/$FUNCTION,metricValue=1

# Alarm on that custom metric
aws cloudwatch put-metric-alarm \
  --alarm-name "$FUNCTION-unhandled" \
  --namespace "AppMetrics/$FUNCTION" \
  --metric-name UnhandledException \
  --statistic Sum --period 60 \
  --evaluation-periods 1 --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions $SNS_ARN

For the SQS-triggered Lambda pattern from the SQS Lambda event source mapping post, add an alarm on the ApproximateNumberOfMessagesNotVisible metric in the AWS/SQS namespace — a rising count there means your Lambda is processing slowly or backing up.

When to Use What

Start with CloudWatch and assume it’s your primary observability layer unless you have a specific reason to reach for something else. Built-in metrics are already flowing. Log groups are being created by Lambda and ECS automatically. You have dashboards and alarms from day one with zero setup.

Deploy the CloudWatch Agent on every EC2 instance. This is non-negotiable for production. The 10-minute install pays for itself the first time memory pressure causes an outage that CPU graphs couldn’t predict. For EKS workloads, the equivalent is Container Insights — the same agent running as a DaemonSet, collecting per-pod and per-container metrics alongside node-level data.

For logs: use Insights for investigations, use metric filters for patterns you check more than twice a week, use log group subscriptions to forward to Elasticsearch or S3 when you need longer retention or more powerful search than Insights provides.

Don’t over-engineer the alerting setup. Four well-tuned alarms per critical service — errors, latency, throttles, queue depth — beat 40 poorly-tuned alarms that nobody acts on. The goal is getting paged when something real is breaking, not covering every metric because it feels thorough.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus