EKS Networking Deep Dive: VPC CNI, IP Exhaustion, and Pod Networking
Running out of IP addresses in production at 2 AM is a specific kind of bad. It happens in EKS clusters when the VPC CNI plugin has allocated every available IP in your subnets and new pods can’t start because there’s nowhere to assign them. The error message is unhelpful — “0/3 nodes are available: 3 Too many pods” — and the fix requires subnet redesign you should have done six months ago. This guide covers how EKS networking actually works, why IP exhaustion happens, and the options for avoiding it before it becomes a production incident.
How VPC CNI Works
EKS uses the Amazon VPC CNI plugin by default. Unlike most Kubernetes networking plugins that use an overlay network, VPC CNI assigns real VPC IP addresses directly to pods. Every pod is a first-class VPC citizen — pods have real IP addresses, they’re routable from anywhere in the VPC, and there’s no network address translation between pods and other VPC resources.
The mechanism: each EC2 node has a primary Elastic Network Interface (ENI). The VPC CNI attaches additional ENIs to the node and allocates secondary IP addresses on those ENIs. Each secondary IP gets assigned to a pod. There’s no tunnel, no overlay, no VXLAN — a packet from a pod goes directly over the VPC network fabric.
This architecture has performance implications. Pod-to-pod latency on EKS is essentially VPC network latency — the same as EC2 instance-to-instance. It’s also why EKS networking troubleshooting looks like VPC networking troubleshooting: security groups, NACLs, route tables, and subnet CIDRs all directly affect pod connectivity.
The IP Address Math
Every instance type has a limit on how many ENIs can be attached and how many IP addresses each ENI can hold. An m5.large supports 3 ENIs with 10 IP addresses each. The first IP on the primary ENI is the node’s primary address, leaving (3 × 10) - 1 = 29 secondary IPs for pods plus 2 reserved for the system.
The formula AWS uses for max pods: (max-ENIs × (IPs-per-ENI - 1)) + 2
For common instance types:
| Instance | Max ENIs | IPs per ENI | Max Pods |
|---|---|---|---|
| t3.medium | 3 | 6 | 17 |
| m5.large | 3 | 10 | 29 |
| m5.xlarge | 4 | 15 | 58 |
| m5.2xlarge | 4 | 15 | 58 |
| m5.4xlarge | 8 | 30 | 234 |
| c5.xlarge | 4 | 15 | 58 |
| c5.4xlarge | 8 | 30 | 234 |
The ENI and IP limits are set by AWS per instance type and can’t be changed. If you’re running 60 pods on an m5.xlarge (max 58), the 59th pod will be Pending indefinitely. This isn’t a Kubernetes scheduling error — it’s the VPC CNI refusing to allocate an IP it doesn’t have.
Check available IP addresses on a node:
kubectl describe node <node-name> | grep -A5 "Allocatable"
# Look for: pods, followed by the limit
# Check actual pod count vs limit
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,PODS:.status.allocatable.pods,IP:.status.capacity.pods'
WARM_IP_TARGET and How It Can Accelerate Exhaustion
The VPC CNI pre-allocates IP addresses before pods need them, to avoid startup delays. Three environment variables control this behavior on the aws-node DaemonSet.
WARM_IP_TARGET is the number of IP addresses the CNI keeps available at all times. If your node has 10 pods running and WARM_IP_TARGET is 10, the CNI allocates 20 IPs total — 10 in use, 10 warm. As you scale down, it holds those warm IPs. This is comfortable for pod startup speed but burns through IPs on every node, even nodes running only a handful of pods.
MINIMUM_IP_TARGET is the floor — the CNI won’t drop below this count. Set this to your expected steady-state pod count per node; the CNI ensures that many IPs are always available.
WARM_ENI_TARGET keeps a full ENI warm (with all its IPs pre-allocated). Default is 1. Fine for large nodes, excessive for small ones.
The recommended approach for clusters concerned about IP exhaustion:
kubectl set env daemonset aws-node \
WARM_IP_TARGET=2 \
MINIMUM_IP_TARGET=10 \
-n kube-system
This keeps 2 warm IPs for burst absorption and ensures at least 10 IPs are always ready, without pre-allocating 30+ IPs on every node for padding. The trade-off is a slight latency during rapid pod scale-up while the CNI fetches more IPs. For clusters that don’t experience bursty pod creation, this is the right balance.
Prefix Delegation: 16x More IPs
Prefix delegation changes the allocation unit from individual IP addresses to /28 prefixes. Each /28 contains 16 IP addresses. Instead of assigning one secondary IP per ENI slot, the CNI assigns one /28 prefix per slot — giving you 16 times the IP capacity from the same number of ENI slots.
An m5.large with 3 ENIs and 10 IPs per ENI normally supports up to 29 pods. With prefix delegation, each of those 10 IP slots becomes a /28 prefix — 16 IPs — giving (3 × 10 × 16) - 1 = 479 potential pod IPs from the same instance.
Enable it on the aws-node DaemonSet (requires Amazon Linux 2023 or AL2 on Nitro instances):
kubectl set env daemonset aws-node \
ENABLE_PREFIX_DELEGATION=true \
-n kube-system
After enabling, existing nodes need to be recycled or you need to drain and cordon them so the CNI picks up the new allocation mode. The change doesn’t apply to already-attached ENIs on running nodes.
Prefix delegation is the recommended first solution for clusters hitting IP exhaustion. It’s simpler than custom networking and doesn’t require subnet redesign.
Custom Networking: Pods on a Different Subnet
Custom networking decouples node IP addresses from pod IP addresses. Nodes keep IPs from your primary subnet; pods get IPs from a secondary subnet you designate. The secondary subnet can use a different CIDR — typically from 100.64.0.0/10 (IANA private space, often used for this purpose because it doesn’t conflict with corporate networks as often as RFC 1918 does).
The setup requires a CRD called ENIConfig — one per availability zone:
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
name: us-east-1a # Must match AZ name
spec:
subnet: subnet-xxxxxxxxx # Your pod subnet in us-east-1a
securityGroups:
- sg-xxxxxxxxx # Security group for pods
Enable custom networking on the aws-node DaemonSet:
kubectl set env daemonset aws-node \
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true \
ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone \
-n kube-system
With ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone, the CNI automatically selects the ENIConfig matching the node’s AZ label. Nodes in us-east-1a use the us-east-1a ENIConfig, nodes in us-east-1b use us-east-1b, and so on.
Custom networking works well with Karpenter for autoscaling — Karpenter can be configured to prefer instance types based on subnet capacity, and it respects the ENIConfig label for pod IP assignment.
The main downside: one primary ENI is sacrificed for node networking, reducing max pods by one ENI’s worth of IPs compared to the standard configuration. If the trade-off matters, combine custom networking with prefix delegation to recover the capacity.
Security Groups for Pods
By default, all pods on a node share the node’s security groups. Security Groups for Pods lets you assign specific security groups to individual pods — useful when pods need to access RDS databases, ElastiCache clusters, or other resources that allow access based on security group membership rather than IP CIDR.
The feature uses a “trunk” ENI model. An additional ENI is designated as the trunk interface; each pod with security group requirements gets a “branch” ENI branched off the trunk. This uses one of the node’s ENI slots for the trunk, reducing max pods.
Enable on the aws-node DaemonSet and annotate pods:
kubectl set env daemonset aws-node \
ENABLE_POD_ENI=true \
-n kube-system
Then reference the security group in a SecurityGroupPolicy CRD:
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
name: rds-access-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
securityGroups:
groupIds:
- sg-xxxxxxxxx # Security group with RDS access
All pods with the app: api-server label in the production namespace get this security group assigned. When those pods connect to RDS, the RDS security group can reference this security group instead of an IP range — cleaner than maintaining CIDR allow lists that change as pods reschedule.
Not all instance types support the trunk ENI model. Check the EKS documentation for supported instances before planning a cluster architecture around this feature.
SNAT and Outbound Traffic
When a pod in a private subnet initiates outbound traffic to the internet, it goes through SNAT. By default, the VPC CNI performs SNAT at the node level — the pod’s IP is translated to the node’s IP before leaving the instance, and the node’s IP is then NATted by the VPC NAT Gateway (or not, if there isn’t one).
Setting AWS_VPC_K8S_CNI_EXTERNALSNAT=true disables the node-level SNAT, letting traffic leave the node with the pod’s original IP. This means your VPC routing handles all NAT — the pod IP reaches the NAT Gateway directly, and NAT Gateway translates it to the public IP.
Why you’d want EXTERNALSNAT: if you’re using custom networking with pod IPs from a secondary CIDR, the node-level SNAT only knows about the node’s primary IP space. External SNAT lets the VPC routing handle the secondary CIDR properly.
kubectl set env daemonset aws-node \
AWS_VPC_K8S_CNI_EXTERNALSNAT=true \
-n kube-system
For pods that only communicate within the VPC — internal services, databases in private subnets, other pods — SNAT configuration doesn’t matter. It only affects traffic that leaves the VPC.
Subnet Design Matters More Than You Think
The most common source of EKS IP exhaustion is subnet design decided before anyone knew how many pods would run. A /24 subnet has 256 addresses (251 usable after AWS reserves 5). If each m5.large node consumes up to 29 pod IPs plus 1 node IP, you can fit about 8 nodes worth of IPs in that /24. Karpenter autoscaling that provisions a 20-node cluster will exhaust that subnet in under an hour.
Recommendations for new clusters:
Use /19 or larger subnets for EKS node groups — that’s 8,192 addresses. Size for 3-5x your expected maximum pod count to leave room for growth and IP warm pools.
Keep EKS subnets private. Use separate subnets for load balancers (which need public IPs) and your EC2 instances. The VPC design patterns guide covers subnet sizing for multi-tier architectures.
Use separate subnets per AZ, all the same size. EKS and Karpenter need to provision nodes in any AZ, and if one AZ’s subnet is exhausted, you may be unable to scale there.
Diagnosing IP Allocation Problems
When pods get stuck in Pending state with IP-related errors:
# Check events on the stuck pod
kubectl describe pod <pod-name> | grep -A5 Events
# Check CNI logs on the node running the pod
kubectl logs -n kube-system -l k8s-app=aws-node --tail=50
# Check subnet IP availability (requires ec2 permissions)
aws ec2 describe-subnets \
--subnet-ids subnet-xxxxxxxxx \
--query 'Subnets[*].{Subnet:SubnetId,Available:AvailableIpAddressCount,CIDR:CidrBlock}'
# Check current ENI attachments on a node
aws ec2 describe-network-interfaces \
--filters "Name=attachment.instance-id,Values=<instance-id>" \
--query 'NetworkInterfaces[*].{ENI:NetworkInterfaceId,IPs:PrivateIpAddresses|length(@)}'
If AvailableIpAddressCount is 0 or close to 0 in a subnet, that’s your problem. The fix is either enabling prefix delegation (if you haven’t), migrating to larger subnets, or adding secondary CIDRs with custom networking.
The CloudWatch Container Insights guide covers pod-level metrics including restart counts and resource usage — those dashboards are where you’ll first notice pods stuck in pending state when IP exhaustion hits.
When to Use Each Feature
Start with: properly sized subnets and default VPC CNI configuration. No feature flags needed if your subnets are big enough.
Add prefix delegation when your subnets are right-sized but you’re hitting pod limits on individual nodes. It’s the simplest fix and gives 16x capacity from the same ENI slots.
Add custom networking when your subnets are too small and redesigning them isn’t practical. Puts pod IPs in a secondary CIDR you can size generously without touching the existing node subnet.
Add security groups for pods when specific pods need access to resources gated by security group membership. Don’t use it as a default — the trunk ENI overhead reduces node capacity and the feature requires careful instance type planning.
The EKS getting started guide walks through initial cluster setup including VPC configuration — the subnet choices you make there determine whether you’ll hit IP exhaustion problems later.
Comments