EKS Cluster Upgrade: Zero-Downtime Playbook
AWS EKS standard support ends 14 months after a Kubernetes version’s upstream release. Extended support adds another 12 months but costs $0.60 per cluster per hour on top of normal pricing. For a cluster running 24/7, that’s $438/month — just to avoid upgrading. The math isn’t complicated: upgrade on your schedule or pay to defer it.
EKS upgrades one minor version at a time. You can’t jump from 1.29 to 1.32. Each step is a separate upgrade operation: control plane, then data plane, then add-ons. If you’re three versions behind, that’s three rounds of this playbook. This guide covers each phase in order, the pre-upgrade checks that prevent incidents, and the PodDisruptionBudget configurations that actually make upgrades zero-downtime.
EKS Upgrade Architecture
EKS upgrades have two independent phases:
Phase 1: Control plane — AWS upgrades the API server, etcd, scheduler, and controller manager. Your nodes keep running the old Kubernetes version throughout. Workloads keep serving traffic. This takes 10–25 minutes and requires no manual intervention.
Phase 2: Data plane — You upgrade node groups. Managed node groups do a rolling replacement (launch new nodes, drain old ones). Self-managed node groups require you to drain and terminate nodes manually. Fargate task definitions are replaced automatically when you redeploy.
Add-ons (CoreDNS, kube-proxy, VPC CNI, EBS CSI) are upgraded separately after the control plane but should be done before data plane upgrades when possible.
The rule: control plane version must always be >= node version. You can run nodes one minor version behind the control plane. You cannot run nodes ahead of the control plane.
Pre-Upgrade Checklist
Run these checks before touching the cluster. Surprises during an upgrade are much worse than surprises discovered beforehand.
1. Check API Deprecations
Each Kubernetes release removes deprecated API versions. The most common failure mode: a Helm chart or CI pipeline applies a manifest using a removed apiVersion, and the API server rejects it after the upgrade.
# Install kubent (Kubernetes deprecation checker)
curl -s https://raw.githubusercontent.com/doitintl/kube-no-trouble/master/scripts/install.sh | bash
# Scan your cluster for deprecated APIs
kubent
# Example output:
# NAMESPACE NAME GVK REPLACEMENT
# monitoring prometheus networking.k8s.io/v1beta1 Ingress networking.k8s.io/v1 Ingress
# default my-cronjob batch/v1beta1 CronJob batch/v1 CronJob
Fix everything kubent flags before proceeding. A single deprecated resource will block workloads after the upgrade.
2. Check Add-on Compatibility
Each EKS add-on version has a Kubernetes version compatibility matrix. Check which versions are compatible with your target:
TARGET_VERSION="1.32"
# CoreDNS
aws eks describe-addon-versions \
--kubernetes-version $TARGET_VERSION \
--addon-name coredns \
--query 'addons[0].addonVersions[0:3].addonVersion' \
--output table
# kube-proxy
aws eks describe-addon-versions \
--kubernetes-version $TARGET_VERSION \
--addon-name kube-proxy \
--query 'addons[0].addonVersions[0:3].addonVersion' \
--output table
# VPC CNI
aws eks describe-addon-versions \
--kubernetes-version $TARGET_VERSION \
--addon-name vpc-cni \
--query 'addons[0].addonVersions[0:3].addonVersion' \
--output table
# EBS CSI driver
aws eks describe-addon-versions \
--kubernetes-version $TARGET_VERSION \
--addon-name aws-ebs-csi-driver \
--query 'addons[0].addonVersions[0:3].addonVersion' \
--output table
3. Verify PodDisruptionBudgets
PodDisruptionBudgets (PDBs) are what make node drains zero-downtime — or what makes them hang indefinitely if misconfigured. Check your PDBs now:
# List all PDBs
kubectl get pdb -A
# Check for PDBs that might block drains
kubectl get pdb -A -o json | python3 -c "
import json, sys
data = json.load(sys.stdin)
for pdb in data['items']:
ns = pdb['metadata']['namespace']
name = pdb['metadata']['name']
status = pdb['status']
allowed = status.get('disruptionsAllowed', 0)
desired = status.get('desiredHealthy', 0)
if allowed == 0:
print(f'WARN: {ns}/{name} — disruptionsAllowed=0 (will block drain)')
"
A PDB with maxUnavailable: 0 or minAvailable equal to the current replica count will block kubectl drain. The drain will hang waiting for permission that never comes. Either temporarily scale up the deployment or adjust the PDB before draining.
4. Check Node Group AMI Support
# Current node group versions
aws eks list-nodegroups --cluster-name my-cluster --query 'nodegroups' --output table
# Check current version per node group
aws eks describe-nodegroup \
--cluster-name my-cluster \
--nodegroup-name general-workers \
--query 'nodegroup.{K8sVersion:releaseVersion,AMI:amiType,DesiredSize:scalingConfig.desiredSize}' \
--output table
Phase 1: Control Plane Upgrade
The control plane upgrade is the safest step — AWS handles it, your nodes keep running, workloads don’t restart.
CLUSTER_NAME="my-cluster"
TARGET_VERSION="1.32"
# Initiate control plane upgrade
aws eks update-cluster-version \
--name $CLUSTER_NAME \
--kubernetes-version $TARGET_VERSION
# Get the update ID
UPDATE_ID=$(aws eks list-updates --name $CLUSTER_NAME \
--query 'updateIds[0]' --output text)
# Watch upgrade status (takes 10–25 minutes)
watch -n 30 "aws eks describe-update \
--name $CLUSTER_NAME \
--update-id $UPDATE_ID \
--query 'update.{Status:status,Type:type,CreatedAt:createdAt}' \
--output table"
# Verify completion
aws eks describe-cluster --name $CLUSTER_NAME \
--query 'cluster.{Version:version,Status:status}' --output table
During the upgrade, the API server is briefly unavailable (typically under 60 seconds total across all API server replacements). Running workloads aren’t affected — they’re on the nodes, not the control plane. kubectl commands will fail during that window; retry them.
Phase 2: Update EKS Add-ons
Update add-ons after the control plane, before node groups. This order matters: some add-on versions require the new control plane API, and having updated add-ons on the old nodes is safe, but having old add-ons on new nodes can cause compatibility issues.
CLUSTER_NAME="my-cluster"
# Update CoreDNS
aws eks update-addon \
--cluster-name $CLUSTER_NAME \
--addon-name coredns \
--resolve-conflicts OVERWRITE
# Update kube-proxy
aws eks update-addon \
--cluster-name $CLUSTER_NAME \
--addon-name kube-proxy \
--resolve-conflicts OVERWRITE
# Update VPC CNI
aws eks update-addon \
--cluster-name $CLUSTER_NAME \
--addon-name vpc-cni \
--resolve-conflicts OVERWRITE
# Update EBS CSI driver
aws eks update-addon \
--cluster-name $CLUSTER_NAME \
--addon-name aws-ebs-csi-driver \
--resolve-conflicts OVERWRITE
# Wait for all add-ons to become ACTIVE
aws eks list-addons --cluster-name $CLUSTER_NAME --output json | \
python3 -c "
import json, sys, subprocess
data = json.load(sys.stdin)
for addon in data['addons']:
result = subprocess.run([
'aws', 'eks', 'describe-addon',
'--cluster-name', 'my-cluster',
'--addon-name', addon,
'--query', 'addon.{Name:addonName,Status:status,Version:addonVersion}',
'--output', 'text'
], capture_output=True, text=True)
print(result.stdout.strip())
"
The --resolve-conflicts OVERWRITE flag tells EKS to overwrite any manual modifications to the add-on’s managed resources. If your team has customized CoreDNS ConfigMap or VPC CNI DaemonSet directly, those changes will be lost. Use NONE to skip the update if conflicts exist, then resolve them manually.
Phase 3: Upgrade Managed Node Groups
Managed node groups support in-place rolling upgrades. EKS launches new nodes with the updated AMI, cordons and drains the old nodes in batches, then terminates them.
CLUSTER_NAME="my-cluster"
NODEGROUP="general-workers"
# Initiate node group upgrade
aws eks update-nodegroup-version \
--cluster-name $CLUSTER_NAME \
--nodegroup-name $NODEGROUP \
--kubernetes-version 1.32 \
--force # Skip PDB checks — only use if you know what you're doing
# Without --force (recommended): respects PDBs
aws eks update-nodegroup-version \
--cluster-name $CLUSTER_NAME \
--nodegroup-name $NODEGROUP \
--kubernetes-version 1.32
# Watch progress
aws eks describe-update \
--name $CLUSTER_NAME \
--nodegroup-name $NODEGROUP \
--update-id $(aws eks list-updates \
--name $CLUSTER_NAME \
--nodegroup-name $NODEGROUP \
--query 'updateIds[0]' --output text) \
--query 'update.{Status:status,Params:params}' \
--output table
By default, EKS upgrades nodes in batches of 33% (one third of the node group at a time). You can adjust this with updateConfig:
# Set max unavailable to 1 node (safer for small clusters)
aws eks update-nodegroup-config \
--cluster-name $CLUSTER_NAME \
--nodegroup-name $NODEGROUP \
--update-config maxUnavailable=1
For large clusters with Karpenter managing nodes, the process differs — Karpenter handles its own node lifecycle. Drift detection in Karpenter automatically replaces nodes that don’t match the current NodePool spec, which includes the Kubernetes version. Check the EKS Karpenter Autoscaling guide for the Karpenter-specific upgrade path.
Handling Self-Managed Node Groups
Self-managed node groups require manual drain and terminate. The zero-downtime pattern is: launch new nodes first, then drain old ones.
# Step 1: Scale up by launching new nodes with the target AMI
# (Create a new launch template version with the new AMI ID, update the ASG)
NEW_AMI=$(aws ssm get-parameter \
--name /aws/service/eks/optimized-ami/1.32/amazon-linux-2/recommended/image_id \
--query 'Parameter.Value' --output text)
aws ec2 create-launch-template-version \
--launch-template-id lt-0123456789abcdef0 \
--source-version '$Latest' \
--launch-template-data "{\"ImageId\":\"$NEW_AMI\"}"
# Update ASG to use new launch template version
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-asg \
--launch-template LaunchTemplateId=lt-0123456789abcdef0,Version='$Latest'
# Step 2: Temporarily double the desired capacity (old nodes + new nodes)
aws autoscaling set-desired-capacity \
--auto-scaling-group-name my-asg \
--desired-capacity 10 # Was 5; now 10 to have both old and new
# Wait for new nodes to become Ready
kubectl get nodes --watch | grep -v "v1.32"
# Step 3: Drain each old node
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep "old-"); do
kubectl drain $node \
--ignore-daemonsets \
--delete-emptydir-data \
--timeout=300s
echo "Drained $node"
done
# Step 4: Terminate old instances and reduce desired capacity back to 5
PodDisruptionBudgets for Zero-Downtime Drains
A deployment without a PDB has no protection during node drain. All pods can be evicted simultaneously, causing brief outages. Add PDBs to every production deployment:
# Good PDB: always keep at least 2 pods available
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-api-pdb
namespace: my-api
spec:
minAvailable: 2
selector:
matchLabels:
app: my-api
---
# Alternative: allow at most 1 pod unavailable at a time
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-api-pdb-max
namespace: my-api
spec:
maxUnavailable: 1
selector:
matchLabels:
app: my-api
The minAvailable: 2 form is safer for small deployments: if you have 3 replicas and set maxUnavailable: 1, a drain can proceed even if one replica is already unhealthy (making it effectively maxUnavailable: 2 in practice). minAvailable: 2 is absolute.
One gotcha: a Deployment with replicas: 1 and minAvailable: 1 will permanently block drain. The node can never be drained because there’s always exactly one pod and it can never be evicted while still satisfying the PDB. Either increase replicas to 2+ or remove the PDB for singleton workloads.
Post-Upgrade Verification
CLUSTER_NAME="my-cluster"
# 1. Verify control plane version
aws eks describe-cluster --name $CLUSTER_NAME \
--query 'cluster.version' --output text
# 2. Verify all nodes are on the new version
kubectl get nodes -o wide | awk '{print $1, $5}' | sort -k2
# 3. Check all system pods are healthy
kubectl get pods -n kube-system
# 4. Verify add-on versions
aws eks list-addons --cluster-name $CLUSTER_NAME --output text | \
while read addon; do
aws eks describe-addon --cluster-name $CLUSTER_NAME \
--addon-name $addon \
--query 'addon.{Name:addonName,Version:addonVersion,Status:status}' \
--output table
done
# 5. Spot check workload health
kubectl get deployments -A | grep -v "1/1\|2/2\|3/3\|Running" | head -20
# 6. Check for any pending PDB violations
kubectl get pdb -A
# 7. Run a quick smoke test on your application endpoints
curl -f https://api.example.com/health && echo "API healthy"
Common Failure Modes
Drain hangs indefinitely. A PDB with disruptionsAllowed: 0 is blocking. Find it with kubectl get pdb -A and either scale up the deployment or use --disable-eviction as a last resort (which bypasses PDBs but accepts the risk of service disruption).
Add-on update stuck in DEGRADED. The add-on pod is probably crashing. Check with kubectl describe pod -n kube-system <addon-pod>. Common causes: IAM role missing permissions (IRSA annotation on the service account is wrong), or incompatible configuration in the add-on’s ConfigMap.
API version errors after upgrade. A manifest in your CI pipeline uses a removed API. Find it with kubent, update the apiVersion field in your Helm chart or raw manifest. Old Helm releases in the cluster may also contain deprecated API objects — helm get manifest <release> shows what’s stored.
Node group upgrade creates nodes that fail to join. Usually a bootstrap argument or user data issue in the launch template. Check the EC2 instance’s system log via the console or aws ec2 get-console-output. The most common EKS cause is the --apiserver-endpoint and --b64-cluster-ca bootstrap args pointing to the old control plane version.
For the full cluster security hardening that should accompany upgrades, the EKS RBAC and security guide covers IRSA, security groups, and pod security standards. The Kubernetes v1.36 release notes cover API changes in the current release that this playbook’s pre-upgrade checks should catch.
Comments