Cluster API v1.12 for Platform Teams: In-Place Updates, Chained Upgrades, and Day-2 Operations

Written by Cleber Rodrigues

Cluster API v1.12 for Platform Teams: In-Place Updates, Chained Upgrades, and Day-2 Operations

Cluster lifecycle work is usually where platform engineering gets less glamorous and more expensive. Creating a cluster is the easy part. Upgrading it across minor versions, changing rollout behavior without unnecessary replacement, coordinating add-ons, and keeping GitOps honest during long-running lifecycle operations is the work that burns real time. Cluster API v1.12 matters because it improves that day-2 layer instead of pretending the hard part is bootstrap.

The two release themes worth paying attention to are chained upgrades and more useful in-place propagation. Those sound incremental. They are not. They change how much orchestration logic your team needs to invent around version transitions and low-risk topology changes.

If you are coming from the workload side of the platform first, the ArgoCD on EKS guide and the Prometheus and Grafana on EKS guide are the right companions. Those posts assume the cluster lifecycle underneath them is already stable. This post is about making that assumption more believable.

Why Cluster API Still Matters In 2026

Managed Kubernetes services solve control-plane hosting. They do not eliminate cluster lifecycle concerns. Platform teams still need repeatable cluster definitions, consistent worker policies, upgrade sequencing, add-on coordination, and a management story that works across environments.

That is where Cluster API keeps earning its place. It gives you Kubernetes-style declarative objects for cluster lifecycle, and it keeps getting more opinionated about how topology and upgrade workflows should behave. The value is not abstraction for its own sake. The value is that cluster creation, upgrade, and migration stop being collections of custom scripts tied to one provider team’s memory.

What Chained Upgrades Actually Fix

Before the newer chained-upgrade work, upgrades across more than one minor version could become awkward quickly. Teams either performed each step manually, or wrapped Cluster API in external orchestration that knew which control-plane and worker transitions were allowed and in what order.

The newer upgrade-plan model moves that logic closer to Cluster API itself. The Cluster API runtime hook documentation describes an explicit upgrade plan where control plane and workers step through intermediate versions in a valid sequence until they reach the target version. It also documents lifecycle hooks that fire before and after control plane and worker upgrades for each intermediate step.

That is operationally important for two reasons.

First, the control plane and worker upgrade sequence becomes inspectable and programmable. You can reason about what will happen before starting the change.

Second, add-on coordination gets a real place to live. If you have CNI, CSI, policy, or observability components that must be prepared before workers move to the next step, the runtime hooks give you a cleaner place to block and validate than an external pile of bash glue.

This is one of those features that matters more to mature platforms than greenfield ones. A lab cluster can tolerate ad hoc sequencing. A fleet of business-critical clusters should not depend on improvised upgrade memory.

In-Place Propagation Is More Valuable Than It Sounds

The phrase “in-place updates” sounds like minor controller plumbing. In reality it is about avoiding needless rollouts for changes that do not justify replacement.

Cluster API already documents several fields that can propagate in place without triggering a full rollout. The MachineDeployment and MachineSet controller docs call out labels, annotations, ready timing, and deletion-related timeout fields. The KubeadmControlPlane docs describe similar in-place propagation for machine template metadata and node drain or deletion timeouts.

That matters because day-2 operations are full of these changes. You tune nodeDrainTimeout. You add labels that drive policy or observability. You adjust annotations used by add-ons. None of those should trigger a broad, expensive machine replacement if the provider can honor them in place.

This is where platform teams save real operational pain. A cleaner in-place path reduces double rollouts, shortens maintenance windows, and makes it less risky to refine cluster policy after the cluster is already in service.

The Practical Upgrade Controls To Pay Attention To

Cluster API now exposes better control points for upgrade sequencing on classy clusters. The annotations reference documents two that are especially useful.

The first is topology.cluster.x-k8s.io/upgrade-concurrency, which lets you control how many MachineDeployments upgrade at once.

The second is topology.cluster.x-k8s.io/hold-upgrade-sequence, which lets you defer a MachineDeployment topology and all subsequent ones in the sequence.

Those are not just nice-to-have knobs. They are the difference between “one more opaque cluster upgrade” and a controlled, operator-readable plan.

A minimal example looks like this:

kubectl annotate cluster prod-cluster \
  topology.cluster.x-k8s.io/upgrade-concurrency="2" \
  --overwrite

clusterctl upgrade plan

That is still not a full change-management process, but it gives the platform team a clearer way to shape the blast radius before the upgrade starts.

Runtime Hooks Are Powerful, And Dangerous If You Treat Them Casually

The runtime hook model is one of the more interesting parts of the newer Cluster API story. The docs define hooks like BeforeClusterUpgrade, BeforeControlPlaneUpgrade, AfterControlPlaneUpgrade, BeforeWorkersUpgrade, and AfterWorkersUpgrade.

That gives platform teams a place to coordinate version-sensitive add-ons, admission changes, policy checks, or external maintenance steps. It is the right direction.

It is also very easy to misuse.

The Cluster API docs are blunt about runtime extensions being advanced and potentially dangerous if implemented poorly. A failing extension can block upgrades. A non-deterministic hook can turn upgrade planning into guesswork. A hook that depends on fragile external systems can leave the cluster waiting on infrastructure that has nothing to do with the cluster’s actual health.

That means the design rule should be strict: runtime hooks must be deterministic, idempotent, and boring. If the extension is clever, it is probably too clever.

Day-2 Operations Are More Than Version Numbers

This is the point many teams miss. Day-2 is not only Kubernetes version upgrades. It is also topology evolution, add-on sequencing, node-deletion behavior, policy propagation, and management-cluster hygiene.

For example, Cluster API’s docs note that while an upgrade is blocked, topology changes can be delayed from propagating to underlying objects until the system is ready to pick them up safely. That is the right consistency behavior, but it means operators need to understand that scale changes or new MachineDeployment topology changes may not apply immediately during a held upgrade.

That is good engineering. It is also a source of confusion if nobody on the team understands why a seemingly valid change is waiting.

This is exactly why platform documentation and alerting still matter even with a stronger controller model. You are not removing complexity. You are moving it into safer machinery.

The Gotchas I Would Plan For First

The first is provider support. Cluster API can define a better lifecycle contract, but your infrastructure and control-plane providers still determine how fully those contracts are honored. Do not assume every provider supports every in-place or topology behavior equally.

The second is management-cluster fragility. The clusterctl move documentation still reminds operators that the management cluster needs schedulable worker capacity for Cluster API workloads. If your management cluster is too thin, your elegant lifecycle plan dies on contact with simple scheduling reality.

The third is over-automation through hooks. A hook that blocks every upgrade because an external dependency is flaky is not a safety control. It is an outage factory.

The fourth is forgetting the workload layer. A technically correct cluster upgrade can still be an operational failure if your observability, policy, and GitOps layers are not version-aware. That is why Kyverno policy-as-code on EKS and the Amazon EKS capabilities guide are still useful companion reading even if your cluster lifecycle is driven through Cluster API.

When I Would Use Cluster API v1.12 Aggressively

I would use it aggressively when the platform team owns multiple clusters, multiple environments, or multiple providers and wants one declarative lifecycle model with controlled upgrades.

I would especially lean into the newer upgrade features if the team has already been forced to build external orchestration just to handle multi-step cluster upgrades safely. That is a signal that the lifecycle engine needs to do more for you.

I would be more conservative if the team has one small environment, no strong need for topology abstraction, and no operational appetite to own runtime extensions or management-cluster discipline. Cluster API is excellent for platforms. It is not mandatory for every small cluster footprint.

The Practical Recommendation

Use Cluster API v1.12 to reduce custom lifecycle glue, not to add another abstraction layer because the project sounds mature.

Adopt chained upgrades if your current process for multi-minor transitions lives in scripts and tribal memory. Use in-place propagation deliberately to avoid needless rollouts for low-risk topology changes. Treat runtime hooks like production control code: deterministic, reviewable, and minimal.

That is the real improvement in Cluster API v1.12. Not a shinier bootstrap story. A better chance that day-2 cluster operations stop depending on improvised heroics.

Sources

Cleber Rodrigues

AWS Enthusiast | Cloud Architect | AWS Certified Solutions Architect – Professional

Comments

comments powered by Disqus

Explore more like this

Cloud Computing DevOps Cluster API Day 2 Operations GitOps Kubernetes Platform Engineering Upgrades

Terraform State Locking with S3 and DynamoDB in 2026

The moment two engineers run terraform apply at the same time without state locking, you have a race condition that can corrupt your entire infrastructure state. Both processes read the...

Cleber Rodrigues

GitLab CI Environments and Review Apps in 2026

Review apps changed how my team does code review. Instead of reading diffs, reviewers click a link and see the actual change running. The designer can verify spacing on the...

Cleber Rodrigues

Scrum + Team Topologies: Why Your DevOps Team Structure Might Be Slowing You Down

I spent three years at a company that spent $4 million on “DevOps transformation.” New tools, new cloud infrastructure, training budgets, the works. The velocity of the platform stayed flat....

Cleber Rodrigues