ECS Canary and Linear Deployments with Network Load Balancers

Written by Bits Lovers on 01 May 2026

ECS Canary and Linear Deployments with Network Load Balancers

On February 4, 2026, Amazon ECS added native support for linear and canary deployment strategies for services using Network Load Balancers. That is a small announcement with a large operational consequence. Workloads that need TCP, UDP, low latency, long-lived connections, or static IP addresses can now use managed incremental traffic shifting without leaving the ECS deployment model.

Before this launch, many NLB-backed ECS services were stuck with a blunt choice. Use rolling updates and accept limited traffic-shift control, or build a custom blue/green pattern with extra target groups, listeners, automation, and rollback logic. Application Load Balancer users had more obvious progressive-delivery paths. NLB users often had to do more work.

ECS NLB canary and linear deployment timeline

The new support does not make every ECS deployment safe by default. It gives you a better release lever. You still need health checks, CloudWatch alarms, service metrics, connection-draining expectations, rollback criteria, and a deployment shape that matches your protocol.

What AWS Added

AWS says ECS now supports linear and canary deployment strategies for ECS services using Network Load Balancers. The announcement calls out applications that commonly use NLB: TCP and UDP services, low-latency systems, long-lived connections, and workloads that require static IP addresses. It also says deployments can integrate with Amazon CloudWatch alarms to automatically stop or roll back if issues are detected.

The ECS documentation describes four managed strategy families: rolling, blue/green, linear, and canary.

Strategy	Traffic behavior	Good fit	Main tradeoff
Rolling	Replace tasks gradually	Normal stateless services, cost-sensitive services	Less precise traffic control
Blue/green	Create a new environment, then switch traffic	Fast rollback and pre-traffic validation	More duplicate capacity and target group complexity
Linear	Shift equal traffic increments over time	Gradual validation and performance monitoring	Slower rollout
Canary	Shift a small percentage first, then the rest after a bake period	Feature validation and blast-radius control	Requires strong alarms during bake time

The difference is not just vocabulary. Linear and canary deployments let you observe the new task set under real production traffic before completing the rollout. That is especially important for NLB workloads, where problems can hide behind long-lived connections, binary protocols, or clients that retry aggressively.

BitsLovers has covered ECS operational patterns in the Amazon ECS managed daemons guide and the ECS Express Mode guide. This post focuses on progressive delivery for the NLB services that usually carry the sharpest operational edges.

Why NLB Workloads Needed This

Network Load Balancers are used when L7 routing is not the point. You choose NLB for L4 behavior: TCP, UDP, TLS pass-through, static IP addresses, very high performance, or long-lived connections. That often means the application protocol owns behavior that an ALB would otherwise expose through HTTP metrics, target response codes, and path-based routing.

That makes deployment safety harder.

NLB workload type	Why rolling updates can be risky	Canary or linear benefit
Financial transaction service	Small regression can affect high-value traffic	Limit initial exposure and watch business metrics
Real-time messaging	Long-lived connections hide bad behavior	Observe connection churn before full rollout
Online gaming backend	Latency and packet behavior matter more than HTTP status	Shift gradually and watch protocol metrics
gRPC over TCP/TLS	Client channel behavior can mask target changes	Canary new tasks with connection-level telemetry
Static IP partner integration	Partner clients may retry slowly or cache paths	Avoid all-at-once exposure

The AWS announcement explicitly mentions online gaming backends, financial transaction systems, and real-time messaging services. Those are exactly the systems where “the service is healthy” can be an incomplete signal. A target can pass an NLB health check while still breaking a subset of protocol behavior.

This is why progressive delivery needs application metrics. NLB health checks tell you whether targets are reachable. They do not prove the new version handles a payment reversal, game session rejoin, WebSocket reconnect, or binary message negotiation correctly.

Canary Versus Linear

Use canary when you want to expose a small slice of traffic, wait, and then finish. Use linear when you want to increase exposure in equal increments and observe each step.

Decision factor	Canary	Linear
Best for	Unknown feature risk	Performance or capacity validation
Traffic shape	Small first step, then large final step	Equal increments over time
Operator attention	Intense during bake period	Repeated at each increment
Failure visibility	Early if alarms are strong	Easier to spot gradual degradation
Rollout time	Usually shorter	Usually longer
Capacity pressure	New and old task sets overlap during bake	Overlap lasts through gradual shift

My default for high-risk behavior changes is canary. Put 5% or 10% of traffic on the new revision, wait long enough to see real behavior, then complete. My default for performance-sensitive changes is linear. Move 10% or 20% at a time and watch latency, connection count, CPU, memory, queue depth, and business metrics at each step.

The wrong choice is using progressive delivery without enough bake time. A canary that waits 30 seconds for a service with 15-minute client sessions is theater. A linear rollout with no alarms is a slower rolling update, not a safety mechanism.

The Metrics That Actually Matter

For NLB deployments, split signals into four groups: load balancer health, ECS service health, application protocol health, and business health.

Signal group	Metrics or checks	Why it matters
NLB	Healthy host count, unhealthy host count, target reset count, active flow count	Confirms target reachability and connection behavior
ECS	Running task count, deployment state, CPU, memory, task restarts	Confirms scheduler and capacity health
Application	p95/p99 latency, protocol error rate, retry rate, connection churn	Confirms the new version handles real protocol traffic
Business	payment failures, match disconnects, message delivery failures, job completion rate	Confirms users are not harmed

CloudWatch alarms should not be decorative. If the deployment can roll back automatically, alarms define the rollback contract.

Good alarms are:

Fast enough to catch a bad canary before the final shift.
Specific enough to avoid rolling back for unrelated background noise.
Sensitive to the new version’s behavior, not only global service health.
Tested with synthetic failures before you rely on them.

This is where the OpenTelemetry and CloudWatch observability guide becomes relevant. If you cannot break down metrics by task set, version, deployment ID, or target group, a canary can look fine because the stable version still carries most traffic.

Tag or label every metric with version information where the stack allows it. At minimum, log deployment ID and task definition revision in structured logs.

Designing The Alarm Contract

The alarm contract is the most important part of an automated canary. It says what failure means.

Do not start with “rollback on any alarm.” Start with the service promise. A payments service may care about authorization failure rate and p99 latency. A messaging service may care about delivery acknowledgement delay and reconnect rate. A game backend may care about session join failures and packet processing latency. The NLB only sees targets and flows. Your users experience protocol behavior.

I like two alarm tiers:

Alarm tier	Example	Deployment action
Hard rollback	New-version error rate above 2% for 3 datapoints	Stop and roll back
Human review	p99 latency 25% above baseline for 10 minutes	Pause or require approval

Hard rollback alarms must be precise. If they fire constantly for unrelated noise, engineers will disable them. Human-review alarms can be more exploratory. They should slow the rollout when the signal is suspicious but not conclusive.

Missing data needs an explicit setting. For a small canary, some metrics may have low traffic. Treating missing data as breaching can roll back a healthy deployment. Treating missing data as not breaching can hide a canary that receives no traffic. Pick intentionally and test it with a dry run.

Target Group And Health Check Design

NLB health checks are simple compared with application health. That simplicity is useful, but it can mislead.

For TCP services, a successful health check may only prove that the port accepts connections. For TLS services, it may prove the listener is reachable. It may not prove the service can process a real message. When the application protocol allows it, expose a dedicated health endpoint or lightweight protocol command that checks dependencies without mutating state.

Health check recommendations:

Setting	Recommendation	Reason
Health path or command	Test a real readiness condition	Avoid routing to initialized-but-broken tasks
Interval and threshold	Balance speed with noise	Fast rollback needs fast detection, but noisy health hurts availability
Deregistration delay	Match connection behavior	Long-lived connections need graceful drain time
Grace period	Give new tasks time to warm up	Cold starts and cache warm-up can trigger false failures
Per-AZ target health	Watch each zone separately	One unhealthy zone can hide under global averages

Keep liveness and readiness separate inside the container if you can. Liveness answers “should the orchestrator restart me?” Readiness answers “should the load balancer send me traffic?” During deployment, readiness matters most.

Connection Draining And Long-Lived Clients

NLB workloads often have long-lived connections. That can make traffic shifting less clean than the diagram suggests.

When you move new connection traffic to a new task set, existing client connections may remain on old tasks until they disconnect or the target deregistration process completes. That is usually good. It prevents abrupt drops. It also means the new version may not receive the expected share of traffic immediately if your clients keep connections open for a long time.

Design reviews should answer these questions:

Question	Why it matters
How long do clients keep connections open?	Bake time must exceed a meaningful portion of session behavior
Do clients reconnect automatically?	A deployment can trigger reconnect storms
What happens to in-flight requests during deregistration?	Some protocols need graceful shutdown logic
Can old and new versions coexist?	Long-lived connections create mixed-version windows
Are protocol changes backward compatible?	Canary is not a substitute for compatibility

For connection-heavy services, add a pre-stop behavior inside the container. Stop accepting new work, keep health status honest, finish in-flight work where possible, and exit within the deregistration and ECS stop timeout expectations. Test it. Do not assume SIGTERM handling is correct.

A Minimal ECS Service Shape

The exact API fields can vary by deployment controller and current tooling support, so treat this as a conceptual checklist rather than a copy-paste template.

An ECS service using progressive delivery with NLB needs:

service:
  name: payments-tcp-api
  launchType: FARGATE
  loadBalancer:
    type: network
    listener: payments-nlb-listener
    targetGroups:
      production: payments-blue
      test: payments-green
  deploymentConfiguration:
    strategy: CANARY
    canary:
      percentage: 10
      bakeTimeMinutes: 15
    alarms:
      enable: true
      rollback: true
      names:
        - payments-canary-p99-latency
        - payments-canary-error-rate
        - payments-canary-business-failures

In Terraform, CDK, CloudFormation, or CLI, use the ECS service deployment fields supported by your current provider and AWS SDK. The important design is not the syntax. It is the coupling between traffic strategy and alarm strategy. A canary without meaningful alarms is a manual observation exercise.

If you deploy MCP servers or other stateful-ish container workloads on ECS, the MCP on ECS deployment guide has useful context around service boundaries and operational expectations. Progressive delivery helps, but it does not replace application-level compatibility.

Terraform And CDK Review Points

Because this feature is new, provider support may lag the ECS API in some environments. Before standardizing a module, check your Terraform AWS provider version, CloudFormation resource support, CDK version, and AWS SDK model. The announcement is service capability. Your IaC toolchain still has to expose it.

For an IaC module, I would make these inputs explicit:

Module input	Example	Why expose it
`deployment_strategy`	`ROLLING`, `CANARY`, `LINEAR`	Strategy should be a conscious choice
`canary_percent`	`10`	Small first exposure controls blast radius
`bake_time_minutes`	`15`	Must match client/session behavior
`linear_percent`	`10`	Controls rollout pace
`alarm_names`	`["payments-p99", "payments-errors"]`	Rollback depends on real signals
`rollback_enabled`	`true`	Some teams require manual approval
`min_healthy_percent`	`100`	Capacity and availability tradeoff
`max_percent`	`200`	Extra task capacity during deployment

Do not hide these under a generic “deployment config” object with weak defaults. Defaults become production behavior. For high-risk services, require the caller to choose strategy, bake time, and alarms explicitly.

Also include outputs for the deployment ID, target group ARNs, service ARN, and current task definition revision. Your dashboards and runbooks will need them.

Pre-Production Load Test

Before the first production canary, run a load test that mimics connection behavior, not just request rate.

For HTTP services, request-per-second load tests are often enough to find obvious regressions. For NLB services, the test must understand connection lifetime, reconnect behavior, message size, keepalive settings, TLS negotiation, and client retry policy. A service that handles 10,000 short connections may still fail with 2,000 long-lived clients.

Test matrix:

Scenario	What it reveals
Fresh connections during canary start	Whether new tasks accept traffic cleanly
Long-lived sessions before traffic shift	Whether old tasks drain safely
Client reconnect storm	Whether rollout causes cascading retries
Mixed old/new protocol versions	Whether compatibility is real
Downstream dependency slowdown	Whether alarms catch user impact
Forced rollback during traffic	Whether rollback preserves active sessions

The rollback test is the one teams skip. Do not skip it. A rollback path that has never been exercised is a story, not a control.

Rollback Is Not A Time Machine

Automatic rollback is useful. It does not erase every failure.

If a bad version mutates data, sends messages, or changes protocol state, rolling traffic back to the previous task definition may stop new damage but not repair the old damage. This is true for any deployment strategy, but canaries make it more visible because only part of traffic experiences the new behavior first.

Before enabling canary or linear deployments for a stateful workload, define the rollback and repair plan.

Failure type	Rollback enough?	Additional work
High latency from code path	Usually yes	Verify old tasks recover saturation
Container crash loop	Usually yes	Inspect logs and image differences
Bad response format	Maybe	Confirm clients recover and caches expire
Duplicate transaction	No	Reconcile and compensate
Bad database migration	No	Restore, forward-fix, or data migration rollback
Protocol incompatibility	Maybe	Ensure old and new clients can coexist

This is why progressive delivery should pair with backward-compatible database and protocol changes. Use expand-and-contract migrations. Deploy additive changes first. Keep old fields long enough. Avoid introducing a server version that only works with a client version not yet deployed.

Capacity And Cost

Blue/green, canary, and linear deployments can require extra capacity because old and new task sets overlap. Rolling updates are often cheaper because they replace tasks gradually inside the same desired capacity envelope, depending on minimum and maximum healthy percentages. Progressive traffic shifting gives you better validation, but it is not free.

Capacity questions:

Question	Why it matters
Can the cluster or Fargate quota run both task sets?	Deployment can stall if capacity is unavailable
Do target groups have enough healthy targets per AZ?	Traffic shift is only safe when each AZ is healthy
Are downstream dependencies sized for duplicate warm-up?	New tasks may create connection pools and caches
Does the canary create cold-cache latency?	Early p99 spikes can trigger noisy rollback
Is autoscaling tied to service-level or task-set metrics?	Scaling can mask or amplify canary behavior

For ECS Express Mode and simpler services, progressive delivery may feel heavier than needed. The ECS Express Mode guide is useful for that tradeoff. If the service is low risk, stateless, and easy to roll back, a rolling update may be enough. Save canary and linear deployments for services where the blast radius matters.

This is also relevant for teams leaving App Runner. The App Runner availability change guide explains why many container teams are being pushed toward ECS-native options. If that migration lands on an NLB-backed service rather than a simple public HTTP service, plan progressive delivery from the beginning instead of adding it after the first painful deploy.

Version Compatibility Rules

Progressive delivery creates a mixed-version window. That is the point. Some traffic goes to old tasks and some goes to new tasks. Any shared dependency must tolerate that window.

Use these rules:

Database changes are additive first. Add columns, indexes, or tables before code requires them.
Message schemas are backward compatible during rollout. New producers should not break old consumers.
Protocol changes are negotiated or optional during the mixed window.
Feature flags can disable risky behavior without redeploying.
Cache keys include version only when the old and new formats cannot share data.
Background workers and front-end services are rolled in an order that preserves compatibility.

This may sound like normal deployment hygiene, but canary makes the requirement explicit. If old and new versions cannot coexist, a canary will surface that weakness by design.

Change Types And Recommended Strategy

Not every release deserves the same rollout. Use the deployment strategy as a risk control, not as a ritual.

Change type	Recommended strategy	Extra check
Text, config, or small bug fix	Rolling or short canary	Normal service health
New protocol behavior	Canary	Client compatibility and reconnect behavior
Latency-sensitive optimization	Linear	p95/p99 latency by version
Dependency upgrade	Canary	Error rate and downstream timeout alarms
Database-read change	Canary or linear	Query latency and connection pool usage
Database-write change	Canary plus feature flag	Rollback and data repair plan
Security patch	Fast canary or rolling	Confirm vulnerability window and rollback risk

The table is deliberately opinionated. A release that changes write behavior deserves slower exposure than a release that changes a log line. A security patch may need speed, but speed is not the same as recklessness. You can run a short canary with strong alarms and still reduce blast radius.

Operational Ownership

Progressive delivery crosses team boundaries. The application team owns code behavior. The platform team owns ECS service patterns, target groups, and deployment configuration. The SRE or operations team owns alarms and incident response. If nobody owns the whole deployment contract, the rollout becomes a collection of good intentions.

Write down:

Who can approve a canary completion.
Who can override an alarm.
Who watches the first production rollout.
Who owns rollback after hours.
Who repairs data if rollback is not enough.
Who updates the module defaults after lessons learned.

Those decisions are not bureaucracy. They are what lets a team trust automation when the deployment is half complete and the alarm state turns red.

A Runbook For The First Deployment

Do the first canary during a low-risk window with someone watching the metrics. Automation should exist, but humans should learn the shape of the signals.

Confirm both target groups are healthy before shifting.
Confirm alarms are in OK state and not missing data.
Deploy with a small canary percentage, such as 5% or 10%.
Watch p95 and p99 latency, error rate, connection churn, and business metrics.
Compare new task set metrics against stable task set metrics.
Wait at least one meaningful client session or transaction window.
Complete the rollout only if metrics stay inside thresholds.
After completion, keep watching for delayed failures.
Record the deployment ID, task definition revision, alarm states, and rollback decision.
Adjust bake time and thresholds based on what you learned.

For linear deployment, repeat the same review at each increment. Do not approve the next shift just because time elapsed. Approve it because the metrics support it.

When Not To Use This

Canary and linear deployment are not always the right answer.

Do not use them as a substitute for compatibility. If the new version cannot coexist with the old version, traffic shifting may create a mixed-world failure. Fix compatibility first.

Do not use them without metrics. A canary that cannot detect failure is just a partial outage with better branding.

Do not use them for every tiny change if the operational overhead slows the team without reducing real risk. Progressive delivery is most valuable when failure impact is high, detection is possible, and rollback is meaningful.

Use them when:

The service handles money, messaging, sessions, or real-time user traffic.
NLB is required because of protocol, latency, static IP, or long-lived connections.
You have CloudWatch alarms tied to user-impacting metrics.
Old and new versions can coexist.
Rollback stops the failure quickly enough to matter.

Stick with rolling updates when:

The service is low risk and stateless.
Health checks are enough to catch most failures.
Duplicate capacity is not available.
Your team has not built the observability needed for progressive delivery.

Sources

AWS What’s New: Amazon ECS adds Network Load Balancer support for Linear and Canary deployments
Amazon ECS Developer Guide: Amazon ECS service deployment controllers and strategies
Amazon ECS Developer Guide: Amazon ECS deployment failure detection
Elastic Load Balancing documentation: Network Load Balancers
Amazon CloudWatch documentation: Using Amazon CloudWatch alarms

NLB-backed services are often the ones where deployment mistakes hurt quickly and visibly. ECS canary and linear strategies give those teams a native traffic-shift mechanism. The value comes from pairing it with real alarms, version-aware telemetry, connection behavior tests, and rollback plans that are honest about data side effects.