ECS Canary and Linear Deployments with Network Load Balancers

Bits Lovers
Written by Bits Lovers on
ECS Canary and Linear Deployments with Network Load Balancers

On February 4, 2026, Amazon ECS added native support for linear and canary deployment strategies for services using Network Load Balancers. That is a small announcement with a large operational consequence. Workloads that need TCP, UDP, low latency, long-lived connections, or static IP addresses can now use managed incremental traffic shifting without leaving the ECS deployment model.

Before this launch, many NLB-backed ECS services were stuck with a blunt choice. Use rolling updates and accept limited traffic-shift control, or build a custom blue/green pattern with extra target groups, listeners, automation, and rollback logic. Application Load Balancer users had more obvious progressive-delivery paths. NLB users often had to do more work.

ECS NLB canary and linear deployment timeline

The new support does not make every ECS deployment safe by default. It gives you a better release lever. You still need health checks, CloudWatch alarms, service metrics, connection-draining expectations, rollback criteria, and a deployment shape that matches your protocol.

What AWS Added

AWS says ECS now supports linear and canary deployment strategies for ECS services using Network Load Balancers. The announcement calls out applications that commonly use NLB: TCP and UDP services, low-latency systems, long-lived connections, and workloads that require static IP addresses. It also says deployments can integrate with Amazon CloudWatch alarms to automatically stop or roll back if issues are detected.

The ECS documentation describes four managed strategy families: rolling, blue/green, linear, and canary.

Strategy Traffic behavior Good fit Main tradeoff
Rolling Replace tasks gradually Normal stateless services, cost-sensitive services Less precise traffic control
Blue/green Create a new environment, then switch traffic Fast rollback and pre-traffic validation More duplicate capacity and target group complexity
Linear Shift equal traffic increments over time Gradual validation and performance monitoring Slower rollout
Canary Shift a small percentage first, then the rest after a bake period Feature validation and blast-radius control Requires strong alarms during bake time

The difference is not just vocabulary. Linear and canary deployments let you observe the new task set under real production traffic before completing the rollout. That is especially important for NLB workloads, where problems can hide behind long-lived connections, binary protocols, or clients that retry aggressively.

BitsLovers has covered ECS operational patterns in the Amazon ECS managed daemons guide and the ECS Express Mode guide. This post focuses on progressive delivery for the NLB services that usually carry the sharpest operational edges.

Why NLB Workloads Needed This

Network Load Balancers are used when L7 routing is not the point. You choose NLB for L4 behavior: TCP, UDP, TLS pass-through, static IP addresses, very high performance, or long-lived connections. That often means the application protocol owns behavior that an ALB would otherwise expose through HTTP metrics, target response codes, and path-based routing.

That makes deployment safety harder.

NLB workload type Why rolling updates can be risky Canary or linear benefit
Financial transaction service Small regression can affect high-value traffic Limit initial exposure and watch business metrics
Real-time messaging Long-lived connections hide bad behavior Observe connection churn before full rollout
Online gaming backend Latency and packet behavior matter more than HTTP status Shift gradually and watch protocol metrics
gRPC over TCP/TLS Client channel behavior can mask target changes Canary new tasks with connection-level telemetry
Static IP partner integration Partner clients may retry slowly or cache paths Avoid all-at-once exposure

The AWS announcement explicitly mentions online gaming backends, financial transaction systems, and real-time messaging services. Those are exactly the systems where “the service is healthy” can be an incomplete signal. A target can pass an NLB health check while still breaking a subset of protocol behavior.

This is why progressive delivery needs application metrics. NLB health checks tell you whether targets are reachable. They do not prove the new version handles a payment reversal, game session rejoin, WebSocket reconnect, or binary message negotiation correctly.

Canary Versus Linear

Use canary when you want to expose a small slice of traffic, wait, and then finish. Use linear when you want to increase exposure in equal increments and observe each step.

Decision factor Canary Linear
Best for Unknown feature risk Performance or capacity validation
Traffic shape Small first step, then large final step Equal increments over time
Operator attention Intense during bake period Repeated at each increment
Failure visibility Early if alarms are strong Easier to spot gradual degradation
Rollout time Usually shorter Usually longer
Capacity pressure New and old task sets overlap during bake Overlap lasts through gradual shift

My default for high-risk behavior changes is canary. Put 5% or 10% of traffic on the new revision, wait long enough to see real behavior, then complete. My default for performance-sensitive changes is linear. Move 10% or 20% at a time and watch latency, connection count, CPU, memory, queue depth, and business metrics at each step.

The wrong choice is using progressive delivery without enough bake time. A canary that waits 30 seconds for a service with 15-minute client sessions is theater. A linear rollout with no alarms is a slower rolling update, not a safety mechanism.

The Metrics That Actually Matter

For NLB deployments, split signals into four groups: load balancer health, ECS service health, application protocol health, and business health.

Signal group Metrics or checks Why it matters
NLB Healthy host count, unhealthy host count, target reset count, active flow count Confirms target reachability and connection behavior
ECS Running task count, deployment state, CPU, memory, task restarts Confirms scheduler and capacity health
Application p95/p99 latency, protocol error rate, retry rate, connection churn Confirms the new version handles real protocol traffic
Business payment failures, match disconnects, message delivery failures, job completion rate Confirms users are not harmed

CloudWatch alarms should not be decorative. If the deployment can roll back automatically, alarms define the rollback contract.

Good alarms are:

  • Fast enough to catch a bad canary before the final shift.
  • Specific enough to avoid rolling back for unrelated background noise.
  • Sensitive to the new version’s behavior, not only global service health.
  • Tested with synthetic failures before you rely on them.

This is where the OpenTelemetry and CloudWatch observability guide becomes relevant. If you cannot break down metrics by task set, version, deployment ID, or target group, a canary can look fine because the stable version still carries most traffic.

Tag or label every metric with version information where the stack allows it. At minimum, log deployment ID and task definition revision in structured logs.

Designing The Alarm Contract

The alarm contract is the most important part of an automated canary. It says what failure means.

Do not start with “rollback on any alarm.” Start with the service promise. A payments service may care about authorization failure rate and p99 latency. A messaging service may care about delivery acknowledgement delay and reconnect rate. A game backend may care about session join failures and packet processing latency. The NLB only sees targets and flows. Your users experience protocol behavior.

I like two alarm tiers:

Alarm tier Example Deployment action
Hard rollback New-version error rate above 2% for 3 datapoints Stop and roll back
Human review p99 latency 25% above baseline for 10 minutes Pause or require approval

Hard rollback alarms must be precise. If they fire constantly for unrelated noise, engineers will disable them. Human-review alarms can be more exploratory. They should slow the rollout when the signal is suspicious but not conclusive.

Missing data needs an explicit setting. For a small canary, some metrics may have low traffic. Treating missing data as breaching can roll back a healthy deployment. Treating missing data as not breaching can hide a canary that receives no traffic. Pick intentionally and test it with a dry run.

Target Group And Health Check Design

NLB health checks are simple compared with application health. That simplicity is useful, but it can mislead.

For TCP services, a successful health check may only prove that the port accepts connections. For TLS services, it may prove the listener is reachable. It may not prove the service can process a real message. When the application protocol allows it, expose a dedicated health endpoint or lightweight protocol command that checks dependencies without mutating state.

Health check recommendations:

Setting Recommendation Reason
Health path or command Test a real readiness condition Avoid routing to initialized-but-broken tasks
Interval and threshold Balance speed with noise Fast rollback needs fast detection, but noisy health hurts availability
Deregistration delay Match connection behavior Long-lived connections need graceful drain time
Grace period Give new tasks time to warm up Cold starts and cache warm-up can trigger false failures
Per-AZ target health Watch each zone separately One unhealthy zone can hide under global averages

Keep liveness and readiness separate inside the container if you can. Liveness answers “should the orchestrator restart me?” Readiness answers “should the load balancer send me traffic?” During deployment, readiness matters most.

Connection Draining And Long-Lived Clients

NLB workloads often have long-lived connections. That can make traffic shifting less clean than the diagram suggests.

When you move new connection traffic to a new task set, existing client connections may remain on old tasks until they disconnect or the target deregistration process completes. That is usually good. It prevents abrupt drops. It also means the new version may not receive the expected share of traffic immediately if your clients keep connections open for a long time.

Design reviews should answer these questions:

Question Why it matters
How long do clients keep connections open? Bake time must exceed a meaningful portion of session behavior
Do clients reconnect automatically? A deployment can trigger reconnect storms
What happens to in-flight requests during deregistration? Some protocols need graceful shutdown logic
Can old and new versions coexist? Long-lived connections create mixed-version windows
Are protocol changes backward compatible? Canary is not a substitute for compatibility

For connection-heavy services, add a pre-stop behavior inside the container. Stop accepting new work, keep health status honest, finish in-flight work where possible, and exit within the deregistration and ECS stop timeout expectations. Test it. Do not assume SIGTERM handling is correct.

A Minimal ECS Service Shape

The exact API fields can vary by deployment controller and current tooling support, so treat this as a conceptual checklist rather than a copy-paste template.

An ECS service using progressive delivery with NLB needs:

service:
  name: payments-tcp-api
  launchType: FARGATE
  loadBalancer:
    type: network
    listener: payments-nlb-listener
    targetGroups:
      production: payments-blue
      test: payments-green
  deploymentConfiguration:
    strategy: CANARY
    canary:
      percentage: 10
      bakeTimeMinutes: 15
    alarms:
      enable: true
      rollback: true
      names:
        - payments-canary-p99-latency
        - payments-canary-error-rate
        - payments-canary-business-failures

In Terraform, CDK, CloudFormation, or CLI, use the ECS service deployment fields supported by your current provider and AWS SDK. The important design is not the syntax. It is the coupling between traffic strategy and alarm strategy. A canary without meaningful alarms is a manual observation exercise.

If you deploy MCP servers or other stateful-ish container workloads on ECS, the MCP on ECS deployment guide has useful context around service boundaries and operational expectations. Progressive delivery helps, but it does not replace application-level compatibility.

Terraform And CDK Review Points

Because this feature is new, provider support may lag the ECS API in some environments. Before standardizing a module, check your Terraform AWS provider version, CloudFormation resource support, CDK version, and AWS SDK model. The announcement is service capability. Your IaC toolchain still has to expose it.

For an IaC module, I would make these inputs explicit:

Module input Example Why expose it
deployment_strategy ROLLING, CANARY, LINEAR Strategy should be a conscious choice
canary_percent 10 Small first exposure controls blast radius
bake_time_minutes 15 Must match client/session behavior
linear_percent 10 Controls rollout pace
alarm_names ["payments-p99", "payments-errors"] Rollback depends on real signals
rollback_enabled true Some teams require manual approval
min_healthy_percent 100 Capacity and availability tradeoff
max_percent 200 Extra task capacity during deployment

Do not hide these under a generic “deployment config” object with weak defaults. Defaults become production behavior. For high-risk services, require the caller to choose strategy, bake time, and alarms explicitly.

Also include outputs for the deployment ID, target group ARNs, service ARN, and current task definition revision. Your dashboards and runbooks will need them.

Pre-Production Load Test

Before the first production canary, run a load test that mimics connection behavior, not just request rate.

For HTTP services, request-per-second load tests are often enough to find obvious regressions. For NLB services, the test must understand connection lifetime, reconnect behavior, message size, keepalive settings, TLS negotiation, and client retry policy. A service that handles 10,000 short connections may still fail with 2,000 long-lived clients.

Test matrix:

Scenario What it reveals
Fresh connections during canary start Whether new tasks accept traffic cleanly
Long-lived sessions before traffic shift Whether old tasks drain safely
Client reconnect storm Whether rollout causes cascading retries
Mixed old/new protocol versions Whether compatibility is real
Downstream dependency slowdown Whether alarms catch user impact
Forced rollback during traffic Whether rollback preserves active sessions

The rollback test is the one teams skip. Do not skip it. A rollback path that has never been exercised is a story, not a control.

Rollback Is Not A Time Machine

Automatic rollback is useful. It does not erase every failure.

If a bad version mutates data, sends messages, or changes protocol state, rolling traffic back to the previous task definition may stop new damage but not repair the old damage. This is true for any deployment strategy, but canaries make it more visible because only part of traffic experiences the new behavior first.

Before enabling canary or linear deployments for a stateful workload, define the rollback and repair plan.

Failure type Rollback enough? Additional work
High latency from code path Usually yes Verify old tasks recover saturation
Container crash loop Usually yes Inspect logs and image differences
Bad response format Maybe Confirm clients recover and caches expire
Duplicate transaction No Reconcile and compensate
Bad database migration No Restore, forward-fix, or data migration rollback
Protocol incompatibility Maybe Ensure old and new clients can coexist

This is why progressive delivery should pair with backward-compatible database and protocol changes. Use expand-and-contract migrations. Deploy additive changes first. Keep old fields long enough. Avoid introducing a server version that only works with a client version not yet deployed.

Capacity And Cost

Blue/green, canary, and linear deployments can require extra capacity because old and new task sets overlap. Rolling updates are often cheaper because they replace tasks gradually inside the same desired capacity envelope, depending on minimum and maximum healthy percentages. Progressive traffic shifting gives you better validation, but it is not free.

Capacity questions:

Question Why it matters
Can the cluster or Fargate quota run both task sets? Deployment can stall if capacity is unavailable
Do target groups have enough healthy targets per AZ? Traffic shift is only safe when each AZ is healthy
Are downstream dependencies sized for duplicate warm-up? New tasks may create connection pools and caches
Does the canary create cold-cache latency? Early p99 spikes can trigger noisy rollback
Is autoscaling tied to service-level or task-set metrics? Scaling can mask or amplify canary behavior

For ECS Express Mode and simpler services, progressive delivery may feel heavier than needed. The ECS Express Mode guide is useful for that tradeoff. If the service is low risk, stateless, and easy to roll back, a rolling update may be enough. Save canary and linear deployments for services where the blast radius matters.

This is also relevant for teams leaving App Runner. The App Runner availability change guide explains why many container teams are being pushed toward ECS-native options. If that migration lands on an NLB-backed service rather than a simple public HTTP service, plan progressive delivery from the beginning instead of adding it after the first painful deploy.

Version Compatibility Rules

Progressive delivery creates a mixed-version window. That is the point. Some traffic goes to old tasks and some goes to new tasks. Any shared dependency must tolerate that window.

Use these rules:

  1. Database changes are additive first. Add columns, indexes, or tables before code requires them.
  2. Message schemas are backward compatible during rollout. New producers should not break old consumers.
  3. Protocol changes are negotiated or optional during the mixed window.
  4. Feature flags can disable risky behavior without redeploying.
  5. Cache keys include version only when the old and new formats cannot share data.
  6. Background workers and front-end services are rolled in an order that preserves compatibility.

This may sound like normal deployment hygiene, but canary makes the requirement explicit. If old and new versions cannot coexist, a canary will surface that weakness by design.

Not every release deserves the same rollout. Use the deployment strategy as a risk control, not as a ritual.

Change type Recommended strategy Extra check
Text, config, or small bug fix Rolling or short canary Normal service health
New protocol behavior Canary Client compatibility and reconnect behavior
Latency-sensitive optimization Linear p95/p99 latency by version
Dependency upgrade Canary Error rate and downstream timeout alarms
Database-read change Canary or linear Query latency and connection pool usage
Database-write change Canary plus feature flag Rollback and data repair plan
Security patch Fast canary or rolling Confirm vulnerability window and rollback risk

The table is deliberately opinionated. A release that changes write behavior deserves slower exposure than a release that changes a log line. A security patch may need speed, but speed is not the same as recklessness. You can run a short canary with strong alarms and still reduce blast radius.

Operational Ownership

Progressive delivery crosses team boundaries. The application team owns code behavior. The platform team owns ECS service patterns, target groups, and deployment configuration. The SRE or operations team owns alarms and incident response. If nobody owns the whole deployment contract, the rollout becomes a collection of good intentions.

Write down:

  1. Who can approve a canary completion.
  2. Who can override an alarm.
  3. Who watches the first production rollout.
  4. Who owns rollback after hours.
  5. Who repairs data if rollback is not enough.
  6. Who updates the module defaults after lessons learned.

Those decisions are not bureaucracy. They are what lets a team trust automation when the deployment is half complete and the alarm state turns red.

A Runbook For The First Deployment

Do the first canary during a low-risk window with someone watching the metrics. Automation should exist, but humans should learn the shape of the signals.

  1. Confirm both target groups are healthy before shifting.
  2. Confirm alarms are in OK state and not missing data.
  3. Deploy with a small canary percentage, such as 5% or 10%.
  4. Watch p95 and p99 latency, error rate, connection churn, and business metrics.
  5. Compare new task set metrics against stable task set metrics.
  6. Wait at least one meaningful client session or transaction window.
  7. Complete the rollout only if metrics stay inside thresholds.
  8. After completion, keep watching for delayed failures.
  9. Record the deployment ID, task definition revision, alarm states, and rollback decision.
  10. Adjust bake time and thresholds based on what you learned.

For linear deployment, repeat the same review at each increment. Do not approve the next shift just because time elapsed. Approve it because the metrics support it.

When Not To Use This

Canary and linear deployment are not always the right answer.

Do not use them as a substitute for compatibility. If the new version cannot coexist with the old version, traffic shifting may create a mixed-world failure. Fix compatibility first.

Do not use them without metrics. A canary that cannot detect failure is just a partial outage with better branding.

Do not use them for every tiny change if the operational overhead slows the team without reducing real risk. Progressive delivery is most valuable when failure impact is high, detection is possible, and rollback is meaningful.

Use them when:

  • The service handles money, messaging, sessions, or real-time user traffic.
  • NLB is required because of protocol, latency, static IP, or long-lived connections.
  • You have CloudWatch alarms tied to user-impacting metrics.
  • Old and new versions can coexist.
  • Rollback stops the failure quickly enough to matter.

Stick with rolling updates when:

  • The service is low risk and stateless.
  • Health checks are enough to catch most failures.
  • Duplicate capacity is not available.
  • Your team has not built the observability needed for progressive delivery.

Sources

NLB-backed services are often the ones where deployment mistakes hurt quickly and visibly. ECS canary and linear strategies give those teams a native traffic-shift mechanism. The value comes from pairing it with real alarms, version-aware telemetry, connection behavior tests, and rollback plans that are honest about data side effects.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus