AWS Lambda Durable Functions vs Step Functions: The Serverless Workflow Decision Framework

Written by Bits Lovers on 01 May 2026

AWS Lambda Durable Functions vs Step Functions: The Serverless Workflow Decision Framework

AWS Lambda Durable Functions change a decision that used to be almost automatic. For years, if a serverless workflow needed to wait, retry, branch, call several services, or survive failure, the usual AWS answer was Step Functions. Now Lambda can run durable executions for up to one year, checkpoint progress, replay completed work, pause without compute charges during waits, and expose workflow primitives directly inside normal application code.

That does not make Step Functions obsolete. It makes the choice more precise.

The old question was, “Do I need orchestration?” The new question is, “Where should orchestration live?” If the workflow is part of application logic and the team wants to write it in Python, TypeScript, or Java, durable functions are attractive. If the workflow coordinates many AWS services, needs visual state inspection, uses native integrations, or belongs to a platform team rather than one application team, Step Functions is still the cleaner boundary.

Lambda Durable Functions versus Step Functions decision workflow

This distinction matters because both services can now run long-lived processes. Lambda Durable Functions can execute for up to one year. Step Functions Standard workflows also have a one-year maximum execution time. Both can preserve progress. Both can retry. Both can wait. The difference is not duration. The difference is programming model, operational surface, integration depth, and how much workflow state you want outside your code.

The Short Version

Use Lambda Durable Functions when the workflow is code-first, tightly coupled to one Lambda application, and easier to express as sequential logic with durable steps, waits, callbacks, maps, and parallel operations.

Use Step Functions when the workflow is architecture-first, crosses service boundaries, needs a visual state machine, uses many AWS service integrations, or must be owned and audited as a standalone orchestration layer.

Requirement	Better default	Why
Business process is mostly one application	Lambda Durable Functions	The orchestration stays close to application code
Many AWS services are coordinated directly	Step Functions	Native integrations and visual state machines reduce glue code
Developers want Python/TypeScript/Java control flow	Lambda Durable Functions	SDK primitives fit normal programming constructs
Platform team owns the workflow	Step Functions	Clear standalone resource, IAM boundary, and execution history
Human approval waits inside app logic	Lambda Durable Functions	Callback and wait primitives are direct
Long cross-team process with audit needs	Step Functions	State history, graph, and operations model are easier to review
Huge fan-out data processing	Step Functions Distributed Map	Built for orchestration-level fan-out
Deterministic replay is hard to guarantee	Step Functions	State machine definition avoids replay hazards in arbitrary code

If you remember one thing, remember this: durable functions are application durability; Step Functions are workflow orchestration.

What Lambda Durable Functions Actually Do

A durable function is still a Lambda function. The difference is that the handler is wrapped by a durable execution SDK. Instead of running once and forgetting everything after timeout or failure, the function creates checkpoints as it performs durable operations.

AWS describes the model as checkpoint and replay. When your function resumes after a wait, interruption, or retry, the code starts again from the beginning. Completed durable operations are not re-executed. The SDK returns stored results from the checkpoint log and continues from the next unfinished operation.

That sounds strange the first time you read it. It is also the core of the feature.

Consider a payment workflow:

@durable_execution
def handler(event: dict, context: DurableContext):
    order = context.step(
        lambda _: validate_order(event["orderId"]),
        name="validate-order",
    )

    payment = context.step(
        lambda _: authorize_payment(order),
        name="authorize-payment",
    )

    context.wait(Duration.from_seconds(300))

    context.step(
        lambda _: capture_payment(payment["authorizationId"]),
        name="capture-payment",
    )

If the function pauses at the wait, Lambda does not keep compute running for five minutes. When it resumes, the handler starts from the top. The SDK sees that validate-order and authorize-payment already completed, returns their stored results, skips those calls, reaches the wait checkpoint, and then continues to capture-payment.

That model creates a new responsibility: deterministic code. During replay, the sequence of durable operations must match the checkpoint log. Random values, wall-clock calls, external API calls, and non-deterministic branching belong inside durable steps, not in the plain replay path.

The Primitives That Matter

The durable execution SDK gives you a small set of primitives. These are enough to build serious workflows, but each one has operational consequences.

Primitive	What it does	Common use
`step`	Runs business logic with checkpointing and retry	Payment call, inventory update, validation
`wait`	Pauses without compute charges	SLA delay, polling interval, backoff
`createCallback`	Creates a callback ID and waits for external completion	Human approval, webhook response
`waitForCallback`	Combines callback creation and submission	External approval systems
`parallel`	Runs multiple durable branches concurrently	Independent service calls
`map`	Processes array items with durable tracking	Batch item processing
child context	Groups operations with isolated checkpoint logs	Sub-workflows and retry scopes
`invoke`	Invokes another Lambda and waits for result	Function composition
`waitForCondition`	Polls until a condition succeeds	External job status checks

The strongest part is readability. A developer can express the workflow as normal code. That is useful when the orchestration is part of the business logic. A checkout flow, onboarding process, AI agent loop, or document approval chain often reads better as code than as a JSON state machine.

The tradeoff is that replay is now part of your programming model. With Step Functions, state transitions are explicit in the state machine. With durable functions, state transitions are implied by durable operation calls in your code. That gives developers more flexibility and more rope.

Limits You Need To Design Around

The numbers matter.

Lambda Durable Functions and Step Functions are both long-running, but their guardrails are different.

Limit or quota	Lambda Durable Functions	Step Functions Standard	Why it matters
Maximum execution time	Up to 1 year	1 year	Duration alone no longer decides the service
Running executions	1,000,000 default for Lambda durable executions	1,000,000 open executions per account per Region	Both can support large fleets, but quotas still need monitoring
Operation/history ceiling	3,000 durable operations per durable execution	25,000 events in one execution history	Long loops must be designed carefully
Persisted data	100 MB durable execution storage written	1 MB request size; execution history event limit applies	Payload discipline still matters
Direct checkpoint result	Results under 256 KB are checkpointed directly in certain contexts	State input/output quotas apply	Large payloads still need external storage
Timeout of one compute invocation	Lambda function timeout remains 15 minutes	State task timeout depends on state config	Durable waits do not mean one Lambda invocation runs forever
Definition model	Code plus SDK	Amazon States Language and service resource	Ownership and review differ

The 3,000 durable-operation quota is the one I would highlight in design reviews. It is easy to accidentally build a workflow that polls every minute for days. That may look simple in code, but it becomes many durable operations. Use longer polling intervals, external events, callbacks, or Step Functions when the workflow is naturally a large state graph.

The 100 MB durable execution storage-written quota also matters. Do not treat durable functions as a database. Persist business records in DynamoDB, Aurora, S3, or the correct system of record. The checkpoint log is for workflow recovery, not analytics or archival.

Step Functions has its own sharp edges. The 25,000-event execution history limit can bite long loops in Standard workflows. Distributed Map, child executions, or splitting workflows can help, but you need to design for it. The service is visual and operationally mature, but it still has quotas.

Cost Model: Where The Bills Come From

Durable functions charge around Lambda execution and durable operations. Step Functions Standard charges per state transition, while Express charges per request and duration. That difference changes architecture.

For a workflow that waits 24 hours and does three pieces of work, durable functions can be very clean: run a step, checkpoint, wait without compute charges, resume later, run the next step. Step Functions Standard can also wait for long periods without paying for compute, but you pay for state transitions rather than the wait duration. Both can be economical.

For a workflow with hundreds of tiny state transitions, Step Functions Standard state-transition cost becomes visible. For a workflow with many small durable operations, durable-operation charges and checkpoint data writes become visible. There is no universal cheaper option.

The right cost review asks:

Cost driver	Durable Functions question	Step Functions question
Steps	How many durable operations are created per execution?	How many state transitions happen, including retries?
Waiting	Are waits suspending compute properly?	Is Standard or Express the right workflow type?
Payloads	How much data is written into checkpoints?	How large is state input/output and execution history?
Retries	Are retries bounded and idempotent?	Do retries multiply transitions unexpectedly?
Fan-out	Are maps creating too many checkpoints?	Should Distributed Map or batching be used?
Logging	Are full payloads being logged?	Are execution histories and CloudWatch logs retained appropriately?

If you need a simple rule: model cost per execution before choosing. A workflow diagram without an execution-count and retry-count estimate is not enough.

Replay Is Powerful, But It Is Not Free

Replay is the durable-functions superpower and the main source of bugs.

During replay, your handler runs again. That means any non-durable side effect outside a step can happen again unless you prevent it. Printing a log line twice is annoying. Charging a card twice is unacceptable.

Bad replay code looks like this:

@durable_execution
def handler(event, context):
    charge_card(event["payment"])  # unsafe during replay
    context.wait(Duration.from_minutes(5))
    send_receipt(event["email"])   # unsafe during replay

Better code wraps side effects in durable steps:

@durable_execution
def handler(event, context):
    payment = context.step(
        lambda _: charge_card(event["payment"]),
        name="charge-card",
    )

    context.wait(Duration.from_minutes(5))

    context.step(
        lambda _: send_receipt(event["email"], payment["id"]),
        name="send-receipt",
    )

Also avoid changing operation names casually. Names become part of the checkpoint story. If you rename charge-card to authorize-payment during a running execution, the replay path may no longer match what the execution history expects. Treat durable operation names like state names in Step Functions: stable, reviewed, and migrated carefully.

This is where Step Functions can be safer for some teams. A state machine definition makes state names and transitions explicit. It is harder to accidentally put a side effect “between states” because the definition is the workflow. With durable functions, the code is the workflow. That is more ergonomic for developers and more demanding for reviewers.

Where Step Functions Still Wins

Step Functions remains the better tool when orchestration is a platform concern.

First, it has native service integrations. AWS documentation describes Step Functions as orchestrating over 14,000 API actions from over 220 AWS services in the context of enhanced TestState local testing. That breadth matters. If your workflow calls Lambda, DynamoDB, ECS, EventBridge, Glue, Bedrock, SNS, SQS, and human approval tasks, a state machine may remove a lot of glue code.

Second, it has a visual execution model. Operations teams can open an execution and see where it is stuck. That is hard to beat during incidents. Developers can do similar observability for durable functions, but they have to build discipline around logs, metrics, execution IDs, and tracing.

Third, Step Functions is a clean resource boundary. A platform team can own the state machine, version it, attach IAM policies, test it, and expose it to application teams. Durable functions are owned with the Lambda function code. That may be perfect for one product team and wrong for a shared business process.

Fourth, Step Functions has mature testing improvements. The enhanced TestState API now supports local testing of workflow behavior, mocked integrations, Map and Parallel states, context object simulation, retry and catch testing, and CI/CD integration. That closes one of the old complaints: state machines were hard to test before deployment. They are much easier now.

If you are already using Step Functions well, do not migrate just because durable functions exist. Revisit workflows that are awkward as state machines but natural as code.

Where Durable Functions Win

Durable functions shine when the workflow is part of application code.

Think of an AI agent that needs to call a model, wait for a human review, call a tool, retry an external vendor, and then write a final answer. The logic may be easier to express in TypeScript than in Amazon States Language. You can use normal language features, normal unit testing, normal package management, and normal application abstractions.

They also fit teams that already think in Lambda. If you have a codebase with validators, domain services, adapters, and tests, durable functions let you keep that shape. Step Functions would force you to split orchestration out into a separate state machine, which may or may not be cleaner.

Durable waits are also valuable. Waiting without compute charges makes human-in-the-loop workflows practical without keeping a function warm. That is useful for approvals, external job polling, vendor callbacks, and long-running remediation.

Finally, durable functions can reduce the number of moving pieces. A small workflow that used to require a state machine plus several Lambda functions may become one durable Lambda. Fewer resources can mean faster development and simpler deployment. But do not confuse fewer resources with lower risk. The risk moves into code review.

A Production Design Pattern

For durable functions, I like a three-layer structure:

Handler and durable orchestration.
Domain services with idempotent operations.
Adapters for AWS services and external APIs.

That split is not cosmetic. It keeps replay-aware code small. The handler should decide durable order, call named durable operations, and pass stable inputs. Domain services should own business rules and idempotency keys. Adapters should own boto3 clients, HTTP clients, retries, timeouts, and serialization. When those concerns mix, replay bugs get hard to spot in review.

For Step Functions, I use a similar split with different artifacts:

Layer	Step Functions artifact	Review owner
Orchestration	ASL definition, state names, retries, catches, timeouts	Platform and application engineers
Business work	Lambda functions, ECS tasks, service integrations	Application team
Contracts	Event schemas, input/output examples, failure envelopes	API or platform team
Operations	CloudWatch alarms, dashboards, execution retention, DLQs	SRE or operations team

That table is why some teams should prefer Step Functions even when durable functions look more ergonomic. If the workflow is a cross-team contract, the state machine gives reviewers something concrete to inspect. If the workflow is private application behavior, the code-first durable model may reduce ceremony.

For durable functions, add one more review rule: every external side effect must have an idempotency key visible in code. A reviewer should be able to point at the payment call, email send, ticket update, or database write and explain why replay cannot duplicate harm. For Step Functions, require the same evidence in state input, task parameters, and downstream Lambda code. Different orchestration models do not remove distributed-systems rules. They only move where those rules are expressed.

I would also tag long-lived executions with a business key. A one-year workflow without a searchable customer ID, order ID, incident ID, or approval ID becomes hard to operate. The execution ID is useful for machines. The business key is what an on-call engineer has when a user reports a problem.

The handler should be thin. It should call durable operations with stable names. Domain services should be deterministic where possible and idempotent where side effects exist. Adapters should handle retries, timeouts, and request IDs.

@durable_execution
def handler(event: dict, context: DurableContext):
    order_id = event["orderId"]

    order = context.step(
        lambda _: order_service.validate(order_id),
        name="validate-order",
    )

    approval = context.wait_for_callback(
        lambda callback_id: approval_service.request(order, callback_id),
        name="request-manager-approval",
        config=WaitForCallbackConfig(timeout_seconds=86400),
    )

    if not approval["approved"]:
        return {"status": "rejected", "orderId": order_id}

    return context.step(
        lambda _: fulfillment_service.submit(order),
        name="submit-fulfillment",
    )

In tests, verify replay behavior. Test that operation names stay stable. Test that external side effects happen inside steps. Test that callbacks time out. Test that duplicate invocations do not duplicate business actions.

If this feels like too much application discipline for your team, use Step Functions.

Observability and Operations

Step Functions gives you a native execution graph. Durable functions require more intentional telemetry.

At minimum, every durable execution should log:

Durable execution ID
Business correlation ID
Current durable operation name
Attempt number
Replay indicator if available
External request ID for every service call
Callback ID for human or external waits
Final status and reason

For CloudWatch, create alarms on durable execution failures, callback timeouts, checkpoint API errors, and unusual execution age. For Step Functions, alarm on failed executions, timed-out executions, throttling, and history-limit risk if your workflows loop heavily.

If you already use OpenTelemetry with CloudWatch, add durable execution IDs as span attributes. If you use AWS X-Ray for distributed tracing, make sure replay does not create misleading traces for already-completed steps. Trace design gets trickier when code can replay without redoing work.

For incident response, document how to answer these questions:

Question	Durable Functions evidence	Step Functions evidence
Where is the workflow stuck?	Durable execution state, logs, current operation	Execution graph and event history
Did a side effect run twice?	Idempotency keys and adapter logs	Task state history and downstream logs
Why did it wait?	Wait or callback operation log	Wait state or task token state
Can it be stopped?	Durable execution stop API and app policy	StopExecution and state machine policy
Can it be replayed safely?	Determinism tests and operation naming	State machine version and execution input

Migration: From Step Functions to Durable Functions

Do not migrate a working Step Functions workflow just to reduce resource count. Migrate when the state machine has become a poor fit.

Good candidates:

The workflow is owned by one application team.
Most states are Lambda tasks that call back into the same codebase.
Business logic is hard to review because it is split across ASL and code.
The team wants normal unit tests around orchestration logic.
The execution has modest operation count and predictable payload size.

Bad candidates:

The workflow coordinates many AWS services without Lambda glue.
Operations teams rely on the visual execution graph.
The state machine is shared across teams.
Execution history and audit are more important than code ergonomics.
The workflow uses large fan-out or complex service integrations.

Use a strangler migration. Wrap one Step Functions branch in a durable function. Keep the outer workflow intact. If that works well, move another branch. This avoids a risky all-at-once rewrite.

Migration: From Lambda Glue to Durable Functions

The better migration is often from fragile Lambda glue to durable functions.

If you have a Lambda that writes intermediate state to DynamoDB, schedules EventBridge callbacks, polls an external API, and manually deduplicates work, durable functions may remove custom orchestration code. You still need a system of record. But you may not need as much homemade retry and resume logic.

Look for these smells:

A table named workflow_state that only exists to resume Lambda logic.
Manual “current_step” fields.
EventBridge Scheduler jobs created for one-off waits.
SQS messages that represent workflow progress instead of business events.
Custom retry counters in DynamoDB.
Support tickets about stuck intermediate states.

Durable functions were built for that pain.

How This Fits With Other AWS Serverless Patterns

Durable functions do not replace every serverless pattern. They sit next to them.

Use SQS plus Lambda event source mapping when you need queue-driven work distribution and backpressure. Use EventBridge plus Step Functions when you need event routing and standalone orchestration. Use Lambda S3 Files when the workflow needs a shared workspace or large file context. Use Lambda layers and custom runtimes when packaging and runtime control are the main problem.

The architecture should not be “durable functions everywhere.” It should be “durable functions where code-first durability reduces complexity.”

Final Decision Framework

Ask these questions in order:

Is the workflow mostly one application or a cross-service business process?
Would a developer understand it faster as code or as a visual state machine?
How many operations or transitions will one execution create?
Is replay determinism easy to guarantee?
Does the workflow need native integrations to many AWS services?
Who owns production support: the application team or platform team?
What evidence will an auditor or incident commander need?
What is the rollback path if the workflow definition changes?

If most answers point to one application team, code-first logic, modest operation counts, and durable waits, choose Lambda Durable Functions. If most answers point to cross-service orchestration, visual auditability, broad AWS integrations, and platform ownership, choose Step Functions.

The best AWS architectures usually do not pick one tool forever. They use the right boundary for the job. Durable functions make Lambda a stronger workflow runtime. Step Functions remains the orchestration backbone when the workflow needs to stand outside the application.