AWS Step Functions Deep Dive: States, Integrations, and Workflows
Step Functions launched in 2016 as a way to sequence Lambda functions without writing glue code. Seven years later, it has grown into something considerably more powerful: 220+ AWS service integrations, optimized integrations that don’t require a Lambda function in the middle, and two execution modes with completely different pricing and durability characteristics. Most tutorials still show the basic Lambda-chaining use case. This guide covers the parts that actually matter at production scale.
Standard vs Express Workflows
Every state machine you create is either Standard or Express. The choice matters because the two modes have different execution models, price points, and durability guarantees — and you can’t change the type after creation.
Standard workflows support executions up to one year. Each execution is logged and queryable through the Step Functions console. You get exactly-once execution semantics: if a state starts, it runs exactly once. You pay $0.025 per 1,000 state transitions. For a state machine that runs 100 steps, one execution costs $0.0025. This is the right choice for long-running business processes — order fulfillment, multi-step approvals, ML training pipelines.
Express workflows cap out at 5 minutes and are priced on duration and requests: $0.00001 per state transition plus $0.00001 per GB-second of duration. At-least-once semantics mean a state might execute more than once if there’s a system failure. Express executions aren’t stored — you get them only if you configure CloudWatch Logs. This mode is designed for high-volume, short-duration flows: IoT event processing, API Gateway backend orchestration, data transformation pipelines where you’re running thousands of executions per second.
A quick rule of thumb: if your workflow touches financial data or needs an audit trail, use Standard. If you’re running 10,000+ executions per day and cost is a concern, Express with CloudWatch logging is usually cheaper.
The Eight State Types
State machines are built from eight state types. Each has a specific job, and picking the wrong one is usually how over-engineered state machines happen.
Task states do the actual work. They invoke a resource — a Lambda function, an ECS task, a DynamoDB operation, an API call via SDK integrations. This is where your business logic runs.
Choice states branch execution based on conditions. You define rules against the input JSON, and execution routes to a different state based on which rule matches first. No Lambda required. Choice doesn’t support Retry or Catch — handle errors before or after it.
Wait states pause execution for a fixed duration or until a specific timestamp. Free to use, no state transitions consumed during the wait period. Useful for polling with backoff or implementing time-based gates.
Parallel states fan out. You define several branches and all of them start simultaneously. The state machine waits until every branch is done, then combines their results into an array (one entry per branch, in the order you defined them). One failed branch poisons the whole Parallel state — the others get cancelled, and execution takes the Catch path.
Map states iterate over an array and process each element with the same sub-state-machine. Use MaxConcurrency to control how many iterations run in parallel — 0 means unlimited, which can overwhelm downstream services if you’re not careful.
Pass states transform data without calling any service. They pass their input through to output, optionally injecting a Result field or reshaping the JSON via Parameters. Useful for injecting constants or restructuring data between states.
Succeed and Fail states terminate execution. Succeed ends with a success status; Fail ends with an error and cause string that shows up in the execution history. Use Fail explicitly rather than letting an unhandled error propagate — it gives you cleaner error messages.
Data Flow: InputPath, OutputPath, ResultPath
Step Functions’ data model is one of the most confusing parts for new users. Each Task state receives input, calls a resource, gets a result, and passes output to the next state. Four fields control what gets passed where.
InputPath filters the state’s input before it’s sent to the resource. $.user means only the user field from the input JSON goes to the Lambda function.
Parameters reshapes or augments the input before the resource call. You can mix static values with dynamic references to the input:
{
"Parameters": {
"userId.$": "$.user.id",
"source": "payment-service",
"timestamp.$": "$$.Execution.StartTime"
}
}
ResultPath controls where the resource’s output lands in the state’s data. $.taskResult means the Lambda return value is injected at taskResult in the existing input JSON, rather than replacing it entirely. null discards the result completely — useful when you’re calling something for its side effects.
OutputPath filters what gets passed to the next state. Set this to $.userData and only that field flows forward.
Getting these four fields right is the difference between a state machine that passes clean data between states and one where you write transformation Lambdas just to reformat JSON.
Task State Integrations
Step Functions can call AWS services three different ways, and the choice has significant implications for cost and reliability.
RequestResponse fires the API call and moves to the next state immediately, without waiting for the called service to complete. For Lambda, this means you invoke the function and proceed — you won’t get the function’s return value. Use this only when you genuinely don’t care about the result.
.sync:2 (the optimized synchronous integration) calls the service and waits for it to complete before the state machine continues. Step Functions polls the service internally — you don’t pay for a polling Lambda. For ECS RunTask, Step Functions waits until the task exits. For Lambda, it waits for the function to return. This is the correct integration for most use cases.
.waitForTaskToken is for long-running external work. The Task state sends a task token to the resource, then pauses until your code calls SendTaskSuccess or SendTaskFailure with that token. The execution sits idle (no transitions consumed, no cost during wait for Standard workflows) until the callback arrives. Useful for human approval steps, third-party API callbacks, or any process that completes asynchronously.
{
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "approval-notifier",
"Payload": {
"taskToken.$": "$$.Task.Token",
"approvalUrl.$": "$.approvalUrl"
}
},
"TimeoutSeconds": 86400,
"Next": "ProcessApproval"
}
The Lambda function receives the task token and stores it. When the reviewer approves or rejects (via your UI calling an API), you call SendTaskSuccess with the token, and the state machine resumes from where it paused. No polling required.
Optimized Service Integrations
Step Functions has two tiers of service integrations. SDK integrations let you call any AWS API operation — useful but you’re responsible for handling async behavior. Optimized integrations are built specifically for Step Functions and handle the async polling internally.
Optimized integrations exist for Lambda, ECS/Fargate, SQS, SNS, EventBridge, DynamoDB, Glue, SageMaker, and several others. The practical difference: to run an ECS task and wait for completion via SDK integration, you’d need a Lambda that starts the task, another Lambda polling for completion, and state machine states to coordinate them. With the optimized ECS integration, one Task state does all of that.
{
"Type": "Task",
"Resource": "arn:aws:states:::ecs:runTask.sync:2",
"Parameters": {
"LaunchType": "FARGATE",
"Cluster": "arn:aws:ecs:us-east-1:123456789012:cluster/my-cluster",
"TaskDefinition": "arn:aws:ecs:us-east-1:123456789012:task-definition/my-task:5",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": ["subnet-abc123"],
"AssignPublicIp": "DISABLED"
}
}
},
"Next": "ProcessResults"
}
Step Functions manages the polling. Your execution history shows the ECS task ARN and exit code. No additional infrastructure needed.
Error Handling: Retry and Catch
Every Task state can define Retry and Catch blocks. This is where Step Functions handles transient failures without requiring your code to implement retry logic.
Retry specifies which error types trigger retries, how many to attempt, and how the backoff interval grows:
{
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2,
"JitterStrategy": "FULL"
},
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 5,
"MaxAttempts": 2,
"BackoffRate": 1.5
}
]
}
JitterStrategy: FULL adds full jitter to the backoff interval, which helps prevent thundering herd problems when many executions retry simultaneously. With BackoffRate: 2 and IntervalSeconds: 2, retries happen at roughly 2s, 4s, 8s before the third attempt.
Catch handles errors that exhaust retries or hit a non-retryable error:
{
"Catch": [
{
"ErrorEquals": ["PaymentDeclined"],
"Next": "NotifyCustomer",
"ResultPath": "$.error"
},
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleUnexpectedError",
"ResultPath": "$.error"
}
]
}
States.ALL is a catch-all. Put it last — Step Functions evaluates Catch rules in order and uses the first match. ResultPath: "$.error" preserves the error details in the state data so the error-handling state can log or act on them.
IAM Execution Roles
Step Functions assumes an IAM execution role to call services on your behalf. This role needs explicit permissions for every service your state machine calls. Common gotchas:
For Lambda invocations, the role needs lambda:InvokeFunction on the specific function ARNs. For ECS RunTask, it needs ecs:RunTask and iam:PassRole to pass the task execution role. For DynamoDB, dynamodb:PutItem or whichever operations you’re using. The IAM roles and policies guide covers the permission model and how to scope execution roles tightly.
Step Functions also needs logs:CreateLogGroup, logs:CreateLogDelivery, and related permissions to write execution logs to CloudWatch. The CloudWatch deep dive shows how to set up log groups and metric filters to alert on execution failures.
Missing permissions produce States.TaskFailed errors with an AccessDeniedException cause. The execution history shows which state failed and the full error message — check there before digging into IAM policies.
A Complete State Machine Example
An order processing workflow that validates payment, fulfills inventory, and sends confirmation:
{
"Comment": "Order processing pipeline",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "order-validator",
"Payload.$": "$"
},
"ResultPath": "$.validation",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": ["ValidationFailed"],
"Next": "RejectOrder",
"ResultPath": "$.error"
}
],
"Next": "CheckInventory"
},
"CheckInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "Inventory",
"Key": {
"productId": {
"S.$": "$.productId"
}
}
},
"ResultPath": "$.inventory",
"Next": "InventoryAvailable"
},
"InventoryAvailable": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.inventory.Item.quantity.N",
"NumericGreaterThan": 0,
"Next": "ProcessPayment"
}
],
"Default": "BackorderItem"
},
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "payment-processor",
"Payload": {
"taskToken.$": "$$.Task.Token",
"amount.$": "$.amount",
"customerId.$": "$.customerId"
}
},
"TimeoutSeconds": 300,
"Catch": [
{
"ErrorEquals": ["PaymentDeclined"],
"Next": "RejectOrder",
"ResultPath": "$.error"
}
],
"Next": "FulfillOrder"
},
"FulfillOrder": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "UpdateInventory",
"States": {
"UpdateInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:updateItem",
"Parameters": {
"TableName": "Inventory",
"Key": {
"productId": {"S.$": "$.productId"}
},
"UpdateExpression": "SET quantity = quantity - :one",
"ExpressionAttributeValues": {":one": {"N": "1"}}
},
"End": true
}
}
},
{
"StartAt": "SendConfirmation",
"States": {
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:order-notifications",
"Message.$": "States.Format('Order {} confirmed', $.orderId)"
},
"End": true
}
}
}
],
"Next": "OrderComplete"
},
"OrderComplete": {
"Type": "Succeed"
},
"RejectOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:order-notifications",
"Message.$": "States.Format('Order {} rejected: {}', $.orderId, $.error.Cause)"
},
"Next": "OrderFailed"
},
"BackorderItem": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage",
"Parameters": {
"QueueUrl": "https://sqs.us-east-1.amazonaws.com/123456789012/backorder-queue",
"MessageBody.$": "$"
},
"End": true
},
"OrderFailed": {
"Type": "Fail",
"Error": "OrderProcessingFailed",
"Cause": "Order could not be completed"
}
}
}
This state machine handles the complete order lifecycle without a coordination Lambda. The Parallel state runs inventory update and notification concurrently. The .waitForTaskToken integration on payment processing allows an external payment gateway to callback when processing completes.
Step Functions vs SQS for Orchestration
The comparison that comes up most often: when do you use Step Functions versus SQS with Lambda event source mapping for chaining work?
SQS is the right choice when you need high throughput, loose coupling, and the queue consumers are independent services. If service A produces work and service B consumes it, they don’t need to know about each other. SQS handles back-pressure naturally through its queue depth. The downside is that visibility into the overall flow requires CloudWatch metrics and custom logging — there’s no execution history showing you where a specific item is in the pipeline.
Step Functions is the right choice when the steps are tightly coupled, you need to see the state of each execution, or you need conditional branching and error handling across the flow. The execution history shows every state transition for every execution. You can see exactly where an order failed and why. Retries with configurable backoff are built in — no retry logic in your Lambda code.
The common mistake is using Step Functions for pure fan-out scenarios. If you’re processing 100,000 SQS messages per minute with a single Lambda, Step Functions adds cost and complexity with no benefit. The EventBridge + Step Functions patterns guide covers how to combine both approaches: EventBridge for routing events, Step Functions for the stateful orchestration of each event’s processing flow.
Monitoring and Debugging
Step Functions emits CloudWatch metrics for execution counts, duration, and throttles. The useful ones to alarm on:
ExecutionsFailed — obvious. Set an alarm for any value above 0 in production.
ExecutionThrottled — you’ve hit the state machine’s request rate limit. Standard workflows have a limit of 2,000 new executions per second per account per region (can be increased). This alarm fires before throttling starts impacting latency.
ExecutionTime — p99 duration. Alert if it exceeds your SLA.
For debugging individual executions, the Events tab in the console shows every state transition with inputs, outputs, timestamps, and error details. For Express workflows, you need CloudWatch Logs configured to see this — Express executions aren’t stored by default. Set logging level to ALL during development; drop to ERROR in production to keep log costs down.
X-Ray tracing works with Step Functions — enable it on the state machine and X-Ray shows a service map connecting Step Functions to each Lambda, ECS task, and DynamoDB call. Latency breakdown per state is visible in the X-Ray trace.
Step Functions Local runs a local simulation of the service for development. It doesn’t execute real AWS resources, but it validates your state machine definition, data flow, and error handling logic without incurring costs or requiring AWS connectivity. Run it in Docker:
docker run -p 8083:8083 amazon/aws-stepfunctions-local
Then point the CLI at the local endpoint:
aws stepfunctions --endpoint http://localhost:8083 create-state-machine \
--name test-machine \
--definition file://state-machine.json \
--role-arn arn:aws:iam::123456789012:role/DummyRole
It’s useful for iterating on state machine definitions quickly before deploying to AWS.
When Not to Use Step Functions
Step Functions is not a good fit for every orchestration problem. Three situations where it’s the wrong tool:
High-frequency micro-tasks are where the pricing bites you. Run 50,000 executions per minute with 10 states each — that’s 500,000 transitions per minute, or $0.75 per minute on Standard workflows. Over a month, that’s real money. Express cuts the per-transition cost dramatically, but the 5-minute cap and at-least-once semantics rule it out for anything stateful or financial. A Lambda that orchestrates internally, or an SQS queue driving parallel workers, handles that kind of volume at a fraction of the cost.
Simple two-step flows. If you’re just calling Lambda A and then Lambda B, a Step Functions state machine adds operational overhead with minimal benefit. A single Lambda that calls another Lambda (or sends an SQS message) is simpler and easier to debug.
Real-time latency requirements. Step Functions has overhead — typically 100-200ms per state transition for Standard workflows. For flows where latency must stay under 100ms end-to-end, that overhead is unacceptable. Express workflows have lower latency but still add overhead compared to direct service calls.
For the use cases in between — multi-step business processes, workflows that run for minutes to days, flows that need human intervention, pipelines that require conditional branching with error handling — Step Functions removes a significant amount of coordination code and makes the execution flow visible and debuggable in a way that pure Lambda chains never are.
Comments