|
Voiced by Amazon Polly |
Overview
AWS Lambda durable functions enable the implementation of multi-step workflows as a single Python handler, while AWS Lambda manages checkpoints, long waits, and recovery behind the scenes. The mechanism behind this is durable execution, where a workflow can run for up to a year in wall-clock time, even though each individual AWS Lambda invocation still adheres to the usual runtime limit.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Basic Concepts
A durable execution represents the entire lifetime of a single long-running workflow instance. The developer writes regular, top-to-bottom Python code. Certain operations, such as steps, waits, callbacks, and invocations of other AWS Lambdas, are treated as durable operations. For each such operation, AWS Lambda records inputs and outputs in an execution history.
When the function stops, times out, or is paused (for example, while waiting for a payment callback), AWS Lambda later reinvokes the handler with the same event and replays the history. During replay, completed durable operations are not executed again; instead, Lambda injects the stored results and resumes from the next unfinished operation.
Two-time dimensions are important: the per-invocation AWS Lambda timeout (up to 15 minutes for managed runtimes) and the durable execution timeout, which can be configured up to one year for the entire workflow instance, including all waits and replays.
Durable Execution SDK and Runtimes
For Python, the Durable Execution SDK exposes decorators and a special context object:

The handler is annotated with @durable_execution and receives a DurableContext instead of the usual AWS Lambda context. This context provides methods such as:
- step(fn, name=…) to run a unit of work whose result is checkpointed,
- wait_for_callback(callback_starter, …) to suspend until an external system responds,
- wait(duration) to pause for a period without keeping a container running, and
- invoke(arn, payload, name=…) to call another AWS Lambda function with checkpointed results.
Durable execution is currently supported on selected Python and Node.js managed runtimes and container image functions.
Creating a Durable Function in the Console
Durable execution is enabled when the function is created.
In the AWS Lambda console, a developer chooses Create function → Author from scratch, selects a supported runtime such as Python 3.14, and expands the Durable execution section. That section allows enabling durable execution, setting the Execution timeout in seconds (for example, 86,400 seconds for one day), and defining the Retention period in days, which controls how long the execution history is stored after completion (between 1 and 90 days).
The console can create an execution role that already includes the necessary durable-state permissions. Once saved, the function page displays a Durable executions tab that lists each execution and its timeline.
With Infrastructure as Code (for example, AWS SAM or AWS CloudFormation), the same configuration is expressed via a DurableConfig block on the function resource, containing ExecutionTimeout and RetentionPeriodInDays.
Example: Vending Machine Durable Workflow
The following handler illustrates a vending workflow: validate a request, reserve a slot, start a payment, wait for a callback, then either release or dispense and audit:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
from typing import Any, Dict import uuid import json from aws_durable_execution_sdk_python import ( DurableContext, durable_execution, durable_step, ) from aws_durable_execution_sdk_python.config import Duration # -------- Durable steps for the vending workflow -------- # @durable_step def validate_request(step_context, request: Dict[str, Any]) -> Dict[str, Any]: """Validate the incoming vending request and assign an order ID.""" price = int(request.get("price_cents", 0)) machine_id = request.get("machine_id") slot_id = request.get("slot_id") user_id = request.get("user_id") if price <= 0: raise ValueError("Price must be a positive integer.") if not machine_id: raise ValueError("Missing machine_id.") if not slot_id: raise ValueError("Missing slot_id.") if not user_id: raise ValueError("Missing user_id.") order_id = f"ORD-{uuid.uuid4()}" step_context.logger.info(f"[vending] validated request → order_id={order_id}") return { "order_id": order_id, "machine_id": machine_id, "slot_id": slot_id, "user_id": user_id, "price_cents": price, } @durable_step def reserve_vending_slot(step_context, order: Dict[str, Any]) -> Dict[str, Any]: """Reserve the selected slot on the vending machine.""" reservation_id = f"RES-{order['order_id']}" step_context.logger.info( f"[vending] reserving slot={order['slot_id']} " f"on machine={order['machine_id']} " f"reservation={reservation_id}" ) return { "reservation_id": reservation_id, "machine_id": order["machine_id"], "slot_id": order["slot_id"], "status": "RESERVED", } @durable_step def process_vending_payment(step_context, order: Dict[str, Any]) -> Dict[str, Any]: """ Simulate payment processing for the order. In a real system this step would call an external payment gateway. """ step_context.logger.info( f"[vending] processing payment for order={order['order_id']} " f"amount={order['price_cents']} cents" ) return { "order_id": order["order_id"], "status": "PAID", "amount_cents": order["price_cents"], "reference": f"PAY-{uuid.uuid4()}", } @durable_step def confirm_and_dispense(step_context, order: Dict[str, Any]) -> Dict[str, Any]: """Confirm the order and dispense the item from the machine.""" step_context.logger.info( f"[vending] dispensing from machine={order['machine_id']} " f"slot={order['slot_id']} " f"order_id={order['order_id']}" ) return { "order_id": order["order_id"], "status": "DISPENSED", "machine_id": order["machine_id"], "slot_id": order["slot_id"], "dispensed_at": "now", # placeholder timestamp } @durable_step def write_vending_audit( step_context, order: Dict[str, Any], reservation: Dict[str, Any], payment: Dict[str, Any], dispense: Dict[str, Any], ) -> Dict[str, Any]: """Record an audit entry for the full vending transaction.""" record = { "order": order, "reservation": reservation, "payment": payment, "dispense": dispense, } step_context.logger.info("[vending] audit record: %s", json.dumps(record)) # In production this might write to DynamoDB, S3, or another store. return {"audit_status": "RECORDED"} # -------- Durable execution handler -------- # @durable_execution def lambda_handler(event: Dict[str, Any], context: DurableContext) -> Dict[str, Any]: """ Durable vending workflow. Expected event: { "request": { "machine_id": "VM-101", "slot_id": "A3", "user_id": "user-42", "price_cents": 150 } } """ request = event["request"] # Step 1: validate and enrich input order = context.step( validate_request(request), name="validate-request", ) # Step 2: reserve a slot on the machine reservation = context.step( reserve_vending_slot(order), name="reserve-slot", ) # Step 3: process payment payment = context.step( process_vending_payment(order), name="process-payment", ) # Step 4: To simulate external confirmation (e.g., user approving payment) context.wait(Duration.from_seconds(10)) # Step 5: confirm and dispense dispense = context.step( confirm_and_dispense(order), name="confirm-and-dispense", ) # Step 6: write audit record audit = context.step( write_vending_audit(order, reservation, payment, dispense), name="write-audit", ) return { "order_id": order["order_id"], "status": dispense["status"], "machine_id": order["machine_id"], "slot_id": order["slot_id"], "reservation": reservation, "payment": payment, "dispense": dispense, "audit": audit, } |
Every side effect, reserving the slot, releasing it, kicking off payment, dispensing the item, and writing the audit entry, is enclosed in a step or wait_for_callback. If the function crashes after the dispense operation, the subsequent invocation replays the workflow, skips the already completed steps, and produces the same outcome without triggering a second dispense.
Invocation, Event Sources, Retries, and Idempotency
Durable functions are invoked like any other AWS Lambda: from the console test feature, Amazon API Gateway, AWS Lambda Function URLs, or other Lambdas using a qualified ARN (version or alias). Each run appears in the Durable Executions tab with a full step history. For event source mappings such as Amazon SQS or Amazon Kinesis, the usual per-invocation timeout still applies; if a batch is processed directly by a durable function, all work (including waits) must stay within that limit. For long-running workflows, an intermediate non-durable AWS Lambda often starts the durable execution asynchronously and returns immediately. Retries occur at both the durable-operation level (step retry policies) and the AWS Lambda infrastructure level, so steps must be idempotent, typically by using stable identifiers such as order_id or explicit idempotency keys.
Security, Testing, Monitoring, and Best Practices
The execution role of a durable function requires standard logging and service permissions, plus durable-state actions such as lambda:CheckpointDurableExecution and related APIs; AWS managed policies like AWSLambdaBasicDurableExecutionRolePolicy bundle these permissions. Durable state is encrypted both at rest and in transit, and durable APIs are logged in AWS CloudTrail for audit purposes. Testing can be performed in the AWS Lambda console using sample events and by inspecting durable executions. Callbacks can be completed via a dedicated callback AWS Lambda or directly in the console. Monitoring relies on Amazon CloudWatch Logs, Amazon CloudWatch metrics and alarms, and the durable executions view for step-by-step inspection. Recommended practices include deterministic orchestration (keeping non-deterministic logic inside steps), descriptive step names, and least-privilege AWS IAM for both durable state and downstream services.
Conclusion
AWS Lambda durable functions allow complex, multi-step workflows, such as the vending machine example, to be modeled as simple Python code while AWS transparently handles checkpoints, waits, and recovery.
Drop a query if you have any questions regarding AWS Lambda and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.
FAQs
1. When is a durable function appropriate?
ANS: – A durable function is well-suited to workflows with multiple stages, waits, or callbacks, such as payments, approvals, or long-running document processing, where checkpointing and automated recovery are valuable, but a code-centric model inside AWS Lambda is preferred over an external state machine.
2. How long can a single workflow run?
ANS: – A single durable execution can run as long as the configured ExecutionTimeout allows (up to one year), while each individual invocation between waits must stay under the normal Lambda runtime limit.
3. What happens if an external system never calls back?
ANS: – A callback wait can include a timeout via WaitForCallbackConfig. When that timeout expires, the durable operation fails in a controlled way, and the function can respond accordingly.
WRITTEN BY Rishi Raj Saikia
Rishi works as an Associate Architect. He is a dynamic professional with a strong background in data and IoT solutions, helping businesses transform raw information into meaningful insights. He has experience in designing smart systems that seamlessly connect devices and streamline data flow. Skilled in addressing real-world challenges by combining technology with practical thinking, Rishi is passionate about creating efficient, impactful solutions that drive measurable results.
Login

December 12, 2025
PREV
Comments