AWS, Cloud Computing, Data Analytics

5 Mins Read

Implementing AWS Lambda Durable Functions in Python

Voiced by Amazon Polly

Overview

AWS Lambda durable functions enable the implementation of multi-step workflows as a single Python handler, while AWS Lambda manages checkpoints, long waits, and recovery behind the scenes. The mechanism behind this is durable execution, where a workflow can run for up to a year in wall-clock time, even though each individual AWS Lambda invocation still adheres to the usual runtime limit.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Basic Concepts

A durable execution represents the entire lifetime of a single long-running workflow instance. The developer writes regular, top-to-bottom Python code. Certain operations, such as steps, waits, callbacks, and invocations of other AWS Lambdas, are treated as durable operations. For each such operation, AWS Lambda records inputs and outputs in an execution history.

When the function stops, times out, or is paused (for example, while waiting for a payment callback), AWS Lambda later reinvokes the handler with the same event and replays the history. During replay, completed durable operations are not executed again; instead, Lambda injects the stored results and resumes from the next unfinished operation.

Two-time dimensions are important: the per-invocation AWS Lambda timeout (up to 15 minutes for managed runtimes) and the durable execution timeout, which can be configured up to one year for the entire workflow instance, including all waits and replays.

Durable Execution SDK and Runtimes

For Python, the Durable Execution SDK exposes decorators and a special context object:

lambda

The handler is annotated with @durable_execution and receives a DurableContext instead of the usual AWS Lambda context. This context provides methods such as:

  • step(fn, name=…) to run a unit of work whose result is checkpointed,
  • wait_for_callback(callback_starter, …) to suspend until an external system responds,
  • wait(duration) to pause for a period without keeping a container running, and
  • invoke(arn, payload, name=…) to call another AWS Lambda function with checkpointed results.

Durable execution is currently supported on selected Python and Node.js managed runtimes and container image functions.

Creating a Durable Function in the Console

Durable execution is enabled when the function is created.

In the AWS Lambda console, a developer chooses Create function → Author from scratch, selects a supported runtime such as Python 3.14, and expands the Durable execution section. That section allows enabling durable execution, setting the Execution timeout in seconds (for example, 86,400 seconds for one day), and defining the Retention period in days, which controls how long the execution history is stored after completion (between 1 and 90 days).

The console can create an execution role that already includes the necessary durable-state permissions. Once saved, the function page displays a Durable executions tab that lists each execution and its timeline.

With Infrastructure as Code (for example, AWS SAM or AWS CloudFormation), the same configuration is expressed via a DurableConfig block on the function resource, containing ExecutionTimeout and RetentionPeriodInDays.

Example: Vending Machine Durable Workflow

The following handler illustrates a vending workflow: validate a request, reserve a slot, start a payment, wait for a callback, then either release or dispense and audit:

Every side effect, reserving the slot, releasing it, kicking off payment, dispensing the item, and writing the audit entry, is enclosed in a step or wait_for_callback. If the function crashes after the dispense operation, the subsequent invocation replays the workflow, skips the already completed steps, and produces the same outcome without triggering a second dispense.

Invocation, Event Sources, Retries, and Idempotency

Durable functions are invoked like any other AWS Lambda: from the console test feature, Amazon API Gateway, AWS Lambda Function URLs, or other Lambdas using a qualified ARN (version or alias). Each run appears in the Durable Executions tab with a full step history. For event source mappings such as Amazon SQS or Amazon Kinesis, the usual per-invocation timeout still applies; if a batch is processed directly by a durable function, all work (including waits) must stay within that limit. For long-running workflows, an intermediate non-durable AWS Lambda often starts the durable execution asynchronously and returns immediately. Retries occur at both the durable-operation level (step retry policies) and the AWS Lambda infrastructure level, so steps must be idempotent, typically by using stable identifiers such as order_id or explicit idempotency keys.

Security, Testing, Monitoring, and Best Practices

The execution role of a durable function requires standard logging and service permissions, plus durable-state actions such as lambda:CheckpointDurableExecution and related APIs; AWS managed policies like AWSLambdaBasicDurableExecutionRolePolicy bundle these permissions. Durable state is encrypted both at rest and in transit, and durable APIs are logged in AWS CloudTrail for audit purposes. Testing can be performed in the AWS Lambda console using sample events and by inspecting durable executions. Callbacks can be completed via a dedicated callback AWS Lambda or directly in the console. Monitoring relies on Amazon CloudWatch Logs, Amazon CloudWatch metrics and alarms, and the durable executions view for step-by-step inspection. Recommended practices include deterministic orchestration (keeping non-deterministic logic inside steps), descriptive step names, and least-privilege AWS IAM for both durable state and downstream services.

Conclusion

AWS Lambda durable functions allow complex, multi-step workflows, such as the vending machine example, to be modeled as simple Python code while AWS transparently handles checkpoints, waits, and recovery.

By combining deterministic orchestration, idempotent steps, and proper use of callbacks, durable functions provide a cloud-native way to run long-lived business processes entirely within the AWS Lambda environment.

Drop a query if you have any questions regarding AWS Lambda and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. When is a durable function appropriate?

ANS: – A durable function is well-suited to workflows with multiple stages, waits, or callbacks, such as payments, approvals, or long-running document processing, where checkpointing and automated recovery are valuable, but a code-centric model inside AWS Lambda is preferred over an external state machine.

2. How long can a single workflow run?

ANS: – A single durable execution can run as long as the configured ExecutionTimeout allows (up to one year), while each individual invocation between waits must stay under the normal Lambda runtime limit.

3. What happens if an external system never calls back?

ANS: – A callback wait can include a timeout via WaitForCallbackConfig. When that timeout expires, the durable operation fails in a controlled way, and the function can respond accordingly.

WRITTEN BY Rishi Raj Saikia

Rishi works as an Associate Architect. He is a dynamic professional with a strong background in data and IoT solutions, helping businesses transform raw information into meaningful insights. He has experience in designing smart systems that seamlessly connect devices and streamline data flow. Skilled in addressing real-world challenges by combining technology with practical thinking, Rishi is passionate about creating efficient, impactful solutions that drive measurable results.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!