|
Voiced by Amazon Polly |
Introduction
In dynamic AWS environments, updating an AMI is routine. The challenge begins after the AMI is ready.
If you manage many Auto Scaling Groups (ASGs), triggering instance refreshes for all of them simultaneously can overwhelm downstream systems such as databases, caches, or third-party integrations. In my case, the database connection limits made parallel refreshes risky.
To solve this, we have designed a controlled, batch-driven refresh mechanism using AWS Step Functions, AWS Lambda, and Amazon DynamoDB. The system refreshes ASGs gradually, monitors progress and moves to the next batch only when it’s safe.
This approach gives us automation without sacrificing stability.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Problem Statement
When a new AMI becomes available, we want every ASG to adopt it.
However:
- Refreshing all ASGs at once can spike DB connections.
- Manual sequencing is slow and error-prone.
- Teams need visibility into which groups are in progress or completed.
- Failures must not impact unrelated workloads.
We need something automated, controlled, observable, and repeatable.
Overview of the Solution
The architecture uses orchestration rather than brute force.
Instead of triggering refresh everywhere, we:
- Divide ASGs into smaller batches.
- Refresh one batch at a time.
- Track progress in DynamoDB.
- Move forward only after the current batch is healthy.
AWS Step Functions acts as the conductor, while Lambda performs the actions.
AWS Services Used
Amazon EC2 Auto Scaling – Executes instance refresh.
AWS Lambda –
- AWS Lambda 1: prepares batches.
- AWS Lambda 2: triggers and monitors refresh.
AWS Step Functions – Orchestrates the workflow and sequencing.
Amazon DynamoDB – Stores refresh state and progress.
AWS IAM – Grants permissions between services.
Solution Architecture & Flow
Here’s how the system works end-to-end:
Step 1 – Input
We start with a list of ASGs that must adopt the new AMI.
Step 2 – Batch Creation (Lambda 1)
The first AWS Lambda divides ASGs into manageable groups (for example, 2–3 at a time, depending on DB capacity).
These batches are passed to the AWS Step Function.
Step 3 – Refresh Execution (AWS Lambda 2)
For each batch, the second AWS Lambda:
- Starts the instance refresh.
- Updates tracking information in Amazon DynamoDB.
- Checks refresh status.
Step 4 – Wait & Recheck
The state machine waits for completion signals and repeatedly evaluates progress.
Step 5 – Move to Next Batch
Only after the current batch finishes successfully does the workflow proceed.
This guarantees controlled rollout and prevents infrastructure shock.
Why AWS Step Functions?
Before this, the options were:
- Run scripts manually
- Trigger all ASGs together
- Maintain complicated custom schedulers
AWS Step Functions give:
- visual workflow
- built-in retries
- controlled parallelism
- easier auditing
- simpler failure handling
It converts operational stress into predictable automation.
Role of Amazon DynamoDB in the Design
Amazon DynamoDB acts as the source of truth during execution.
We store:
- Which ASGs are in which batch
- Refresh start time
- Current status
- Completion markers
This helps in:
- Observability
- Restart capability
- Troubleshooting
- Avoiding duplicate actions
Without a state store, long-running orchestrations become messy.
Implementation Strategy
To keep the blog concise, we have published the complete implementation in the public GitHub repository:
GitHub Repo:
https://github.com/DeepakRao121/POC/tree/main/ec2InstanceRefresh
You’ll find:
- AWS Step Function definition
- Batch creation AWS Lambda
- Instance refresh AWS Lambda
- Amazon DynamoDB interaction logic
- AWS IAM examples
This lets readers deploy or adapt the solution directly.
Key Design Decisions
Controlled Concurrency
We intentionally limit the number of ASGs that refresh at once to protect database and application dependencies.
Idempotency
The system can re-run without causing duplicate refresh chaos.
Separation of Responsibilities
One AWS Lambda plans, another executes. This keeps logic clean and maintainable.
Observable State
Everything important is queryable via Amazon DynamoDB.
Cost Perspective
The architecture is almost entirely serverless.
You pay for:
- AWS Lambda invocations
- AWS Step Function transitions
- Small Amazon DynamoDB reads/writes
Compared to downtime risk or engineering hours, the cost is minimal.
Extending the Solution
This model can be enhanced further to support:
- environment-based prioritization
- approval workflows before the next batch
- canary ASG refresh
- automatic rollback triggers
- Slack / Teams notifications
It provides a strong foundation for mature release engineering.
Conclusion
AMI updates are inevitable in any AWS environment. Whether driven by security patches, application upgrades, or compliance requirements, new images must be rolled out regularly to keep infrastructure healthy.
The real challenge is not how to refresh, but how to refresh safely at scale.
By combining AWS Step Functions for orchestration, AWS Lambda for execution logic, and Amazon DynamoDB for state tracking, we introduce structure and intelligence into what would otherwise be a disruptive operation. Batching ensures that only a limited number of fleet changes are made at any given time. Centralized tracking provides visibility into progress and simplifies troubleshooting. Automated transitions between stages remove the need for manual supervision while still maintaining strict control.
This approach converts instance refresh from a high-risk maintenance activity into a predictable, repeatable deployment pipeline. Teams gain confidence, operational noise is reduced, and updates can occur more frequently without fear of cascading failures.
Most importantly, the design embraces core DevOps principles: automation over manual effort, visibility over guesswork, and resilience over speed. Instead of reacting to problems, the system prevents them.
Drop a query if you have any questions regarding AWS Lambda and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.
FAQs
1. Can the batch size be changed?
ANS: – Yes. Modify the batch AWS Lambda logic to match your dependency capacity.
2. Does this work for hundreds of ASGs?
ANS: – Absolutely. AWS Step Functions can orchestrate at scale while maintaining order.
3. What happens if a refresh fails?
ANS: – You can configure retries or stop the pipeline. The state remains visible in Amazon DynamoDB.
WRITTEN BY Deepak S
Deepak S is a Senior Research Associate at CloudThat, specializing in AWS services. He is passionate about exploring new technologies in cloud and is also an automobile enthusiast.
Login

February 12, 2026
PREV
Comments