Automating Controlled EC2 Instance Refresh Across Multiple Auto Scaling Groups

Introduction

In dynamic AWS environments, updating an AMI is routine. The challenge begins after the AMI is ready.

If you manage many Auto Scaling Groups (ASGs), triggering instance refreshes for all of them simultaneously can overwhelm downstream systems such as databases, caches, or third-party integrations. In my case, the database connection limits made parallel refreshes risky.

To solve this, we have designed a controlled, batch-driven refresh mechanism using AWS Step Functions, AWS Lambda, and Amazon DynamoDB. The system refreshes ASGs gradually, monitors progress and moves to the next batch only when it’s safe.

This approach gives us automation without sacrificing stability.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Problem Statement

When a new AMI becomes available, we want every ASG to adopt it.

However:

Refreshing all ASGs at once can spike DB connections.
Manual sequencing is slow and error-prone.
Teams need visibility into which groups are in progress or completed.
Failures must not impact unrelated workloads.

We need something automated, controlled, observable, and repeatable.

Overview of the Solution

The architecture uses orchestration rather than brute force.

Instead of triggering refresh everywhere, we:

Divide ASGs into smaller batches.
Refresh one batch at a time.
Track progress in DynamoDB.
Move forward only after the current batch is healthy.

AWS Step Functions acts as the conductor, while Lambda performs the actions.

AWS Services Used

Amazon EC2 Auto Scaling – Executes instance refresh.

AWS Lambda –

AWS Lambda 1: prepares batches.
AWS Lambda 2: triggers and monitors refresh.

AWS Step Functions – Orchestrates the workflow and sequencing.

Amazon DynamoDB – Stores refresh state and progress.

AWS IAM – Grants permissions between services.

Solution Architecture & Flow

Here’s how the system works end-to-end:

Step 1 – Input

We start with a list of ASGs that must adopt the new AMI.

Step 2 – Batch Creation (Lambda 1)

The first AWS Lambda divides ASGs into manageable groups (for example, 2–3 at a time, depending on DB capacity).

These batches are passed to the AWS Step Function.

Step 3 – Refresh Execution (AWS Lambda 2)

For each batch, the second AWS Lambda:

Starts the instance refresh.
Updates tracking information in Amazon DynamoDB.
Checks refresh status.

Step 4 – Wait & Recheck

The state machine waits for completion signals and repeatedly evaluates progress.

Step 5 – Move to Next Batch

Only after the current batch finishes successfully does the workflow proceed.

This guarantees controlled rollout and prevents infrastructure shock.

Why AWS Step Functions?

Before this, the options were:

Run scripts manually
Trigger all ASGs together
Maintain complicated custom schedulers

AWS Step Functions give:

visual workflow
built-in retries
controlled parallelism
easier auditing
simpler failure handling

It converts operational stress into predictable automation.

Role of Amazon DynamoDB in the Design

Amazon DynamoDB acts as the source of truth during execution.

We store:

Which ASGs are in which batch
Refresh start time
Current status
Completion markers

This helps in:

Observability
Restart capability
Troubleshooting
Avoiding duplicate actions

Without a state store, long-running orchestrations become messy.

Implementation Strategy

To keep the blog concise, we have published the complete implementation in the public GitHub repository:

GitHub Repo:
https://github.com/DeepakRao121/POC/tree/main/ec2InstanceRefresh

You’ll find:

AWS Step Function definition
Batch creation AWS Lambda
Instance refresh AWS Lambda
Amazon DynamoDB interaction logic
AWS IAM examples

This lets readers deploy or adapt the solution directly.

Key Design Decisions

Controlled Concurrency

We intentionally limit the number of ASGs that refresh at once to protect database and application dependencies.

Idempotency

The system can re-run without causing duplicate refresh chaos.

Separation of Responsibilities

One AWS Lambda plans, another executes. This keeps logic clean and maintainable.

Observable State

Everything important is queryable via Amazon DynamoDB.

Cost Perspective

The architecture is almost entirely serverless.

You pay for:

AWS Lambda invocations
AWS Step Function transitions
Small Amazon DynamoDB reads/writes

Compared to downtime risk or engineering hours, the cost is minimal.

Extending the Solution

This model can be enhanced further to support:

environment-based prioritization
approval workflows before the next batch
canary ASG refresh
automatic rollback triggers
Slack / Teams notifications

It provides a strong foundation for mature release engineering.

Conclusion

AMI updates are inevitable in any AWS environment. Whether driven by security patches, application upgrades, or compliance requirements, new images must be rolled out regularly to keep infrastructure healthy.

However, performing these updates without a controlled strategy can quickly introduce instability. Simultaneous instance replacements across multiple Auto Scaling Groups may overload shared dependencies such as databases, caches, authentication systems, or external APIs.

The real challenge is not how to refresh, but how to refresh safely at scale.

By combining AWS Step Functions for orchestration, AWS Lambda for execution logic, and Amazon DynamoDB for state tracking, we introduce structure and intelligence into what would otherwise be a disruptive operation. Batching ensures that only a limited number of fleet changes are made at any given time. Centralized tracking provides visibility into progress and simplifies troubleshooting. Automated transitions between stages remove the need for manual supervision while still maintaining strict control.

This approach converts instance refresh from a high-risk maintenance activity into a predictable, repeatable deployment pipeline. Teams gain confidence, operational noise is reduced, and updates can occur more frequently without fear of cascading failures.

Most importantly, the design embraces core DevOps principles: automation over manual effort, visibility over guesswork, and resilience over speed. Instead of reacting to problems, the system prevents them.

Drop a query if you have any questions regarding AWS Lambda and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Can the batch size be changed?

ANS: – Yes. Modify the batch AWS Lambda logic to match your dependency capacity.

2. Does this work for hundreds of ASGs?

ANS: – Absolutely. AWS Step Functions can orchestrate at scale while maintaining order.

3. What happens if a refresh fails?

ANS: – You can configure retries or stop the pipeline. The state remains visible in Amazon DynamoDB.

WRITTEN BY Deepak S

Deepak S is a Senior Research Associate at CloudThat, specializing in AWS services. He is passionate about exploring new technologies in cloud and is also an automobile enthusiast.

Comments

Click to Comment