Automating EBS Volume Scaling with Amazon CloudWatch, AWS Lambda and SSM

Introduction

In cloud environments, storage utilization is one of those things that quietly grow until they suddenly become a production issue.

On Linux-based EC2 instances, disk usage can increase due to logs, application data, or temporary files. While increasing an Amazon EBS volume is straightforward, doing it manually across multiple systems is not scalable.

The real challenge is not resizing a disk once.
It ensures that disk capacity adjusts automatically when needed, without manual intervention.

To solve this, we built a fully automated workflow using Amazon CloudWatch, AWS Lambda, and AWS Systems Manager that:

detects high disk usage
increases the EBS volume
expands the filesystem inside the OS

All without logging into the instance.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Problem Statement

In most environments:

Disk usage is monitored
Alerts are configured
But remediation is manual

This leads to common issues:

Engineers responding late to alerts
Downtime due to full disks
Manual SSH intervention
Inconsistent handling across systems

We needed a solution that:

automatically reacts to disk thresholds
works across instances
requires no manual login
handles both infrastructure and OS-level changes
is safe and repeatable

Overview of the Solution

The system connects monitoring with automated remediation.

High-level flow:

Amazon CloudWatch Agent publishes disk usage metrics
Amazon CloudWatch Alarm detects threshold breach (e.g., 80%)
AWS Lambda function is triggered
AWS Lambda increases the EBS volume
AWS Lambda waits for the modification to complete
SSM runs a script on the instance
The filesystem is expanded automatically

At the end, disk space increases automatically.

AWS Services Used

Amazon EC2 – Hosts workloads
Amazon EBS – Provides scalable storage
Amazon CloudWatch – Collects metrics and triggers alarms
AWS Lambda – Executes scaling logic
AWS Systems Manager – Executes OS-level commands
AWS IAM – Controls access securely

Architecture & Workflow

Step 1 – Metric Collection

CloudWatch Agent runs on the Amazon EC2 instance and publishes:

disk_used_percent for /

This is critical because default Amazon EC2 metrics do not include disk utilization.

Step 2 – Alarm Trigger

Amazon CloudWatch Alarm is configured:

Metric: disk_used_percent
Threshold: e.g., >= 80%
Action: Trigger Lambda

Step 3 – AWS Lambda Execution

The AWS Lambda function:

extracts the instance ID from the alarm event
identifies the root EBS volume
increases the volume size

Step 4 – Waiting for Volume Modification

EBS resizing is not instantaneous.

The AWS Lambda waits until the volume state becomes:

optimizing or completed

This step is critical, without it, OS-level expansion fails.

Step 5 – OS-Level Expansion via SSM

AWS Lambda uses SSM to run a script on the instance that:

detects root partition
expands partition (growpart)
resizes filesystem (resize2fs or xfs_growfs)

Step 6 – Final State

Volume size increased
Filesystem expanded
Disk usage drops automatically

Implementation Details

To keep this article focused and practical, we’ve published the complete working implementation on GitHub.

Inside the repository, you’ll find:

AWS Lambda function with full automation logic
Dynamic EBS volume detection
Wait mechanism for volume modification
SSM-based filesystem expansion
Shell script used for disk extension

GitHub Repository:
https://github.com/DeepakRao121/AutoEBSExpansion/tree/main

You can directly reuse this code or adapt it to your environment.

Key Implementation Detail (Critical Learning)

One of the most important challenges in this setup:

Timing between EBS resize and OS expansion

If the filesystem expansion runs too early:

NOCHANGE: partition cannot be grown

The fix is:

wait for volume modification
add a small delay for OS detection

This ensures reliable automation.

Why SSM Instead of SSH?

Using AWS Systems Manager provides major advantages:

no SSH access required
works on private instances
secure and auditable
scalable across environments

This eliminates the need for bastion hosts or key management.

Operational Impact

After implementing this automation:

disk-related incidents have reduced significantly
no manual intervention required
consistent behavior across instances
faster recovery from high utilization
improved system reliability

Instead of reacting to alerts, the system now self-heals.

Cost Perspective

This solution is cost-efficient because:

AWS Lambda runs only on a trigger
Amazon CloudWatch metrics are lightweight
SSM usage is minimal

The primary cost driver is EBS storage growth.

Important Considerations

Avoid uncontrolled scaling

If alarms trigger repeatedly, storage can grow rapidly.

Recommended:

add cooldown logic
or define upper limits

Volume modification delay

Large volumes take longer to optimize.

Lambda timeout (15 minutes) is usually sufficient, but for very large disks:

AWS Step Functions can be considered

Filesystem compatibility

Ensure your script supports:

ext4 → resize2fs
xfs → xfs_growfs

Conclusion

Disk space issues are one of the most common causes of production incidents, yet they are often handled manually.

By combining Amazon CloudWatch, AWS Lambda, Amazon EBS, and SSM, we can build a system that:

detects problems early
reacts automatically
scales infrastructure safely
removes operational overhead

What used to be a reactive process becomes a proactive, automated control.

This is a practical example of how cloud-native services can be combined to build resilient and self-healing systems.

Drop a query if you have any questions regarding Disk Utilization, and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As an AWS Premier Tier Services Partner, AWS Advanced Training Partner, Microsoft Solutions Partner, and Google Cloud Platform Partner, CloudThat has empowered over 1.1 million professionals through 1000+ cloud certifications, winning global recognition for its training excellence, including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 14 awards in the last 9 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, Security, IoT, and advanced technologies like Gen AI & AI/ML. It has delivered over 750 consulting projects for 850+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Does this require SSH access to instances?

ANS: – No. All operations are performed using SSM.

2. Will this work on private EC2 instances?

ANS: – Yes. As long as SSM is configured, no public access is required.

3. What happens if filesystem expansion fails?

ANS: – SSM command output is logged and can be reviewed for troubleshooting.

WRITTEN BY Deepak S

Deepak S is a Senior Research Associate at CloudThat, specializing in AWS services. He is passionate about exploring new technologies in cloud and is also an automobile enthusiast.