AWS, Cloud Computing

< 1 min

Automating EBS Volume Scaling with Amazon CloudWatch, AWS Lambda and SSM

Voiced by Amazon Polly

Introduction

In cloud environments, storage utilization is one of those things that quietly grow until they suddenly become a production issue.

On Linux-based EC2 instances, disk usage can increase due to logs, application data, or temporary files. While increasing an Amazon EBS volume is straightforward, doing it manually across multiple systems is not scalable.

The real challenge is not resizing a disk once.
It ensures that disk capacity adjusts automatically when needed, without manual intervention.

To solve this, we built a fully automated workflow using Amazon CloudWatch, AWS Lambda, and AWS Systems Manager that:

  • detects high disk usage
  • increases the EBS volume
  • expands the filesystem inside the OS

All without logging into the instance.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Problem Statement

In most environments:

  • Disk usage is monitored
  • Alerts are configured
  • But remediation is manual

This leads to common issues:

  • Engineers responding late to alerts
  • Downtime due to full disks
  • Manual SSH intervention
  • Inconsistent handling across systems

We needed a solution that:

  • automatically reacts to disk thresholds
  • works across instances
  • requires no manual login
  • handles both infrastructure and OS-level changes
  • is safe and repeatable

Overview of the Solution

The system connects monitoring with automated remediation.

High-level flow:

  1. Amazon CloudWatch Agent publishes disk usage metrics
  2. Amazon CloudWatch Alarm detects threshold breach (e.g., 80%)
  3. AWS Lambda function is triggered
  4. AWS Lambda increases the EBS volume
  5. AWS Lambda waits for the modification to complete
  6. SSM runs a script on the instance
  7. The filesystem is expanded automatically

At the end, disk space increases automatically.

AWS Services Used

  • Amazon EC2 – Hosts workloads
  • Amazon EBS – Provides scalable storage
  • Amazon CloudWatch – Collects metrics and triggers alarms
  • AWS Lambda – Executes scaling logic
  • AWS Systems Manager – Executes OS-level commands
  • AWS IAM – Controls access securely

Architecture & Workflow

Step 1 – Metric Collection

CloudWatch Agent runs on the Amazon EC2 instance and publishes:

  • disk_used_percent for /

This is critical because default Amazon EC2 metrics do not include disk utilization.

Step 2 – Alarm Trigger

Amazon CloudWatch Alarm is configured:

  • Metric: disk_used_percent
  • Threshold: e.g., >= 80%
  • Action: Trigger Lambda

Step 3 – AWS Lambda Execution

The AWS Lambda function:

  • extracts the instance ID from the alarm event
  • identifies the root EBS volume
  • increases the volume size

Step 4 – Waiting for Volume Modification

EBS resizing is not instantaneous.

The AWS Lambda waits until the volume state becomes:

  • optimizing or completed

This step is critical, without it, OS-level expansion fails.

Step 5 – OS-Level Expansion via SSM

AWS Lambda uses SSM to run a script on the instance that:

  • detects root partition
  • expands partition (growpart)
  • resizes filesystem (resize2fs or xfs_growfs)

Step 6 – Final State

  • Volume size increased
  • Filesystem expanded
  • Disk usage drops automatically

Implementation Details

To keep this article focused and practical, we’ve published the complete working implementation on GitHub.

Inside the repository, you’ll find:

  • AWS Lambda function with full automation logic
  • Dynamic EBS volume detection
  • Wait mechanism for volume modification
  • SSM-based filesystem expansion
  • Shell script used for disk extension

GitHub Repository:
https://github.com/DeepakRao121/AutoEBSExpansion/tree/main

You can directly reuse this code or adapt it to your environment.

Key Implementation Detail (Critical Learning)

One of the most important challenges in this setup:

Timing between EBS resize and OS expansion

If the filesystem expansion runs too early:

NOCHANGE: partition cannot be grown

The fix is:

  • wait for volume modification
  • add a small delay for OS detection

This ensures reliable automation.

Why SSM Instead of SSH?

Using AWS Systems Manager provides major advantages:

  • no SSH access required
  • works on private instances
  • secure and auditable
  • scalable across environments

This eliminates the need for bastion hosts or key management.

Operational Impact

After implementing this automation:

  • disk-related incidents have reduced significantly
  • no manual intervention required
  • consistent behavior across instances
  • faster recovery from high utilization
  • improved system reliability

Instead of reacting to alerts, the system now self-heals.

Cost Perspective

This solution is cost-efficient because:

  • AWS Lambda runs only on a trigger
  • Amazon CloudWatch metrics are lightweight
  • SSM usage is minimal

The primary cost driver is EBS storage growth.

Important Considerations

  1. Avoid uncontrolled scaling

If alarms trigger repeatedly, storage can grow rapidly.

Recommended:

  • add cooldown logic
  • or define upper limits
  1. Volume modification delay

Large volumes take longer to optimize.

Lambda timeout (15 minutes) is usually sufficient, but for very large disks:

  • AWS Step Functions can be considered
  1. Filesystem compatibility

Ensure your script supports:

  • ext4 → resize2fs
  • xfs → xfs_growfs

Conclusion

Disk space issues are one of the most common causes of production incidents, yet they are often handled manually.

By combining Amazon CloudWatch, AWS Lambda, Amazon EBS, and SSM, we can build a system that:

  • detects problems early
  • reacts automatically
  • scales infrastructure safely
  • removes operational overhead

What used to be a reactive process becomes a proactive, automated control.

This is a practical example of how cloud-native services can be combined to build resilient and self-healing systems.

Drop a query if you have any questions regarding Disk Utilization, and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As an AWS Premier Tier Services Partner, AWS Advanced Training Partner, Microsoft Solutions Partner, and Google Cloud Platform Partner, CloudThat has empowered over 1.1 million professionals through 1000+ cloud certifications, winning global recognition for its training excellence, including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 14 awards in the last 9 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, Security, IoT, and advanced technologies like Gen AI & AI/ML. It has delivered over 750 consulting projects for 850+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Does this require SSH access to instances?

ANS: – No. All operations are performed using SSM.

2. Will this work on private EC2 instances?

ANS: – Yes. As long as SSM is configured, no public access is required.

3. What happens if filesystem expansion fails?

ANS: – SSM command output is logged and can be reviewed for troubleshooting.

WRITTEN BY Deepak S

Deepak S is a Senior Research Associate at CloudThat, specializing in AWS services. He is passionate about exploring new technologies in cloud and is also an automobile enthusiast.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!