|
Voiced by Amazon Polly |
Introduction
In cloud environments, storage utilization is one of those things that quietly grow until they suddenly become a production issue.
On Linux-based EC2 instances, disk usage can increase due to logs, application data, or temporary files. While increasing an Amazon EBS volume is straightforward, doing it manually across multiple systems is not scalable.
The real challenge is not resizing a disk once.
It ensures that disk capacity adjusts automatically when needed, without manual intervention.
To solve this, we built a fully automated workflow using Amazon CloudWatch, AWS Lambda, and AWS Systems Manager that:
- detects high disk usage
- increases the EBS volume
- expands the filesystem inside the OS
All without logging into the instance.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Problem Statement
In most environments:
- Disk usage is monitored
- Alerts are configured
- But remediation is manual
This leads to common issues:
- Engineers responding late to alerts
- Downtime due to full disks
- Manual SSH intervention
- Inconsistent handling across systems
We needed a solution that:
- automatically reacts to disk thresholds
- works across instances
- requires no manual login
- handles both infrastructure and OS-level changes
- is safe and repeatable
Overview of the Solution
The system connects monitoring with automated remediation.
High-level flow:
- Amazon CloudWatch Agent publishes disk usage metrics
- Amazon CloudWatch Alarm detects threshold breach (e.g., 80%)
- AWS Lambda function is triggered
- AWS Lambda increases the EBS volume
- AWS Lambda waits for the modification to complete
- SSM runs a script on the instance
- The filesystem is expanded automatically
At the end, disk space increases automatically.
AWS Services Used
- Amazon EC2 – Hosts workloads
- Amazon EBS – Provides scalable storage
- Amazon CloudWatch – Collects metrics and triggers alarms
- AWS Lambda – Executes scaling logic
- AWS Systems Manager – Executes OS-level commands
- AWS IAM – Controls access securely
Architecture & Workflow
Step 1 – Metric Collection
CloudWatch Agent runs on the Amazon EC2 instance and publishes:
- disk_used_percent for /
This is critical because default Amazon EC2 metrics do not include disk utilization.
Step 2 – Alarm Trigger
Amazon CloudWatch Alarm is configured:
- Metric: disk_used_percent
- Threshold: e.g., >= 80%
- Action: Trigger Lambda
Step 3 – AWS Lambda Execution
The AWS Lambda function:
- extracts the instance ID from the alarm event
- identifies the root EBS volume
- increases the volume size
Step 4 – Waiting for Volume Modification
EBS resizing is not instantaneous.
The AWS Lambda waits until the volume state becomes:
- optimizing or completed
This step is critical, without it, OS-level expansion fails.
Step 5 – OS-Level Expansion via SSM
AWS Lambda uses SSM to run a script on the instance that:
- detects root partition
- expands partition (growpart)
- resizes filesystem (resize2fs or xfs_growfs)
Step 6 – Final State
- Volume size increased
- Filesystem expanded
- Disk usage drops automatically
Implementation Details
To keep this article focused and practical, we’ve published the complete working implementation on GitHub.
Inside the repository, you’ll find:
- AWS Lambda function with full automation logic
- Dynamic EBS volume detection
- Wait mechanism for volume modification
- SSM-based filesystem expansion
- Shell script used for disk extension
GitHub Repository:
https://github.com/DeepakRao121/AutoEBSExpansion/tree/main
You can directly reuse this code or adapt it to your environment.
Key Implementation Detail (Critical Learning)
One of the most important challenges in this setup:
Timing between EBS resize and OS expansion
If the filesystem expansion runs too early:
NOCHANGE: partition cannot be grown
The fix is:
- wait for volume modification
- add a small delay for OS detection
This ensures reliable automation.
Why SSM Instead of SSH?
Using AWS Systems Manager provides major advantages:
- no SSH access required
- works on private instances
- secure and auditable
- scalable across environments
This eliminates the need for bastion hosts or key management.
Operational Impact
After implementing this automation:
- disk-related incidents have reduced significantly
- no manual intervention required
- consistent behavior across instances
- faster recovery from high utilization
- improved system reliability
Instead of reacting to alerts, the system now self-heals.
Cost Perspective
This solution is cost-efficient because:
- AWS Lambda runs only on a trigger
- Amazon CloudWatch metrics are lightweight
- SSM usage is minimal
The primary cost driver is EBS storage growth.
Important Considerations
- Avoid uncontrolled scaling
If alarms trigger repeatedly, storage can grow rapidly.
Recommended:
- add cooldown logic
- or define upper limits
- Volume modification delay
Large volumes take longer to optimize.
Lambda timeout (15 minutes) is usually sufficient, but for very large disks:
- AWS Step Functions can be considered
- Filesystem compatibility
Ensure your script supports:
- ext4 → resize2fs
- xfs → xfs_growfs
Conclusion
Disk space issues are one of the most common causes of production incidents, yet they are often handled manually.
By combining Amazon CloudWatch, AWS Lambda, Amazon EBS, and SSM, we can build a system that:
- detects problems early
- reacts automatically
- scales infrastructure safely
- removes operational overhead
What used to be a reactive process becomes a proactive, automated control.
This is a practical example of how cloud-native services can be combined to build resilient and self-healing systems.
Drop a query if you have any questions regarding Disk Utilization, and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
FAQs
1. Does this require SSH access to instances?
ANS: – No. All operations are performed using SSM.
2. Will this work on private EC2 instances?
ANS: – Yes. As long as SSM is configured, no public access is required.
3. What happens if filesystem expansion fails?
ANS: – SSM command output is logged and can be reviewed for troubleshooting.
WRITTEN BY Deepak S
Deepak S is a Senior Research Associate at CloudThat, specializing in AWS services. He is passionate about exploring new technologies in cloud and is also an automobile enthusiast.
Login

May 25, 2026
PREV
Comments