Best Practices for AWS Monitoring Solutions

In this blog, we will discuss some measures which you can take to improve the monitoring and security of your AWS Cloud infrastructure. Monitoring is entrenched in the Well-Architected Framework. We have heard about five pillars of the AWS Well-Architected Framework which are

Operational Excellence
Security
Reliability
Performance Efficiency
Cost Optimization

Monitoring your AWS environment is a constant best-practice theme across the pillars as it’s only through auditing and understanding your resources that you’ll be able to use them to their fullest ability. For security reasons, it is essential to have the tools and practices in place to know your environments as we can see, data and security breaches are consistently making the headlines as hackers are using brute force.

We use tools like Prometheus and Grafana to ensure the security of our infrastructure, but I feel, implementing these tools is not enough. There could be hundreds to thousands of instances in your infrastructure, and there may be many instances that are being launched and terminated on a daily or weekly basis. So, it becomes difficult to manage these instances with such tools. If you have worked on Prometheus, you must be knowing that to get system metrics, we install node exporter on the target instance. But in a huge infrastructure with many instances you cannot install node exporter manually in every newly launched instance. That is why today we will discuss some standard practices which you can follow to tackle such problems.

Create a standard custom AMI for your Organization
Imagine you are setting up a Prometheus (monitoring and alerting tool) for your infrastructure. You have installed node exporter in all the servers. You have enabled service discovery for ec2 instances in your AWS Infrastructure. All the nodes are up, and you are very happy that you have enabled monitoring on your whole infrastructure. You have submitted the report that you are done. The next day when you come to the office, you find out that ten nodes are down as someone has launched ten new instances. Now you will realize that it is hectic to install node exporter in every newly launched instance manually. So, as a solution for that, you launch machines of different OS flavors like Red Hat, Ubuntu, etc. And install all those exporters which you are going to use, enable these services, and create Amazon machine image (AMI) of these machines. Standardize these AMIs to launch instances in your infrastructure. This makes your life easy and you will be able to monitor your infrastructure better.

Create Default security group with required inbound rules
When we install different exporters in servers to get metrics into your monitoring tool like Prometheus, we must open some specific ports in the target machines so that the tool can extract metrics from those servers. It becomes difficult to tell every person that if you are launching any instance, open these many ports for these many IPs or security groups. And if these ports are not open then there is no use of creating a custom AMI with all exporters installed as Prometheus will not be able to extract metrics from these servers because of firewall even if exporters are running in these servers. So, to overcome this challenge, create a default security group consisting of all the required inbound rules. You can attach this security group to all your instances using a simple AWS CLI command. You can also write a Lambda script which will attach this security group as soon as an instance will be created. This will automate your manual tasks, and you will be able to monitor your infrastructure more effortlessly.

Add owner tags to the instances
If you are working in a company where many people are authorized to create instances. You can write a Lambda script which will attach an owner tag to all your instances with key as owner and value as IAM username of the creator of that instance. Now, these tags will help you to easily be able to find the responsible person in case of any incident happening. In monitoring and alerting set up, we usually send alerts to notification channels like Slack and Microsoft Teams. You can use such channels for other purposes as well as send notifications of information if someone did not follow the standard practice of a predefined practice while the creation of a resource in your infrastructure. These notifications will contain information like resource name, resource id, launch time, name tag, owner tag, and any other information according to your requirement.

Write Lambda functions and send notifications related to infrastructure
In a monitoring setup, configuring alerting rules in monitoring tools is very important but that is not enough. I believe you must also monitor how the instances are being created and what all practices are being followed for ensuring security and reliability. If everyone who has access to create resources is following standards or not. For all these notifications, you can create a separate notification channel and push all such notifications on this channel. With the help of this, the operability team will have better insights into the whole infrastructure.

What all notifications can be pushed? So, the answer to this question is everything related to infrastructure which you can fetch with the help of API requests. These are a few examples,

Send notifications of the creation of instances that are not launched by standard AMI along with its owner name. This will help you to resolve the node down the issue as you can install node exporter as soon as you get that notification along with that you can warn the owner to use standard AMI in the future. For this, you can create a Lambda function which will fetch RunInstance information from CloudTrail and from there you can filter information from the last created instance. Then save each information in variables and create a custom message. To trigger this Lambda function, use CloudTrail. This function will be triggered as soon as an instance is created.
Send notifications with information of the ports which are wide open in the inbound rules of security groups daily. It will reduce a manual effort to keep checking security groups. You can write a Lambda function very easily using the Boto3 Python SDK of AWS. You just need to use DescribeSecurityGroup API, and then it is just a few minutes task to fetch required information from a JSON output and push it as notification to Slack or Microsoft Teams.
There might be another case where DevOps team provisions infrastructure based on requirements. So, here we cannot assign owner tags of IAM usernames of the creator of that resource. Here it becomes the creator’s responsibility to add this owner tag manually and assign value as for whom this resource is being created. To regulate this process and having a track that is it is being followed or not, you can again write a Lambda script. This script will be triggered when any RunInstance event happens in CloudWatch, and it will check if there is any owner tag available or not. If it is not available, then it will send a notification to our Slack channel or Microsoft teams with the resource details in the message.

So, in this way, you can enable as many automated notifications of monitoring your AWS infrastructure as you want and be updated about the health and security of your infrastructure.

If you want to learn more about AWS infrastructure, check out our AWS Courses.

Please comment if you have any questions.

Voiced by Amazon Polly

WRITTEN BY Saurabh Jain

Saurabh Kumar Jain is the CSA – Projects Head for DevOps and Kubernetes at CloudThat. An innovative Solutions Architect and technical leader, he is passionate about driving digital transformation across diverse industries. He specializes in designing enterprise-grade, cloud-native solutions, with deep expertise in multi-cloud platforms, Kubernetes orchestration, and AI-powered automation. Saurabh has extensive experience in architecting secure, scalable systems for sectors including oil & petroleum, financial services, e-commerce, and government organizations. He is recognized for his thought leadership in modernization strategies, GitOps workflows, and comprehensive observability implementations. In his free time, he explores emerging technologies in AI and GenAI, contributes to open-source projects, and shares knowledge through technical content and industry speaking engagements.

Sajid Akram

Jul 8, 2020

Reply

Its a nice work.
Keep it up bro…
Akshay Baishander

Jul 7, 2020

Reply

Very Informative blog saurabh.
Keep writing more like that
Kapil

Jul 7, 2020

Reply

Nice article

Click to Comment

Best Practices for AWS Monitoring Solutions

WRITTEN BY Saurabh Jain

Comments

Sajid Akram

Akshay Baishander

Kapil

Leave a Reply Cancel reply

Related Resources

Get The Most Out Of Us