With digitalization and the cloud taking over the development and deployment of new features in software applications, error-handling is a crucial activity that helps in faster deployment. Any mistake in the chain from writing code, deploying to monitoring performance can degrade customer experience, shoot the cost, or interrupt critical services immediately. Many systems have been created to maintain a chain or a pipeline to manage errors, only to fall prey to the limitations of these tools and systems. Most IT teams manually sift through terabytes of data to identify the issue; this is time-consuming and frequently delays rectification, resulting in companies spending valuable resources.
What if there was a way to fix operational problems quickly?
The answer is: Yes, Amazon DevOps Guru.
Amazon has introduced DevOps Guru, a fully managed operations service that helps developers and operators to improve the performance and availability of their applications. DevOps Guru lets you offload the administrative tasks associated with identifying operational issues so that you can quickly implement recommendations to improve your application.
Amazon DevOps Guru working:
DevOps Guru creates reactive insights that you can use to improve your application now. It also makes proactive insights to help you avoid operational issues that might affect your application in the future. It applies machine learning to analyze your operational data and application metrics and events to identify behaviors that deviate from standard operating patterns. You are notified when DevOps Guru detects an operational issue or risk. For each issue, DevOps Guru presents intelligent recommendations to address current and predicted future operational problems.
Proof Of Concept:
To understand this service better, we have carried out a use case provided by AWS. With the help of this POC, we will deep dive into this service. If you would like to perform this POC, you can click here and follow each step from AWS Documentation. I have also provided the details and step-by-step procedure to perform this POC below.
We will deploy a CloudFormation Stack and populate it with test data. This stack will launch a serverless application that includes an API Gateway, Lambda Function, and DynamoDB Table. Then we will modify the ReadCapacityUnits of DynamoDB table from 5 to 1 and trigger the API Gateway Endpoint from a script to increase the traffic.
Step 1: Choose the resource coverage and enable Amazon DevOps Guru Service. Resource coverage can be selected later as well after allowing the service.
Step 2: Choose the stacks you want to monitor with DevOps Guru.
Now you have to wait for DevOps Guru to complete the baselining of the resources. It is a necessary step to benchmark the expected behavior. For our serverless stack with three resources, we recommend waiting for 2 hours before carrying out the next steps. When enabled in a production environment, depending upon the number of resources selected to monitor, it can take up to 24 hours to complete the baseline.
Step 3: Modify the stack and change the ReadCapacityUnits of the DynamoDB table from 5 to 1.
Step 4: Run the script to trigger the API Gateway endpoint in a loop to increase the traffic.
Step 5: View Insights in Amazon DevOps Guru Service and see the recommendations.
The resources affected will be API gateway, Lambda, and DynamoDB table. The DevOps Guru will find out the root cause of the issue (DynamoDB) and show in insights.
Here are the screenshots of the POC performed.
Fig. 1: On the Insights dashboard, you will be able to view different errors on the affected resources.
Fig. 2: In the Aggregated metrics dashboard, you can view the timeline of various metrics and compare the origin time of specific errors or spikes in the metrics. It will help in determining the root cause of the issue.
Fig. 3: The Relevant events tab help in determining the several events happened during the entire timeline. It will help in dermining the reason of the issue. For example, after a certain deployment, issues started in the application. It will be visible under this tab.
Fig. 4: The recommendations dashboard provides suggestions to mitigate the occurred issue.
Hence, Amazon DevOps Guru is a very powerful service that not only detects the issue but also finds the root cause and provides recommendations to mitigate the issue.
Saurabh Kumar Jain is a Subject Matter Expert working with CloudThat. He has experience in AWS, Microsoft Azure, and DevOps technologies. He is specialized in cloud security and architecture design. He also holds experience on various cloud and DevOps projects based on the cloud maturity model. He is an AWS Certified Security Speciality, AWS Certified Solutions Architect - Associate, Microsoft certified Azure Administrator and a certified Terraform Associate.