How can we help you?

Anomaly Detection with AWS CloudWatch

Anomaly Detection with AWS CloudWatch

CloudWatch has become an integral part of managing any AWS eco-system. Whether you are on traditional EC2 type services, containers or modern serverless, you can use CloudWatch to provide insights into capacity planning, and help diagnose issues with services and your applications. From humble beginnings1 in 2009 where you were able to monitor basic metrics of EC2 instances and it was optional, to today with the latest anomaly detection support.

You are able to push custom metrics to CloudWatch either from your application or from the operating system using tools such as the CloudWatch Agent, push logs for analysis, storing for compliance purposes or as an interim step into another log analysis system or SIEM. CloudWatch also provides events that can trigger a workflow. CloudWatch has definitely become a very powerful in the AWS builders toolbox.

Next Generation MSP

As an AWS certified next generation MSP, we have been providing next generation monitoring capabilities to customers for a few years. Sometimes, this has been through the use of alternative third-party commercial or open-source solutions. These solutions have had various levels of success over the years, but generally have never felt like they had that tight integration I was looking for. Anomaly and outlier detection has been a crucial part of this requirement, and to have this now available directly in CloudWatch is awesome.

With this announcement, it means that we'll be able to bring this powerful monitoring approach to more customers than we have before due to the pay-for-what-you-consume approach. We can apply anomaly detection to just the metrics that make sense for each customer for thean utterly ridiculous price of USD 30c per month.

Example
Threshold Based

Let's take a look at an example.

I have a container based CMS running for a customer. This frontend stack is a traditional three-tier approach. One application load-balancer, N+1 auto-scaling containers, and Multi-Node PostgreSQL Aurora. In this example we are interested in how fast the application containers are responding. For this, we measure the TargetResponseTime on the application load-balancer.

Here we can see that when I took this screenshot of metric in CloudWatch, the application was responding generally under 300ms, with a few peaks to about 450ms. If we go back through the history of CloudWatch, we can see that we have often reached 2 seconds.

We can monitor this with a traditional alarm with an upper threshold. In this example we set the threshold to be 500ms.

This is represented in CloudFormation as follows:

Switching to anomaly detection

Why would you want to switch to anomaly detection? When I took that screenshot, it was a relatively quiet period for this customer. And some of the responses were hitting 450ms and our threshold is 500ms. As a business you may make a decision that 500ms is a respectable response time and that during peak, you are OK if a small percentage of responses are greater than that.

As a Site Reliability Engineer you observe that the initial value you have set is still acceptable because so few responses are greater than 500ms. But if we go back to our graph, we can see that we are really hover around 200-250ms. Even with auto-scaled containers our PostgreSQL database can get a little busy during the busy period, and we start to sit close to that 500ms time very often and even a few times have gone over it and now you are starting to paged a few too many times for what really is a false alarm.

Instead of guessing what the threshold value should be, why don't we let machine-learning work it out for us. Machine-learning will add a bounding box to our metric (shown in grey around our metric below) so that not only are we providing a bit of a gap to cope with peak busy periods but also will be able to provide us alerting when our it is busy outside of normal times or when the application server is responding too quickly, which could mean it is serving broken content for some reason (possibly 500 errors as the database is no longer in service for example).

Our metric graph in CloudWatch now looks like this:

We can now get an accurate picture of anomalies:

Our CloudWatch Alarm is now alerting on the 2nd metric which is being calculated based on what machine-learning has learnt over the pass two weeks.

To set this up you have to change your CloudFormation a bit and add in the Anomaly detection, and now set your alarm to alert on the new calculated metric:

Gotchas

One of the the issues we ran into was the change in JSON format coming from AWS CloudWatch. Instead of MetricName in the payload, it now resembles Metrics object in the alarm configuration.

Of course, this is not going to take into account those big one off events, and hopefully you know about them before and can plan accordingly (e.g. Black Friday, Cyber Monday or similar), but if you do get Slashdotted (Reddit Hug of Death) you will get alerted about it and thus can take appropriate action quickly to mitigate as quickly as possible as auto-scaling often won't scale fast enough in those types of scenarios.

Summary

This new anomaly detection capability from Amazon CloudWatch will help to reduce false alarms and help with monitoring fatigue. It will also give us a greater capability to stop guessing as to a good threshold starting point, and let the machines take care of the thinking for us.

For such a low price, I'm excited to see how we'll use it to help improve our customers and their customers experiences.

Notes

  1. Technically CloudWatch was available a few months earlier, but this was the first time you could see graphs with the integration into the console.

This article was written by Greg Cockburn, Principal Practice Lead @ AC3. Check out his profile on the AWS APN Ambassadors Asia Pacific page.