How can we help you?

The Four Golden Signals. Should Traffic Be The First Alert?

Google has done a great job in giving us a framework to hitch ourselves to when it comes to observability. The ‘golden signals’ are now observability parlance, with wide adoption by open source and commercial software vendors. It’s one of the foundations of software reliability engineering or SRE as we know it. It’s given a sometimes-misunderstood competency a name that makes sense, with the result a defined domain that people can carve out a career in if they wish.

Looking at the Golden Signals in the context of application performance monitoring (APM), different types of systems emit different signals however the following focuses on APM. In a perfect world, you deploy an APM agent and possibly something to monitor the infrastructure.

What are the Golden Signals for applications?

There are four ‘Golden Signals’:

Latency - The time it takes for an application to service a request. Be sure to distinguish between successful and failed requests.

Traffic - The total demand placed on your system, in terms of transaction throughput. Measure this in operations or requests per a unit of time.

Errors - The rate of requests that fail. Failures can be explicit (e.g., HTTP 500) or implicit (e.g., incorrect data). Your application should emit signals that indicate errors.

Saturation - This signal indicates how full your system is. Measure the consumption of critical resources like CPU, memory, and network of the system your application resides on.

At first glance, all four signals are valuable, but they are not equally useful for alerting. Latency and errors are critical because they directly affect the customer experience, and as such are the best place to start. CPU, memory, network and disk capacity on the systems supporting the application are finite, so if these resources are exhausted, latency can increase and errors can rise. A full disk is like running out of fuel while driving - everything stops. Fortunately, mature autoscaling solutions are widely available today to assist when

What about traffic?

When it comes to setting alarms based on traffic, tread lightly. A stakeholder may see low traffic as an issue, and this is because it’s the most relatable signal to business success. You may wish to know when it reaches a high threshold so you can add capacity. There are many different reasons why transactions can be low: Elections, a sporting grand final, a significant weather event or a sudden emergence of a competitor may all impact your customers inclination to transact. At the other end of the scale, your expectations may have been exceeded, and is this a bad thing? If traffic is going up, it’s your autoscaling infrastructure that should mitigate this by allocating more resources.

The reality of seasonality and anomaly detection

Years back, observability vendors started introducing seasonality and anomaly detection to their products, enabling you to calculate baselines of what the expected behaviour should be. You then specify acceptable deviation from these baselines. It can work well for latency, particularly saturation. It sounded like a cure-all. But leveraging these for traffic, the results aren’t great, and no matter how hard you try there will be rational edge cases where your thresholds are breached.

You can’t control traffic, its usually driven by external forces beyond your control. If traffic is high but latency and errors remain steady then great, you have architected your platform well, and business is good. However, if either or both become a problem, then traffic will rightfully be the first thing you look at. It may fit into a compound condition that observes latency/errors and throughput together.

Where to from here?

Build your business a dashboard so people can readily see the detail – but don’t get people out of bed unnecessarily when it’s not behaving as expected. Traffic can be an issue because of a distributed denial of service attack (DDOS), an issue with a content delivery network (CDN) or an issue with your own network. Your monitoring upstream should tell you this, not your application. By now you’re thinking that AI must have a role in here… that’s a topic that will be covered off in an upcoming article, so stay tuned.

FAQs

What are the Four Golden Signals? The Four Golden Signals are Latency, Traffic, Errors, and Saturation. Traffic provides metrics on the volume of demand, saturation highlights resource constraints, and latency and errors reveal customer impact. Together, they offer a comprehensive view of system performance and user experience.

What does traffic mean in the Four Golden Signals? In the Four Golden Signals framework, traffic measures the demand placed on an application, service or infrastructure component. This is typically represented as requests, transactions, or operations per unit of time.

What’s the best way to monitor traffic? The best way to monitor traffic is to track both traffic volume and traffic patterns in real time using application performance monitoring (APM) and observability tools. Traffic is often best presented through dashboards and reporting rather than operational alerts. This allows stakeholders to track traffic metrics and trends without creating unnecessary notifications.

Why are the Four Golden Signals important for SRE? The Four Golden signals are important for SRE because they focus on the metrics that most directly reflect the user experience and overall service health. SRE teams can use these metrics to quickly identify issues that matter most to customers. When one or more of these signals deviate from expected levels, SRE teams can investigate further to determine the root cause.

Contact AC3 for more information.