Software Metrics Best Practices

Tom
Level Up Coding
Published in
6 min readSep 23, 2021

--

How to save yourself time and sanity when things don’t go according to plan.

Photo by Launde Morel on Unsplash

For the sake of putting things in context, it’s worth taking a step back first and consider monitoring as a whole before diving into metrics. There are three main practices for monitoring software in production: logging, metrics, and tracing.

Probably the simplest to put in place is logging, although some thoughts should go into how log messages are formatted, collected and aggregated. This is the first tool that a developer or an administrator can use to gather intel about incidents.

Logs can contain a lot of information which can be a good thing when debugging a specific problem. However, that’s precisely why they stop being useful in other circumstances.

A typical application produces a lot of logs and it’s not uncommon to see gigabytes of logs accumulating over short periods of time.

In that regard, metrics are the exact opposite of logs. They aggregate information into a few key numbers and provide a macroscopic view of the systems under monitoring.

Finally, we can add tracing to the mix in order to view log messages related to a specific request or event.

If we want a microscopic, fine-grained view of the system, we could filter all logs between two dates but there would be a lot of noise. Instead, we can annotate all logs with a request identifier and then filter by that.

Tracing typically goes one step further by allowing spans to be nested. For example, we might have a span — with a span identifier — for all incoming requests and a subspan for a specific part of the request handler.

The Four Golden Signals

In their excellent book Site Reliability Engineer, engineers from Google describe four so-called golden signals for monitoring services.

  • Latency is the time it took to process a request.
  • Traffic is the amount of incoming requests in a period of time.
  • Errors are an indicator of how many issues occurred when processing these requests.
  • Saturation is a collection of metrics measuring the load on the system, e.g. the memory used, the CPU time, etc.

Of course, you can replace the word request by event or batch and adapt this list to other types of systems.

It is vital to make sure you monitor the data you need to get an accurate picture of what’s going in your system before going any futher.

However, this is not enough. Data is just data and more work is required to turn all of it into valuable information.

Useful Dashboards

First of all, a useful dashboard doesn’t have a gazillion graphs with moving needles and flashing colours everywhere. This might look fancy to some but it is almost certainly a waste of time.

There is a lot more value in taking the time to think of a few key views from which we can quickly see 1) whether the system is operating properly and 2) what exactly is wrong with it when things don’t go according to plan.

Here are some questions to ask when adding some view to a dashboard.

  • Can someone see with ease if something is wrong?
  • Does the view give an indication about how to fix the issue? Or is there another view which can give that information?
  • Would a problem shown on that view be visible on another one as well, i.e. is it redundant?

Once you have identified valuable information to show on a dashboard, the next question that naturally arises is how to organise them.

A good rule of thumb is to go from the generic to the specific.

Imagine yourself in the shoes of the person in charge of fixing production at 3am.

The first question you are probably going to ask is whether everything is okay. If, say, you can see an increased error rate, the next question is going to be: what kind of errors?

In that sense, a dashboard should be readable almost as a story.

This process can be dramatically eased by accompanying all views with

  • What: What is shown exactly. A descriptive title and appropriate axis labels are a good start but it is often not enough.
  • How: How to use/interpret the view, i.e. what is expected and what is abnormal. If it is above a certain threshold, what does it imply?
  • Why: During an incident, someone looking at this view wants to know the why, i.e. identifying the root cause. Views can be accompanied by either pointers to other views or dashboards to look at or hints at what might be wrong with the system.

Don’t forget you are not limited to a single dashboard either and that different dashboard can be created for different purposes, audience and level of granularity.

For example, there could be one dashboard showing generic metrics about the number of active users, the conversion rate, etc. for management.

Another generic one could show the requests per second, the error rate, the CPU time, and memory used designed for people operating the system.

Yet another might show more granular information about errors and might be used when the error rate is unusually high. A link to this dashboard should be placed next to the aforementioned view showing error rates.

Alerts

In a perfect world, customers contact their service provider immediately when something is not quite working while providing a detailed description of the incident together with error messages or screenshots.

Unfortunately, this is rarely the case. (And we can’t really blame them.)

More often than not, users would just give up cursing your company only to give it a go later or worse, they would start looking at competitors.

You, therefore, want to know about incidents as early as possible. Maybe even before it starts seriousely impacting anyone. That is, you need alerting.

Just as dashboards, alerts should be designed with their consumers in mind, e.g. a software developer who already spent the whole day working and just got woken up in the middle of the night feeling very groggy.

There are two main rules for alerts.

The first rule is to keep them to a minimum. A constant stream of alerts and notifications is just noise that quickly gets ignored. Not only it’s not achieving its goal but it’s a sure path to burning out.

It might be worth revisiting the alerts and incidents from the previous few months and to classify them into four quadrants of true/false positives/ negatives and tidy things up.

The second rule is to make alerts informative. If you are half awake, you don’t want to have to decrypt an alert message which was written in a hurry.

Careful thoughts should go into alert messages beforehand while we have time and the building is not figuratively on fire.

On top of not making any unreasonable assumption about knowledge of the systems and removing any ambiguity, messages should also come with pointers as to where to look next.

A good rule of thumb is to accompany all alerts with a link to the relevant metrics dashboard.

Conclusion

There is a lot more to say about monitoring of course but I think this is already a good start if you are unsure how to approach it from a high-level perspective.

There are two running themes to keep in mind:

  • Monitoring requires more thought than is usually allocated to it. Time spent carefully designing monitoring upfront is time saved during an incident.
  • Dashboards and alerts should always be designed with their consumers in mind.

--

--