Immutable health check management

bitsofinfo
Level Up Coding
Published in
6 min readAug 25, 2020

--

If you’ve ever had to monitor an application, endpoint, or website, you’ve likely come across literally hundreds of monitoring services that can execute simple HTTP based checks from N global endpoints then notify an operator when certain thresholds are met. One of the more widely know services that can do this is Pingdom.

On a past project, the team was tasked with monitoring an application comprised of several underlying components each manifesting themselves behind a single endpoint FQDN where various paths were actually serviced by N underlying applications, each of which either exposed their own health check or simply needed to return a 200 indicating they were up. Unfortunately, the FQDN and backing applications were not in any cloud platform or orchestrator that provided even the most basic internal facilities for health checking endpoints….much less from N global endpoints. Due to this, we had to utilize a 3rd party external service; which ended up being Pingdom. By default, with Pingdom the process for creating and managing these health check monitors “out of the box” would end up looking something like this:

Clearly there is nothing groundbreaking here if you’ve ever had to utilize such a monitoring service:

  • User authenticates into monitoring application
  • User manually configures one or more “checks” for N endpoints in the GUI or via API calls
  • The monitoring platform itself stores those configurations then distributes them to desired global monitors
  • Monitors around the world execute checks and report back to the platform
  • If thresholds are reached, alerts are sent out.

Manually managing potentially dozens of checks, many of which sharing similar boilerplate characteristics was unappealing.

The monitoring for this application needed to be able to adapt as new iterations of the apps were deployed to production and were going through a graduated rollout to users (i.e. canary releases). The team didn’t want to have to manually re-configure (or re-add) N checks every time a change was needed for monitoring or a new version needed to be monitored (in a different environment). Instead the team wanted a way to manage a “set” or “group” of monitors, and likewise generate those monitors from a template. Secondly, these “sets” of monitor configurations should really be immutable. Create a new “set” of checks, then later decommission any “set” of check configurations.

Fortunately, Pingdom provides an API and the ability to “tag” named monitor configurations. With “tagging” we could end up creating a physical representation of these “check sets” that we wanted to achieve and just generate new “sets” of monitors, and manage them at a higher level… i.e. maybe more like this:

Moving Forward

After some wrangling with Pingdom’s API to get things working in a prototype implementation, I ended up creating a small utility that would let an DevOps person curate a YAML checks config file that permitted the creator to define some default check behaviors then one or more named “sites”, each of which contained an FQDN and then one or more “pathParts”; where a “pathPart” is simply defined as some portion of a URI that can contain one or more values, each of which can override the “check behaviors” defined in the defaults.

The operator then defines a “checks” section defining one or more named “checks” each supporting a “forEach” directive which can be used in a nested fashion to create full URI paths for a given “site” + “pathParts” to generate unique URIs to be checked. Each generated check is tagged with a shared timestamp identifier plus individual tags for each “check” name, “site” and “pathPart”; permitting management via tags.

An ultra basic config file looks something like this, which defines that 3 paths should be checked for the “my.app.com” site; from this there are 3 distinct Pingdom checks generated; each unique “monitor” gets its behaviors defined via the inherited settings.

Which (in dry-run) mode generates a “set” of checks that would pushed to Pingdom via its API as follows:

loader.py --checks-config-file config.yaml --dump-generated-checks

2020-08-25 19:05:25,259 - root - DEBUG - generateChecks() initiating run w/ id: 20200825_19052525
2020-08-25 19:05:25,265 - root - DEBUG - Reading sites[mysite]
2020-08-25 19:05:25,266 - root - DEBUG - Reading sites[mysite].checks[v1]
2020-08-25 19:05:25,266 - root - DEBUG - sites[mysite].checks[v1] generated 3 checks.

------------------------------
v1
------------------------------
['NA'] -> https://my.app.com/path1
every:15m timeout:10000ms notifyAfter:4 fails,
priority:high users:['23456'] teams:['12345'] integrations:['88723']
again:30 intervals, whenBackUp:True
tags:['20200825_19052525', 'v1', 'my_app_com', 'priority-high', 'path1']

['NA'] -> https://my.app.com/path2
every:15m timeout:10000ms notifyAfter:4 fails,
priority:high users:['23456', 'abc'] teams:['12345'] integrations:['88723']
again:30 intervals, whenBackUp:True
tags:['20200825_19052525', 'v1', 'my_app_com', 'priority-high', 'path2']

['NA'] -> https://my.app.com/path3 every:15m timeout:10000ms notifyAfter:4 fails,
priority:high users:['23456'] teams:['12345'] integrations:['88723']
again:30 intervals, whenBackUp:True
tags:['20200825_19052525', 'v1', 'my_app_com', 'priority-high', 'path3']

.. and these checks, once pushed to Pingdom can be managed as a “set” via their tags such as:

./loader.py \
--checks-config-file config.yaml \
--delete-tag-qualifiers 20200825_19052525 \
--delete-in-pingdom \
--pingdom-api-token-file trial.token

Once you have a config file defined, you can just add new “sites” ad-hoc, and rerun to regenerate new sets of checks to be loaded int Pingdom and managed distinctly as “sets” via the tagging mechanism:

Now the scenario looks more like this:

pingdom-check-loader

The net-net of what was developed is available on Github at pingdom-check-loader and it pretty much works as described.

The utility defines a simple CLI which lets you declare your desired check configuration state in YAML files; the CLI consumes that configuration, then generates one or more checks driven by the configuration. See the sample checkconfigs.yaml for more docs and details on the configuration format.

Once checks are generated they can be created against a target Pingdom account. This CLI does not support mutating previously defined checks. Checks changes are additive in nature. You can generate and produce new checks and delete old ones only. Existing checks (while editable in the Pingdom GUI) are not mutable via this CLI by design.

The checks generated and created by this utility are intended to be immutable; generated checks are tagged appropriately to be easy to find via Pingdom’s GUI and APIs. Tags are automatically created based on a CLI invocation timestamp and pathParts (see YAML) so that all generated checks can be managed as a single set. You can then use these tags to delete checks (via this CLI) which can then be replaced by newer generated iterations of them as your requirements change. You can do things in any order you desire; for example create one version of checks, then interate and generate the 2nd iteration; after your 2nd iteration is functioning as desired you can cleanup the 1st iteration using the --delete-tag-qualifers flag passing the 1st iterations timestamp identifier.

https://github.com/bitsofinfo/pingdom-check-loader

There are some key annoyances w/ the Pingdom API that you should be aware of here. In particular the way that the GUI and the API treat the regions with respect to multiple selections; and no easy way to get the user, team and integration identifiers via the UI or the API (you have to inspect and find in HTML source code in Pingdom’s GUI

Originally published at http://bitsofinfo.wordpress.com on August 25, 2020.

--

--