Test pyramid as a measurable code metric

Published in

Level Up Coding

6 min readApr 24, 2022

Geometry as science knows a single definition of a square pyramid shape. But software engineers know many types of pyramid shapes and all of them call as “test pyramids”.

Definition

The “Test Pyramid” is a metaphor that tells us to group software tests into buckets of different granularity. It also gives an idea of how many tests we should have in each of these groups. Although the concept of the Test Pyramid has been around for a while, teams still struggle to put it into practice properly…

The ideal test pyramid by definition of Martin Fowler must improve the efficiency and speed of covering real business scenarios by choosing the correct test type for the particular test case. Keeping a correct proportion allows for keeping the majority of tests that are easy to write and that are super stable and fast.

From my point of view, produced test pyramid type is also a litmus paper for software design validity, and also a good indicator for the adoption of TDD methodology in the team.

Few examples?

Let’s review a few examples of test pyramids and possible reasons for such shapes.

⛔ Warning, software design complexity detected

Probably architecture is so unclear and so complicated that the system can be tested only as a black box. The single-responsibility principle is violated so modules and services become very tangled, and testing integration scenarios is a big challenge.

Taking into account this complexity in the long-term period — the only way is only to test individual class behaviors or full user behavior in general.

⛔ Warning, lack of confidence detected

Probably engineer doesn’t trust even themselves. Even having enough tests to prove behavior is working as expected, still, each behavioral change is additionally re-checked on each next layer and surrounding behaviors. This indicates that architecture is still complicated enough — you can’t trust the expected flow for some reasons and want to re-check the full path the particular change propagation should go through. As a result, spending more time re-checking already checked behavior using more complicated and less stable test types.

⛔ Warning, lack of trust detected

The engineer was very close to catching the test pyramid philosophy but doesn’t pay as much attention to edge cases and a variety of scenarios for the designed feature as the QA team does.

This relatively straightforward look at the business requirements frightened the QA department so many different tests are duplicated and many others are added to cover missing scenarios. This type of test pyramid usually shows a lack of communication between different departments, so some work is duplicated.

⛔Warning, TDD rules violation detected

There is evidence of testing individual class behaviors or exact feature behavior in general. This type of test pyramid is similar to the one above but has a different meaning — probably team just doesn’t follow TDD as a practice from the very beginning, so you already have designed layers, hierarchy, and communication between these layers in such a way that is not TDD friendly.

Also, this type of pyramid can indicate that solution has a strong lock to a vendor framework, so there is no strong application design to follow.

A good example is when Spring DI is used as a de-facto engine to magically access any service from any place in the application when there is no clean hierarchy introduced. Having this paradigm applied there is no dedicated layer where specific services are integrated and communicated, but a mix of cycled complex services with a dynamic set of dependencies, so the complexity of testing such services is way higher.

How to control the test pyramid you have?

All this development paradigm with a testing pyramid in the center sounds more like some abstract pattern than a concrete measurable testing methodology.

As engineers, we want to control the design and test coverage continuously and fire some warnings when defined thresholds are crossed.

If code coverage control is not a challenge anymore as there are plenty of third-party platforms like SonarQube that gives you a deep view of code and coverage for it, controlling the testing pyramid shape still is a challenge.

Obviously, we can put higher attention to the code review process and react to the detected during review pattern violations. This is definitely a “must-have” practice but is not error-free because of the human factor here. So better to improve the CI process with shape recognition to ensure its validity.

The formula to validate can be simple enough — majority of tests must be of type “unit”, a small percentage of “integration” and “end-to-end” tests. We can put it as requirements, that:

60+ percent of all tests must be unit tests
only less than 10- percent are end-to-end tests
only less than 30- percent are integration tests

Keeping this proportion under control helps us to have the pyramid look like a pyramid. As a result majority of all tests will be fast and stable tests, a very small amount of tests will check the complete user scenarios, layer, when many services integrate to produce complex flow, is also under control with a big enough percentage of tests.

So our CI can as simple as:

run all tests of type “unit” and store total count executed
run all tests of type “integration” and store total count executed
run all tests of type “end-to-end” and store total count executed
check the expected percentage of each type is in a defined range

Demo

There is a Jhipster-based project created as a reference.

There is a new PR created that uses the GitHub actions pipeline to check the test pyramid as a gated step.

Steps to introduce test pyramid check:

create GitHub action that run check on each PR action (create, commit)

create Gradle task to calculate pyramid (percentage of each test type)

For demo purposes there was chosen the simplest strategy of classification tests by type — all tests in a specific folder are of a specific type. Also, interface markers can be used for more granular test classification.

check as part of CI verification that the percentage of tests of each test type is in the defined range, e.g.

if [[ $unit_percent > .6 ]]
then
  echo "Pyramid is valid"
else
  echo "Pyramid is invalid: number of unit tests only $unit_percent"
false
fi

and can be visualized as:

So test pyramid validity can be considered as one more metric we can rely on and fail to deliver code increment that makes the complete state worse.

Having metrics as a threshold indicator makes your code delivery process less dependent on human control. Similar to static code analysis or code coverage it can be a blocker when a team works near the border, so periodical design review and controlling TDD is a pre-requisite for adopting one more gated verification. On another side, it has a good visualization point and can easily reveal existing product design issues.

Test pyramid as a measurable code metric

Definition

Few examples?

How to control the test pyramid you have?

Demo

Written by Vadym Barylo