Harness the Power of Evolution to Improve Your Unit Tests

An introduction to mutation testing

Published in

Level Up Coding

7 min readDec 2, 2020

Cancerous cells under a microscope — Photo by National Cancer Institute on Unsplash

What’s the problem with unit tests? You can write as much of them as you want, they may even pass, it still doesn’t prove your code works as expected. Or as Dijkstra put it elegantly:

Testing shows the presence, not the absence of bugs. — Edsger W. Dijkstra

Even to show the presence of bugs, you need to have enough tests written, so you can change and release your code with a high degree of confidence.

Not having enough tests can manifest in two different ways. Firstly, the resulting software doesn’t meet the requirements and its users are not getting what they paid for. A careful review of the requirements and the code, side-by-side, can reduce this risk. Along with exposing your software to users early and often. Secondly, not every line and branch of the code is exercised by the test suite.¹ This can be easily measured with the army of code coverage tools available, which are often built into the unit test runner or library you use.

Map showing Europe, parts of Asia and parts of Africa with red dots of various sizes over individual countries. — Photo by Clay Banks on Unsplash

Code coverage (or test coverage) is simply the degree to which the code is executed when a given test suite runs. There are various ways to measure this. Generally, if your project has 100% coverage, your unit tests ran every bit of code written. The discipline of Test Driven Development is definitely beneficial to achieving high coverage. If you have to write a test before the corresponding code, you’ll almost effortlessly reach 100%.

High code coverage might suggest there is a low chance of bugs being introduced. But go ahead and remove all assertions from your unit tests, and your test coverage will remain the same. Now you have a test suite covering 100%, but it’s as useful as not having tests at all.

…remove all assertions from your unit tests, and your test coverage will still remain the same.

A good enough test suite will ensure the semantic stability of your software: if you introduce any modification in the code altering its meaning, at least one test case should fail, alerting you to the change in semantics. Of course, this is what we all expect from all test suites, but often it’s quite a challenge to live up to.

How can you create a good enough test suite? How do you make sure you don’t just run every statement, but also test every statement? This is where mutation testing comes in.

Mutating a toy-example

Toy rubber duck — Photo by Timothy Dykes on Unsplash

Take a look at the following Python snippet:

def double(number: float) -> float:
    return 2.0*numberdef test_double() -> None:
    assert double(2.0) == 4.0

The functionality double provides is simple: multiply any input by 2 and return it to the caller. Surely, testing it is just as simple, we just need to verify what the function returns is in fact double of the input. But is verifying that 4 is indeed the double of 2 good enough? After all, the code has 100% coverage.

What if we replace multiplication (*) with addition (+)? The code would become:

def double(number: float) -> float:
    return 2.0+numberdef test_double() -> None:
    assert double(2.0) == 4.0

It’s easy to see that adding 2 to any number will not double it. But because of the numbers used in the test, it will pass (2+2 is indeed 4), even though the code is clearly broken. This small test suite doesn’t provide semantic stability, bugs can go unnoticed. We could use different numbers in the test, 3 and 6 for example, making it resilient to the change of the operator (2+3 is not equal to 6).

However, the operator is not the only thing we can break in the double function. We could replace the constant 2.0 with some other number, as well as the usage of the number parameter. We could replace the whole function body to return a constant. With the “right” changes to the code, the test will still pass, but the code will be broken. Even in this one-line function, there are at least four things we could alter (break), and we need several test cases to catch them.

Mutation testing in practice

Computer screens showing software source code — Photo by Fotis Fotopoulos on Unsplash

Introducing faults in the source code one-by-one, running the tests, and evaluating whether the error was caught by the test suite is what mutation testing is all about. Each fault added is called a mutant, and it is “killed” when one or more tests fail.

Creating these changes manually, to verify the stability of a test suite, is cumbersome and time-consuming. This is the burden mutation testing tools help you with. These tools generally have a set of mutation types they can introduce.

As part of a mutation test session, a tool would go through its predefined set of possible mutations, find all viable locations for each, perform the mutation and run the test suite. If a tool can handle 10 different kinds of mutations and, in the code, each of them can be introduced in 10 different locations, the result is 100 mutants and 100 test runs.

Kinds of mutants

The kinds of code alterations you or a mutation testing tool can introduce depend on the language you use and its paradigm.

The list of possible mutations is endless, but just to name a few:

Replace arithmetic operators (as seen in the above example with * and + being switched)
Replace boolean expressions with true or false
Replace constant with another value (often adding 1 or -1 to numbers, or appending "XX" to strings, etc.)
Replace boolean operators (and to or and vice versa)
Replace boolean relations (eg. < to <=)
Replace break with continue
Invert if and while conditionals (eg. if condition(): to if not condition():)
Suppress exceptions
Remove function calls and other statements
Remove super calls
Replace access modifiers (eg. public to private)

Toy monster — Photo by Ashkan Forouzani on Unsplash

Evaluating the results

The best mutant is a dead mutant. If the test suite fails after introducing a fault, the mutant is detected by the test suite and killed. But if the tests pass and the mutant goes unnoticed, it survived. A surviving mutant is the indication of a missing test case or a test case that needs improving.

Rarely, a mutation renders the code uncompilable or causes an infinite loop to be executed. (Think of the possible implications of replacing break with continue.) Mutants that turn some parts of the code useless are incompetent. Most tools use timeouts to handle infinite loops and generally separate these mutants in statistics.

Sometimes a mutation doesn’t change the behavior of the code at all and survives. This is called an equivalent mutant and is one of the biggest obstacles for practical usage of mutation testing. Equivalent mutation detection is its own science and gave birth to many whitepapers, such as the nicely titled: Overcoming the Equivalent Mutant Problem: A Systematic Literature Review and a Comparative Experiment of Second Order Mutation.

The success of a mutation test session is indicated by the mutation score, which is calculated as the ratio of killed mutants over generated mutants. If all the mutants are dead, the mutation score is 100%, which should give you plenty of confidence in your test suite.

Complexity

You can see how this can quickly get out of hand though. Even a moderately complex software project would have hundreds and hundreds of possible mutations and every single mutation results in a new execution of your test suite.

If your unit tests take 30 seconds to run, a few hundred mutations can push the execution time to the scale of hours. Not even counting for compilation time in languages like C++, C#, or Java. Many tools use various forms of multiprocessing to moderate the issue of time, turning it into a matter of computing power. Mutation testing is expensive, one way or another.

Tools

Easy to use tools are available for many popular programming languages, to name a few:

PITest for Java
Stryker for JavaScript, C# and Scala
MutMut for Python
Mull for C++

As well as many others. These tools have varying degrees of maturity and capabilities, but it’s definitely worth trying some of them out. You can only win and make your unit test suite more robust and comprehensive.

Summary

Mutation testing is a form of white-box testing where faults (or mutants) are introduced into your code, then for each mutant, your test suite is run. If the tests fail, the mutant is “killed”. If the tests pass, the mutant “survived”. The more mutants killed, the more your test suite ensures the semantic stability of your code.

While this is a powerful technique, it is expensive. The sheer scale of the problem at hand often makes it an impractical tool to employ, even for small software projects. Unit tests should be able to provide quick and cheap feedback, on the scale of seconds.

Yet, even if you don’t introduce a mutation testing tool in your development workflow, you can certainly benefit from the mindset of keeping semantic stability in focus when building your tests. Asking myself “What if this part of the code changes, will the tests fail?” definitely gave me a new perspective when writing and reviewing code.

Additional sources and reading