How to Parallelize Your Python Tests Easily With CircleCI

Don’t let tests slow you down!

Albert Jimenez
Level Up Coding

--

Continuous Integration

Continuous Integration (CI) is a very extended practice in the software community, especially in companies where multiple developers are working in the same codebase. CI allows developers to frequently merge code changes into a repository, where builds and tests are run. Automated tools such as CircleCI are used to assert the new code’s correctness and ensure high quality code before integration.

Companies working with a large codebase or with machine learning models usually find themselves in a situation where their tests take a long time to run. As a standard procedure, for every Pull Request, the whole test suite is configured to be run for every new commit. Long test builds slow down the deployment of features, raise timeout errors, and could even affect team morale (as no one wants to be waiting for a 40 minute test to pass to merge a small typo change).

Another 40 minute test build! — Image by Tim Gouw on Unsplash

Fortunately, cluster parallelization comes to the rescue and saves the lives of millions of developers!

The more tests your project has, the longer it will take for them to complete on a single machine. To decrease the runtime, we can run tests in parallel by spreading them across multiple separate clusters. CircleCI offers a built-in solution that supports this feature (https://circleci.com/docs/2.0/parallelism-faster-jobs/).

Parallelization split— Image from CircleCI

However, there are a few caveats to take into account in the implementation in order to make it work properly…

Parallelization is easy with CircleCI

CircleCI offers three ways to parallelize your tests:

  1. Using file names: will split tests alphabetically.
  2. Using file size: will split tests by the size of the file.
  3. Using timing data: will split tests based on previous test computed runtime.

As you can imagine, the optimal way to reduce time is the third option, as neither file size nor file names take test runtime into account. Implementing a timing split will be the final objective so… let’s get started!

Your typical configuration file will probably look something like this:

To begin, we are going to specify the number of different clusters by using the parallelism key. It sets how many independent executors will be created to run the steps of a job and it is implemented at the job level.

Next, once we have selected the number of machines, we need to tell CircleCI which tests do we want to be split and how. CircleCI provides the circleci tests globand circleci tests split commands to select and split respectively.

For a four machine parallelization, if we update our config file following CircleCI’s instructions, it will look like this:

But not so easy!

With only these modifications, the build will not work properly.

CircleCI requires that we store runtimes from previous builds. We must add an extra step, store_test_results, to make CircleCI save timings data at the directory specified. The timings data consists of a file stating the test filename (or classname) and how long each test took to complete for that particular build.

That being said, making this change was not enough for me to create a correct parallelization build. To start exploring what was happening, I added the store_test_artifactsstep to the config file. That allowed me to download and review the test files being saved. The test files I was saving were not in the format that CircleCI expects to read and make the split properly.

Upon some conversations with the CircleCI support team, as well as my own investigation, I made the following additions to the configuration file:

  • Created a file .circleci/resources/pytest_build_config.inito change the default testing framework:
[pytest]junit_family=xunit1
  • Added a command to copy the file above as pytest.ini at test time: cp -f .circleci/resources/pytest_build_config.ini pytest.ini .
  • Added the shopt -s globstar command to be able to get all the tests paths form subfolders.
  • Saved my test files in a folder adding the-junitxml=test-results/junit.xml option to the pytestcommand.

So my final configuration file ended up looking like this:

After making all these changes, I finally got parallelization working for my Python tests.

Conclusion

CircleCI offers a simple yet powerful way of splitting tests across several machines. However, writing a configuration file it is not always trivial. The examples above contain a working configuration file for you to start parallelizing your Python test suite.

In this era of high competitiveness, test parallelization offers a competitive advantage by decreasing the time engineers and developers are waiting and speeding up the shipping of new features.

Don’t let tests slow you down!

Thanks for reading.

--

--

Machine Learning Engineer & Researcher @scribd. I write about ML, Data Science and MLOps.