Java, Go, and Python: a multi-thread performance comparison

Jose Pablo
Level Up Coding
Published in
11 min readMay 27, 2022

--

From https://www.educations.es/study-abroad/fontys-school-of-fine-and-performing-arts/bachelor-of-circus-and-performance-art-1395880

Introduction

In computer a thread is a small sequence of instructions that can be performed independently by a processor. Multi-threading is possible within a process, wherein they share resources such as instructions and context.

New programming languages are being developed or old ones are being improved with features to allow software to better utilize threading and parallelism. As such it would be a good idea to compare a number of these languages to ascertain how well they handle multi-threaded processes.

Discovering the programming languages that have the best efficiency when running multi-threaded processes is significant as it could help software developers while picking the most advantageous languages for implementing their systems.

The purpose of this article is to analyze and compare the performance of Java, Go, and Python resolving several algorithms using their parallel tools, such as: threads in the case of Java and Python, and goroutines for Go. To evaluate the performance we write parallelized implementations of the classic matrix multiplication algorithm, quicksort algorithm, and Conway’s game of life.

Background

Java: it is an object-oriented programming language with built-in support for concurrency through the Thread class, the Runnable interface, and other functionality contained in the java.util.concurrent package.

Go: it is a programming language released by Google in 2012. It is a procedural language like C, and has garbage-collector and built-in support for concurrency. For concurrency and parallelism Go uses an implementation of threads called goroutines. It uses a built-in scheduler for handling threads behind the scenes and tries to hide much of the complexity of usual thread management from the user in favor of simply specifying what functions to run and prefacing them with the go command. Go uses operating system (OS) threads behind the scenes.

Python: the standard library threading, this module encapsulates threads, providing a clean interface to work with them. In this approach the OS actually knows about each thread and can interrupt it at any time to start running a different thread. It is possible to attain parallelism in Python using other of its standard libraries known as multiprocessing, with multiprocessing Python creates new processes. The threading module uses threads, the multiprocessing module uses processes. One of the differences is that threads run in the same memory space, while processes have separate memory.

Matrix multiplication: it is an important yet simple mathematical operation in linear algebra and the classic matrix multiplication algorithm (CMMA) can easily be implemented in any programming language. CMMA runs in O(N^3), where N is the dimension of the matrix.

Quicksort: it is a divide and conquer algorithm. Initially, it splits the array into two sub-arrays: the lower items and the higher items respectively. It then recursively sorts these sub-arrays. Its algorithm can be described as:

  1. Pick a pivot element from the array.
  2. Partition the array such that, all the elements to the left of the pivot have values less than the value of the pivot, and all elements to the right of the pivot have values greater than the pivot value.
  3. Recursively perform the above steps over the generated sub-arrays, until completely sorted sub-arrays are obtained.

It has an average sort complexity of O(n log(n)), but degrades to O(n^2) under rare circumstances.

Conway’s game of life: it was developed by John Horton Conway, a mathematician at Gonville and Caius College of the University of Cambridge, in the 60s. The rules are:

  1. Rule 1 Survival: if a live cell has two or three live neighbors, it survives.
  2. Rule 2 Death: if a live cell has less than two or more than three live neighbors, it dies.
  3. Rule 3 Birth: if a dead cell has exactly three live neighbors, it is born.

Next image shows several scenarios of Conway’s game of life from time T to time T + 2. A näıve implementation of the algorithm, which would check each and every cell on step, would have a time complexity of O(M × N ) where M represents the number of rows, and N the number of columns of the grid.

It shows several scenarios of Conway’s game of life from time T to time T + 2
Example of Conway’s game of life

Solution

All the benchmarks are executed in the same computer, and in the same environment. Table I shows the hardware specifications where we run our experiments, and table II shows the version of each language programming compiler

Hardware and software (programming languages) overview
Hardware and software (programming languages) overview

For each of the benchmarks we calculate its respective speedup and efficiency. The speedup is define as ratio of the best sequential time over the parallel time with p processors: S = T1/Tp. Then efficiency is defined as: overall utilization of each processor in the parallel system: E = S/p.

Matrix multiplication: in this experiment we use simple matrix multiplication that is calculated by C = AB. The size of the matrices is 512 × 512. We are running a sequential scenario, then several multi-threaded scenarios with a thread pool size of: 2, 4, 8, 16 and 32 threads. The same matrices A and B are used to run the code in Java, Go and Python. These matrices will have integer values from 0 to 1000 inclusive. The thread pool will received n works, representing the number of rows in matrix A, then each of the threads will perform the multiplication of that particular row, and updates the corresponding row in C with the obtained result. Each time a thread finish its work it will receive a new assignment from the thread pool until all work have been done. Next figure illustrates the calculation of matrix C.

Method of matrix multiplication
Method of matrix multiplication

Quicksort: we sort an array of 10^7 integer values from 0 to 10^7 inclusive. We use a ForkJoinPool approach to tackle this experiment. A sequential run is executed, then the multi-threaded tests go from 2 to 32 threads in powers of 2. The 3 implementations sort the same array. In the parallel implementation of Quicksort, post-partition the sorting of the new sub-sequences can be performed in parallel as there is no collision.

Conway’s game of life: there are several initial known patterns in Conway’s game of life that produce an expected result, there is one which grows infinite, left side of next image displays its initial setup. In this benchmark we define a 2D grid of 28 × 28 with 4 of the previous pattern next to each other, and execute the game during 20,000 generations, that grid looks like right side of next figure.

Infinite growth pattern

For this experiment we repeat the thread pool approach, it will handle 2, 4, 8, 16, and 32 threads. Also, we run a sequential implementation. We parallelized the application of the rules that decide if a cell keeps living, dies, or reborn in each generation. Also, in order to get the neighbours of a cell we are not taking the grid as an infinite grid, if a neighbour of a cell is outside the limits of the world it is considered dead.

Contributions

Matrix multiplication: table III shows the results obtained after running the matrix multiplication benchmark. Java was the best in the sequential execution, it ran the 512 × 512 matrix multiplication in 316 ms, while Go consumed 453 ms, that represents 43.35% more time in the sequential run. The worst performance is for Python which used 93,870 ms (93.87 seconds), a difference of around 29,600%.

Also, in table III is possible to see that Java performs better when the benchmark is executed using 2 threads, however Go outperforms Java and Python when the experiment is run in 4, 8, and 16 threads. We can detect a performance degradation at the 32 threads threshold, this might be due the overhead of creating too many threads that surpasses the physical cores in the CPU; even in this deterioration situation Go performs better than the other 2 languages.

Matrix multiplication time consumption (in milliseconds). Lower is better.
Matrix multiplication time consumption (in milliseconds). Lower is better.

Left side of next figure reflects that for Go the difference in performance from 2 to 4 threads is significant (around 92% of improvement), and from 8 to 16 threads it is marginal due the curve in the plot remains almost the same, which tell us that 8 threads is a good number for this particular scenario. On the other hand, Java’s performance steadily increases from 2 to 16 threads; from this plot we can pick 16 threads as a good amount of threads to have a decent performance in Java. From right side of next figure it can be assured that the worst efficiency in the utilization of the resources is to Python, and best is to Go.

Speedup and efficiency for matrix multiplication.
Speedup and efficiency for matrix multiplication.

Quicksort: in table IV we can see Go getting the best performance in the sequential run of the test, and Python the worst. The general behavior of Python implementation was very poor, in the scenarios with 16 and 32 threads the application was unable to finish the execution (DNF stands for did not finish) getting stuck and blocking the computer. Go presents interesting results, since the multi-threaded runs go from bad to worst performance. For this measure Java gets the best multi-threaded results, however in the runs with 16 and 32 threads we can find a degradation of the performance, being the one with 16 threads the worst.

Quicksort time consumption (in milliseconds). Lower is better.

The speedup from the left side of next figure shows that it does not make sense to run quicksort algorithm using more than 8 threads when it is implemented in Java. Also, we can see that the run with 32 threads was a bit better than the one with 16 threads, but worse than the run with 8 threads. The plot reflects that Go multi-thread implementations were not as good as the sequential. Even when the performance of Python was very poor, the runs with 2, 4, and 8 threads were better than the sequential.

The efficiency reflected in right side of next figure confirms what we can see in table IV in terms of the best, but not in terms of the worst, since this plot could make us think that Python was better than Go, however the consumption of time tell us a different history: Go outperformed Python.

Speedup and efficiency for quicksort.
Speedup and efficiency for quicksort.

Conway’s game of life: once again Python gets the worst results of all the 3 implementations, however as we can see in table V it was the only one that has some improvements from sequential implementation to multi-threaded implementation; nonetheless, according the number of threads increases so does the consumption of time. In the case of Java and Go in both of them the best performance was from the sequential implementation. In Java scenario the difference of time from the sequential version to the one with 32 threads is 11,455% which represents a huge gap between then. In the Java implementation after each generation the thread pool is disposed and a new one is created, this is creating a considerable overhead. Go behavior is unstable, time utilization increases from sequential to the version with 2 threads, then it decreases a little bit in the versions with 4, 8, and 16 threads, finally at 32 threads it raises again.

Conway’s game of life time consumption (in milliseconds). Lower is better.

The speedup plot from next figure confirms that the general performance for the multi-threaded implementation of Conway’s game of life is poor. The expectations around a speedup curve is that it should follow an increasing behavior, and that is not happening. The efficiency reported shows that the resources are not being properly exploited. Java’s efficiency is very near to zero, and there is just a marginal difference between Go and Java.

Speedup and efficiency for Conway’s game of life.
Speedup and efficiency for Conway’s game of life.

Conclusions

The overhead of create threads directly impacts the performance of an application. In table III is possible to see that it does not make sense to keep creating threads after certain threshold has been reached, because performance will degrade.

As table V shows, using multi-thread to tackle a problem does not guarantee that we are going to get a better performance, sometimes the sequential implementation is the best option.

Python standard libraries to handle concurrency and parallelism are not as good as the ones implemented in other programming languages as Java and Go.

It is hard to say which is better between Java and Go, due Java outperforms Go in the matrices multiplication benchmark, but Go exceeded Java in the quicksort experiment.

Future work

This work was limited to basic implementations of each of the algorithms, good next steps would be to replicate the scenarios that were introduced by this research but using the best known implementations of each benchmark.

There are several options to implement parallel code in Python, such as: MPI for Python, CharmPy, Numba, among others. A good experiment would be to replace the written code with more efficient implementations using this libraries.

A way to expand this research is to consider other technical aspects, like compilation time, file size, and number of written lines of code.

References

[1] S. Majumdar, I. Jain, and A. Gawade. “Parallel Quick Sort using Thread Pool Pattern.” In: International Journal of Computer Applications 136 (Feb. 2016), pp. 36–41.DOI: 10.5120/ijca2016908495.

[2] T. Andersson and C. Brenden, “Parallelism in Go and Java A Comparison of Performance Using Matrix Multiplication”, Dissertation, 2018.

[3] E. A. Lee. “The Problem with Threads”. In: Computer 39 (June 2006), pp. 33–42. DOI: 10.1109/MC.2006.180.

[4] J. Carl, “Parallel programming in Go and Scala : A performance comparison”, Dissertation, 2015.

[5] “Timeline of key Java milestone.” Oracle.com. https://www.oracle.com/java/moved-by-java/ (accessed: Jun. 19, 2021).

[6] “Oracle Java SE Support Roadmap.” Oracle.com. https://www.oracle.com/java/technologies/java-se-support-roadmap.html (accessed: Jun.19, 2021).

[7] R. Buyya, “Object oriented programming with Java”, Chapter 14, pp. 367–368, 2009. Tata McGraw Hill Education Private Limited.

[8] N. Togashi and V. Klyuev, “A novel approach for web development: A schedule management system using GAE/Go,” 2015 IEEE 7th International Conference on Awareness Science and Technology (iCAST), 2015, pp. 55–59, doi: 10.1109/ICAwST.2015.7314020.

[9] “Downloads.” Golang.org. https://golang.org/dl/ (accessed: May. 21,2021).

[10] “Frequently Asked Questions (FAQ).” Golang.org https://golang.org/ doc/faq#goroutines (accessed: May. 21, 2021).

[11] “Go scheduler source code.” Github.com. https://github.com/golang/ go/blob/master/src/runtime/proc.go (accessed: May. 21, 2021).

[12] A. A. A. Donovan and B. W. Kernighan, “Concurrency with Shared Variables,” in The Go Programming Language, 1st ed. New York, NY, USA: Addison-Wesley, 2015, pp. 280–283.

[13] J. J. Galvez, K. Senthil and L. Kale, ”CharmPy: A Python Parallel Programming Model,” 2018 IEEE International Conference on Cluster Computing (CLUSTER), 2018, pp. 423–433, doi: 10.1109/CLUS- TER.2018.00059.

[14] V. Sheromova, “A brief history of Python,” Exyte.com. https://exyte.com/blog/a-brief-history-of-python (accessed: Jun. 19, 2021).

[15] “Download the latest version for Mac OS X.” Python.org. https://www.python.org/downloads/ (accessed: Jun. 19, 2021).

[16] J. Anderson, “An Intro to Threading in Python.” RealPython.com. https://realpython.com/intro-to-python-threading/ (accessed: Jun. 19, 2021).

[17] J. Anderson, “Speed Up Your Python Program With Concurrency.” Re- alPython.com. https://realpython.com/python-concurrency/ (accessed: Jun. 19, 2021).

[18] K. Khankasikam, “Printed Lanna character recognition by using Conway’s game of life,” Seventh International Conference on Digital Information Management (ICDIM 2012), 2012, pp. 104–109, doi: 10.1109/ICDIM.2012.6360138.

--

--

Full time Sr. software engineer and part time MSc in Computer Science student with interest in AI, DL, HPC, and computer graphics. Love outdoors. Foodie.