Demystifying Golang Channels, Goroutines, and Optimal Concurrency

When does concurrency in Golang make sense, and at what point are there diminishing returns?

Matt Wiater

Published in

Level Up Coding

12 min readApr 17, 2023

The Framework

In my exploration of Golang, I wanted to take a deeper dive into some areas that I found interesting about the language. Though conceptually similar to the way other languages handle concurrency, I didn’t have much production-level knowledge surrounding goroutines and concurrency.

As an engineer, I have the compulsion to pull things apart and see how they work — and hopefully, gain a fundamental understanding of best practices and the contexts in which certain patterns make the most sense.

For goroutines and channels, I created an application using a Dispatcher -> Worker -> Job pattern to benchmark and compare results in differing scenarios. These comparisons would allow me to see how the same pattern operated under types of loads and hopefully uncover some situations where this was an optimal pattern to follow. Conversely, with the right array of scenarios, it’s just as important to know when this pattern does not provide the benefits I’m trying to achieve.

The code used in the following examples is available here.

The Question

After reading tutorials and documentation about channels and goroutines, I understood the theory and the concepts but didn’t know how I could predict when the usage of this type of pattern might be beneficial and when it might be harmful. Certainly, I knew that just jamming code into unlimited goroutines was not the answer. So, my question became:

When does concurrency in Golang make sense, and at what point are there diminishing returns?

Note: The examples below are based on my 8 Core development environment. Your runtime.NumCPU() value may be different.

An Idealized Scenario

While engineering always deals with the multivariate, eliminating as many variables as possible can bring you closer to the answer to a question. In trying to understand the basics of how Go executes goroutines and channels, I decided upon an idealized scenario in order to add as few variables to the equation as possible.

I decided to follow the Dispatcher -> Worker -> Job pattern using goroutines and channels. I started with this simple example and expanded on it.

A Rough Estimate

From my exploration of the topic, it seemed that the most efficient way to execute this pattern is to have a maximum of runtime.NumCPU() workers executing jobs. The runtime.NumCPU()function will give you the number of CPUs (or cores) available in your environment. This number should determine the optimal way to make use of your hardware.

As a baseline test, I decided to benchmark 1-n goroutines, based on runtime.NumCPU()using a sleep job for 1 second to standardize the length of time that the jobs ran executed by the workers. This would eliminate any variability in the jobs themselves. The single pertinent line in the EmptySleepJob function example job below is the time.Sleep line. Everything else simply handles tracking the elapsed timing of the job: start the timer, run the job (sleep), end the timer, and save the timing data to jobResult. These timing metrics are stored for each job so that we can analyze data and generate a report at the end of the tests.

func (job Job) EmptySleepJob() (string, float64) {
 jobStartTime := time.Now()

 time.Sleep(time.Duration(config.EmptySleepJobSleepTimeMs) * time.Millisecond)

 jobEndTime := time.Now()
 jobElapsed := jobEndTime.Sub(jobStartTime)

 jobResult := structs.SleepJobResult{}
 jobResult.SleepTime = time.Duration(config.EmptySleepJobSleepTimeMs).String()
 jobResult.Elapsed = jobElapsed.String()
 jobResult.Status = strconv.FormatBool(true)

 jobResultString, err := json.Marshal(jobResult)
 if err != nil {
  fmt.Println(err)
 }

 return string(jobResultString), jobElapsed.Seconds()
}

In the real world, the jobs are going to be dissimilar: different execution times that are based on a number of factors. If the job is querying or writing data, the execution time can change as, for example, the dataset grows or shrinks — or any other number of factors that might affect the execution time. This could be hardware, other running processes, overall system resource utilization at the time the job is executed, etc.

The app uses a .env file to define the tests, so this is the first test setup (comments added for clarity):

DEBUG=                // Verbose console output, default: false
JOBNAME=EmptySleepJob // Job: EmptySleepJob (default), PiJob, or IoJob
STARTINGWORKERCOUNT=1 // Workers to start the test with, default: 1
MAXWORKERCOUNT=8      // Workers to ramp up to, default: runtime.NumCPU()
TOTALJOBCOUNT=64      // Jobs to run, default: runtime.NumCPU() * 2

The above configuration will start the test with one worker running 64 jobs and capture the timing metrics. The workers will increase by 1 for the next iteration processing the same jobs, and so on until the test ends up with runtime.NumCPU() workers — in my case 8. The metrics captured along the way are printed in a summary report that can be seen here: 1 row for each number of workers running the test.

make golang-run Result:

Summary Results: EmptySleepJob
+---------+------+--------------+-------------------+-------------+--------+
| WORKERS | JOBS | AVG JOB TIME | TOTAL WORKER TIME | AVG MEM USE |  +/-   |
+---------+------+--------------+-------------------+-------------+--------+
|       1 |   64 | 1.00s        | 64.35s            | 0.004Mb     | (1x)*  |
|       2 |   64 | 1.01s        | 32.23s            | 0.005Mb     | +2x    |
|       3 |   64 | 1.00s        | 22.12s            | 0.006Mb     | +2.91x |
|       4 |   64 | 1.00s        | 16.03s            | 0.007Mb     | +4.01x |
|       5 |   64 | 1.00s        | 13.04s            | 0.008Mb     | +4.93x |
|       6 |   64 | 1.00s        | 11.05s            | 0.008Mb     | +5.82x |
|       7 |   64 | 1.00s        | 10.03s            | 0.009Mb     | +6.42x |
|       8 |   64 | 1.00s        | 8.04s             | 0.008Mb     | +8x    |
+---------+------+--------------+-------------------+-------------+--------+

* Baseline: All subsequent +/- tests are compared to this.

This first test confirmed my expectations:

✅ Running 64 one-second sleep jobs with 1 worker should take about 64 seconds.
✅ Running 64 one-second sleep jobs with 8 workers on an 8-core machine should take about 1/8th the time, or 8 seconds.

As you can see from the above, this works out almost to the exact second. Again, this is an overly-idealized scenario, but it’s always good to start simple and verify expectations. It makes sense that the Total Processing Time for all of the jobs gets shorter as workers are added. And since a time.Sleep job has so little overhead, it also makes sense that Average Job Time is consistent.

It’s also important to note that the AVERAGE MEMORY USAGE per set of workers increases as we add more: 8 workers use 2x the amount of memory that 1 worker uses. While memory is a cheap resource these days, noting this increase is important if the application starts to creep up toward the limitations of the total available memory on the system. if that were to happen, we might see a large increase in overhead created by the system's need to use and manage swap space.

Thinking ahead, if this application ran inside of a Docker container, the application would likely have access to far fewer resources than the total system resources. Best Docker practices dictate setting limits on container resources as a Docker environment would likely be juggling many containers, all vying to share the system resources.

A (More) Real-World Scenario

While validating my expectations in an idealized scenario is a great first step, benchmarking a sleeping application is far from a real-world test. My next assumption is that the results above could — and likely would — be different if the application jobs were more CPU and memory intensive than a low-consumption sleep job.

I still wanted to eliminate as much variability as possible but wanted to stress the system in a way that was predictable. Enter PiJob. I decided to create a job that would calculate PI to n places. For my particular system, the function I used to calculate PI took about 1 second to calculate 10,000 places. Similar to the EmptySleepJob function, in the PiJob function captures timing information. This function is much more resource intensive — but still very consistent as far as the time it takes to run the job — and can be seen here.

.env File excerpt (Only changing the JOBNAME, all other settings are the same as before):

JOBNAME=PiJob
STARTINGWORKERCOUNT=1
MAXWORKERCOUNT=8
TOTALJOBCOUNT=64

make golang-run Result:

Summary Results: PiJob
+---------+------+--------------+-------------------+-------------+--------+
| WORKERS | JOBS | AVG JOB TIME | TOTAL WORKER TIME | AVG MEM USE |  +/-   |
+---------+------+--------------+-------------------+-------------+--------+
|       1 |   64 | 0.96s        | 61.34s            | 0.010Mb     | (1x)*  |
|       2 |   64 | 1.10s        | 35.37s            | 0.011Mb     | +1.73x |
|       3 |   64 | 1.29s        | 28.10s            | 0.012Mb     | +2.18x |
|       4 |   64 | 1.47s        | 23.72s            | 0.014Mb     | +2.59x |
|       5 |   64 | 1.66s        | 21.72s            | 0.016Mb     | +2.82x |
|       6 |   64 | 1.93s        | 21.12s            | 0.018Mb     | +2.9x  |
|       7 |   64 | 2.20s        | 20.89s            | 0.019Mb     | +2.94x |
|       8 |   64 | 2.53s        | 20.48s            | 0.018Mb     | +3x    |
+---------+------+--------------+-------------------+-------------+--------+

* Baseline: All subsequent +/- tests are compared to this.

Compared to the ideal scenario in the first test, 8 workers only ran the scenario 3x as fast and not 8.02x. Again, this makes sense, as the jobs the workers are executing are much more CPU and memory intensive. While the time savings aren’t nearly as much as the idealized scenario, a 3x gain is still quite an optimization!

Another important piece of data above is that the AVERAGE JOB TIME increases as we add more workers. While the total execution time continues to get faster via adding concurrent workers, each AVERAGE JOB TIME increases as more application overhead is created. While it’s trivial for Go to spawn and execute concurrent goroutines, these processes must also be managed.

As in the first EmptySleepJob example, memory usage almost doubles when going from 1 worker to 8. The PiJob job also uses over 2x memory as the EmptySleepJob job.

Keep in mind: not only is this application splitting up its duties between CPU cores and managing memory, but the OS itself (and any other processes outside of the application) are also time-sharing resources like CPU, memory, and disk I/O. While Go is nice enough to take care of most of this without intervention, it does have to manage this juggling act of context switching. This application is mostly a CPU-bound job. If the job was dealing with reading and writing large chunks of data, Go would also have to manage more memory and I/O operations. Each layer of management and coordination that comes with goroutines isn’t necessarily expensive, but it’s definitely noticeable — especially when compared to a baseline.

In the PiJob example, 1 worker (AVERAGE JOB TIME: 0.96s) ran the PiJob 2.6x faster than with 8 workers (AVERAGE JOB TIME: 2.53s). This difference is the overhead created for managing context switching and goroutine coordination. With 8 workers, this overhead is still much smaller than the performance gained by adding additional workers.

Point of Diminishing Returns

At some point, the overhead required for managing the goroutines outweighs the benefits of more concurrent workers. What if we spin up more workers than runtime.NumCPU() dictates? This time, instead of a max of 8 workers, let’s push it to 32 workers.

.env File excerpt:

JOBNAME=PiJob
STARTINGWORKERCOUNT=1
MAXWORKERCOUNT=32
TOTALJOBCOUNT=64

make golang-run Result:

Summary Results: PiJob
+---------+------+--------------+-------------------+-------------+--------+
| WORKERS | JOBS | AVG JOB TIME | TOTAL WORKER TIME | AVG MEM USE |  +/-   |
+---------+------+--------------+-------------------+-------------+--------+
|       1 |   64 | 0.97s        | 62.14s            | 0.009Mb     | (1x)*  |
|       2 |   64 | 1.10s        | 35.34s            | 0.010Mb     | +1.76x |
|       3 |   64 | 1.27s        | 27.86s            | 0.011Mb     | +2.23x |
|       4 |   64 | 1.51s        | 24.33s            | 0.013Mb     | +2.55x |
|       5 |   64 | 1.63s        | 21.29s            | 0.019Mb     | +2.92x |
|       6 |   64 | 1.90s        | 20.87s            | 0.019Mb     | +2.98x |
|       7 |   64 | 2.27s        | 21.46s            | 0.024Mb     | +2.9x  |
|       8 |   64 | 2.51s        | 20.41s            | 0.026Mb     | +3.04x |
|       9 |   64 | 3.01s        | 22.32s            | 0.021Mb     | +2.78x |
|      10 |   64 | 3.01s        | 20.14s            | 0.023Mb     | +3.09x |
|      11 |   64 | 3.38s        | 20.28s            | 0.021Mb     | +3.07x |
|      12 |   64 | 3.92s        | 21.96s            | 0.020Mb     | +2.83x |
|      13 |   64 | 3.96s        | 19.95s            | 0.025Mb     | +3.12x |
|      14 |   64 | 4.15s        | 20.04s            | 0.023Mb     | +3.1x  |
|      15 |   64 | 4.63s        | 20.94s            | 0.026Mb     | +2.97x |
|      16 |   64 | 5.06s        | 20.51s            | 0.025Mb     | +3.03x |
|      17 |   64 | 5.18s        | 20.59s            | 0.024Mb     | +3.02x |
|      18 |   64 | 5.36s        | 20.51s            | 0.023Mb     | +3.03x |
|      19 |   64 | 5.56s        | 20.18s            | 0.025Mb     | +3.08x |
|      20 |   64 | 6.01s        | 20.60s            | 0.023Mb     | +3.02x |
|      21 |   64 | 6.48s        | 20.66s            | 0.028Mb     | +3.01x |
|      22 |   64 | 6.45s        | 19.68s            | 0.036Mb     | +3.16x |
|      23 |   64 | 6.57s        | 19.63s            | 0.034Mb     | +3.17x |
|      24 |   64 | 6.64s        | 19.65s            | 0.028Mb     | +3.16x |
|      25 |   64 | 6.72s        | 19.17s            | 0.031Mb     | +3.24x |
|      26 |   64 | 7.01s        | 19.37s            | 0.033Mb     | +3.21x |
|      27 |   64 | 7.02s        | 18.66s            | 0.038Mb     | +3.33x |
|      28 |   64 | 7.43s        | 18.98s            | 0.033Mb     | +3.27x |
|      29 |   64 | 7.47s        | 18.19s            | 0.042Mb     | +3.42x |
|      30 |   64 | 7.90s        | 18.42s            | 0.037Mb     | +3.37x |
|      31 |   64 | 7.99s        | 17.86s            | 0.039Mb     | +3.48x |
|      32 |   64 | 8.37s        | 17.31s            | 0.039Mb     | +3.59x |
+---------+------+--------------+-------------------+-------------+--------+

* Baseline: All subsequent +/- tests are compared to this.

As expected, there are diminishing returns as more workers than there are CPU cores are added. Sure there is a little bit of gain beyond 8 workers, but nowhere near the 3x leaps seen prior. In fact, the AVERAGE JOB TIME for 32 increases by 3.3x (8.37s) over the optimal 8 workers (2.51s) with only a 1.1x (3.1s) advantage in TOTAL WORKER TIME.

Conclusion

Concurrency in this pattern can improve the throughput of your application a great deal. As shown in the EmptySleepJobvs the PiJob, there was a lot of performance variability between an idealized scenario and a more real-world scenario. Truthfully, the PiJobwas fairly idealized as it was skewed toward CPU intensive and I forced it to perform work that lasted as long as I needed it to for comparison to the EmptySleepJob: 1 second.

The reality is that the performance gains with concurrency can hinge on what tasks your jobs are actually performing. But rather than being theoretical about it, as I was writing this conclusion, I decided that since I already have the framework, why not just create an I/O intensive job to illustrate this? Introducing: IoJob

Here is the meat:

...
iterations := 11
for n := 0; n < iterations; n++ {
  f, err := os.Create("/tmp/test.txt")
  if err != nil {
    panic(err)
  }
  for i := 0; i < 100000; i++ {
   f.WriteString("some text!\n")
  }
  f.Close()
}
...

Again, to minimize variability between the different job tests, the iterations variable above reflects an I/O-bound job that takes about 1 second to complete on my system.

.env File excerpt:

JOBNAME=IoJob
STARTINGWORKERCOUNT=1
MAXWORKERCOUNT=8
TOTALJOBCOUNT=16

make golang-run Result:

Summary Results: IoJob
+---------+------+--------------+-------------------+-------------+--------+
| WORKERS | JOBS | AVG JOB TIME | TOTAL WORKER TIME | AVG MEM USE |  +/-   |
+---------+------+--------------+-------------------+-------------+--------+
|       1 |   16 | 1.03s        | 16.50s            | 0.004Mb     | (1x)*  |
|       2 |   16 | 1.53s        | 12.26s            | 0.004Mb     | +1.35x |
|       3 |   16 | 2.27s        | 12.74s            | 0.004Mb     | +1.29x |
|       4 |   16 | 2.94s        | 11.80s            | 0.004Mb     | +1.4x  |
|       5 |   16 | 3.71s        | 12.63s            | 0.005Mb     | +1.31x |
|       6 |   16 | 3.93s        | 11.44s            | 0.005Mb     | +1.44x |
|       7 |   16 | 5.53s        | 13.33s            | 0.005Mb     | +1.24x |
|       8 |   16 | 6.68s        | 13.59s            | 0.006Mb     | +1.21x |
+---------+------+--------------+-------------------+-------------+--------+

* Baseline: All subsequent +/- tests are compared to this.

It’s plainly visible that I/O-bound tasks are not benefitting from concurrency in this situation. While the speed increase is about 1.2x the AVERAGE JOB TIME is growing by 6.5x due to concurrency overhead — coupled with the fact that each worker is I/O-bound to the same disk. The point of diminishing returns on an I/O-intensive job like this is immediate.

While there’s no silver bullet for every situation, running tests and benchmarks like this can help you locate an optimal balance. If you can’t find the right tool to uncover what you’re looking for, write your own!

Demystifying Golang Channels, Goroutines, and Optimal Concurrency

When does concurrency in Golang make sense, and at what point are there diminishing returns?

The Framework

The Question

An Idealized Scenario

A Rough Estimate

A (More) Real-World Scenario

Point of Diminishing Returns

Conclusion

Written by Matt Wiater