Counting Stars

Will Carter
Level Up Coding
Published in
10 min readJan 29, 2021

--

Using Big O Analysis to write efficient code

In outer space, there are billions of stars

What is Big O Analysis?

A programmer’s main task is to write code that executes series of actions designed to produce some desired output. As every new chunk of code is written, the programmer likely must evaluate different ways to solve a particular problem through code.

So, if there are many different ways to write code that produces the desired result, how can we determine objectively if a chunk of code, or block, is better than another? This is where Big O analysis comes in.

Big O analysis is about objectively comparing different code blocks that produce the same result by looking at the efficiency of the code execution. Because of this, Big O analysis allows programmers to identify and alleviate bottle necks and write more efficient code that is faster and consumes less resources.

Big O is concerned with describing code efficiency by looking at two factors: Time and Space. Time refers to the duration of time it takes for a particular code block to execute. Space refers to the amount of memory that is used when for the block. In this post, I will focus on time.

All computers are different, some are faster than others or have larger amounts of memory. Because of this, Big O Analysis is not concerned with concrete units when classifying code (like milliseconds or bytes). What allows us to make comparisons across different environments with Big O is its focus on what happens when things scale, or how the block performs when hit with a large number of inputs. Big O analysis forces programmers to think about the worst case scenario and consider questions such as…

“What happens when this function receives a million inputs? How about a billion?” 😖

When considering code with respect to Big O, we are looking to optimize code for when things are at their at their worst. This way, we can insure that the code will perform well under normal circumstances.

So, now that we know that Big O is about classifying code efficiency, how does it work? Let’s consider the simplified Big O Notation chart below. The three lines represent the time efficiency of three different code blocks. We can think of code blocks as algorithms. The number of inputs to the algorithm is tracked on the x-axis and the duration of time to complete the algorithm is tracked on the y-axis.

We can see that fast algorithms perform in the green area. Those that perform in the orange band take longer, and finally, those that perform in the red take the most time. Notice that the algorithm line in the red actually curves upward steeply as the number of inputs increases, labeled O(n²). Algorithms that perform like this can be especially problematic if the number of inputs increases greatly, and is the reason why this line is in the red.

The three lines are labeled O(1), O(n) and O(n²). Each is the Big O notation for the algorithm that is represented by the line.

O(1)
Algorithms that perform in this manner are said to have constant complexity because the time it takes for an algorithm to complete isn’t dependent on the size of the input. Algorithms classified as O(1) are the fastest because it does not matter how many inputs provided to an algorithm, it will still run in a constant amount of time.

O(n)
Algorithms that perform in this manner are said to have linear complexity because the time it takes for the algorithm to complete is directly related to the in number of inputs to the algorithm (n). Algorithms that are classified as O(n) take longer as the number of inputs increases, but not in an exponential way.

O(n²)
Algorithms that perform in this manner are said to have quadradic complexity because the time it takes for the algorithm to complete increases at the pace of n².

There are more Big O classifications than just these, but for simplicity, I will keep focus on algorithm examples that have O(1), O(n) and O(n²) notations.

Because Big O is concerned with how algorithms perform at scale, stars in the universe is an appropriate model to experiment with. After all, there are billions of stars in the universe, which sounds like a potential scaling problem if we start working with many, many stars. I wonder how this many stars get named?

After some quick research, Space.com tells us that stars are named in the following manner:

Since there are so many stars in the universe, most star names consist of an abbreviation that stands for either the type of star or a catalog that lists information about the star, followed by a group of symbols. For instance, PSR J1302–6350 is a pulsar, thus the PSR. The J reveals that a coordinate system known as J2000 is being used, while the 1302 and 6350 are coordinates similar to the latitude and longitude codes used on Earth.

Using this logic, here is a simple function that returns a random star name.

starNameGenerator

Examples of names returned from this function might be: PSRJ18–93, PSRJ26–74 or PSRJ88–13, and so on, where the numbers are random. Also, in the logic described above, the numbers are 4 digits, but in the starNameGenerator function, I have made them 2 digits so that there is a better chance for the function to produce a duplicate name (more on this in a bit).

Next, lets make some stars, maybe start with a significant amount, but let’s not go crazy. I’ll have 100 stars, please.

Making stars

After running, my array of random star names came out like this:

So, now that we have a way to generate some data (that could potentially grow to billions of stars if we wanted). Next, we can think about writing functions against different array sizes (number of inputs) and analyzing them with respect to Big O.

First, let’s consider O(1) constant complexity. Remember, O(1) means that no matter the number of inputs, the algorithm should return a result in about the same amount of time, each time.

Say, we want a function that returns the last element of the array of stars passed in.

The returnLast function correctly returned PSRJ75–12 for the list of 100 above.

Next, we want to increase the number inputs to see how the function performs with regards to Big O. I have added start and end timestamps to determine how many milliseconds the function takes to execute with the following arrays of random stars at increasing lengths (to 1 million).

The iteration times that were recorded for each of those input array lengths were as follows…

That looks very, very fast, less than 1 millisecond for the returnLast(array) function to return a result, regardless of the size of the array passed in, even for 1 million stars. When the iteration times are plotted against the amounts of inputs we get a flat line. This function is indeed O(1), green and fast.

O(n)

Next, let’s consider O(n) linear complexity. Remember, the time it takes for a linear complex algorithm to complete is directly related to the number of inputs to the algorithm (n).

Let’s imagine that we don’t have control over the starNameGenerator() function, say maybe that list of names comes from an API call or somewhere else outside of our control. Further, let’s say that we are not fond of the dash(-) that the function inserts in the name, PSRJ1893 and we want to remove it before saving them to a database, for example. So we come up with a removeDashesFrom(array) function on our end to run the names through before saving.

Looking at the this function, we can see that we are looping through the array of stars once. So, if n is defined as the length of the array passed, then the loop will iterate n times. The number of iterations it will take to replace all the dashes is directly related to the length of the input array. Let’s graph the times again, from 1 to 1,000,000 for the lengths of the array.

This line definitely looks like the shape of O(n), and this function is in the yellow. With 1 million inputs, the function takes only about a half a second.

Finally, let’s consider O(n²) quadradic complexity. Remember, the time it takes for a quadradic complex algorithm to complete increases at the pace of n², if n is defined as the number of inputs.

Let’s say we have discovered that the starNameGenerator function might return duplicate names, and we need a function that removes the duplicate names from our array of star names. Here is a function that will produce the array of stars with duplicates removed. This is not the the most efficient function, maybe it’s the worst, but it produces the desired result.

I have purposely written this function to be as inefficient as possible. The logic of the function builds an array of unique star names to return by checking every star name in the array against every other star name, except the one it is currently checking upon each iteration of the loop though the array. If a duplicate is not found, it pushes the star name into the return array. To make things worse, if a duplicate is found, it is noted, but the loop does not stop looking for more duplicates at that point, it continues to check the following items in the array unnecessarily after having already found a duplicate.

This function is classified as O(n²) because it will iterate through the input array with length of n, n times due to the nested loop. This makes the number of iterations n * n, or n², if n is the number of inputs.

Here are the times recorded for this inefficient function.

This line looks similar to the O(n²) line in the chart above, and we can see the curvature upward, telling us that the more the inputs increase, the time increases at the accelerated pace of n².

In fact, if the the array length reaches 1 million, the time for the function to complete takes almost 10 million milliseconds, or a whopping 154 minutes! Indeed, this function is a potential bottle neck and is a candidate for refactoring.

So, what can be done to improve the function’s performance? We know the issue is the nested loop where the iterations are tied to the length of the input array.

What if, instead of using an array to keep track of the elements to return, we use an object named arrayMap. Consider this version:

Here, we loop through the input array only once and the arrayMap object gets populated with the star name for each of its keys, with the value for each key being an array of all the indices where the item appears in the input array. If the name appears more than once in the array, the value array grows with each duplicate.

For example, say our star input array is the following (notice the duplicate star, PSRJ33–50):

['PSRJ73–98', 'PSRJ33–50', 'PSRJ64–81', 'PSRJ33–50', 'PSRJ21–13']

If this is the input array, then in the removeDuplicatesFast(array) function, arrayMap is built up to the following:

{
'PSRJ73–98': [ 0 ],
'PSRJ33–50': [ 1, 3 ],
'PSRJ64–81': [ 2 ],
'PSRJ21–13': [ 4 ]
}

The PSRJ33–50 key has a value of [1, 3]. This is because the value is duplicated at index 1 and index 3 of the input array.

Conveniently, now that we have the populated arrayMap object, we can use Object.keys(arrayMap) to return an array of a given object's own enumerable property names, iterated in the same order that a normal loop would. This function now removes duplicates without a nested loop.

Here are the times recorded for removeDuplicatesFast(array) function:

We can see that the removeDuplicatesFast(array) function performs at O(n), and that to remove duplicates from an array of 1 million stars takes about half a second, much better than the 154 minutes that the previous function performed!

Using Big O Analysis, we can identify code bottlenecks or poorly performing code and refactor problematic algorithms and produce robust code.

The code for this blog post can be found at: https://github.com/FergusDevelopmentLLC/countingStars

References:
The Complete Guide to Big O Notation & Complexity Analysis for Algorithm
Big O Notation — Time Complexity in Javascript
Star Facts: The Basics of Star Names and Stellar Evolution

--

--