Random Forest Classifier — A Forest of Predictions

Published in

Level Up Coding

7 min readOct 6, 2020

Imagine the last decision you made. It could be if you should buy a new car or what should you eat for dinner. Making any type of decision is hard on your own is generally hard because there are so many things you need to think about. For me, it takes 1 hour to decide what to eat (not really good use of time 😅)! What if I told you that you don’t need to make a decision on your own and something else could guide you to make those decisions. That is what the Random Forest Classifier does, but what is it?

Decision Trees

In order to understand how the Random Forest Classifier work, we first need to understand a foundational concept, Decision Trees. A decision tree is a tool used for generating insights and find possible consequences such as chance event outcomes, resource cost, and utility functions.

Let’s imagine we have a set of 6 shapes where 2 of them are squares and 4 of them are triangles. Both of the squares are colored red and 2 of the triangles are also colored red. Let’s also imagine that the remaining triangles are colored blue. Now that we have our dataset, we want to split them amongst the different classes they are in. How do we do that?

To start with, we can split the dataset by the different colors of the shapes. So we can use this question, “Is it red?” to split up our first node. A node is where the dataset splits in two. Whatever meets the criteria goes into the “Yes” category and whatever does not meet the criteria goes into the “No” category.

In this scenario, the 2 red squares and the 2 red triangles go into the “Yes” category while the remaining blue triangles go into the “No” category. Now in our “Yes” category, we have 2 red squares and 2 red triangles, so a question we can ask to split them up is, “Is it a Square?”. This means the 2 red square will go into the “Yes” category and the 2 red triangles will go into the “No” category. Now our decision tree is finished!!

What is the Random Forest Classifier?

The Random Forest Classifier simply is a bunch of individual decision trees that work together. Each tree in Random Forest makes a class prediction and the class with the most votes becomes the model’s overall prediction. The reason why random forest works really well is because there is a large number of uncorrelated models (trees) that work as a committee. This committee can outperform any of the individual models.

The uncorrelated models are the key to making random forest work so well. The reason for this amazing effect is that the trees protect each other from their individual errors. Some trees might be wrong, but many of the trees will be right, so together the trees can move in the correct direction.

In order for the Random Forest Classifier to work well, there needs to be an actual signal in the features so that the models don’t just guess. Also, the predictions made by the individual trees must have a low correlation with each other.

Importance of Uncollerated Trees

In order to understand why having an uncorrelated model is really important, let’s use an example. Let’s imagine we are playing a game, where I role a dice so where it can either be an odd number or an even number. If the dice lands on an even number you get some points and if the dice lands on an odd number you lose the same amount of points. Now let’s imagine you have the following three options for the game:

Play 1 turn and bet 100 points
Play 10 turns and bet 10 points each time
Play 100 turns and bet 1 point each time

Here are the end points for each game (with a probability of 50 percent of getting points or losing points):

Game 1 = .50 * 100 + .50 * -100 = 0

Game 2 = (.50 * 10 + .50 * -10) * 10 = 0

Game 3 = (.50 * 1 + .50 * -1) * 100 = 0

In all three of the games above, you would have made zero points. Though, this is not the full picture because this is the expected result and not the distribution of points received. Now let’s imagine we played 100 of each game. After playing game 1 for 100 times, I won 10% of the games. After playing game 2 for 100 times, I won 40% of the games. Finally, after playing game 3 for 100, I won 52% of the games.

At the start, the games were expected to give us the same percentage of winning, but after we distributed the games we had a more clear option of which one we should choose. The Random Forest Classifier is just like this where the decision becomes clearer the more times we try it.

Keeping Trees Uncorrelated

In order to keep the decision trees uncorrelated from one another, the Random Forest Classifier uses Bagging and Feature Randomness.

Bagging

Decision trees rely a lot on the data they are trained on, so small changes that are made to it can result in different tree structures. The Random Forest Classifier uses this to its advantage for creating uncorrelated trees by allowing individual tress to randomly sample from the dataset, which causes different trees. Let’s imagine we are using the same dataset from above. If we changed 1 triangle to a square, the tree would look a lot different than the original. This method is called Bagging.

Feature Randomness

Normally in decision trees, when we split a node, we look at every possible feature that we can split the dataset into. Then we pick the feature that will split the dataset the most. In order to keep the trees uncorrelated, the Random Forest Classifier will pick a feature randomly from a set of features. This forces the trees to be different and has more variation amongst each other. This causes a low correlation between the trees. This method is called Feature Randomness.

Shortcomings of the Random Forest Classifier

Even though the Random Forest Classifier is an amazing classifier for making decisions, it has some limitations.

The Random Forest Classifier is really slow and inefficient for making real-time predictions on large datasets because of the large number of trees it uses. This type of classifier can get trained fast but it takes a lot of time to make a prediction when it is trained. In order to have a more accurate model, the model will need to have a lot of trees and this can take a long time to go through. In most real-world applications it will be sufficient to use, however, there are some situations where another model would be better.

Use Cases of the Random Forest Classifier

Even though the Random Forest Classifier has limitations, it is highly used in real-world applications because of its simplicity.

Finance: In Finance it is currently being used to predict if customers are likely to repay their debt on time or use bank’s services more frequently. In Trade, it is being used to determine a stock’s future behavior.
Healthcare: In healthcare, it is being used to identify the correct combination of components in medicine. It is also being used to analyze a patients’ medical history to identify diseases.
E-commerce: In E-commerce, it is currently being used to determine whether a customer will actually like the product or not and make recommendations.

This is only a fraction of Random Forest’s usages and the number of applications is growing each day.

Random Forest at the forefront of making predictions

The Random Forest Classifier is an amazing and simple classifier to make real-time predictions. It is a really good choice to make a fast and accurate classifier to make decisions. The Random Forest Classifier is being leveraged in a variety of ways across sectors and is making a positive impact in those fields.

Let us review why you all probably came here. What is the Random Forest Classifier?

The Random Forest is a prediction making classifier that consists of many individual decision trees. It uses Bagging, Feature Randomness to make its trees uncorrelated from one another.

The Random Forest Classifier is currently a key prediction algorithm and with other companions, machine learning tools are having a deep impact in real-world scenarios.

Nivan is a 14-year-old Artificial Intelligence developer looking to use technology to help solve problems in the world. He is currently building a company called Lemonaid which is a platform that connects teens with organizations that are in need of help with volunteer/intern recruitment. Send me an email at nivangujral@gmail.com if you would like to further discuss this article or just talk.

Random Forest Classifier — A Forest of Predictions

Written by Nivan Gujral