Mastering Regression Models: A Comprehensive Guide to Predictive Analysis

Afaque Umer
Python in Plain English
19 min readJun 16, 2023

--

Photo by Thomas T on Unsplash

Introduction

Autobots, roll out! In this glorious blog post, we shall embark on a daring expedition through the vast expanse of machine learning. Brace yourselves as we delve into the core essence of supervised regression, unraveling its secrets in the pursuit of solving predictive problems. Together, we shall scrutinize the strengths and weaknesses of renowned regression algorithms, like mighty warriors facing formidable foes on the battlefield. But fear not, for we possess the knowledge of regularization techniques, such as the noble L1 and L2 regularization. These techniques shall valiantly combat the treacherous overfitting, imposing their constraints upon the complexity of our models. And lo and behold, the time has come to harness the power of ensemble methods! By uniting multiple models, we shall forge an alliance that transcends limitations, unlocking unparalleled predictive performance and basking in the glory of optimal metrics. As our epic saga concludes, you shall emerge, fortified with a comprehensive understanding of popular regression algorithms, ready to face any challenge that lies ahead. Autobots, stand tall and embrace the wisdom bestowed upon you, for you are now equipped with the knowledge to lead the charge toward victory!

In this blog, we will cover the following topics:

I. Top 5 Regression Algorithms:

  • Simple Linear Regression
  • Multiple Linear Regression
  • Decision Tree Regressor
  • Random Forest Regressor
  • Support Vector Regressor

II. Model Evaluation

  • Performance Metrics
  • Cross Validation

III. Regularization Techniques:

  • Lasso Regression (L1 Regularization)
  • Ridge Regression (L2 Regularization)

Autobots, lock and load 🤛🤖🤜

Section 1: Regression Models in Machine Learning

Which Subset of Machine Learning Algorithms are we exploring? Machine learning algorithms can be simplified into three main types:

  1. Supervised Learning: Algorithms that learn from labeled training data to make predictions or classify new instances based on patterns and relationships identified in the data.
  2. Unsupervised Learning: Algorithms that analyze unlabeled data to discover patterns, structures, or relationships without specific target values. They focus on finding hidden insights and grouping similar data points.
  3. Reinforcement Learning: Algorithms that learn through interaction with an environment, taking actions, and receiving feedback in the form of rewards or penalties. The goal is to maximize cumulative rewards by learning the optimal decision-making policy.

Here we will revolve around the realm of Supervised Regression problems.

Syncing with the Data

Before embarking on algorithm testing, it is essential to obtain the requisite dataset upon which we shall fit our models. To streamline the process and maintain focus on our main objective, we will employ a toy dataset instead of working with raw data that necessitates extensive exploratory data analysis (EDA) and data cleaning procedures. By utilizing the diabetes dataset provided by sklearn, we can save valuable time. The choice of the diabetes dataset is motivated by its relevance to our regression problem, as it comprises continuous variables. The target variable in this dataset represents sugar levels, which are indicated by continuous numerical values, based on the input features. Let us now closely examine the dataset and prepare it for model training and testing through the essential step of train-test splitting.

Image By Author: Loading Diabetes Dataset

The code imports the necessary libraries and loads the diabetes dataset as a pandas DataFrame. It displays the feature names, creates the DataFrame, and provides information about its shape and contents. Overall, the code prepares and examines the dataset for further analysis. Here are the first 20 rows of the DataFrame:

Image By Author: DataFrame

Now that our DataFrame is ready, we can proceed to split the dataset into training and testing datasets.

Train-Test Split: KD Nuggets

This division will allow us to use the separated portions for training and evaluating our regression models.

Image By Author: Data Split

With the completion of our data-gathering process, it’s time to move on to the next step: exploring regressor models. We will now explore each regressor model one by one to uncover its unique features and capabilities. So, without further ado, let’s roll

1 & 2. Linear Regression

Linear regression is the most basic and widely used regression algorithm. It assumes a linear relationship between the independent variables and the dependent variable. The algorithm finds the best-fitting line through the data by minimizing the sum of squared differences between the observed and predicted values.

You can think of linear regression as the answer to the question “How can I use X to predict Y?”, where X is some information that you have and Y is some information that you want to know.

There are two types of Linear Regression: Simple Linear Regression and Multiple Linear Regression.

In simple linear regression, there is a single independent variable (predictor variable) used to predict a single dependent variable (response variable). It assumes a linear relationship between the independent variable and the dependent variable. The relationship is represented by a straight line and the equation is represented as y = β0 + β1*x + ε.

Image Source: Google Images

In multiple linear regression, there are multiple independent variables used to predict a single dependent variable. It allows for the consideration of multiple factors that may influence the dependent variable. The relationship is represented by a hyperplane in a higher-dimensional space and the equation is represented as y = β0 + β1x1 + β2x2 + … + βn*xn + ε .

where:

  • y represents the dependent variable
  • x / x1, x2, …, xn represent the independent variable/s
  • β0 represents the y-intercept (constant term)
  • β1 / β1, β2, …, βn represent the coefficients(slope) for each independent variable
  • ε represents the error term or residual.

How does the model fit the data?

The regression line is the estimate that minimizes the sum of squared residual values, also called the residual sum of squares or RSS:

Residual Sum of Squares

The method of minimizing the sum of the squared residuals is termed Least Squares Regression, or Ordinary Least Squares (OLS) regression.

All the concepts in simple linear regression, such as fitting by least squares and the definition of fitted values and residuals, are applicable in the context of multiple linear regression.

In a previous blog, I delved into the concept of simple linear regression, discussing the underlying mathematical principles and also providing a comparison of the results obtained through scikit-learn.

I invite you to explore that blog to gain a deeper understanding of the mathematical intuition behind the scenes 👇

Now, let’s dive straight into multiple linear regression. It starts by importing the LinearRegression class and initializes an instance of it as ‘lr’.

We will begin by fitting the training dataset (x_train and y_train) into the linear model. Then, we can utilize the model’s score function to calculate the value of the R-squared parameter. This parameter will indicate how well the model has fitted the training dataset, providing us with an assessment of its performance.

Image By Author: Linear Regression

But wait!!! Is this the final score? Have you encountered any anomalies when rerunning the notebook? It’s important to note that each time the model is scored, the values may differ. One of the primary reasons for score variations is the randomization or shuffling that occurs during the data-splitting process, resulting in different train and test splits. To validate this theory, we can run a loop to split the data multiple times, say five, and observe whether the scores differ each time.

Image By Author: Variation in R2 Score

However, a question arises: How can we determine if a model score is good? Do we need to iterate every time for finding the best scores a model can produce? Fortunately, there is a technique called Cross-Validation that addresses this concern. Cross-validation allows us to evaluate the model’s performance by assessing its consistency across different data splits. We will explore cross-validation in the later half of the blog once we have completed the training of our main models.

Pros and Cons of Linear Regression:

Pros:

Simplicity: Linear regression is straightforward to understand and implement.

Speed: Linear regression is computationally efficient, making it suitable for large datasets and real-time applications.

Feature Importance: By examining the magnitude of coefficients, we can identify which features have the most significant influence on the target variable.

Cons:

Linearity Assumption: If the relationship is nonlinear, the model may not capture the underlying patterns accurately.

Overfitting: Linear regression can be prone to overfitting if the model becomes too complex or if there is multicollinearity (high correlation) among the features.

Outliers: Outliers can distort the line of best fit and impact the accuracy of predictions.

3. Decision Tree Regressor

Decision tree regression is a powerful algorithm used in machine learning for solving regression problems. Unlike traditional regression models that rely on mathematical equations, decision tree regression employs a tree-like structure to make predictions and has the ability to discover hidden patterns corresponding to complex interactions in the data.

How does it work?

A decision tree begins with a central node representing the entire population or sample. Through a process called splitting, this node divides into two or more homogeneous groups. Nodes that undergo further divisions are known as decision nodes, while those that do not split are called terminal nodes or leaves. Each section of the complete tree is referred to as a branch. Each internal node represents a feature, each branch represents a possible outcome, and each leaf node represents a prediction.

Decision Tree Terminology: Google Images

Decision trees are nothing but a bunch of if-else statements in layman's terms. It checks if the condition is true and if it is then it goes to the next node attached to that decision. By traversing the tree based on the feature values of new instances, the algorithm can predict the target value.

How does the splitting happen?

When splitting a node in a decision tree, the algorithm evaluates the information gain for each feature. The feature with the highest information gain is chosen as the best feature for splitting.

The information gain is calculated using the concept of Entropy. Entropy measures the impurity or randomness or uncertainty of a node in the decision tree. It is higher when the samples in the set are more diverse. The entropy of a binary classification problem can be calculated using the following formula:

Entropy = -p * log2(p) — (1 — p) * log2(1 — p)

where ‘p’ represents the proportion of positive instances (class 1) in the dataset. The higher the Entropy, the lower will be the purity and the higher will be the impurity.

In a classification problem, entropy helps in finding information gain by quantifying the uncertainty in a node, and information gain is used to evaluate the usefulness of attributes in reducing that uncertainty. This aids in determining the optimal attribute for splitting the node into a decision tree. But here we are talking about regression problems where the target variable is continuous, a different measure called variance or mean squared error (MSE) is commonly used.

Variance = (1/n) * ∑(yi — ȳ)²

Here are the steps that are followed when building a regression tree:

  1. Arrange the data points in ascending order based on the feature you want to split on.
  2. Select a random data point as the root node.
  3. Calculate the variance of the target variable for all data points. This will be used as the initial variance value.
  4. Randomly choose a potential split point and divide the data points into two branches based on that split. Each branch will have two sets of values.
  5. Calculate the variance reduction for each branch by subtracting the weighted average of the variances from the initial variance value. Choose the split that maximizes the variance reduction.
  6. Repeat steps 4 and 5 for each branch, recursively splitting the data points until a stopping criterion is met (e.g., maximum depth, minimum number of samples).
  7. At each node, choose the split that maximizes the variance reduction.

To enhance your understanding of the algorithm and delve deeper into the mathematical aspects, exploring a practical example and solving it step-by-step would be beneficial. However, covering all the content and code in a single blog post might be overwhelming. To address this, I have provided a link to my Jupyter Notebook at the end of this post. The notebook contains the complete code and several mathematical examples for your reference.

Now, let’s proceed with fitting the decision tree on the same dataset we used for linear regression, and let's compare the scores.

Image By Author: Decision Tree Regression

As observed, the scores obtained from the decision tree regressor are lower compared to the regression algorithm. This difference can be attributed to intentionally keeping the depth of the tree at 2. By increasing the depth, the tree becomes more complex and has the potential to capture the entire pattern present in the data. However, it is important to consider the performance of the model on the test dataset. While the decision tree may yield a perfect R2 score on the training data, its performance on unseen test data might differ. Further evaluation of the test dataset is required to assess the model’s generalization ability and determine if increasing the depth will truly lead to improved performance.

Image By Author: Model gone wrong

Although the model achieved a perfect score on the training set, it performed poorly on the test set, resulting in a negative score.

This indicates that the model failed to capture any meaningful patterns in the test dataset, rendering it unsuitable for making accurate predictions in general. This situation is commonly referred to as overfitting, which we will explore further in the latter part of this blog.

Image Source: Google Images

Pros and Cons of Decision Tree:

Pros:

Interpretability: Decision trees are easy to interpret and understand.

Nonlinear Relationships: Decision trees can capture nonlinear relationships between features and target variables, making them suitable for datasets with complex patterns.

Handling Missing Values: Decision trees can handle missing values in the dataset without requiring imputation or preprocessing techniques.

Robust to Outliers: Decision trees are robust to outliers in the data as they do not heavily rely on the mean and variance of the data.

Cons:

Overfitting: Decision trees are prone to overfitting, especially if the depth of the tree is not properly controlled.

Instability: Small changes in the data can lead to different splits and, consequently, different tree structures.

Difficulty with Continuous Variables: Decision trees may struggle to effectively handle continuous variables with a large number of unique values, as they primarily rely on binary splits.

4. Random Forest Regressor

Random forests are an excellent example of an ensemble learner constructed using decision trees. Ensemble learning is a machine learning technique that involves combining multiple individual models, called base learners or weak learners, to create a more accurate and robust predictive model.

Random forest regression addresses the high variance and instability of decision trees by creating an ensemble of decision trees and aggregating their predictions to make more accurate and robust predictions. The algorithm builds multiple decision trees using different subsets of the training data and random subsets of the features. Each tree is trained independently, and the final prediction is obtained by averaging the predictions of all the trees (for regression problems).

Image Source: Wikipedia

By utilizing multiple decision trees, random forest regression reduces the risk of overfitting and provides better generalization to unseen data. The randomness introduced in feature selection and data sampling further enhances the diversity and reduces the correlation among the individual trees, leading to improved overall performance.

Image Source: Google Images

The hyperparameter that defines the number of decision trees in a random forest is commonly known as n_estimators. It specifies the number of trees to be included in the ensemble. Let’s create and test out the random forest containing 20 decision trees and observe its superior performance compared to a single decision tree.

Image By Author: Random Forest Regressor

The model’s performance on the test dataset has demonstrated a significant improvement compared to the decision tree alone. This exemplifies the power and effectiveness of Random Forest.

Pros and Cons of Random Forest:

Pros:

High Accuracy: Random forests tend to provide high prediction accuracy due to the aggregation of multiple decision trees

Reduced Overfitting: By averaging the predictions of multiple trees and introducing randomness, random forests help mitigate overfitting and improve generalization to unseen data.

Handling of Large Datasets: Random forests can effectively handle large datasets with high-dimensional feature spaces, making them suitable for complex problems and big data applications.

Feature Importance: Random forests provide a measure of feature importance, which helps in identifying the most relevant features for making predictions and gaining insights from the data.

Cons:

Lack of Interpretability: The ensemble nature of random forests makes them less interpretable compared to individual decision trees

Computationally Expensive: Training a random forest can be computationally expensive, especially when dealing with a large number of trees or complex datasets.

Hyperparameter Tuning: Random forests have several hyperparameters, such as the number of trees and tree depth, which need to be optimized to achieve optimal performance.

Biased Towards Features with More Levels: Random forests may exhibit a bias towards features with more levels or categories, potentially leading to biased predictions or the overlooking of features with fewer levels.

5. Support Vector Regressor

Support Vector Regression (SVR) is a machine learning algorithm that extends the principles of Support Vector Machines (SVMs).

What is SVM then?

Support Vector Machines (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks. It works by finding an optimal hyperplane that separates different classes or predicts a continuous target variable by maximizing the margin or distance between the hyperplane and the data points.

Hyperplane: Google Images

SVR combines the principles of support vector machines (SVMs) with regression tasks. By mapping the features into a higher-dimensional space and finding an optimal hyperplane, SVR can effectively model nonlinear relationships between the features and the target variable while also minimizing the prediction error.

Support Vector Regression: Google Images
Image By Author: SVR

Pros and Cons of SVR:

Pros:

Effective for Nonlinear Relationships: SVR is capable of capturing nonlinear relationships between input features and the target variable by using various kernel functions, such as the polynomial or radial basis function (RBF) kernels.

Robustness to Outliers: SVR focuses on the support vectors, which are the data points closest to the hyperplane.

Global Solution: SVR seeks to find a global optimal solution by formulating the problem as a convex optimization task.

Cons:

Sensitive to Hyperparameters: The performance of SVR heavily relies on the selection and tuning of hyperparameters, such as the choice of kernel function, regularization parameter, and epsilon value. Incorrect or suboptimal choices can lead to poor model performance.

Computational Complexity: Training an SVR model can be computationally expensive, especially for large datasets or when using complex kernel functions.

Extrapolation Challenges: SVR is primarily designed for interpolation tasks, meaning it predicts well within the range of the training data.

Section 2: Model Evaluation

Performance Metrics

Performance metrics in regression algorithms are used to evaluate the quality and accuracy of predictions made by the model. These metrics provide insights into how well the model is performing in terms of predicting continuous target variables. Here are some commonly used performance metrics in regression:

Mean Squared Error (MSE): MSE measures the average squared difference between the predicted and actual values. It gives higher weight to larger errors, making it sensitive to outliers. Lower values of MSE indicate better model performance.

Root Mean Squared Error (RMSE): RMSE is the square root of MSE and is often preferred because it is in the same unit as the target variable. It provides a more interpretable measure of the average prediction error. Lower RMSE values indicate better model performance.

Mean Absolute Error (MAE): MAE calculates the average absolute difference between the predicted and actual values. It provides a measure of the average magnitude of errors and is less sensitive to outliers compared to MSE. Lower MAE values indicate better model performance.

R-squared (R2) Score: R2 score measures the proportion of the variance in the target variable that is explained by the model. It ranges from 0 to 1, with 1 indicating a perfect fit. Higher R2 values indicate better model performance in capturing the variability of the target variable.

Mean Absolute Percentage Error (MAPE): MAPE calculates the average percentage difference between the predicted and actual values, relative to the actual values. It provides a measure of relative accuracy and is commonly used when the scale of the target variable is significant. Lower MAPE values indicate better model performance.

Adjusted R-squared (Adjusted R2) Score: Adjusted R2 adjusts the R2 score by taking into account the number of features or predictors in the model. It penalizes overfitting by accounting for the degrees of freedom. Higher adjusted R2 values indicate better model performance.

The selection of an appropriate performance metric for a regression model depends on the specific features and requirements of the model.

I have previously covered the metrics used in classification and regression problems in one of my blogs. For more detailed information, please refer to the blog post available here 👇

Cross Validation

Through the previous analysis, it has been demonstrated that the accuracy of the model varies across different train-test splits. To overcome this variability and avoid the need for repeated model iterations with different splits to determine the best accuracy score, cross-validation provides a solution. By utilizing cross-validation, we can obtain the best possible score without the need for excessive iterations, ultimately saving time and effort.

Cross-validation involves dividing the dataset into multiple subsets or folds. The model is trained on a subset of the data called the training set and evaluated on the remaining subset called the validation set. This process is repeated multiple times, with each fold serving as the validation set in a different iteration.

Ten-fold cross-validation diagram: Research Gate

Given the observed variations in the regression model, let’s apply cross-validation specifically to linear regression to determine if there is potential for improvement in the score. The process remains consistent for all models: we provide the estimator (in this case, linear regression), features, target variables, the number of folds for cross-validation, and a chosen evaluation metric. The result is an array of scores corresponding to the specified number of folds. By taking the mean of these scores, we can obtain a general assessment. Let’s walk the talk 🦿

Image By Author: Cross Validation

Section 3: Regularization Techniques

Regularization techniques in regression problems are methods used to prevent overfitting and improve the performance of regression models. They involve adding a regularization term to the loss function during model training, which helps control the complexity of the model and reduce the impact of noisy or irrelevant features. This term imposes a penalty on large coefficients, encouraging the model to favor simpler solutions.

The two commonly used regularization techniques in regression are Ridge regression (L2 regularization) and Lasso regression (L1 regularization).

Ridge Regression (L2 Regularization): Ridge regression adds a penalty term/regularization term to the loss function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Mathematically, the Ridge regression loss function can be written as:

Ridge Regression Cost Function

The addition of the regularization term forces the model to shrink the coefficient values toward zero, reducing their impact on the predictions. This helps control multicollinearity and makes the model more robust against noise in the data.

Note: It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.

Lasso Regression (L1 Regularization): Lasso regression, similar to Ridge regression, adds a penalty/regularization term to the loss function. However, in Lasso regression, the penalty term is proportional to the absolute magnitudes of the coefficients i.e. it uses the ℓ1 norm of the weight vector instead of half the square of the ℓ2 norm.

This leads to some coefficients becoming exactly zero, effectively performing feature selection. Mathematically, the Lasso regression loss function can be written as:

Lasso Regression Cost Function

By setting some coefficients to zero, Lasso regression promotes sparsity and selects the most relevant features, which can improve interpretability and reduce model complexity.

Let’s apply both regularization techniques to the dataset.

Image By Author: Ridge and Lasso Regression

Here alpha (α) is a hyperparameter that controls the strength of regularization. The choice of alpha is crucial in balancing the trade-off between model complexity and performance. Higher values of alpha provide more regularization, which can help mitigate overfitting but might lead to underfitting if set too high. On the other hand, lower values of alpha reduce regularization and allow the model to be more flexible, which can increase the risk of overfitting.

To determine the optimal value for alpha, techniques such as cross-validation or grid search can be employed. These methods help find the alpha value that yields the best model performance on unseen data, considering both bias and variance trade-offs.

Finally, we have come to the conclusion of this extensive blog !!!

I trust that it has served its purpose in providing a comprehensive understanding of the fundamental concepts. However, we must not forget one crucial aspect: ensemble techniques. These techniques offer a way to amplify prediction accuracy and generalization by harnessing the collective wisdom of multiple individual models. Random Forest, for instance, stands as an example of an ensemble technique. While I have not delved into the intricacies of ensembles in this article, I assure you that I will explore them in a separate piece, where their inner workings and benefits will be expounded upon. Until then, keep exploring and expanding your knowledge in the fascinating world of machine learning.

I hope you enjoyed this article! You can follow me Afaque Umer for more such articles.

I will try to bring up more Machine learning/Data science concepts and will try to break down fancy-sounding terms and concepts into simpler ones.

Thanks for reading 🙏Keep learning 🧠 Keep Sharing 🤝 Stay Awesome 🤘

In Plain English

Thank you for being a part of our community! Before you go:

--

--

AI whisperer, unraveling the secrets of the universe one byte at a time. Let's geek out together 👉 www.linkedin.com/in/afaque-umer