A Primer on PCA and Dimensionality Reduction Simplified!

Uncover the math behind PCA, understand principal components, and learn to code PCA with a real-world example.

Published in

Level Up Coding

9 min readApr 30, 2024

Greetings AI folks! Recently, while watching the “Dr Romantic S3” k-drama. becoming a doctor is fascinating, however it’s too late for me to become a doctor. Nonetheless, I wanted to experience that feeling, so I picked up a medical dataset (Wisconsin Breast Cancer Dataset) and sat down to predict if the patient was having Cancer. As I opened the dataset, I got 32 columns in my hand. When we are dealing with tons of features, things can get messy. But in machine learning, we have a knight in shining armour “Dimensionality Reduction”.

Today we will be learning about:

Dimensionality Reduction
Mathematics Prerequisite (Very important!!!)
Principal Component Analysis
Math Behind PCA
Coding PCA
Tips When Performing PCA
Conclusion

Dimensionality Reduction

Let’s say I have a huge table of data, I have this habit of imagining and visualizing the data in my mind, it gives a better understanding and makes me feel comfortable with the data. However, the data is very big that it’s hard to visualize the features in your brain. So I would like to create a smaller version of the data while still preserving as much information as possible. And this is called Dimensionality Reduction.

So the next question arises is, How can we do this? The simple answer is we try to map higher dimensional spaces with a lower dimensional space. Let’s say we scattered data points in a high dimensional space, the target is we want to project all those points on a line ensuring they are spaced out as much as possible.

Mathematics Prerequisite

In this section, we will mainly focus of 5 core concepts that will be helpful for understanding the “Principal Components” and “PCA”. We are not mathematicians, so we will not go deep into the math, only learn about what is it and how is it useful for us.

The 5 Core Concepts we are going to learn are:

Variance
Co-Variance Matrix
Linear Transformation
Eigenvalues
Eigenvectors

Variance

To start with variance, let’s first know what mean is. Mean is a point where all the points get surrounded by it. In other words, it is the point of equilibrium which balances the data. And variance, tells about how spread the data is from the mean or from the point of equilibrium.

It measures how much a single variable deviates from its mean. For instance, let’s spread the data points over the 2-D space. When we say a 2-D space, we consider horizontal variance and vertical variance. It quantifies the spread of data over individual directions.

Covariance

Covariance takes us a step further, beyond just a single variable. It tells us about the relationship between two variables. It measures how the two variables vary together. In a 2-D space, while variance explains spread of data from the mean in individual direction, covariance on the other hand takes the who spread into consideration and explains the relationship associated with the change between variables. Principal components are constructed to be orthogonal to each other, meaning they are uncorrelated.

Covariance Matrix

It efficiently capture these relationships in a dataset with multiple features, we use covariance matrix. It summarizes the variances of each variable along the diagonal and the covariances between pairs of variables off the diagonal. This covariance Matrix is the building block of the principal component. Principal components are derived from the eigenvectors of the covariance matrix.

Linear Transformation

Linear transformation is really just a function or a map to transform from one plane to another plane. When finding principal components, these transformations help us identify then directions of maximum variance in the dataset and reduce the dimensionality of the data while preserving as much as information possible.

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors are properties of linear transformation. The eigenvalues represent the amount of variance captured by each principal component. And eigenvectors represent the directions in which the variance occurs. Principal components are constructed from the eigenvectors of the covariance matrix, with the eigen values indicating the importance of each component.

Principal Component Analysis

Principal component analysis summarizes the information content of large datasets into a smaller set of uncorrelated variables. It generates a set of variables called principal components. Principal components are new variables that are linear combinations of the initial variables. These combinations are done in such a way that the new variables are uncorrelated and most of the information within the initial variables is compressed into the first components.

The first principal component is a line that best represents the shape of the projected points. The larger the variability captured in the first component, the larger the information retained from the original dataset. No other component can have variability higher than the first principal component.

Principal components are orthogonal projections (perpendicular) of data onto lower-dimensional space. They have a direction and magnitude. These principal components are arranged in decreasing order where the first component contributes to explaining most the variance, followed by the second, third and so on.

Math Behind PCA

Now it’s time to combine everything we have learnt and put it into a flow to understand how these principal components are actually derived. Grab your pen and paper and write down the intuitions and calculations as we discuss.

Firstly, let’s again scatter the data over space. We center the data in the space by shifting the data to the origin, it let’s us to calculate the covariance matrix. Since these data points are fake, I consider a fake covariance matrix:

Principal components are constructed from the eigenvectors of the covariance matrix. So now let’s find the eigenvectors for our matrix. There are online tools present to find eigenvectors and the eigenvectors for our covariance matrix are

Generally when we perform linear transformation on covariance matrix using random vectors, the direction of the vector changes in the resultant plane. However, only the eigenvectors doesn’t change their direction in the resultant plane and also they are orthogonal, means they are uncorrelated to each other.

For those eigenvectors, the associated eigenvalues are:

Fact: The number of independent eigenvectors for each eigenvalue is at least one, and at most equal to the multiplicity of the associated eigenvalue.

So now that we have summarized the data into 2 components the red vector and the green vector. And the importance of these vectors is ranked by eigenvalues. The highest the eigenvalue is, the more important the vector A.K.A the principal component is. Now that we have the vector in our hands we will project the data points to that vector, forming the first principal component.

Coding PCA

Remember the Wisconsin Breast Cancer Dataset I mentioned in the starting. Now it’s to code the things we have learnt so far. Now we know what PCA is, what principal components are and how are they derived, so when we see the results generated with code we can have a better understanding on what we are seeing.

The WBCD includes measurements such as radius, texture, perimeter, etc. 32 features in total.

Step 1: Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

Step 2: Load the Dataset

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data 
y = data.target

Step 3: Standardize the Data

Since PCA is sensitive to the variances of the initial variables, we’ll standardize the features:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 4: Applying PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Step 5: Visualizing the Results

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='plasma')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Breast Cancer Dataset')
plt.colorbar()
plt.show()

As you can see from this plot, in spite of reducing the features from 30 to 2, we are able to have two separate clusters of benign vs malignant. This means we did not need all those details in the original 30 features and reduced them to two PCs and still able to get enough information to have separate clusters. This means if you are saving space, training time, and able to visualize it easier.

Step 6: Explaining the Variance

The principal components do not have a direct meaning in terms of the original data features. However, we can check how much variance is captured by these components:

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

PC1 explained 44% of the data and PC2 explained 19% of the data.

np.sum(pca.explained_variance_ratio_)

The explained variance ratio tells us the proportion of the dataset’s total variance that is captured by each principal component when performing PCA. Total is almost 64% explanation of the data with just two features vs 30 features

How Can We Cover more Variance than 63%

In the previous example, we selected two features and then found they cover only 63.24%. You can ask PCA to select the number of features that can give you more variance coverage as follows:

pca = PCA(n_components=0.8, random_state=1)

If you set n_components to an integer, it will set the number of components. However, if we used a float, we ask PCA to select the number of features that will give us 80% variance coverage.

pca.n_components_

So to cover 80% of information from the data, we need 5 principal components.

Tips for performing PCA

Always make sure the data are on the same scale.
Always make sure the data are centered to the origin. As it will affect the covariance matrix
Use float value in n_components to determine the optimal number of principal components to be formed to cover the desired value of information.

Done And Dusted

And that’s it! That is all about Principal Component Analysis. I’ve tried my best to convey the conceptual and mathematical foundations of PCA. I appreciate you taking the time to read this, and I hope I was able to clear up some confusion for those who are new to machine learning! In the future, I plan to post more posts about machine learning and computer vision. Especially if you more interested in learning about architectures of deep learning models, check out my latest blogs on Transformers architecture and Vision language model architecture.

Decoding Transformers: Exploring the Anatomy of Transformer Architecture

Embark on a journey through the intricacies of Transformers, unraveling their architecture from tokenization to softmax…

ai.gopubby.com

Understanding Vision Language Model Architecture: From Iron Man to Reality

Discover the architecture behind vision language models. Whether you’re a novice or seasoned enthusiast, this blog…