K-Nearest Neighbours (KNN)

Deepak Jain
Level Up Coding
Published in
7 min readJul 14, 2020

--

In this series of articles, we will understand K-Nearest Neighbour (KNN) algorithm in great detail.

To start with, K-NN is classified as a supervised machine learning algorithm. For those who want to understand supervised and unsupervised machine learning algorithm, here is a link for the same. In a nutshell, supervised learning has the class label defined.

Before diving into the details of KNN, let’s first understand what we mean about classification problem v/s regression problem.

Consider the example of the titanic dataset. Titanic data set is considered as the “hello world!” kind of project in the field of machine learning. The objective of this problem is to classify whether a person survived the disaster based on certain features. So, given a bunch of features like the age of the person, the gender, which class ticket he/she purchased, etc, we are supposed to predict if the person survived or not. For e.g. If a person is female or a child, the probability of them surviving is higher. So, we are supposed to classify between one of the two class labels i.e. survived or not. Since there are two possible outcomes it is also called a 2-class classification problem or binary classification problem. Now, suppose you just took a cab ride using one of the online cab aggregators. When the ride is completed, you are given an option to rate your experience about the ride. This rating system helps the company understand the performance of the driver responsible for the ride. Now, there are (let’s say) 5 possible choices given to you i.e. Very bad, bad, average, good, very good. Based on your experience you select one of the 5 given choices. Here, you are trying to classify your ride experience in one of the 5 possible categories. Since the possible number of categories is more than 2, it is called multi-class classification.

Let’s move on to understand what a regression problem is? In the above two examples, my class label had a finite set of values like survived or not or rating the ride experience amongst the 5 possible class labels. But in case of regression, we do not have this finite set of class labels. Let me explain this with an example. You are given features like the weight, age, gender of some students and the objective is to predict the height of the students. In this case, the height of students does not belong to a finite set of class label. It is a real-value number which can take any value within a specific range. Such problems where the class-label is a real number is called a regression problem.

KNN can be used for both classification as well as regression problems. We will first understand KNN from a classification standpoint.

Consider the below example.

Image Credit

We have a set of red and green points. We observe that the red points belong to class label A and green to class label B. Given these points, we are now provided with another point in yellow. The task/objective is to classify this yellow point also called as the query point into one of the class labels.

Definition of KNN in layman terms:

KNN stands for K Nearest Neighbours. Where K is an integer value indicating the number of nearest neighbours to consider. K can take any integer value between 1 to n where n=no. of points in the data set. So, we consider K nearest neighbours of the query points and perform a majority vote on those nearest neighbours. The class label with the highest votes is assigned to the query point.

Coming back to the above example. Suppose my K=3, so we are supposed to identify 3 nearest neighbours of the query point (how we determine those neighbours, we’ll see in the later section of this article). When we consider 3 nearest neighbours, out of the 3 nearest neighbours 2 belong to class B and 1 belongs to class A. Going as per majority, we classify the query point as class B. Now, let's consider K=7. Means, instead of considering 3 nearest neighbours, now we will consider 7 nearest neighbours. When we consider 7-NN, we get 4 nearest neighbours belonging to class A and 3 nearest neighbours belonging to class B. now, going as per the majority class, we classify the query point as class A.

You might wonder what did just happen? When we chose K=3, we classified the query point as class B and when we chose K=7 the class label changed to A. This is so confusing. How do I determine what value of “K” to pick?

Don’t worry, all your queries shall be answered soon. For now, I just want to make sure that you understand the intuition behind KNN. We’ll soon understand how to determine the nearest neighbours and also how to pick the best value of ‘K’.

How to find the nearest neighbours?

One simple and straight forward answer to the above question is to calculate the distance between the points. We’ll understand the 4 main distance measures to calculate the distance between the points.

1. Euclidean distance

Consider the below example:

Image Credit

Point A = (X_21, Y_21)

Point B = (X_22, Y_22)

d = length of the shortest line from point A to B = Euclidean distance between the points A and B.

Euclidean distance is the square root of the sum of squared distance between two points. It is also known as L2 norm. We`ll see in later articles what L1 and L2 norms are.

2. Manhattan distance

Manhattan distance is the sum of the absolute values of the differences between two points

3. Hamming distance

Mostly used in text processing, when we have a binary or Boolean vector. In simple terms, it tells us if the two categorical variables are same or not. Consider the below eg. where the hamming distance between:

4. Minkowski distance

Minkowski distance is used to find distance similarity between two points. When p=1, it becomes Manhattan distance and when p=2, it becomes Euclidean distance

These are the distance measuring techniques mostly used for calculating the distance between the points to identify the nearest neighbours.

Now, you might wonder, is there anything else? I mean, I understand the distance measures part but is there some other way as well to understand the nearest neighbours? And the answer is YES!

Till now, we measured the nearest neighbours in terms of distance, how about we measure them in terms of “angle”?

Let me introduce you to the concept of “Cosine-similarity and Cosine-distance”

Cosine-similarity considers the angle (θ) between the two points and is equal to the cos(θ). Consider the below example:

A and B are two data points and θ is the angle between them. It is safe to say that if the distance between the two points increases, the angle between the two will increase and as the angle between the two increases the cos(θ) decreases means their cosine-similarity decreases.

Consider the below example:

We have 3 vectors — x1, x2 and x3.

Cosine-similartiy (x1, x2) = cos(θ)
Cosine-similarity (x1, x3) = 1 (since, θ = 0 and cos(0) = 1)
Cosine-distance (x1, x2) = 1 — cos(θ)
Cosine-distance between (x1, x3) = 1–1 = 0 (since cos(θ) = 1

So the observations are, in terms of Euclidean distance, its clear that distance d13 > d12

But when comparing cosine-distance, cos-dist (x1, x3) < cos-dist (x1, x2)

From the above observations, it is pretty clear that cos-sim or cos-dist uses the angle between the vectors and not the geometric distance.

Is there a relation between Euclidean and cosine-similarity?

Yes. There is, and it's given as follows:

Squared Euclidean distance between x1 and x2 = 2*(1 — cos(θ))

To understand the detailed mathematical proof of the same please refer to this link

To conclude, we understood K-NN in good detail. We also covered various distance measures and cosine-similarity to understand how the nearest neighbours are calculated and how they are related.

Additional reading material:

While doing my research and study on this topic, I came across a brilliant piece of an article by Chris Emmery. I would recommend you to once visit this link

In the next article, we will understand the limitations of KNN, its failure cases, how to pick the right value of “K”.

Until then, Happy learning

--

--