Love thy neighbor — KNN from Scratch

Aditya Chandhoke
Level Up Coding
Published in
4 min readMay 3, 2024

--

_> KNN classifier

Back again with another implementation from scratch and this time, we’re looking at the K Nearest Neighbor algorithm. This is a pretty fun one as when I started my journey with ML, KNN was the first algorithm I studied and used for predictive modelling on some random dataset.

Photo by Alex Padurariu on Unsplash

So, let’s just jump in!

Let’s just get familiar with the theory first — KNN is one of the basic algorithms we use for classification and regression tasks (part of the supervised technique). The working is even simpler, the training data has points divided into groups with similar characteristics and when showing a new data point the algorithm will calculate the distance between the point and the groups, the point is then classified to the closest group. The K in the algorithm stands for the number of data points from each group that you calculate the distance from (to the new point).

This is a non-parametric algorithm, i.e. it does not make any assumption about the data it is being trained on. This would make sense as in the last blog we saw how the Bayes algorithm assumed that the data followed a normal distribution.

This algorithm is also called the lazy learner as it waits till the time of classification (or regression) to act on the dataset. It just stores the dataset in the training phase.

Enough with the words, let’s dive into the code!

Like always we start with the class definition and the initialization code :

class KNN:

def __init__(self, n_neighbours, choice='Euclidean'):
self.n_neighbours = n_neighbours
self.choice = choice

Next, we write the Training and Prediction function:

def train(self, X, y):

self.m, self.n = X.shape
self.trainX = X
self.trainy = y

As we discussed above, in the training phase we just store the dataset with the features and the label. We also keep the dimensions of the data for later use.

def predict(self, testX):

y_pred = np.empty((testX.shape[0],1))

for index, test_x in enumerate(testX):
distance = [distance_measure(test_x, train_x ,self.choice) for train_x in self.trainX]

n_neighbours_index = np.argsort(distance)[: self.n_neighbours]

n_neighbours_y = np.array([self.trainy[ind][0] for ind in n_neighbours_index ])

y_pred[index] = self.most_frequent_class(n_neighbours_y)

return y_pred

The function starts with taking an input of our data in a 2D vector. Next we initialization an empty array to store our predicted labels, it has the same rows as our input.

The prediction loop is made over the input data divided into its index and the feature vector being considered. The first step in the loop is to calculate the distance between the current test feature vector and all the training feature vectors, this is stored in a list.

distance = [distance_measure(test_x, train_x ,self.choice) for train_x in self.trainX]

The distance is calculated based on basic geometry and here I have implemented 3 types of distance measures :

def distance_measure(x1,x2,choice="Euclidean",p=1):
'''

Choices : Euclidean, Manhattan, Minkowski

'''
d = 0
if choice == 'Euclidean':
for i in range(len(x1)):
d += np.square(x1[i] - x2[i])
return np.square(d)
if choice == 'Manhattan':
for i in range(len(x1)):
d += np.abs(x1[i] - x2[i])
return d
if choice == 'Minkowski':
for i in range(len(x1)):
d += np.power(np.abs(x1[i]-x2[i]), p)
return np.power(d,1/p)

The second step consists of finding the indices of nearest neighbors based on their distances, we sort the distance list and get the K top neighbors.

n_neighbours_index = np.argsort(distance)[: self.n_neighbours]

In the third step, we retrieve the labels of the nearest neighbors from the training data, we use the indices from the last step to find the useful labels in the training data.

n_neighbours_y = np.array([self.trainy[ind][0] for ind in n_neighbours_index ])

Lastly, we use a function to calculate the most frequent class among the labels of the nearest neighbors.

def most_frequent_class(self, neighbours_y):
counts = np.bincount(neighbours_y)
return counts.argmax()

This function calculates the frequency of each unique label in the array and returns the label that occurs most frequently. Here’s an example

Suppose we have the array as [0, 1, 1, 2, 2, 2, 3]
So the resulting array from bincount will be - [1, 2, 3, 1]
i.e. The indices are the labels and the values are the frequency of that label,
like 0 occurs 1 times, 1 occurs 2 times, etc.
And then we finally return the max of the list, that is "2" as it occured
3 times.

Then at the end, we return the the predicted class.

That’s it! This is the KNN algorithm simplified and written from scratch.

Hope you had fun reading this and trying to implement your own too! I’ll catch you next time with an even better algorithm!

Happy coding! See ya next time!

--

--