Clustering GPS Coordinates and Forming Regions with Python

Published in

Level Up Coding

5 min readJun 19, 2019

I recently had a challenge while crunching some data which contained GPS latitudes and longitudes. In an effort to squeeze as much information as I could out of the data I have, I had this idea. It’s not anything new, but definitely something exciting.

Heat maps and clustered maps are nice, but what if we could do more with the GPS coordinates? Let’s dream a little, what if there were relationships in the demographics and the other data points. e.g is customer churn influenced by region?

Here’s a simple, yet powerful, way to cluster GPS locations with Python.

For this I’ve used data from kaggle ‘s Zillow Prize: Zillow’s Home Value Prediction (Zestimate). I used ‘properties_2016.csv’. It’s large!

Import Prerequisites.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv

Load the File & Read first few rows

df = pd.read_csv('properties_2016.csv')
df.head(10)

Remove rows where the Longitude and/or Latitude are null values

df.dropna(axis=0,how='any',subset=['latitude','longitude'],inplace=True)

Create a variable that only has what we need. We need the ‘parcelid’ so that we can join to the original data later, the longitude and the latitude.

# Variable with the Longitude and Latitude
X=df.loc[:,['parcelid','latitude','longitude']]
X.head(10

Elbow Curve

Woah, wait up! What is is this Joseph?

K-means is somewhat naive — it clusters the data into k clusters, even if k is not the right number of clusters to use. When we come to clustering, it’s hard to know how many clusters are optimal… In our dataset, how many clusters are optimal i.e. make sense, we don’t want to guess now do we? Therefore, when using k-means clustering, we need a way to determine whether we are using the right number of clusters.

Let’s make this fun — guess the number of optimal clusters, write it down somewhere… And now let’s see if you can actually win the lottery.

One method to validate the number of clusters is the elbow method. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10), and for each value of k calculate the Sum of Squared Errors (SSE).

When K increases, the centroids are closer to the clusters centroids. The improvements will decline rapidly at some point, creating the elbow shape. That is the optimal value for K.

This might take a while.. stretch a little.

K_clusters = range(1,10)kmeans = [KMeans(n_clusters=i) for i in K_clusters]Y_axis = df[['latitude']]
X_axis = df[['longitude']]score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')plt.ylabel('Score')plt.title('Elbow Curve')plt.show()

Elbow Curve showing optimal number of clusters

When we graph the plot, we see that the graph levels off slowly after 3 clusters. This implies that addition of more clusters will not help us that much.

Clustering using K-Means and Assigning Clusters to our Data

Let’s look at some parameters of the KMeans function first.

KMeans Parameters

n_clusters int, optional, default 8. The number of clusters to form as well as the number of centroids to generate.
init {‘k-means++’, ‘random’ or an ndarray}. k-means++’: selects initial cluster centers for k-means clustering in a smart way to speed up convergence. random: choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

Let’s get into it

kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[1:3]]) # Compute k-means clustering.X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])centers = kmeans.cluster_centers_ # Coordinates of cluster centers.labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each pointX.head(10)

Visualize the Results

Let’s visualize the results by plotting the data colored by these labels. We will also plot the cluster centers as determined by the k-means estimator:

X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

You can try running the k_means with the number you had picked and visualize it. I picked 5 and it looked like this: Not that pretty, huh?

Visualized Results from a guessed cluster

Merge Results with the all of your data

We have to merge our existing data to include the clusters so we can do more analysis. We have two variables containing our data, df and X. Let’s see what they look like before.

df.head(5)

X.head(5)

Let’s remove the longitude and latitudes from X since they already exist on df. If we don’t remove the longitude and latitude column, we’ll have 2 other columns created for longitude and latitude in our data frame. We don’t want that.

X = X[['parcelid','cluster_label']]
X.head(5)

Let’s merge the data now. After merging the new column will be added to the right of your data set.

clustered_data = df.merge(X, left_on='parcelid', right_on='parcelid')clustered_data.head(5)

Export the Data Frame to a CSV

Fortunately, there’s a pandas.DataFrame.to_csv function and it’s as simple as:

clustered_data.to_csv ('clustered_data.csv', index=None, header = True)

Bonus — How to get the Centers.

centers = kmeans.cluster_centers_
print(centers)

What’s Next?

Having clustered the locations, we can now have a different perspective of the data we have.

Github Link

Thank you for reading and … Good Luck