Evaluating socio-economical indicators with exploratory analysis and hierarchical clustering

Grouping similar neighborhoods of Barcelona using agglomerative hierarchical clustering

Amanda Iglesias Moreno
Level Up Coding

--

Image by Miquel Migg in Unsplash

Hierarchical clustering is an unsupervised learning technique that builds nested clusters by merging or splitting them successively. This method produces a complete hierarchy of clusters that allows us to easily identify the number of clusters and the similarities between data points. In this article, we will apply agglomerative hierarchical clustering to classify districts in Barcelona according to their social class. To do so, we use three attributes: (1) income, (2) unemployment rate, and (3) educational level. After clustering the neighborhoods, we will be able to identify 5 social classes. From a business perspective, the classification of districts according to their social class is really useful to be able to adapt the business strategy to the preferences and habits of the customers.

Steps of the project

The project consists of the following sections:

  1. Exploratory data analysis and data cleaning
  2. Theoretical introduction: Hierarchical clustering
  3. Hierarchical clustering with Scipy
  4. Visualization of the hierarchy with a dendrogram
  5. Visualization of the clusters on a choropleth map
  6. Mean of each socioeconomic indicator by cluster
  7. Why identifying social classes is so important from a business perspective?
  8. Project summary

1. Exploratory Data Analysis and Data Cleaning

The objective of this section is to create a data frame that contains all the data necessary for the subsequent analysis. The final data frame contains three columns: (1) average income per household, (2) percentage of population with university education, and (3) unemployment level. The information is provided for each district of the city of Barcelona. All the necessary steps to clean the data are detailed in the Jupyter notebook available on GitHub.

Final data frame after cleaning (df_all)

2. Theoretical introduction: Hierarchical clustering

Hierarchical clustering is an unsupervised learning technique that groups data with a sequence of nested partitions. There are two types of hierarchical clustering algorithms: agglomerative and divisive.

Agglomerative vs divisive hierarchical clustering

Hierarchical clusters are generated either using a top-down or a bottom-up approach.

  • Agglomerative hierarchical clustering (bottom-up): The procedure starts with n clusters of size 1 (each observation has its own cluster) and ends up with 1 cluster of size n (one cluster that contains all observations).
  • Divisive hierarchical clustering (top-down): All observations are initially part of one single cluster. This initial cluster is recursively split into smaller and smaller clusters until each observation forms a cluster on its own.

In this article, we use the bottom-up approach (agglomerative hierarchical clustering) to group similar neighborhoods into clusters according to socio-economical indicators.

Agglomerative vs divisive hierarchical clustering (image created by the author)

Hierarchical clustering: The linkage method

The goal of agglomerative hierarchical clustering is to put together the elements which are similar to each other. At any stage of the process, the two clusters that have the smallest linkage distance are fused to form a single cluster.

There are many linkage methods from which to choose. The most commonly used methods are (1) single, (2) complete, (3) average, and (4) centroid linkage.

  • Single Linkage: this method uses the minimum distance between two clusters to determine the linkage.
  • Complete Linkage: this method considers the maximum of all pairwise distances between elements of the two clusters.
  • Average Linkage: the distance between two clusters is defined as the average distance of all pairwise distances between observations of the two clusters.
  • Centroid Linkage: the distance between two clusters is the distance between their centroids.
Linkage methods (image created by the author)

3. Hierarchical clustering with Scipy

We use the linkage function to perform a hierarchical clustering of the rows of the NumPy array X_standard, using 4 linkage methods: (1) single, (2) complete, (3) average, and (4) centroid. This function returns a (n-1) by 4 matrix. The first two columns contain the clusters that are combined to form a new cluster. The third column represents the distance between clusters of columns 1 and 2. The fourth column contains the number of observations in the newly created cluster. As expected, the last cluster formed contains a total of 73 observations (number of districts in our data set).

As shown below, the results depend on the linkage method used. It is important to bear in mind that it is necessary to standardize all the columns before calculating the distances between data points since the columns have very different ranges.

Clustering with different linkage methods (image created by the author)

After we determine the linkage matrix, we can pass the results to the dendrogram function.

As shown above, the linkage methods average and centroid provide a really similar partition, where 3 or 5 different clusters can be distinguished. Finally, we will decide to proceed with the linkage method average and visualize the clusters obtained in more detail in the following sections.

4. Visualization of the hierarchy with a dendrogram

Hierarchical clustering produces a tree-like structure called a dendrogram. The horizontal position (generally vertical) of the nodes corresponds to the order in which clusters were merged.

Dendrogram (image created by the author)

Cutting the dendrogram at a particular height splits the neighborhoods into disjoint clusters. In this case, we decided to cut the dendrogram to form 5 aggrupations. Those clusters represent groups of neighborhoods that are similar to each other according to socio-economical indicators.

5. Visualization of the clusters on a choropleth map

After determining the linkage matrix and visualizing the results, we use the cut_tree function to form 5 disjointed clusters (n_clusters=5).

The results obtained are stored in a DataFrame called clusters_districs, where the column clusters represents the clusters and the column NOM represents represent the different districts of the city. As shown above, we have renamed the district el Poble Sec to match the name used in the JSON file.

To build the map, we used the Geopandas library. There are mainly two elements to build a choropleth map:

  • A JSON file that contains the boundaries (polygons) of every district of Barcelona. The JSON file used in this article can be found in the following GitHub repository:

https://raw.githubusercontent.com/martgnz/bcn-geodata/master/barris/barris.json

  • A data frame that contains the value we want to represent (in this case the cluster).

We read the JSON file as a GeoDataFrame using the geopandas.read_file function. After selecting the columns of interest (NOM and geometry), we add the cluster labels to the GeoDataFrame. Finally, we visualize the results using the plot method.

The following choropleth map shows the location of each cluster (5 in total). The clusters represent the social classes in Barcelona, which are calculated according to the following socioeconomic indicators: (1) the rate of unemployment, (2) the income per household, and (3) the rate of citizens with university studies.

Choropleth map (image created by the author)

6. Mean of each socioeconomic indicator by cluster

The following table shows the mean of each socioeconomic indicator for the resulting clusters (five in total). By observing the values, we can make the following associations.

Image created by the author

High-class neighborhoods (cluster 3 — purple)

The upper-class districts in Barcelona are: (1) Pedralbes, (2) Sant Gervasi — Galvany, (3) Sant Gervasi — la Bonanova, (4) Sarrià, and (5) les Tres Torres. Those districts are located in the inner part of the city. They present really low unemployment rates (3.35%), high education rates (51.52), and high income per household (74793).

High-class neighborhoods (Image created by the author)

These values are far above the average for Spain and Catalonia, and they are even higher than the median income of countries such as Norway, Sweden, or Denmark (See https://en.wikipedia.org/wiki/Median_income).

High middle-class neighborhoods (cluster 4 — yellow)

The high middle-class districts are located in the inner part of the city and in the city center. There are 5 neighborhoods that belong to this cluster: (1) Vallvidrera, el Tibidabo i les Planes, (2) el Putxet i el Farró, (3) l’Antiga Esquerra de l’Eixample, (4) la Dreta de l’Eixample, (5) la Vila Olímpica del Poblenou.

High middle-class neighborhoods (Image created by the author)

Those neighborhoods show high education rates (all of them more than 40%), low unemployment rates, and relatively high income.

Middle-class neighborhoods (cluster 1 — salmon)

As we expected, most neighborhoods belong to the middle-class cluster.

Middle-class neighborhoods (Image created by the author)

As shown below, the income of these neighborhoods ranges from 25485 to 50563. The unemployment levels are less than 9% and the education rates are for all neighborhoods of the cluster more than 17%.

Low middle-class (cluster 0 — blue)

The low middle-class neighborhoods are located in the north and south parts of the city. The education levels are less than 20% and the levels of unemployment are really high (8.67% on average)

Low middle-class neighborhoods (Image created by the author)

Low-class neighborhoods (cluster 2 — green)

The low-class neighborhoods are located near the port and on the inner northside of the city.

Low-class neighborhoods (Image created by the author)

The present really high unemployment rates (larger than 12%), low education levels, and low incomes, as shown in the table below.

Why identifying social classes is so important from a business perspective?

From a business perspective, social classes are often measured as a combination of multiple attributes such as income, education, occupation, wealth, and other variables. Social classes can have a strong effect on spending habits. The high-upper class individuals, for example, buy expensive jewelry and luxury cars while those belonging to the lower class tend to live on a day-to-day basis, spending most of their income on food and shelter. Marketers should be aware of the social classes of the target market to develop a marketing strategy that suits the tastes and needs of the individuals.

8. Project summary

In this project, we have used unsupervised learning to create a hierarchical clustering of Barcelona districts according to three economic aspects: (1) income per household, (2) unemployment level, and (3) university rate. We have been able to distinguish 5 social levels which have been visualized using a choropleth map. This division of neighborhoods according to social aspects can be useful from a business point of view to adapt to the needs and aspirations of customers.

Thanks for reading 💜

Amanda Iglesias

--

--