Evaluating socio-economical indicators with exploratory analysis and hierarchical clustering
Grouping similar neighborhoods of Barcelona using agglomerative hierarchical clustering
Hierarchical clustering is an unsupervised learning technique that builds nested clusters by merging or splitting them successively. This method produces a complete hierarchy of clusters that allows us to easily identify the number of clusters and the similarities between data points. In this article, we will apply agglomerative hierarchical clustering to classify districts in Barcelona according to their social class. To do so, we use three attributes: (1) income, (2) unemployment rate, and (3) educational level. After clustering the neighborhoods, we will be able to identify 5 social classes. From a business perspective, the classification of districts according to their social class is really useful to be able to adapt the business strategy to the preferences and habits of the customers.
Data sets
The data sets used in this article are available in the Open Data Barcelona.
Unemployment
This data set contains the unemployment rate in the population from 16 to 64 of the city of Barcelona by month and district.
Income
This data set contains the average income per household of the city of Barcelona.
Education
The data set contains the number of inhabitants of Barcelona by sex and academic level.
Annotation: The information in all datasets refers to the year 2018.
Github
The code for this project is available as a Jupyter Notebook on GitHub.
Steps of the project
The project consists of the following sections:
- Exploratory data analysis and data cleaning
- Theoretical introduction: Hierarchical clustering
- Hierarchical clustering with Scipy
- Visualization of the hierarchy with a dendrogram
- Visualization of the clusters on a choropleth map
- Mean of each socioeconomic indicator by cluster
- Why identifying social classes is so important from a business perspective?
- Project summary
1. Exploratory Data Analysis and Data Cleaning
The objective of this section is to create a data frame that contains all the data necessary for the subsequent analysis. The final data frame contains three columns: (1) average income per household, (2) percentage of population with university education, and (3) unemployment level. The information is provided for each district of the city of Barcelona. All the necessary steps to clean the data are detailed in the Jupyter notebook available on GitHub.
2. Theoretical introduction: Hierarchical clustering
Hierarchical clustering is an unsupervised learning technique that groups data with a sequence of nested partitions. There are two types of hierarchical clustering algorithms: agglomerative and divisive.
Agglomerative vs divisive hierarchical clustering
Hierarchical clusters are generated either using a top-down or a bottom-up approach.
- Agglomerative hierarchical clustering (bottom-up): The procedure starts with n clusters of size 1 (each observation has its own cluster) and ends up with 1 cluster of size n (one cluster that contains all observations).
- Divisive hierarchical clustering (top-down): All observations are initially part of one single cluster. This initial cluster is recursively split into smaller and smaller clusters until each observation forms a cluster on its own.
In this article, we use the bottom-up approach (agglomerative hierarchical clustering) to group similar neighborhoods into clusters according to socio-economical indicators.
Hierarchical clustering: The linkage method
The goal of agglomerative hierarchical clustering is to put together the elements which are similar to each other. At any stage of the process, the two clusters that have the smallest linkage distance are fused to form a single cluster.
There are many linkage methods from which to choose. The most commonly used methods are (1) single, (2) complete, (3) average, and (4) centroid linkage.
- Single Linkage: this method uses the minimum distance between two clusters to determine the linkage.
- Complete Linkage: this method considers the maximum of all pairwise distances between elements of the two clusters.
- Average Linkage: the distance between two clusters is defined as the average distance of all pairwise distances between observations of the two clusters.
- Centroid Linkage: the distance between two clusters is the distance between their centroids.
3. Hierarchical clustering with Scipy
We use the linkage
function to perform a hierarchical clustering of the rows of the NumPy array X_standard
, using 4 linkage methods: (1) single, (2) complete, (3) average, and (4) centroid. This function returns a (n-1) by 4 matrix. The first two columns contain the clusters that are combined to form a new cluster. The third column represents the distance between clusters of columns 1 and 2. The fourth column contains the number of observations in the newly created cluster. As expected, the last cluster formed contains a total of 73 observations (number of districts in our data set).
As shown below, the results depend on the linkage method used. It is important to bear in mind that it is necessary to standardize all the columns before calculating the distances between data points since the columns have very different ranges.
After we determine the linkage matrix, we can pass the results to the dendrogram function.
As shown above, the linkage methods average and centroid provide a really similar partition, where 3 or 5 different clusters can be distinguished. Finally, we will decide to proceed with the linkage method average and visualize the clusters obtained in more detail in the following sections.
4. Visualization of the hierarchy with a dendrogram
Hierarchical clustering produces a tree-like structure called a dendrogram. The horizontal position (generally vertical) of the nodes corresponds to the order in which clusters were merged.
Cutting the dendrogram at a particular height splits the neighborhoods into disjoint clusters. In this case, we decided to cut the dendrogram to form 5 aggrupations. Those clusters represent groups of neighborhoods that are similar to each other according to socio-economical indicators.
5. Visualization of the clusters on a choropleth map
After determining the linkage matrix and visualizing the results, we use the cut_tree
function to form 5 disjointed clusters (n_clusters=5
).
The results obtained are stored in a DataFrame called clusters_districs
, where the column clusters
represents the clusters and the column NOM
represents represent the different districts of the city. As shown above, we have renamed the district el Poble Sec
to match the name used in the JSON file.
To build the map, we used the Geopandas library. There are mainly two elements to build a choropleth map:
- A JSON file that contains the boundaries (polygons) of every district of Barcelona. The JSON file used in this article can be found in the following GitHub repository:
https://raw.githubusercontent.com/martgnz/bcn-geodata/master/barris/barris.json
- A data frame that contains the value we want to represent (in this case the cluster).
We read the JSON file as a GeoDataFrame using the geopandas.read_file
function. After selecting the columns of interest (NOM
and geometry
), we add the cluster labels to the GeoDataFrame. Finally, we visualize the results using the plot
method.
The following choropleth map shows the location of each cluster (5 in total). The clusters represent the social classes in Barcelona, which are calculated according to the following socioeconomic indicators: (1) the rate of unemployment, (2) the income per household, and (3) the rate of citizens with university studies.
6. Mean of each socioeconomic indicator by cluster
The following table shows the mean of each socioeconomic indicator for the resulting clusters (five in total). By observing the values, we can make the following associations.
High-class neighborhoods (cluster 3 — purple)
The upper-class districts in Barcelona are: (1) Pedralbes, (2) Sant Gervasi — Galvany, (3) Sant Gervasi — la Bonanova, (4) Sarrià, and (5) les Tres Torres. Those districts are located in the inner part of the city. They present really low unemployment rates (3.35%), high education rates (51.52), and high income per household (74793).
These values are far above the average for Spain and Catalonia, and they are even higher than the median income of countries such as Norway, Sweden, or Denmark (See https://en.wikipedia.org/wiki/Median_income).
High middle-class neighborhoods (cluster 4 — yellow)
The high middle-class districts are located in the inner part of the city and in the city center. There are 5 neighborhoods that belong to this cluster: (1) Vallvidrera, el Tibidabo i les Planes, (2) el Putxet i el Farró, (3) l’Antiga Esquerra de l’Eixample, (4) la Dreta de l’Eixample, (5) la Vila Olímpica del Poblenou.
Those neighborhoods show high education rates (all of them more than 40%), low unemployment rates, and relatively high income.
Middle-class neighborhoods (cluster 1 — salmon)
As we expected, most neighborhoods belong to the middle-class cluster.
As shown below, the income of these neighborhoods ranges from 25485 to 50563. The unemployment levels are less than 9% and the education rates are for all neighborhoods of the cluster more than 17%.
Low middle-class (cluster 0 — blue)
The low middle-class neighborhoods are located in the north and south parts of the city. The education levels are less than 20% and the levels of unemployment are really high (8.67% on average)
Low-class neighborhoods (cluster 2 — green)
The low-class neighborhoods are located near the port and on the inner northside of the city.
The present really high unemployment rates (larger than 12%), low education levels, and low incomes, as shown in the table below.
Why identifying social classes is so important from a business perspective?
From a business perspective, social classes are often measured as a combination of multiple attributes such as income, education, occupation, wealth, and other variables. Social classes can have a strong effect on spending habits. The high-upper class individuals, for example, buy expensive jewelry and luxury cars while those belonging to the lower class tend to live on a day-to-day basis, spending most of their income on food and shelter. Marketers should be aware of the social classes of the target market to develop a marketing strategy that suits the tastes and needs of the individuals.
8. Project summary
In this project, we have used unsupervised learning to create a hierarchical clustering of Barcelona districts according to three economic aspects: (1) income per household, (2) unemployment level, and (3) university rate. We have been able to distinguish 5 social levels which have been visualized using a choropleth map. This division of neighborhoods according to social aspects can be useful from a business point of view to adapt to the needs and aspirations of customers.
Thanks for reading 💜
Amanda Iglesias