Compare & Contrast: Exploration of Datasets with Many Features

Published in

Level Up Coding

5 min readFeb 22, 2021

I just did a Kaggle competition with Gary Lu. We studied the dataset of Boston housing prices, and built up prediction models. The result turns out to be within top 9%.

Soon after getting started, I noticed that the datasets have many features (80 in total: 37 continuous and 43 categorical). Making sense of all of them definitely paves concrete path to feature engineering and model building process. However, there are also too many ways to do that, I was confused about the ‘right’ way for a while…

After some trial and error I feel that though there’s no absolute right way to do it, still, a ‘compare & contrast’ style analysis and visualization serves well to the final goal: Identify potentials quickly to improve prediction model.

Here I would elaborate on my practices to wave the idea of ‘compare & contrast’ into the exploration:

1. Compare Data Type and Nulls between Train & Test Data

For datasets with many features, summarizing data types and null values are important. Instead of only exploring training and test datasets separately, a comparative analysis can tells more on: Are the data types consistent across datasets? Are there more missing values in one dataset than the other? Below are the example and codes (Python):

# Data Type Comparison
s1 = train_data.dtypes
s2 = test_data.dtypes
s1_train = s1.drop('SalePrice')
s1_train.compare(s2)

Result: Features with Different Datatypes

So from above, we can easily see that there are 9 features having different datatypes across training & test data. Does it matter? it needs more exploration, in our case it doesn’t as they are both numeric types, but it may be an issue if one is an object type.

# Null Value Comparison
null_train = train_data.isnull().sum()
null_test = test_data.isnull().sum()
null_train = null_train.drop('SalePrice')
null_train.compare(null_test).sort_values(['self','other'], ascending= [False,False])

Result: Part of Features with Null Values

Here we can see that null values seem evenly distribute across two datasets, which assure us that we can go ahead dealing with these null values in similar ways for both data sets.

2. Contrast Distributions of Train & Test Data

Distribution of features is a key consideration for feature engineering, we always want to answer questions:

Are the continuous features normally distributed? Is there any categorical feature dominated by a single category which over shadows others? Again, instead of studying the questions separately for train and test data separately, we can contrast them.

For continuous features, we can apply Seaborn histograms as below:

# Distribution Comparison - Continuous Variablescon_var = s1[s1.values != 'object'].index
f, axes = plt.subplots(7,6 , figsize=(30, 30), sharex=False)
for i, feature in enumerate(con_var):
    sns.histplot(data=combined_data, x = feature, hue="Label",ax=axes[i%7, i//7])

Distribution Comparison — Part of Continuous Variables

The distribution above shows that:

The distribution of train and test data are similar for most features;
Some features can be reclassified as ‘Categorical’, such as ‘MSSubClass’;
Some features are dominated by 0/null (eg:PoolArea), thus we can consider to drop.

For categorical features, we can apply Seaborn countplot as below:

# Distribution Comparison - Catagorical Variablescat_var = s1[s1.values == 'object'].index
f, axes = plt.subplots(7,7 , figsize=(30, 30), sharex=False)
for i, feature in enumerate(cat_var):
    sns.countplot(data = combined_data, x = feature, hue="Label",ax=axes[i%7, i//7])

Distribution Comparison — Part of Categorical Variables

The comparison of the categorical variables showed that:

Train data and test data distributions are similar for most features
Some features have dominant items, we can consider to combine some minor items into a group, such as ‘Fa’ & ‘Po’ in ‘HeatingQC’, ‘FireplaceQu’, ‘GarageQual’ and ‘GarageCond’

3. Compare Linearity between Response Variable and Continues Features

Linearity affects the quality of our model. Some features may have a positive / negative relation with the response variable but not linear, for such cases, transformation may help improve our model.

f, axes = plt.subplots(7,6 , figsize=(30, 30), sharex=False)
for i, feature in enumerate(con_var):
    sns.scatterplot(data=train_data, x = feature, y= "SalePrice",ax=axes[i%7, i//7])

Comparison of SalePrice with Continues features — Part of Continuous Variables

Here, we can see that some relations seem positive but not quite linear:‘SalePrice’ VS.’BsmtUnfSF’, ‘SalePrice’ VS.’LotFrontage’ and etc, So we will consider transform such features into log forms to improve linearity.

4. Contrast Items of Categorical Features by the Response Variable

In process 2, we have identified the categorical features that we want to combine, then we would go ahead to confirm that these items we want to combine have similar prices (Response Variable). (So that we won’t re-group items causing diverse performances of the response variable)

f, axes = plt.subplots(7,7 , figsize=(30, 30), sharex=False)
for i, feature in enumerate(cat_var):
    sort_list = sorted(train_data.groupby(feature)['SalePrice'].median().items(), key= lambda x:x[1], reverse = True)
    order_list = [x[0] for x in sort_list ]
    sns.boxplot(data = train_data, x = feature, y = 'SalePrice', order=order_list, ax=axes[i%7, i//7])
plt.show()

Part of Contrast of SalePrice within Categorical Features

Here, we could see that sale prices for ‘Fa’ & ‘Po’ in ‘HeatingQC’, ‘FireplaceQu’, ‘GarageQual’ and ‘GarageCond’ are similar, so we may consider go ahead and combine the items.

Summary

With the major four ‘compare & contrast’ procedures mentioned above, I quickly spotted areas to fix for feature engineering. I feel when conducting exploratory data analysis on data sets with many features, it is important to use our critical thinking pattern: instead of exhibiting all details, we should always focus on identify potentials quickly to improve prediction model. In this sense, ‘compare and contrast’ method can help us keep on the right track.

Relevant Articles From Gary Lu:

Tuning Hyperparameters with Optuna

This article discusses my experience with tunning hyperparameters of ML models using Optuna.

glucn.medium.com

Links:

https://www.kaggle.com/garylucn/house-price

https://github.com/glucn/kaggle/blob/main/House_Prices/notebook/house-price.ipynb