Compare & Contrast: Exploration of Datasets with Many Features
I just did a Kaggle competition with Gary Lu. We studied the dataset of Boston housing prices, and built up prediction models. The result turns out to be within top 9%.
Soon after getting started, I noticed that the datasets have many features (80 in total: 37 continuous and 43 categorical). Making sense of all of them definitely paves concrete path to feature engineering and model building process. However, there are also too many ways to do that, I was confused about the ‘right’ way for a while…
After some trial and error I feel that though there’s no absolute right way to do it, still, a ‘compare & contrast’ style analysis and visualization serves well to the final goal: Identify potentials quickly to improve prediction model.
Here I would elaborate on my practices to wave the idea of ‘compare & contrast’ into the exploration:
1. Compare Data Type and Nulls between Train & Test Data
For datasets with many features, summarizing data types and null values are important. Instead of only exploring training and test datasets separately, a comparative analysis can tells more on: Are the data types consistent across datasets? Are there more missing values in one dataset than the other? Below are the example and codes (Python):
# Data Type Comparison
s1 = train_data.dtypes
s2 = test_data.dtypes
s1_train = s1.drop('SalePrice')
s1_train.compare(s2)
So from above, we can easily see that there are 9 features having different datatypes across training & test data. Does it matter? it needs more exploration, in our case it doesn’t as they are both numeric types, but it may be an issue if one is an object type.
# Null Value Comparison
null_train = train_data.isnull().sum()
null_test = test_data.isnull().sum()
null_train = null_train.drop('SalePrice')
null_train.compare(null_test).sort_values(['self','other'], ascending= [False,False])
Here we can see that null values seem evenly distribute across two datasets, which assure us that we can go ahead dealing with these null values in similar ways for both data sets.
2. Contrast Distributions of Train & Test Data
Distribution of features is a key consideration for feature engineering, we always want to answer questions:
Are the continuous features normally distributed? Is there any categorical feature dominated by a single category which over shadows others? Again, instead of studying the questions separately for train and test data separately, we can contrast them.
For continuous features, we can apply Seaborn histograms as below:
# Distribution Comparison - Continuous Variablescon_var = s1[s1.values != 'object'].index
f, axes = plt.subplots(7,6 , figsize=(30, 30), sharex=False)
for i, feature in enumerate(con_var):
sns.histplot(data=combined_data, x = feature, hue="Label",ax=axes[i%7, i//7])
The distribution above shows that:
- The distribution of train and test data are similar for most features;
- Some features can be reclassified as ‘Categorical’, such as ‘MSSubClass’;
- Some features are dominated by 0/null (eg:PoolArea), thus we can consider to drop.
For categorical features, we can apply Seaborn countplot as below:
# Distribution Comparison - Catagorical Variablescat_var = s1[s1.values == 'object'].index
f, axes = plt.subplots(7,7 , figsize=(30, 30), sharex=False)
for i, feature in enumerate(cat_var):
sns.countplot(data = combined_data, x = feature, hue="Label",ax=axes[i%7, i//7])
The comparison of the categorical variables showed that:
- Train data and test data distributions are similar for most features
- Some features have dominant items, we can consider to combine some minor items into a group, such as ‘Fa’ & ‘Po’ in ‘HeatingQC’, ‘FireplaceQu’, ‘GarageQual’ and ‘GarageCond’
3. Compare Linearity between Response Variable and Continues Features
Linearity affects the quality of our model. Some features may have a positive / negative relation with the response variable but not linear, for such cases, transformation may help improve our model.
f, axes = plt.subplots(7,6 , figsize=(30, 30), sharex=False)
for i, feature in enumerate(con_var):
sns.scatterplot(data=train_data, x = feature, y= "SalePrice",ax=axes[i%7, i//7])
Here, we can see that some relations seem positive but not quite linear:‘SalePrice’ VS.’BsmtUnfSF’, ‘SalePrice’ VS.’LotFrontage’ and etc, So we will consider transform such features into log forms to improve linearity.
4. Contrast Items of Categorical Features by the Response Variable
In process 2, we have identified the categorical features that we want to combine, then we would go ahead to confirm that these items we want to combine have similar prices (Response Variable). (So that we won’t re-group items causing diverse performances of the response variable)
f, axes = plt.subplots(7,7 , figsize=(30, 30), sharex=False)
for i, feature in enumerate(cat_var):
sort_list = sorted(train_data.groupby(feature)['SalePrice'].median().items(), key= lambda x:x[1], reverse = True)
order_list = [x[0] for x in sort_list ]
sns.boxplot(data = train_data, x = feature, y = 'SalePrice', order=order_list, ax=axes[i%7, i//7])
plt.show()
Here, we could see that sale prices for ‘Fa’ & ‘Po’ in ‘HeatingQC’, ‘FireplaceQu’, ‘GarageQual’ and ‘GarageCond’ are similar, so we may consider go ahead and combine the items.
Summary
With the major four ‘compare & contrast’ procedures mentioned above, I quickly spotted areas to fix for feature engineering. I feel when conducting exploratory data analysis on data sets with many features, it is important to use our critical thinking pattern: instead of exhibiting all details, we should always focus on identify potentials quickly to improve prediction model. In this sense, ‘compare and contrast’ method can help us keep on the right track.
Relevant Articles From Gary Lu:
Links:
https://www.kaggle.com/garylucn/house-price
https://github.com/glucn/kaggle/blob/main/House_Prices/notebook/house-price.ipynb