Data science stylistics

Stylistic differences between R and Python in setting up the dataframe pre-modelling

Published in

Level Up Coding

8 min readMar 16, 2020

After having prepared the dataframe and explored some key relationships between variables in visual form, the next phase would entail a number of critical steps that are necessary before modelling the data. These include: 1) partitioning the data, 2) validating the data partition, 3) balancing the data and, 4) baseline model performance. The combination of these tasks prepare us to move ahead with modelling data in an optimal and more accurate way.

1. Meaning of partitioning in data science

The proposed methodology illustrated in this series of blogs does not rely on statistical inferences as a mean to generalise from a sample population. The reason to avoid this practice in the proposed approach is dual: 1) statistical significance might not correspond to practical significance or corroborated by theoretical underpinnings; 2) having a priori hypothesis might lead to confirmation bias instead of searching through the data for actionable results.

Because of the lack of a priori hypothesis, the need to partition data reduces the risk of spurious results from random variation instead of real effects through data dredging. This term describes a misuse of data analysis to find patterns by performing many statistical tests and reporting only those that come back as significant which increases and understates the risk of false positives. This risk is real when data mining is used to uncover patterns without first devising a specific hypothesis intended as instrumental to explore an underlying causality.

The process of cross-validation is a concrete solution to ensure results are more generalisable to an independent and unseen dataframe. In a twofold cross-validation, the data is partitioned using a random assignment into 2 separate spaces: a training and a test dataset. The target variable is temporarily removed from the test dataset while the model learns about patterns and trends in the data from the training dataset. The model is then applied to the test set and predicted values are evaluated against the original target values through an evaluation of the overall error rate. The size of each partition would vary based on the level of complexity of the dataframe, usually ranging from 50% to 75% of entries as part of the training dataset.

2. Partitioning the data

In Python, the creation of random dataset as illustrated in the snippet below and in line with the previous examples can support the reader to apply the code more easily. Differently from previous examples, df is now larger.

df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 3)), columns= ('A','B','C'))
Place = (['London','Delhi','Rome'])
df["Place"] = np.random.choice(Place, size=len(df))
Balance = (["Credit", "Debit","Zero"])
df["Balance"] = np.random.choice(Balance, size=len(df))

In Python the starting point for partitioning the data would require the uploading of relevant libraries and the use of the train_test_split command specifying the size of the test dataset and a random_state input to set a seed to the random generator, so that your train-test splits are always deterministic. The same seed is to be repeated if to replicate results during run-throughs.

import pandas as pd
from sklearn.model_selection import train_test_split
import randomdf_train,df_test= train_test_split(df, test_size=0.25, random_state=8)df.shape, df_train.shape, df_test.shape

In R, the process is quite similar. After defining a larger dataset and establishing the key variables as suggested in previous blogs, the starting point for partition seems to be quite similar.

df <- data.frame(replicate(3,sample(0:1000, 1000, rep=TRUE)))
colnames(df) <- c(“A”,”B”,”C”)
df$Place <- sample(c(“London”, “Delhi”,”Rome”), size = nrow(df), prob = c(0.76, 0.14,0.1), replace = TRUE)
df$Balance <- sample(c(“Credit”, “Debit”,”Zero”), size = nrow(df), prob = c(0.70, 0.1,0.45), replace = TRUE)

The starting point of the partition in R is to set the seed which also will need to be used as a constant in additional run-through to get the same results. The partition is then carried out by defining the length of the dataframe and using the command runif() to draw number randomly between zero and one and map each with equal probability to the various entries in the dataframe. The condition in runif will associate true and false to each value below or above the threshold used for randomisation. The subsetting of the dataframe will gather records according to the boolean condition from the command runif.

set.seed(8)
n<- dim(df)[1]
train_df<-runif(n)<0.75
df_train<- df[train_df, ]
df_test<- df[!train_df, ]
dim(df_test); dim(df_train)

2. Validating the partition

The validation of the partition will require an additional set of checks to ensure the datasets do not deviate too much one from another. A sample of variables would need to undergo further testing for this purpose. Depending on the variable type some statistical tests might be beneficial: two-sample t-test for difference in means for numerical variable or stats.chisquare to verify if the distribution of categorical variables are the same.

In Python the t-test on the generated data samples prints the statistic and p-value for two numerical variables from each sample. The interpretation of the t-statistic in this case indicates that the sample means are the same at 95% confidence level since the p-value fails to reject H0, in other words the means between the two distributions are similar.

ttest_ind(df_train['B'],df_test['B'],equal_var = False)

The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables in Python: it tests whether the distribution of sample categorical data matches an expected distribution. The command stats.chi2_contingency computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table TrainPlace_array. If the p-value is above the threshold of 5% (as in this case), we cannot reject the null hypothesis and the two distributions are independent which also means the two variables does not have association or are connected to each other. An opposite result showing a stronger association of frequencies between Place_train and Place_test would be ideal during partitioning.

Place_train = pd.crosstab(index=df_train['Place'], columns="count")
Place_test = pd.crosstab(index=df_test['Place'], columns="count")Train_array = np.asarray(Place_train)
Place_array = np.asarray(Place_test)
TrainPlace_array = np.array([Train_array, Place_array])
stats.chi2_contingency(TrainPlace_array)

t.test(df_train['A'], df_test['A'], var.equal = TRUE)

To measure how close the proportions between can be used for testing the null that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values. In the output,prop1, prop2 and prop3 are the probabilities of equal proportions across the 3 groups of categorical attributes within the two samples. The p-value is above 5%, which indicates that the proportions of the characteristic studied (Place) are not significantly different between the two datasets (train, test).

t= table(df_train$Place)
t1= table(df_test$Place)matrix_frames<-merge(t,t1,by=”Var1")
matrix_frames<-matrix_frames[,-1]
matrix_frames<-data.matrix(matrix_frames, rownames.force = NA)
prop.test(matrix_frames)

3. Balancing the training dataset

In some classification models, one of the target variable classes has much lower relative frequency than the other classes. As explored before, checking the frequency distribution of relevant variables can support the identification of the ones that might need rebalancing. Only the training dataset should be rebalanced while the test dataset should be kept as it needs to remain a world-like representation of evidence.

The first step to rebalance the dataframe is to observe the value counts of the dataset to determine which categorical attribute to adjust. In Python this is done through thevalue_count function, which can also be divided by the whole length of the dataframe to observe relative frequency for each.

df_train['Balance'].value_counts(), df_train['Balance'].value_counts()/len(df_train)

From this initial overview, we need to establish to what extent we need to resample the selected variable in the dataframe. In this case, I decided to increase the % of Credit records to 35% from 32%. The calculation to determine how many records are needed with this characteristic is the result of the formula: (target % * size of the training dataset - frequency count of variable to modify)/ 1-target %. The number of credit records to be added to the training dataset is 39 and the command to_resample.sample and pd.concat leads to a rebalanced training dataset with 35% of credit records.

x=(0.35*750–241)/0.55to_resample=df_train.loc[df_train['Balance']=='Credit']
our_resample=to_resample.sample(n=39, replace=True)
df_train_rebalance= pd.concat([df_train, our_resample])df_train_rebalance['Balance'].value_counts(),
df_train_rebalance['Balance'].value_counts()/len(df_train_rebalance)

In R, the starting point for rebalancing is similar to the previous example. We need to explore in tabular form the frequency in absolute and relative terms.

table(df_train$Balance);table(df_train$Balance)/n

The rebalancing of the dataframe in R also requires the selection of a categorical attribute (Debit) and a size value that reflects the target % in that distribution (in this case from 6% to 30%). The rebalanced dataframe is created through the command sample and rbind whilst the table to show results is more code heavy and entail the specification of each label.

to.resample<-which(df_train$Balance=='Debit')
x=(0.3*750-63)/0.7our.resample<-sample(x=to.resample, size=231, replace=TRUE)
our.resample<-df_train[our.resample,]
df_train_rebalanced<- rbind(df_train, our.resample)t.v1<- table(df_train_rebalanced$Balance)
t.v2<-rbind(t.v1, round(prop.table(t.v1),4))
colnames(t.v2)<-c('Credit','Debit','Zero')
rownames(t.v2)<-c('Count','Proportion')
t.v2

Reached at this stage, before evaluating the model performance, we should calibrate the results against some baseline model based on the accuracy level that is necessary for the specific purpose of a dataframe. For a regression the comparison between the estimates and the mean response is not enough in many cases. It becomes necessary to acknowledge the wealth of information residing in the predictor and what a subject matter expert considers as an optimal prediction error to be. This is the reason why a strong data science methodology also requires relevant technical expertise.

After this, the next phase of the proposed data science methodology is the modelling phase. This is explained in the next blog!