(Week 4) - Balance DataSet
(Week 4) - Balance DataSet
(Week 4) - Balance DataSet
CONTENTS
3.1 Balancing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1.1 NearMiss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2.1 Randomly oversampling . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2.2 Resampling Methods, SMOTE . . . . . . . . . . . . . . . 30
3.2 Train-test split and cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 K-Fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Stratified K-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Handling Missing data in datasets . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1.1 Removing Missing values . . . . . . . . . . . . . . . . . . . . . 36
3.3.1.2 Uni-Variant imputation . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1.3 Multi-variant imputation . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Handling Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Outliers and Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3.1 Techniques for outlier detection . . . . . . . . . . . . . . 41
3.3.3.2 Methods for treating outliers . . . . . . . . . . . . . . . . . 41
3.3.4 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4.1 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
27
28 Fundamentals of Machine and Deep Learning using Python
discussed in section 3.2. Once the data set is balanced and split for training and
testing purposes, the next stage is to fix discrepancies such as missing data,
different scale variations in different columns, etc. The entire process of making
data suitable for ML algorithm is known as feature engineering presented in
section 3.3. Firstly, Feature Engineering deals with missing data, meaning that
there are blank spaces where we are supposed to have data. The missing data
entries might have different sources as discussed in section 3.3.1. Secondly,
another major problem in data is outliers, these are samples in the dataset
that are different from the entire trend of the remaining data. Outliers impose
impediments in the correct learning of ML algorithms since a few outliers
could shift the entire learning curve of the model, we will discuss outliers in
further detail in 3.3.3. Thirdly, the features in data may have different scales,
for example, the age of an employee may be between 20 to 100, however, his
salary must be in the range of thousands, such problem can be fixed by scaling
the features as discussed further in 3.3.4.
3.1.1 Undersampling
Undersampling methods deal with reducing the number of majority class sam-
ples so that minority and majority classes become equal in the number of
samples. It could only work if we have a very large dataset and throwing
away the majority of samples does not affect the learning process or patterns
that could be learned from the data. The most simplistic technique would be
random undersampling where random samples from the majority class are dis-
carded. However, this does not respect whether thrown samples would affect
certain class distributions or not. A more appropriate algorithm would be to
discard values that are either redundant or have the least possible effect on the
Data Exploration and Preprocessing 29
3.1.1.1 NearMiss
The near miss method is an upsampling technique that measures the distance
of majority class samples to minority class samples [4]. There are three vari-
ants of Near Miss, in the first NearMiss-1 distance of the majority class is
calculated from the three closest samples from the minority class. The ma-
jority of class samples having the smallest average distance are kept and the
remaining are discarded. In the second NearMiss-2 samples of the majority
class having the smallest average distance from the furthest samples of the
minority class are chosen. Finally, the third variant NearMiss-3 calculates
the distance of the majority class with each minority class sample. It chooses
majority samples having the least average distance from minority class sam-
ples. Although there might be different distance metrics that can be used for
all three cases, we have employed the most well-known Euclidean distance.
For the sake of this text, we will discuss NearMiss-3 only and refer to it as
NearMiss algorithm. The near miss algorithm will calculate distance of each
majority class sample with each sample in minority class. And throw away
samples which have closest average distance from minority class. The premise
of this algorithm lies in the fact the most eligible discardable points are those
which are closest to minority class and therefore convey least unique informa-
tion. The summary of algorithm is given in algorithm 1.
Below is the code for near miss algorithm, line 6 to 9 are loading the data.
While line 12 splits the data into training and testing. Finally line 15 makes
instance of near miss and 18 applies the Near miss algorithm on the dataset.
1 from imblearn . un der_samp ling import NearMiss
2 from imblearn . datasets import fet ch_datas ets
3 from sklearn . mo de l_ s el ec ti on import t r a i n _ t e s t _ s p l i t
4
5 # Load an imbalanced dataset as an example
6 data = fet ch_datas ets () [ ’ satimage ’]
30 Fundamentals of Machine and Deep Learning using Python
7
8 X = data . data
9 y = data . target
10
11 # Split the data into training and testing sets
12 X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p li t (X , y ,
test_size =0.3 , random_state =42)
13
14 # Initialize the NearMiss algorithm
15 near_miss = NearMiss ( version =1 , random_state =42)
16
17 # Undersample the majority class
18 X_resampled , y_resampled = near_miss . fit_resample ( X_train ,
y_train )
Code Listing 3.1
Near missing algorithm
3.1.2 Oversampling
The oversampling increases the minority class so that it matches the majority
class. It can be done in two ways in the first technique Randomly oversampling
section 3.1.2.1 the minority samples are randomly picked from existing samples
and copied multiple times so that they increase to the size of the majority
class. In SMOTE (Synthetic Minority Over-sampling Technique) discussed in
section 3.1.2.2 the samples are increased by extrapolating the neighborhood
of existing samples.
Y
Y
X X
FIGURE 3.1
SMOTE (Synthetic Minority Over-sampling Technique)