(Week 4) - Balance DataSet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

3

Data Exploration and Preprocessing

CONTENTS
3.1 Balancing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1.1 NearMiss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2.1 Randomly oversampling . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2.2 Resampling Methods, SMOTE . . . . . . . . . . . . . . . 30
3.2 Train-test split and cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 K-Fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Stratified K-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Handling Missing data in datasets . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1.1 Removing Missing values . . . . . . . . . . . . . . . . . . . . . 36
3.3.1.2 Uni-Variant imputation . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1.3 Multi-variant imputation . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Handling Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Outliers and Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3.1 Techniques for outlier detection . . . . . . . . . . . . . . 41
3.3.3.2 Methods for treating outliers . . . . . . . . . . . . . . . . . 41
3.3.4 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4.1 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Data whether labeled or unlabelled is of preliminary importance for ML since


all algorithm learns from data. This means that the data is going to have a
significant impact on the output of the ML algorithm. The better the data
is the better we can train our ML algorithm. Also, no matter how good an
ML algorithm you apply, when fed with low-quality data, the result would be
poor performance. Unfortunately, the way data is produced is not controllable
since it comes from the real world. So in all cases, we might need to preprocess
data to best suit our ML algorithm. There are several problems a data could
have such as having more samples for certain classes in a dataset than others
known as the Balancing Datasets problem. We will explore this problem in
section 3.1. It is fundamental to have the right division of data set between
the training and testing stages also known as train test data split, further

27
28 Fundamentals of Machine and Deep Learning using Python

discussed in section 3.2. Once the data set is balanced and split for training and
testing purposes, the next stage is to fix discrepancies such as missing data,
different scale variations in different columns, etc. The entire process of making
data suitable for ML algorithm is known as feature engineering presented in
section 3.3. Firstly, Feature Engineering deals with missing data, meaning that
there are blank spaces where we are supposed to have data. The missing data
entries might have different sources as discussed in section 3.3.1. Secondly,
another major problem in data is outliers, these are samples in the dataset
that are different from the entire trend of the remaining data. Outliers impose
impediments in the correct learning of ML algorithms since a few outliers
could shift the entire learning curve of the model, we will discuss outliers in
further detail in 3.3.3. Thirdly, the features in data may have different scales,
for example, the age of an employee may be between 20 to 100, however, his
salary must be in the range of thousands, such problem can be fixed by scaling
the features as discussed further in 3.3.4.

3.1 Balancing Datasets


In supervised learning data set is labelled or annotated. Just by looking at
the labeled column, we could determine the number of classes in the dataset,
and how many samples are there in each class. An imbalanced dataset occurs
when the distribution of classes in the data is significantly skewed in favor
of some classes. In other words, one class (the minority class) contains sig-
nificantly fewer examples than another class (the majority class). No matter
which ML algorithm you use or how much training you do, the application
always remains biased toward the majority class. Fixing the imbalance dataset
is called Balancing Datasets. It could be done in two ways reduce the majority
class so it becomes equal to the minority class as discussed in section 3.1.1 or
increase the minority classes so that they become equal to minority classes as
discussed in section 3.1.2.

3.1.1 Undersampling
Undersampling methods deal with reducing the number of majority class sam-
ples so that minority and majority classes become equal in the number of
samples. It could only work if we have a very large dataset and throwing
away the majority of samples does not affect the learning process or patterns
that could be learned from the data. The most simplistic technique would be
random undersampling where random samples from the majority class are dis-
carded. However, this does not respect whether thrown samples would affect
certain class distributions or not. A more appropriate algorithm would be to
discard values that are either redundant or have the least possible effect on the
Data Exploration and Preprocessing 29

dataset. The most simplistic approach would be an ML algorithm that could


learn redundant features from the data, or choose an equal representation of
samples from all classes.

3.1.1.1 NearMiss
The near miss method is an upsampling technique that measures the distance
of majority class samples to minority class samples [4]. There are three vari-
ants of Near Miss, in the first NearMiss-1 distance of the majority class is
calculated from the three closest samples from the minority class. The ma-
jority of class samples having the smallest average distance are kept and the
remaining are discarded. In the second NearMiss-2 samples of the majority
class having the smallest average distance from the furthest samples of the
minority class are chosen. Finally, the third variant NearMiss-3 calculates
the distance of the majority class with each minority class sample. It chooses
majority samples having the least average distance from minority class sam-
ples. Although there might be different distance metrics that can be used for
all three cases, we have employed the most well-known Euclidean distance.
For the sake of this text, we will discuss NearMiss-3 only and refer to it as
NearMiss algorithm. The near miss algorithm will calculate distance of each
majority class sample with each sample in minority class. And throw away
samples which have closest average distance from minority class. The premise
of this algorithm lies in the fact the most eligible discardable points are those
which are closest to minority class and therefore convey least unique informa-
tion. The summary of algorithm is given in algorithm 1.

Algorithm 1 Near Miss Algorithm


Require: The input data set X and Y with samples of majority and minority
class respectively, where the size of the majority class is N and the size of
the minority class is M .
[1-] Find the distance
p of all points of the majority class with all points of
the minority class. (X − Y )2 .
[2-] Find k points of the majority class which has minimum distance with
minority class where k = M − N .
[3-] Remove the k nearest points, where average distance is least to the
minority class.

Below is the code for near miss algorithm, line 6 to 9 are loading the data.
While line 12 splits the data into training and testing. Finally line 15 makes
instance of near miss and 18 applies the Near miss algorithm on the dataset.
1 from imblearn . un der_samp ling import NearMiss
2 from imblearn . datasets import fet ch_datas ets
3 from sklearn . mo de l_ s el ec ti on import t r a i n _ t e s t _ s p l i t
4
5 # Load an imbalanced dataset as an example
6 data = fet ch_datas ets () [ ’ satimage ’]
30 Fundamentals of Machine and Deep Learning using Python

7
8 X = data . data
9 y = data . target
10
11 # Split the data into training and testing sets
12 X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p li t (X , y ,
test_size =0.3 , random_state =42)
13
14 # Initialize the NearMiss algorithm
15 near_miss = NearMiss ( version =1 , random_state =42)
16
17 # Undersample the majority class
18 X_resampled , y_resampled = near_miss . fit_resample ( X_train ,
y_train )
Code Listing 3.1
Near missing algorithm

3.1.2 Oversampling
The oversampling increases the minority class so that it matches the majority
class. It can be done in two ways in the first technique Randomly oversampling
section 3.1.2.1 the minority samples are randomly picked from existing samples
and copied multiple times so that they increase to the size of the majority
class. In SMOTE (Synthetic Minority Over-sampling Technique) discussed in
section 3.1.2.2 the samples are increased by extrapolating the neighborhood
of existing samples.

3.1.2.1 Randomly oversampling


In random oversampling, the primary idea revolves around augmenting the
number of instances within the minority class through the random dupli-
cation or replication of a subset of its existing data points. This strategic
approach serves to rectify the skewed class distribution, fostering a more bal-
anced dataset conducive to the ML model’s training. The ultimate objective
is to mitigate the inherent bias favoring the majority class, thereby enhancing
the model’s capacity to acquire knowledge and generate precise predictions
for the minority class. Nevertheless, while random oversampling represents
a straightforward method to address class imbalance, prudent considerations
are paramount. This is because indiscriminate oversampling can introduce the
risk of overfitting, wherein the model overly tailors itself to the training data,
potentially compromising generalization. To counteract this risk, advanced
techniques like SMOTE come into play, creating synthetic instances that ex-
pand the minority class in a more controlled manner, thus preserving diversity
and reducing the risk of overfitting. Conducting a meticulous evaluation of the
model’s performance on independent validation or test datasets is also impera-
tive to ascertain that the oversampling strategy effectively enhances predictive
accuracy without introducing any adverse repercussions.
Data Exploration and Preprocessing 31

Y
Y

X X

FIGURE 3.1
SMOTE (Synthetic Minority Over-sampling Technique)

3.1.2.2 Resampling Methods, SMOTE


SMOTE, or Synthetic Minority Over-sampling Technique, is an algorithm for
the correction of imbalanced data. Imbalanced datasets occur when one class
significantly outnumbers another, which can lead to biased model performance
and a propensity to misclassify the minority class. SMOTE was introduced to
address this challenge by generating synthetic examples of the minority class,
thereby balancing the dataset. The core idea behind SMOTE is to create
synthetic instances by interpolating between existing minority class samples.
It selects a minority class data point and its k-nearest neighbors and then
generates new samples by randomly selecting a neighbor and creating a linear
combination between the two. This process continues until the desired level of
oversampling is achieved. In Figure 3.1 the cross-class got 8 samples. SMOTE
offers several advantages, such as improving classifier performance on minority
class instances, reducing the risk of overfitting, and making the model more
robust. However, it’s essential to apply SMOTE judiciously, as oversampling
can lead to overgeneralization of the minority class, depending on the specific
problem. In summary, SMOTE is a valuable tool in addressing imbalanced
datasets, allowing ML models to better capture the nuances of minority classes
and ultimately improving the overall performance and fairness of classification
models. It has found applications in various domains, including healthcare,
fraud detection, and text classification, where class imbalance is a common
challenge.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy