Reading 5 - Data Preparation
Reading 5 - Data Preparation
Data Cleaning plays an important role in the field of Data Managements as well
as Analytics and Machine Learning. In this article, I will try to give the intuitions
about the importance of data cleaning and different data cleaning processes.
Data is the most valuable thing for Analytics and Machine learning. In computing
or Business data is needed everywhere. When it comes to the real world data, it
is not improbable that data may contain incomplete, inconsistent or missing
values. If the data is corrupted then it may hinder the process or provide
inaccurate results. Let’s see some examples of the importance of data cleaning.
Suppose you are a general manager of a company. Your company collects data
of different customers who buy products produced by your company. Now you
want to know on which products people are interested most and according to
that you want to increase the production of that product. But if the data is
corrupted or contains missing values then you will be misguided to make the
correct decision and you will be in trouble.
At the end of all, Machine Learning is a data-driven AI. In machine learning, if the
data is irrelevant or error-prone then it leads to an incorrect model building.
1
Figure 1: Impact of data on Machine Learning Modeling.
As much as you make your data clean, as much as you can make a better
model. So, we need to process or clean the data before using it. Without the
quality data,it would be foolish to expect any good outcome.
● Inconsistent column
If your DataFrame (A Data frame is a two-dimensional data structure, i.e.,
data is aligned in a tabular fashion in rows and columns) contains
columns that are irrelevant or you are never going to use them then you
can drop them to give more focus on the columns you will work on. Let’s
see an example of how to deal with such data set. Let’s create an example
of students data set using pandas DataFrame.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
data={'Name':['A','B','C','D','E','F','G','H']
,'Height':[5.2,5.7,5.6,5.5,5.3,5.8,5.6,5.5],
'Roll':[55,99,15,80,1,12,47,104],
'Department':['CSE','EEE','BME','CSE','ME','ME','CE','CSE'],
2
'Address':['polashi','banani','farmgate','mirpur','dhanmondi'
,'ishwardi','khulna','uttara']}
df=pd.DataFrame(data)
print(df)
Let us drop the height column. For this you need to push the column
name in the column keyword.
df=df.drop(columns='Height')
print(df.head())
3
● Missing data
It is rare to have a real world dataset without having any missing values.
When you start to work with real world data, you will find that most of
the dataset contains missing values. Handling missing values is very
important because if you leave the missing values as it is, it may affect
your analysis and machine learning models. So, you need to be sure
whether your dataset contains missing values or not. If you find missing
values in your dataset you must handle it. If you find any missing values in
the dataset you can perform any of these three task on it:
1. Leave as it is
2. Filling the missing values
3. Drop them
For filling the missing values we can perform different methods. For
example, Figure 4 shows that airquality dataset has missing values.
4
You can use different statistical methods to fill the missing values
according to your needs. For example, here in figure 5, we will use the
statistical mean method to fill the missing values.
airquality['Ozone'] =
airquality['Ozone'].fillna(airquality.Ozone.mean())
airquality.head()
You can see that the missing values in “Ozone” column is filled with the mean
value of that column. You can also drop the rows or columns where missing
values are found. we drop the rows containing missing values. Here You can
drop missing values with the help of pandas.DataFrame.dropna.
Here, in figure 6, you can see that rows have missing values in column Solar.R is
5
dropped.
airquality.isnull().sum(axis=0)
● Outliers
If you are new data Science then the first question that will arise in your
head is “what does these outliers mean” ? Let’s talk about the outliers
first and then we will talk about the detection of these outliers in the
dataset and what will we do after detecting the outliers.
According to wikipedia,
“In statistics, an outlier is a data point that differs significantly from other
observations.”
In Figure 4 all the values in math column are in range between 90–95
6
except 20 which is significantly different from others. It can be an input
error in the dataset. So we can call it a outliers.
One thing should be added here — “ Not all the outliers are bad data
points. Some can be errors but others are the valid values. ”
So, now the question is how can we detect the outliers in the dataset. For
detecting the outliers we can use :
1. Box Plot
2. Scatter plot
3. Z-score etc.
We will see the Scatter Plot method here. Let’s draw a scatter plot of a
dataset.
dataset.plot(kind='scatter' , x='initial_cost' ,
y='total_est_fee' , rot = 70)
plt.show()
Here in Figure 9 there is a outlier with red outline. After detecting this, we
7
can remove this from the dataset.
df_removed_outliers = dataset[dataset.total_est_fee<17500]
df_removed_outliers.plot(kind='scatter', x='initial_cost' ,
y='total_est_fee' , rot = 70)
plt.show()
● Duplicate rows
Datasets may contain duplicate entries. It is one of the easiest tasks to
delete duplicate rows. To delete the duplicate rows you can use —
dataset_name.drop_duplicates(). Figure 12 shows a
sample of a dataset having duplicate rows.
8
dataset=dataset.drop_duplicates()#this will remove the
duplicate rows.
print(dataset)
import pandas as pd
pd.melt(frame=df,id_vars='name',value_vars=['treatment
a','treatment b'])
9
tidy data.
● String manipulation
One of the most important and interesting part of data cleaning is string
manipulation. In the real world most of the data are unstructured data.
String manipulation means the process of changing, parsing, matching or
analyzing strings. For string manipulation, you should have some
knowledge about regular expressions. Sometimes you need to extract
some value from a large sentence. Here string manipulation gives us a
strong benefit. Let say,
“This umbrella costs $12 and he took this money from his mother.”
If you want to exact the “$12” information from the sentence then you
have to build a regular expression for matching that pattern.After that
you can use the python libraries.There are many built in and external
libraries in python for string manipulation.
import re
pattern = re.compile('|\$|d*')
10
result = pattern.match("$12312312")
print(bool(result))
● Data Concatenation
In this modern era of data science the volume of data is increasing day by
day. Due to the large number of volume of data data may stored in
separated files. If you work with multiple files then you can concatenate
them for simplicity. You can use the following python library for
concatenate.
concatenated_data=pd.concat([dataset1,dataset2])
print(concatenated_data)
11
Figure 15: Concatenated dataset.
12
price.
Now this data might have some errors or might be incorrect, not all sources on
the internet are correct. To begin, we’ll add a new column to display the cost per
square foot.
This new feature will help us understand a lot about our data. So, we have a
new column which shows cost per square ft. There are three main ways you can
find any error. You can use Domain Knowledge to contact a property advisor or
real estate agent and show him the per square foot rate. If your counsel states
that pricing per square foot cannot be less than 3400, you may have a problem.
The data can be visualised.
13
When you plot the data, you’ll notice that one price is significantly different from
the rest. In the visualisation method, you can readily notice the problem. The
third way is to use Statistics to analyze your data and find any problem. Feature
engineering consists of various process:
● Feature Creation: Creating features involves creating new variables which
will be most helpful for our model. This can be adding or removing some
features. As we saw above, the cost per sq. ft column was a feature
creation.
● Transformations: Feature transformation is simply a function that
transforms features from one representation to another. The goal here is
to plot and visualise data, if something is not adding up with the new
features we can reduce the number of features used, speed up training,
or increase the accuracy of a certain model.
● Feature Extraction: Feature extraction is the process of extracting features
from a data set to identify useful information. Without distorting the
original relationships or significant information, this compresses the
amount of data into manageable quantities for algorithms to process.
● Exploratory Data Analysis : Exploratory data analysis (EDA) is a powerful
and simple tool that can be used to improve your understanding of your
data, by exploring its properties. The technique is often applied when the
goal is to create new hypotheses or find patterns in the data. It’s often
used on large amounts of qualitative or quantitative data that haven’t
been analyzed before.
● Benchmark : A Benchmark Model is the most user-friendly, dependable,
transparent, and interpretable model against which you can measure
your own. It’s a good idea to run test datasets to see if your new machine
learning model outperforms a recognised benchmark. These
benchmarks are often used as measures for comparing the performance
between different machine learning models like neural networks and
support vector machines, linear and non-linear classifiers, or different
approaches like bagging and boosting. To learn more about feature
14
engineering steps and process, check the links provided at the end of this
article. Now, let’s have a look at why we need feature engineering in
machine learning.
When feature engineering activities are done correctly, the resulting dataset is
optimal and contains all of the important factors that affect the business
problem. As a result of these datasets, the most accurate predictive models and
the most useful insights are produced.
15
the techniques listed may work better with certain algorithms or datasets, while
others may be useful in all situations.
1. Imputation
When it comes to preparing your data for machine learning, missing
values are one of the most typical issues. Human errors, data flow
interruptions, privacy concerns, and other factors could all contribute to
missing values. Missing values have an impact on the performance of
machine learning models for whatever cause. The main goal of
imputation is to handle these missing values. There are two types of
imputation :
● Numerical Imputation: To figure out what numbers should be
assigned to people currently in the population, we usually use data
from completed surveys or censuses. These data sets can include
information about how many people eat different types of food,
whether they live in a city or country with a cold climate, and how
much they earn every year. That is why numerical imputation is
used to fill gaps in surveys or censuses when certain pieces of
information are missing.
#Filling all missing values with 0
data = data.fillna(0)
16
2. Handling Outliers
Outlier handling is a technique for removing outliers from a dataset. This
method can be used on a variety of scales to produce a more accurate
data representation. This has an impact on the model’s performance.
Depending on the model, the effect could be large or minimal; for
example, linear regression is particularly susceptible to outliers. This
procedure should be completed prior to model training. The various
methods of handling outliers include:
a. Removal: Outlier-containing entries are deleted from the
distribution. However, if there are outliers across numerous
variables, this strategy may result in a big chunk of the datasheet
being missed.
b. Replacing values: Alternatively, the outliers could be handled as
missing values and replaced with suitable imputation.
c. Capping: Using an arbitrary value or a value from a variable
distribution to replace the maximum and minimum values.
d. Discretization : Discretization is the process of converting
continuous variables, models, and functions into discrete ones. This
is accomplished by constructing a series of continuous intervals (or
bins) that span the range of our desired variable/model/function.
3. Log Transform
Log Transform is the most used technique among data scientists. It’s
mostly used to turn a skewed distribution into a normal or less-skewed
distribution. We take the log of the values in a column and utilise those
values as the column in this transform. It is used to handle confusing
data, and the data becomes more approximative to normal applications.
//Log Example
df[log_price] = np.log(df[‘Price’])
4. One-hot encoding
17
A one-hot encoding is a type of encoding in which an element of a finite
set is represented by the index in that set, where only one element has its
index set to “1” and all other elements are assigned indices within the
range [0, n-1]. In contrast to binary encoding schemes, where each bit can
represent 2 values (i.e. 0 and 1), this scheme assigns a unique value for
each possible case.
5. Scaling
Feature scaling is one of the most pervasive and difficult problems in
machine learning, yet it’s one of the most important things to get right. In
order to train a predictive model, we need data with a known set of
features that needs to be scaled up or down as appropriate. This blog post
will explain how feature scaling works and why it’s important as well as
some tips for getting started with feature scaling.
After a scaling operation, the continuous features become similar in
terms of range. Although this step isn’t required for many algorithms, it’s
still a good idea to do so. Distance-based algorithms like k-NN and
k-Means, on the other hand, require scaled continuous features as model
input. There are two common ways for scaling :
a. Normalization : All values are scaled in a specified range between 0
and 1 via normalisation (or min-max normalisation). This
modification has no influence on the feature’s distribution,
however it does exacerbate the effects of outliers due to lower
standard deviations. As a result, it is advised that outliers be dealt
with prior to normalisation.
b. Standardization: Standardization (also known as z-score
normalisation) is the process of scaling values while accounting for
standard deviation. If the standard deviation of features differs, the
range of those features will likewise differ. The effect of outliers in
the characteristics is reduced as a result. To arrive at a distribution
with a 0 mean and 1 variance, all the data points are subtracted by
18
their mean and the result divided by the distribution’s variance.
19
combination of features, for any machine learning model to be valid. For
example, let’s say that for a model to perform well, we need at least 10 data
points for each combination of feature values. If we assume that we have one
binary feature, then for its 21 unique values (0 and 1) we would need 2¹x 10 = 20
data points. For 2 binary features, we would have 2² unique values and need 2² x
10 = 40 data points. Thus, for k-number of binary features we would need 2ᵏ x
10 data points.
Hughes (1968) in his study concluded that with a fixed number of training
samples, the predictive power of any classifier first increases as the number of
dimensions increase, but after a certain value of number of dimensions, the
performance deteriorates. Thus, the phenomenon of curse of dimensionality is
also known as Hughes phenomenon.
20
farthest neighbor.
That is, for a d — dimensional space, given n-random points, the distₘᵢₙ(A) ≈
distₘₐₓ(A) meaning, any given pair of points are equidistant to each other.
Therefore, any machine learning algorithms which are based on the distance
measure including KNN(k-Nearest Neighbor) tend to fail when the number of
dimensions in the data is very high. Thus, dimensionality can be considered as a
“curse” in such algorithms.
Other methods could involve the use of reduction in dimensions. Some of the
techniques that can be used are:
1. Forward-feature selection: This method involves picking the most useful
subset of features from all given features.
2. PCA/t-SNE: Though these methods help in reduction of the number of
features, but it does not necessarily preserve the class labels and thus
can make the interpretation of results a tough task.
21
Material Sources:
https://towardsdatascience.com/what-is-data-cleaning-how-to-process
-data-for-analytics-and-machine-learning-modeling-c2afcf4fbf45
https://towardsdatascience.com/what-is-feature-engineering-importan
ce-tools-and-techniques-for-machine-learning-2080b0269f10#:~:text=
Feature%20engineering%20is%20the%20process,design%20and%20tr
ain%20better%20features
https://towardsdatascience.com/curse-of-dimensionality-a-curse-to-m
achine-learning-c122ee33bfeb
https://www.r-bloggers.com/2021/04/how-to-clean-the-datasets-in-r/
https://towardsdatascience.com/feature-engineering-for-machine-lear
ning-3a5e293a5114
https://towardsdatascience.com/normalization-vs-standardization-quan
titative-analysis-a91e8a79cebf
22