0% found this document useful (0 votes)

48 views

Reading 5 - Data Preparation

1) Data cleaning is the process of identifying and correcting inaccurate, incomplete, or irrelevant data in a dataset. It involves tasks like handling missing values, removing outliers, dropping duplicate or irrelevant columns, and converting data types. 2) Common data cleaning techniques include imputing missing values using statistical methods, removing outliers using visualizations like box plots or scatter plots, dropping duplicate rows, and converting between tidy and untidy data formats. 3) Cleaning data helps produce accurate analytics and machine learning models by ensuring the data is complete, consistent, and relevant to the problem being examined.

Uploaded by

NR Yalife

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Reading 5 - Data Preparation

Uploaded by

NR Yalife

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

1

What is Data Cleaning? How to Process Data for Analytics and

Machine Learning Modeling?

Data Cleaning plays an important role in the field of Data Managements as well
as Analytics and Machine Learning. In this article, I will try to give the intuitions
about the importance of data cleaning and different data cleaning processes.

What is Data Cleaning?

Data Cleaning means the process of identifying the incorrect, incomplete,
inaccurate, irrelevant or missing part of the data and then modifying, replacing
or deleting them according to the necessity. Data cleaning is considered a
foundational element of basic data science.

Data is the most valuable thing for Analytics and Machine learning. In computing
or Business data is needed everywhere. When it comes to the real world data, it
is not improbable that data may contain incomplete, inconsistent or missing
values. If the data is corrupted then it may hinder the process or provide
inaccurate results. Let’s see some examples of the importance of data cleaning.

Suppose you are a general manager of a company. Your company collects data
of different customers who buy products produced by your company. Now you
want to know on which products people are interested most and according to
that you want to increase the production of that product. But if the data is
corrupted or contains missing values then you will be misguided to make the
correct decision and you will be in trouble.

At the end of all, Machine Learning is a data-driven AI. In machine learning, if the
data is irrelevant or error-prone then it leads to an incorrect model building.

1
Figure 1: Impact of data on Machine Learning Modeling.
As much as you make your data clean, as much as you can make a better
model. So, we need to process or clean the data before using it. Without the
quality data,it would be foolish to expect any good outcome.

Different Ways of Cleaning Data

Now let’s take a closer look at the different ways of cleaning data.

● Inconsistent column
If your DataFrame (A Data frame is a two-dimensional data structure, i.e.,
data is aligned in a tabular fashion in rows and columns) contains
columns that are irrelevant or you are never going to use them then you
can drop them to give more focus on the columns you will work on. Let’s
see an example of how to deal with such data set. Let’s create an example
of students data set using pandas DataFrame.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
data={'Name':['A','B','C','D','E','F','G','H']
,'Height':[5.2,5.7,5.6,5.5,5.3,5.8,5.6,5.5],
'Roll':[55,99,15,80,1,12,47,104],

'Department':['CSE','EEE','BME','CSE','ME','ME','CE','CSE'],

2
'Address':['polashi','banani','farmgate','mirpur','dhanmondi'
,'ishwardi','khulna','uttara']}
df=pd.DataFrame(data)
print(df)

Figure 2: Student data set

Here if we want to remove the “Height” column, we can use python

pandas.DataFrame.drop to drop specified labels from rows or
columns.

DataFrame.drop(self, labels=None, axis=0, index=None,

columns=None, level=None, inplace=False, errors='raise')

Let us drop the height column. For this you need to push the column
name in the column keyword.

df=df.drop(columns='Height')
print(df.head())

Figure 3: “Height” column dropped

3
● Missing data
It is rare to have a real world dataset without having any missing values.
When you start to work with real world data, you will find that most of
the dataset contains missing values. Handling missing values is very
important because if you leave the missing values as it is, it may affect
your analysis and machine learning models. So, you need to be sure
whether your dataset contains missing values or not. If you find missing
values in your dataset you must handle it. If you find any missing values in
the dataset you can perform any of these three task on it:
1. Leave as it is
2. Filling the missing values
3. Drop them
For filling the missing values we can perform different methods. For
example, Figure 4 shows that airquality dataset has missing values.

airquality.head() # return top n (5 by default) rows of a

data frame

Figure 4: missing values.

In figure 4, NaN indicates that the dataset contains missing values in that
position. After finding missing values in your dataset, You can use
pandas.DataFrame.fillna to fill the missing values.

DataFrame.fillna(self, value=None, method=None, axis=None,

inplace=False, limit=None, downcast=None, **kwargs)

4
You can use different statistical methods to fill the missing values
according to your needs. For example, here in figure 5, we will use the
statistical mean method to fill the missing values.

airquality['Ozone'] =
airquality['Ozone'].fillna(airquality.Ozone.mean())
airquality.head()

Figure 5: Filling missing values with the mean value.

You can see that the missing values in “Ozone” column is filled with the mean
value of that column. You can also drop the rows or columns where missing
values are found. we drop the rows containing missing values. Here You can
drop missing values with the help of pandas.DataFrame.dropna.

airquality = airquality.dropna() #drop the rows containing at

least one missing value
airquality.head()

Figure 6: Rows are dropped having at least one missing value.

Here, in figure 6, you can see that rows have missing values in column Solar.R is

5
dropped.

airquality.isnull().sum(axis=0)

Figure 7: Shows the numbers of missing values in column.

● Outliers
If you are new data Science then the first question that will arise in your
head is “what does these outliers mean” ? Let’s talk about the outliers
first and then we will talk about the detection of these outliers in the
dataset and what will we do after detecting the outliers.
According to wikipedia,
“In statistics, an outlier is a data point that differs significantly from other
observations.”

That means an outlier indicates a data point that is significantly different

from the other data points in the data set. Outliers can be created due to
the errors in the experiments or the variability in the measurements. Let’s
look at an example to clear the concept.

Figure 8: Table contains outlier.

In Figure 4 all the values in math column are in range between 90–95

6
except 20 which is significantly different from others. It can be an input
error in the dataset. So we can call it a outliers.

One thing should be added here — “ Not all the outliers are bad data
points. Some can be errors but others are the valid values. ”

So, now the question is how can we detect the outliers in the dataset. For
detecting the outliers we can use :
1. Box Plot
2. Scatter plot
3. Z-score etc.
We will see the Scatter Plot method here. Let’s draw a scatter plot of a
dataset.

dataset.plot(kind='scatter' , x='initial_cost' ,
y='total_est_fee' , rot = 70)
plt.show()

Figure 9: Scatter plotting with outlier.

Here in Figure 9 there is a outlier with red outline. After detecting this, we

7
can remove this from the dataset.

df_removed_outliers = dataset[dataset.total_est_fee<17500]
df_removed_outliers.plot(kind='scatter', x='initial_cost' ,
y='total_est_fee' , rot = 70)
plt.show()

Figure 10: Scatter plotting with removed outliers.

● Duplicate rows
Datasets may contain duplicate entries. It is one of the easiest tasks to
delete duplicate rows. To delete the duplicate rows you can use —
dataset_name.drop_duplicates(). Figure 12 shows a
sample of a dataset having duplicate rows.

Figure 11: Data having duplicate rows.

8
dataset=dataset.drop_duplicates()#this will remove the
duplicate rows.
print(dataset)

Figure 12: Data without duplicate rows.

● Tidy data set

Tidy dataset means each columns represent separate variables and each
rows represent individual observations. But in untidy data each columns
represent values but not the variables. Tidy data is useful to fix common
data problem.You can turn the untidy data to tidy data by using
pandas.melt.

import pandas as pd
pd.melt(frame=df,id_vars='name',value_vars=['treatment
a','treatment b'])

Figure 13: Converting from Untidy to tidy data.

You can also see pandas.DataFrame.pivot for un-melting the

9
tidy data.

● Converting data types

In DataFrame data can be of many types. As example :
1. Categorical data
2. Object data
3. Numeric data
4. Boolean data
Some columns data type can be changed due to some reason or have
inconsistent data type. You can convert from one data type to another by
using pandas.DataFrame.astype.

DataFrame.astype(self, dtype, copy=True, errors='raise',

**kwargs)

● String manipulation
One of the most important and interesting part of data cleaning is string
manipulation. In the real world most of the data are unstructured data.
String manipulation means the process of changing, parsing, matching or
analyzing strings. For string manipulation, you should have some
knowledge about regular expressions. Sometimes you need to extract
some value from a large sentence. Here string manipulation gives us a
strong benefit. Let say,
“This umbrella costs $12 and he took this money from his mother.”

If you want to exact the “$12” information from the sentence then you
have to build a regular expression for matching that pattern.After that
you can use the python libraries.There are many built in and external
libraries in python for string manipulation.

import re
pattern = re.compile('|\$|d*')

10
result = pattern.match("$12312312")
print(bool(result))

This will give you an output showing “True”.

● Data Concatenation
In this modern era of data science the volume of data is increasing day by
day. Due to the large number of volume of data data may stored in
separated files. If you work with multiple files then you can concatenate
them for simplicity. You can use the following python library for
concatenate.

pandas.concat(objs, axis=0, join='outer', join_axes=None,

ignore_index=False, keys=None, levels=None, names=None,
verify_integrity=False, sort=None, copy=True)

Let’s see an example how to concatenate two dataset. Figure 14 shows an

example of two different datasets loaded from two different files. We will
concatenate them using pandas.concat.

Figure 14: Dataset1(left) & Dataset2(right)

concatenated_data=pd.concat([dataset1,dataset2])
print(concatenated_data)

11
Figure 15: Concatenated dataset.

What is Feature Engineering — Importance, Tools and

Techniques for Machine Learning
Feature engineering techniques for machine learning are a fundamental topic in
machine learning, yet one that is often overlooked or deceptively simple.
Feature engineering is the process of selecting, manipulating, and transforming
raw data into features that can be used in supervised learning. In order to make
machine learning work well on new tasks, it might be necessary to design and
train better features. As you may know, a “feature” is any measurable input that
can be used in a predictive model — it could be the color of an object or the
sound of someone’s voice. Feature engineering, in simple terms, is the act of
converting raw observations into desired features using statistical or machine
learning approaches.
What is Feature Engineering
Feature engineering is a machine learning technique that leverages data to
create new variables that aren’t in the training set. It can produce new features
for both supervised and unsupervised learning, with the goal of simplifying and
speeding up data transformations while also enhancing model accuracy. Feature
engineering is required when working with machine learning models.
Regardless of the data or architecture, a terrible feature will have a direct impact
on your model.
Now to understand it in a much easier way, let’s take a simple example. Below
are the prices of properties in x city. It shows the area of the house and total

12
price.

Now this data might have some errors or might be incorrect, not all sources on
the internet are correct. To begin, we’ll add a new column to display the cost per
square foot.

This new feature will help us understand a lot about our data. So, we have a
new column which shows cost per square ft. There are three main ways you can
find any error. You can use Domain Knowledge to contact a property advisor or
real estate agent and show him the per square foot rate. If your counsel states
that pricing per square foot cannot be less than 3400, you may have a problem.
The data can be visualised.

13
When you plot the data, you’ll notice that one price is significantly different from
the rest. In the visualisation method, you can readily notice the problem. The
third way is to use Statistics to analyze your data and find any problem. Feature
engineering consists of various process:
● Feature Creation: Creating features involves creating new variables which
will be most helpful for our model. This can be adding or removing some
features. As we saw above, the cost per sq. ft column was a feature
creation.
● Transformations: Feature transformation is simply a function that
transforms features from one representation to another. The goal here is
to plot and visualise data, if something is not adding up with the new
features we can reduce the number of features used, speed up training,
or increase the accuracy of a certain model.
● Feature Extraction: Feature extraction is the process of extracting features
from a data set to identify useful information. Without distorting the
original relationships or significant information, this compresses the
amount of data into manageable quantities for algorithms to process.
● Exploratory Data Analysis : Exploratory data analysis (EDA) is a powerful
and simple tool that can be used to improve your understanding of your
data, by exploring its properties. The technique is often applied when the
goal is to create new hypotheses or find patterns in the data. It’s often
used on large amounts of qualitative or quantitative data that haven’t
been analyzed before.
● Benchmark : A Benchmark Model is the most user-friendly, dependable,
transparent, and interpretable model against which you can measure
your own. It’s a good idea to run test datasets to see if your new machine
learning model outperforms a recognised benchmark. These
benchmarks are often used as measures for comparing the performance
between different machine learning models like neural networks and
support vector machines, linear and non-linear classifiers, or different
approaches like bagging and boosting. To learn more about feature

14
engineering steps and process, check the links provided at the end of this
article. Now, let’s have a look at why we need feature engineering in
machine learning.

Importance Of Feature Engineering

Feature Engineering is a very important step in machine learning. Feature
engineering refers to the process of designing artificial features into an
algorithm. These artificial features are then used by that algorithm in order to
improve its performance, or in other words reap better results. Data scientists
spend most of their time with data, and it becomes important to make models
accurate.

When feature engineering activities are done correctly, the resulting dataset is
optimal and contains all of the important factors that affect the business
problem. As a result of these datasets, the most accurate predictive models and
the most useful insights are produced.

Feature Engineering Techniques for Machine Learning

Lets see a few feature engineering best techniques that you can use. Some of

15
the techniques listed may work better with certain algorithms or datasets, while
others may be useful in all situations.
1. Imputation
When it comes to preparing your data for machine learning, missing
values are one of the most typical issues. Human errors, data flow
interruptions, privacy concerns, and other factors could all contribute to
missing values. Missing values have an impact on the performance of
machine learning models for whatever cause. The main goal of
imputation is to handle these missing values. There are two types of
imputation :
● Numerical Imputation: To figure out what numbers should be
assigned to people currently in the population, we usually use data
from completed surveys or censuses. These data sets can include
information about how many people eat different types of food,
whether they live in a city or country with a cold climate, and how
much they earn every year. That is why numerical imputation is
used to fill gaps in surveys or censuses when certain pieces of
information are missing.
#Filling all missing values with 0
data = data.fillna(0)

● Categorical Imputation: When dealing with categorical columns,

replacing missing values with the highest value in the column is a
smart solution. However, if you believe the values in the column
are evenly distributed and there is no dominating value, imputing a
category like “Other” would be a better choice, as your imputation
is more likely to converge to a random selection in this scenario.

#Max fill function for categorical columns

data[‘column_name’].fillna(data[‘column_name’].value_counts()
.idxmax(), inplace=True)

16
2. Handling Outliers
Outlier handling is a technique for removing outliers from a dataset. This
method can be used on a variety of scales to produce a more accurate
data representation. This has an impact on the model’s performance.
Depending on the model, the effect could be large or minimal; for
example, linear regression is particularly susceptible to outliers. This
procedure should be completed prior to model training. The various
methods of handling outliers include:
a. Removal: Outlier-containing entries are deleted from the
distribution. However, if there are outliers across numerous
variables, this strategy may result in a big chunk of the datasheet
being missed.
b. Replacing values: Alternatively, the outliers could be handled as
missing values and replaced with suitable imputation.
c. Capping: Using an arbitrary value or a value from a variable
distribution to replace the maximum and minimum values.
d. Discretization : Discretization is the process of converting
continuous variables, models, and functions into discrete ones. This
is accomplished by constructing a series of continuous intervals (or
bins) that span the range of our desired variable/model/function.
3. Log Transform
Log Transform is the most used technique among data scientists. It’s
mostly used to turn a skewed distribution into a normal or less-skewed
distribution. We take the log of the values in a column and utilise those
values as the column in this transform. It is used to handle confusing
data, and the data becomes more approximative to normal applications.

//Log Example
df[log_price] = np.log(df[‘Price’])

4. One-hot encoding

17
A one-hot encoding is a type of encoding in which an element of a finite
set is represented by the index in that set, where only one element has its
index set to “1” and all other elements are assigned indices within the
range [0, n-1]. In contrast to binary encoding schemes, where each bit can
represent 2 values (i.e. 0 and 1), this scheme assigns a unique value for
each possible case.

5. Scaling
Feature scaling is one of the most pervasive and difficult problems in
machine learning, yet it’s one of the most important things to get right. In
order to train a predictive model, we need data with a known set of
features that needs to be scaled up or down as appropriate. This blog post
will explain how feature scaling works and why it’s important as well as
some tips for getting started with feature scaling.
After a scaling operation, the continuous features become similar in
terms of range. Although this step isn’t required for many algorithms, it’s
still a good idea to do so. Distance-based algorithms like k-NN and
k-Means, on the other hand, require scaled continuous features as model
input. There are two common ways for scaling :
a. Normalization : All values are scaled in a specified range between 0
and 1 via normalisation (or min-max normalisation). This
modification has no influence on the feature’s distribution,
however it does exacerbate the effects of outliers due to lower
standard deviations. As a result, it is advised that outliers be dealt
with prior to normalisation.
b. Standardization: Standardization (also known as z-score
normalisation) is the process of scaling values while accounting for
standard deviation. If the standard deviation of features differs, the
range of those features will likewise differ. The effect of outliers in
the characteristics is reduced as a result. To arrive at a distribution
with a 0 mean and 1 variance, all the data points are subtracted by

18
their mean and the result divided by the distribution’s variance.

Curse of Dimensionality — A “Curse” to Machine Learning

Curse of Dimensionality describes the explosive nature of increasing data

dimensions and its resulting exponential increase in computational efforts
required for its processing and/or analysis. This term was first introduced by
Richard E. Bellman, to explain the increase in volume of Euclidean space
associated with adding extra dimensions, in area of dynamic programming.
Today, this phenomenon is observed in fields like machine learning, data
analysis, data mining to name a few. An increase in the dimensions can in
theory, add more information to the data thereby improving the quality of data
but practically increases the noise and redundancy during its analysis.

Behavior of a Machine Learning Algorithms — Need for data points and

Accuracy of Model

In machine learning, a feature of an object can be an attribute or a characteristic

that defines it. Each feature represents a dimension and group of dimensions
creates a data point. This represents a feature vector that defines the data point
to be used by a machine learning algorithm(s). When we say increase in
dimensionality it implies an increase in the number of features used to describe
the data. For example, in the field of breast cancer research, age, number of
cancerous nodes can be used as features to define the prognosis of the breast
cancer patient. These features constitute the dimensions of a feature vector.
But other factors like past surgeries, patient history, type of tumor and other
such features help a doctor to better determine the prognosis. In this case by
adding features, we are theoretically increasing the dimensions of our data.
As the dimensionality increases, the number of data points required for good
performance of any machine learning algorithm increases exponentially. The
reason is that, we would need more number of data points for any given

19
combination of features, for any machine learning model to be valid. For
example, let’s say that for a model to perform well, we need at least 10 data
points for each combination of feature values. If we assume that we have one
binary feature, then for its 21 unique values (0 and 1) we would need 2¹x 10 = 20
data points. For 2 binary features, we would have 2² unique values and need 2² x
10 = 40 data points. Thus, for k-number of binary features we would need 2ᵏ x
10 data points.

Hughes (1968) in his study concluded that with a fixed number of training
samples, the predictive power of any classifier first increases as the number of
dimensions increase, but after a certain value of number of dimensions, the
performance deteriorates. Thus, the phenomenon of curse of dimensionality is
also known as Hughes phenomenon.

Effect of Curse of Dimensionality on Distance Functions

For any point A, lets assume distₘᵢₙ(A) is the minimum distance between A and
its nearest neighbor and distₘₐₓ(A) is the maximum distance between A and the

20
farthest neighbor.

That is, for a d — dimensional space, given n-random points, the distₘᵢₙ(A) ≈
distₘₐₓ(A) meaning, any given pair of points are equidistant to each other.
Therefore, any machine learning algorithms which are based on the distance
measure including KNN(k-Nearest Neighbor) tend to fail when the number of
dimensions in the data is very high. Thus, dimensionality can be considered as a
“curse” in such algorithms.

Solutions to Curse of Dimensionality

One of the ways to reduce the impact of high dimensions is to use a different
measure of distance in a space vector. One could explore the use of cosine
similarity to replace Euclidean distance. Cosine similarity can have a lesser
impact on data with higher dimensions. However, use of such a method could
also be specific to the required solution of the problem.

Other methods could involve the use of reduction in dimensions. Some of the
techniques that can be used are:
1. Forward-feature selection: This method involves picking the most useful
subset of features from all given features.
2. PCA/t-SNE: Though these methods help in reduction of the number of
features, but it does not necessarily preserve the class labels and thus
can make the interpretation of results a tough task.

21
Material Sources:
https://towardsdatascience.com/what-is-data-cleaning-how-to-process
-data-for-analytics-and-machine-learning-modeling-c2afcf4fbf45

https://towardsdatascience.com/what-is-feature-engineering-importan
ce-tools-and-techniques-for-machine-learning-2080b0269f10#:~:text=
Feature%20engineering%20is%20the%20process,design%20and%20tr
ain%20better%20features

https://towardsdatascience.com/curse-of-dimensionality-a-curse-to-m
achine-learning-c122ee33bfeb