0% found this document useful (0 votes)

17 views

module 3 data preparation

Unit 3 focuses on data preparation, emphasizing the importance of transforming raw data for machine learning algorithms. It outlines steps for data exploration, including variable identification, univariate and bivariate analysis, handling missing values, and addressing outliers. The document also discusses feature engineering and variable transformation techniques to enhance data utility and improve predictive modeling outcomes.

Uploaded by

sudeep shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

module 3 data preparation

Uploaded by

sudeep shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

UNIT 3

Data Preparation
Contents:
► Introduction to data exploration process for data preparation
► Data discovery, issues related with data access
► Characterization of data, consistency and pollution of data
► Duplicate or redundant variables
► Outliers and leverage data, noisy data
► Missing values, imputation of missing and empty places with different
techniques
► Missing pattern and its importance
► Handling non numerical data in missing places.
Introduction to data exploration process for data
preparation
► Data preparation is the process of transforming raw data so that data scientists and analysts
can run it through machine learning algorithms to uncover insights or make predictions. OR
► Data Preprocessing is a technique that is used to convert the raw data into a clean data
set.

Why is Data Preparation Important?

► Most machine learning algorithms require data to be formatted in a very specific way, so
datasets generally require some amount of preparation before they can yield useful
insights.
► Some datasets have values that are missing, invalid, or otherwise difficult for an algorithm
to process. If data is missing, the algorithm can’t use it.
► If data is invalid, the algorithm produces less accurate or even misleading outcomes.
► Good data preparation produces clean and well-curated data which leads to more practical,
accurate model outcomes.
Steps of Data Exploration and
Preparation
► We know that, the quality of your inputs decide the quality of your output. Generally data
exploration, cleaning and preparation can take up to 70%-75% of your total project time.
Following are the steps involved to understand, clean and prepare your data for building your
predictive model:
1)Variable Identification
2)Univariate Analysis
3)Bivariate Analysis
4)Missing values treatment
5)Outlier treatment
6)Variable transformation
7)Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our
refined model.
Variable Identification

► First, identify Predictor (Input) and Target (output) variables. Next, identify
the data type and category of the variables.
Suppose, we want to predict, whether the students will play cricket or not (refer below data set).
Here you need to identify predictor variables, target variable, data type of variables and category
of variables.
Univariate Analysis
► At this stage, we explore variables one by one.
► Method to perform univariate analysis will depend on whether the variable
type is categorical or continuous.
Let’s look at these methods and statistical measures for categorical and
continuous variables individually:
► Continuous Variables:- In case of continuous variables, we need to
understand the central tendency and spread of the variable.
► Categorical Variables:- For categorical variables, we’ll use frequency table to
understand distribution of each category. We can also read as percentage of
values under each category.
► Univariate analysis is also used to highlight missing and outlier values.
Continuous Variables
Bivariate Analysis
► Bi-variate Analysis finds out the relationship between two variables.
► Here, we look for association and disassociation between variables at a
pre-defined significance level.
► We can perform bivariate analysis for any combination of categorical and
continuous variables.
► The combination can be: Categorical & Categorical, Categorical & Continuous
and Continuous & Continuous.
► Different methods are used to tackle these combinations during analysis
process.
Continuous and Continuous:

► While doing bivariate analysis between two continuous variables, we should

look at scatter plot.
► It is a easy way to find out the relationship between two variables.
► The pattern of scatter plot indicates the relationship between variables. The
relationship can be linear or non-linear.
► Scatter plot shows the relationship between two variable but does not
indicates the strength of relationship amongst them.
► To find the strength of the relationship, we use Correlation(corrcoef).
Correlation varies between -1 and +1.
► -1: perfect negative linear correlation
► +1:perfect positive linear correlation and
► 0: No correlation
► Correlation can be derived using following formula:
o Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Correlation
Categorical and Categorical:
► To find the relationship between two categorical variables, we can use
following methods:
► Two-way table: We can start analyzing the relationship by creating a
two-way table of count and count%.
► The rows represents the category of one variable and the columns represent the
categories of the other variable.
► We show count or count% of observations available in each combination of row and
column categories.
► Stacked Column Chart: This method is more of a visual form of Two-way
table.
► Chi-Square Test: This test is used to derive the statistical significance of relationship
between the variables. Also, it tests whether the evidence in the sample is strong
enough to generalize that the relationship for a larger population as well.
► Chi-square is based on the difference between the expected and observed
frequencies in one or more categories in the two-way table. It returns probability
for the computed chi-square distribution with the degree of freedom.
► Probability of 0: It indicates that both categorical variable are dependent.
► Probability of 1: It shows that both variables are independent.
► Probability less than 0.05: It indicates that the relationship between the variables
is significant at 95% confidence.
Categorical and Continuous
► Categorical & Continuous: While exploring relation between categorical and
continuous variables, we can draw box plots for each level of categorical
variables.
► If levels are small in number, it will not show the statistical significance.
► To look at the statistical significance we can perform Z-test, T-test or ANOVA.
► Z-Test/ T-Test:- Either test assess whether mean of two groups are
statistically different from each other or not.
► If the probability of Z is small then the difference of two averages is more
significant. The T-test is very similar to Z-test but it is used when number of
observation for both categories is less than 30.
► ANOVA:- It assesses whether the average of more than two groups is
statistically different.
Missing Value Treatment:
Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because
we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction
or classification.

• Notice the missing values in the image shown above: In the left scenario, we have not
treated missing values. The inference from this data set is that the chances of playing
cricket by males is equal to females.
• On the other hand, if you look at the second table, which shows data after treatment of
missing values (based on gender), we can see that females have higher chances of playing
cricket compared to males.
Methods to treat missing values

► Deletion
► Mean/ Mode/ Median Imputation(A method to “fill in” missing value):
► Mean / Mode / Median imputation is one of the most frequently used methods.
► It consists of replacing the missing data for a given attribute by the mean or
median (quantitative attribute) or mode (qualitative attribute) of all known values
of that variable.
► Backward and Forward fill method
Outliers
► Outlier is an observation that appears far away and diverges from an overall pattern in a sample.
► In statistics, an outlier is an observation point that is distant from other observations.
[Wikipedia]
Most common causes of outliers on a
data set
► Data entry errors (human errors)
► Measurement errors (instrument errors)
► Experimental errors (data extraction or experiment planning/executing
errors)
► Intentional (dummy outliers made to test detection methods)
► Data processing errors (data manipulation or data set unintended mutations)
► Sampling errors (extracting or mixing data from wrong or various sources)
► Natural (not an error, novelties in data)
Types of Outliers

► Univariate
► Multivariate

► Univariate outliers can be found when we look at distribution of a single

variable.
► Multi-variate outliers are outliers in an n-dimensional space. In order to find
them, we have to look at distributions in multi-dimensions.
Example of Multivariate outliers
Impact of Outliers on a dataset

4,4,5,5,5,5,6,6,6,7,7 4,4,5,5,5,5,6,6,7,7,300
Find the mean, median, Find the mean, median, mode
mode and std deviation. and std deviation.

mean= 5.45
mean= 32.18
Median= 5
Median= 5
Mode= 5
Mode= 5
std= 1.0357254813546264
std= 88.83109611146108

► Outliers can drastically change the results of the data analysis and statistical modeling.
There are numerous unfavorable impacts of outliers in the data set:
► It increases the error variance and reduces the power of statistical tests.
► If the outliers are non-randomly distributed, they can decrease normality.
► They can bias or influence estimates that may be of substantive interest.
► They can also impact the basic assumption of Regression, ANOVA and other statistical model
assumptions.
How to detect Outliers?
► Most commonly used method to detect outliers is visualization.
► We use various visualization methods, like Box-plot, Histogram, Scatter Plot.
► Some analysts also use various thumb rules to detect outliers.
► Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR.
► Use capping methods. Any value which out of range of 5th and 95th percentile can be
considered as outlier.
► Data points, three or more standard deviation away from mean are considered outlier.
How to remove Outliers?
► Most of the ways to deal with outliers are similar to the methods of missing values like
deleting observations, transforming them, binning them, treat them as a separate
group, imputing values and other statistical methods.
► Deleting observations: We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers.
► Transforming and binning values: Transforming variables can also eliminate outliers.
Natural log of a value reduces the variation caused by extreme values. Binning is also
a form of variable transformation. Decision Tree algorithm allows to deal with outliers
well due to binning of variable. We can also use the process of assigning weights to
different observations.
► Imputing: Like imputation of missing values, we can also impute outliers. We can use
mean, median, mode imputation methods.
► Before imputing values, we should analyze if it is natural outlier or artificial. If it is artificial,
we can go with imputing values.
► We can also use statistical model to predict values of outlier observation and after that we
can impute it with predicted values.
► Treat separately: If there are significant number of outliers, we should treat
them separately in the statistical model. One of the approach is to treat both
groups as two different groups and build individual model for both groups and
then combine the output.
► Outliers can also come in different flavours, depending on the
environment: point outliers, contextual outliers, or collective outliers.
► Point outliers are single data points that lay far from the rest of the
distribution(also called as Global outliers).
► Contextual outliers can be noise in data, such as punctuation symbols when
realizing text analysis or background noise signal when doing speech
recognition(Example: App crashes: more users = more crashes).
► Collective outliers can be subsets of novelties in data such as a signal that
may indicate the discovery of new phenomena.
Boston House pricing Dataset
► This dataset contains information collected by the U.S Census Service concerning housing in the area
of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston),
and has been used extensively throughout the literature to benchmark algorithms. The dataset is small
in size with only 506 cases.
► The name for this dataset is boston. It has two prototasks:
► nox: in which the nitrous oxide level is to be predicted
► price: in which the median value of a home is to be predicted
Variables
There are 13 attributes in each case of the dataset. They are:
1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per $10,000
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
What is Feature Engineering?
► Feature engineering is the science (and art) of extracting more information from
existing data.
Note: You are not adding any new data here, but you are actually making the data you
already have more useful.
► You perform feature engineering once you have completed the first 5 steps in data
exploration — Variable Identification, Univariate, Bivariate Analysis, Missing Values
Imputation and Outliers Treatment.
Feature engineering can be divided in 2 steps:
► Variable transformation.
► Variable / Feature creation.
► These two techniques are vital in data exploration and have a remarkable impact on
the power of prediction.
What is Variable Transformation?
► In data modelling, transformation refers to the replacement of a variable by a
function. For instance, replacing a variable x by the square / cube root or
logarithm x is a transformation. In other words, transformation is a process
that changes the distribution or relationship of a variable with others.
When should we use Variable Transformation?
► When we want to change the scale of a variable or standardize the values of
a variable for better understanding. While this transformation is a must if you
have data in different scales, this transformation does not change the shape
of the variable distribution
► When we can transform complex non-linear relationships into linear
relationships. Existence of a linear relationship between variables is easier to
comprehend compared to a non-linear or curved relation. Transformation
helps us to convert a non-linear relation into linear relation. Scatter plot can
be used to find the relationship between two continuous variables. These
transformations also improve the prediction. Log transformation is one of the
commonly used transformation technique used in these situations.
► Symmetric distribution is preferred over skewed distribution as it is easier to
interpret and generate inferences. Some modeling techniques requires normal
distribution of variables. So, whenever we have a skewed distribution, we can use
transformations which reduce skewness. For right skewed distribution, we take
square / cube root or logarithm of variable and for left skewed, we take square /
cube or exponential of variables.
What are the common methods of Variable Transformation?
There are various methods used to transform variables. As discussed, some of them
include square root, cube root, logarithmic, binning, reciprocal and many others.
► Logarithm: Log of a variable is a common transformation method used to change
the shape of distribution of the variable on a distribution plot. It is generally used
for reducing right skewness of variables.
Note: It can’t be applied to zero or negative values.
► Square / Cube root: The square and cube root of a variable has a sound effect on
variable distribution. However, it is not as significant as logarithmic transformation.
Cube root has its own advantage. It can be applied to negative values including
zero. Square root can be applied to positive values including zero.
► Binning: It is used to categorize variables. It is performed on original values,
percentile or frequency. Decision of categorization technique is based on business
understanding. For example, we can categorize income in three categories, namely:
High, Average and Low. We can also perform co-variate binning which depends on
the value of more than one variables.
What is Feature / Variable Creation
► Feature / Variable creation is a process to generate a new variables /
features based on existing variable(s). For example, say, we have
date(dd-mm-yy) as an input variable in a data set. We can generate new
variables like day, month, year, week, weekday that may have better
relationship with target variable. This step is used to highlight the hidden
relationship in a variable:
Methods:
► Creating derived variables: This refers to creating new variables from existing variable(s)
using set of functions or different methods. Let’s look at it through “Titanic — Kaggle
competition”. In this data set, variable age has missing values. To predict missing values,
we used the salutation (Master, Mr, Miss, Mrs) of name as a new variable. How do we
decide which variable to create? Honestly, this depends on business understanding of the
analyst, his curiosity and the set of hypothesis he might have about the problem. Methods
such as taking log of variables, binning variables and other methods of variable
transformation can also be used to create new variables.
► Creating dummy variables: One of the most common application of dummy variable is to
convert categorical variable into numerical variables. Dummy variables are also called
Indicator Variables. It is useful to take categorical variable as a predictor in statistical
models. Categorical variable can take values 0 and 1. Let’s take a variable ‘gender’. We
can produce two variables, namely, “Var_Male” with values 1 (Male) and 0 (No male) and
“Var_Female” with values 1 (Female) and 0 (No Female). We can also create dummy
variables for more than two classes of a categorical variables with n or n-1 dummy
variables.

Tma - Volume III
100% (1)
Tma - Volume III
136 pages
54-Pressure Gauge OIT Calibration Cartificate
50% (2)
54-Pressure Gauge OIT Calibration Cartificate
1 page
MSCDFSM Prog. Guide
No ratings yet
MSCDFSM Prog. Guide
108 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
7. Data Cleaning
No ratings yet
7. Data Cleaning
39 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
A Comprehensive Guide To Data Exploration
100% (1)
A Comprehensive Guide To Data Exploration
18 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
Guide Data Exploration
No ratings yet
Guide Data Exploration
16 pages
Data Mining Technical
No ratings yet
Data Mining Technical
45 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Data Screening Assumptions
No ratings yet
Data Screening Assumptions
29 pages
BRM CS
No ratings yet
BRM CS
4 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
45 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
data screening and main model analysis in spss
No ratings yet
data screening and main model analysis in spss
26 pages
Quantitative Research Methods - Data Processing and Analysis
No ratings yet
Quantitative Research Methods - Data Processing and Analysis
25 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Regression
No ratings yet
Regression
86 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Statistic & Machine Learning: Team 2
No ratings yet
Statistic & Machine Learning: Team 2
42 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Introduction Qr
No ratings yet
Introduction Qr
34 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Research Methodology: Chapter - 7
No ratings yet
Research Methodology: Chapter - 7
28 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
AP Statistics Michel Liao
No ratings yet
AP Statistics Michel Liao
20 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Project Employee Absenteeism
No ratings yet
Project Employee Absenteeism
33 pages
C207 Study Guide
No ratings yet
C207 Study Guide
27 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
data science slides
No ratings yet
data science slides
57 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
6 Data Analysis
No ratings yet
6 Data Analysis
24 pages
FIN10002 - Notes Master
No ratings yet
FIN10002 - Notes Master
44 pages
Data Analysis Guide
No ratings yet
Data Analysis Guide
4 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Chapter 4 Data Mining
No ratings yet
Chapter 4 Data Mining
5 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
Data Analysis
100% (2)
Data Analysis
87 pages
UNIT V STATISTICAL DATA ANALYSIS (1)
No ratings yet
UNIT V STATISTICAL DATA ANALYSIS (1)
72 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Unit 1
No ratings yet
Unit 1
38 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Chapter 2_ Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2_ Data Exploration, Preprocessing and Visualization
92 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
module 5 bivariate analysis
No ratings yet
module 5 bivariate analysis
81 pages
Module 5 Stem and Leaf Plot
No ratings yet
Module 5 Stem and Leaf Plot
26 pages
EDA 2
No ratings yet
EDA 2
69 pages
Computer Officer - IT Officer Syllabus for Sangh
No ratings yet
Computer Officer - IT Officer Syllabus for Sangh
16 pages
4a
No ratings yet
4a
16 pages
Lessons From Ruslana
No ratings yet
Lessons From Ruslana
4 pages
Automobile Cleaner Report
No ratings yet
Automobile Cleaner Report
24 pages
DISC Personality Test Result - Free DISC Types Test Online at
No ratings yet
DISC Personality Test Result - Free DISC Types Test Online at
4 pages
Asco-Daq2 Uman 1512
No ratings yet
Asco-Daq2 Uman 1512
59 pages
Management Information Systems
No ratings yet
Management Information Systems
6 pages
SCT: June 2010
No ratings yet
SCT: June 2010
40 pages
The Elements of Dance: Part 1 Space Shape 1: Shapes With Straight Lines and Angles
No ratings yet
The Elements of Dance: Part 1 Space Shape 1: Shapes With Straight Lines and Angles
3 pages
Altium To q3d
100% (1)
Altium To q3d
3 pages
DLL Mathematics 1 q1 w2
0% (1)
DLL Mathematics 1 q1 w2
6 pages
Jaspersoft Embedding Guide
No ratings yet
Jaspersoft Embedding Guide
44 pages
Obama Center Zoning Change
No ratings yet
Obama Center Zoning Change
37 pages
Overlord - Blu-Ray 1 Special - Emissary of The King
No ratings yet
Overlord - Blu-Ray 1 Special - Emissary of The King
75 pages
Simulation
No ratings yet
Simulation
10 pages
NAT GRADE 10 With Poctors Assignment
No ratings yet
NAT GRADE 10 With Poctors Assignment
4 pages
Urbanest Student Accomodation
No ratings yet
Urbanest Student Accomodation
20 pages
QR Code
100% (1)
QR Code
32 pages
Born-Oppenheimer Presentation
100% (1)
Born-Oppenheimer Presentation
15 pages
Graduation List
No ratings yet
Graduation List
54 pages
The Scales in AutoCAD
No ratings yet
The Scales in AutoCAD
11 pages
EEDA - 514 DC & Ac Machinery
No ratings yet
EEDA - 514 DC & Ac Machinery
11 pages
WRE 212 - EXP7 Flow Over V Notch
No ratings yet
WRE 212 - EXP7 Flow Over V Notch
8 pages
Claim of Fact Essay
100% (2)
Claim of Fact Essay
9 pages
Article 4.2 - Planetary Pair Cycles in Western History
No ratings yet
Article 4.2 - Planetary Pair Cycles in Western History
8 pages
Lord Northbourne - The Man Who Invented Organic Farming - A Biography
No ratings yet
Lord Northbourne - The Man Who Invented Organic Farming - A Biography
23 pages
Logical Resoning 22
No ratings yet
Logical Resoning 22
9 pages
Sap MM LCT Training Material Latest
100% (2)
Sap MM LCT Training Material Latest
610 pages
To Kill A Mockingbird Lesson Plans
No ratings yet
To Kill A Mockingbird Lesson Plans
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

module 3 data preparation

Uploaded by

module 3 data preparation

Uploaded by

UNIT 3

Why is Data Preparation Important?

► While doing bivariate analysis between two continuous variables, we should

► Univariate outliers can be found when we look at distribution of a single

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.