module 3 data preparation
module 3 data preparation
Data Preparation
Contents:
► Introduction to data exploration process for data preparation
► Data discovery, issues related with data access
► Characterization of data, consistency and pollution of data
► Duplicate or redundant variables
► Outliers and leverage data, noisy data
► Missing values, imputation of missing and empty places with different
techniques
► Missing pattern and its importance
► Handling non numerical data in missing places.
Introduction to data exploration process for data
preparation
► Data preparation is the process of transforming raw data so that data scientists and analysts
can run it through machine learning algorithms to uncover insights or make predictions. OR
► Data Preprocessing is a technique that is used to convert the raw data into a clean data
set.
► First, identify Predictor (Input) and Target (output) variables. Next, identify
the data type and category of the variables.
Suppose, we want to predict, whether the students will play cricket or not (refer below data set).
Here you need to identify predictor variables, target variable, data type of variables and category
of variables.
Univariate Analysis
► At this stage, we explore variables one by one.
► Method to perform univariate analysis will depend on whether the variable
type is categorical or continuous.
Let’s look at these methods and statistical measures for categorical and
continuous variables individually:
► Continuous Variables:- In case of continuous variables, we need to
understand the central tendency and spread of the variable.
► Categorical Variables:- For categorical variables, we’ll use frequency table to
understand distribution of each category. We can also read as percentage of
values under each category.
► Univariate analysis is also used to highlight missing and outlier values.
Continuous Variables
Bivariate Analysis
► Bi-variate Analysis finds out the relationship between two variables.
► Here, we look for association and disassociation between variables at a
pre-defined significance level.
► We can perform bivariate analysis for any combination of categorical and
continuous variables.
► The combination can be: Categorical & Categorical, Categorical & Continuous
and Continuous & Continuous.
► Different methods are used to tackle these combinations during analysis
process.
Continuous and Continuous:
• Notice the missing values in the image shown above: In the left scenario, we have not
treated missing values. The inference from this data set is that the chances of playing
cricket by males is equal to females.
• On the other hand, if you look at the second table, which shows data after treatment of
missing values (based on gender), we can see that females have higher chances of playing
cricket compared to males.
Methods to treat missing values
► Deletion
► Mean/ Mode/ Median Imputation(A method to “fill in” missing value):
► Mean / Mode / Median imputation is one of the most frequently used methods.
► It consists of replacing the missing data for a given attribute by the mean or
median (quantitative attribute) or mode (qualitative attribute) of all known values
of that variable.
► Backward and Forward fill method
Outliers
► Outlier is an observation that appears far away and diverges from an overall pattern in a sample.
► In statistics, an outlier is an observation point that is distant from other observations.
[Wikipedia]
Most common causes of outliers on a
data set
► Data entry errors (human errors)
► Measurement errors (instrument errors)
► Experimental errors (data extraction or experiment planning/executing
errors)
► Intentional (dummy outliers made to test detection methods)
► Data processing errors (data manipulation or data set unintended mutations)
► Sampling errors (extracting or mixing data from wrong or various sources)
► Natural (not an error, novelties in data)
Types of Outliers
► Univariate
► Multivariate
4,4,5,5,5,5,6,6,6,7,7 4,4,5,5,5,5,6,6,7,7,300
Find the mean, median, Find the mean, median, mode
mode and std deviation. and std deviation.
mean= 5.45
mean= 32.18
Median= 5
Median= 5
Mode= 5
Mode= 5
std= 1.0357254813546264
std= 88.83109611146108
► Outliers can drastically change the results of the data analysis and statistical modeling.
There are numerous unfavorable impacts of outliers in the data set:
► It increases the error variance and reduces the power of statistical tests.
► If the outliers are non-randomly distributed, they can decrease normality.
► They can bias or influence estimates that may be of substantive interest.
► They can also impact the basic assumption of Regression, ANOVA and other statistical model
assumptions.
How to detect Outliers?
► Most commonly used method to detect outliers is visualization.
► We use various visualization methods, like Box-plot, Histogram, Scatter Plot.
► Some analysts also use various thumb rules to detect outliers.
► Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR.
► Use capping methods. Any value which out of range of 5th and 95th percentile can be
considered as outlier.
► Data points, three or more standard deviation away from mean are considered outlier.
How to remove Outliers?
► Most of the ways to deal with outliers are similar to the methods of missing values like
deleting observations, transforming them, binning them, treat them as a separate
group, imputing values and other statistical methods.
► Deleting observations: We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers.
► Transforming and binning values: Transforming variables can also eliminate outliers.
Natural log of a value reduces the variation caused by extreme values. Binning is also
a form of variable transformation. Decision Tree algorithm allows to deal with outliers
well due to binning of variable. We can also use the process of assigning weights to
different observations.
► Imputing: Like imputation of missing values, we can also impute outliers. We can use
mean, median, mode imputation methods.
► Before imputing values, we should analyze if it is natural outlier or artificial. If it is artificial,
we can go with imputing values.
► We can also use statistical model to predict values of outlier observation and after that we
can impute it with predicted values.
► Treat separately: If there are significant number of outliers, we should treat
them separately in the statistical model. One of the approach is to treat both
groups as two different groups and build individual model for both groups and
then combine the output.
► Outliers can also come in different flavours, depending on the
environment: point outliers, contextual outliers, or collective outliers.
► Point outliers are single data points that lay far from the rest of the
distribution(also called as Global outliers).
► Contextual outliers can be noise in data, such as punctuation symbols when
realizing text analysis or background noise signal when doing speech
recognition(Example: App crashes: more users = more crashes).
► Collective outliers can be subsets of novelties in data such as a signal that
may indicate the discovery of new phenomena.
Boston House pricing Dataset
► This dataset contains information collected by the U.S Census Service concerning housing in the area
of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston),
and has been used extensively throughout the literature to benchmark algorithms. The dataset is small
in size with only 506 cases.
► The name for this dataset is boston. It has two prototasks:
► nox: in which the nitrous oxide level is to be predicted
► price: in which the median value of a home is to be predicted
Variables
There are 13 attributes in each case of the dataset. They are:
1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per $10,000
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
What is Feature Engineering?
► Feature engineering is the science (and art) of extracting more information from
existing data.
Note: You are not adding any new data here, but you are actually making the data you
already have more useful.
► You perform feature engineering once you have completed the first 5 steps in data
exploration — Variable Identification, Univariate, Bivariate Analysis, Missing Values
Imputation and Outliers Treatment.
Feature engineering can be divided in 2 steps:
► Variable transformation.
► Variable / Feature creation.
► These two techniques are vital in data exploration and have a remarkable impact on
the power of prediction.
What is Variable Transformation?
► In data modelling, transformation refers to the replacement of a variable by a
function. For instance, replacing a variable x by the square / cube root or
logarithm x is a transformation. In other words, transformation is a process
that changes the distribution or relationship of a variable with others.
When should we use Variable Transformation?
► When we want to change the scale of a variable or standardize the values of
a variable for better understanding. While this transformation is a must if you
have data in different scales, this transformation does not change the shape
of the variable distribution
► When we can transform complex non-linear relationships into linear
relationships. Existence of a linear relationship between variables is easier to
comprehend compared to a non-linear or curved relation. Transformation
helps us to convert a non-linear relation into linear relation. Scatter plot can
be used to find the relationship between two continuous variables. These
transformations also improve the prediction. Log transformation is one of the
commonly used transformation technique used in these situations.
► Symmetric distribution is preferred over skewed distribution as it is easier to
interpret and generate inferences. Some modeling techniques requires normal
distribution of variables. So, whenever we have a skewed distribution, we can use
transformations which reduce skewness. For right skewed distribution, we take
square / cube root or logarithm of variable and for left skewed, we take square /
cube or exponential of variables.
What are the common methods of Variable Transformation?
There are various methods used to transform variables. As discussed, some of them
include square root, cube root, logarithmic, binning, reciprocal and many others.
► Logarithm: Log of a variable is a common transformation method used to change
the shape of distribution of the variable on a distribution plot. It is generally used
for reducing right skewness of variables.
Note: It can’t be applied to zero or negative values.
► Square / Cube root: The square and cube root of a variable has a sound effect on
variable distribution. However, it is not as significant as logarithmic transformation.
Cube root has its own advantage. It can be applied to negative values including
zero. Square root can be applied to positive values including zero.
► Binning: It is used to categorize variables. It is performed on original values,
percentile or frequency. Decision of categorization technique is based on business
understanding. For example, we can categorize income in three categories, namely:
High, Average and Low. We can also perform co-variate binning which depends on
the value of more than one variables.
What is Feature / Variable Creation
► Feature / Variable creation is a process to generate a new variables /
features based on existing variable(s). For example, say, we have
date(dd-mm-yy) as an input variable in a data set. We can generate new
variables like day, month, year, week, weekday that may have better
relationship with target variable. This step is used to highlight the hidden
relationship in a variable:
Methods:
► Creating derived variables: This refers to creating new variables from existing variable(s)
using set of functions or different methods. Let’s look at it through “Titanic — Kaggle
competition”. In this data set, variable age has missing values. To predict missing values,
we used the salutation (Master, Mr, Miss, Mrs) of name as a new variable. How do we
decide which variable to create? Honestly, this depends on business understanding of the
analyst, his curiosity and the set of hypothesis he might have about the problem. Methods
such as taking log of variables, binning variables and other methods of variable
transformation can also be used to create new variables.
► Creating dummy variables: One of the most common application of dummy variable is to
convert categorical variable into numerical variables. Dummy variables are also called
Indicator Variables. It is useful to take categorical variable as a predictor in statistical
models. Categorical variable can take values 0 and 1. Let’s take a variable ‘gender’. We
can produce two variables, namely, “Var_Male” with values 1 (Male) and 0 (No male) and
“Var_Female” with values 1 (Female) and 0 (No Female). We can also create dummy
variables for more than two classes of a categorical variables with n or n-1 dummy
variables.