AIDS C04-Session-20
AIDS C04-Session-20
AIDS C04-Session-20
Session -20
1
Data Preprocessing
• Data preprocessing is the process of transforming raw data into an
understandable format. The quality of the data should be checked
before applying machine learning or data mining algorithms
• Data could be in so many different forms: Structured Tables, Images,
Audio files, Videos, etc..
• Data Preprocessing is that step in which the data gets transformed,
or Encoded, to bring it to such a state that now the machine can easily
parse it. In other words, the features of the data can now be easily
interpreted by the algorithm.
Steps of Data Preprocessing
• Remember, not all the steps are applicable for each problem, it is
highly dependent on the data we are working with, so maybe only a
few steps might be required with your dataset.
Generally, they are :
• Data Quality Assessment
• Feature Aggregation
• Feature Sampling
• Dimensionality Reduction
• Feature Encoding
Data Quality Assessment
• Data is often taken from multiple sources, normally not too reliable and data is in different formats. Time is consumed in
dealing with data quality issues when working. There may be problems due to human error, limitations of measuring devices,
or flaws in the data collection process.
Methods to deal with quality assessment:
1.Missing values :
It is very much usual to have missing values in dataset. It may have happened during data collection, or maybe due to some data
validation rule, but regardless missing values must be taken into consideration.
• Eliminate rows with missing data :
Simple and sometimes effective strategy. Fails if many objects have missing values. If a feature has mostly missing values,
then that feature itself can also be eliminated.
• Estimate missing values :
If only a reasonable percentage of values are missing, then we can also run simple interpolation methods to fill in those
values. However, most common method of dealing with missing values is by filling them in with the mean, median or mode
value of the respective feature.
2. Inconsistent values :
We know that data can contain inconsistent values. For instance, the ‘Address’ field contains the ‘Phone number’. It may be due
to human error or maybe the information was misread while being scanned from a handwritten form.
• It is therefore always advised to perform data assessment like knowing what the data type of the features should be and
whether it is the same for all the data objects.
3. Duplicate values :
A dataset may include data objects which are duplicates of one another. It may happen when say the same person submits a
form more than once. The term deduplication is often used to refer to the process of dealing with duplicates.
• In most cases, the duplicates are removed so as to not give that particular data object an advantage or bias, when running
machine learning algorithms.
Feature Aggregation
• Feature Aggregations are performed so as to take the aggregated
values in order to put the data in a better perspective.
• Example: day-to-day transactions of a product from recording the
daily sales of that product in various store locations over the year.
Aggregating the transactions to single store-wide monthly or yearly
transactions will help us reducing hundreds or potentially thousands
of transactions that occur daily at a specific store, thereby reducing
the number of data objects.
• This results in reduction of memory consumption and processing time
• Aggregations provide us with a high-level view of the data as the
behaviour of groups or aggregates is more stable than individual data
objects
Feature Sampling
• Sampling is a very common method for selecting a subset of the dataset for
analyzing. Using a sampling algorithm can help reduce the size of the dataset to a
point.
• The key principle here is that the sampling should be done in such a manner that
the sample generated should have approximately the same properties as the
original dataset, meaning that the sample is representative. This involves
choosing the correct sample size and sampling strategy.
Simple Random Sampling dictates that there is an equal probability of selecting
any particular entity. It has two main variations as well :
• Sampling without Replacement : As each item is selected, it is removed from the
set of all the objects that form the total dataset.
• Sampling with Replacement : Items are not removed from the total dataset after
getting selected. This means they can get selected more than once.
Dimensionality Reduction
• Most real world datasets have a large number of features, also called as dimensions. As the name
suggests, dimensionality reduction aims to reduce the number of features.
The Curse of Dimensionality
Data analysis tasks become significantly harder as the dimensionality of the data increases. As a
result, the number planes occupied by the data increases thus adding more and more sparsity to
the data which is difficult to model and visualize.
• The basic objective of techniques used for this purpose is to reduce the dimensionality of a
dataset by creating new features which are a combination of the old features.
Principal Component Analysis and Singular Value Decomposition are two widely accepted
techniques.
A few major benefits of dimensionality reduction are :
• Data Analysis algorithms work better if the dimensionality of the dataset is lower. This is mainly
because irrelevant features and noise have now been eliminated.
• The models which are built on top of lower-dimensional data are more understandable and
explainable.
• The data may now also get easier to visualize! Features can always be taken in pairs or triplets for
visualization purposes, which makes more sense if the feature-set is not that big.
Feature Encoding
• The whole purpose of data preprocessing is to encode the data in order to bring it to such a state that the
machine now understands it.
• Feature encoding is basically performing transformations on the data such that it can be easily accepted as
input for learning algorithms while still retaining its original meaning.
• General norms or rules which are followed when performing feature encoding.
For Continuous variables :
• Nominal : Any one-to-one mapping can be done which retains the meaning. For instance- One-Hot Encoding.
• Ordinal : An order-preserving change of values. The notion of small, medium and large can be represented
equally well with the help of a new function, that is, <new_value = f(old_value)> - For example, {0, 1, 2} or
maybe {1, 2, 3}.
For Numeric variables:
• Interval : Simple mathematical transformation like using the equation <new_value = a*old_value + b>, a and
b being constants. For example, Fahrenheit and Celsius scales, which differ in their Zero values size of a unit,
can be encoded in this manner.
• Ratio : These variables can be scaled to any particular measures, of course while still maintaining the meaning
and ratio of their values. Simple mathematical transformations work in this case as well, like the
transformation <new_value = a*old_value>. For, length can be measured in meters or feet, money can be
taken in different currencies.
Feature Encoding continued:
Train / Validation / Test Split
• After feature encoding is done, our dataset is ready for the exciting learning algorithms!
But before we start deciding the algorithm which should be used, it is always advised to split the
dataset into 2 or sometimes 3 parts. Learning algorithms has to be first trained on the data
distribution available and then validated and tested, before it can be deployed to deal with real-
world data.
• Training data : This is the part on which your learning algorithms are actually trained to build a
model. The model tries to learn the dataset and its various characteristics and intricacies.
• Validation data : This is the part of the dataset which is used to validate our various model fits. In
simpler words, we use validation data to choose and improve our model hyperparameters.
• Test data : Part of the dataset is used to test our model hypothesis. It is left untouched and
unseen until the model and hyperparameters are decided, and only after that the model is
applied on the test data to get an accurate measure of how it would perform when deployed on
real-world data.
String Manipulation
• String manipulation is the process of changing, parsing, splicing,
pasting, or analyzing strings. As we know that sometimes, data in the
string is not suitable for manipulating the analysis or get a description
of the data.
String manipulation operations:
lower(): Converts all uppercase characters in strings in the DataFrame
to lower case and returns the lowercase strings in the result.
upper(): Converts all lowercase characters in strings in the DataFrame
to upper c
strip(): If there are spaces at the beginning or end of a string, we should
trim the strings to eliminate spaces using strip() or remove the extra
spaces contained by a string in DataFrame.ase and returns the
uppercase strings in result.
• split(‘ ‘): Splits each string with the given pattern. Strings are split and
the new elements after the performed split operation, are stored in a
list.
• len(): With the help of len() we can compute the length of each string
in
• cat(sep=’ ‘): It concatenates the data-frame index elements or each
string in DataFrame with given separator. DataFrame & if there is
empty data in DataFrame, it returns NaN.
• get_dummies(): It returns the DataFrame with One-Hot Encoded
values like we can see that it returns boolean value 1 if it exists in
relative index or 0 if not exists.
• startswith(pattern): It returns true if the element or string in the
DataFrame Index starts with the pattern.
• endswith(pattern): It returns true if the element or string in the
DataFrame Index ends with the pattern.
• replace(a,b): It replaces the value a with the value b like below in
example ‘Geeks’ is being replaced by ‘Gulshan’.
• repeat(value): It repeats each element with a given number of times
like below in example, there are two appearances of each string in
DataFrame.
• find(pattern): It returns the first position of the first occurrence of the pattern.
We can see in the example below, that it returns the index value of appearance of
character ‘n’ in each string throughout the DataFrame
• findall(pattern): It returns a list of all occurrences of the pattern. As we can see in
below, there is a returned list consisting n as it appears only once in the string.
• islower(): It checks whether all characters in each string in the Index of the Data-
Frame in lower case or not, and returns a Boolean value.
• isupper(): It checks whether all characters in each string in the Index of the Data-
Frame in upper case or not, and returns a Boolean value.
• isnumeric(): It checks whether all characters in each string in the Index of the
Data-Frame are numeric or not, and returns a Boolean value.
• swapcase(): It swaps the case lower to upper and vice-versa. Like in the example
below, it converts all uppercase characters in each string into lowercase and vice-
versa (lowercase -> uppercase).
Regular Expressions
• Regular Expressions is very popular among programmers and can be
applied in many programming languages like Java, JS, php, C++, etc.
Regular Expressions are useful for numerous practical day-to-day tasks
that a data scientist encounters. It is one of the key concepts of
Natural Language Processing that every NLP expert should be
proficient in.
• Regular Expressions are used in various tasks such as data pre-
processing, rule-based information mining systems, pattern matching,
text feature engineering, web scraping, data extraction, etc.
• What are Regular Expressions?
• Regular expressions or RegEx is a sequence of characters mainly used
to find or replace patterns embedded in the text. Let’s consider this
example: Suppose we have a list of friends-
And if we want to select only those names on this list which match the
certain pattern such as something the l
ike this-
The names having the first two letters- S and U, followed by only three
positions that can be taken up by any letter.
• the name Sunil and Sumit fit this criterion as they have S and U in the
beginning and three more letters after that. While rest of the three
names are not following the given criteria as Ankit is starting with the
alphabet A whereas Surjeet and Surabhi have more than three
characters post S and U.
17