0% found this document useful (0 votes)
20 views

Class3-9 DataPreprocessing 22Aug-06Sept2019

Data preprocessing techniques are used to clean raw data by handling issues like incomplete, noisy, and inconsistent data. This helps improve data quality and the accuracy of analytics results. Key techniques include data cleaning to identify and address missing values and errors, data integration and transformation to standardize formats, and data reduction to reduce size while maintaining analytical capability. Descriptive analytics is also important for preprocessing, such as calculating central tendency and dispersion to profile data characteristics.

Uploaded by

Saili Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Class3-9 DataPreprocessing 22Aug-06Sept2019

Data preprocessing techniques are used to clean raw data by handling issues like incomplete, noisy, and inconsistent data. This helps improve data quality and the accuracy of analytics results. Key techniques include data cleaning to identify and address missing values and errors, data integration and transformation to standardize formats, and data reduction to reduce size while maintaining analytical capability. Descriptive analytics is also important for preprocessing, such as calculating central tendency and dispersion to profile data characteristics.

Uploaded by

Saili Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

06-09-2019

Data Preprocessing

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
2

1
06-09-2019

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
3

Data Preprocessing
• Real world data are tend to be incomplete, noisy and
inconsistent due to their huge size and their likely
origin from multiple heterogeneous sources
• Preprocessing is important to clean the data
• Low quality data will lead to low quality of analysis
results
• If the users believe the data is of low quality (dirty),
they are unlikely to trust the results of any data
analytics that has been applied to
• Low quality data can cause confusion for analytic
procedure using machine learning techniques,
resulting in unreliable output
• Incomplete, noisy and inconsistent data are common
properties of large real world databases

2
06-09-2019

Tuple (Record)
• A tuple (record) is finite ordered list (sequence) of
elements, where each element is belong to an
attribute

Tuple
(record)

• Each row is a tuple

Incomplete Data
• Many tuple (records) have no recorded value for
several attributes
• Reasons for incomplete data:
– User forgot to fill in a field
– User chose not to fill out the field as it was not
considered important at the time of the entry
– Relevant data may not be recorded due to
malfunctioning of equipment
– Data might have lost while transferring from recorded
place
– Data may not be recorded due to programming error
– Data might not be recorded due to technology
limitations like limited memory

3
06-09-2019

Noisy Data
• Many tuple (records) have incorrect value for several
attributes
• Reasons for noisy data:
– There may be human or computer error occurring in
data entry
– The data collection instruments used may be faulty
– Error in data transmission
– There may be technology limitation such as limited
buffer size for coordinating synchronised data transfer
and consumption

Inconsistent Data
• Data containing discrepancies in stored values for
some attributes
• Reasons for inconsistent data:
– It may result from inconsistencies in name conventions
or data codes used or inconsistent formats of input fields
such as date
– Inconsistency in name convention or formats of input
fields while integrating
– Inconsistent data may be due to human or computer
error occurring in data entry

4
06-09-2019

Data Preprocessing Techniques


• Data cleaning:
– Applied to identify the missing values, fill in missing
values, removing noise and correcting inconsistency in
the data
• Data integration:
– It merges data from multiple sources in to a coherent
data source
• Data transformation:
– Transforming the entries of data to a common format
– Techniques like normalization and standardization
applied to transform the data to another form to
improve the accuracy and efficiency of machine learning
(ML) algorithms involving distance measures

Data Preprocessing Techniques


• Data reduction:
– Applied to obtain a reduced representation that is much
smaller in volume, yet producing almost same analytical
results
– It can reduce the data size by
• Aggregation
• Eliminating irrelevant and redundant features (attributes)
through correlation analysis
• Reducing dimension
• These techniques are not mutually exclusive; they
may work together

5
06-09-2019

Descriptive Data Summarization


(Descriptive Analytics)
• It serves as a foundation for data preprocessing
• It helps us to study the general characteristics of data
and identify the presence of noise or outliers
• Data characteristics:
– Central tendency of data
• Centre of the data
• Measuring mean, median and mode
– Dispersion of data
• The degree to which numerical data tend to spread
• Measuring range, quartiles, interquartile range (IQR), the
five-number summery and standard deviation

Descriptive Analytics:
Measuring Central Tendency
• Mean:
– Let x1, x2, …, xN be a set of N values in an attribute.
Mean of this set of values is given by
N
1

N
x
i 1
i

– Mean is a better measure of central tendency for the


symmetric data (symmetrically distributed data)
• Median:
– For asymmetrically distributed (skewed) data, a better
measure of centre of data is median
– For a given data of N values in sorted order
• In N is odd, then median is the middle value of the ordered
list
• In N is even, then median is the average of middle two
values

6
06-09-2019

Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data

Positively Skewed Negatively Skewed


Symmetric Data
Data Data

Descriptive Analytics:
Measuring Dispersion of Data
• The degree to which numerical data tend to spread
• It is also called as variance
• Common measures:
– Range
– The five-number summery (based on quartiles)
– The inter quartile range (IQR)
– Standard deviation
• Range: The range of a set is the different between the
maximum and minimum values

7
06-09-2019

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles:
– The kth percentile:
• Let x1, x2, …, xN be a set of N values in an attribute
• The kth percentile of a set of data in numerical order is the
value of xi having the property that k percent of data
entries lie at or below xi
• Median is the 50th percentile
• The first quartile (Q1): It is the 25th percentile
• The third quartile (Q3): It is the 75th percentile
– The quartiles including median give some indication of
centre, spread and shape of distribution
• Inter quartile range (IQR): Distance between the first
and third quartile
IQR = Q3 – Q1

Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and
maximum value
– Box plots are the popular way of visualising distribution

8
06-09-2019

Data Cleaning: Handling


Missing Values and Noisy Data

Data Cleaning (Data Cleansing)


• Real world data are tend to be incomplete, noisy and
inconsistent
• Data cleaning routines attempt to identify missing
values, fill in missing values, smooth out noise while
identifying outliers and correct inconsistencies in the
data

• One of the biggest data cleaning task is handling


missing values

9
06-09-2019

Data Cleaning: Missing Values


• Many tuple (records) have no recorded value for
several attributes
• Identifying missing values:
– When Pandas library for python is used, it detect the
missing values as “NaN” [1]
– It automatically consider “blank” in the attribute value,
“NaN/nan/NAN” in the attribute value , “NA” in the
attribute value, “n/a” in the attribute value, “NULL/null”
in the attribute value as NaN

[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Methods to Handle Missing Values


• Ignore the tuples:
– This method is effective only when the tuples contain
several attributes (> 50% of attributes) with missing
value

Tuples contain several attributes (> 50% of attributes) with missing


value

10
06-09-2019

Methods to Handle Missing Values


• Ignore the tuples:
– This method is effective only when the tuples contain
several attributes (> 50% of attributes) with missing
value
– This method is also used when the target variable (class
label) is missing

Target attribute (StationID) with missing value

Methods to Handle Missing Values


• Fill in the missing values (imputing values) manually:
– Time consuming
– Not feasible given a large data set with many missing
values

• Use a global constant to fill in missing value (Imputing


global constant):
– Replace all missing attribute values by a same constant
– Imputed value may not be correct

11
06-09-2019

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change
– However, it does not preserve the relationship with
other variables

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change
– However, it does not preserve the relationship with
other variables

12
06-09-2019

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change
– However, it does not preserve the relationship with
other variables

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change
• However, it does not preserve the relationship with other
variables

13
06-09-2019

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change
• However, it does not preserve the relationship with other
variables

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change
• However, it does not preserve the relationship with other
variables

14
06-09-2019

Methods to Handle Missing Values


• Use the values from the previous/next record (with in
a group) to fill in missing value (Padding)

• If the data is categorical or text, one can replace the


missing values by most frequent observations

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use interpolation technique to predict the missing value
• Linear interpolation is achieved by geometrically
rendering a straight line between two adjacent points on a
graph or plane
• Interpolation happens column wise
• Popular strategy
• It does not preserves the relationship with other variables

15
06-09-2019

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let y1, y2, …, yd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
xn = f(yn1, yn2, …, ynd )

y x
d
f(.)

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd


• Popular strategy
• It uses the most information from the present data to
predict the missing values
• It preserves the relationship with other variables

Data Cleaning: Smoothing the Noisy Data


• Noise is a random error or variance in a measured variable
• Due to noise, many tuple (records) have incorrect value for
several attributes
• Mostly data is full of noise
• Smooth out the data to remove noise
• Data smoothing allows important patterns to stand out
• The idea is to sharpen the patterns (values) in the data and
highlight trends the data is pointing to

• Methods for data


smoothing:
– Binning
– Regression

16
06-09-2019

Binning Methods for Data Smoothing


• Binning method smooth a sorted data value of a noisy
attribute by consulting its neighbourhood i.e., the
values around it
• It perform local smoothing as this method consult the
neighbourhood of values
• The sorted values are partitioned into (almost) equal-
frequency bins
• Smoothing by bin means:
– Each value in a bin is replaced by the mean value of the
bin
• Smoothing by bin medians:
– Each value in a bin is replaced by the median value of
the bin

Binning Methods for Data Smoothing


• Smoothing by bin boundaries:
– The minimum and maximum values in a given bin are
identified as bin boundaries
– Each bin value is then replaced by the closest boundary
value
• Larger the width, the greater the effect of the
smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin means: Smoothing by bin Boundaries:


Bin1: 4, 8, 15 Bin1: 9, 9, 9 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 22, 22, 22 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 29, 29, 29 Bin3: 25, 25, 34

17
06-09-2019

Data Integration

Data Integration
• Data integration is the process of combining the data
from multiple sources into a coherent data store
• These sources may include multiple databases or flat
files
• Issues to consider during data integration:
– Schema integration (entity matching)
– Data value conflict
– Redundancy

18
06-09-2019

Schema Integration (Entity Matching)


• Entity: Each entity in real-world problem is the
attribute in the database
• Addresses the question of
– “how can equivalent real-world entities from multiple
sources be matched up?”
– “how can data analysts are sure that they are same?”
• Attribute name conflict across the multiple sources of
data
– Example: customer_id, customer_num, cust_num
• Entity identification problem:
– Metadata is associated with each attribute
– Matadata include:
• Name,
• Meaning
• Data type
• Range of values permitted

Data Value Conflict


• Issue: Detection and resolution of data value conflicts
• For the same real-world entity, attribute values from
different sources may differ
• This may be due to difference in representation,
scaling, or encoding
• Example:
– “weight” attribute may be stored in metric unit (gram,
kilogram, etc.) in one system, British imperial unit
(pound, ounce, etc.) in another system
– In a database for hotel chain in different countries:
• “price of room” attribute may be stored with price value
in different currencies
– Categorical data: “gender” may be stored with male and
female or M and F

19
06-09-2019

Redundancy
• Major issue to be addressed
• Sources of redundancy:
– An attribute may be redundant, if it can be derived from
another attribute or set of attributes
• Example: Attribute “Total Marks”
– Inconsistency in the attribute naming can also cause
redundancy in resulting data sets
• Two types of redundancies:
– Redundancy between the attributes
– Redundancy at the tuple level
• Duplication of tuples

Redundancy Between Attributes


• Two attributed may be related or dependent
• Detected by the correlation analysis
• Correlation analysis measures how strongly one
attribute implies (related) to other, based on available
data
• Correlation analysis for numerical attributes:
– Compute correlation coefficient between two attributes A
and B (known as Pearson’s product moment coefficient
or Pearson’s correlation coefficient)

20
06-09-2019

Redundancy Between
Numerical Attributes
• Pearson’s correlation coefficient (ρA,B):
N

1  (a  
i A )(bi   B )
 A,B  i 1
N  A B

– N : number of tuples
– ai and bi: respective values of A and B in tuple i
– μA and μB: respective mean values of A and B
– σA and σB : respective standard deviation of A and B

• Note:  1   A,B  1

Redundancy Between Numerical Attributes:


Pearson’s correlation coefficient
• If ρA,B is greater than 0, then attributes A and B are
positively correlated
– The values of A increases as the values of B increases or
vice versa
– The higher the value, the stronger the correlation
– A higher correlation value may indicate that A (or B)
may be removed as a redundancy
• If ρA,B is equal to 0, then attributes A and B have no
correlation between them (may be independent)
• If ρA,B is less than 0, then attributes A and B are
negatively correlated
– The values of A increases as the values of B decreases or
vice versa
– Each attribute discourages the other

21
06-09-2019

Redundancy Between Numerical Attributes:


Pearson’s correlation coefficient
• Scatter plots can also be use to view correlation
between the numerical attributes

Redundancy Between
Categorical (Discrete) Attributes
• Correlation relationship between two categorical
attributes A and B can be discovered by χ2 (chi-square)
test
• Steps in χ2 (chi-square) test
– Identify the two categorical attributes
– Null hypothesis: Two attributes are independent (not
related)
– Complete the contingency matrix (table) with observed
and expected frequencies (count)
– Calculate the observed χ2 value based on contingency
matrix
– Use the standard χ2 table compare if the observed χ2
value to critical χ2 value for the problem’s degree of
freedom and confidence (significance i.e. p-value) level
• If the observed χ2 value < critical χ2 value then the attributes
are not related (null-hypothesis is true)

22
06-09-2019

Redundancy Between
Categorical (Discrete) Attributes
• Correlation relationship between two categorical attributes
A and B can be discovered by χ2 (chi-square) test
• Suppose attribute A has nA distinct value(a1, a2,…,ai,…, anA)
• Suppose attribute B has nB distinct value(b1, b2,…,bj,…, bnB)
• The data tuples described by attributes A and B can be
shown as a contingency table
b1 b2 … bnB
1 2 … nB
• Contingency table has
a1 1 …
– nA distinct values of A a2 2 …
making up the rows
– nB distinct values of B …
making up the anA nA …
columns
(Ai, Bj) denote event that ith district value of A and
jth distinct value of B taken on jointly

Redundancy Between
Categorical (Discrete) Attributes
• The observed χ2 (chi-square) value (Pearson χ2 statistics)
is computed as
nA nB
(oij  eij ) 2
  
2

i 1 j 1 eij

– oij : observed frequency (actual count) of joint event (Ai, Bj)


– eij : expected frequency (count) of joint event (Ai, Bj)
• Expected frequency (eij) is computed as

count(A  ai ) x count(B  b j )
eij 
N
– N : number of tuples
– Count(A = ai) : The number of tuple having value ai for A
– Count(B = bj) : The number of tuple having value bj for B

23
06-09-2019

Redundancy Between
Categorical (Discrete) Attributes
• The χ2 statistic tests the hypothesis that A and B are
independent (Null hypothesis)
• The test is based on the significance level (p-value),
with (nA-1)*(nB -1) degree of freedom
• If the hypothesis can be rejected, then we say that A
and B are statistically related or associated for the
given data set

Redundancy Between
Categorical Attributes: Illustration
• A group of 1500 people are surveyed
• The gender of each person is noted
• Each person is polled as to whether their preferred
type of reading material was fiction or non-fiction
• This leads to two attributes gender and
preferred_reading
– gender takes two district values male and female
– preferred_reading takes two distinct values fiction and
non-fiction
• Size of the contingency matrix is 2 x 2
male (b1) female (b2) Total

fiction (a1) 250 (o11) 200 (o12) 450

non-fiction (a2) 50 (o21) 1000 (o22) 1050


Total 300 1200 1500

24
06-09-2019

Redundancy Between
Categorical Attributes: Illustration
male (b1) female (b2) Total

250 (o11) 200 (o12)


fiction (a1) 450
90 (e11) 360 (e11)

50 (o21) 1000 (o22)


non-fiction (a2) 1050
210 (e11) 840 (e11)
Total 300 1200 1500

• The numbers in blue are the expected frequencies (count)


• The χ2 value is computed as
(o11  e11 ) 2 (o12  e12 ) 2 (o21  e21 ) 2 (o22  e22 ) 2
2    
e11 e12 e21 e22
(250  90) 2 (200  360) 2 (50  210) 2 (1000  840) 2
 
2
  
90 360 210 840
 2  507.93

Redundancy Between
Categorical Attributes: Illustration
• For 2 x 2 contingency table, the degree of freedom is
(2-1)*(2-1) = 1
• Obtain the χ2 value for 0.05 significance i.e. p=0.05
(95% chance or confidence) with 1 degree of freedom
– χ2 value is 7.879 (Taken from the table of χ2 distribution)
• Computed χ2 value for given population is 507.93
• The computed value is above the 7.879
– We reject the hypothesis that gender and
preferred_reading are independent
• Conclusion: The two attributes (gender and
preferred_reading) are strongly correlated for the
given group of people

25
06-09-2019

Data Transformation

Data Transformation
• The data are transformed or consolidated into the
forms appropriate of data modelling
• Data Transformation involve
– Smoothing:
• Used for removing noise
• Techniques: Binning, Regression, Clustering
– Aggregation:
• Summery or aggregation operation are applied to the data
• Analysis of data at multiple granularity
• Example: Daily sales data, Monthly sales data (aggregated
on daily data)
– Attribute construction (feature construction):
• New attributes are constructed from the raw-data to help
mining process
– Normalization and standardization

26
06-09-2019

Attribute Normalization
• In the context of machine learning, it is termed as
feature normalization
• An attribute is normalised by scaling its value so that
they fall within a small specified range (for example
0.0 to 1.0)
• Normalization is particularly useful for classification
algorithms involving distance measurements and
clustering
• For distance based approaches, normalization helps
prevent attributes with large ranges from
overweighting attributes with smaller ranges

53

Illustration
x1 x2 y1 y2
23000.00 6.5
d
Eucledin Distance (ED)   ( xi  yi ) 2
i 1

ED1 = (23500.00 – 23000.00)2 +(8 – 6.5)2


ED1 = 250002.25

min: 2300.00 2

max: 73567.00 8

27
06-09-2019

Illustration
x1 x2 y1 y2
23000.00 6.5
d
Eucledin Distance (ED)   ( xi  yi ) 2
i 1

ED1 = (23500.00 – 23000.00)2 +(8 – 6.5)2


ED1 = 250002.25

ED1 = (23500.00 – 23000.00)2 +(6 – 6.5)2


ED1 = 250000.25

min: 2300.00 2

max: 73567.00 8

Illustration
x1 x2 y1 y2
23000.00 6.5
d
Eucledin Distance (ED)   ( xi  yi ) 2
i 1

ED1 = (23500.00 – 23000.00)2 +(8 – 6.5)2


ED1 = 250002.25

ED1 = (23500.00 – 23000.00)2 +(6 – 6.5)2


ED1 = 250000.25

ED3 = (22879.00 – 23000.00)2 +(2 – 6.5)2


ED3 = 14661.25

min: 2300.00 2

max: 73567.00 8

28
06-09-2019

Attribute Normalization: Min-Max


Normalization
• It performs a linear transformation on the original data
• The transformed data is the scaled version of the original
data so that they fall within a small specified range
• Each numeric attributes in a data are normalised separately
• Steps:
– Compute minimum (mnA) and maximum (mxA) values of
an attribute A
– Specify the new minimum (new_mnA) and new
maximum range (new_mxA)

 x  mnA
x (new_mxA  new _ mnA )  new _ mnA
mxA  mnA
57

Attribute Normalization: Min-Max


Normalization
• When new minimum (new_mnA) and new maximum range
(new_mxA) is 0 and 1 respectively, then the data is scaled
to 0.0 to 1.0 range

 x  mnA
x
mxA  mnA

58

29
06-09-2019

Attribute Normalization: Min-Max


Normalization
• Min-Max normalization preserves the relationship among
the original data values
• It is useful when data has varying ranges among attributes
• It is useful when machine learning (ML) algorithms we are
using does not make any assumption about distribution of
data
• It is useful when the actual minimum and maximum values
for the attribute is known
• Disadvantage: “out-of-bound” error if a future input case
for normalization falls outside the original range of attribute
A
– This situation arises when the actual minimum and
maximum of attribute A is unknown

59

Illustration of Min-Max Normalization

min: 21.20779 82.187 4.5 0.000 0.000 0.000

max: 26.19298 99 1762 1.000 1.000 1.000

30
06-09-2019

Illustration of Min-Max Normalization

min: 2300.00 2 0.000 0.000

max: 73567.00 8 1.000 1.000

Illustration of Min-Max Normalization

23000.00 6.5

0.2905 0.75

min: 2300.00 2

max: 73567.00 8

31
06-09-2019

Illustration
x1 x2 y1 y2
0.2905 0.75
d
Eucledin Distance (ED)   ( xi  yi ) 2
i 1

ED1 = (0.2975 – 0.2905)2 +(1 – 0.75)2


ED1 = 0.06255

min: 0.00 0.00

max: 1.00 1.00

Illustration
x1 x2 y1 y2
0.2905 0.75
d
Eucledin Distance (ED)   ( xi  yi ) 2
i 1

ED1 = (0.2975 – 0.2905)2 +(1 – 0.75)2


ED1 = 0.06255
ED2 = (0.2975 – 0.2905)2 +(0.6667 – 0.75)2
ED2 = 0.00699

min: 0.00 0.00

max: 1.00 1.00

32
06-09-2019

Illustration
x1 x2 y1 y2
0.2905 0.75
d
Eucledin Distance (ED)   ( xi  yi ) 2
i 1

ED1 = (0.2975 – 0.2905)2 +(1.0 – 0.75)2


ED1 = 0.06255
ED2 = (0.2975 – 0.2905)2 +(0.6667 – 0.75)2
ED2 = 0.00699
ED3 = (0.2888 – 0.2905)2 +(0.0 – 0.75)2
ED2 = 0.56250

min: 0.00 0.00

max: 1.00 1.00

Data Standardization
(z-score Normalization)
• The process of rescaling one or more attributes so
that the transformed data have 0 mean and unit
variance i.e. standard deviation of 1
• Standardization assumes that data has a Guassian
distribution
– This assumption does not strictly have to be true, but
this technique is more effective if your attribute
distribution is Gaussian
• In this process, values of an attribute, A, are
normalised based on the mean and standard deviation
of A

 x  A • μA: mean of attribute A


x
A • σA: standard deviation of attribute A

33
06-09-2019

Data Standardization
(z-score Normalization)
• This method of normalization is useful
– when the actual minimum and maximum of attribute A
are unknown
– when there are outliers that dominates the Min-Max
normalization
– when data has Gaussian distribution (symmetric
distribution)
• This method of normalization is useful when the ML
algorithms make any assumptions of Gaussian
distribution

Illustration of Data Standardization


(z-score Normalization)

μ: 23.80035 91.86 481 0.000 0.000 0.000

σ: 1.58225 6.13 488 1 1 1

34
06-09-2019

Data Reduction

Data Reduction
• Data reduction techniques are applied to obtain a
reduced representation of the dataset that is much
smaller in volume, yet closely maintain the integrity of
the original data
• The mining on the reduced dataset should produce
the same or almost same analytical results
• Different strategies:
– Attribute subset selection (feature selection):
• Irrelevant, weekly relevant or redundant attributes
(dimensions) are detected and removed
– Dimensionality reduction:
• Encoding mechanisms are used to reduce the dataset size

35
06-09-2019

Attribute (Feature) Subset Section


• In the context of machine learning, it is termed as
feature subset selection
• Irrelevant or redundant features are detected using
correlation analysis
• Two strategies:
– First strategy:
• Perform the correlation analysis between every pair of
attributes
• Drop one among the two attributes when they are highly
correlated
– Second strategy:
• Perform the correlation analysis between each attribute
and target attribute
• Drop the attributes that are less correlated with target
attribute.

Dimensionality Reduction
• Data encoding or transformations are applied so as to
obtain a reduced or compressed representation of the
original data
Reduced
Representation
Representation
Data Feature x Dimension a Pattern
Extraction Reduction Analysis Task
d l

• If the original data can be reconstructed from


compressed data without any loss of information, the
data reduction is called lossless
• If only an approximation of the original data can be
reconstructed from compressed data, then the data
reduction is called lossy
• One of the popular and effective methods of lossy
dimensionality reduction is principal component
analysis (PCA)

36
06-09-2019

Tuple (Data Vector) – Attribute (Dimension)


• A tuple (one row) is
referred as a vector
• Attribute is referred as
dimension
• In this example:
– Number of vectors =
number of rows = 20
– Dimension of a vector
= number of
attributes = 5
– Sixe of data matrix is
20x5

Tuple (Data Vector)

73

Principal Component Analysis (PCA)


• Suppose data to be reduced consist of N tuples (or
data vectors) described by d-attributes (d -
dimensions)
D  {x n }nN1 , x n  R d

• Let qi, where i = 1, 2,…, d be the d orthonormal


vectors in the d –dimensional space, q i  R
d

– These are unit vectors that each point in a direction


perpendicular to the others

q iT q j  0 i  j
q iT q i  1
• PCA searches for l orthonormal vectors that can best
be used to represent the data, where l < d

37
06-09-2019

Principal Component Analysis (PCA)


• These orthonormal vectors are also called as direction
of projection
• The original data (each of the tuples (data vectors),
xn) is then projected onto each of the l orthonormal
vectors get the principal components
ani  q iT x n
i  1, 2, ..., l
– ani is an ith principal component of xn
• This transform each of the d – dimensional vectors
(i.e. tuples) to l – dimensional vectors

 xn1  an1 
x  a 
xn   n2  an   n2 
...  ... 
   
 xnd  anl 

Principal Component Analysis (PCA)


• Thus the original data is projected onto much smaller
space, resulting in dimensionality reduction
• It combines the essence of attributes by creating an
alternative, smaller set of variables (attributes)
• It is possible to reconstruct the good approximation of
original data, xn , as linear combination of the direction of
projection, qi , and the principal components, ani
l

x n   aniq i
i 1

• The Euclidian distance between the original and


approximated tuples give the error in reconstruction
d
 
Error  x n  x n   (x
i 1
ni  xni ) 2

38
06-09-2019

PCA: Basic Procedure


• Given: Data with N samples, D  {x n }n 1 , x n  R
N d

• Remove mean for each attribute (dimension) in data


samples (tuples)
• Then construct a data matrix X using the mean
subtracted samples, X  R
Nxd

– Each row of the matrix X corresponds to 1 sample (tuple


or a data vector)
• Compute a correlation matrix C=XTX
• Perform the eigen analysis of correlation matrix C
Cqi  i q i i  1, 2, ..., d
– As correlation matrix is symmetric matrix,
• Each eigenvalues λi are distinct and non-negative.
• Eigenvectors qi corresponding to each eigenvalues are
orthonormal vectors
• Eigenvalues indicate the variance or strength of
eigenvectors 77

PCA: Basic Procedure


• Project the xn onto each of the directions
(eigenvectors) to get the principal components
ani  q iT x n i  1, 2, ..., d
– ani is an ith principal component of xn
• Thus, each training example xn is transformed to a new
representation an by projecting on to d-orthonormal
basis (eigenvectors)  xn1  an1 
x  a 
xn   n2  an   n2 
...  ... 
   
 xnd  and 
• It is possible to reconstruct the original data, xn , without
error as linear combination of the direction of projection, qi,
and the principal components, ani d
x n   aniq i
i 1 78

39
06-09-2019

PCA for Dimension Reduction


• In general, we are interested in representing the data
using fewer dimensions such that the data has high
variance along these dimensions
• Idea: Select l out of d orthonormal basis vectors
(eigenvectors) that contain high variance of data (i.e.
more information content)
• Rank order the eigenvalues (λi’s) such that
1  2  ...  d
• Based on the Definition 1, consider the l (l << d)
eigenvectors corresponding to l significant eigenvalues
– Definition 1: Let λ1, λ2, . . . , λd, be the eigenvalues of an
d x d matrix A. λ1 is called the dominant (significant)
eigenvalue of A if | λ1| ≥ | λi| , i = 1, 2, …, d

79

PCA for Dimension Reduction


• Project the xn onto each of the l directions
(eigenvectors) to get reduced dimensional
representation

ani  q iT x n i  1, 2, ..., l

• Thus, each training example xn is transformed to a new


reduced dimensional representation an by projecting on
to l-orthonormal basis vectors (eigenvectors)
 xn1  an1 
x  a 
xn   n2 
an   n2 
...  ... 
   
 xnd  anl 
• The eigenvalue λi correspond to the variance of
projected data
80

40
06-09-2019

PCA for Dimension Reduction


• Since the strongest l directions are considered for
obtaining reduced dimensional representation, it
should be possible to reconstruct a good
approximation of the original data
• An approximation of original data, xn , is obtained as linear
combination of the direction of projection (stongest
eigenvectors), qi , and the principal components, ai
l

x n   ai q i
i 1

81

Illustration: PCA
• Atmospheric Data:
– N = Number tiples
(data vectors) = 20
–d = Number of
attributes (dimension)
=5
• Mean of each
dimension:

23.42 93.63 1003.55 448.88 14.4

82

41
06-09-2019

Illustration: PCA
• Step1: Subtract mean
from each attribute

83

Illustration: PCA
• Step2: Compute correlation matrix from the data
matrix

84

42
06-09-2019

Illustration: PCA
Eigen Values
• Step4: Perform Eigen
analysis on
Eigen Vectors correlation matrix
– Get eigenvalues and
eigenvectors
• Step5: Sort the
eigenvalues in
descending order
• Step6: Arrange the
eigenvectors in the
descending order of
their corresponding
eigenvalues

85

Illustration: PCA
• Step7: Consider the two leading
(significant) eigenvalues and their
corresponding eigenvectors
• Step8: Project the mean subtracted
data matrix onto the selected two
eigenvectors corresponding to leading
eigenvalues

86

43
06-09-2019

Eigenvalues and Eigenvectors


• What happens when a
1 2
8
A vector is multiplied with a
2 1
7 matrix?
7  • The vector gets transformed
6

q2 5 Aq    into a new vector


5 
4 1  – Direction changes
q 
3
3 • The vector may also get
2
scaled (elongated or
1 shortened) in the process
0
1 2 3 4 5 6 7 8
q1

87

Eigenvalues and Eigenvectors


• For a given square symmetric
1 2 matrix A, there exist special
8
A
7 2 1 vectors which do not change
direction when multiplied
6
• These vectors are called
q2 5 eigenvectors
4

3 1 • More formally,


3
Aq     3    q
2
3
  1 Aq   q
1 – λ is eigenvalue
1 q 
1 – Eigenvalue indicate the
0
1 2 3 4
q1
5 6 7 8
magnitude of the eigenvector
• The vector will only get scaled
but will not change its
direction
• So what is so special about
eigenvalues and eigenvectors?
88

44
06-09-2019

Linear Algebra: Basic Definitions


• Basis: A set of vectors  R d is called a basis, if
– those vectors are linearly independent and
– every vector  R can be expressed as a linear
d

combination of these basis vectors


• Linearly independent vectors:
– A set of d vectors q1, q2, . . . , qd is linearly independent if
no vector in the set can be expressed as a linear
combination of the remaining d – 1 vectors
– In other words, the only solution to
c1q1  c2q 2  ...  cd q d  0 is c1  c2  ...  cd  0
• Here ci are scalars

89

Linear Algebra: Basic Definitions


• For example consider the
8 space R 2
7 • Consider the vectors:
6 1  0 
q1    q2   
 z1 
q2 5 z  0  1 
4  z2 
• Any vector [z1 z2]T can be
3
expressed as a linear
2
0 combination of these two
q2   
1
1  vectors
 z1  1  0
 z   z1 0  z 2 1 
0
1 2 3 4 5 6 7 8
1  q1
q1     2    
 0
• Further, q1 and q2 are linearly
independent
– The only solution to
c1q1  c2q 2  0 is c1  c2  0
90

45
06-09-2019

Linear Algebra: Basic Definitions


• It turns out that q1 and q2 are
8 unit vectors in the direction of
7
the co-ordinate axes
6
• And indeed we are used to
 z1  2
representing all vectors in R as
q2 5
z 
4  z2  a linear combination of these
3
two vectors
2
0
1
q2   
1 
0
1 2 3 4 5 6 7 8
1  q1
q1   
 0

91

Linear Algebra: Basic Definitions


• We could have chosen any 2
8 linearly independent vectors in
5 
7 q2    R 2 as the basis vectors
7 
6
• For example, consider the
 z1  linearly independent vectors
q2 5
z 
4  z2  [4 2]T and [5 7]T
3 • Any vector z=[z1 z2]T can be
4
2 q1    expressed as a linear
2 combination of these two
1

vectors  z   4 5 
 z   1  2  2 7 
0
1
1 2 3 4 5 6 7 8
q1  2    
z  1 q1  2 q 2
z1  41  52 • We can find λ1 and λ2 by
z 2  21  72 solving a system of linear
equations
92

46
06-09-2019

Linear Algebra: Basic Definitions


• In general, given a set of
8 linearly independent vectors
5  q 1, q 2 , . . . , q d  R d
7 q2   
7 
– we can express any vector z  R
d
6

 z1  as a linear combination of
q2 5 z 
 z2  these vectors
4
z  1 q 1   2 q 2  ...   d q d
3
4  z1   q11   q 21  qd 1 
2 q1    z  q  q   
1
2  2     12     22   ...    q d 2 
.  1
.  2
.  d
. 
       
0
1 2 3 4 5 6 7 8  zd   q1d  q2 d   q dd 
q1
 z1   q11 q 21 ... q d 1  1 
z    
 2    q12 q 22 ... q d 2   2 
.  . . . . . . . . . . . . . . . .  . 
    
 zd   q1d q 2 d ... q dd   d 
z  Q λ
93

Linear Algebra: Basic Definitions


x2 • Let us see if we have
orthonormal basis
5
q iT q i  1 and q iT q j  0 i  j
q2 4
• We can express any vector z  R d
3
z as a linear combination of
q1
2
these vectors
1
2 1 z  1q1  2q 2  ...  d q d
-3 -2 -1 0 1 2 3 4
x1 – Multiply q1 to both sides
q1T z  1q1T q1  2q1Tq 2  ...  d q1Tq d
q1T z  1
• Similarly, 2  q 2T z
• An orthogonal basis is the ...
most convenient basis
d  q Td z
that one can hope for
94

47
06-09-2019

Eigenvalues and Eigenvectors


• What does any of this have to do with eigenvectors?
• Eigenvectors can form a basis
• Theorem 1: The eigenvectors of a matrix A  R d x d
having distinct eigenvalues are linearly independent
• Theorem 2: The eigenvectors of a square symmetric
matrix are orthogonal
• Definition 1: Let λ1, λ2, . . . , λd, be the eigenvalues of
an d x d matrix A. λ1 is called the dominant
(significant) eigenvalue of A if | λ1| ≥ | λi| , i = 1, 2, …, d
• We will put all of this to use for principal component
analysis

95

Principal Component Analysis (PCA)


x2 • Each point (vector) here is
5
represented using a linear
4
combination of the x1 and x2
3
p2 axes
2
• In other words we are using p1
1
and p2 as the basis
p1
-2 -1 0 1 2 3 4 5
x1

96

48
06-09-2019

Principal Component Analysis (PCA)


x2 • Lets consider orthonormal
5
q1 vectors q1 and q2 as a basis
instead of p1 and p2 as the basis
4
q2 3
• We observe that all the points
2
have a very small component in
1
the direction of q2 (almost
noise)
-2 -1 0 1 2 3 4 5
x1

97

Principal Component Analysis (PCA)


x2 • Lets consider orthonormal
5
q1 vectors q1 and q2 as a basis
instead of p1 and p2 as the basis
4
q2
3
• We observe that all the points
2
have a very small component in
1
1 the direction of q2 (almost
2 noise)
-2 -1 0 1 2 3 4 5
x1
• Now the same data can be represented in 1-dimension
in the direction of q1 by making a smarter choice for
the basis
• Why do we not care about q2?
– Variance in the data in this direction is very small
– All data points have almost the same value in the q2
direction
98

49
06-09-2019

Principal Component Analysis (PCA)


x2
• If we were to build a classifier
q1
5 on top of this data then q2
4 would not contribute to the
q2 classier
3

2 – The points are not


1
1 distinguishable along this
2 direction
-2 -1 0 1 2 3 4 5
x1
• In general, we are interested in representing the data
using fewer dimensions such that
– the data has high variance along these dimensions
– the dimensions are linearly independent (uncorrelated)

99

PCA: Basic Procedure


• Given: Data with N samples, D  {x n }n 1 , x n  R
N d

1. Remove mean for each attribute (dimension) in data


samples (tuples)
2. Then construct a data matrix X using the mean
subtracted samples,X  R
Nxd

– Each row of the matrix X corresponds to 1 sample


(tuple)
3. Compute a correlation matrix C=XTX
4. Perform the eigen analysis of correlation matrix C
Cqi  i q i i  1, 2, ..., d
– As correlation matrix is symmetric matrix,
• Each eigenvalues λi are distinct and non-negative
• Eigenvectors qi corresponding to each eigenvalues are
orthonormal vectors
• Eigenvalues indicate the variance or strength of
eigenvectors 100

50
06-09-2019

PCA for Dimension Reduction


• In general, we are interested in representing the data
using fewer dimensions such that the data has high
variance along these dimensions
5. Rank order the eigenvalues (λi’s) (sorted order) such
that
1  2  ...  d

6. Consider the l (l << d) eigenvectors corresponding to


l significant eigenvalues
7. Project the xn onto each of the l directions
(eigenvectors) to get reduced dimensional
representation
ani  q iT x n i  1, 2, ..., l

101

PCA for Dimension Reduction


7. Thus, each training example xn is transformed to a
new reduced dimensional representation an by
projecting on to l-orthonormal basis
 xn1  an1 
x  a 
xn   n2  an   n2 
...  ... 
   
 xnd  anl 
• The new reduced representation an is uncorrelated
• The eigenvalue λi correspond to the variance of
projected data

102

51
06-09-2019

Illustration: PCA
• Handwritten Digit Image [1]:
– Size of each image: 28 x 28
– Dimension after linearizing: 784
– Total number of training examples: 5000 (500 per class)

[1] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-Based Learning


Applied to Document Recognition,” Intelligent Signal Processing, 306-351,
IEEE Press, 2001
103

Illustration: PCA
• Handwritten Digit Image:
– All 784 Eigenvalues

104

52
06-09-2019

Illustration: PCA
• Handwritten Digit Image:
– Leading 100 Eigenvalues

105

Illustration: PCA-Reconstructed Images


Original Image l=1 l=20 l=100

106

53

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy