Class3-9 DataPreprocessing 22Aug-06Sept2019
Class3-9 DataPreprocessing 22Aug-06Sept2019
Data Preprocessing
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
2
1
06-09-2019
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
3
Data Preprocessing
• Real world data are tend to be incomplete, noisy and
inconsistent due to their huge size and their likely
origin from multiple heterogeneous sources
• Preprocessing is important to clean the data
• Low quality data will lead to low quality of analysis
results
• If the users believe the data is of low quality (dirty),
they are unlikely to trust the results of any data
analytics that has been applied to
• Low quality data can cause confusion for analytic
procedure using machine learning techniques,
resulting in unreliable output
• Incomplete, noisy and inconsistent data are common
properties of large real world databases
2
06-09-2019
Tuple (Record)
• A tuple (record) is finite ordered list (sequence) of
elements, where each element is belong to an
attribute
Tuple
(record)
Incomplete Data
• Many tuple (records) have no recorded value for
several attributes
• Reasons for incomplete data:
– User forgot to fill in a field
– User chose not to fill out the field as it was not
considered important at the time of the entry
– Relevant data may not be recorded due to
malfunctioning of equipment
– Data might have lost while transferring from recorded
place
– Data may not be recorded due to programming error
– Data might not be recorded due to technology
limitations like limited memory
3
06-09-2019
Noisy Data
• Many tuple (records) have incorrect value for several
attributes
• Reasons for noisy data:
– There may be human or computer error occurring in
data entry
– The data collection instruments used may be faulty
– Error in data transmission
– There may be technology limitation such as limited
buffer size for coordinating synchronised data transfer
and consumption
Inconsistent Data
• Data containing discrepancies in stored values for
some attributes
• Reasons for inconsistent data:
– It may result from inconsistencies in name conventions
or data codes used or inconsistent formats of input fields
such as date
– Inconsistency in name convention or formats of input
fields while integrating
– Inconsistent data may be due to human or computer
error occurring in data entry
4
06-09-2019
5
06-09-2019
Descriptive Analytics:
Measuring Central Tendency
• Mean:
– Let x1, x2, …, xN be a set of N values in an attribute.
Mean of this set of values is given by
N
1
N
x
i 1
i
6
06-09-2019
Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Descriptive Analytics:
Measuring Dispersion of Data
• The degree to which numerical data tend to spread
• It is also called as variance
• Common measures:
– Range
– The five-number summery (based on quartiles)
– The inter quartile range (IQR)
– Standard deviation
• Range: The range of a set is the different between the
maximum and minimum values
7
06-09-2019
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles:
– The kth percentile:
• Let x1, x2, …, xN be a set of N values in an attribute
• The kth percentile of a set of data in numerical order is the
value of xi having the property that k percent of data
entries lie at or below xi
• Median is the 50th percentile
• The first quartile (Q1): It is the 25th percentile
• The third quartile (Q3): It is the 75th percentile
– The quartiles including median give some indication of
centre, spread and shape of distribution
• Inter quartile range (IQR): Distance between the first
and third quartile
IQR = Q3 – Q1
Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and
maximum value
– Box plots are the popular way of visualising distribution
8
06-09-2019
9
06-09-2019
[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
10
06-09-2019
11
06-09-2019
12
06-09-2019
13
06-09-2019
14
06-09-2019
15
06-09-2019
y x
d
f(.)
16
06-09-2019
17
06-09-2019
Data Integration
Data Integration
• Data integration is the process of combining the data
from multiple sources into a coherent data store
• These sources may include multiple databases or flat
files
• Issues to consider during data integration:
– Schema integration (entity matching)
– Data value conflict
– Redundancy
18
06-09-2019
19
06-09-2019
Redundancy
• Major issue to be addressed
• Sources of redundancy:
– An attribute may be redundant, if it can be derived from
another attribute or set of attributes
• Example: Attribute “Total Marks”
– Inconsistency in the attribute naming can also cause
redundancy in resulting data sets
• Two types of redundancies:
– Redundancy between the attributes
– Redundancy at the tuple level
• Duplication of tuples
20
06-09-2019
Redundancy Between
Numerical Attributes
• Pearson’s correlation coefficient (ρA,B):
N
1 (a
i A )(bi B )
A,B i 1
N A B
– N : number of tuples
– ai and bi: respective values of A and B in tuple i
– μA and μB: respective mean values of A and B
– σA and σB : respective standard deviation of A and B
• Note: 1 A,B 1
21
06-09-2019
Redundancy Between
Categorical (Discrete) Attributes
• Correlation relationship between two categorical
attributes A and B can be discovered by χ2 (chi-square)
test
• Steps in χ2 (chi-square) test
– Identify the two categorical attributes
– Null hypothesis: Two attributes are independent (not
related)
– Complete the contingency matrix (table) with observed
and expected frequencies (count)
– Calculate the observed χ2 value based on contingency
matrix
– Use the standard χ2 table compare if the observed χ2
value to critical χ2 value for the problem’s degree of
freedom and confidence (significance i.e. p-value) level
• If the observed χ2 value < critical χ2 value then the attributes
are not related (null-hypothesis is true)
22
06-09-2019
Redundancy Between
Categorical (Discrete) Attributes
• Correlation relationship between two categorical attributes
A and B can be discovered by χ2 (chi-square) test
• Suppose attribute A has nA distinct value(a1, a2,…,ai,…, anA)
• Suppose attribute B has nB distinct value(b1, b2,…,bj,…, bnB)
• The data tuples described by attributes A and B can be
shown as a contingency table
b1 b2 … bnB
1 2 … nB
• Contingency table has
a1 1 …
– nA distinct values of A a2 2 …
making up the rows
– nB distinct values of B …
making up the anA nA …
columns
(Ai, Bj) denote event that ith district value of A and
jth distinct value of B taken on jointly
Redundancy Between
Categorical (Discrete) Attributes
• The observed χ2 (chi-square) value (Pearson χ2 statistics)
is computed as
nA nB
(oij eij ) 2
2
i 1 j 1 eij
count(A ai ) x count(B b j )
eij
N
– N : number of tuples
– Count(A = ai) : The number of tuple having value ai for A
– Count(B = bj) : The number of tuple having value bj for B
23
06-09-2019
Redundancy Between
Categorical (Discrete) Attributes
• The χ2 statistic tests the hypothesis that A and B are
independent (Null hypothesis)
• The test is based on the significance level (p-value),
with (nA-1)*(nB -1) degree of freedom
• If the hypothesis can be rejected, then we say that A
and B are statistically related or associated for the
given data set
Redundancy Between
Categorical Attributes: Illustration
• A group of 1500 people are surveyed
• The gender of each person is noted
• Each person is polled as to whether their preferred
type of reading material was fiction or non-fiction
• This leads to two attributes gender and
preferred_reading
– gender takes two district values male and female
– preferred_reading takes two distinct values fiction and
non-fiction
• Size of the contingency matrix is 2 x 2
male (b1) female (b2) Total
24
06-09-2019
Redundancy Between
Categorical Attributes: Illustration
male (b1) female (b2) Total
Redundancy Between
Categorical Attributes: Illustration
• For 2 x 2 contingency table, the degree of freedom is
(2-1)*(2-1) = 1
• Obtain the χ2 value for 0.05 significance i.e. p=0.05
(95% chance or confidence) with 1 degree of freedom
– χ2 value is 7.879 (Taken from the table of χ2 distribution)
• Computed χ2 value for given population is 507.93
• The computed value is above the 7.879
– We reject the hypothesis that gender and
preferred_reading are independent
• Conclusion: The two attributes (gender and
preferred_reading) are strongly correlated for the
given group of people
25
06-09-2019
Data Transformation
Data Transformation
• The data are transformed or consolidated into the
forms appropriate of data modelling
• Data Transformation involve
– Smoothing:
• Used for removing noise
• Techniques: Binning, Regression, Clustering
– Aggregation:
• Summery or aggregation operation are applied to the data
• Analysis of data at multiple granularity
• Example: Daily sales data, Monthly sales data (aggregated
on daily data)
– Attribute construction (feature construction):
• New attributes are constructed from the raw-data to help
mining process
– Normalization and standardization
26
06-09-2019
Attribute Normalization
• In the context of machine learning, it is termed as
feature normalization
• An attribute is normalised by scaling its value so that
they fall within a small specified range (for example
0.0 to 1.0)
• Normalization is particularly useful for classification
algorithms involving distance measurements and
clustering
• For distance based approaches, normalization helps
prevent attributes with large ranges from
overweighting attributes with smaller ranges
53
Illustration
x1 x2 y1 y2
23000.00 6.5
d
Eucledin Distance (ED) ( xi yi ) 2
i 1
min: 2300.00 2
max: 73567.00 8
27
06-09-2019
Illustration
x1 x2 y1 y2
23000.00 6.5
d
Eucledin Distance (ED) ( xi yi ) 2
i 1
min: 2300.00 2
max: 73567.00 8
Illustration
x1 x2 y1 y2
23000.00 6.5
d
Eucledin Distance (ED) ( xi yi ) 2
i 1
min: 2300.00 2
max: 73567.00 8
28
06-09-2019
x mnA
x (new_mxA new _ mnA ) new _ mnA
mxA mnA
57
x mnA
x
mxA mnA
58
29
06-09-2019
59
30
06-09-2019
23000.00 6.5
0.2905 0.75
min: 2300.00 2
max: 73567.00 8
31
06-09-2019
Illustration
x1 x2 y1 y2
0.2905 0.75
d
Eucledin Distance (ED) ( xi yi ) 2
i 1
Illustration
x1 x2 y1 y2
0.2905 0.75
d
Eucledin Distance (ED) ( xi yi ) 2
i 1
32
06-09-2019
Illustration
x1 x2 y1 y2
0.2905 0.75
d
Eucledin Distance (ED) ( xi yi ) 2
i 1
Data Standardization
(z-score Normalization)
• The process of rescaling one or more attributes so
that the transformed data have 0 mean and unit
variance i.e. standard deviation of 1
• Standardization assumes that data has a Guassian
distribution
– This assumption does not strictly have to be true, but
this technique is more effective if your attribute
distribution is Gaussian
• In this process, values of an attribute, A, are
normalised based on the mean and standard deviation
of A
33
06-09-2019
Data Standardization
(z-score Normalization)
• This method of normalization is useful
– when the actual minimum and maximum of attribute A
are unknown
– when there are outliers that dominates the Min-Max
normalization
– when data has Gaussian distribution (symmetric
distribution)
• This method of normalization is useful when the ML
algorithms make any assumptions of Gaussian
distribution
34
06-09-2019
Data Reduction
Data Reduction
• Data reduction techniques are applied to obtain a
reduced representation of the dataset that is much
smaller in volume, yet closely maintain the integrity of
the original data
• The mining on the reduced dataset should produce
the same or almost same analytical results
• Different strategies:
– Attribute subset selection (feature selection):
• Irrelevant, weekly relevant or redundant attributes
(dimensions) are detected and removed
– Dimensionality reduction:
• Encoding mechanisms are used to reduce the dataset size
35
06-09-2019
Dimensionality Reduction
• Data encoding or transformations are applied so as to
obtain a reduced or compressed representation of the
original data
Reduced
Representation
Representation
Data Feature x Dimension a Pattern
Extraction Reduction Analysis Task
d l
36
06-09-2019
73
q iT q j 0 i j
q iT q i 1
• PCA searches for l orthonormal vectors that can best
be used to represent the data, where l < d
37
06-09-2019
xn1 an1
x a
xn n2 an n2
... ...
xnd anl
38
06-09-2019
39
06-09-2019
79
ani q iT x n i 1, 2, ..., l
40
06-09-2019
81
Illustration: PCA
• Atmospheric Data:
– N = Number tiples
(data vectors) = 20
–d = Number of
attributes (dimension)
=5
• Mean of each
dimension:
82
41
06-09-2019
Illustration: PCA
• Step1: Subtract mean
from each attribute
83
Illustration: PCA
• Step2: Compute correlation matrix from the data
matrix
84
42
06-09-2019
Illustration: PCA
Eigen Values
• Step4: Perform Eigen
analysis on
Eigen Vectors correlation matrix
– Get eigenvalues and
eigenvectors
• Step5: Sort the
eigenvalues in
descending order
• Step6: Arrange the
eigenvectors in the
descending order of
their corresponding
eigenvalues
85
Illustration: PCA
• Step7: Consider the two leading
(significant) eigenvalues and their
corresponding eigenvectors
• Step8: Project the mean subtracted
data matrix onto the selected two
eigenvectors corresponding to leading
eigenvalues
86
43
06-09-2019
87
44
06-09-2019
89
45
06-09-2019
91
vectors z 4 5
z 1 2 2 7
0
1
1 2 3 4 5 6 7 8
q1 2
z 1 q1 2 q 2
z1 41 52 • We can find λ1 and λ2 by
z 2 21 72 solving a system of linear
equations
92
46
06-09-2019
z1 as a linear combination of
q2 5 z
z2 these vectors
4
z 1 q 1 2 q 2 ... d q d
3
4 z1 q11 q 21 qd 1
2 q1 z q q
1
2 2 12 22 ... q d 2
. 1
. 2
. d
.
0
1 2 3 4 5 6 7 8 zd q1d q2 d q dd
q1
z1 q11 q 21 ... q d 1 1
z
2 q12 q 22 ... q d 2 2
. . . . . . . . . . . . . . . . . .
zd q1d q 2 d ... q dd d
z Q λ
93
47
06-09-2019
95
96
48
06-09-2019
97
49
06-09-2019
99
50
06-09-2019
101
102
51
06-09-2019
Illustration: PCA
• Handwritten Digit Image [1]:
– Size of each image: 28 x 28
– Dimension after linearizing: 784
– Total number of training examples: 5000 (500 per class)
Illustration: PCA
• Handwritten Digit Image:
– All 784 Eigenvalues
104
52
06-09-2019
Illustration: PCA
• Handwritten Digit Image:
– Leading 100 Eigenvalues
105
106
53