Unit1 Data Preprocessing
Unit1 Data Preprocessing
processing
Unit-1
1
Outline
Types of Data
Data Quality
Data Preprocessing
2
What is Data?
Objects
dimension, or feature
A collection of attributes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
Attribute Values
6
Properties of Attribute Values
7
Attribut Description Examples Operations
e Type
Nominal Nominal zip codes, employee mode, entropy,
attribute values ID numbers, eye contingency
only color, sex: {male, correlation,
distinguish. (=, ) female} 2 test
Categorical
Qualitative
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
For e.g. Consider a data set where each object is a student and each attribute
records whether or not a student took a particular course at a university. For a specific
student, an attribute has a value of 1 if the student took the course associated with that
attribute and a value of 0 otherwise. Because students take only a small fraction of all
available courses, most of the values in such a data set would be 0. Therefore, it is
more meaningful and more efficient to focus on the non-zero values.
Critiques of the attribute categorization
Incomplete
– Asymmetric binary: Binary attribute where only non-zero value is
important.
– Cyclical : A cyclic attribute has values that repeat in a period of
time. Ex. hour, week, year.
– Multivariate : multivalued attribute
– Resolution
Patterns depend on the scale
– Size
Type of analysis may depend on size of data
Types of data sets
Record
– Data Matrix
– Document Data
– Transaction Data
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
timeout
season
coach
game
score
team
play
lost
ball
win
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
2
5 1
2
5
Sequences of transactions
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
23
Data Quality
– Attributes may not be applicable to all cases (e.g., annual income is not applicable
to children)
Handling missing values
– Eliminate data objects or tuple
– Fill the missing value manually S.No Actual Value Mean Median Mode
– Estimate missing values 1 67 67 67 67
Mean = (67+67+56+58+48+89+74)/9=51
Median = 48, 56,58,67,74,89 = 58
Mode (most frequent occur value
Duplicate Data
Binning
first sort data and partition into
(equal-frequency) bins
then one can smooth by bin means,
smooth by bin median, smooth by bin
boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
32
Similarity and Dissimilarity Measures
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
Euclidean Distance
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Euclidean Distance
2 p1
p3 p4
1
p2
0
01 2 3 5
4 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
r = 2. Euclidean distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Common Properties of a Distance
A 1 2 4 2 5 X
4
B 100 300 200 600 100
500
40
C 10 15 20 10 30
Mahalanobis Distance
mean
Calculate Mean 2.8
260
17
x-mean
240
23
2.7 -110 13
Calculate Covariance matrix -110 43000 -900
13 -900 70
= 106.7
5.5 -0.01 -1.15 1.2
MD = (106.7)1/2= 10.33
Similarity Between Binary Vectors
Common situation is that objects, x and y, have only
binary attributes
x= 1000000
000
y= 0000001
001
f01 = 2 (the number of attributes where x was 0 and y was 1)
f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)
Example:
x= 3205000200
y= 1000000102
Cosine Similarity
x= 3205000200
y= 1000000102
x. y = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||x || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| y || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449 cos(x
, y ) = 0.3150
Extended Jaccard Coefficient (Tanimoto coefficient)
X= (1,0,1,0,1)
Y=(1,1,1,0,1)
X.Y=1*1+0*1+1*1+0*0+1*1=3
||x||2 = ((1*1+0*0+1*1+0*0+1*1)1/2 )2 =3
||y||2 = ((1*1+1*1+1*1+0*0+1*1)1/2 )2 =4
EJ(x,y)=3/ (3+4-3)=3/4=0.75
Correlation measures the linear relationship
between objects
X= (-3,6,0,3,-6) Y= (1,-2,0,-1,2)
Mean of x= 0 Mean of y= 0 n=5
Sy = [ (1+4+1+4)/4]½ =1.5
Corr(x,y)= -7.25/(4.716+1.5) = -1
For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
=− log2
=1
log2(x)=ln(x)/ln(2)
Entropy Examples
Mutual Information
Formally, , = +
− ( , ), where
Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
Using Weights to Combine Similarities
Aggregation
Sampling
Discretization and Binarization
Attribute Transformation
Dimensionality Reduction
Feature subset selection
Feature creation
Aggregation
62
Sampling: With or without Replacement
W O R
SRS le random
i m p h o u t
( s e w i t
l
samp ment)
p l a c e
re
SRSW
R
Raw Data
63
Sampling: Cluster or Stratified Sampling
64
Sample Size
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
What Is Wavelet Transform?
The wavelet transforms the data can be truncated
and this is helpful in data reduction. If we store a
small fraction of the strongest wavelet
coefficients, then the compressed approximation
of the original data can be obtained. For example,
the wavelet coefficients larger than some
determined threshold can be retained.
69
Wavelet Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT) for linear signal processing,
multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
70
Wavelet Decomposition
71
Wavelet Decomposition & regeneration of Signal
Dimensionality Reduction: PCA
https://www.youtube.com/watch?v=ZtS6sQUAh0c
Feature Subset Selection
90
Discretization in Supervised Settings
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
94
Normalization