Chapter 3: Data Preprocessing
Chapter 3: Data Preprocessing
Chapter 3: Data Preprocessing
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Summary
1
2
Data Quality: Why Preprocess the Data?
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
4
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect
data, e.g., instrument faulty, human or computer error,
transmission error
incomplete: lacking attribute values, lacking certain
e.g.,
Age = “42”, Birthday = “03/07/2010”
5
Incomplete (Missing) Data
technology limitation
incomplete data
inconsistent data
8
How to Handle Noisy Data? - Data smoothing methods
Binning
first sort data and partition into (equal-frequency) bins
10
Chapter 3: Data Preprocessing
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Summary
11
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
12
Handling Redundancy in Data Integration
14
Chi-Square Calculation: An Example
n n
(ai A)(bi B ) (ai bi ) n AB
rA, B i 1
i 1
(n 1) A B (n 1) A B
where n is the number of tuples, and are the respective
means of A and B, σA and σB are theA respective
B standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated
Therefore, either A or B can be removed as a redundant attribute
16
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
17
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing
Wavelet transforms
Data compression
21
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in one
or more other attributes
E.g., purchase price of a product and the amount of sales
tax paid
Irrelevant attributes
Contain no information that is useful for the data mining
task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Using methods such as Information Gain and Decision Trees
22
Data Reduction: Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model
23
Histogram Analysis
Divide data into buckets and 40
store average (sum) for each
35
bucket
30
Instead of storing entire data, it
25
is enough to store the value and
20
its frequency, that achieves data
reduction 15
Partitioning rules: 10
Equal-width: equal bucket 5
range 0
10000
20000
00
40000
0000
0000
70000
80000
90000
Equal-frequency (or equal-
100000
300
6
24
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
25
Sampling
26
Types of Sampling
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a ce
re
SRSW
R
Raw Data
28
Data Cube Aggregation
Data cube can represent reduced data after operations such as
aggregation
The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough to solve the task
Queries regarding aggregated information should be answered using
data cube, when possible
29
Data Compression
os sy
l
Original Data
Approximated
30
Chapter 3: Data Preprocessing
73,600 54,000
1.225
16,000
Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
34
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or bottom-
up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
35
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
36
Discretization by Classification &
Correlation Analysis
Classification (e.g., decision tree analysis)
Supervised: Given class labels, e.g., cancerous vs. benign
Top-down, recursive split
Correlation analysis (e.g., Chi-merge: χ2-based discretization)
Supervised: use class information
Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ 2 values) to
merge
Merge performed recursively, until a predefined stopping
condition
37
Concept Hierarchy Generation
38
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the
data set
Steps
sort the attributes in ascending order based on the number
sorted order, with the first attribute at the top level and the
last attribute at the bottom level
modify it to reflect desired semantic relationships among
39
Automatic Concept Hierarchy Generation
Example: Consider a set of location-oriented attributes—street,
country, province or state, and city— from a database
A concept hierarchy for location can be generated automatically
The attribute with the most distinct values is placed at the