02
02
02
Chapter 2
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber, All rights reserved
Data Mining: Concepts and
2/13/17 Techniques 1
Data Mining: Concepts and
2/13/17 Techniques 2
Chapter 2: Data Preprocessing
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and
accessibility
Data Mining: Concepts and
2/13/17 Techniques 7
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Mining: Concepts and
2/13/17 Techniques 8
Forms of Data Preprocessing
w x i i
Trimmed mean: chopping extreme values
x i 1
n
w i
Median: A holistic measure i 1
Importance
Data cleaning is one of the three biggest
technology limitation
incomplete data
inconsistent data
Data Mining: Concepts and
2/13/17 Techniques 30
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-
frequency) bins
then one can smooth by bin means, smooth by
functions
Clustering
detect and remove outliers
2/13/17
Good data scaling
Data Mining: Concepts and
Techniques 32
Binning Methods for Data
Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Mining: Concepts and
2/13/17 Techniques 33
Regression
Y1
Y1 y=x+1
X1 x
distribution)
Check field overloading
specified
ETL (Extraction/Transformation/Loading) tools: allow users
coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
rA, B
( A A)( B B ) ( AB ) n A B
(n 1)AB ( n 1)AB
2 (chi-square) test
(Observed Expected ) 2
2
Expected
The larger the 2 value, the more likely the
variables are related
The cells that contribute the most to the 2 value
are those whose actual count is very different from
the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Data Mining: Concepts and
2/13/17 Techniques 41
Chi-Square Calculation: An Example
( 250 90 ) 2
(50 210) 2
( 200 360) 2
(1000 840 ) 2
2 507.93
90 210 360 840
It shows that like_science_fiction and play_chess
are correlated in the group
Data Mining: Concepts and
2/13/17 Techniques 42
Data Transformation
73,600 54,000
Ex. Let = 54,000, = 16,000. Then16,000 1.225
attributes
Data Compression
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
A4 ?
A1? A6?
first, ...
Step-wise feature elimination:
algorithms
Typically lossless
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
s s y
lo
Original Data
Approximated
range
Compute k orthonormal (unit) vectors, i.e., principal
components
Each input data (vector) is a linear combination of the k
significance or strength
Since the components are sorted, the size of the data can
Y1
Y2
X1
50000
10000
20000
30000
40000
60000
70000
80000
90000
100000
bucket represents)
Data Mining: Concepts and
2/13/17 MaxDiff: set bucket
Techniques 60
Data Reduction Method (3):
Clustering
Stratified sampling:
W O R
SRS le random
i m p h o ut
( s e wi t
l
samp ment)
e pl a ce
r
SRSW
R
Raw Data
Data Mining: Concepts and
2/13/17 Techniques 63
Sampling: Cluster or Stratified Sampling
2/13/17
interval, max inconsistency,
Techniques
etc.)
Data Mining: Concepts and
70
Segmentation by Natural
Partitioning
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4:
year
country 15 distinct values