DM Day3 Preprocessing A F24
DM Day3 Preprocessing A F24
Fall 2024
Section A
Chapter 3: Data
Preprocessing
||x|| =
Ch. 3: Data
Preprocessing
Data quality
Garbage in, garbage out!
Data Quality
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
Inaccurate Data
Data having incorrect attribute values
Inconsistent formats
15 April 2021 for input fields
e.g., dateApril, 15, 2021
15-04-2021
15/04/21
Timeliness Issues
Monthly sales bonuses
Failure to submit sales records on time at the
end of the month
Corrections and adjustments that flow in
after the month’s end
Merit Award
Delayed submissions of grades
Believability Issues
For example, the database, at one point,
had several errors, all of which have since
been corrected
Data Cleaning
Data Integration
Data Reduction
Data Transformation
Data Preprocessing
Data Cleaning
What is a remedy?
Regression
Linear regression
Multiple linear regression
Outlier Analysis
Values that fall outside of the set of clusters may
be considered outliers
Binning
Smoothing by bin means
Each value in a bin is replaced by the mean value of
the bin
Smoothing by bin medians
Each bin value is replaced by the bin median
Y = 2 – 5X1 + 6 X2 – 3X3
Outlier analysis
Values that fall outside of the set of clusters
may be considered outliers
Data Cleaning
A Two-Step Process
Discrepancy Detection
Poorly designed forms with optional fields, human error,
deliberate error, data decay, inconsistencies, outliers, missing
values, noise, etc.
Metadata, attribute type, attribute range, outlier analysis,
format check, unique rule, consecutive rule
Data Transformation
to correct discrepancies
Data Scrubbing
use simple domain knowledge (e.g., knowledge of postal
addresses and spell-checking)
Data auditing
discover rules and relationships, and detect data that violate
such conditions, correlation analysis, cluster analysis, etc.
Activity
Read, watch, explore
Read
Get rid of the dirt from your data — Data Cle
aning techniques
Watch the video
Google Refine
Explore the tool
OpenRefine is a powerful free, opensource tool for
working with messy data: cleaning it; transforming
it from one format into another.
Download and Install Weka
A Machine learning software to solve data mining
problems
Explore Weka and the datasets that come with it,
e.g., Iris.
Data integration
Tuple Duplication
Data Value Conflict Detection and Resolution
For the same real-world entity, attribute values from
Numeric attributes
Correlation Coefficient
Covariance
χ2(chi-square) Test
Given two nominal attributes, A and B
Domain of A = {a1,a2, …,ac }
Domain of B = {b1,b2, …,br }
Then
1500
Example 3.1
The observed frequency (or count) of each
possible joint event is summarized in the
contingency table shown below:
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
62
62
Data Reduction
Obtain a reduced representation of the data set
that is much smaller in volume yet produces the
same (or almost the same) analytical results
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data
set that is much smaller in volume yet produces the same (or
almost the same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.
64
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Principal Component Analysis
Wavelet transforms
Supervised and nonlinear techniques (e.g., feature selection)
65
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
66 x1
Principal Component Analysis
(Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data)
Works for numeric data only
Domain Expert
To pick out some of the useful attributes
Attribute Subset Selection
Classify customers based on whether or not
they are likely to purchase a popular new
CD
Customer’s Telephone Number
Customer’s Age
Customer’s Music Taste
Domain Expert
To pick out some of the useful attributes
difficult & time
consuming
Heuristic Search in Attribute
Selection
There are 2d possible attribute combinations of d
attributes
Typical heuristic attribute selection methods:
Best single attribute under the attribute
independence assumption: choose by significance
tests
Best step-wise feature selection:
The best single-attribute is picked first
Then next best attribute condition to the first, ...
Step-wise attribute elimination:
Repeatedly eliminate the worst attribute
Best combined attribute selection and elimination
77
Attribute Subset Selection
Decision Tree Induction
Constructs a flowchart like structure
At each node, the algorithm chooses the
“best” attribute to partition the data into
individual classes
The set of attributes appearing in the tree
form the reduced subset of attributes
Decision Tree Induction
Constructs a flowchart
like structure
At each node, the
algorithm chooses the
“best” attribute to
partition the data into
individual classes
The set of attributes
appearing in the tree
form the reduced
subset of attributes
HomeGroun Predicti
Match # d Weather Result on
5 Yes Cloudy ?
6 No Cloudy ?
7 Yes Sunny ?
homeground
yes no
win lose
weathe
r
cloudy sunny
1 Yes Cloudy Win 2 No Sunny Lose
4 No Cloudy Lose 3 Yes Sunny Win
Attribute Subset Selection
Stopping Criteria
A threshold, on the measure used, may be
employed to determine when to stop the
attribute selection process
Attribute Creation (Feature
Generation)
Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
Three general methodologies
Attribute extraction
Domain-specific
Mapping data to new space (see: data reduction)
E.g., Fourier transformation, wavelet transformation, PCA
Attribute construction
Combining features, area
Data discretization
84
Data Reduction via Numerosity Reduction
85