0% found this document useful (0 votes)
5 views85 pages

DM Day3 Preprocessing A F24

Uploaded by

Samer Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views85 pages

DM Day3 Preprocessing A F24

Uploaded by

Samer Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 85

CS 5162 - Data Mining (DM)

Fall 2024
Section A

Chapter 3: Data
Preprocessing

Dr. Malik Tahir Hassan, University of Management and


Cosine Similarity
Traditional distance measures do not work
well for sparse numeric data such as
term-frequency vectors

Document 1 Document 2 Document 3 Document 4


… … … …
… … … …
Similarity for Sparse Data (e.g.,
Text)
Comparing Documents
E.g. Computing Similarity/Plagiarism reports

Write an essay on “My favorite sport”

Document 1 Document 2 Document 3 Document 4


… … … …
… … … …
Cosine Similarity
Traditional distance measures do not work
well for sparse numeric data such as
term-frequency vectors

Cosine Similarity is a Solution


Cosine Similarity

||x|| =

x . y = (x1* y1+x2* y2+…+xp* yp)


Activity
Given the two SMS below, convert these to document term frequency vectors and find their
similarity. Are the two documents similar? Please comment. Use underlined terms only.
a. I was born in Pakistan. Pakistan is my country. I love my country. Pakistan Zindabad.
China is a friend country. I am a Muslim.
b. Pakistan is a Muslim country. Forces of Pakistan are very strong. They are always ready.
A friend in need is a friend indeed.
Data Mining

Ch. 3: Data
Preprocessing
Data quality
Garbage in, garbage out!
Data Quality
 Accuracy

 Completeness

 Consistency

 Timeliness

 Believability

 Interpretability
Inaccurate Data
Data having incorrect attribute values

Data collection instruments used may be faulty

Human or computer errors occurring at data


entry

Errors in data transmission


There may be technology limitations such as
limited buffer size for coordinating synchronized
data transfer and consumption
Inaccurate Data
Users may purposely submit incorrect data
values for mandatory fields when they do
not wish to submit personal information
e.g., by choosing the default value “January
1” displayed for birthday
This is known as disguised missing data
Incomplete Data
Attributes of interest may not always be
available
e.g Customer information for sales
transaction data

Data may not be included simply because


they were not considered important at the
time of data entry

Data may not be recorded due to


equipment malfunctions
Inconsistent Data
Incorrect/Inconsistent data may also result
from
Discrepancies in the codes used to
categorize items
 University of Management and Technology
 University of Management & Technology
 UMT
 BS(CS), BSCS, BS-CS, BS Computer Science

Inconsistent formats
15 April 2021 for input fields
 e.g., dateApril, 15, 2021
15-04-2021
15/04/21
Timeliness Issues
Monthly sales bonuses
Failure to submit sales records on time at the
end of the month
Corrections and adjustments that flow in
after the month’s end

Merit Award
Delayed submissions of grades
Believability Issues
For example, the database, at one point,
had several errors, all of which have since
been corrected

The past errors, however, had caused many


problems for sales department users, and
so they no longer trust the data
Interpretability Issues
The data uses many accounting codes,
which the sales department does not know
how to interpret
Major Tasks in data
Preprocessing
Data cleaning, data integration, data reduction,
and data transformation
Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation
Data Preprocessing
Data Cleaning

Real-world data tend to be incomplete, noisy,


and inconsistent
Data Cleaning
Filling in missing values, smoothing out
noise while idenfying outliers, and
correcting inconsistencies in the data
Data Cleaning
Handling Missing Values

Smoothing Noisy Data


Handling Missing Values
ID, age, gender, income, loan
11, 35, M, 30, N
Ignore the tuple
12, 45, M, 60, Y
13, 40, F, 40, N
14, 32, M, 30, N
Fill in the missing value manually
15, 30, M, ? , N

Use a global constant to fill in the missing


value
Unknown” or −∞
Missing Values
Use a measure of central tendency for the
attribute
e.g., the mean or median

Use the attribute mean or median for all


samples belonging to the same class as the
ID, age, gender, income, loan
given11,
tuple
35, M, 30, N
12, 45, M, 60, Y
13, 40, F, 40, N
14, 32, M, 30, N
15, 30, M, ?,N
Missing Values
Use the most probable value to fill in the
missing value
regression, decision tree induction etc.

ID, age, gender, income, loan


11, 35, M, 30, N
12, 45, M, 60, Y
13, 40, F, 40, N
14, 32, M, 30, N
15, 30, M, ? , N
Noisy Data
What is noise?

Noise is a random error or variance in a


measured variable

What is a remedy?

 Data Smoothing Techniques


Data Smoothing Techniques
 Binning
Smoothing by bin means
Smoothing by bin medians
Smoothing by bin boundaries

 Regression
Linear regression
Multiple linear regression

 Outlier Analysis
Values that fall outside of the set of clusters may
be considered outliers
Binning
Smoothing by bin means
 Each value in a bin is replaced by the mean value of
the bin
Smoothing by bin medians
 Each bin value is replaced by the bin median

Smoothing by bin boundaries


 The minimum and maximum values in a given bin
are identified as the bin boundaries
 Each bin value is then replaced by the closest
boundary value
Binning
22, 22, 22,
9, 9, 9, 29, 29, 29
Regression
Linear regression
Involves finding the “best” line to fit two
attributes (or variables) so that one attribute
can
Y =be
5 – used
10x to predict the other

Multiple linear regression


Extension of linear regression, where more
than two attributes are involved
The data are fit to a multidimensional surface

Y = 2 – 5X1 + 6 X2 – 3X3
Outlier analysis
Values that fall outside of the set of clusters
may be considered outliers
Data Cleaning
 A Two-Step Process
Discrepancy Detection
 Poorly designed forms with optional fields, human error,
deliberate error, data decay, inconsistencies, outliers, missing
values, noise, etc.
 Metadata, attribute type, attribute range, outlier analysis,
format check, unique rule, consecutive rule
Data Transformation
 to correct discrepancies
 Data Scrubbing
 use simple domain knowledge (e.g., knowledge of postal
addresses and spell-checking)
 Data auditing
 discover rules and relationships, and detect data that violate
such conditions, correlation analysis, cluster analysis, etc.
Activity
Read, watch, explore
Read
Get rid of the dirt from your data — Data Cle
aning techniques
Watch the video
Google Refine
Explore the tool
OpenRefine is a powerful free, opensource tool for
working with messy data: cleaning it; transforming
it from one format into another.
Download and Install Weka
A Machine learning software to solve data mining
problems
Explore Weka and the datasets that come with it,
e.g., Iris.
Data integration

The merging of data from multiple data stores


Challenges in Data Integration
 Entity Identification Problem
 How can we match schema and objects from different
sources?
 Redundancy and Correlation Analysis
 Are any attributes correlated?

 Tuple Duplication
 Data Value Conflict Detection and Resolution
 For the same real-world entity, attribute values from

different sources are different


 Possible reasons: different representations, different

scales, e.g., metric vs. British units, different grading


systems, etc.
Entity Identification Problem
Do customer id in one database and cust
number in another refer to the same
attribute?
Entity Identification Problem
Special attention must be paid to the
structure of the data
In one system, a discount may be applied to
the order, whereas in another system it is
applied to each individual line item within the
order
Entity Identification Problem
Metadata
E.g. the name, meaning, data type, and
range of values permitted for an attribute

Metadata can be used to help avoid


errors in schema integration
Redundancy and Correlation
Analysis

An attribute (such as annual revenue, for


instance) may be redundant if it can be
“derived” from another attribute or set of
attributes

Some redundancies can be detected by


correlation analysis
Correlation Analysis
Given two attributes, a correlation analysis
can measure how strongly one attribute
implies the other, based on the available
data
Correlation Analysis
Nominal data
χ2(chi-square) test

Numeric attributes
Correlation Coefficient
Covariance
χ2(chi-square) Test
Given two nominal attributes, A and B
Domain of A = {a1,a2, …,ac }
Domain of B = {b1,b2, …,br }

Construct a Contingency Table as following:


The c values of A making up the columns
The r values of B making up the rows
χ2(chi-square) Test
Let (Ai ,Bj) be the joint event representing
A = ai ,B = bj

Then

oij is the observed frequency (i.e., actual


count) of the joint event (Ai ,Bj)
eij is the expected frequency of the joint
event (Ai ,Bj)
χ2(chi-square) Test
The χ2 statistic tests the hypothesis that A
and B are independent
that is, there is no correlation between
them

The test is based on a significance level,


with (r-1)×(c-1) degrees of freedom

If the hypothesis can be rejected, then we


say that A and B are statistically
correlated
Example 3.1
Suppose that a group of 1500 people was
surveyed. The gender of each person was
noted. Each person was polled as to
whether his or her preferred type of reading
material was fiction or nonfiction.
1, M, Fiction
2, M, NonFiction
Thus, we have two attributes, 3, F, Fiction
GENDER , and 4, F, Fiction
5, M, Nonfiction
PREFERRED READING 6
7

1500
Example 3.1
The observed frequency (or count) of each
possible joint event is summarized in the
contingency table shown below:

Male Female Total


Fiction 250 200 450
Non-fiction 50 1000 1050
Total 300 1200 1500
Example 3.1
What is the expected frequency of each
possible joint event ???
Male Female Total
Fiction 250 (???) 200 (???) 450
Non-fiction 50 (???) 1000 1050
(???)
Total 300 1200 1500

Expected Male Fiction = 300*450/1500 = 90


Example 3.1
What is the expected frequency of each
possible joint event ???
Male Female Total
Fiction 250 (90) 200 (360) 450
Non-fiction 50 (210) 1000 1050
(840)
Total 300 1200 1500

Expected Male Non-fiction = 300*1050/1500 = 210


Example 3.1
Male Female Total
Fiction 250 (90) 200 (360) 450
Non-fiction 50 (210) 1000 1050
(840)
Total 300 1200 1500
Example 3.1

Male Female Total


Fiction 250 (90) 200 (360) 450
Non-fiction 50 (210) 1000 1050
(840)
Total 300 1200 1500
Example 3.1

For this 2×2 table, the degrees of freedom


are (2-1)(2-1)=1

For 1 degree of freedom, the χ2 value


needed to reject the hypothesis at the
0.001 significance level is 10.828
Example 3.1

 For 1 degree of freedom, the χ2 value needed to


reject the hypothesis at the 0.001 significance level is
10.828

 Since our computed value (507.93) is above this, we


can reject the hypothesis that gender and preferred
reading are independent

 Hence, that the two attributes are (strongly)


correlated forThe
Hypothesis: thetwo
given group of
attributes people
are
independent
H1: The two attributes are correlated
Correlation Coefficient
For numeric attributes, we can evaluate the
correlation between two attributes, A and B,
by computing the correlation coefficient
A.k.a. Pearson’s product moment coefficient
Correlation Coefficient

-1≤ rA,B ≤+1

If rA,B is greater than 0, then A and B are


positively correlated, meaning that the
values of A increase as the values of B
increase
Correlation Coefficient

-1≤ rA,B ≤+1

If rA,B is greater than 0, then A and B are


positively correlated, meaning that the values
of A increase as the values of B increase.
The higher the value, the stronger the
correlation (i.e., the more each attribute
implies the other).
Hence, a higher value may indicate that A (or B)
may be removed as a redundancy
Covariance
Used for assessing how much two
attributes change together

Variance is a special case of covariance,


where the two attributes are identical
i.e., the covariance of an attribute with itself
Var(X) = E(X-X’)2 = E(X-X’)(X-X’) = Sum(x2)/n – (x’)2
Covariance Matrix
Cov A B C
A Var(A) =
cov(A,A)
B Cov(A,B) Cov(B,B) =
var(B)
C Cov(A,C) Cov(B,C) Cov(C,C) =
var(C)
Covariance
Correlation and Covariance are two similar
measures
Both are used for assessing how much two
attributes change together
Covariance analysis
Share your findings based on the
covariance analysis of following data
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
62
62
Data Reduction
Obtain a reduced representation of the data set
that is much smaller in volume yet produces the
same (or almost the same) analytical results
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data
set that is much smaller in volume yet produces the same (or
almost the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.

 Data reduction strategies


 Dimensionality reduction, e.g., remove unimportant
attributes
 Numerosity reduction (some simply call it: Data Reduction)
 Data compression

64
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Principal Component Analysis
 Wavelet transforms
 Supervised and nonlinear techniques (e.g., feature selection)
65
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space

x2

66 x1
Principal Component Analysis
(Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data)
 Works for numeric data only

 Principal Component Analysis in Python


 A step by step tutorial to PCA
 https://plot.ly/ipython-notebooks/principal-component-analysis/
67
 http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Iris Dataset and PCA
150x4 (sepal len, pet len, sep wid, pet wid)
Covariance Matrix 4x4
Four Eigen vectors: 4 x 4
Four eigen values
Select top 2 eigen vectors 4 x 2 based on
highest eigen values
(2x4) (4x150)
(2x150)
(150x2)
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Remove Redundant attributes
Duplicate much or all of the information
contained in one or more other attributes
E.g., purchase price of a product and the
amount of sales tax paid
 Remove Irrelevant attributes
Contain no information that is useful for the
data mining task at hand
E.g., students' ID is often irrelevant to the task
of predicting students' GPA
74
Attribute Subset Selection
Classify customers based on whether or not
they are likely to purchase a popular new
CD
Customer’s Telephone Number
Customer’s Age
Customer’s Music Taste

Domain Expert
To pick out some of the useful attributes
Attribute Subset Selection
Classify customers based on whether or not
they are likely to purchase a popular new
CD
Customer’s Telephone Number
Customer’s Age
Customer’s Music Taste

Domain Expert
To pick out some of the useful attributes
difficult & time
consuming
Heuristic Search in Attribute
Selection
 There are 2d possible attribute combinations of d
attributes
 Typical heuristic attribute selection methods:
Best single attribute under the attribute
independence assumption: choose by significance
tests
Best step-wise feature selection:
 The best single-attribute is picked first
 Then next best attribute condition to the first, ...
Step-wise attribute elimination:
 Repeatedly eliminate the worst attribute
Best combined attribute selection and elimination

77
Attribute Subset Selection
Decision Tree Induction
Constructs a flowchart like structure
At each node, the algorithm chooses the
“best” attribute to partition the data into
individual classes
The set of attributes appearing in the tree
form the reduced subset of attributes
Decision Tree Induction
 Constructs a flowchart
like structure
 At each node, the
algorithm chooses the
“best” attribute to
partition the data into
individual classes
 The set of attributes
appearing in the tree
form the reduced
subset of attributes
HomeGroun Predicti
Match # d Weather Result on

1 Yes Cloudy Win


2 No Sunny Lose

3 Yes Sunny Win


4 No Cloudy Lose

5 Yes Cloudy ?

6 No Cloudy ?

7 Yes Sunny ?
homeground
yes no

1 Yes Cloudy Win 2 No Sunny Lose


3 Yes Sunny Win 4 No Cloudy Lose

win lose

weathe
r
cloudy sunny
1 Yes Cloudy Win 2 No Sunny Lose
4 No Cloudy Lose 3 Yes Sunny Win
Attribute Subset Selection

Stopping Criteria
A threshold, on the measure used, may be
employed to determine when to stop the
attribute selection process
Attribute Creation (Feature
Generation)
 Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
 Three general methodologies
Attribute extraction
 Domain-specific
Mapping data to new space (see: data reduction)
 E.g., Fourier transformation, wavelet transformation, PCA
Attribute construction
 Combining features, area
 Data discretization

84
Data Reduction via Numerosity Reduction

 Reduce data volume by choosing alternative,


smaller forms of data representation
 Parametric methods (e.g., regression)
Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
 Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling,

85

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy