0% found this document useful (0 votes)

5 views85 pages

DM Day3 Preprocessing A F24

Uploaded by

Samer Iqbal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views85 pages

DM Day3 Preprocessing A F24

Uploaded by

Samer Iqbal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 85

CS 5162 - Data Mining (DM)

Fall 2024
Section A

Chapter 3: Data
Preprocessing

Dr. Malik Tahir Hassan, University of Management and

Cosine Similarity
Traditional distance measures do not work
well for sparse numeric data such as
term-frequency vectors

Document 1 Document 2 Document 3 Document 4

… … … …
… … … …
Similarity for Sparse Data (e.g.,
Text)
Comparing Documents
E.g. Computing Similarity/Plagiarism reports

Write an essay on “My favorite sport”

Document 1 Document 2 Document 3 Document 4

… … … …
… … … …
Cosine Similarity
Traditional distance measures do not work
well for sparse numeric data such as
term-frequency vectors

Cosine Similarity is a Solution

Cosine Similarity


||x|| =

x . y = (x1* y1+x2* y2+…+xp* yp)

Activity
Given the two SMS below, convert these to document term frequency vectors and find their
similarity. Are the two documents similar? Please comment. Use underlined terms only.
a. I was born in Pakistan. Pakistan is my country. I love my country. Pakistan Zindabad.
China is a friend country. I am a Muslim.
b. Pakistan is a Muslim country. Forces of Pakistan are very strong. They are always ready.
A friend in need is a friend indeed.
Data Mining

Ch. 3: Data
Preprocessing
Data quality
Garbage in, garbage out!
Data Quality
 Accuracy

 Completeness

 Consistency

 Timeliness

 Believability

 Interpretability
Inaccurate Data
Data having incorrect attribute values

Data collection instruments used may be faulty

Human or computer errors occurring at data

entry

Errors in data transmission

There may be technology limitations such as
limited buffer size for coordinating synchronized
data transfer and consumption
Inaccurate Data
Users may purposely submit incorrect data
values for mandatory fields when they do
not wish to submit personal information
e.g., by choosing the default value “January
1” displayed for birthday
This is known as disguised missing data
Incomplete Data
Attributes of interest may not always be
available
e.g Customer information for sales
transaction data

Data may not be included simply because

they were not considered important at the
time of data entry

Data may not be recorded due to

equipment malfunctions
Inconsistent Data
Incorrect/Inconsistent data may also result
from
Discrepancies in the codes used to
categorize items
 University of Management and Technology
 University of Management & Technology
 UMT
 BS(CS), BSCS, BS-CS, BS Computer Science

Inconsistent formats
15 April 2021 for input fields
 e.g., dateApril, 15, 2021
15-04-2021
15/04/21
Timeliness Issues
Monthly sales bonuses
Failure to submit sales records on time at the
end of the month
Corrections and adjustments that flow in
after the month’s end

Merit Award
Delayed submissions of grades
Believability Issues
For example, the database, at one point,
had several errors, all of which have since
been corrected

The past errors, however, had caused many

problems for sales department users, and
so they no longer trust the data
Interpretability Issues
The data uses many accounting codes,
which the sales department does not know
how to interpret
Major Tasks in data
Preprocessing
Data cleaning, data integration, data reduction,
and data transformation
Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation
Data Preprocessing
Data Cleaning

Real-world data tend to be incomplete, noisy,

and inconsistent
Data Cleaning
Filling in missing values, smoothing out
noise while idenfying outliers, and
correcting inconsistencies in the data
Data Cleaning
Handling Missing Values

Smoothing Noisy Data

Handling Missing Values
ID, age, gender, income, loan
11, 35, M, 30, N
Ignore the tuple
12, 45, M, 60, Y
13, 40, F, 40, N
14, 32, M, 30, N
Fill in the missing value manually
15, 30, M, ? , N

Use a global constant to fill in the missing

value
Unknown” or −∞
Missing Values
Use a measure of central tendency for the
attribute
e.g., the mean or median

Use the attribute mean or median for all

samples belonging to the same class as the
ID, age, gender, income, loan
given11,
tuple
35, M, 30, N
12, 45, M, 60, Y
13, 40, F, 40, N
14, 32, M, 30, N
15, 30, M, ?,N
Missing Values
Use the most probable value to fill in the
missing value
regression, decision tree induction etc.

ID, age, gender, income, loan

11, 35, M, 30, N
12, 45, M, 60, Y
13, 40, F, 40, N
14, 32, M, 30, N
15, 30, M, ? , N
Noisy Data
What is noise?

Noise is a random error or variance in a

measured variable

What is a remedy?

 Data Smoothing Techniques

Data Smoothing Techniques
 Binning
Smoothing by bin means
Smoothing by bin medians
Smoothing by bin boundaries

 Regression
Linear regression
Multiple linear regression

 Outlier Analysis
Values that fall outside of the set of clusters may
be considered outliers
Binning
Smoothing by bin means
 Each value in a bin is replaced by the mean value of
the bin
Smoothing by bin medians
 Each bin value is replaced by the bin median

Smoothing by bin boundaries

 The minimum and maximum values in a given bin
are identified as the bin boundaries
 Each bin value is then replaced by the closest
boundary value
Binning
22, 22, 22,
9, 9, 9, 29, 29, 29
Regression
Linear regression
Involves finding the “best” line to fit two
attributes (or variables) so that one attribute
can
Y =be
5 – used
10x to predict the other

Multiple linear regression

Extension of linear regression, where more
than two attributes are involved
The data are fit to a multidimensional surface

Y = 2 – 5X1 + 6 X2 – 3X3
Outlier analysis
Values that fall outside of the set of clusters
may be considered outliers
Data Cleaning
 A Two-Step Process
Discrepancy Detection
 Poorly designed forms with optional fields, human error,
deliberate error, data decay, inconsistencies, outliers, missing
values, noise, etc.
 Metadata, attribute type, attribute range, outlier analysis,
format check, unique rule, consecutive rule
Data Transformation
 to correct discrepancies
 Data Scrubbing
 use simple domain knowledge (e.g., knowledge of postal
addresses and spell-checking)
 Data auditing
 discover rules and relationships, and detect data that violate
such conditions, correlation analysis, cluster analysis, etc.
Activity
Read, watch, explore
Read
Get rid of the dirt from your data — Data Cle
aning techniques
Watch the video
Google Refine
Explore the tool
OpenRefine is a powerful free, opensource tool for
working with messy data: cleaning it; transforming
it from one format into another.
Download and Install Weka
A Machine learning software to solve data mining
problems
Explore Weka and the datasets that come with it,
e.g., Iris.
Data integration

The merging of data from multiple data stores

Challenges in Data Integration
 Entity Identification Problem
 How can we match schema and objects from different
sources?
 Redundancy and Correlation Analysis
 Are any attributes correlated?

 Tuple Duplication
 Data Value Conflict Detection and Resolution
 For the same real-world entity, attribute values from

different sources are different

 Possible reasons: different representations, different

scales, e.g., metric vs. British units, different grading

systems, etc.
Entity Identification Problem
Do customer id in one database and cust
number in another refer to the same
attribute?
Entity Identification Problem
Special attention must be paid to the
structure of the data
In one system, a discount may be applied to
the order, whereas in another system it is
applied to each individual line item within the
order
Entity Identification Problem
Metadata
E.g. the name, meaning, data type, and
range of values permitted for an attribute

Metadata can be used to help avoid

errors in schema integration
Redundancy and Correlation
Analysis

An attribute (such as annual revenue, for

instance) may be redundant if it can be
“derived” from another attribute or set of
attributes

Some redundancies can be detected by

correlation analysis
Correlation Analysis
Given two attributes, a correlation analysis
can measure how strongly one attribute
implies the other, based on the available
data
Correlation Analysis
Nominal data
χ2(chi-square) test

Numeric attributes
Correlation Coefficient
Covariance
χ2(chi-square) Test
Given two nominal attributes, A and B
Domain of A = {a1,a2, …,ac }
Domain of B = {b1,b2, …,br }

Construct a Contingency Table as following:

The c values of A making up the columns
The r values of B making up the rows
χ2(chi-square) Test
Let (Ai ,Bj) be the joint event representing
A = ai ,B = bj

Then

oij is the observed frequency (i.e., actual

count) of the joint event (Ai ,Bj)
eij is the expected frequency of the joint
event (Ai ,Bj)
χ2(chi-square) Test
The χ2 statistic tests the hypothesis that A
and B are independent
that is, there is no correlation between
them

The test is based on a significance level,

with (r-1)×(c-1) degrees of freedom

If the hypothesis can be rejected, then we

say that A and B are statistically
correlated
Example 3.1
Suppose that a group of 1500 people was
surveyed. The gender of each person was
noted. Each person was polled as to
whether his or her preferred type of reading
material was fiction or nonfiction.
1, M, Fiction
2, M, NonFiction
Thus, we have two attributes, 3, F, Fiction
GENDER , and 4, F, Fiction
5, M, Nonfiction
PREFERRED READING 6
7

1500
Example 3.1
The observed frequency (or count) of each
possible joint event is summarized in the
contingency table shown below:

Male Female Total

Fiction 250 200 450
Non-fiction 50 1000 1050
Total 300 1200 1500
Example 3.1
What is the expected frequency of each
possible joint event ???
Male Female Total
Fiction 250 (???) 200 (???) 450
Non-fiction 50 (???) 1000 1050
(???)
Total 300 1200 1500

Expected Male Fiction = 300*450/1500 = 90

Example 3.1
What is the expected frequency of each
possible joint event ???
Male Female Total
Fiction 250 (90) 200 (360) 450
Non-fiction 50 (210) 1000 1050
(840)
Total 300 1200 1500

Expected Male Non-fiction = 300*1050/1500 = 210

Example 3.1
Male Female Total
Fiction 250 (90) 200 (360) 450
Non-fiction 50 (210) 1000 1050
(840)
Total 300 1200 1500
Example 3.1

Male Female Total

Fiction 250 (90) 200 (360) 450
Non-fiction 50 (210) 1000 1050
(840)
Total 300 1200 1500
Example 3.1

For this 2×2 table, the degrees of freedom

are (2-1)(2-1)=1

For 1 degree of freedom, the χ2 value

needed to reject the hypothesis at the
0.001 significance level is 10.828
Example 3.1

 For 1 degree of freedom, the χ2 value needed to

reject the hypothesis at the 0.001 significance level is
10.828

 Since our computed value (507.93) is above this, we

can reject the hypothesis that gender and preferred
reading are independent

 Hence, that the two attributes are (strongly)

correlated forThe
Hypothesis: thetwo
given group of
attributes people
are
independent
H1: The two attributes are correlated
Correlation Coefficient
For numeric attributes, we can evaluate the
correlation between two attributes, A and B,
by computing the correlation coefficient
A.k.a. Pearson’s product moment coefficient
Correlation Coefficient

-1≤ rA,B ≤+1

If rA,B is greater than 0, then A and B are

positively correlated, meaning that the
values of A increase as the values of B
increase
Correlation Coefficient

-1≤ rA,B ≤+1

If rA,B is greater than 0, then A and B are

positively correlated, meaning that the values
of A increase as the values of B increase.
The higher the value, the stronger the
correlation (i.e., the more each attribute
implies the other).
Hence, a higher value may indicate that A (or B)
may be removed as a redundancy
Covariance
Used for assessing how much two
attributes change together

Variance is a special case of covariance,

where the two attributes are identical
i.e., the covariance of an attribute with itself
Var(X) = E(X-X’)2 = E(X-X’)(X-X’) = Sum(x2)/n – (x’)2
Covariance Matrix
Cov A B C
A Var(A) =
cov(A,A)
B Cov(A,B) Cov(B,B) =
var(B)
C Cov(A,C) Cov(B,C) Cov(C,C) =
var(C)
Covariance
Correlation and Covariance are two similar
measures
Both are used for assessing how much two
attributes change together
Covariance analysis
Share your findings based on the
covariance analysis of following data
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
62
62
Data Reduction
Obtain a reduced representation of the data set
that is much smaller in volume yet produces the
same (or almost the same) analytical results
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data
set that is much smaller in volume yet produces the same (or
almost the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.

 Data reduction strategies

 Dimensionality reduction, e.g., remove unimportant
attributes
 Numerosity reduction (some simply call it: Data Reduction)
 Data compression

64
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Principal Component Analysis
 Wavelet transforms
 Supervised and nonlinear techniques (e.g., feature selection)
65
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space

66 x1
Principal Component Analysis
(Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data)
 Works for numeric data only

 Principal Component Analysis in Python

 A step by step tutorial to PCA
 https://plot.ly/ipython-notebooks/principal-component-analysis/
67
 http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Iris Dataset and PCA
150x4 (sepal len, pet len, sep wid, pet wid)
Covariance Matrix 4x4
Four Eigen vectors: 4 x 4
Four eigen values
Select top 2 eigen vectors 4 x 2 based on
highest eigen values
(2x4) (4x150)
(2x150)
(150x2)
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Remove Redundant attributes
Duplicate much or all of the information
contained in one or more other attributes
E.g., purchase price of a product and the
amount of sales tax paid
 Remove Irrelevant attributes
Contain no information that is useful for the
data mining task at hand
E.g., students' ID is often irrelevant to the task
of predicting students' GPA
74
Attribute Subset Selection
Classify customers based on whether or not
they are likely to purchase a popular new
CD
Customer’s Telephone Number
Customer’s Age
Customer’s Music Taste

Domain Expert
To pick out some of the useful attributes
Attribute Subset Selection
Classify customers based on whether or not
they are likely to purchase a popular new
CD
Customer’s Telephone Number
Customer’s Age
Customer’s Music Taste

Domain Expert
To pick out some of the useful attributes
difficult & time
consuming
Heuristic Search in Attribute
Selection
 There are 2d possible attribute combinations of d
attributes
 Typical heuristic attribute selection methods:
Best single attribute under the attribute
independence assumption: choose by significance
tests
Best step-wise feature selection:
 The best single-attribute is picked first
 Then next best attribute condition to the first, ...
Step-wise attribute elimination:
 Repeatedly eliminate the worst attribute
Best combined attribute selection and elimination

77
Attribute Subset Selection
Decision Tree Induction
Constructs a flowchart like structure
At each node, the algorithm chooses the
“best” attribute to partition the data into
individual classes
The set of attributes appearing in the tree
form the reduced subset of attributes
Decision Tree Induction
 Constructs a flowchart
like structure
 At each node, the
algorithm chooses the
“best” attribute to
partition the data into
individual classes
 The set of attributes
appearing in the tree
form the reduced
subset of attributes
HomeGroun Predicti
Match # d Weather Result on

1 Yes Cloudy Win

2 No Sunny Lose

3 Yes Sunny Win

4 No Cloudy Lose

5 Yes Cloudy ?

6 No Cloudy ?

7 Yes Sunny ?
homeground
yes no

1 Yes Cloudy Win 2 No Sunny Lose

3 Yes Sunny Win 4 No Cloudy Lose

win lose

weathe
r
cloudy sunny
1 Yes Cloudy Win 2 No Sunny Lose
4 No Cloudy Lose 3 Yes Sunny Win
Attribute Subset Selection

Stopping Criteria
A threshold, on the measure used, may be
employed to determine when to stop the
attribute selection process
Attribute Creation (Feature
Generation)
 Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
 Three general methodologies
Attribute extraction
 Domain-specific
Mapping data to new space (see: data reduction)
 E.g., Fourier transformation, wavelet transformation, PCA
Attribute construction
 Combining features, area
 Data discretization

84
Data Reduction via Numerosity Reduction

 Reduce data volume by choosing alternative,

smaller forms of data representation
 Parametric methods (e.g., regression)
Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
 Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling,
…

Querying Microsoft SQL Server 2014
No ratings yet
Querying Microsoft SQL Server 2014
6 pages
Printable Newsletter Templates Teal
No ratings yet
Printable Newsletter Templates Teal
1 page
Week2 2
No ratings yet
Week2 2
25 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Mining
No ratings yet
Data Mining
31 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
3 Processing
No ratings yet
3 Processing
79 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
DM Merged
No ratings yet
DM Merged
169 pages
CH 3
No ratings yet
CH 3
68 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
DP
No ratings yet
DP
44 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Correlation
No ratings yet
Correlation
14 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Unit 3
No ratings yet
Unit 3
164 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
2-Data Fundamentals For BI - Part1
No ratings yet
2-Data Fundamentals For BI - Part1
39 pages
Unit 2
No ratings yet
Unit 2
37 pages
03 Preprocessing
No ratings yet
03 Preprocessing
38 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Unit 3
No ratings yet
Unit 3
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Null 1
No ratings yet
Null 1
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
Mining
No ratings yet
Mining
63 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
39565
No ratings yet
39565
22 pages
Sentry GB200 User Manual Rev - B
No ratings yet
Sentry GB200 User Manual Rev - B
77 pages
Customer Implementation Inputs
No ratings yet
Customer Implementation Inputs
10 pages
Lecture 1.2.4
No ratings yet
Lecture 1.2.4
20 pages
UNIT-1 IAT1 OOSE Expected Questions
No ratings yet
UNIT-1 IAT1 OOSE Expected Questions
2 pages
Omkar Resume
No ratings yet
Omkar Resume
3 pages
M803 26 CSM310-A CSM540-A A4 E Screen
No ratings yet
M803 26 CSM310-A CSM540-A A4 E Screen
29 pages
327 Linh Bang Nguyen
No ratings yet
327 Linh Bang Nguyen
8 pages
MTH603 MCQS2 by Arsal Shani Aina Malik
No ratings yet
MTH603 MCQS2 by Arsal Shani Aina Malik
13 pages
Files Concepts
No ratings yet
Files Concepts
30 pages
M.tech II YEAR Computer Science and Engineering 22-04-2025
No ratings yet
M.tech II YEAR Computer Science and Engineering 22-04-2025
13 pages
3D Scanning For Reverse Engineering, Restoration, and Metrology
No ratings yet
3D Scanning For Reverse Engineering, Restoration, and Metrology
14 pages
NSXT 3.x Upgrade-Document
No ratings yet
NSXT 3.x Upgrade-Document
55 pages
Esquema PIN OUT Haltech NSX
No ratings yet
Esquema PIN OUT Haltech NSX
4 pages
Weg SSW07
No ratings yet
Weg SSW07
152 pages
B.tech 4-2 R18 Timetables An
No ratings yet
B.tech 4-2 R18 Timetables An
3 pages
Trellix Website Privacy Notice
No ratings yet
Trellix Website Privacy Notice
7 pages
Specification For BT 96040AV-FSTF-12-I2C-COG: Batron
No ratings yet
Specification For BT 96040AV-FSTF-12-I2C-COG: Batron
10 pages
Abdul Basit Adil - Java Developer
No ratings yet
Abdul Basit Adil - Java Developer
7 pages
Chapter 05 - Discrete Probability Distributions: F (X) 1 For All Values of X F (X) 0 For All Values of X
No ratings yet
Chapter 05 - Discrete Probability Distributions: F (X) 1 For All Values of X F (X) 0 For All Values of X
25 pages
Integration Solutions Brochure-Original
No ratings yet
Integration Solutions Brochure-Original
8 pages
Attachment 1
No ratings yet
Attachment 1
9 pages
Class 5 Computer
No ratings yet
Class 5 Computer
1 page
Security in Comp Science 2021
No ratings yet
Security in Comp Science 2021
145 pages
CH6 (1) Interpolation
No ratings yet
CH6 (1) Interpolation
22 pages
Scientists Make Biocomputer With Brain Tissue
No ratings yet
Scientists Make Biocomputer With Brain Tissue
24 pages
VAR11N-300 Quick Setting Guide
No ratings yet
VAR11N-300 Quick Setting Guide
9 pages
Eye of The Beholder II - Manual - PC
No ratings yet
Eye of The Beholder II - Manual - PC
32 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DM Day3 Preprocessing A F24

Uploaded by

DM Day3 Preprocessing A F24

Uploaded by

CS 5162 - Data Mining (DM)

Dr. Malik Tahir Hassan, University of Management and

Document 1 Document 2 Document 3 Document 4

Write an essay on “My favorite sport”

Document 1 Document 2 Document 3 Document 4

Cosine Similarity is a Solution

x . y = (x1* y1+x2* y2+…+xp* yp)

Data collection instruments used may be faulty

Human or computer errors occurring at data

Errors in data transmission

Data may not be included simply because

Data may not be recorded due to

The past errors, however, had caused many

Real-world data tend to be incomplete, noisy,

Smoothing Noisy Data

Use a global constant to fill in the missing

Use the attribute mean or median for all

ID, age, gender, income, loan

Noise is a random error or variance in a

 Data Smoothing Techniques

Smoothing by bin boundaries

Multiple linear regression

The merging of data from multiple data stores

different sources are different

scales, e.g., metric vs. British units, different grading

Metadata can be used to help avoid

An attribute (such as annual revenue, for

Some redundancies can be detected by

Construct a Contingency Table as following:

oij is the observed frequency (i.e., actual

The test is based on a significance level,

If the hypothesis can be rejected, then we

Male Female Total

Expected Male Fiction = 300*450/1500 = 90

Expected Male Non-fiction = 300*1050/1500 = 210

Male Female Total

For this 2×2 table, the degrees of freedom

For 1 degree of freedom, the χ2 value

 For 1 degree of freedom, the χ2 value needed to

 Since our computed value (507.93) is above this, we

 Hence, that the two attributes are (strongly)

-1≤ rA,B ≤+1

If rA,B is greater than 0, then A and B are

-1≤ rA,B ≤+1

If rA,B is greater than 0, then A and B are

Variance is a special case of covariance,

 Data Preprocessing: An Overview

Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Data reduction strategies

 Principal Component Analysis in Python

1 Yes Cloudy Win

3 Yes Sunny Win

1 Yes Cloudy Win 2 No Sunny Lose

 Reduce data volume by choosing alternative,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.