0% found this document useful (0 votes)

14 views27 pages

Lecture5

The document discusses data mining and business intelligence, focusing on data preprocessing techniques such as cosine similarity, data cleaning, and data integration. It explains how to handle missing and noisy data, including methods like binning, regression, and clustering. Additionally, it addresses challenges in data integration, such as entity identification and schema integration problems.

Uploaded by

kiro2morris3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views27 pages

Lecture5

Uploaded by

kiro2morris3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

SET 393: Data Mining and Business Intelligence

3rd Year

Spring 2025

Lec. 5

Chapter 2. Data, Measurements, and Data Preprocessing

Assistant Professor: Dr. Rasha Saleh
Cosine Similarity
◼ A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

◼ Applications: information retrieval, biologic taxonomy, gene feature mapping, ...

◼ Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

d1 dot d2 / amplitude d1 * amplitude d2

where • indicates vector dot product, ||d||: the length of vector d

2
Cosine Similarity

◼ Reminder: Dot product:

◼ Assume d1= (1,2) ,

◼ d2= (0,3)

1∗0 +(2∗3)
◼ d1 dot d2 =
12 +22 ∗ 02 +32

6 6 2
◼ = = =
5∗ 9 3 5 5

3
Example: Cosine Similarity

◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

where • indicates vector dot product, ||d|: the length of vector d

◼ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = sim(x,y)= 0.94
4
Exercise

5
Exercise- Answer

(a) What is the mean of the data? What is the median?

• The (arithmetic) mean of the data is: = 809/ 27 =30.

• The median (middle value of the ordered set, as the number of values in the set is odd) of the data is: 25.
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.). This data set
has two values that occur with the same highest frequency and is, therefore, bimodal. The modes (values
occurring with the greatest frequency) of the data are 25 and 35.
(c) What is the midrange of the data?
The midrange (average of the largest and smallest values in the data set) of the data is: (70+13) 2 = 415
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
The first quartile (corresponding to the 25th percentile) of the data is: 20.
6
Exercise- Answer

(a) What is the mean of the data? What is the median?

• The third quartile (corresponding to the 75th percentile) of the data is: 35.
• (e) Give the five-number summary of the data.
• The five number summary of a distribution consists of the minimum value, first quartile, median value, third
quartile, and maximum value. It provides a good summary of the shape of the distribution and for this data
is: 13, 20, 25, 35, 70.

7
Exercise- Answer

Using Equation , we have

L1 = 20, N= 3194, = 950, = 1500, width = 30,
median = 32.94 years

8
Exercise-Answer

(a) For the variable age the mean is 46.44, the

median is 51, and the standard deviation is
12.85.
For the variable %fat the mean is 28.78, the
median is 30.7, and the standard deviation is
8.99.
(b)Draw the box plots for age and %fat

9
Exercise-Answer
Try it yourself

10
Data Quality, Data Cleaning and Data
Integration
◼ Chapter 2 Data, measurements, and data preprocessing

Definition and types Basic statistical description, measuring similarity and dissimilarity

◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation

11
Major Tasks in Data Preprocessing

12
Data Cleaning
◼ Data in the Real World Needs Cleaning: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, and transmission error
◼ Incomplete: lacking attribute values, lacking certain attributes of interest, or containing

only aggregate data

◼ e.g., Occupation = “ ” (missing data)

◼ Noisy: containing noise, errors, or outliers

◼ e.g., Salary = “−10” (an error)

◼ Inconsistent: containing discrepancies in codes or names, e.g.,

◼ Age = “42”, Birthday = “03/07/2010”

◼ Was rating “1, 2, 3”, now rating “A, B, C”

◼ discrepancy between duplicate records

◼ Intentional (e.g., disguised missing data)

◼ Jan. 1 as everyone’s birthday?

Incomplete (Missing) Data
◼ Data is not always available
◼ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data of a company
◼ Missing data may be due to
◼ Equipment malfunction
◼ Inconsistent with other recorded data and thus deleted
◼ Data were not entered due to misunderstanding
◼ Certain data may not be considered important at the time of entry
◼ Did not register history or changes of the data
◼ Missing data may need to be inferred

14
How to Handle Missing Data?
◼ Ignore the tuple: usually done when value of the cell in the class label is
missing (when doing classification)—This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor
when the percentage of missing values per attribute varies considerably. By
ignoring the tuple, we do not make use of the remaining attributes’ values in
the tuple. Such data could have been useful to the task at hand.
◼ Fill in the missing value manually: tedious +time consuming + infeasible
given a large data set with many missing values.

15
How to Handle Missing Data?
◼ Fill in it automatically with
◼ a global constant : Replace all missing attribute values by the same constant such as a
label like e.g., “unknown”,
◼ Drawback: If missing values are replaced by, say, “Unknown,” then the mining program may
mistakenly think that they form a new class
◼ method is simple, it is not foolproof.
◼ Central tendency for the attribute (e.g., the mean or median) for all samples belonging
to the same class: smarter
◼ For example, suppose that the data distribution regarding the income of the customers is
symmetric and that the mean income is $56,000. Use this value to replace the missing value
for income.
◼ the most probable value: regression, or inference-based such as Bayesian
formula or decision tree induction to predict the missing values.

16
Noisy Data
◼ Noise: is a random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ Faulty data collection instruments

◼ Data entry problems

◼ Data transmission problems

◼ Technology limitation

◼ Inconsistency in naming convention

17
How to Handle Noisy Data?
◼ Given a numeric attribute such as, say, price, how can we “smooth” out the data
to remove the noise? Use the following data smoothing techniques.
◼ 1- Binning
◼ First sort data and partition into (equal-frequency) bins

◼ The sorted values are distributed into a number of “buckets,” or bins.

Because binning methods consult the neighborhood of values, they perform

local smoothing
◼ Then one can smooth by bin means, smooth by bin median, smooth

by bin boundaries, etc.

18
Example: Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
* Partition into equal-frequency (equal-depth) bins (of size 3 in this case):
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
* Smoothing by bin means:each value in a bin is replaced by the mean value of the
bin (e.g Bin1 = (4+8+15)/3=27/3=9)) Smoothing by bin medians: each bin
- Bin 1: 9, 9, 9 value is replaced by the bin median
- Bin 2: 22, 22, 22 - Bin 1: 8,8,8
- Bin 3: 29, 29, 29 - Bin 2: 21,21,21
- Bin 3: 28,28,28
Example: Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
* Partition into equal-frequency (equal-depth) bins (of size 3 in this case):
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
• Smoothing by bin boundaries: the minimum and maximum values in a given bin
are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
• - Bin 1: 4, 4, 15
• - Bin 2: 21, 21, 24
• - Bin 3: 25, 25, 34
Exercise-Answer

Answer:

The following steps are required to smooth the above data using smoothing by bin means with a
bin depth of 3.
Step 1: Sort the data. (This step is not required here as the data are already sorted.)
Step 2: Partition the data into equal-frequency bins of size 3

21
Exercise-Answer

Answer:
Step 3: Calculate the arithmetic mean of each bin.
Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin

22
How to Handle Noisy Data?
◼ 2- Regression
◼ Smooth by fitting the data into regression

functions
◼ Linear regression involves finding the “best” line

to fit two attributes (or variables) so that one

attribute can be used to predict the other.
◼ 3- Clustering
◼ Detect and remove outliers

◼ for example, where similar values are organized

into groups or “clusters.” Intuitively, values that

fall outside of the set of clusters may be
considered as outliers A 2-D customer data plot with respect to
◼ Semi-supervised: Combined computer and human customer locations in a city, showing three
inspection data clusters. Outliers may be detected as
values that fall outside of the cluster sets.
◼ Detect suspicious values and check by human

(e.g., deal with possible outliers) 23

Data Integration
◼ Data integration
◼ Combining data from multiple sources into a coherent store: Data mining often requires data

integration. e.g. the merging of data from multiple data stores

◼ Why data integration?
◼ Help reduce/avoid noise (redundancies and inconsistencies) between different datasets that to

be integrated.
◼ Get a more complete picture

◼ Improve mining speed, quality and accuracy

◼ Entity identification problem: How can equivalent real-world entities from multiple data
sources be matched up?
◼ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

◼ Schema integration problem:

◼ How can a data analyst or a computer be sure that customer_id in one database and

cust_number in another refer to the same attribute?: e.g., A.cust-id  B.cust-#

◼ Results in noise and redundancy in the integrated dataset which require applying
preprocessing techniques.
24
Handling Noise in Data Integration
◼ Detecting data value conflicts
◼ For the same real world entity, attribute values from different sources are

different
◼ Possible reasons: different representations , different encoding, different

scales, e.g., metric vs. British units

◼ For instance, a weight attribute may be stored in metric units in one system and
British imperial units in another.
◼ When exchanging information between schools, for example, each school may have
its own curriculum and grading scheme. One university may adopt a quarter system,
offer three courses on database systems, and assign grades from A+ to F, whereas
another may adopt a semester system, offer two courses on databases, and assign
grades from 1 to 10. It is difficult to work out precise course-to grade transformation
rules between the two universities, making information exchange difficult.

25
Handling Noise in Data Integration
◼ Resolving conflict information
◼ Use Central tendency: Take the mean/median/mode
◼ Use: max/min values
◼ Take the most recent

26
Thank You

E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
UNIT-2
No ratings yet
UNIT-2
34 pages
Week2-2
No ratings yet
Week2-2
25 pages
253777
No ratings yet
253777
66 pages
Lecture 7 -Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
21 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
DWDM UNIT-II
No ratings yet
DWDM UNIT-II
18 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Topic 05 - Data Preprocessing
No ratings yet
Topic 05 - Data Preprocessing
62 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
ML Assignment-1
No ratings yet
ML Assignment-1
7 pages
Week2_DataPreprocessing
No ratings yet
Week2_DataPreprocessing
43 pages
Lec 5
No ratings yet
Lec 5
24 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
CH 2
No ratings yet
CH 2
36 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Mining
No ratings yet
Data Mining
31 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Feature Engineering
No ratings yet
Feature Engineering
35 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
ml4
No ratings yet
ml4
17 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Outliners
No ratings yet
Outliners
15 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Pearsons R
No ratings yet
Pearsons R
8 pages
MULTICOLLINEARITY
No ratings yet
MULTICOLLINEARITY
8 pages
Practical Research 2
No ratings yet
Practical Research 2
4 pages
Web Services Building Blocks
No ratings yet
Web Services Building Blocks
51 pages
Tos Assignment No: 2: Shell Structures
No ratings yet
Tos Assignment No: 2: Shell Structures
5 pages
Bottom Hole Pressure
No ratings yet
Bottom Hole Pressure
30 pages
Get Calculations for Molecular Biology and Biotechnology Frank Harold Stephenson free all chapters
100% (9)
Get Calculations for Molecular Biology and Biotechnology Frank Harold Stephenson free all chapters
82 pages
Binomial Theorem
100% (1)
Binomial Theorem
9 pages
Siemens Application Note CT KV Measurement Somatom
No ratings yet
Siemens Application Note CT KV Measurement Somatom
4 pages
York Cat Min-Vrf 2012 Def en
No ratings yet
York Cat Min-Vrf 2012 Def en
68 pages
HC900 Control Designer Software - Modbusmap
No ratings yet
HC900 Control Designer Software - Modbusmap
15 pages
JS_DOM coding Questions
No ratings yet
JS_DOM coding Questions
8 pages
Calculate Mean For Individual Series Example 1: Roll No. Marks
No ratings yet
Calculate Mean For Individual Series Example 1: Roll No. Marks
8 pages
Practical Analysis 2 Process Injection
0% (1)
Practical Analysis 2 Process Injection
3 pages
Amazon CloudWatch
No ratings yet
Amazon CloudWatch
6 pages
Maths Class Viii Question Bank
100% (1)
Maths Class Viii Question Bank
139 pages
Asme-B18 2 3 1m
No ratings yet
Asme-B18 2 3 1m
29 pages
Bearing Capacity calculation
No ratings yet
Bearing Capacity calculation
10 pages
4007 Release
No ratings yet
4007 Release
10 pages
Fixed-Removable Prostheses: Done By: Tabark Y. Mizil
No ratings yet
Fixed-Removable Prostheses: Done By: Tabark Y. Mizil
28 pages
Cfa二级百题预测金程教育学员版题目
No ratings yet
Cfa二级百题预测金程教育学员版题目
392 pages
Cambridge International AS & A Level: Further Mathematics 9231/32
No ratings yet
Cambridge International AS & A Level: Further Mathematics 9231/32
16 pages
Lab 5
No ratings yet
Lab 5
7 pages
33 942s PDF
No ratings yet
33 942s PDF
36 pages
Manual Arduino
No ratings yet
Manual Arduino
15 pages
Make A Maze in Blender For Unity 3d
No ratings yet
Make A Maze in Blender For Unity 3d
2 pages
990 Discrete Op-Amp
No ratings yet
990 Discrete Op-Amp
8 pages
AMCOL MCST TR HeviSand Sieve Comparison Chart
No ratings yet
AMCOL MCST TR HeviSand Sieve Comparison Chart
1 page
Acumulador de Freno - Pruebas
No ratings yet
Acumulador de Freno - Pruebas
1 page
Von Neumann Architecture
No ratings yet
Von Neumann Architecture
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture5

Uploaded by

Lecture5

Uploaded by

SET 393: Data Mining and Business Intelligence

Chapter 2. Data, Measurements, and Data Preprocessing

◼ Applications: information retrieval, biologic taxonomy, gene feature mapping, ...

d1 dot d2 / amplitude d1 * amplitude d2

◼ Reminder: Dot product:

◼ Assume d1= (1,2) ,

◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

◼ Ex: Find the similarity between documents 1 and 2.

(a) What is the mean of the data? What is the median?

• The (arithmetic) mean of the data is: = 809/ 27 =30.

(a) What is the mean of the data? What is the median?

Using Equation , we have

(a) For the variable age the mean is 46.44, the

only aggregate data

◼ Noisy: containing noise, errors, or outliers

◼ e.g., Salary = “−10” (an error)

◼ Inconsistent: containing discrepancies in codes or names, e.g.,

◼ Age = “42”, Birthday = “03/07/2010”

◼ Was rating “1, 2, 3”, now rating “A, B, C”

◼ discrepancy between duplicate records

◼ Intentional (e.g., disguised missing data)

◼ Jan. 1 as everyone’s birthday?

◼ Data entry problems

◼ Data transmission problems

◼ Inconsistency in naming convention

◼ The sorted values are distributed into a number of “buckets,” or bins.

Because binning methods consult the neighborhood of values, they perform

by bin boundaries, etc.

to fit two attributes (or variables) so that one

◼ for example, where similar values are organized

into groups or “clusters.” Intuitively, values that

(e.g., deal with possible outliers) 23

integration. e.g. the merging of data from multiple data stores

◼ Improve mining speed, quality and accuracy

◼ Schema integration problem:

cust_number in another refer to the same attribute?: e.g., A.cust-id  B.cust-#

scales, e.g., metric vs. British units

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.