Lecture5
Lecture5
3rd Year
Spring 2025
Lec. 5
2
Cosine Similarity
1∗0 +(2∗3)
◼ d1 dot d2 =
12 +22 ∗ 02 +32
6 6 2
◼ = = =
5∗ 9 3 5 5
3
Example: Cosine Similarity
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = sim(x,y)= 0.94
4
Exercise
5
Exercise- Answer
• The third quartile (corresponding to the 75th percentile) of the data is: 35.
• (e) Give the five-number summary of the data.
• The five number summary of a distribution consists of the minimum value, first quartile, median value, third
quartile, and maximum value. It provides a good summary of the shape of the distribution and for this data
is: 13, 20, 25, 35, 70.
7
Exercise- Answer
8
Exercise-Answer
9
Exercise-Answer
Try it yourself
10
Data Quality, Data Cleaning and Data
Integration
◼ Chapter 2 Data, measurements, and data preprocessing
Definition and types Basic statistical description, measuring similarity and dissimilarity
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation
11
Major Tasks in Data Preprocessing
12
Data Cleaning
◼ Data in the Real World Needs Cleaning: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, and transmission error
◼ Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
14
How to Handle Missing Data?
◼ Ignore the tuple: usually done when value of the cell in the class label is
missing (when doing classification)—This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor
when the percentage of missing values per attribute varies considerably. By
ignoring the tuple, we do not make use of the remaining attributes’ values in
the tuple. Such data could have been useful to the task at hand.
◼ Fill in the missing value manually: tedious +time consuming + infeasible
given a large data set with many missing values.
15
How to Handle Missing Data?
◼ Fill in it automatically with
◼ a global constant : Replace all missing attribute values by the same constant such as a
label like e.g., “unknown”,
◼ Drawback: If missing values are replaced by, say, “Unknown,” then the mining program may
mistakenly think that they form a new class
◼ method is simple, it is not foolproof.
◼ Central tendency for the attribute (e.g., the mean or median) for all samples belonging
to the same class: smarter
◼ For example, suppose that the data distribution regarding the income of the customers is
symmetric and that the mean income is $56,000. Use this value to replace the missing value
for income.
◼ the most probable value: regression, or inference-based such as Bayesian
formula or decision tree induction to predict the missing values.
16
Noisy Data
◼ Noise: is a random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ Faulty data collection instruments
◼ Technology limitation
17
How to Handle Noisy Data?
◼ Given a numeric attribute such as, say, price, how can we “smooth” out the data
to remove the noise? Use the following data smoothing techniques.
◼ 1- Binning
◼ First sort data and partition into (equal-frequency) bins
18
Example: Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
* Partition into equal-frequency (equal-depth) bins (of size 3 in this case):
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
* Smoothing by bin means:each value in a bin is replaced by the mean value of the
bin (e.g Bin1 = (4+8+15)/3=27/3=9)) Smoothing by bin medians: each bin
- Bin 1: 9, 9, 9 value is replaced by the bin median
- Bin 2: 22, 22, 22 - Bin 1: 8,8,8
- Bin 3: 29, 29, 29 - Bin 2: 21,21,21
- Bin 3: 28,28,28
Example: Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
* Partition into equal-frequency (equal-depth) bins (of size 3 in this case):
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
• Smoothing by bin boundaries: the minimum and maximum values in a given bin
are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
• - Bin 1: 4, 4, 15
• - Bin 2: 21, 21, 24
• - Bin 3: 25, 25, 34
Exercise-Answer
Answer:
The following steps are required to smooth the above data using smoothing by bin means with a
bin depth of 3.
Step 1: Sort the data. (This step is not required here as the data are already sorted.)
Step 2: Partition the data into equal-frequency bins of size 3
21
Exercise-Answer
Answer:
Step 3: Calculate the arithmetic mean of each bin.
Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin
22
How to Handle Noisy Data?
◼ 2- Regression
◼ Smooth by fitting the data into regression
functions
◼ Linear regression involves finding the “best” line
be integrated.
◼ Get a more complete picture
◼ Entity identification problem: How can equivalent real-world entities from multiple data
sources be matched up?
◼ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
different
◼ Possible reasons: different representations , different encoding, different
25
Handling Noise in Data Integration
◼ Resolving conflict information
◼ Use Central tendency: Take the mean/median/mode
◼ Use: max/min values
◼ Take the most recent
26
Thank You