0% found this document useful (0 votes)
14 views27 pages

Lecture5

The document discusses data mining and business intelligence, focusing on data preprocessing techniques such as cosine similarity, data cleaning, and data integration. It explains how to handle missing and noisy data, including methods like binning, regression, and clustering. Additionally, it addresses challenges in data integration, such as entity identification and schema integration problems.

Uploaded by

kiro2morris3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views27 pages

Lecture5

The document discusses data mining and business intelligence, focusing on data preprocessing techniques such as cosine similarity, data cleaning, and data integration. It explains how to handle missing and noisy data, including methods like binning, regression, and clustering. Additionally, it addresses challenges in data integration, such as entity identification and schema integration problems.

Uploaded by

kiro2morris3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

SET 393: Data Mining and Business Intelligence

3rd Year

Spring 2025

Lec. 5

Chapter 2. Data, Measurements, and Data Preprocessing


Assistant Professor: Dr. Rasha Saleh
Cosine Similarity
◼ A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

◼ Applications: information retrieval, biologic taxonomy, gene feature mapping, ...


◼ Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

d1 dot d2 / amplitude d1 * amplitude d2


where • indicates vector dot product, ||d||: the length of vector d

2
Cosine Similarity

◼ Reminder: Dot product:

◼ Assume d1= (1,2) ,


◼ d2= (0,3)

1∗0 +(2∗3)
◼ d1 dot d2 =
12 +22 ∗ 02 +32

6 6 2
◼ = = =
5∗ 9 3 5 5

3
Example: Cosine Similarity

◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,


where • indicates vector dot product, ||d|: the length of vector d

◼ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = sim(x,y)= 0.94
4
Exercise

5
Exercise- Answer

(a) What is the mean of the data? What is the median?

• The (arithmetic) mean of the data is: = 809/ 27 =30.


• The median (middle value of the ordered set, as the number of values in the set is odd) of the data is: 25.
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.). This data set
has two values that occur with the same highest frequency and is, therefore, bimodal. The modes (values
occurring with the greatest frequency) of the data are 25 and 35.
(c) What is the midrange of the data?
The midrange (average of the largest and smallest values in the data set) of the data is: (70+13) 2 = 415
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
The first quartile (corresponding to the 25th percentile) of the data is: 20.
6
Exercise- Answer

(a) What is the mean of the data? What is the median?

• The third quartile (corresponding to the 75th percentile) of the data is: 35.
• (e) Give the five-number summary of the data.
• The five number summary of a distribution consists of the minimum value, first quartile, median value, third
quartile, and maximum value. It provides a good summary of the shape of the distribution and for this data
is: 13, 20, 25, 35, 70.

7
Exercise- Answer

Using Equation , we have


L1 = 20, N= 3194, = 950, = 1500, width = 30,
median = 32.94 years

8
Exercise-Answer

(a) For the variable age the mean is 46.44, the


median is 51, and the standard deviation is
12.85.
For the variable %fat the mean is 28.78, the
median is 30.7, and the standard deviation is
8.99.
(b)Draw the box plots for age and %fat

9
Exercise-Answer
Try it yourself

10
Data Quality, Data Cleaning and Data
Integration
◼ Chapter 2 Data, measurements, and data preprocessing

Definition and types Basic statistical description, measuring similarity and dissimilarity

◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation

11
Major Tasks in Data Preprocessing

12
Data Cleaning
◼ Data in the Real World Needs Cleaning: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, and transmission error
◼ Incomplete: lacking attribute values, lacking certain attributes of interest, or containing

only aggregate data


◼ e.g., Occupation = “ ” (missing data)

◼ Noisy: containing noise, errors, or outliers

◼ e.g., Salary = “−10” (an error)

◼ Inconsistent: containing discrepancies in codes or names, e.g.,

◼ Age = “42”, Birthday = “03/07/2010”

◼ Was rating “1, 2, 3”, now rating “A, B, C”

◼ discrepancy between duplicate records

◼ Intentional (e.g., disguised missing data)

◼ Jan. 1 as everyone’s birthday?


Incomplete (Missing) Data
◼ Data is not always available
◼ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data of a company
◼ Missing data may be due to
◼ Equipment malfunction
◼ Inconsistent with other recorded data and thus deleted
◼ Data were not entered due to misunderstanding
◼ Certain data may not be considered important at the time of entry
◼ Did not register history or changes of the data
◼ Missing data may need to be inferred

14
How to Handle Missing Data?
◼ Ignore the tuple: usually done when value of the cell in the class label is
missing (when doing classification)—This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor
when the percentage of missing values per attribute varies considerably. By
ignoring the tuple, we do not make use of the remaining attributes’ values in
the tuple. Such data could have been useful to the task at hand.
◼ Fill in the missing value manually: tedious +time consuming + infeasible
given a large data set with many missing values.

15
How to Handle Missing Data?
◼ Fill in it automatically with
◼ a global constant : Replace all missing attribute values by the same constant such as a
label like e.g., “unknown”,
◼ Drawback: If missing values are replaced by, say, “Unknown,” then the mining program may
mistakenly think that they form a new class
◼ method is simple, it is not foolproof.
◼ Central tendency for the attribute (e.g., the mean or median) for all samples belonging
to the same class: smarter
◼ For example, suppose that the data distribution regarding the income of the customers is
symmetric and that the mean income is $56,000. Use this value to replace the missing value
for income.
◼ the most probable value: regression, or inference-based such as Bayesian
formula or decision tree induction to predict the missing values.

16
Noisy Data
◼ Noise: is a random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ Faulty data collection instruments

◼ Data entry problems

◼ Data transmission problems

◼ Technology limitation

◼ Inconsistency in naming convention

17
How to Handle Noisy Data?
◼ Given a numeric attribute such as, say, price, how can we “smooth” out the data
to remove the noise? Use the following data smoothing techniques.
◼ 1- Binning
◼ First sort data and partition into (equal-frequency) bins

◼ The sorted values are distributed into a number of “buckets,” or bins.

Because binning methods consult the neighborhood of values, they perform


local smoothing
◼ Then one can smooth by bin means, smooth by bin median, smooth

by bin boundaries, etc.

18
Example: Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
* Partition into equal-frequency (equal-depth) bins (of size 3 in this case):
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
* Smoothing by bin means:each value in a bin is replaced by the mean value of the
bin (e.g Bin1 = (4+8+15)/3=27/3=9)) Smoothing by bin medians: each bin
- Bin 1: 9, 9, 9 value is replaced by the bin median
- Bin 2: 22, 22, 22 - Bin 1: 8,8,8
- Bin 3: 29, 29, 29 - Bin 2: 21,21,21
- Bin 3: 28,28,28
Example: Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
* Partition into equal-frequency (equal-depth) bins (of size 3 in this case):
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
• Smoothing by bin boundaries: the minimum and maximum values in a given bin
are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
• - Bin 1: 4, 4, 15
• - Bin 2: 21, 21, 24
• - Bin 3: 25, 25, 34
Exercise-Answer

Answer:

The following steps are required to smooth the above data using smoothing by bin means with a
bin depth of 3.
Step 1: Sort the data. (This step is not required here as the data are already sorted.)
Step 2: Partition the data into equal-frequency bins of size 3

21
Exercise-Answer

Answer:
Step 3: Calculate the arithmetic mean of each bin.
Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin

22
How to Handle Noisy Data?
◼ 2- Regression
◼ Smooth by fitting the data into regression

functions
◼ Linear regression involves finding the “best” line

to fit two attributes (or variables) so that one


attribute can be used to predict the other.
◼ 3- Clustering
◼ Detect and remove outliers

◼ for example, where similar values are organized

into groups or “clusters.” Intuitively, values that


fall outside of the set of clusters may be
considered as outliers A 2-D customer data plot with respect to
◼ Semi-supervised: Combined computer and human customer locations in a city, showing three
inspection data clusters. Outliers may be detected as
values that fall outside of the cluster sets.
◼ Detect suspicious values and check by human

(e.g., deal with possible outliers) 23


Data Integration
◼ Data integration
◼ Combining data from multiple sources into a coherent store: Data mining often requires data

integration. e.g. the merging of data from multiple data stores


◼ Why data integration?
◼ Help reduce/avoid noise (redundancies and inconsistencies) between different datasets that to

be integrated.
◼ Get a more complete picture

◼ Improve mining speed, quality and accuracy

◼ Entity identification problem: How can equivalent real-world entities from multiple data
sources be matched up?
◼ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

◼ Schema integration problem:


◼ How can a data analyst or a computer be sure that customer_id in one database and

cust_number in another refer to the same attribute?: e.g., A.cust-id  B.cust-#


◼ Results in noise and redundancy in the integrated dataset which require applying
preprocessing techniques.
24
Handling Noise in Data Integration
◼ Detecting data value conflicts
◼ For the same real world entity, attribute values from different sources are

different
◼ Possible reasons: different representations , different encoding, different

scales, e.g., metric vs. British units


◼ For instance, a weight attribute may be stored in metric units in one system and
British imperial units in another.
◼ When exchanging information between schools, for example, each school may have
its own curriculum and grading scheme. One university may adopt a quarter system,
offer three courses on database systems, and assign grades from A+ to F, whereas
another may adopt a semester system, offer two courses on databases, and assign
grades from 1 to 10. It is difficult to work out precise course-to grade transformation
rules between the two universities, making information exchange difficult.

25
Handling Noise in Data Integration
◼ Resolving conflict information
◼ Use Central tendency: Take the mean/median/mode
◼ Use: max/min values
◼ Take the most recent

26
Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy