0% found this document useful (0 votes)

149 views206 pages

Data Mining All Slides

The document discusses different classification algorithms including 1R and Naive Bayes. 1R learns simple 1-level decision trees that classify examples based on the value of a single attribute. It chooses the attribute with the lowest error rate. Naive Bayes makes independence assumptions between attributes and uses probabilities to classify examples based on the values of all attributes. The document provides examples of applying these methods to classify examples of weather data based on attributes like outlook, temperature, humidity, and windy conditions.

Uploaded by

Amanat Construction

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views206 pages

Data Mining All Slides

Uploaded by

Amanat Construction

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 206

Algorithms for

Classification:
The Basic Methods
Outline

 Simplicity first: 1R

 Naïve Bayes

2
Classification

 Task: Given a set of pre-classified examples,

build a model or classifier to classify new cases.
 Supervised learning: classes are known for the
examples used to build the classifier.
 A classifier can be a set of rules, a decision tree,
a neural network, etc.
 Typical applications: credit approval, direct
marketing, fraud detection, medical diagnosis,
…..
3
Simplicity first

 Simple algorithms often work very well!

 There are many kinds of simple structure, eg:
 One attribute does all the work
 All attributes contribute equally & independently
 A weighted linear combination might do
 Instance-based: use a few prototypes
 Use simple logical rules

 Success of method depends on the domain

4
witten&eibe
Inferring rudimentary rules

 1R: learns a 1-level decision tree

 I.e., rules that all test one particular attribute

 Basic version
 One branch for each value
 Each branch assigns most frequent class
 Error rate: proportion of instances that don’t belong to the
majority class of their corresponding branch
 Choose attribute with lowest error rate

(assumes nominal attributes)

5
witten&eibe
Pseudo-code for 1R

For each attribute,

For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate

 Note: “missing” is treated as a separate attribute value

6
witten&eibe
Evaluating the weather attributes
Outlook Temp Humidity Windy Play Attribute Rules Errors Total
errors
Sunny Hot High False No
Outlook Sunny → No 2/5 4/14
Sunny Hot High True No
Overcast → Yes 0/4
Overcast Hot High False Yes
Rainy → Yes 2/5
Rainy Mild High False Yes
Temp Hot → No* 2/4 5/14
Rainy Cool Normal False Yes
Mild → Yes 2/6
Rainy Cool Normal True No
Cool → Yes 1/4
Overcast Cool Normal True Yes
Humidity High → No 3/7 4/14
Sunny Mild High False No
Normal → Yes 1/7
Sunny Cool Normal False Yes
Windy False → Yes 2/8 5/14
Rainy Mild Normal False Yes
True → No* 3/6
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
* indicates a tie

7
witten&eibe
Dealing with
numeric attributes
 Discretize numeric attributes
 Divide each attribute’s range into intervals
 Sort instances according to attribute’s values
 Place breakpoints where the class changes
(the majority class)
 This minimizes the totalTemperature
Outlook
error Humidity Windy Play

Sunny 85 85 False No

 Example: temperature
Sunny from
80 weather
90 data True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

8
witten&eibe
The problem of overfitting

 This procedure is very sensitive to noise

 One instance with an incorrect class label will probably
produce a separate interval
 Also: time stamp attribute will have zero errors
 Simple solution:
enforce minimum number of instances in majority class
per interval

9
witten&eibe
Discretization example

 Example (with min = 3):

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

 Final result for temperature attribute

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

10
witten&eibe
With overfitting avoidance

 Resulting rule set:

Attribute Rules Errors Total errors
Outlook Sunny → No 2/5 4/14
Overcast → Yes 0/4
Rainy → Yes 2/5
Temperature ≤ 77.5 → Yes 3/10 5/14
> 77.5 → No* 2/4
Humidity ≤ 82.5 → Yes 1/7 3/14
> 82.5 and ≤ 95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6

11
witten&eibe
Bayesian (Statistical) modeling

 “Opposite” of 1R: use all the attributes

 Two assumptions: Attributes are
 equally important
 statistically independent (given the class value)
 I.e., knowing the value of one attribute says nothing about
the value of another
(if the class is known)

 Independence assumption is almost never correct!

 But … this scheme works well in practice

13
witten&eibe
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5 Outlook Temp Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

14
witten&eibe
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5

Outlook Temp. Humidity Windy Play

Sunny Cool High True ?

 A new day: Likelihood of the two classes

For “yes” = 2/9 × 3/9 × 3/9 × 3/9 × 9/14 = 0.0053
For “no” = 3/5 × 1/5 × 4/5 × 3/5 × 5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

15
witten&eibe
Bayes’s rule
 Probability of event H given evidence E :

Pr[ E | H ] Pr[ H ]
Pr[ H | E ] =
Pr[ E ]
 A priori probability of H : Pr[H ]
 Probability of event before evidence is seen

 A posteriori probability of H : Pr[ H | E ]

 Probability of event after evidence is seen

from Bayes “Essay towards solving a problem in the

doctrine of chances” (1763)
Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent, England
16
witten&eibe
Naïve Bayes for classification

 Classification learning: what’s the probability of the class

given an instance?
 Evidence E = instance
 Event H = class value for instance

 Naïve assumption: evidence splits into parts (i.e.

attributes) that are independent

Pr[ E1 | H ] Pr[ E1 | H ] Pr[ En | H ] Pr[ H ]

Pr[ H | E ] =
Pr[ E ]

17
witten&eibe
Weather data example

Outlook Temp. Humidity Windy Play

Sunny Cool High True ?
Evidence E

Pr[ yes | E ] = Pr[Outlook = Sunny | yes]

× Pr[Temperature = Cool | yes]
× Pr[ Humidity = High | yes]
Probability of
class “yes” × Pr[Windy = True | yes]
Pr[ yes]
×
Pr[ E ]

× 93 × 93 × 93 × 149
2
= 9
Pr[ E ]
18
witten&eibe
The “zero-frequency problem”

 What if an attribute value doesn’t occur with every class

value?
(e.g. “Humidity = high” for class “yes”)
 Probability will be zero! Pr[ Humidity = High | yes] = 0
 A posteriori probability will also be zero! Pr[ yes | E ] = 0
(No matter how likely the other values are!)
 Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
 Result: probabilities will never be zero!
(also: stabilizes probability estimates)

19
witten&eibe
*Modified probability estimates

 In some cases adding a constant different from 1 might

be more appropriate
 Example: attribute outlook for class yes

2+ µ /3 4+ µ /3 3+ µ /3
9+µ 9+µ 9+µ
Sunny Overcast Rainy

 Weights don’t need to be equal

(but they must sum to 1)
2 + µp1 4 + µp 2 3 + µp3
9+µ 9+µ 9+µ
20
witten&eibe
Missing values
 Training: instance is not included in
frequency count for attribute value-class
combination
 Classification: attribute will be omitted from
calculation
 Example: Outlook Temp. Humidity Windy Play
? Cool High True ?

Likelihood of “yes” = 3/9 × 3/9 × 3/9 × 9/14 = 0.0238

Likelihood of “no” = 1/5 × 4/5 × 3/5 × 5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

21
witten&eibe
Numeric attributes
 Usual assumption: attributes have a normal or
Gaussian probability distribution (given the class)
 The probability density function for the normal
distribution is defined by two parameters:
 Sample mean µ
1 n
µ = ∑ xi
n i =1

 Standard deviation σ
1 n
σ= ∑ i
n − 1 i =1
( x − µ ) 2

 Then the density function f(x) is

( x− µ )2
1 −
Karl Gauss, 1777-1855
f ( x) = e 2σ 2
2π σ great German mathematician

22
witten&eibe
Statistics for
weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72, 80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5 µ =73 µ =75 µ =79 µ =86 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 σ =6.2 σ =7.9 σ =10.2 σ =9.7 True 3/9 3/5
Rainy 3/9 2/5

 Example density value:

( 66−73) 2
1 −
f (temperature = 66 | yes) = e 2∗6.22
= 0.0340
2π 6.2

23
witten&eibe
Classifying a new day

 A new day: Outlook Temp. Humidity Windy Play

Sunny 66 90 true ?

Likelihood of “yes” = 2/9 × 0.0340 × 0.0221 × 3/9 × 9/14 = 0.000036

Likelihood of “no” = 3/5 × 0.0291 × 0.0380 × 3/5 × 5/14 = 0.000136
P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%
P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%

 Missing values during training are not included in

calculation of mean and standard deviation

24
witten&eibe
Naïve Bayes: discussion

 Naïve Bayes works surprisingly well (even if

independence assumption is clearly violated)
 Why? Because classification doesn’t require
accurate probability estimates as long as
maximum probability is assigned to correct class
 However: adding too many redundant attributes
will cause problems (e.g. identical attributes)
 Note also: many numeric attributes are not
normally distributed (→ kernel density
estimators)
26
witten&eibe
Naïve Bayes Extensions

 Improvements:
 select best attributes (e.g. with greedy search)
 often works as well or better with just a fraction
of all attributes

 Bayesian Networks

27
witten&eibe
Summary

 OneR – uses rules based on just one attribute

 Naïve Bayes – use all attributes and Bayes rules

to estimate probability of the class given an
instance.
 Simple methods frequently work well, but …
 Complex methods can be better (as we will see)

28
Data Mining
Preprocessing
• Data quality
• Missing values imputation using Mean,
Median and k-Nearest Neighbor approach
• Distance Measure
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view

– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?

2
Major Tasks in Data Preprocessing
• Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
Integration of multiple databases, data cubes, or files
• Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
• Data transformation and data discretization
Normalization
Concept hierarchy generation

3
Data Quality
• Data quality is a major concern in Data Mining and
Knowledge Discovery tasks.
• Why: At most all Data Mining algorithms induce knowledge
strictly from data.
• The quality of knowledge extracted highly depends on the
quality of data.
• There are two main problems in data quality:-
– Missing data: The data not present.
– Noisy data: The data present but not correct.
• Missing/Noisy data sources:-
– Hardware failure.
– Data transmission error.
– Data entry problem.
– Refusal of responds to answer certain questions.
Effect of Noisy Data on Results Accuracy

age income student buys_computer Discover only those

<=30 high yes yes rules which contain
<=30 high no yes support (frequency)
>40 medium yes no greater >= 2
Data Mining
>40 medium no no
>40 low yes yes
31…40 no yes
31…40 medium yes yes

Training data • If ‘age <= 30’ and income = ‘high’ then

buys_computer = ‘yes’
• If ‘age > 40’ and income = ‘medium’ then
buys_computer = ‘no’
Due to the missing value in training
dataset, the accuracy of prediction age income student buys_computer
decreases and becomes “66.7%” <=30 high no ?
>40 medium yes ?
31…40 medium yes ?
Testing data or actual data
Imputation of Missing Data (Basic)
• Imputation is a term that denotes a procedure that
replaces the missing values in a dataset by some plausible
values
– i.e. by considering relationship among correlated values
among the attributes of the dataset.

Attribute 1 Attribute 2 Attribute 3 Attribute 4 If we consider only

20 cool high false {attribute#2}, then
cool high true value “cool” appears in 4
20 cool high true records.
20 mild low false
30 cool normal false Probability of Imputing
10 mild high true value (20) = 66.67%
Probability of Imputing
value (30) = 33.33%
Imputation of Missing Data (Basic)
Attribute 1 Attribute 2 Attribute 3 Attribute 4 For {attribute#4} the
20 cool high false value “true” appears in 3
cool high true records
20 cool high true
Probability of Imputing
20 mild low false
value (20) = 50%
30 cool normal false
10 mild high true Probability of Imputing
value (10) = 50%

Attribute 1 Attribute 2 Attribute 3 Attribute 4 For {attribute#2,

20 cool high false attribute#3} the value
cool high true {“cool”, “high”}
20 cool high true appears in only 2 records
20 mild low false
Probability of Imputing
30 cool normal false
value (20) = 100%
10 mild high true
Randomness of Missing Data
• Missing data randomness is divided into three classes.

1. Missing completely at random (MCAR):- It occurs

when the probability of instance (case) having missing
value for an attribute does not depend on either the
known attribute values or missing data attribute.
2. Missing at random (MAR):- It occurs when the
probability of instance (case) having missing value for an
attribute depends on the known attribute values, but not
on the missing data attribute.
3. Not missing at random (NMAR):- When the
probability of an instance having a missing value for an
attribute could depend on the value of that attribute.
Methods of Treating Missing Data
• Ignoring and discarding data:- There are two main
ways to discard data with missing values.
– Discard all those records which have missing data also
called as discard case analysis.
– Discarding only those attributes which have high level
of missing data.
• Imputation using Mean/median or Mod:- One of the
most frequently used method (Statistical technique).
– Replace (numeric continuous) type “attribute missing
values” using mean/median. (Median robust against
noise).
– Replace (discrete) type attribute missing values using
MOD.
Methods of Treating Missing Data
• Replace missing values using
prediction/classification model:-
– Advantage:- it considers relationship among the known
attribute values and the missing values, so the
imputation accuracy is very high.
– Disadvantage:- If there is no correlation exist for some
missing attribute values and known attribute values.
The imputation can’t be performed.
– (Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MOD.
• First try to impute missing value using prediction/classification
model, and then Median/MOD.
– We will study more about this topic in Association Rules
Mining.
Methods of Treating Missing Data
• K-Nearest Neighbor (k-NN) approach (Best
approach):-
– k-NN imputes the missing attribute values on the basis
of nearest K neighbor. Neighbors are determined on
the basis of distance measure.
– Once K neighbors are determined, missing value are
imputed by taking mean/median or MOD of known
attribute values of missing attribute.
– Pseudo-code/analysis after studying distance measure.

Missing value record

Other dataset records

Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Distance Measures
• Remember K-Nearest Neighbor are determined on the
bases of some kind of “distance” between points.

• Two major classes of distance measure:

1. Euclidean : based on position of points in some k -
dimensional space.
2. Noneuclidean : not related to position or space.
Scales of Measurement
• Applying a distance measure largely depends on the type
of input data
• Major scales of measurement:
1. Nominal Data (aka Nominal Scale Variables)

• Typically classification data, e.g. m/f

• no ordering, e.g. it makes no sense to state that M > F
• Binary variables are a special case of Nominal scale variables.

2. Ordinal Data (aka Ordinal Scale)

• ordered but differences between values are not important
• e.g., political parties on left to right spectrum given labels 0, 1, 2
• e.g., Liker scales, rank on a scale of 1..5 your degree of satisfaction
• e.g., restaurant ratings
Scales of Measurement
• Applying a distance function largely depends on the type
of input data
• Major scales of measurement:
3. Numeric type Data (aka interval scaled)
• Ordered and equal intervals. Measured on a linear scale.
• Differences make sense
• e.g., temperature (C,F), height, weight, age, date
Scales of Measurement
• Only certain operations can be performed on
certain scales of measurement.

Nominal Scale
1. Equality
2. Count

3. Rank Ordinal Scale

(Cannot quantify difference)

Interval Scale
4. Quantify the difference
Axioms of a Distance Measure
• d is a distance measure if it is a function
from pairs of points to reals such that:
1. d(x,x) = 0.
2. d(x,y) = d(y,x).
3. d(x,y) > 0.
Some Euclidean Distances
• L2 norm (also common or Euclidean distance):

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

– The most common notion of “distance.”

• L1 norm (also Manhattan distance)

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp
– distance if you had to travel along coordinates only.
Examples L1 and L2 norms

y = (9,8)
L2-norm:
dist(x,y) = √(42+32) = 5

5
3
L1-norm:
dist(x,y) = 4+3 = 7
x = (5,5) 4
Another Euclidean Distance
• L∞ norm : d(x,y) = the maximum of the
differences between x and y in any dimension.
Example:
Data Matrix and Dissimilarity Matrix
x x Data Matrix
2 4
point attribute1 attribute2
4
x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x
1
Dissimilarity Matrix
(with Euclidean Distance)
x
3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

21
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x x
2 4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 Supremum
x
1
L∞ x1 x2 x3 x4
x1 0
x2 3 0
x
3 x3 2 5 0
0 2 4
x4 3 1 5 0
22
Proximity Measure for Nominal
Attributes

• Can take 2 or more states, e.g., red, yellow, blue,

green (generalization of a binary attribute)
• Method 1: Simple matching
– m: # of matches, p: total # of variables

d (i, j) = p −
p
m
• Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M
nominal states

23
Non-Euclidean Distances

• Jaccard measure for binary vectors

• Cosine measure = angle between vectors

from the origin to the points in question.

• Edit distance = number of inserts and deletes

to change one string into another.
Jaccard Measure
• A note about Binary variables first
– Symmetric binary variable
• If both states are equally valuable and carry the same weight,
that is, there is no preference on which outcome should be
coded as 0 or 1.
• Like “gender” having the states male and female
– Asymmetric binary variable:
• If the outcomes of the states are not equally important, such as
the positive and negative outcomes of a disease test.
• We should code the rarest one by 1 (e.g., HIV positive), and the
other by 0 (HIV negative).
– Given two asymmetric binary variables, the agreement
of two 1s (a positive match) is then considered more
important than that of two 0s (a negative match).
Jaccard Measure
• A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p

• Simple matching coefficient :

d (i, j) = b+c
a +b+c + d
• Jaccard coefficient :
d (i, j) = b+c
a +b+c
Jaccard Measure Example
• Example

Name Fever Cough Test-1 Test-2 Test-3 Test-4

Jack Y N P N N N
Mary Y N P N P N
Jim Y P N N N N
– All attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0

0+1 1 0 sum
d ( jack , mary ) = = 0.33 1 a b a +b
2+ 0+1
1+1 0 c d c+d
d ( jack , jim ) = = 0.67
1+1+1 sum a + c b + d p
1+ 2 b+c
d ( jim , mary ) = = 0.75 d (i, j) =
1+1+ 2 a +b+c
Cosine Measure
• Think of a point as a vector from the origin
(0,0,…,0) to its location.

• Two points’ vectors make an angle, whose cosine

is the normalized dot-product of the vectors.
– Example:
– p1.p2 = 2; |p1| = |p2| = √3.
– cos(θ) = 2/3; θ is about 48 degrees. p1

θ
p2
p1.p2
dist(p1, p2) = θ = arccos(p1.p2/|p2||p1|) |p2|
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

• Other vector objects: gene features in micro-arrays, …

• Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d

29
Example: Cosine Similarity

• cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

where • indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

30
Edit Distance
• The edit distance of two strings is the number
of inserts and deletes of characters needed to
turn one into the other.

• Equivalently, d(x,y) = |x| + |y| -2|LCS(x,y)|.

– LCS = longest common subsequence = longest
string obtained both by deleting from x and
deleting from y.
Example

• x = abcde ; y = bcduve.

• LCS(x,y) = bcde.
• D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4
= 3.

• What left?
• Normalize it in the range [0-1]. We will study
normalization formulas later.
Back to k-Nearest Neighbor (Pseudo-code)
• Missing values Imputation using k-NN.
• Input: Dataset (D ), size of K

• for each record (x ) with at least one missing value

in D .
– for each data object (y ) in D .
• Take the Distance (x ,y)
• Save the distance x and y in array Similarity (S ) array.

– Sort the array S in descending order

– Pick the top K data objects from S
• Impute the missing attribute value (s) of x on the basic of
known values of S (use Mean/Median or MOD).
K-Nearest Neighbor Drawbacks
• The major drawbacks of this approach are the
– Choice of selecting exact distance functions.
– Considering all attributes when attempting to retrieve
the similar type of examples.
– Searching through all the dataset for finding the same
type of instances.
– Algorithm Cost: ?
Noisy Data
• Noise: Random error, Data Present but not correct.
– Data Transmission error
– Data Entry problem

• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.

• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.

• Values which falls outside of the set of clusters may be

considered outliers.
CSC479
Data Mining
Lecture # 8

Mining Frequent Patterns

Associations and Correlations
(Ch # 6)
Introduction
Motivation

{butter, bread, milk, sugar}

{butter, flour, milk, sugar}
{butter, eggs, milk, salt} DB of Sales Transactions
{eggs}
{butter, flour, milk, salt, sugar}

Market basket analysis

• Which products are frequently purchased together?
• Applications
• Improvement of store layouts
• Cross marketing
• Attached mailings/add-on sales
2
Market-Basket Data
 A large set of items, e.g., things sold in
a supermarket.
 A large set of baskets, each of which is
a small set of the items, e.g., the things
one customer buys on one day.

3
Market-Baskets – (2)
 Really, a general many-to-many
mapping (association) between two
kinds of things, where the one (the
baskets) is a set of the other (the
items)
 But we ask about connections among
“items,” not “baskets.”
 The technology focuses on common
events, not rare events (“long tail”). 4
Frequent Itemsets
• Given a set of transactions, find combinations of
items (itemsets) that occur frequently

Market-Basket transactions
Items: {Bread, Milk, Diaper, Beer, Eggs,
Coke}
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Bread}: 4
3 Milk, Diaper, Beer, Coke {Milk} : 4
4 Bread, Milk, Diaper, Beer {Diaper} : 4
{Beer}: 3
5 Bread, Milk, Diaper, Coke
{Diaper, Beer} : 3
{Milk, Bread} : 3
Applications – (1)
 Items = products; baskets = sets of
products someone bought in one trip to
the store.

 Example application: given that many

people buy beer and diapers together:
 Run a sale on diapers; raise price of beer.
 Only useful if many buy diapers & beer.
6
Applications – (2)
 Baskets = Web pages; items = words.

 Example application: Unusual words

appearing together in a large number of
documents, e.g., “Brad” and “Angelina,”
may indicate an interesting relationship.

7
Applications – (3)
 Baskets = sentences; items =
documents containing those sentences.

 Example application: Items that appear

together too often could represent
plagiarism.
 Notice items do not have to be “in”
baskets.
8
Definition: Frequent Itemset
 Itemset
 A collection of one or more items
• Example: {Milk, Bread, Diaper}
 k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
 Support (σ) 2 Bread, Diaper, Beer, Eggs
 Count: Frequency of occurrence of an 3 Milk, Diaper, Beer, Coke
itemset 4 Bread, Milk, Diaper, Beer
 E.g. σ({Milk, Bread,Diaper}) = 2 5 Bread, Milk, Diaper, Coke
 Fraction: Fraction of transactions that
contain an itemset
 E.g. s({Milk, Bread, Diaper}) = 40%
 Frequent Itemset
 An itemset whose support is greater
than or equal to a minsup threshold
Some more definitions
 Close Itemset : An itemset X is closed in a data set
D if there exists no proper super-itemset Y such that
Y has the same support count as X in D.
 closed frequent itemset: An itemset X is a closed
frequent itemset in set D if X is both closed and
frequent in D.
 Maximal frequent itemset : An itemset X is a
maximal frequent itemset (or max-itemset) in
a data set D if X is frequent, and there exists no
super-itemset Y such that X⊂ Y and Y is frequent in
D.
Mining Frequent Itemsets task
 Input: A set of transactions T, over a set of items I
 Output: All itemsets with items in I having
 support ≥ minsup threshold
 Problem parameters:
 N = |T|: number of transactions
 d = |I|: number of (distinct) items
 w: max width of a transaction
 Number of possible itemsets?

 Scale of the problem: M = 2d

 WalMart sells 100,000 items and can store billions of
baskets.
 The Web has billions of words and many billions of pages.
Initial Definition of Association
Rules (ARs) Mining
 Association rules define relationship of the
form:
A→B
 Read as A implies B, where A and B are sets
of binary valued attributes represented in a
data set.
 Association Rule Mining (ARM) is then the
process of finding all the ARs in a given DB.
Association Rule: Basic Concepts
 Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
 Find: all rules that correlate the presence of one
set of items with that of another set of items
 E.g., 98% of students who study Databases and C++
also study Algorithms
 Applications
 Home Electronics ⇒ * (What other products should
the store stocks up?)
 Attached mailing in direct marketing
 Web page navigation in Search Engines (first page a->
page b)
 Text mining if IT companies -> Microsoft
Some Notation
D = A data set comprising n records and m
binary valued attributes.

I = The set of m attributes, {i1,i2, … ,im},

represented in D.

Itemset = Some subset of I. Each record

in D is an itemset.
In depth Definition of ARs Mining
 Association rules define relationship of the
form:
A→B
 Read as A implies B
 Such that A⊂I, B⊂I, A∩B=∅ (A and B are
disjoint) and A∪B⊆I.
 In other words an AR is made up of an
itemset of cardinality 2 or more.
Association Rules Measurement
The most commonly used “interestingness”
measures are:
1. Support
2. Confidence
Itemset Support
 Support: A measure of the frequency
with which an itemset occurs in a DB.
supp(A) = # records that contain A
m
 If an itemset has support higher than
some specified threshold we say that
the itemset is supported or frequent
(some authors use the term large).
 Support threshold is normally set
reasonably low (say) 1%.
Confidence
 Confidence: A measure, expressed as
a ratio, of the support for an AR
compared to the support of its
antecedent.
conf(A→B) = supp(A∪B)
supp(A)
 We say that we are confident in a rule if
its confidence exceeds some threshold
(normally set reasonably high, say,
80%).
Rule Measures: Support and Confidence
Customer Customer  Find all the rules X & Y ⇒ Z with
buys both buys Bread
minimum confidence and support
 support, s, probability that a
transaction contains {X & Y & Z}
 confidence, c, conditional probability
that a transaction having {X & Y} also
contains Z
Customer
buys Butter

Let minimum support 50%, and

Transaction ID Items Bought
2000 A,B,C
minimum confidence 50%,
1000 A,C we have
4000 A,D – A ⇒ C (50%, 66.6%)
5000 B,E,F – C ⇒ A (50%, 100%)
BRUTE FORCE
a 6 cd 3 abce 0 List all possible
b 6 acd 1 de 3 combinations in an
ab 3 bcd 1 ade 1 array.
c 6 abcd 0 bde 1
ac 3 e 6 abde 0
For each record:
bc 3 ae 3 cde 1
abc 1 be 3 acde 0 1. Find all combinations.
d 6 abe 1 bcde 0
2. For each combination
ad 6 ce 3 abcde 0 index into array and
bd 3 ace 1 increment support by
abd 1 bce 1 1.
Then generate rules
Support threshold = 5% Frequents Sets (F):
(count of 1.5) where ab(3) ac(3) bc(3)
total # of trans = 30 ad(6) bd(3) cd(3)
a 6 cd 3 abce 0
ae(3) be(3) ce(3)
b 6 acd 1 de 3
ab 3 bcd 1 ade 1 de(3)
c 6 abcd 0 bde 1
Rules:
ac 3 e 6 abde 0
bc 3 ae 3 cde 1 From ab we can
develop two rules
abc 1 be 3 acde 0 as
d 6 abe 1 bcde 0 a→b conf=3/6=50%
ad 6 ce 3 abcde 0
b→a conf=3/6=50%
bd 3 ace 1
abd 1 bce 1 Etc.
BRUTE FORCE
Advantages:
1) Very efficient for data sets with small numbers of
attributes (<20).

Disadvantages:
1) Given 20 attributes, number of combinations is 220-1 =
1048576. Therefore array storage requirements will be
4.2MB.
2) Given a data sets with (say) 100 attributes it is likely that
many combinations will not be present in the data set ---
therefore store only those combinations present in the
dataset!
Mining Association Rules—An Example
Transaction ID Items Bought Min. support 50%
2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
5000 B,E,F {A} 75%
{B} 50%
{C} 50%
For rule A ⇒ C: {A,C} 50%
support = support({A ՍC}) = 50%
confidence = support({A ՍC})/support({A}) =
66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Mining Frequent Itemsets: the Key Step
 Find the frequent itemsets: the sets of items
that have minimum support
 A subset of a frequent itemset must also be a
frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset
 Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
 Use the frequent itemsets to generate
association rules.
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
 Pruning:
• acde is removed because ade is not in L3
 C4={abcd}
28
29
CSC479
Data Mining
Lecture # 11

Classification
Basic Concepts
Decision Trees

(Ch # 8: Data Mining-Concepts and Techniques

by Han and Kamber)
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

Tax-return data for year 2011
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes A new tax return for 2012
6 No Married 60K No Is this a cheating tax return?
7 Yes Divorced 220K No Refund Marital Taxable
Status Income Cheat
8 No Single 85K Yes
No Married 80K ?
9 No Married 75K No 10

10 No Single 90K Yes

An instance of the classification problem: learn a method for discriminating

between records of different classes (cheaters vs non-cheaters)
2
What is classification?
 Classification is the task of learning a target function f
that maps attribute set x to one of the predefined class labels y

Tid Refund Marital Taxable

Status Income Cheat One of the attributes is the class attribute
1 Yes Single 125K No
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 3
10
What is classification (cont…)
 The target function f is known as a
classification model

 Descriptive modeling: Explanatory tool to

distinguish between objects of different
classes (e.g., understand why people
cheat on their taxes)

 Predictive modeling: Predict a class of a

previously unseen record 4
Examples of Classification Tasks
 Predicting tumor cells as benign or malignant

 Classifying credit card transactions as

legitimate or fraudulent

 Categorizing news stories as finance,

weather, entertainment, sports, etc

 Identifying spam email, spam web pages,

adult content

 Understanding if a web query has commercial

intent or not 5
General approach to classification
 Training set consists of records with known class
labels

 Training set is used to build a classification model

 A labeled test set of previously unseen data

records is used to evaluate the quality of the
model.

 The classification model is applied to new records

with unknown class labels
6
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set 7
Evaluation of classification models
 Counts of test records that are correctly
(or incorrectly) predicted by the
classification model Predicted Class
 Confusion matrix Class = 1 Class = 0

Actual Class
Class = 1 f11 f10
Class = 0 f01 f00

# correct predictions f11 + f 00

Accuracy = =
total # of predictions f11 + f10 + f 01 + f 00

# wrong predictions f10 + f 01

Error rate = =
total # of predictions f11 + f10 + f 01 + f 00 8
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

9
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

10
Decision Trees
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class
distribution

11
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No Refund
Yes No
3 No Single 70K No Test outcome
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Class labels
Training Data Model: Decision Tree 12
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

13
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set 14
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No