0% found this document useful (0 votes)
149 views206 pages

Data Mining All Slides

The document discusses different classification algorithms including 1R and Naive Bayes. 1R learns simple 1-level decision trees that classify examples based on the value of a single attribute. It chooses the attribute with the lowest error rate. Naive Bayes makes independence assumptions between attributes and uses probabilities to classify examples based on the values of all attributes. The document provides examples of applying these methods to classify examples of weather data based on attributes like outlook, temperature, humidity, and windy conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views206 pages

Data Mining All Slides

The document discusses different classification algorithms including 1R and Naive Bayes. 1R learns simple 1-level decision trees that classify examples based on the value of a single attribute. It chooses the attribute with the lowest error rate. Naive Bayes makes independence assumptions between attributes and uses probabilities to classify examples based on the values of all attributes. The document provides examples of applying these methods to classify examples of weather data based on attributes like outlook, temperature, humidity, and windy conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 206

Algorithms for

Classification:
The Basic Methods
Outline

 Simplicity first: 1R

 Naïve Bayes

2
Classification

 Task: Given a set of pre-classified examples,


build a model or classifier to classify new cases.
 Supervised learning: classes are known for the
examples used to build the classifier.
 A classifier can be a set of rules, a decision tree,
a neural network, etc.
 Typical applications: credit approval, direct
marketing, fraud detection, medical diagnosis,
…..
3
Simplicity first

 Simple algorithms often work very well!


 There are many kinds of simple structure, eg:
 One attribute does all the work
 All attributes contribute equally & independently
 A weighted linear combination might do
 Instance-based: use a few prototypes
 Use simple logical rules

 Success of method depends on the domain

4
witten&eibe
Inferring rudimentary rules

 1R: learns a 1-level decision tree


 I.e., rules that all test one particular attribute

 Basic version
 One branch for each value
 Each branch assigns most frequent class
 Error rate: proportion of instances that don’t belong to the
majority class of their corresponding branch
 Choose attribute with lowest error rate

(assumes nominal attributes)

5
witten&eibe
Pseudo-code for 1R

For each attribute,


For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate

 Note: “missing” is treated as a separate attribute value

6
witten&eibe
Evaluating the weather attributes
Outlook Temp Humidity Windy Play Attribute Rules Errors Total
errors
Sunny Hot High False No
Outlook Sunny → No 2/5 4/14
Sunny Hot High True No
Overcast → Yes 0/4
Overcast Hot High False Yes
Rainy → Yes 2/5
Rainy Mild High False Yes
Temp Hot → No* 2/4 5/14
Rainy Cool Normal False Yes
Mild → Yes 2/6
Rainy Cool Normal True No
Cool → Yes 1/4
Overcast Cool Normal True Yes
Humidity High → No 3/7 4/14
Sunny Mild High False No
Normal → Yes 1/7
Sunny Cool Normal False Yes
Windy False → Yes 2/8 5/14
Rainy Mild Normal False Yes
True → No* 3/6
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
* indicates a tie

7
witten&eibe
Dealing with
numeric attributes
 Discretize numeric attributes
 Divide each attribute’s range into intervals
 Sort instances according to attribute’s values
 Place breakpoints where the class changes
(the majority class)
 This minimizes the totalTemperature
Outlook
error Humidity Windy Play

Sunny 85 85 False No

 Example: temperature
Sunny from
80 weather
90 data True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

8
witten&eibe
The problem of overfitting

 This procedure is very sensitive to noise


 One instance with an incorrect class label will probably
produce a separate interval
 Also: time stamp attribute will have zero errors
 Simple solution:
enforce minimum number of instances in majority class
per interval

9
witten&eibe
Discretization example

 Example (with min = 3):


64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

 Final result for temperature attribute


64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

10
witten&eibe
With overfitting avoidance

 Resulting rule set:


Attribute Rules Errors Total errors
Outlook Sunny → No 2/5 4/14
Overcast → Yes 0/4
Rainy → Yes 2/5
Temperature ≤ 77.5 → Yes 3/10 5/14
> 77.5 → No* 2/4
Humidity ≤ 82.5 → Yes 1/7 3/14
> 82.5 and ≤ 95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6

11
witten&eibe
Bayesian (Statistical) modeling

 “Opposite” of 1R: use all the attributes


 Two assumptions: Attributes are
 equally important
 statistically independent (given the class value)
 I.e., knowing the value of one attribute says nothing about
the value of another
(if the class is known)

 Independence assumption is almost never correct!


 But … this scheme works well in practice

13
witten&eibe
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5 Outlook Temp Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No


14
witten&eibe
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5

Outlook Temp. Humidity Windy Play

Sunny Cool High True ?

 A new day: Likelihood of the two classes


For “yes” = 2/9 × 3/9 × 3/9 × 3/9 × 9/14 = 0.0053
For “no” = 3/5 × 1/5 × 4/5 × 3/5 × 5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

15
witten&eibe
Bayes’s rule
 Probability of event H given evidence E :

Pr[ E | H ] Pr[ H ]
Pr[ H | E ] =
Pr[ E ]
 A priori probability of H : Pr[H ]
 Probability of event before evidence is seen

 A posteriori probability of H : Pr[ H | E ]


 Probability of event after evidence is seen

from Bayes “Essay towards solving a problem in the


doctrine of chances” (1763)
Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent, England
16
witten&eibe
Naïve Bayes for classification

 Classification learning: what’s the probability of the class


given an instance?
 Evidence E = instance
 Event H = class value for instance

 Naïve assumption: evidence splits into parts (i.e.


attributes) that are independent

Pr[ E1 | H ] Pr[ E1 | H ] Pr[ En | H ] Pr[ H ]


Pr[ H | E ] =
Pr[ E ]

17
witten&eibe
Weather data example

Outlook Temp. Humidity Windy Play


Sunny Cool High True ?
Evidence E

Pr[ yes | E ] = Pr[Outlook = Sunny | yes]


× Pr[Temperature = Cool | yes]
× Pr[ Humidity = High | yes]
Probability of
class “yes” × Pr[Windy = True | yes]
Pr[ yes]
×
Pr[ E ]

× 93 × 93 × 93 × 149
2
= 9
Pr[ E ]
18
witten&eibe
The “zero-frequency problem”

 What if an attribute value doesn’t occur with every class


value?
(e.g. “Humidity = high” for class “yes”)
 Probability will be zero! Pr[ Humidity = High | yes] = 0
 A posteriori probability will also be zero! Pr[ yes | E ] = 0
(No matter how likely the other values are!)
 Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
 Result: probabilities will never be zero!
(also: stabilizes probability estimates)

19
witten&eibe
*Modified probability estimates

 In some cases adding a constant different from 1 might


be more appropriate
 Example: attribute outlook for class yes

2+ µ /3 4+ µ /3 3+ µ /3
9+µ 9+µ 9+µ
Sunny Overcast Rainy

 Weights don’t need to be equal


(but they must sum to 1)
2 + µp1 4 + µp 2 3 + µp3
9+µ 9+µ 9+µ
20
witten&eibe
Missing values
 Training: instance is not included in
frequency count for attribute value-class
combination
 Classification: attribute will be omitted from
calculation
 Example: Outlook Temp. Humidity Windy Play
? Cool High True ?

Likelihood of “yes” = 3/9 × 3/9 × 3/9 × 9/14 = 0.0238


Likelihood of “no” = 1/5 × 4/5 × 3/5 × 5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

21
witten&eibe
Numeric attributes
 Usual assumption: attributes have a normal or
Gaussian probability distribution (given the class)
 The probability density function for the normal
distribution is defined by two parameters:
 Sample mean µ
1 n
µ = ∑ xi
n i =1

 Standard deviation σ
1 n
σ= ∑ i
n − 1 i =1
( x − µ ) 2

 Then the density function f(x) is


( x− µ )2
1 −
Karl Gauss, 1777-1855
f ( x) = e 2σ 2
2π σ great German mathematician

22
witten&eibe
Statistics for
weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72, 80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5 µ =73 µ =75 µ =79 µ =86 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 σ =6.2 σ =7.9 σ =10.2 σ =9.7 True 3/9 3/5
Rainy 3/9 2/5

 Example density value:


( 66−73) 2
1 −
f (temperature = 66 | yes) = e 2∗6.22
= 0.0340
2π 6.2

23
witten&eibe
Classifying a new day

 A new day: Outlook Temp. Humidity Windy Play

Sunny 66 90 true ?

Likelihood of “yes” = 2/9 × 0.0340 × 0.0221 × 3/9 × 9/14 = 0.000036


Likelihood of “no” = 3/5 × 0.0291 × 0.0380 × 3/5 × 5/14 = 0.000136
P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%
P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%

 Missing values during training are not included in


calculation of mean and standard deviation

24
witten&eibe
Naïve Bayes: discussion

 Naïve Bayes works surprisingly well (even if


independence assumption is clearly violated)
 Why? Because classification doesn’t require
accurate probability estimates as long as
maximum probability is assigned to correct class
 However: adding too many redundant attributes
will cause problems (e.g. identical attributes)
 Note also: many numeric attributes are not
normally distributed (→ kernel density
estimators)
26
witten&eibe
Naïve Bayes Extensions

 Improvements:
 select best attributes (e.g. with greedy search)
 often works as well or better with just a fraction
of all attributes

 Bayesian Networks

27
witten&eibe
Summary

 OneR – uses rules based on just one attribute

 Naïve Bayes – use all attributes and Bayes rules


to estimate probability of the class given an
instance.
 Simple methods frequently work well, but …
 Complex methods can be better (as we will see)

28
Data Mining
Preprocessing
• Data quality
• Missing values imputation using Mean,
Median and k-Nearest Neighbor approach
• Distance Measure
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?

2
Major Tasks in Data Preprocessing
• Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
Integration of multiple databases, data cubes, or files
• Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
• Data transformation and data discretization
Normalization
Concept hierarchy generation

3
Data Quality
• Data quality is a major concern in Data Mining and
Knowledge Discovery tasks.
• Why: At most all Data Mining algorithms induce knowledge
strictly from data.
• The quality of knowledge extracted highly depends on the
quality of data.
• There are two main problems in data quality:-
– Missing data: The data not present.
– Noisy data: The data present but not correct.
• Missing/Noisy data sources:-
– Hardware failure.
– Data transmission error.
– Data entry problem.
– Refusal of responds to answer certain questions.
Effect of Noisy Data on Results Accuracy

age income student buys_computer Discover only those


<=30 high yes yes rules which contain
<=30 high no yes support (frequency)
>40 medium yes no greater >= 2
Data Mining
>40 medium no no
>40 low yes yes
31…40 no yes
31…40 medium yes yes

Training data • If ‘age <= 30’ and income = ‘high’ then


buys_computer = ‘yes’
• If ‘age > 40’ and income = ‘medium’ then
buys_computer = ‘no’
Due to the missing value in training
dataset, the accuracy of prediction age income student buys_computer
decreases and becomes “66.7%” <=30 high no ?
>40 medium yes ?
31…40 medium yes ?
Testing data or actual data
Imputation of Missing Data (Basic)
• Imputation is a term that denotes a procedure that
replaces the missing values in a dataset by some plausible
values
– i.e. by considering relationship among correlated values
among the attributes of the dataset.

Attribute 1 Attribute 2 Attribute 3 Attribute 4 If we consider only


20 cool high false {attribute#2}, then
cool high true value “cool” appears in 4
20 cool high true records.
20 mild low false
30 cool normal false Probability of Imputing
10 mild high true value (20) = 66.67%
Probability of Imputing
value (30) = 33.33%
Imputation of Missing Data (Basic)
Attribute 1 Attribute 2 Attribute 3 Attribute 4 For {attribute#4} the
20 cool high false value “true” appears in 3
cool high true records
20 cool high true
Probability of Imputing
20 mild low false
value (20) = 50%
30 cool normal false
10 mild high true Probability of Imputing
value (10) = 50%

Attribute 1 Attribute 2 Attribute 3 Attribute 4 For {attribute#2,


20 cool high false attribute#3} the value
cool high true {“cool”, “high”}
20 cool high true appears in only 2 records
20 mild low false
Probability of Imputing
30 cool normal false
value (20) = 100%
10 mild high true
Randomness of Missing Data
• Missing data randomness is divided into three classes.

1. Missing completely at random (MCAR):- It occurs


when the probability of instance (case) having missing
value for an attribute does not depend on either the
known attribute values or missing data attribute.
2. Missing at random (MAR):- It occurs when the
probability of instance (case) having missing value for an
attribute depends on the known attribute values, but not
on the missing data attribute.
3. Not missing at random (NMAR):- When the
probability of an instance having a missing value for an
attribute could depend on the value of that attribute.
Methods of Treating Missing Data
• Ignoring and discarding data:- There are two main
ways to discard data with missing values.
– Discard all those records which have missing data also
called as discard case analysis.
– Discarding only those attributes which have high level
of missing data.
• Imputation using Mean/median or Mod:- One of the
most frequently used method (Statistical technique).
– Replace (numeric continuous) type “attribute missing
values” using mean/median. (Median robust against
noise).
– Replace (discrete) type attribute missing values using
MOD.
Methods of Treating Missing Data
• Replace missing values using
prediction/classification model:-
– Advantage:- it considers relationship among the known
attribute values and the missing values, so the
imputation accuracy is very high.
– Disadvantage:- If there is no correlation exist for some
missing attribute values and known attribute values.
The imputation can’t be performed.
– (Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MOD.
• First try to impute missing value using prediction/classification
model, and then Median/MOD.
– We will study more about this topic in Association Rules
Mining.
Methods of Treating Missing Data
• K-Nearest Neighbor (k-NN) approach (Best
approach):-
– k-NN imputes the missing attribute values on the basis
of nearest K neighbor. Neighbors are determined on
the basis of distance measure.
– Once K neighbors are determined, missing value are
imputed by taking mean/median or MOD of known
attribute values of missing attribute.
– Pseudo-code/analysis after studying distance measure.

Missing value record

Other dataset records


Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Distance Measures
• Remember K-Nearest Neighbor are determined on the
bases of some kind of “distance” between points.

• Two major classes of distance measure:


1. Euclidean : based on position of points in some k -
dimensional space.
2. Noneuclidean : not related to position or space.
Scales of Measurement
• Applying a distance measure largely depends on the type
of input data
• Major scales of measurement:
1. Nominal Data (aka Nominal Scale Variables)

• Typically classification data, e.g. m/f


• no ordering, e.g. it makes no sense to state that M > F
• Binary variables are a special case of Nominal scale variables.

2. Ordinal Data (aka Ordinal Scale)


• ordered but differences between values are not important
• e.g., political parties on left to right spectrum given labels 0, 1, 2
• e.g., Liker scales, rank on a scale of 1..5 your degree of satisfaction
• e.g., restaurant ratings
Scales of Measurement
• Applying a distance function largely depends on the type
of input data
• Major scales of measurement:
3. Numeric type Data (aka interval scaled)
• Ordered and equal intervals. Measured on a linear scale.
• Differences make sense
• e.g., temperature (C,F), height, weight, age, date
Scales of Measurement
• Only certain operations can be performed on
certain scales of measurement.

Nominal Scale
1. Equality
2. Count

3. Rank Ordinal Scale


(Cannot quantify difference)

Interval Scale
4. Quantify the difference
Axioms of a Distance Measure
• d is a distance measure if it is a function
from pairs of points to reals such that:
1. d(x,x) = 0.
2. d(x,y) = d(y,x).
3. d(x,y) > 0.
Some Euclidean Distances
• L2 norm (also common or Euclidean distance):

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

– The most common notion of “distance.”

• L1 norm (also Manhattan distance)

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp
– distance if you had to travel along coordinates only.
Examples L1 and L2 norms

y = (9,8)
L2-norm:
dist(x,y) = √(42+32) = 5

5
3
L1-norm:
dist(x,y) = 4+3 = 7
x = (5,5) 4
Another Euclidean Distance
• L∞ norm : d(x,y) = the maximum of the
differences between x and y in any dimension.
Example:
Data Matrix and Dissimilarity Matrix
x x Data Matrix
2 4
point attribute1 attribute2
4
x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x
1
Dissimilarity Matrix
(with Euclidean Distance)
x
3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

21
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x x
2 4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 Supremum
x
1
L∞ x1 x2 x3 x4
x1 0
x2 3 0
x
3 x3 2 5 0
0 2 4
x4 3 1 5 0
22
Proximity Measure for Nominal
Attributes

• Can take 2 or more states, e.g., red, yellow, blue,


green (generalization of a binary attribute)
• Method 1: Simple matching
– m: # of matches, p: total # of variables

d (i, j) = p −
p
m
• Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M
nominal states

23
Non-Euclidean Distances

• Jaccard measure for binary vectors

• Cosine measure = angle between vectors


from the origin to the points in question.

• Edit distance = number of inserts and deletes


to change one string into another.
Jaccard Measure
• A note about Binary variables first
– Symmetric binary variable
• If both states are equally valuable and carry the same weight,
that is, there is no preference on which outcome should be
coded as 0 or 1.
• Like “gender” having the states male and female
– Asymmetric binary variable:
• If the outcomes of the states are not equally important, such as
the positive and negative outcomes of a disease test.
• We should code the rarest one by 1 (e.g., HIV positive), and the
other by 0 (HIV negative).
– Given two asymmetric binary variables, the agreement
of two 1s (a positive match) is then considered more
important than that of two 0s (a negative match).
Jaccard Measure
• A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p

• Simple matching coefficient :


d (i, j) = b+c
a +b+c + d
• Jaccard coefficient :
d (i, j) = b+c
a +b+c
Jaccard Measure Example
• Example

Name Fever Cough Test-1 Test-2 Test-3 Test-4


Jack Y N P N N N
Mary Y N P N P N
Jim Y P N N N N
– All attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0

0+1 1 0 sum
d ( jack , mary ) = = 0.33 1 a b a +b
2+ 0+1
1+1 0 c d c+d
d ( jack , jim ) = = 0.67
1+1+1 sum a + c b + d p
1+ 2 b+c
d ( jim , mary ) = = 0.75 d (i, j) =
1+1+ 2 a +b+c
Cosine Measure
• Think of a point as a vector from the origin
(0,0,…,0) to its location.

• Two points’ vectors make an angle, whose cosine


is the normalized dot-product of the vectors.
– Example:
– p1.p2 = 2; |p1| = |p2| = √3.
– cos(θ) = 2/3; θ is about 48 degrees. p1

θ
p2
p1.p2
dist(p1, p2) = θ = arccos(p1.p2/|p2||p1|) |p2|
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

• Other vector objects: gene features in micro-arrays, …


• Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d

29
Example: Cosine Similarity

• cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,


where • indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

30
Edit Distance
• The edit distance of two strings is the number
of inserts and deletes of characters needed to
turn one into the other.

• Equivalently, d(x,y) = |x| + |y| -2|LCS(x,y)|.


– LCS = longest common subsequence = longest
string obtained both by deleting from x and
deleting from y.
Example

• x = abcde ; y = bcduve.

• LCS(x,y) = bcde.
• D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4
= 3.

• What left?
• Normalize it in the range [0-1]. We will study
normalization formulas later.
Back to k-Nearest Neighbor (Pseudo-code)
• Missing values Imputation using k-NN.
• Input: Dataset (D ), size of K

• for each record (x ) with at least one missing value


in D .
– for each data object (y ) in D .
• Take the Distance (x ,y)
• Save the distance x and y in array Similarity (S ) array.

– Sort the array S in descending order


– Pick the top K data objects from S
• Impute the missing attribute value (s) of x on the basic of
known values of S (use Mean/Median or MOD).
K-Nearest Neighbor Drawbacks
• The major drawbacks of this approach are the
– Choice of selecting exact distance functions.
– Considering all attributes when attempting to retrieve
the similar type of examples.
– Searching through all the dataset for finding the same
type of instances.
– Algorithm Cost: ?
Noisy Data
• Noise: Random error, Data Present but not correct.
– Data Transmission error
– Data Entry problem

• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.

• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.

• Values which falls outside of the set of clusters may be


considered outliers.
CSC479
Data Mining
Lecture # 8

Mining Frequent Patterns


Associations and Correlations
(Ch # 6)
Introduction
Motivation

{butter, bread, milk, sugar}


{butter, flour, milk, sugar}
{butter, eggs, milk, salt} DB of Sales Transactions
{eggs}
{butter, flour, milk, salt, sugar}

Market basket analysis


• Which products are frequently purchased together?
• Applications
• Improvement of store layouts
• Cross marketing
• Attached mailings/add-on sales
2
Market-Basket Data
 A large set of items, e.g., things sold in
a supermarket.
 A large set of baskets, each of which is
a small set of the items, e.g., the things
one customer buys on one day.

3
Market-Baskets – (2)
 Really, a general many-to-many
mapping (association) between two
kinds of things, where the one (the
baskets) is a set of the other (the
items)
 But we ask about connections among
“items,” not “baskets.”
 The technology focuses on common
events, not rare events (“long tail”). 4
Frequent Itemsets
• Given a set of transactions, find combinations of
items (itemsets) that occur frequently

Market-Basket transactions
Items: {Bread, Milk, Diaper, Beer, Eggs,
Coke}
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Bread}: 4
3 Milk, Diaper, Beer, Coke {Milk} : 4
4 Bread, Milk, Diaper, Beer {Diaper} : 4
{Beer}: 3
5 Bread, Milk, Diaper, Coke
{Diaper, Beer} : 3
{Milk, Bread} : 3
Applications – (1)
 Items = products; baskets = sets of
products someone bought in one trip to
the store.

 Example application: given that many


people buy beer and diapers together:
 Run a sale on diapers; raise price of beer.
 Only useful if many buy diapers & beer.
6
Applications – (2)
 Baskets = Web pages; items = words.

 Example application: Unusual words


appearing together in a large number of
documents, e.g., “Brad” and “Angelina,”
may indicate an interesting relationship.

7
Applications – (3)
 Baskets = sentences; items =
documents containing those sentences.

 Example application: Items that appear


together too often could represent
plagiarism.
 Notice items do not have to be “in”
baskets.
8
Definition: Frequent Itemset
 Itemset
 A collection of one or more items
• Example: {Milk, Bread, Diaper}
 k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
 Support (σ) 2 Bread, Diaper, Beer, Eggs
 Count: Frequency of occurrence of an 3 Milk, Diaper, Beer, Coke
itemset 4 Bread, Milk, Diaper, Beer
 E.g. σ({Milk, Bread,Diaper}) = 2 5 Bread, Milk, Diaper, Coke
 Fraction: Fraction of transactions that
contain an itemset
 E.g. s({Milk, Bread, Diaper}) = 40%
 Frequent Itemset
 An itemset whose support is greater
than or equal to a minsup threshold
Some more definitions
 Close Itemset : An itemset X is closed in a data set
D if there exists no proper super-itemset Y such that
Y has the same support count as X in D.
 closed frequent itemset: An itemset X is a closed
frequent itemset in set D if X is both closed and
frequent in D.
 Maximal frequent itemset : An itemset X is a
maximal frequent itemset (or max-itemset) in
a data set D if X is frequent, and there exists no
super-itemset Y such that X⊂ Y and Y is frequent in
D.
Mining Frequent Itemsets task
 Input: A set of transactions T, over a set of items I
 Output: All itemsets with items in I having
 support ≥ minsup threshold
 Problem parameters:
 N = |T|: number of transactions
 d = |I|: number of (distinct) items
 w: max width of a transaction
 Number of possible itemsets?

 Scale of the problem: M = 2d


 WalMart sells 100,000 items and can store billions of
baskets.
 The Web has billions of words and many billions of pages.
Initial Definition of Association
Rules (ARs) Mining
 Association rules define relationship of the
form:
A→B
 Read as A implies B, where A and B are sets
of binary valued attributes represented in a
data set.
 Association Rule Mining (ARM) is then the
process of finding all the ARs in a given DB.
Association Rule: Basic Concepts
 Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
 Find: all rules that correlate the presence of one
set of items with that of another set of items
 E.g., 98% of students who study Databases and C++
also study Algorithms
 Applications
 Home Electronics ⇒ * (What other products should
the store stocks up?)
 Attached mailing in direct marketing
 Web page navigation in Search Engines (first page a->
page b)
 Text mining if IT companies -> Microsoft
Some Notation
D = A data set comprising n records and m
binary valued attributes.

I = The set of m attributes, {i1,i2, … ,im},


represented in D.

Itemset = Some subset of I. Each record


in D is an itemset.
In depth Definition of ARs Mining
 Association rules define relationship of the
form:
A→B
 Read as A implies B
 Such that A⊂I, B⊂I, A∩B=∅ (A and B are
disjoint) and A∪B⊆I.
 In other words an AR is made up of an
itemset of cardinality 2 or more.
Association Rules Measurement
The most commonly used “interestingness”
measures are:
1. Support
2. Confidence
Itemset Support
 Support: A measure of the frequency
with which an itemset occurs in a DB.
supp(A) = # records that contain A
m
 If an itemset has support higher than
some specified threshold we say that
the itemset is supported or frequent
(some authors use the term large).
 Support threshold is normally set
reasonably low (say) 1%.
Confidence
 Confidence: A measure, expressed as
a ratio, of the support for an AR
compared to the support of its
antecedent.
conf(A→B) = supp(A∪B)
supp(A)
 We say that we are confident in a rule if
its confidence exceeds some threshold
(normally set reasonably high, say,
80%).
Rule Measures: Support and Confidence
Customer Customer  Find all the rules X & Y ⇒ Z with
buys both buys Bread
minimum confidence and support
 support, s, probability that a
transaction contains {X & Y & Z}
 confidence, c, conditional probability
that a transaction having {X & Y} also
contains Z
Customer
buys Butter

Let minimum support 50%, and


Transaction ID Items Bought
2000 A,B,C
minimum confidence 50%,
1000 A,C we have
4000 A,D – A ⇒ C (50%, 66.6%)
5000 B,E,F – C ⇒ A (50%, 100%)
BRUTE FORCE
a 6 cd 3 abce 0 List all possible
b 6 acd 1 de 3 combinations in an
ab 3 bcd 1 ade 1 array.
c 6 abcd 0 bde 1
ac 3 e 6 abde 0
For each record:
bc 3 ae 3 cde 1
abc 1 be 3 acde 0 1. Find all combinations.
d 6 abe 1 bcde 0
2. For each combination
ad 6 ce 3 abcde 0 index into array and
bd 3 ace 1 increment support by
abd 1 bce 1 1.
Then generate rules
Support threshold = 5% Frequents Sets (F):
(count of 1.5) where ab(3) ac(3) bc(3)
total # of trans = 30 ad(6) bd(3) cd(3)
a 6 cd 3 abce 0
ae(3) be(3) ce(3)
b 6 acd 1 de 3
ab 3 bcd 1 ade 1 de(3)
c 6 abcd 0 bde 1
Rules:
ac 3 e 6 abde 0
bc 3 ae 3 cde 1 From ab we can
develop two rules
abc 1 be 3 acde 0 as
d 6 abe 1 bcde 0 a→b conf=3/6=50%
ad 6 ce 3 abcde 0
b→a conf=3/6=50%
bd 3 ace 1
abd 1 bce 1 Etc.
BRUTE FORCE
Advantages:
1) Very efficient for data sets with small numbers of
attributes (<20).

Disadvantages:
1) Given 20 attributes, number of combinations is 220-1 =
1048576. Therefore array storage requirements will be
4.2MB.
2) Given a data sets with (say) 100 attributes it is likely that
many combinations will not be present in the data set ---
therefore store only those combinations present in the
dataset!
Mining Association Rules—An Example
Transaction ID Items Bought Min. support 50%
2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
5000 B,E,F {A} 75%
{B} 50%
{C} 50%
For rule A ⇒ C: {A,C} 50%
support = support({A ՍC}) = 50%
confidence = support({A ՍC})/support({A}) =
66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Mining Frequent Itemsets: the Key Step
 Find the frequent itemsets: the sets of items
that have minimum support
 A subset of a frequent itemset must also be a
frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset
 Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
 Use the frequent itemsets to generate
association rules.
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
 Pruning:
• acde is removed because ade is not in L3
 C4={abcd}
28
29
CSC479
Data Mining
Lecture # 11

Classification
Basic Concepts
Decision Trees

(Ch # 8: Data Mining-Concepts and Techniques


by Han and Kamber)
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


Tax-return data for year 2011
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes A new tax return for 2012
6 No Married 60K No Is this a cheating tax return?
7 Yes Divorced 220K No Refund Marital Taxable
Status Income Cheat
8 No Single 85K Yes
No Married 80K ?
9 No Married 75K No 10

10 No Single 90K Yes


10

An instance of the classification problem: learn a method for discriminating


between records of different classes (cheaters vs non-cheaters)
2
What is classification?
 Classification is the task of learning a target function f
that maps attribute set x to one of the predefined class labels y

Tid Refund Marital Taxable


Status Income Cheat One of the attributes is the class attribute
1 Yes Single 125K No
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 3
10
What is classification (cont…)
 The target function f is known as a
classification model

 Descriptive modeling: Explanatory tool to


distinguish between objects of different
classes (e.g., understand why people
cheat on their taxes)

 Predictive modeling: Predict a class of a


previously unseen record 4
Examples of Classification Tasks
 Predicting tumor cells as benign or malignant

 Classifying credit card transactions as


legitimate or fraudulent

 Categorizing news stories as finance,


weather, entertainment, sports, etc

 Identifying spam email, spam web pages,


adult content

 Understanding if a web query has commercial


intent or not 5
General approach to classification
 Training set consists of records with known class
labels

 Training set is used to build a classification model

 A labeled test set of previously unseen data


records is used to evaluate the quality of the
model.

 The classification model is applied to new records


with unknown class labels
6
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set 7
Evaluation of classification models
 Counts of test records that are correctly
(or incorrectly) predicted by the
classification model Predicted Class
 Confusion matrix Class = 1 Class = 0

Actual Class
Class = 1 f11 f10
Class = 0 f01 f00

# correct predictions f11 + f 00


Accuracy = =
total # of predictions f11 + f10 + f 01 + f 00

# wrong predictions f10 + f 01


Error rate = =
total # of predictions f11 + f10 + f 01 + f 00 8
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

9
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

10
Decision Trees
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class
distribution

11
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No Test outcome
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Class labels
Training Data Model: Decision Tree 12
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

13
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set 14
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

15
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

16
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

17
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

18
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

19
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

20
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

21
Test Set
Tree Induction
 Finding the best decision tree is NP-hard

 Greedy strategy.
 Split the records based on an attribute test that
optimizes certain criterion.

 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ, SPRINT

22
Which Attribute is the Best Classifier?
The central choice in the ID3 algorithm is selecting
which attribute to test at each node in the tree

We would like to select the attribute which is most


useful for classifying examples

For this we need a good quantitative measure

For this purpose a statistical property, called


information gain is used
23
Which Attribute is the Best Classifier?
Definition of Entropy

- In order to define information gain precisely, we


begin by defining entropy

- Entropy is a measure commonly used in


information theory.

- Entropy characterizes the impurity of an


arbitrary collection of examples

24
Entropy
 Entropy (D)
 Entropy of data set D is denoted by H(D)
 Cis are the possible classes
 pi = fraction of records from D that have class C

H ( D ) = −∑ pi log 2 pi
ci

25
Entropy Examples
 Example:
 10 records have class A
 20 records have class B
 30 records have class C
 40 records have class D
 Entropy = -[(.1 log .1) + (.2 log .2) + (.3 log
.3) + (.4 log .4)]
 Entropy = 1.846

26
Splitting Criterion
 Example:
Two classes, +/-
100 records overall (50 +s and 50 -s)
A and B are two binary attributes
• Records with A=0: 48+, 2-
Records with A=1: 2+, 48-
• Records with B=0: 26+, 24-
Records with B=1: 24+, 26-
Splitting on A is better than splitting on B
• A does a good job of separating +s and -s
• B does a poor job of separating +s and -s
27
Which Attribute is the Best Classifier?
Information Gain
The expected information needed to classify a tuple
in D is
= Entropy
How much more information would we still need
(after partitioning at attribute A) to arrive at an exact
classification? This amount is measured by

= H(D, A)

Info Gain (D, A) = H(D) – H(D, A)


In general, we write Gain (D, A), where D is the
collection of examples & A is an attribute 28
Information Gain
 Gain of an attribute split: compare the
impurity of the parent node with the average
impurity of the child nodes
v Dj
InfoGain H ( parent ) − ∑
= H (Dj )
j =1 D

 Maximizing the gain ⇔ Minimizing the weighted


average impurity measure of children nodes

29
Examples Constructing Decision Tree

30
Examples Constructing Decision Tree

 So the attribute Age will be placed at root


level.
 For placement at second level we find
InfoGain for all the remaining attributes
under every branch of the parent node.

31
DECISION TREES
Which Attribute is the Best Classifier?: Use
Information Gain to develop the complete tree

32
CSC479
Data Mining
Lecture # 15

Rule Based Classification

(Ch # 8.4)
Rule Generation from Decision Tree

 Decision tree classifiers are popular method of


classification due to it is easy understanding

 However, decision tree can become large and difficult


to interpret

 In comparison with decision tree, the IF-THEN rules


may be easier for humans to understand, particularly
if the decision tree is very large
Rule Generation from Decision Tree
 Rules are easier to understand than large trees
 One rule is created for each path from the root to a
leaf
 Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
Rule Generation from Decision Tree
 Example: Rule extraction from our buys_computer
decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN
buys_computer = yes
IF age = young AND credit_rating = fair THEN
buys_computer = no
Rule-Based Classifier
 Classify records by using a collection of
“if…then…” rules
 Rule: (Condition) → y
 where
• Condition is a conjunctions of attributes
• y is the class label
 LHS: rule antecedent or condition
 RHS: rule consequent
 Examples of classification rules:
• (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
• (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) →
Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Application of Rule-Based Classifier

 A rule r covers an instance x if the attributes of


the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Rule Coverage and Accuracy
 Coverage of a rule: Tid Refund Marital
Status
Taxable
Income Class

 Fraction of records 1 Yes Single 125K No

that satisfy the 2 No Married 100K No


3 No Single 70K No
antecedent of a rule 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No

 Accuracy of a rule:
7 Yes Divorced 220K No
8 No Single 85K Yes

 Fraction of records 9 No Married 75K No

that satisfy both the 10


10 No Single 90K Yes

antecedent and (Status=Single) → No


consequent of a rule Coverage = 40%, Accuracy = 50%
How does Rule-based Classifier Work?
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
Characteristics of Rule-Based Classifier
 Mutually exclusive rules
 Classifier contains mutually exclusive rules if the rules are
independent of each other
 Every record is covered by at most one rule
 we can not have rules conflict because no two rules will
triggered for the same tuple

 Exhaustive rules
 Classifier has exhaustive coverage if it accounts for every
possible combination of attribute values
 Each record is covered by at least one rule
 There is one rule for each possible attribute-value
combination, so that the set of rules does not require a
default rule
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
Refund
Yes No (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marital
Status (Refund=No, Marital Status={Single,Divorced},
{Single,
{Married} Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K

NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
Rules Can Be Simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund
Yes No 1 Yes Single 125K No
2 No Married 100K No
NO Marital
3 No Single 70K No
{Single, Status
{Married} 4 Yes Married 120K No
Divorced}
5 No Divorced 95K Yes
Taxable NO
Income 6 No Married 60K No

< 80K > 80K 7 Yes Divorced 220K No


8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Initial Rule: (Refund=No) ∧ (Status=Married) → No


Simplified Rule: (Status=Married) → No
Effect of Rule Simplification
 Rules are no longer mutually exclusive
 A record may trigger more than one rule
 Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

 Rules are no longer exhaustive


 A record may not trigger any rules
 Solution?
• Use a default class
Ordered Rule Set
 Rules are rank ordered according to their priority
 An ordered rule set is known as a decision list
 When a test record is presented to the classifier
 It is assigned to the class label of the highest ranked rule it has
triggered
 If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
CSC479
Data Mining
Lecture # 18

Clustering

(Ch # 10)
The Problem of Clustering
 Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
members of a cluster are in some sense as
nearby as possible.
 Clustering is unsupervised classification: no
predefined classes.
 Formally, Clustering is the process of
grouping data points such as intra-cluster
distance is minimized and inter-cluster
distance
2
is maximized.
Types of Clustering
 A clustering is a set of clusters
 Important distinction between hierarchical
and partitional sets of clusters
 Partitional Clustering
• A division data objects into non-overlapping
subsets (clusters) such that each data object is in
exactly one subset

 Hierarchical clustering
• A set of nested clusters organized as a hierarchical
tree
 Other distinctions – coming slides
3
Partitional Clustering

Original Points A Partitional Clustering

4
Hierarchical Clustering

p1
p3 p4
p2

p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

5
Other Distinctions Between Sets of Clusters
 Exclusive versus non-exclusive
 In non-exclusive clusterings, points may belong to multiple
clusters.
 Can represent multiple classes or ‘border’ points

 Fuzzy versus non-fuzzy


 In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
 Weights must sum to 1
 Probabilistic clustering has similar characteristics

 Partial versus complete


 In some cases, we only want to cluster some of the data

 Heterogeneous versus homogeneous


 Cluster of widely different sizes, shapes, and densities
6
Types of Clusters
 Well-separated clusters

 Center-based clusters

 Contiguous clusters

 Density-based clusters

 Property or Conceptual

 Described by an Objective Function


7
Types of Clusters: Well-Separated
 Well-Separated Clusters:
 A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than to
any point not in the cluster.

3 well-separated clusters
8
Types of Clusters: Center-Based

 Center-based
 A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
 The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative”
point of a cluster

4 center-based clusters
9
Types of Clusters: Density-Based

 Density-based
 A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
 Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters
10
Data Structures Used
 x11 ... x1f ... x1p 
 
 Data matrix  ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 Similarity matrix
 0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

11
Partitioning (Centeroid-Based) Algorithms
 Construct a partition of a database D of n objects
into a set of k clusters
 Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
 k-means (MacQueen’67)
• Each cluster is represented by the center of the cluster
• A Euclidean Distance based method, mostly used for
interval/ratio scaled data

 k-medoids
• Each cluster is represented by one of the objects in the
cluster
• For categorical data
K-means Clustering
 Partitional clustering approach
 Each cluster is associated with a centroid (center
point)
 Each point is assigned to the cluster with the
closest centroid
 Number of clusters, K, must be specified
 The basic algorithm is very simple

13
Clustering Example
Iteration 0
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

14
Clustering Example
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

15
K-means Clustering – Details

 Initial centroids are often chosen randomly.


 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few
iterations.
 Often the stopping condition is changed to ‘Until relatively few
points change clusters’
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

16
A Simple example showing the
implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two
centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
 Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
 Their new centroids are:
Step 3:
 Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.

 Therefore, the new


clusters are:
{1,2} and {3,4,5,6,7}

 Next centroids are:


m1=(1.25,1.5) and m2 =
(3.9,5.1)
 Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

 Therefore, there is no
change in the cluster.
 Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
PLOT
(with K=3)

Step 1 Step 2
PLOT
CSC479
Data Mining
Lecture # 19

Clustering

(Ch # 10)
Partitioning (Centeroid-Based) Algorithms
 Construct a partition of a database D of n objects
into a set of k clusters
 Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
 k-means (MacQueen’67)
• Each cluster is represented by the center of the cluster
• A Euclidean Distance based method, mostly used for
interval/ratio scaled data

 k-medoids
• Each cluster is represented by one of the objects in the
cluster
• For categorical data 2
K-means Clustering
 Partitional clustering approach
 Each cluster is associated with a centroid (center
point)
 Each point is assigned to the cluster with the
closest centroid
 Number of clusters, K, must be specified
 The basic algorithm is very simple

3
Getting k Right
 Try different k, looking at the change in
the average distance to centroid, as k
increases.
 Average falls rapidly until right k, then
changes little.

Best value
of k
Average
distance to
centroid

k
4
Evaluating K-means Clusters
 Most common measure is Sum of Squared Error (SSE)
 For each point, the error is the distance to the nearest
cluster
 To get SSE, we square these errors and sum them.
K
SSE = ∑ ∑ dist 2 (mi , x)
i =1 x∈Ci
 x is a data point in cluster Ci and mi is the representative
point for cluster Ci
• can show that mi corresponds to the center (mean) of the
cluster
 Given two clusters, we can choose the one with the smallest
error
 One easy way to reduce SSE is to increase K, the number of
clusters
• A good clustering with smaller K can have a lower SSE than a
poor clustering with higher K 5
Importance of Choosing Initial Centroids …

Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

6
Importance of Choosing Initial Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

7
Limitations of K-means
K-means has problems when clusters are
of differing
 Sizes
 Non-globular shapes

K-means has problems when the data


contains outliers.

8
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

9
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

10
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial
grouping will determine the cluster significantly.
2. The number of cluster, K, must be determined before
hand. Its disadvantage is that it does not yield the
same result with each run, since the resulting clusters
depend on the initial random assignments.
3. We never know the real cluster, using the same data,
because if it is inputted in a different order it may
produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial
condition may produce different result of cluster. The
algorithm may be trapped in the local optimum.
11
Applications of K-Mean Clustering
 It is relatively efficient and fast. It computes result
at O(tkn), where n is number of objects or points,
k is number of clusters and t is number of
iterations.
 k-means clustering can be applied to machine
learning or data mining
 Used on acoustic data in speech understanding to
convert waveforms into one of k categories (known
as Vector Quantization or Image Segmentation).
 Also used for choosing color palettes on old
fashioned graphical display devices and Image
Quantization. 12
CONCLUSION
 K-means algorithm is useful for
undirected knowledge discovery and is
relatively simple. K-means has found
wide spread usage in lot of fields,
ranging from unsupervised learning of
neural network, Pattern recognitions,
Classification analysis, Artificial
intelligence, image processing, machine
vision, and many others.
13
Pre-processing and Post-processing K-means
 Pre-processing
 Normalize the data
 Eliminate outliers
 Post-processing
 Eliminate small clusters that may represent
outliers
 Split ‘loose’ clusters, i.e., clusters with
relatively high SSE
 Merge clusters that are ‘close’ and that have
relatively low SSE
 Can use these steps during the clustering
process
14
• ISODATA
Variations of k-Means Method
 Aspects of variants of k-means
 Selection of initial k centroids
• E.g., choose k farthest points
 Dissimilarity calculations
• E.g., use Manhattan distance
 Strategies to calculate cluster means
• E.g., update the means incrementally

15
Strengths of k-Means Method
 Strength
 Relatively efficient for large datasets
• O(tkn) where n is # objects, k is # clusters, and t is
# iterations; normally, k, t <<n
 Often terminates at a local optimum
• global optimum may be found using techniques
such as deterministic annealing and genetic
algorithms

16
k-modes Algorithm
 Handling categorical
age income student credit_rating
< = 30 high no fair
data: k-modes < = 30
31…40
high
high
no
no
excellent
fair
(Huang’98) > 40 medium no fair

 Replacing means of
> 40 low yes fair
> 40 low yes excellent
clusters with modes 31…40 low yes excellent
< = 30 medium no fair
• Given n records in
< = 30 low yes fair
cluster, mode is record > 40 medium yes fair
made up of most < = 30 medium yes excellent
frequent attribute values 31…40 medium no excellent
31…40 high yes fair

In the example cluster, mode = (<=30, medium, yes,


fair)
 Using new dissimilarity measures to deal with

categorical objects 17 17
A Problem of K-means
 Sensitive to outliers
 Outlier: objects with extremely large (or
small) values
• May substantially distort the distribution of the
data

+
+

Outlier
18
k-Medoids Clustering Method
 k-medoids: Find k representative objects,
called medoids
 PAM (Partitioning Around Medoids, 1987)
 CLARA (Kaufmann & Rousseeuw, 1990)
10

9
 CLARANS (Ng & Han, 1994): Randomized 10

7
sampling 8

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

k-means k-medoids
19
PAM (Partitioning Around Medoids) (1987)
 PAM (Kaufman and Rousseeuw, 1987)
 Arbitrarily choose k objects as the initial medoids
 Until no change, do
 (Re)assign each object to the cluster with the nearest
medoid
 Improve the quality of the k-medoids
(Randomly select a nonmedoid object, Orandom,
compute the total cost of swapping a medoid with
Orandom)
 Work for small data sets (100 objects in 5
clusters)
 Not efficient for medium and large data sets 20
Swapping Cost
 For each pair of a medoid o and a non-
medoid object h, measure whether h is
better than o as a medoid
 Use the squared-error criterion
k
E = ∑ ∑ d ( p, oi ) 2

=i 1 p∈Ci

 Compute Eh-Eo
 Negative: swapping brings benefit
 Choose the minimum swapping cost 21
Four Swapping Cases
 When a medoid m is to be swapped with a non-
medoid object h, check each of other non-
medoid objects j
 j is in cluster of m⇒ reassign j
• Case 1: j is closer to some k than to h; after swapping m and
h, j relocates to cluster represented by k
• Case 2: j is closer to h than to k; after swapping m and h, j is
in cluster represented by h
 j is in cluster of some k, not m ⇒ compare k with h
• Case 3: j is closer to some k than to h; after swapping m and
h, j remains in cluster represented by k
• Case 4: j is closer to h than to k; after swapping m and h, j is
in cluster represented by h
22
PAM Clustering: Total swapping cost
TCmh=∑jCjmh
Case 1
10 10

9
Case 3 9
j
8
h 8 k
7

6
j 7

4 m k
5

4
h
3 3

2 2
m
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjmh = d (j, k) − d (j, m ) ≥ 0 Cjmh = d(j, k) − d(j, k)=0


Case 2 Case 4
10
10
9
9
8
k 8
7

6
h j 7

6
5
5 m
4

3
m 4

h j
3
2

k
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjmh = d (j, h ) − d(j, m )


May be positive or negative Cjmh = d(j, h) − d(j, k) < 0 23
24
25
26
27
28
29
30
31
32
33
34
Strength and Weakness of PAM
 PAM is more robust than k-means in the
presence of outliers because a medoid is
less influenced by outliers or other extreme
values than a mean
 PAM works efficiently for small data sets
but does not scale well for large data sets
 O(k(n-k)2 ) for each iteration
where n is # of data objects, k is # of clusters
35
CSC479
Data Mining
Lecture # 22

Hierarichical Clustering

(Ch # 10.3)
Hierarchical Clustering

 Use distance matrix as clustering criteria. This method does not


require the number of clusters k as an input, but needs a
termination condition

Step Step Step Step Step


agglomerative
0 1 2 3 4
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0
Hierarchical Clustering

 Clusters are created in levels actually creating


sets of clusters at each level.
 Agglom erative
 Initially each item in its own cluster
 Iteratively clusters are merged together
 Bottom Up
 Divisive
 Initially all items in one cluster
 Large clusters are successively divided
 Top Down
Distance Between Clusters
 Single Link : smallest distance between points. The
minimum of all pair wise distances between points in
the two clusters
 Tends to produce long, “loose” clusters
 Com plete Link : largest distance between points.
The maximum of all pairwise distances between
points in the two clusters
 Tends to produce very tight clusters
 Average Link : the average of all pairwise distances
between points in the two clusters
 Centroid Linage: distance between centroids
Hierarchical Algorithms
 Single Link
 Complete Link
 Average Link
 MST Single Link
Single Link - Example

6
Clustering Algorithms
Hierarchical Clustering:
Example: Single-Link (Minimum) Method:

Resulting Tree, or
Dendrogram:
Clustering Algorithms
Hierarchical Clustering:
Example: Complete-Link (Maximum) Method:

Resulting Tree, or
Dendrogram:
Clustering Algorithms
Hierarchical Clustering:
In a dendrogram, the length of each tree branch represents
the distance
between clusters it joins.

Different dendrograms may arise when different Linkage


methods are used.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy