Data Mining All Slides
Data Mining All Slides
Classification:
The Basic Methods
Outline
Simplicity first: 1R
Naïve Bayes
2
Classification
4
witten&eibe
Inferring rudimentary rules
Basic version
One branch for each value
Each branch assigns most frequent class
Error rate: proportion of instances that don’t belong to the
majority class of their corresponding branch
Choose attribute with lowest error rate
5
witten&eibe
Pseudo-code for 1R
6
witten&eibe
Evaluating the weather attributes
Outlook Temp Humidity Windy Play Attribute Rules Errors Total
errors
Sunny Hot High False No
Outlook Sunny → No 2/5 4/14
Sunny Hot High True No
Overcast → Yes 0/4
Overcast Hot High False Yes
Rainy → Yes 2/5
Rainy Mild High False Yes
Temp Hot → No* 2/4 5/14
Rainy Cool Normal False Yes
Mild → Yes 2/6
Rainy Cool Normal True No
Cool → Yes 1/4
Overcast Cool Normal True Yes
Humidity High → No 3/7 4/14
Sunny Mild High False No
Normal → Yes 1/7
Sunny Cool Normal False Yes
Windy False → Yes 2/8 5/14
Rainy Mild Normal False Yes
True → No* 3/6
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
* indicates a tie
7
witten&eibe
Dealing with
numeric attributes
Discretize numeric attributes
Divide each attribute’s range into intervals
Sort instances according to attribute’s values
Place breakpoints where the class changes
(the majority class)
This minimizes the totalTemperature
Outlook
error Humidity Windy Play
Sunny 85 85 False No
Example: temperature
Sunny from
80 weather
90 data True No
… … … … …
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
8
witten&eibe
The problem of overfitting
9
witten&eibe
Discretization example
10
witten&eibe
With overfitting avoidance
11
witten&eibe
Bayesian (Statistical) modeling
13
witten&eibe
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5 Outlook Temp Humidity Windy Play
15
witten&eibe
Bayes’s rule
Probability of event H given evidence E :
Pr[ E | H ] Pr[ H ]
Pr[ H | E ] =
Pr[ E ]
A priori probability of H : Pr[H ]
Probability of event before evidence is seen
17
witten&eibe
Weather data example
× 93 × 93 × 93 × 149
2
= 9
Pr[ E ]
18
witten&eibe
The “zero-frequency problem”
19
witten&eibe
*Modified probability estimates
2+ µ /3 4+ µ /3 3+ µ /3
9+µ 9+µ 9+µ
Sunny Overcast Rainy
21
witten&eibe
Numeric attributes
Usual assumption: attributes have a normal or
Gaussian probability distribution (given the class)
The probability density function for the normal
distribution is defined by two parameters:
Sample mean µ
1 n
µ = ∑ xi
n i =1
Standard deviation σ
1 n
σ= ∑ i
n − 1 i =1
( x − µ ) 2
22
witten&eibe
Statistics for
weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72, 80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5 µ =73 µ =75 µ =79 µ =86 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 σ =6.2 σ =7.9 σ =10.2 σ =9.7 True 3/9 3/5
Rainy 3/9 2/5
23
witten&eibe
Classifying a new day
Sunny 66 90 true ?
24
witten&eibe
Naïve Bayes: discussion
Improvements:
select best attributes (e.g. with greedy search)
often works as well or better with just a fraction
of all attributes
Bayesian Networks
27
witten&eibe
Summary
28
Data Mining
Preprocessing
• Data quality
• Missing values imputation using Mean,
Median and k-Nearest Neighbor approach
• Distance Measure
Data Quality: Why Preprocess the Data?
2
Major Tasks in Data Preprocessing
• Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
Integration of multiple databases, data cubes, or files
• Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
• Data transformation and data discretization
Normalization
Concept hierarchy generation
3
Data Quality
• Data quality is a major concern in Data Mining and
Knowledge Discovery tasks.
• Why: At most all Data Mining algorithms induce knowledge
strictly from data.
• The quality of knowledge extracted highly depends on the
quality of data.
• There are two main problems in data quality:-
– Missing data: The data not present.
– Noisy data: The data present but not correct.
• Missing/Noisy data sources:-
– Hardware failure.
– Data transmission error.
– Data entry problem.
– Refusal of responds to answer certain questions.
Effect of Noisy Data on Results Accuracy
Nominal Scale
1. Equality
2. Count
Interval Scale
4. Quantify the difference
Axioms of a Distance Measure
• d is a distance measure if it is a function
from pairs of points to reals such that:
1. d(x,x) = 0.
2. d(x,y) = d(y,x).
3. d(x,y) > 0.
Some Euclidean Distances
• L2 norm (also common or Euclidean distance):
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp
– distance if you had to travel along coordinates only.
Examples L1 and L2 norms
y = (9,8)
L2-norm:
dist(x,y) = √(42+32) = 5
5
3
L1-norm:
dist(x,y) = 4+3 = 7
x = (5,5) 4
Another Euclidean Distance
• L∞ norm : d(x,y) = the maximum of the
differences between x and y in any dimension.
Example:
Data Matrix and Dissimilarity Matrix
x x Data Matrix
2 4
point attribute1 attribute2
4
x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x
1
Dissimilarity Matrix
(with Euclidean Distance)
x
3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
21
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x x
2 4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 Supremum
x
1
L∞ x1 x2 x3 x4
x1 0
x2 3 0
x
3 x3 2 5 0
0 2 4
x4 3 1 5 0
22
Proximity Measure for Nominal
Attributes
d (i, j) = p −
p
m
• Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M
nominal states
23
Non-Euclidean Distances
0+1 1 0 sum
d ( jack , mary ) = = 0.33 1 a b a +b
2+ 0+1
1+1 0 c d c+d
d ( jack , jim ) = = 0.67
1+1+1 sum a + c b + d p
1+ 2 b+c
d ( jim , mary ) = = 0.75 d (i, j) =
1+1+ 2 a +b+c
Cosine Measure
• Think of a point as a vector from the origin
(0,0,…,0) to its location.
θ
p2
p1.p2
dist(p1, p2) = θ = arccos(p1.p2/|p2||p1|) |p2|
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
29
Example: Cosine Similarity
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
30
Edit Distance
• The edit distance of two strings is the number
of inserts and deletes of characters needed to
turn one into the other.
• x = abcde ; y = bcduve.
• LCS(x,y) = bcde.
• D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4
= 3.
• What left?
• Normalize it in the range [0-1]. We will study
normalization formulas later.
Back to k-Nearest Neighbor (Pseudo-code)
• Missing values Imputation using k-NN.
• Input: Dataset (D ), size of K
• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.
• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.
3
Market-Baskets – (2)
Really, a general many-to-many
mapping (association) between two
kinds of things, where the one (the
baskets) is a set of the other (the
items)
But we ask about connections among
“items,” not “baskets.”
The technology focuses on common
events, not rare events (“long tail”). 4
Frequent Itemsets
• Given a set of transactions, find combinations of
items (itemsets) that occur frequently
Market-Basket transactions
Items: {Bread, Milk, Diaper, Beer, Eggs,
Coke}
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Bread}: 4
3 Milk, Diaper, Beer, Coke {Milk} : 4
4 Bread, Milk, Diaper, Beer {Diaper} : 4
{Beer}: 3
5 Bread, Milk, Diaper, Coke
{Diaper, Beer} : 3
{Milk, Bread} : 3
Applications – (1)
Items = products; baskets = sets of
products someone bought in one trip to
the store.
7
Applications – (3)
Baskets = sentences; items =
documents containing those sentences.
Disadvantages:
1) Given 20 attributes, number of combinations is 220-1 =
1048576. Therefore array storage requirements will be
4.2MB.
2) Given a data sets with (say) 100 attributes it is likely that
many combinations will not be present in the data set ---
therefore store only those combinations present in the
dataset!
Mining Association Rules—An Example
Transaction ID Items Bought Min. support 50%
2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
5000 B,E,F {A} 75%
{B} 50%
{C} 50%
For rule A ⇒ C: {A,C} 50%
support = support({A ՍC}) = 50%
confidence = support({A ՍC})/support({A}) =
66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of items
that have minimum support
A subset of a frequent itemset must also be a
frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate
association rules.
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
Pruning:
• acde is removed because ade is not in L3
C4={abcd}
28
29
CSC479
Data Mining
Lecture # 11
Classification
Basic Concepts
Decision Trees
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set 7
Evaluation of classification models
Counts of test records that are correctly
(or incorrectly) predicted by the
classification model Predicted Class
Confusion matrix Class = 1 Class = 0
Actual Class
Class = 1 f11 f10
Class = 0 f01 f00
9
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
10
Decision Trees
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
11
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
Class labels
Training Data Model: Decision Tree 12
Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
13
Decision Tree Classification Task
Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set 14
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
15
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
16
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
17
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
18
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
19
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
20
Decision Tree Classification Task
Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
21
Test Set
Tree Induction
Finding the best decision tree is NP-hard
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
22
Which Attribute is the Best Classifier?
The central choice in the ID3 algorithm is selecting
which attribute to test at each node in the tree
24
Entropy
Entropy (D)
Entropy of data set D is denoted by H(D)
Cis are the possible classes
pi = fraction of records from D that have class C
H ( D ) = −∑ pi log 2 pi
ci
25
Entropy Examples
Example:
10 records have class A
20 records have class B
30 records have class C
40 records have class D
Entropy = -[(.1 log .1) + (.2 log .2) + (.3 log
.3) + (.4 log .4)]
Entropy = 1.846
26
Splitting Criterion
Example:
Two classes, +/-
100 records overall (50 +s and 50 -s)
A and B are two binary attributes
• Records with A=0: 48+, 2-
Records with A=1: 2+, 48-
• Records with B=0: 26+, 24-
Records with B=1: 24+, 26-
Splitting on A is better than splitting on B
• A does a good job of separating +s and -s
• B does a poor job of separating +s and -s
27
Which Attribute is the Best Classifier?
Information Gain
The expected information needed to classify a tuple
in D is
= Entropy
How much more information would we still need
(after partitioning at attribute A) to arrive at an exact
classification? This amount is measured by
= H(D, A)
29
Examples Constructing Decision Tree
30
Examples Constructing Decision Tree
31
DECISION TREES
Which Attribute is the Best Classifier?: Use
Information Gain to develop the complete tree
32
CSC479
Data Mining
Lecture # 15
(Ch # 8.4)
Rule Generation from Decision Tree
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Rule Coverage and Accuracy
Coverage of a rule: Tid Refund Marital
Status
Taxable
Income Class
Accuracy of a rule:
7 Yes Divorced 220K No
8 No Single 85K Yes
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every
possible combination of attribute values
Each record is covered by at least one rule
There is one rule for each possible attribute-value
combination, so that the set of rules does not require a
default rule
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
Refund
Yes No (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marital
Status (Refund=No, Marital Status={Single,Divorced},
{Single,
{Married} Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K
NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
Rules Can Be Simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund
Yes No 1 Yes Single 125K No
2 No Married 100K No
NO Marital
3 No Single 70K No
{Single, Status
{Married} 4 Yes Married 120K No
Divorced}
5 No Divorced 95K Yes
Taxable NO
Income 6 No Married 60K No
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
CSC479
Data Mining
Lecture # 18
Clustering
(Ch # 10)
The Problem of Clustering
Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
members of a cluster are in some sense as
nearby as possible.
Clustering is unsupervised classification: no
predefined classes.
Formally, Clustering is the process of
grouping data points such as intra-cluster
distance is minimized and inter-cluster
distance
2
is maximized.
Types of Clustering
A clustering is a set of clusters
Important distinction between hierarchical
and partitional sets of clusters
Partitional Clustering
• A division data objects into non-overlapping
subsets (clusters) such that each data object is in
exactly one subset
Hierarchical clustering
• A set of nested clusters organized as a hierarchical
tree
Other distinctions – coming slides
3
Partitional Clustering
4
Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
5
Other Distinctions Between Sets of Clusters
Exclusive versus non-exclusive
In non-exclusive clusterings, points may belong to multiple
clusters.
Can represent multiple classes or ‘border’ points
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
3 well-separated clusters
8
Types of Clusters: Center-Based
Center-based
A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
9
Types of Clusters: Density-Based
Density-based
A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
10
Data Structures Used
x11 ... x1f ... x1p
Data matrix ... ... ... ... ...
x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
Similarity matrix
0
d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
11
Partitioning (Centeroid-Based) Algorithms
Construct a partition of a database D of n objects
into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
k-means (MacQueen’67)
• Each cluster is represented by the center of the cluster
• A Euclidean Distance based method, mostly used for
interval/ratio scaled data
k-medoids
• Each cluster is represented by one of the objects in the
cluster
• For categorical data
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center
point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
13
Clustering Example
Iteration 0
3
2.5
1.5
y
0.5
14
Clustering Example
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
15
K-means Clustering – Details
16
A Simple example showing the
implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two
centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
Step 3:
Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.
Therefore, there is no
change in the cluster.
Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
PLOT
(with K=3)
Step 1 Step 2
PLOT
CSC479
Data Mining
Lecture # 19
Clustering
(Ch # 10)
Partitioning (Centeroid-Based) Algorithms
Construct a partition of a database D of n objects
into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
k-means (MacQueen’67)
• Each cluster is represented by the center of the cluster
• A Euclidean Distance based method, mostly used for
interval/ratio scaled data
k-medoids
• Each cluster is represented by one of the objects in the
cluster
• For categorical data 2
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center
point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
3
Getting k Right
Try different k, looking at the change in
the average distance to centroid, as k
increases.
Average falls rapidly until right k, then
changes little.
Best value
of k
Average
distance to
centroid
k
4
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest
cluster
To get SSE, we square these errors and sum them.
K
SSE = ∑ ∑ dist 2 (mi , x)
i =1 x∈Ci
x is a data point in cluster Ci and mi is the representative
point for cluster Ci
• can show that mi corresponds to the center (mean) of the
cluster
Given two clusters, we can choose the one with the smallest
error
One easy way to reduce SSE is to increase K, the number of
clusters
• A good clustering with smaller K can have a lower SSE than a
poor clustering with higher K 5
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
6
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
7
Limitations of K-means
K-means has problems when clusters are
of differing
Sizes
Non-globular shapes
8
Limitations of K-means: Differing Sizes
9
Limitations of K-means: Differing Density
10
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial
grouping will determine the cluster significantly.
2. The number of cluster, K, must be determined before
hand. Its disadvantage is that it does not yield the
same result with each run, since the resulting clusters
depend on the initial random assignments.
3. We never know the real cluster, using the same data,
because if it is inputted in a different order it may
produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial
condition may produce different result of cluster. The
algorithm may be trapped in the local optimum.
11
Applications of K-Mean Clustering
It is relatively efficient and fast. It computes result
at O(tkn), where n is number of objects or points,
k is number of clusters and t is number of
iterations.
k-means clustering can be applied to machine
learning or data mining
Used on acoustic data in speech understanding to
convert waveforms into one of k categories (known
as Vector Quantization or Image Segmentation).
Also used for choosing color palettes on old
fashioned graphical display devices and Image
Quantization. 12
CONCLUSION
K-means algorithm is useful for
undirected knowledge discovery and is
relatively simple. K-means has found
wide spread usage in lot of fields,
ranging from unsupervised learning of
neural network, Pattern recognitions,
Classification analysis, Artificial
intelligence, image processing, machine
vision, and many others.
13
Pre-processing and Post-processing K-means
Pre-processing
Normalize the data
Eliminate outliers
Post-processing
Eliminate small clusters that may represent
outliers
Split ‘loose’ clusters, i.e., clusters with
relatively high SSE
Merge clusters that are ‘close’ and that have
relatively low SSE
Can use these steps during the clustering
process
14
• ISODATA
Variations of k-Means Method
Aspects of variants of k-means
Selection of initial k centroids
• E.g., choose k farthest points
Dissimilarity calculations
• E.g., use Manhattan distance
Strategies to calculate cluster means
• E.g., update the means incrementally
15
Strengths of k-Means Method
Strength
Relatively efficient for large datasets
• O(tkn) where n is # objects, k is # clusters, and t is
# iterations; normally, k, t <<n
Often terminates at a local optimum
• global optimum may be found using techniques
such as deterministic annealing and genetic
algorithms
16
k-modes Algorithm
Handling categorical
age income student credit_rating
< = 30 high no fair
data: k-modes < = 30
31…40
high
high
no
no
excellent
fair
(Huang’98) > 40 medium no fair
Replacing means of
> 40 low yes fair
> 40 low yes excellent
clusters with modes 31…40 low yes excellent
< = 30 medium no fair
• Given n records in
< = 30 low yes fair
cluster, mode is record > 40 medium yes fair
made up of most < = 30 medium yes excellent
frequent attribute values 31…40 medium no excellent
31…40 high yes fair
categorical objects 17 17
A Problem of K-means
Sensitive to outliers
Outlier: objects with extremely large (or
small) values
• May substantially distort the distribution of the
data
+
+
Outlier
18
k-Medoids Clustering Method
k-medoids: Find k representative objects,
called medoids
PAM (Partitioning Around Medoids, 1987)
CLARA (Kaufmann & Rousseeuw, 1990)
10
9
CLARANS (Ng & Han, 1994): Randomized 10
7
sampling 8
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-means k-medoids
19
PAM (Partitioning Around Medoids) (1987)
PAM (Kaufman and Rousseeuw, 1987)
Arbitrarily choose k objects as the initial medoids
Until no change, do
(Re)assign each object to the cluster with the nearest
medoid
Improve the quality of the k-medoids
(Randomly select a nonmedoid object, Orandom,
compute the total cost of swapping a medoid with
Orandom)
Work for small data sets (100 objects in 5
clusters)
Not efficient for medium and large data sets 20
Swapping Cost
For each pair of a medoid o and a non-
medoid object h, measure whether h is
better than o as a medoid
Use the squared-error criterion
k
E = ∑ ∑ d ( p, oi ) 2
=i 1 p∈Ci
Compute Eh-Eo
Negative: swapping brings benefit
Choose the minimum swapping cost 21
Four Swapping Cases
When a medoid m is to be swapped with a non-
medoid object h, check each of other non-
medoid objects j
j is in cluster of m⇒ reassign j
• Case 1: j is closer to some k than to h; after swapping m and
h, j relocates to cluster represented by k
• Case 2: j is closer to h than to k; after swapping m and h, j is
in cluster represented by h
j is in cluster of some k, not m ⇒ compare k with h
• Case 3: j is closer to some k than to h; after swapping m and
h, j remains in cluster represented by k
• Case 4: j is closer to h than to k; after swapping m and h, j is
in cluster represented by h
22
PAM Clustering: Total swapping cost
TCmh=∑jCjmh
Case 1
10 10
9
Case 3 9
j
8
h 8 k
7
6
j 7
4 m k
5
4
h
3 3
2 2
m
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
6
h j 7
6
5
5 m
4
3
m 4
h j
3
2
k
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Hierarichical Clustering
(Ch # 10.3)
Hierarchical Clustering
6
Clustering Algorithms
Hierarchical Clustering:
Example: Single-Link (Minimum) Method:
Resulting Tree, or
Dendrogram:
Clustering Algorithms
Hierarchical Clustering:
Example: Complete-Link (Maximum) Method:
Resulting Tree, or
Dendrogram:
Clustering Algorithms
Hierarchical Clustering:
In a dendrogram, the length of each tree branch represents
the distance
between clusters it joins.