What Is Cluster Analysis?
What Is Cluster Analysis?
What Is Cluster Analysis?
Cluster analysis
Typical applications
Pattern Recognition
Image Processing
WWW
Document classification
Examples of Clustering
Applications
Scalability
High dimensionality
Data Structures
Data matrix
(two modes)
x11
...
x
i1
...
x
n1
Dissimilarity matrix
(one mode)
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
9
Interval-scaled variables
Binary variables
10
Interval-valued variables
Standardize data
where
...
xnf )
m f 1n (x1 f x2 f
xif m f
zif
sf
11
q
q
Some popular
d (i, j) q (| ones
x x |include:
| x x Minkowski
| q ... | x x |distance:
)
i1
j1
i2
j2
ip
jp
If q = 1, d isd (iManhattan
, j) | x x | |distance
x x | ... | x x |
i1
j1
i2
j2
ip
jp
12
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
13
Binary Variables
Object i
1
a
0
b
0
c
d
sum a c b d
d (i, j)
Object j
d (i, j)
bc
a bc
simJaccard (i, j)
Data Mining: Concepts and
Techniques
cd
p
bc
a bc d
sum
a b
a
a bc
14
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
15
Nominal Variables
16
Ordinal Variables
rif 1
M f 1
17
Ratio-Scaled Variables
Methods:
18
d (i, j )
f 1 ij
p
f 1
ij
(f)
ij
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks r and
if
and treat z as interval-scaled
if
z r 1
if
if
19
Vector Objects
Cosine measure
20
21
Partitioning approach:
Hierarchical approach:
Density-based approach:
22
Grid-based approach:
Model-based:
A model is hypothesized for each of the clusters and tries to find the
best fit of that model to each other
Frequent pattern-based:
User-guided or constraint-based:
23
24
Cm
iN 1(t
ip
N (t cm ) 2
Rm i 1 ip
N
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
May 24, 2015
25
26
27
28
Example
10
9
8
7
6
5
10
10
4
3
2
1
0
0
K=2
Arbitrarily choose
K object as initial
cluster center
10
Assign
each
objects
to
most
similar
center
3
2
1
0
0
10
4
3
2
1
0
0
reassign
10
10
2
1
0
0
10
reassign
Update
the
cluster
means
10
Update
the
cluster
means
4
3
2
1
0
0
10
29
Weakness
30
Dissimilarity calculations
31
10
0
0
10
10
32
PAM works effectively for small data sets, but does not
scale well for large data sets
33
10
10
6
5
4
3
2
1
0
0
K=2
10
Arbitrar
y
choose
k object
as
initial
medoid
s
Assign
each
remaini
ng
object
to
nearest
medoid
s
6
5
4
3
2
1
0
0
10
Total Cost = 26
10
Do loop
Until no
change
6
5
4
3
2
1
0
0
10
Compute
total cost
of
swapping
Swapping
O and
Oramdom
If quality is
improved.
7
6
5
4
10
Randomly select a
nonmedoid
object,Oramdom
8
7
6
5
4
0
1
2
3
4
5
6
7
8
9
10
Data
Mining:
Concepts
and
Techniques
10
34
35
10
10
8
7
8
7
6
5
4
3
6
5
h
i
4
3
0
0
10
Cjih = 0
0
10
10
10
7
6
5
4
3
Cjih and
= d(j, h) - d(j, t)
Cjih = d(j, t) - d(j, i) Data Mining: Concepts
0
10
Techniques
10
36
37
Weakness:
38
39
40
Hierarchical Clustering
a
b
Step 1
ab
abcde
cde
de
e
Step 4
May 24, 2015
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
41
Go on in a non-descending fashion
10
10
10
0
0
10
0
0
10
10
42
43
10
10
0
0
10
0
0
10
10
44
Techniques
45
BIRCH (1996)
46
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
47
CF-Tree in BIRCH
Clustering feature:
summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
48
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
Leaf node
CF6 next
prev
CF1 CF2
CF4 next
49
50
C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c,
d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c,
d, e}
C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Jaccard co-efficient may lead to wrong clustering result
C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, T
f})
1 T2
Sim( T , T )
Jaccard co-efficient-based similarity function: 1 2
T1 T2
{c{c,
} d, e} 1
Ex. LetSim
T1 (=T {a,
b, c}, T2 =
1, T 2 )
0.2
{a, b, c, d , e}
51
C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a,
c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
52
A two-phase algorithm
1.
2.
53
Overall Framework of
CHAMELEON
Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
54
55
56
57
Two parameters:
NEps(p):
p belongs to NEps(q)
MinPts = 5
Eps = 1 cm
Techniques
58
Density-reachable:
A point p is density-reachable
from a point q w.r.t. Eps, MinPts if
there is a chain of points p1, ,
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
A point p is density-connected to
a point q w.r.t. Eps, MinPts if there
is a point o such that both, p and
q are density-reachable from o
w.r.t. Eps and MinPts
p
p1
q
o
59
Eps = 1cm
Core
MinPts = 5
60
If p is a border point, no points are densityreachable from p and DBSCAN visits the next point
of the database.
61
DBSCAN: Sensitive to
Parameters
62
63
64
Index-based:
k = number of dimensions
N = 20
p = 75%
M = N(1-p) = 5
Complexity: O(kN2)
Core Distance
p1
Reachability Distance
o
p2
o
MinPts = 5
65
Reachability
-distance
undefined
Cluster-order
of the objects
66
67
f Gaussian ( x , y ) e
Major features
D
Gaussian
D
Gaussian
d ( x , y )2
2 2
( x ) i 1 e
N
d ( x , xi ) 2
2 2
( x, xi ) i 1 ( xi x) e
N
d ( x , xi ) 2
2 2
68
69
Density Attractor
70
71
72
73
74
75
Comments on STING
Advantages:
Disadvantages:
76
77
Wavelet Transform
78
Input parameters
# of grid cells for each dimension
the wavelet, and the # of applications of wavelet transform
Why is wavelet transformation useful for clustering?
Use hat-shape filters to emphasize region where points
cluster, but simultaneously suppress weaker information in
their boundary
Effective removal of outliers, multi-resolution, cost effective
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
Both grid-based and density-based
79
Quantization
& Transformation
80
81
Model-Based Clustering
82
EM Expectation Maximization
An extension to k-means
General idea
83
Maximization step:
Estimation of model parameters
84
Conceptual Clustering
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of
unlabeled objects
Finds characteristic description for each concept
(class)
COBWEB (Fisher87)
A popular a simple method of incremental
conceptual learning
Creates a hierarchical clustering in the form of a
classification tree
Each node refers to a concept and contains a
probabilistic description of that concept
85
COBWEB Clustering
Method
A classification tree
86
Limitations of COBWEB
CLASSIT
Popular in industry
87
88
SOMs, also called topological ordered maps, or Kohonen SelfOrganizing Feature Map (KSOMs)
The unit whose weight vector is closest to the current object wins
SOMs are believed to resemble processing that can occur in the brain
89
The result of
SOM clustering
of 12088 Web
articles
The picture on
the right:
drilling down
on the keyword
mining
Based on
websom.hut.fi
Web page
May 24, 2015
90
91
Major challenges:
Methods
92
93
94
95
Identify clusters
96
40
50
la
a
S
20
30
40
50
age
60
Vacation
=3
30
Vacation(
week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
ry
30
50
age
97
Strength
automatically finds subspaces of the highest
dimensionality such that high density clusters exist
in those subspaces
insensitive to the order of records in input and does
not presume some canonical data distribution
scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
98
CLIQUE, ProClus
Typical methods
99
100
Why p-Clustering?
Where
1
1
1
d
d
d
d
d
IJ | I || J |
ij
Ij | I |
ij
ij | J |
ij
i I, j J
I some > 0
A submatrix is a -cluster
if H(I, J) i
for
jJ
101
p-Clustering:
Clustering by
Pattern Similarity
d ya d yb
pScore(
Properties of -pCluster
Downward closure
d xbon
/ d yb
For scaling patterns, one can observe, taking logarithmic
Techniques
102
103
104
User-specified constraints
105
106
107
108
109
110
Outlier Discovery:
Statistical
Approaches
111
112
Density-Based Local
Outlier Detection
Distance-based outlier
detection is based on global
distance distribution
It encounters difficulties to
identify outliers if data is not
uniformly distributed
Ex. C1 contains 400 loosely
distributed points, C2 has 100
tightly condensed points, 2
outlier points o1, o2
Distance-based method
cannot identify o2 as an outlier
Need the concept of local
outlier
113
114
115
Summary
116
117
References (1)
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers.
SIGMOD 2000.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
118
References (2)
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical
attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March 1999.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using
Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering.
John Wiley and Sons, 1988.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
119
References (3)
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review , SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD 02.
120