Clustering
Clustering
Clustering
Pattern Classification by
Distance Functions
The method of pattern classification by
distance functions can be expected to yield
practical and satisfactory results only when
the pattern classes tend to have clustering
properties
Contd..
Contd..
Contd..
Unlike Fig. 2, in Fig. 1, the patterns of each
class tend to cluster tightly about a typical or
representative pattern for the class. This
occurs in case where pattern variability and
other corruptive influences are well behaved.
A typical example is the problem of reading
bank checks by machine.
Contd..
Here the proximity of an unknown pattern to
the patterns of a class will serve as a measure
for its classification, so the method is called
minimum-distance pattern classification. And
the system which does this is called minimumdistance classifiers.
Single Prototypes
In this case, the patterns of each class tend to
cluster tightly about a typical or
representative pattern for that class. Under
these conditions, minimum-distance classifiers
can constitute a very effective approach to the
classification problem.
Contd..
Consider M pattern classes and assume that
these classes are representable by prototype
patterns Z 1,Z 2,..., Z M . The Euclidean distance
between an arbitrary pattern vector x and the
ith prototype is given by,
D i X Z i ( X Z i )( X Z i )
.(1)
Contd..
A minimum-distance classifier computes the
distance from a pattern x of unknown
classification to the prototype of each class,
and assigns the pattern to the class to which it
is closest.
Contd..
In other words, x is assigned to class wi if
Di < Dj for all ij. Ties are resolved arbitrarily.
Equation (1) may be expressed as
D i 2 X Z i
( X Z i ) ( X Z i )
Contd..
X X 2 X Z iZ i Z i
1
X X 2( X Z i Z i Z i )
2
Contd..
Choosing the minimum Di2 is equivalent to
choosing the minimum Di since all distances
are positive, since the term XtX is independent
of i, choosing the minimum Dt2 is equivalent to
choosing the maximum (XtZi 1/2xZitZi)
Consequently, we may define the decision
functions..
Contd..
1
d i ( X ) X Z i Z i Z i ,
2
i 1,2,..., M
wi
.(2)
, if di ( x) d j ( x)
Contd..
wij z ij , j 1,2,..., n w i.n1 1 Z i Z i
2
and
x1
x2
.
X
.
.x n
1
Contd..
The decision boundary of a two-class
example in which each class is characterized
by a single prototype is shown below.
Contd..
It can be shown that the linear decision
surface separating every pair of prototype
points zi and zj is the hyperplane which is the
perpendicular bisector of the line segment
joining the two points. Thus the minimumdistance-classifiers are a special case of linear
classifiers.
Multiprototypes
Here each class is characterized by several
prototypes, that is, each pattern of class wi
tends to cluster about one of the prototypes
Z i 1, Z i 2 ,..., Z i N , where N i is the number of
prototypes in the ith pattern class.
i
Contd..
Following the development for single prototype,
the decision functions
1 l
d i ( X ) max{( X Z i ) ( Z i )Z i l }, l 1,2,..., N i
2
where, as before, x is placed in class w i ,if
d i ( X ) d j ( X ), for all j i.
l
Contd..
Contd..
Note that the boundaries between classes w i
w j and are piecewise linear. Since we could
have defined this as a single-prototype, fourclass problem, the sections of the boundaries
are the perpendicular bisectors of the lines
joining the prototypes of different classes.
Contd..
Although general iterative algorithms exist
which can be used in the calculation of linear
decision function parameters, unfortunately,
no truly general algorithm is yet known for the
piecewise-linear classifier decision function.
Cluster Seeking
It is evident that the ability to determine
characteristic prototypes or cluster centers
in a given set of data plays a central role in
the design of pattern classifiers based on
the minimum-distance concept. So we need
to study various cluster-seeking methods.
Contd..
Note that the performance of a given
algorithm is not only dependent on the type of
data being analyzed, but is also strongly
influenced by the chosen measure of pattern
similarity
Measures of Similarity
To define a data cluster, it is necessary to first
define a measure of similarity which will
establish a rule for assigning patterns to the
domain of a particular cluster center. Earlier
we considered the Euclidean distance between
two patterns x and z:
D X Z
as a measure of their similarity----the smaller
the distance, the greater the similarity.
Contd..
There are, however, other meaningful
distance measures which are sometimes
useful. For example, the Mahalanobis distance
from x to m.
D ( X m)C 1 ( X m) (1)
Contd..
It is a useful measure of similarity when
statistical properties are being explicitly
considered. Here C is the covariance matrix of
a pattern population, m is the mean vector,
and x represents a variable pattern.
Contd..
Measures of similarity need not be restricted
to distance measures. For example, the nonmetric similarity function:
XZ
S(X , Z)
.(2)
X Z
Contd..
This measure of similarity is useful when
cluster regions tend to develop along
principal axes, as shown below.
xz 1
s( x, z 1 ) cos 1
x z2
xz 2
s( x, z 2 ) cos 2
x z2
Figure 4. Illustration of a similarity measure
Contd..
Here the use of this similarity measure is
governed by certain qualifications, such as
sufficient separation of cluster regions with
respect to each other as well as with respect
to the coordinate system origin.
Contd..
When the patterns under consideration are
binary valued with 0, 1 elements, the
similarity function of Eq. (2) may be given an
interesting non-geometrical interpretation.
Contd..
We say that a binary pattern x possesses the
ith attribute if x Then the term XtZ in Eq. (2) is
simply the number of attributes shared by x
and z, while X Z ( X X )(Z Z ) is the geometric
mean of the number of attributes possessed
by x and the number possessed by z. In this
case the similarity function s ( X , Z ) is,
therefore, seen to be a measure of common
attributes possessed by the binary vectors x
and z.
Contd..
A binary variation of Eq. (2) which has been
widely used in information retrieval, nosology
(classification of diseases), and taxonomy
(classification of plants and animals) is the socalled Tanimoto measure, which is given by
XZ
S(X , Z)
X X Z Z X Z
Clustering Criteria
After a measure of pattern similarity has been
adopted, we are still faced with the problem of
specifying a procedure for partitioning the given
data into cluster domains.
Contd..
The clustering criterion used may represent a
heuristic scheme, or it may be based on the
minimization (or maximization) of a certain
performance index. The heuristic approach is
guided by intuition and experience.
Contd..
The Euclidean distance is generally used as
measure of proximity. A suitable threshold is
also necessary in order to define degrees of
acceptable similarity in the cluster-seeking
process.
Contd..
The performance-index (sometimes called
objective function) approach is guided by the
development of a procedure which will
minimize or maximize the chosen performance
index. One of the most often used indices is
the sum of the squared errors index, given
by..
Contd..
NC
J
j 1
X S j
X m j
N j X S j
Contd..
Contd..
Other common performance indices are :
the average squared distances between
samples in a cluster domain
the average squared distances between
samples in different cluster domains
indices based on the scatter matrix concept,
minimum and maximum-variance indices
a score of other performance measures which
have been used throughout the years.
Contd..
It is not uncommon to find a cluster-seeking
algorithm that represents a combination of the
heuristic and performance index approaches.
The Isodata clustering algorithm is such a
combination.
A Simple Cluster-Seeking
Algorithm
Contd..
If both D 31 and D 32 are greater than T, a new
cluster center, Z 3 X 3, is created. Otherwise,
we assign X 3 to the domain of the cluster
center to which it is closest. This process is
repeated for all patterns.
Contd..
Contd..
Contd..
Contd..
This procedure may be expected to yield
useful results in situations where the data
exhibit characteristic "pockets" which are
reasonably well separated with respect to the
chosen values of the threshold.
Maximin-Distance
Algorithm
The maximum-distance algorithm is another
simple heuristic procedure based on the
Euclidean distance concept. Consider the ten
two-dimensional samples shown in the Figure
below.
Contd..
Contd..
K-means Algorithm
Clustering is an unsupervised technique
used in discovering inherent structure
present in the set of patterns. In fact,
clustering techniques aim to extract the
groups present in a given data set and
each such group is termed as a cluster.
Let the set of patterns be
S x1 , x2 , ...., xn m , where x iis the ith
pattern vector, n is the total number of
patterns and m is the dimensionality of
the feature space.
Contd..
Let the number of clusters be K . If the
clusters are represented by C1 , C2 , ........ , CK
then we assume
.
Contd..
Clustering techniques may broadly be divided
into two categories: Hierarchical and nonhierarchical. The non-hierarchical or partitional
clustering problem deals with obtaining an
optimal partition of into subsets such that
some clustering criterion is satisfied.
Contd..
Among the partitional clustering techniques,
the -Means algorithm has been one of the
most widely used algorithms. Here the value
of K needs to be known a priori. The principle
used for clustering by K-means algorithm is to
minimize the sum of intraclass distances to
get the optimal clusters. Mathematically this
principle has been stated below.
Contd..
1.Let C1 , C2 , ........ , C K be a set of K clusters of S
2.Let
x
xC
j
zj
#Cj
Contd..
3.Let
f (C1 , C2 , ........ , CK ) x z j
j 1 xC j
f (C1 , C2 , ........ , CK )
Contd..
All possible clusterings of S are to be
considered to get the optimal C1 , C 2 ,......, C k .
So obtaining the exact solution of the problem
is theoretically possible, yet not feasible in
practice due to limitations of computer storage
and time. One requires the evaluation of
S (n, k ) partitions if exhaustive enumeration is
used to solve the problem, where
1 k
k j k n
S (n, k ) (1) j .
k! j 1
j
Contd..
Thus, approximate heuristic techniques
seeking a compromise or looking for an
acceptable solution have usually been
adopted. One such method is the Forgy's Kmeans algorithm.
Algorithm K-means
Step 1 : Select an initial cluster configuration.
Repeat
Step 2 : Calculate cluster centers z j , j 1,2,..... K .
of the existing groups.
Step 3 : Redistribute patterns among clusters
utilizing the minimum squared Euclidean distance
classifier concept2 :
2
xi C jif xi z j xi zl l {1,2,.... K }, l j
Until (there is no change in cluster centers)
End
Contd..
Contd..
Although no general proof of convergence
exists for this algorithm, it can be expected to
yield acceptable results when the data exhibit
characteristic pockets which are relatively far
from each other.
Isodata Algorithm
The Isodata (Iterative Self-Organizing Data
Analysis) is similar in principle to the K-means
procedure in the sense that cluster centers are
iteratively determined sample means. Unlike
the latter algorithm, however, Isodata
represents a fairly comprehensive set of
additional heuristic procedures which have
been incorporated into an interactive scheme.
Before executing the algorithm it is necessary
to specify a set N C of initial cluster centers
Z 1,Z 2,..., Z N . For a set of N samples, { X 1, X 2,..., X N },
Isodata consists of the following steps..
C
Contd..
Step 1. Specify the following process
parameters:
K = number of cluster centers desired;
N = a parameter against which the number of
samples in a cluster domain is compared;
s = standard deviation parameter;
C = lumping parameter;
L = maximum number of pairs of cluster
centers which can be lumped;
I = number of iterations allowed.
Contd..
Step 2. Distribute the N samples among the
present cluster centers, using the relation
X S j if
X Z j X Z i , i 1,2,..., N C ; i j
Contd..
Step 4. Update each cluster center by
Z j , j 1,2,..., N C , setting it equal to the sample
mean of its corresponding set S j ; that is,
1
Z j
Nj
X,
X S j
j 1,2,..., N C
Nj
X S j
X Z j ,
j 1,2,..., N C
Contd..
Step 6. Compute the overall average
distance of the samples from their
respective cluster centers, using the
relation
1 NC
D N jD j
N j 1
Contd..
Step 7.
(a) If this is the last iteration, set C 0
and go to Step 11.
(b) If N C K / 2, go to Step 8.
(c) If this is an even-numbered iteration,
or if N C 2K , go to Step 11; otherwise,
continue
Contd..
Step 8. Find the standard deviation vector
j ( 1 j, 2 j,..., nj) for each sample subset, using
the relation 1
2
ij
(X
X S j
ik
Z ij ) , i 1,2,..., n;
j 1,2,..., N C
Contd..
and
Contd..
then split Z j into two new cluster centers Z j and Z j
, delete Z j , and increase N C by 1.
Cluster center Z j is formed by adding a given
quantity y j to the component of Z j which
corresponds to the maximum component of j
Contd..
The basic requirement in choosing yj is that it
be sufficient to provide a detectable difference
in the distance from an arbitrary sample to the
two new cluster centers, but not so large as to
change the overall cluster domain
arrangement appreciably.
If splitting took place in this step, go to Step
2; otherwise continue.
Contd..
Step 11. Compute the pairwise distances D ij
between all cluster centers :
D ij Z i Z j , i 1,2,..., N C 1; j i 1,..., N C
Contd..
Step 12. Compare the distances D ij against
the parameter C . Arrange the L smallest
distances which are less than C in
ascending order:
[ D i1 j1 , D i 2 j 2 ,..., D iLjL ]
where Di j Di j DiL jL and L is the maximum
number of pairs of cluster centers which can
be lumped together. The lumping process is
discussed in the next step.
1
Contd..
Step 13. With each distance D iljl there is
associated a pair of cluster centers Z il and Z jl
Starting with the smallest of these distances,
perform a pairwise lumping operation according
to the following rule :
For l 1,2,..., L, if neither Z il nor Z jl has been used
in lumping in this iteration, merge these two
cluster centers using the following relation:
1
Zl
[ N il( Z il) N jl ( Z jl)]
N il N jl
Contd..
It is noted that only pairwise lumping is
allowed and that a lumped cluster center is
obtained by weighting each old cluster center
by the number of samples in its domain.
Experimental evidence indicates that more
complex lumping can produce unsatisfactory
results. The above procedure makes the
lumped cluster centers representative of the
true average point of the combined subsets. It
is also important to note that, since a cluster
center can be lumped only once, this step will
not always result in L lumped centers.
Contd..
Step 14. If this is the last iteration, the
algorithm terminates. Otherwise go to Step
1 if any of the process parameters requires
changing at the user's discretion, or go to
Step 2 if the parameters are to remain the
same for the next iteration. An iteration is
counted every time the procedure returns to
Step 1 or 2.
Example
Example: Let the patterns be {(0,0), (1,1),
(2,2), (4,3), (4,4), (5,3), (5,4), (6,5)}
In this case N 8 and n 2 . Suppose that
we initially let N C 1, Z 1 (0,0),and specify the
following parameters:
Step 1.K 2, N 1, S 1, C 4, L 0,
I 4
Contd..
If no a priori information on the data being
analyzed is available, these parameters are
arbitrarily chosen and then adjusted during
successive iterations through the algorithm.
Step 2. Since there is only one cluster
center,
S 1 {X 1, X 2,,..., X 8} and
N 1 8
Contd..
Step 4. Update the cluster centers
1
Z 1
N1
3.38
X S 1
2.75
Step 5. ComputeD j :
1
D 1
N1
X S 1
X Z 1 2.26
Contd..
Step 6. Compute D in this case
D D1 2.26
,
N c K / 2 Go to step 8.
1
1.56
Contd..
Step 9. The maximum component of 1 is
1.99 ; hence, 1 max 1.99.
Step 10. Since 1 max S and N C K / 2, we
split Z 1 into two new clusters. Following the
procedure described in Step 10, suppose
that we let y j 0.5 j max 1.0. Then,
4.38
,
Z 1
2.75
2.38
Z 1
2.75
Contd..
For convenience these two cluster centers
are renamed Z 1 and Z 2 respectively. Also, N C
is increased by 1. Go to Step 2.
Step 2. The sample sets are now
S 1 { X 4, X 5, X 6, X 7, X 8},
S 2 { X 1, X 2, X 3}
and N 1 5, N 2 3.
Step 3. Since both N 1 and N 2 are greater
than N, no subsets are discarded.
Contd..
Step 4. Update the cluster centers :
1
Z 1
N1
4.80
,
X
X S 1
3.80
1
Z 2
N2
1.00
X S 2 1.00
X S 1
X Z 1 0.80,
1
D 2
N2
X S 2
X Z 2 0.94
Contd..
Step 6. Compute D :
1 NC
1 2
D N jD j N jD j 0.85
N j 1
8 j 1
Contd..
Step 11. Compute the pairwise distances:
D12 Z 1 Z 2 4.72
Step 12. Compare D12 to C. In this case, D12 C.
Step 13. From the results of Step 12, we see
that no lumping of cluster centers can take
place.
Step 14. Since this is not the last iteration,
we are faced with the decision of whether or
not to make an alteration in the parameters.
Since, in this simple example,
Contd..
1.
2.
3.
Contd..
Steps 2-6 : yield the same results as in the
previous iteration
Step 7. None of the conditions in this step is
satisfied. Therefore, we proceed to Step 8.
Step 8. Compute the standard deviation of sets
S 1 { X 4, X 5, X 6, X 7, X 8} and S 2 { X 1, X 2, X 3} :
0.75
,
1
0.75
0.82
2
0.82
Contd..
Step 9. In this case 1max 0.75 and 2 max 0.82.
Step 10. The conditions for splitting are not
satisfied. Therefore, we proceed to Step 11.
Step 11. We obtain the same result as in the
previous iteration:
D12 Z 1 Z 2 4.72
Contd..
Step 13. We obtain the same result as in
the previous iteration.
Step 14. Nothing new has been added in
this iteration, except the computation of the
standard deviation vectors. Therefore, we
return to Step 2.
Steps 26. yield the same results as in
the previous iteration.
Step 7. Since this is the last iteration, we
set C 0 and go to Step ll.
Contd..
Step 11. D12 Z 1Z 2 4.72 as before.
Step 12. We obtain the same result as in
the previous iteration.
Step 13. From the results of Step 12, we
see that no lumping can take place.
Step 14. Since this is the last iteration, the
algorithm is terminated
Contd..
It should be clear, even from the above
simple example, that the application of
Isodata to a set of moderately complex
data requires, in general, extensive
experimentation before one can arrive at
meaningful conclusions. However, by
properly organizing the information
obtained in each iteration, it is possible to
gain considerable insight into the structure
of the data.
Evaluation of Clustering
Results
The principal difficulty in evaluating the
results of clustering algorithms is inability to
visualize the geometrical properties of a
high-dimensional space. Several
interpretation techniques exist which allow
at least partial insight into the geometrical
properties of the resulting cluster domains.
Contd..
A very useful interpretation tool is the distance
between cluster centers This information is
best presented in the form of a table.
Contd..
Cluster
Center
Z1
Z2
Z3
Z1
0.0
Z2
Z3
Z4
Z5
4.8
14.7 2.1
0.0
50.6
Z4
0.0 49.3
Z5
0 .0
Contd..
Cluster center Z 5 is far removed from the
other cluster centers. If it is known that this
cluster center is associated with numerous
samples, we would accept it as a valid
description of the data. otherwise we might
dismiss this cluster center as representative
of noise samples.
Contd..
If two cluster centers are relatively close,
and one of the centers is associated with a
much larger number of samples, it is often
possible to merge the two cluster domains.
The variances of a cluster domain about its
mean can be used to infer the relative
distribution of the samples in the domain.
Contd..
Cluster
Domains
Variances
S1
1 .2
0 .9
0 .7
S2
2 .0
1 .3
1. 5
0 .9
S3
3 .7 4 .8
7 .3
10.4
S4
0.3
0.8
0.7
1.1
S5
4.2 5.4
18.3
3.3
1 .0
Contd..
It is assumed that each variance
component is along the direction of one of
the coordinate axes.
Note :
Since domain S1 has very similar variances,
it can be expected to be roughly spherical in
nature.
Cluster domain S5 on the other hand, is
significantly elongated about the third
coordinate axis. A similar analysis can be
carried out for the other domains.
Contd..
This information, coupled with the distance
table and sample numbers, can be of
significant value in interpreting clustering
results.
Graph-Theoretic Approach
Till now, the clusters are determined in such a way
that the intraset distance among each cluster is kept
to a minimum, and the interset distances between
two clusters are made as large as possible.
Contd..
An alternative approach to cluster seeking is to make
use of some basic notion in graph theory. In this
approach, a pattern graph is first constructed from
the given sample patterns, which form the nodes of
the graph. A node j is connected to a node k by an
edge, if the patterns corresponding to these two
nodes are similar or are related.
Contd..
Pattern X j and pattern X k are said to be similar if the
similarity measure S ( X j, X k) is greater than a prespecified threshold T. The similarity measure may be
used to generate a similarity matrix S, whose
elements are 0 or 1. The similarity matrix provides a
systematic way to construct the pattern graph. Since
cliques of a pattern graph form the clusters of the
patterns, cluster seeking can be accomplished by
detecting the cliques of the pattern graph.
Contd..
Several clique detection algorithms and
programs have been introduced in the
literature. Several such methods exist that
utilize the MST of the data points to get
the clustering.
Thank You