Partitioning Methods & Hierachical Methods
Partitioning Methods & Hierachical Methods
k-means k-medoids
■ E is the sum of the squared error for all objects in the data set;
■ p is the point in space representing a given object; and
■ ci is the centroid of cluster Ci (both p and ci are multidimensional).
4
The K-Means Centroid based
Technique
■ Algorithm: k-means.
The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster
Input:
k:the number of clusters, D:a dataset containing n objects.
Output: A set of k clusters.
Method:
■ Partition objects into k nonempty subsets
■ Compute seed points as the centroids of the clusters of the
current partitioning (the centroid is the center, i.e., mean point,
of the cluster)
■ Assign each object to the cluster with the nearest seed point
■ Go back to Step 2, stop when the assignment does not change
5
An Example of K-Means Clustering
K=2
Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
■ Partition objects into k nonempty
subsets
■ Repeat
■ Compute centroid (i.e., mean Update
the
point) for each partition cluster
■ Assign each object to the centroids
cluster of its nearest centroid
■ Until no change
6
Comments on the K-Means
Method
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9
PAM: A Typical K-Medoids Algorithm
Total Cost =
20
1
0
9
6
Arbitrar Assign
5
y each
4 choose remaini
3
k object ng
2
as object
initial to
1
0
0 1 2 3 4 5 6 7 8 9 1
0
medoid nearest
K=2 s medoid
s Randomly select a
Total Cost = nonmedoid
26 object,Oramdom
1 1
Do loop
0 0
9 9
8
Compute 8
Swapping total cost
Until no
7 7
O and 6
of 6
change Oramdom 5
swapping
5
4 4
If quality is 3 3
2 2
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
10
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
11
Hierarchical Clustering
■ A hierarchical clustering method works by grouping data
objects into a hierarchy or “tree” of clusters.
■ Representing data objects in the form of a hierarchy is
useful for data summarization and visualization.
■ For example, as the manager of human resources at a
company, may organize the employees into major
groups such as executives, managers, and staff.
■ Further partition these groups into smaller subgroups. For
instance, the general group of staff can be further divided
into subgroups of senior officers, officers, and trainees.
All these groups form a hierarchy. Thus summarize or
characterize the data that are organized into a hierarchy,
which can be used to find, say, the average salary of
managers and of officers
12
Hierarchical Clustering
■ Agglomerative methods start with individual objects as clusters,
which are iteratively merged to form larger clusters.
■ Conversely, divisive methods initially let all the given objects form
one cluster, which they iteratively split into smaller clusters.
■ Hierarchical clustering methods can encounter difficulties regarding
the selection of merge or split points. Such a decision is critical,
because once a group of objects is merged or split, the process at
the next step will operate on the newly generated clusters.
■ It will neither undo what was done previously, nor perform object
swapping between clusters. if not well chosen, may lead to low-
quality clusters. To improving the clustering quality of hierarchical
methods integrate hierarchical clustering with other clustering
techniques, resulting in multiple-phase (or multiphase) clustering.
■ Two methods, namely BIRCH and Chameleon. BIRCH begins by
partitioning objects hierarchically using tree structures, where the
leaf or low-level nonleaf nodes can be viewed as “microclusters”
depending on the resolution scale. It then applies other
■ Chameleon explores dynamic modeling in hierarchical clustering. 13
Hierarchical Clustering
■Chameleon explores dynamic modeling in hierarchical
clustering.
■There are several orthogonal ways to categorize hierarchical
clustering methods.
■Agglomerative, divisive, and multiphase methods are
algorithmic, meaning they consider data objects as
deterministic and compute clusters according to the
deterministic distances between objects.
■Probabilistic methods use probabilistic models to capture
clusters and measure the quality of clusters by the fitness of
models. Bayesian methods compute a distribution of possible
clusterings.
■ That is, instead of outputting a single deterministic
clustering over a data set, they return a group of clustering
structures and their probabilities, conditional on the given
data..
14
Agglomerative versus Divisive
Hierarchical Clustering
■An agglomerative hierarchical clustering method uses a bottom-up
strategy.
■Starts by letting each object form its own cluster and iteratively
merges clusters into larger and larger clusters,
■The single cluster becomes the hierarchy’s root.
■For the merging step,finds the two clusters that are closest to each
other combines the two to form one cluster. An agglomerative method
requires at most n iterations.
■A divisive hierarchical clustering method employs a top-down strategy.
It starts by placing all objects in one cluster, which is the hierarchy’s
root. It then divides the root cluster into several smaller subclusters,
and recursively partitions those clusters into smaller ones.
■ In either agglomerative or divisive hierarchical clustering, a user can
specify the desired number of clusters as a termination condition.
15
Hierarchical Clustering
■ The diagram shows the application of AGNES (AGglomerative
NESting), an agglomerative hierarchical clustering method, and
DIANA (DIvisive ANAlysis), a divisive hierarchical clustering
method, on a data set of five objects, a,b,c,d,e .
■ Initially, AGNES, the agglomerative method, places each object into
a cluster of its own.
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0 16
Hierarchical Clustering
■ The clusters are then merged step-by-step according to some
criterion. For example, clusters C1 and C2 may be merged if an
object in C1 and an object in C2 form the minimum Euclidean
distance between any two objects from different clusters.
■ This is a single-linkage approach in that each cluster is
represented by all the objects in the cluster, the similarity between
two clusters is measured by the similarity of the closest pair of data
points belonging to different clusters.
■ DIANA,the divisive method, proceeds in the contrasting way.
■ All the objects are used to form one initial cluster. The cluster is split
as the maximum Euclidean distance between the closest
neighboring objects in the cluster.
■ The cluster-splitting process repeats until, eventually, each new
cluster contains only a single object
17
Dendrogram: Shows How Clusters are Merged
18
AGNES (Agglomerative
Nesting)
■ Introduced in Kaufmann and Rousseeuw (1990)
■ Implemented in statistical packages, e.g., Splus
■ Use the single-link method and the dissimilarity
matrix
■ Merge nodes that have the least dissimilarity
■ Go on in a non-descending fashion
■ Eventually all nodes belong to the same cluster
19
DIANA (Divisive Analysis)
20
Distance between X X
Clusters
■ Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip,
tjq)
■ Complete link: largest distance between an element in
one cluster and an element in the other, i.e., dist(Ki, Kj) =
max(tip, tjq)
■ Average: avg distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
■ Centroid: distance between the centroids of two clusters,
i.e., dist(Ki, Kj) = dist(Ci, Cj)
■ Medoid: distance between the medoids of two clusters, i.e.,
21
Centroid, Radius and Diameter
of a Cluster (for numerical X
data sets)
■ Centroid: the “middle” of a cluster
22