Unit 5 - Cluster Analysis
Unit 5 - Cluster Analysis
Unit 5 - Cluster Analysis
Cluster analysis
A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters. The process of grouping a set of physical or abstract
objects into classes of similar objects is called clustering.
Clustering techniques apply when there is no class to be predicted but rather when the instances
are to be divided into natural groups. These clusters presumably reflect some mechanism at work
in the domain from which instances are drawn, a mechanism that causes some instances to bear a
stronger resemblance to each other than they do to the remaining instances. Clustering naturally
requires different techniques to the classification and association learning.
Clustering differs from classification in that there is no target variable for clustering. The clustering
task does not try to classify, estimate, or predict the value of a target variable. Instead, clustering
algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters,
where the similarity of the records within the cluster is maximized, and the similarity to records
outside this cluster is minimized. By definition this is an example of unsupervised learning.
Applications
Business: Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into small number of groups for additional analysis
and marketing activities
Information Retrieval: The World Wide Web consists of billions of Web pages, and the results
of a query to a search engine can return thousands of pages. Clustering can be used to group these
search results into a small number of clusters, each of which captures a particular aspect of the
query. For instance, a query of "movie" might return Web pages grouped into categories such as
reviews, trailers, stars, and theaters. Each category (cluster) can be broken into subcategories (sub-
clusters), producing a hierarchical structure that further assists a user's exploration of the query
results.
Climate: Understanding the Earth's climate requires finding patterns in the atmosphere and ocean.
Cluster analysis has been applied to find patterns in the atmospheric pressure of polar region and
areas of the ocean that have significant impact on land climate.
Biology: used to derive plant and animal taxonomies, categorize genes with similar functionality,
and gain insight into structures inherent in populations.
Arjun Lamichhane 1
Data Mining and Data Warehousing Unit 5: Cluster Analysis
1. Partitioning methods
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partition represents a cluster and k ≤ n. Given the number of partitions (k) to
construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation
technique that attempts to improve the partitioning by moving objects from one group to another.
The algorithms that use such methods are k-means algorithm, k-medoids algorithm etc.
2. Hierarchical Methods
A hierarchical method creates a hierarchical decomposition of the given set of data objects. If we
permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested
clusters that are organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is
the union of its children (subclusters), and the root of the tree is the cluster containing all the
objects. A hierarchical method can be classified as being either agglomerative or divisive, based
on how the hierarchical decomposition is formed.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be
undone. This rigidity is useful in that it leads to smaller computation costs by not having to worry
about a combinatorial number of different choices. However, such techniques cannot correct
erroneous decisions.
Arjun Lamichhane 2
Data Mining and Data Warehousing Unit 5: Cluster Analysis
4. Grid-based methods
Grid-based methods quantize the object space into a finite number of cells that form a grid
structure. All of the clustering operations are performed on the grid structure (i.e., on the quantized
space). The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in each
dimension in the quantized space.
5. Model-based methods:
Model-based methods hypothesize a model for each of the clusters and find the best fit of the data
to the given model. A model-based algorithm may locate clusters by constructing a density
function that reflects the spatial distribution of the data points. It also leads to a way of
automatically determining the number of clusters based on standard statistics, taking “noise” or
outliers into account and thus yielding robust clustering methods.
Only the first three methods are of interest to us.
Partitioning algorithm
Partitioning algorithm divides the data into k groups, which together satisfy the following
requirements:
Each group must contain at least one object, and
Each object must belong to exactly one group.
These heuristic clustering methods work well for finding spherical-shaped clusters in small to
medium-sized databases. The clusters are formed to optimize an objective partitioning criterion,
such as a dissimilarity function based on distance, so that the objects within a cluster are “similar,”
whereas the objects of different clusters are “dissimilar” in terms of the data set attributes.
Arjun Lamichhane 3
Data Mining and Data Warehousing Unit 5: Cluster Analysis
Algorithm
Input:
Method:
The operation of K-means is illustrated in Figure 1, which shows how, starting from three
centroids, the final clusters are found in three assignment-update steps. These figure shows the
centroids at the start of the iteration and the assignment of the points to those centroids. The
centroids are indicated by the "+" symbol; all points belonging to the same cluster have the same
outlines encircled by dotted curves. In the first step, shown in Figure 1(a), points are assigned to
the initial centroids (chosen randomly), which are all in the larger group of points. For this
Figure 1 Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by ‘+’)
example, we use the mean as the centroid. After points are assigned to a centroid, the centroid is
then updated. In the second step, points are assigned to the updated centroids, and the centroids
Arjun Lamichhane 4
Data Mining and Data Warehousing Unit 5: Cluster Analysis
are updated again which are shown in Figures 1(b) and 1(c). When the K-means algorithm
terminates in Figure 1(c), because no more changes occur, the centroids have identified the natural
groupings of points.
Example 1.
D={2, 3, 4, 10, 11, 12, 20, 25, 30}
Divide into three clusters
---Refer class notes---
Example 2.
Generate cluster from following data set using K-means algorithm (take k= 2 and consider up to 2
iterations)
T1: Bread, Jelly, Butter
T2: Bread, Butter
T3: Bread, Milk, Butter
T4: Coke, Bread
T5: Coke, Milk
--- Refer Class Notes---
Strength
K-means is simple and can be used for a wide variety of data types. It works well when the clusters
are compact clouds that are rather well separated from one another. The method is relatively
scalable and efficient in processing large data sets, even though multiple runs are often performed.
Weakness
K-means also has trouble clustering data that contains outliers. Outlier detection and removal can
help significantly in such situations. The k-means method is not suitable for discovering clusters
with nonconvex shapes or clusters of very different size. Finally, K-means is restricted to data for
which there is a notion of a center (centroid) so it might not be applicable in some applications,
such as when data with categorical attributes are involved.
Hierarchical Clustering
Hierarchical clustering techniques are a second important category of clustering methods. As with
K-means, these approaches are relatively old compared to many clustering algorithms, but they
still enjoy widespread use.
Arjun Lamichhane 5
Data Mining and Data Warehousing Unit 5: Cluster Analysis
A hierarchical clustering method works by grouping data objects into a tree of clusters.
Hierarchical clustering methods can be further classified as either agglomerative or divisive,
depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-
down (splitting) fashion. The quality of a pure hierarchical clustering method suffers from its
inability to perform adjustment once a merge or split decision has been executed. That is, if a
particular merge or split decision later turns out to have been a poor choice, the method cannot
backtrack and correct it.
In general, there are two types of hierarchical clustering methods:
Agglomerative: Start with the points as individual clusters and, at each step, merge the closest
pair of clusters. This requires defining a notion of cluster proximity.
Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton
clusters of individual points remain. In this case, we need to decide which cluster to split at each
step and how to do the splitting.
A hierarchical clustering is often displayed graphically using a tree-like diagram called a
dendrogram, which displays both the cluster-subcluster relationships and the order in which the
clusters were merged (agglomerative view) or split (divisive view).
Agglomerative Clustering
Many agglomerative hierarchical clustering techniques are variations on a single approach: starting
with individual points as clusters, successively merge the two closest clusters until only one cluster
remains. In general, agglomerative algorithms are more frequently used in real world applications
than divisible methods.
Algorithm
1. Compute the proximity matrix, if necessary.
2. Repeat
2.1. Merge the closest two clusters.
2.2. Update the proximity matrix to reflect the proximity between the new cluster and the
original clusters.
3. Until Only one cluster remains.
Arjun Lamichhane 6
Data Mining and Data Warehousing Unit 5: Cluster Analysis
The key operation of Algorithm for agglomerative clustering is the computation of the proximity
(distance) between two clusters. For example, many agglomerative hierarchical clustering
techniques, such as MIN, MAX, and Group Average, come from a graph-based view of clusters.
MIN defines cluster proximity as the proximity between the closest two points that are in different
clusters, or using graph terms, the shortest edge between two nodes in different subsets of nodes.
Alternatively, MAX takes the proximity between the farthest two points in different clusters to be
the cluster proximity, or using graph terms, the longest edge between two nodes in different subsets
of nodes. The group average technique, defines cluster proximity to be the average pairwise
proximities (average length of edges) of all pairs of points from different clusters.
Example: Perform clustering using Agglomerative algorithm for the following:
Points: A (1, 1), B (1.5, 1.5), C (5, 5) D (3, 4), E (4, 4), F (3, 3.5)
Step 1. Compute distance between all the points.
𝑑𝐴,𝐵 = √(1 − 1.5)2 + (1 − 1.5)2 = √0.5 = 0.71
Arjun Lamichhane 7
Data Mining and Data Warehousing Unit 5: Cluster Analysis
A B C DF E
A 0.00
B 0.71 0.00
C 5.66 4.95 0.00
DF ? ? ? 0.00
E 4.24 3.54 1.41 ? 0.00
To compute proximity of cluster DF to other point we use MIN method. (Any method like MAX
or average can be used but the final cluster may differ in such case.)
𝑑𝐷,𝐹→𝐴 = min(𝑑𝐷𝐴 , 𝑑𝐹𝐴 ) = min(3.61, 3.20) = 3.20
𝑑𝐷,𝐹→𝐵 = min(𝑑𝐷𝐵 , 𝑑𝐹𝐵 ) = min(2.92, 2.50) = 2.50
𝑑𝐷,𝐹→𝐶 = min(𝑑𝐷𝐶 , 𝑑𝐹𝐶 ) = min(2.24, 2.50) = 2.24
𝑑𝐷,𝐹→𝐸 = min(𝑑𝐷𝐸 , 𝑑𝐹𝐸 ) = min(1.00, 1.12) = 1.00
So the table will be as:
A B C DF E
A 0.00
B 0.71 0.00
C 5.66 4.95 0.00
D F 3.20 2.50 2.24 0.00
E 4.24 3.54 1.41 1.00 0.00
AB C DF E
A B 0.00
C ? 0.00
DF ? 2.24 0.00
E ? 1.41 1.00 0.00
Now, compute proximity of AB to other points and clusters
𝑑𝐴,𝐵→𝐶 = min(𝑑𝐴𝐶 , 𝑑𝐵𝐶 ) = min(5.66, 4.95) = 4.95
𝑑𝐴,𝐵→𝐷,𝐹 = min(𝑑𝐴𝐷 , 𝑑𝐴𝐹 , 𝑑𝐵𝐷 , 𝑑𝐵𝐹 ) = min(3.61, 3.20, 2.92, 2.50) = 2.50
Arjun Lamichhane 8
Data Mining and Data Warehousing Unit 5: Cluster Analysis
AB C DFE
AB 0.00
C 4.95 0.00
DFE ? ? 0.00
On computing the distances we get
𝑑𝐷,𝐹,𝐸→𝐴,𝐵 = min(𝑑𝐷𝐴 , 𝑑𝐷𝐵 , 𝑑𝐹𝐴 , 𝑑𝐹𝐵 , 𝑑𝐸𝐴 , 𝑑𝐸𝐵 ) = min(3.61, 2.92, 3.20, 2.50, 4.24, 3.54)
= 2.50
𝑑𝐷,𝐹,𝐸→𝐶 = min(𝑑𝐷𝐶 , 𝑑𝐹𝐶 , 𝑑𝐸𝐶 ) = min(2.24, 2.50, 1.41) = 1.41
AB C DFE
AB 0.00
C 4.95 0.00
D F E 2.50 1.41 0.00
AB DFEC
AB 0.00
DFEC ? 0.00
Arjun Lamichhane 9
Data Mining and Data Warehousing Unit 5: Cluster Analysis
AB DFEC
AB 0.00
D F E C 2.50 0.00
Step 6: Finally merge A B and D F E C to get all points in one single cluster.
Divisive Method
A divisive algorithm starts from the entire set of samples X and divides it into a partition of subsets,
then divides each subset into smaller sets, and so on until each object forms a cluster on its own or
until it satisfies certain termination conditions, such as a desired number of clusters is obtained or
the diameter of each cluster is within a certain threshold.. Thus, a divisive algorithm generates a
sequence of partitions that is ordered from a coarser one to a finer one. In this case, we need to
decide which cluster to split at each step and how to do the splitting. For a large data set, this
method is computationally expensive.
Strength
Such algorithms are typically useful for the application such as creation of a taxonomy which
requires a hierarchy of clusters. Also, there have been some studies that suggest that these
algorithms can produce better-quality clusters.
Weakness
Arjun Lamichhane 10
Data Mining and Data Warehousing Unit 5: Cluster Analysis
Arjun Lamichhane 11
Data Mining and Data Warehousing Unit 5: Cluster Analysis
Noise points: noise point is any point that is neither a core point nor a border point. In figure 3, C
is a noise point because it is neither core nor border point.
Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is
within the ɛ-neighborhood of q, and q is a core object.
An object p is density-reachable from object q with respect to ɛ and MinPts in a set of objects, D,
if there is a chain of objects p1, . . . , pn, where p1 = q and pn = p such that pi+1 is directly density-
reachable from pi with respect to ɛ and MinPts.
For example:
Arjun Lamichhane 12
Data Mining and Data Warehousing Unit 5: Cluster Analysis
Consider Figure 4 for a given ɛ represented by the radius of the circles. let MinPts = 3. Based on
the above definitions:
Of the labeled points, m, p, o and r are core objects because each is in a ɛ-neighborhood containing
at least three points.
Algorithm
Strength
Because DBSCAN uses a density-based definition of a cluster, it is relatively resistant to noise and
can handle clusters of arbitrary shapes and sizes. Thus, DBSCAN can find many clusters that could
not be found using K-means.
Weakness
DBSCAN has trouble when the clusters have widely varying densities. It also has trouble with
high-dimensional data because density is more difficult to define for such data.
Arjun Lamichhane 13
Data Mining and Data Warehousing Unit 5: Cluster Analysis
References
[1] J. Han and K. Micheline, Data Mining: Concepts and Techniques, San Francisco: Elsevier Inc., 2006.
[2] P.N. Tan, M. Steinbach and V. Kumar, INTRODUCTION TO DATA MINING, New York: PEARSON
Addison Wesley, 2006.
[3] D. T. LAROSE, DISCOVERING KNOWLEDGE IN DATA An Introduction to Data Mining, New Jersey:
John Wiley & Sons, Inc., 2005.
Arjun Lamichhane 14