ppt7
ppt7
• For optimal performance, clustering algorithms, just like algorithms for classification, require the data to be
normalized so that no particular variable or subset of variables dominates the analysis
• Analysts may use either the min–max normalization or Z-score standardization
Goal of Clustering
• All clustering methods have as their goal the identification of groups of records such that similarity within a group
is very high while the similarity to records in other groups is very low
• Clustering algorithms seek to construct clusters of records such that the between-cluster variation is large compared
to the within-cluster variation Analogous to the objective of Analysis of Variance
Hierarchical Clustering
• Clustering algorithms are either hierarchical or nonhierarchical
• In hierarchical clustering, a treelike cluster structure (dendrogram) is created through recursive partitioning
(divisive methods) or combining (agglomerative) of existing clusters
• Agglomerative clustering methods initialize each observation to be a tiny cluster of its own. Then, in succeeding
steps, the two closest clusters are aggregated into a new combined cluster the number of clusters in the data set is
reduced by one at each step and finally a huge cluster is obtained containing all the points
• Divisive clustering methods begin with all the records in one big cluster, with the most dissimilar records being
split off recursively, into a separate cluster, until each record represents its own cluster
• Our focus will be on Agglomerative Clustering because of its widespread applications than Divisive Clustering
• Nonhierarchical Clustering works in a different fashion than hierarchical in which most popular are k-means,
k-median etc types of Clustering
Hierarchical Agglomerative Clustering
• Distance computation between records is rather straightforward once appropriate recoding and normalization has
taken place
• But how do we determine distance between clusters of records? Should we consider two clusters to be close if
their nearest neighbors are close or if their farthest neighbors are close? How about criteria that average out these
extremes? Three criteria for determining distance between clusters and records
• Single linkage, sometimes termed the nearest-neighbor approach, is based on the minimum distance between any
record in cluster A and any record in cluster B. In other words, cluster similarity is based on the similarity of the
most similar members from each cluster. Single linkage tends to form long, slender clusters, which may
sometimes lead to heterogeneous records being clustered together
• Complete linkage, sometimes termed the farthest-neighbor approach, is based on the maximum distance between
any record in cluster A and any record in cluster B. In other words, cluster similarity is based on the similarity of
the most dissimilar members from each cluster. Complete linkage tends to form more compact, sphere like
clusters
• Average linkage is designed to reduce the dependence of the cluster-linkage criterion on extreme values, such
as the most similar or dissimilar records. In average linkage, the criterion is the average distance of all the
records in cluster A from all the records in cluster B. The resulting clusters tend to have approximately
equal within-cluster variability
Hierarchical Agglomerative Clustering Algorithm
A Worked Example
• Consider the following n = 8 bivariate points
• Using Euclidean distance, the upper-triangular portion of the symmetric, (8×8)- matrix D(1) is as follows
• The smallest dissimilarity is d1,23 = d68 = 1.414. We choose to merge x1 with the “23” cluster, producing a new
cluster “123.” We next compute new dissimilarities, d123,K = min{d12,K,d3K} for K = 4,5,6,7,8. The (6 × 6)-matrix
D(3) is as follows
A Worked Example – Single Linkage
• The smallest dissimilarity is d68 = 1.414, and so we merge x6 and x8 to form the new cluster “68.” We compute new
dissimilarities, d68,K =min{d6K,d8K} for K = 123,4,5,7. This gives us the (5 × 5)-matrix D(4)
• The smallest dissimilarity is d45 = 2.0, and so we merge x4 andx5 to form the new cluster “45.” We compute new
dissimilarities, d45,K = min{d4K,d5K} for K = 123,68,7. This gives the (4 × 4)-matrix D(5)
A Worked Example – Single Linkage
• The smallest dissimilarity is d45,68 = d68,7 = 2.236. We choose to merge the cluster “68” with x7 to produce the new
cluster “678.” The new dissimilarities, d678,K = min{d68,K,d7K} for K = 123,45, yield the matrix D(6)
• The smallest dissimilarity is d45,678 = 2.236, so the next merge is the cluster “45” with the cluster “678.” The matrix
D(7) is
• The last merge is cluster “123” with cluster “45678,” and the merging dissimilarity is d123,45678 = 3.162
A Worked Example – Single Linkage
• The corresponding Dendogram is given by
A Worked Example – Complete Linkage
• From D(1) given previously, we merge x2 and x3 to form the “23” cluster at height 1.414, as before. So, the
upper-triangular portion of the (7 × 7)-matrix D(2) is as
follows
• The smallest dissimilarity is d68 = 1.414. We merge x6 and x8 to form a new cluster “68.” We compute new
dissimilarities, d68,K = max{d6K,d8K} for K = 1,23,4,5,7. This gives us a (6 × 6)-matrix D(3)
A Worked Example – Complete Linkage
• The smallest dissimilarity is d1,23 = d45 = 2.0. We choose to merge the cluster “23” with x1 to form a new cluster
“123.” We compute new dissimilarities, d123,K = max{d1,K,d23,K} for K = 4,5,68,7. This gives us a new (5 ×
5)-matrix D(4)
• The smallest dissimilarity is d45 = 2.0. We merge x4 and x5 to form a new cluster “45.” We compute dissimilarities,
d45,K = max{d4K,d5K} for K = 123,68,7. This gives us a new (4 × 4)-matrix D(5)
• The smallest dissimilarity is d68,7 = 2.236. We merge cluster “68” with x7 to form the new cluster “678.” New
dissimilarities d678,K = max{d68,K,d7K} are computed for K = 123,45 to give the new (3 × 3)-matrix D(6)
A Worked Example – Complete Linkage
• The smallest dissimilarity is d68,7 = 2.236. We merge cluster “68” with x7 to form the new cluster “678.” New
dissimilarities d678,K = max{d68,K,d7K} are computed for K = 123,45 to give the new (3 × 3)-matrix D(6)
• The last steps merge the clusters “45” and “678” with a merging value of d45,678 = 5.385, and then the clusters
“123” and “45678” with a merging value of d123,45678 = 7.280
A Worked Example – Complete Linkage
• The Dendogram for this method is shown below
A Worked Example – Average Linkage
• We start with the matrix D(1). The smallest dissimilarity is d12 = √2 = 1.414, and so we merge x1 and x2 to form
cluster “12.”
• We compute dissimilarities between the cluster “12” and all other points using the average distance, d12,K = (d1K +
d2K)/2, for K = 3,4,5,6,7,8. For example, d12,3 = (d13 + d23)/2 = (√4 + √2)/2 = 1.707
• The matrix D(2) is given by
• The smallest dissimilarity is d68 = 1.414, and so we merge x6 and x8 to form the new cluster “68.” We compute
dissimilarities between the cluster “68” and all other points and clusters using the average distance, d68,12 = (d16+
d26 + d18 + d28)/4 = 6.364, and d68,K = (d6K + d8K)/2, for K = 3,4,5,7
• The matrix D(3) is
A Worked Example – Average Linkage
• The smallest dissimilarity is d12,3 = 1.707, and so we merge x3 and the cluster “12” to form the new cluster “123.”
We compute dissimilarities between the cluster “123” and all other points using the average distance, d123,68 = (d16
+ d18 + d26 + d28 + d36 + d38)/6 = 5.974 and d123,K = (d1K + d2K + d3K)/3, for K = 4,5,7.
• This gives the matrix D(4)
• The smallest dissimilarity is d45 = 2.0, and so we merge x4 and x5 to form the new cluster “45.” We compute
dissimilarities between the cluster “45” and the other clusters as before
• This gives the matrix D(5)
A Worked Example – Average Linkage
• The smallest dissimilarity is d68,7 = 2.236, and so we merge x7 and the cluster “68” to form the new cluster “678.”
This gives the matrix D(6)
• The smallest dissimilarity is d45,678 = 3.792, and so we merge the two clusters “45” and “678” to form a new cluster
“45678.” We merge the last two clusters and compute their dissimilarity d123,45678 = 4.940. The corresponding
Dendogram is
k-means Clustering
• The k-means clustering algorithm is a straightforward and effective algorithm for
finding clusters in data. The algorithm proceeds as follows:
• Step 1: Ask the user how many clusters k the data set should be partitioned into.
• Step 2: Randomly assign k records to be the initial cluster center locations.
• Step 3: For each record, find the nearest cluster center. Thus, in a sense, each cluster center “owns” a subset of
the records, thereby representing a partition of the data set. We therefore have k clusters, C1, C2, … , Ck
• Step 4: For each of the k clusters, find the cluster centroid, and update the location of each cluster center to the
new value of the centroid
• Step 5: Repeat steps 3–5 until convergence or termination
• The “nearest” criterion in step 3 is usually Euclidean distance, although other criteria may be applied as well.
• The cluster centroid in step 4 is found as follows. Suppose that we have n data points (a1, b1, c1), (a2, b2, c2), … ,
(an, bn, cn), the centroid of these points is the center of gravity of these points and is located at point (∑ ai⁄n,∑ bi⁄n,
∑ ci⁄n,)
• For example, the points (1,1,1), (1,2,1), (1,3,1), and (2,1,1) would have centroid
k-means Clustering
• The algorithm terminates when the centroids no longer change. In other words, the algorithm terminates when for all
clusters C1, C2, … , Ck, all the records “owned” by each cluster center remain in that cluster
• Alternatively, the algorithm may terminate when some convergence criterion is met, such as no significant shrinkage in
the mean squared error (MSE) where SSE represents the sum of squares error, p ∈ Ci represents each data point in
cluster i, mi represents the centroid (cluster center) of cluster i, N is the total sample size, and k is the number of clusters
• Recall that clustering algorithms seek to construct clusters of records such that the between-cluster variation is large
compared to the within-cluster variation
• Because this concept is analogous to the analysis of variance, we may define a pseudo-F statistic as follows
• MSB is the mean square between, and SSB is the sum of squares between clusters, defined as following, with where n i is
the number of records in cluster i, mi is the centroid (cluster center) for cluster i, and M is the grand mean of all the data
k-means Clustering
• MSB represents the between-cluster variation and MSE represents the within-cluster variation
• Thus, a “good” cluster would have a large value of the pseudo-F statistic, representing a situation where the
between-cluster variation is large compared to the within-cluster variation
• Hence, as the k-means algorithm proceeds, and the quality of the clusters increases, we would expect MSB to
increase, MSE to decrease, and F to increase
k-means Clustering– Worked Example
• Suppose that we have the eight data points in two-dimensional space shown in Table and plotted in Figure and are
interested in uncovering k = 2 clusters
k-means Clustering– Worked Example
• Let us apply the k-means algorithm step by step.
• Step 1: Ask the user how many clusters k the data set should be partitioned into. We have already indicated that
we are interested in k = 2 clusters
• Step 2: Randomly assign k records to be the initial cluster center locations. For this example, we assign the
cluster centers to be m1 = (1,1) and m2 = (2,1)
• Step 3 (first pass): For each record, find the nearest cluster center contains the (rounded) Euclidean distances
between each point and each cluster center m1 = (1,1) and m2 = (2,1), along with an indication of which cluster
center the point is nearest to. Therefore, cluster 1 contains points {a,e,g}, and cluster 2 contains points
{b,c,d,f,h}.
• Step 4 (first pass): For each of the k clusters find the cluster centroid and update the location of each cluster
center to the new value of the centroid. The centroid for cluster 1 is [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1,2). The
centroid for cluster 2 is [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
k-means Clustering– Worked Example
• The clusters and centroids (triangles) at the end of the first pass are shown in following figure. Note that m1 has
moved up to the center of the three points in cluster 1, while m2 has moved up and to the right a considerable
distance, to the center of the five points in cluster 2
• Step 4 (second pass): For each of the k clusters, find the cluster centroid and update the location of each cluster
center to the new value of the centroid. The new centroid for cluster 1 is [(1 + 1 + 1 + 2)/4, (3 + 2 + 1 + 1)/4] =
(1.25, 1.75). The new centroid for cluster 2 is [(3 + 4 + 5 + 4)/4, (3 + 3 + 3 + 2)/4] = (4, 2.75). The clusters and
centroids at the end of the second pass are shown in following figure. Centroids m1 and m2 have both moved
slightly
k-means Clustering– Worked Example
• Step 5: Repeat steps 3 and 4 until convergence or termination. As the centroids have moved, we once again
return to step 3 for our third (and as it turns out, final) pass through the algorithm
k-means Clustering– Worked Example
• Step 3 (third pass): For each record, find the nearest cluster center. Following table shows the distances
between each point and each newly updated cluster center m1 = (1.25, 1.75) and m2 = (4, 2.75), together with
the resulting cluster membership. Note that no records have shifted cluster membership from the preceding
pass
• Step 4 (third pass): For each of the k clusters, find the cluster centroid and
update the location of each cluster center to the new value of the centroid. As
no records have shifted cluster membership, the cluster centroids therefore also
remain unchanged
• Step 5: Repeat steps 3 and 4 until convergence or termination. As the centroids
remain unchanged, the algorithm terminates
• After third pass the nearest cluster centers
k-means Clustering– Worked Example
• Lets’ observe the behavior of the statistics MSB, MSE and pseudo-F after step 4 of each pass
• First pass
k-means Clustering– Worked Example
• In general, we would expect MSB to increase, MSE to decrease, and F to increase
• Second pass ∶ MSB = 17.125, MSE = 1.313333, F = 13.03934.
Third pass ∶ MSB = 17.125, MSE = 1.041667, F = 16.44.
• Note that the k-means algorithm cannot guarantee finding the global maximum pseudo-F statistic, instead
often settling at a local maximum. To improve the probability of achieving a global minimum, the analyst may
consider using a variety of initial cluster centers. One suggestion is (i) placing the first cluster center on a
random data point, and (ii) placing the subsequent cluster centers on points as far away from previous centers as
possible
• One potential problem for applying the k-means algorithm is: Who decides how many clusters to search for?
Unless the analyst has a priori knowledge of the number of underlying clusters; therefore, an “outer loop”
should be added to the algorithm, which cycles through various promising values of k. Clustering solutions for each
value of k can therefore be compared, with the value of k resulting in the largest F statistic being selected Some
algorithms such as the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) clustering
algorithm, can select the optimal number of clusters
• What if some attributes are more relevant than others to the problem formulation? As cluster membership is
determined by distance, we may apply the same axis- stretching methods for quantifying attribute relevance
Rationale for measuring cluster goodness
• Every modeling technique requires an evaluation phase
• For example, we may work hard to develop a multiple regression model for predicting the amount of money to be spent
on a new car. But, if the standard error of the estimate s for this regression model is $100,000, then the usefulness of the
regression model is questionable
• In the classification realm, we would expect that a model predicting who will respond to our direct-mail marketing
operation will yield more profitable results than the baseline “send-a-coupon-to-everybody” or
“send-out-no-coupons-at-all” models
• In a similar way, clustering models need to be evaluated as well. Some of the questions of interest might be the following:
• Do my clusters actually correspond to reality, or are they simply artifacts of mathematical convenience?
• I am not sure how many clusters there are in the data. What is the optimal number of clusters to identify?
• How do I measure whether one set of clusters is preferable to another?
• Two methods for measuring cluster goodness, the silhouette method, and the pseudo-F statistic are introduced here
Rationale for measuring cluster goodness
• Any measure of cluster goodness, or cluster quality, should address the concepts of cluster separation as well as cluster
cohesion
• Cluster separation represents how distant the clusters are from each other; cluster cohesion refers to how tightly related
the records within the individual clusters are
• Good measures of cluster quality need to incorporate both criteria. For example, we have seen that the sum of squares
error (SSE) is a good measure of cluster quality
• However, by measuring the distance between each record and its cluster center, SSE accounts only for cluster cohesion
and does not account for cluster separation Thus, SSE is monotonically decreasing as the number of clusters
increases, which is not a desired property of a valid measure of cluster goodness
• Of course, both the silhouette method and the pseudo-F statistic account for both cluster cohesion and cluster separation
The Silhoutte Method
• For each data value i,
where ai is the distance between the data value and its cluster center, and bi is the distance between the data value and
the next closest cluster center
• The silhouette value is used to gauge how good the cluster assignment is for that particular point
• A positive value indicates that the assignment is good, with higher values being better than lower values
• A value that is close to zero is considered to be a weak assignment, as the observation could have been assigned to the
next closest cluster with limited negative consequence
• A negative silhouette value is considered to be misclassified, as assignment to the next closest cluster would have been
better
• Note how the definition of silhouette accounts for both separation and cohesion. The value of ai represents cohesion, as it
measures the distance between the data value and its cluster center, while bi represents separation, as it measures the distance
between the data value and a different cluster
The Silhoutte Method
• Each of the data values in Cluster 1 have their values of ai and bi represented by solid lines and dotted lines, respectively.
Clearly, bi > ai for each data value, as represented by the longer dotted lines
• Thus, each data value’s silhouette value is positive, indicating that the data values have not been misclassified. The dotted
lines indicate separation, and the solid lines indicate cohesion
• Taking the average silhouette value over all records yields a useful measure of how well the cluster solution fits the
data. The following thumbnail interpretation of average silhouette is meant as a guideline only, and should bow before
the expertise of the domain expert
• INTERPRETATION OF AVERAGE SILHOUETTE VALUE
• 0.5 or better. Good evidence of the reality of the clusters in the data
• 0.25–0.5. Some evidence of the reality of the clusters in the data. Hopefully, domain-specific knowledge can be brought
to bear to support the reality of the clusters
• Less than 0.25. Scant evidence of cluster reality
The Silhoutte Method - Example
• Suppose we apply k-means clustering to the following little one-dimensional data set:
x1 = 0, x2 = 2, x3 = 4, x4 = 6, x5 = 10
• k-means assigns the first three data values to Cluster 1 and the last two to Cluster 2
• The cluster center for Cluster 1 is m1 = 2, and the cluster center for Cluster 2 is m2 = 8
• The values for ai represent the distance between the data value xi and the cluster center to which xi belongs. The values
for bi represent the distance between the data value and the other cluster center. Note that a2 = 0 because a2 = m1 = 2
The Silhoutte Method - Example
• Following table contains the calculations for the individual data value silhouettes, along with the mean silhouette. Using
our rule of thumb, mean silhouette = 0.7 represents good evidence of the reality of the clusters in the data. Note that x2 is
perfectly classified as belonging to Cluster 1, as it sits right on the cluster center m1; thus, its silhouette value is a perfect
1.00. However, x3 is somewhat farther from its own cluster center, and somewhat closer to the other cluster center; hence,
its silhouette value is lower, 0.50
Silhoutte Analysis of the IRIS Data
• The data set consists of 150 observations of three species of Iris, along with measurements of their petal width, petal length,
sepal width, and sepal length
• Left figure shows a scatter plot of petal width versus petal length, with an overlay of Iris species. (Note that min–max
normalization is used.) It shows that one species is well separated, but the other two are not, at least in this dimension
• So, one question we could ask of these Irises: True there are three species in the data set, but are there really three clusters in
the data set, or only two?
• It makes sense to begin with k = 3 clusters. k-means clustering was applied to the Iris data, asking for k = 3 clusters. A logical
question might be: Do the clusters match perfectly with the species? (Of course, the species type was not included as input to the
clustering algorithm.)
• The answer is, not quite - most of the Iris virginica belong to Cluster 2, but some belong to Cluster 3. And most of the Iris
versicolor belong to Cluster 3, but some belong to Cluster 2 (From right figure)
Silhoutte Analysis of the IRIS Data
• The silhouette values for each flower were calculated, and graphed in the silhouette plot in the following figure
• This silhouette plot shows the silhouette values, sorted from highest to lowest, for each cluster. Cluster 1 is the
best-defined cluster, as most of its silhouette values are rather high. However, Clusters 2 and 3 have some records with
high silhouette and some records with low silhouette. However, there are no records with negative silhouette, which
would indicate the wrong cluster assignment
• The mean silhouette values for each cluster, and the overall mean silhouette, are provided in the table. These values
support our suggestion that, although Cluster 1 is well-defined, Clusters 2 and 3 are not so well-defined. This makes
sense, in light of what we learned in the last figures
Silhoutte Analysis of the IRIS Data
• Many of the low silhouette values for Clusters 2 and 3 come from the boundary area between their respective clusters.
Evidence for this is shown in this figure
• The silhouette values were binned (for illustrative purposes): A silhouette value below 0.5 is low; a silhouette value at
least 0.5 is high. The lower silhouette values in this boundary area result from the proximity of the “other” cluster center.
This holds down the value of bi, and thus of the silhouette value
• it is worth noting that the clusters were formed using four predictors, but we are examining scatter plots of two predictors
only. This represents a projection of the predictor space down to two dimensions, and so loses some of the information
available in four dimensions
Silhoutte Analysis of the IRIS Data
• Next, k-means was applied, with k = 2 clusters. This clustering combines I. versicolor and I. virginica into a single cluster, as
shown in left figure. The silhouette plot for k = 2 clusters is shown in right figure. There seem to be fewer low silhouette values
than for k = 3 clusters. This is supported by the mean silhouette values reported in the table. The overall mean silhouette is 17%
higher than for k = 3, and the cluster mean silhouettes are higher as well
• So, it is clear that the silhouette method prefers the clustering model where k = 2. This is fine, but just be aware that the k = 2
solution recognizes no distinction between I. versicolor and I. virginica, whereas the k = 3 solution does recognize this
distinction. Such a distinction may be important to the client