K-Medoids-Clustering Method

K-Medoids Clustering Algorithm
Having an overview of K-Medoids clustering, let us discuss the algorithm for the same.
1. First, we select K random data points from the dataset and use them as medoids.
2. Now, we will calculate the distance of each data point from the medoids. You can use any of the
Euclidean, Manhattan distance, or squared Euclidean distance as the distance measure.
3. Once we find the distance of each data point from the medoids, we will assign the data points to the
clusters associated with each medoid. The data points are assigned to the medoids at the closest
distance.
4. After determining the clusters, we will calculate the sum of the distance of all the non-medoid data
points to the medoid of each cluster. Let the cost be Ci.
5. Now, we will select a random data point Dj from the dataset and swap it with a medoid Mi. Here,
Dj becomes a temporary medoid. After swapping, we will calculate the distance of all the non-medoid
data points to the current medoid of each cluster. Let this cost be Cj.
6. If Ci>Cj, the current medoids with Dj as one of the medoids are made permanent medoids. Otherwise,
we undo the swap, and Mi is reinstated as the medoid.
7. Repeat 4 to 6 until no change occurs in the clusters.
K-Medoids Clustering Numerical Example With Solution
Now that we have discussed the algorithm, let us discuss a numerical example of k-medoids clustering.
The dataset for clustering is as follows.
Point Coordinates
A1 (2, 6)
A2 (3, 8)
A3 (4, 7)
A4 (6, 2)
A5 (6, 4)
A6 (7, 3)
A7 (7,4)
A8 (8, 5)
A9 (7, 6)
A10 (3, 4)
Dataset For K-Medoids Clustering
Iteration 1
Suppose that we want to group the above dataset into two clusters. So, we will randomly choose two medoids.
Here, the choice of medoids is important for efficient execution. Hence, we have selected two points from the
dataset that can be potential medoid for the final clusters. Following are two points from the dataset that we
have selected as medoids.
• M1 = (3, 4)
• M2 = (7, 3)
Now, we will calculate the distance between each data point and the medoids using the Manhattan distance
measure. The results have been tabulated as follows.
Point Coordinates Distance From M1 (3,4) Distance from M2 (7,3) Assigned Cluster
A1 (2, 6) 3 8 Cluster 1
A2 (3, 8) 4 9 Cluster 1
A3 (4, 7) 4 7 Cluster 1
A4 (6, 2) 5 2 Cluster 2
A5 (6, 4) 3 2 Cluster 2
A6 (7, 3) 5 0 Cluster 2
A7 (7,4) 4 1 Cluster 2
A8 (8, 5) 6 3 Cluster 2
A9 (7, 6) 6 3 Cluster 2
A10 (3, 4) 0 5 Cluster 1
Iteration 1
The clusters made with medoids (3, 4) and (7, 3) are as follows.
• Points in cluster1= {(2, 6), (3, 8), (4, 7), (3, 4)}
• Points in cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
After assigning clusters, we will calculate the cost for each cluster and find their sum. The cost is nothing but
the sum of distances of all the data points from the medoid of the cluster they belong to.
Hence, the cost for the current cluster will be 3+4+4+2+2+0+1+3+3+0=22.
Iteration 2
Now, we will select another non-medoid point (7, 4) and make it a temporary medoid for the second cluster.
Hence,
• M1 = (3, 4)
• M2 = (7, 4)
Now, let us calculate the distance between all the data points and the current medoids.
A1 (2, 6) 3 7 Cluster 1
A2 (3, 8) 4 8 Cluster 1
A3 (4, 7) 4 6 Cluster 1
A4 (6, 2) 5 3 Cluster 2
A5 (6, 4) 3 1 Cluster 2
A6 (7, 3) 5 1 Cluster 2
A7 (7,4) 4 0 Cluster 2
A8 (8, 5) 6 2 Cluster 2
A9 (7, 6) 6 2 Cluster 2
A10 (3, 4) 0 4 Cluster 1
Iteration 2
The data points haven’t changed in the clusters after changing the medoids. Hence, clusters are:
• Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
• Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
Now, let us again calculate the cost for each cluster and find their sum. The total cost this time will be
3+4+4+3+1+1+0+2+2+0=20.
Here, the current cost is less than the cost calculated in the previous iteration. Hence, we will make the swap
permanent and make (7,4) the medoid for cluster 2. If the cost this time was greater than the previous cost i.e.
22, we would have to revert the change. New medoids after this iteration are (3, 4) and (7, 4) with no change in
the clusters.
Iteration 3
Now, let us again change the medoid of cluster 2 to (6, 4). Hence, the new medoids for the clusters are M1=(3,
4) and M2= (6, 4 ).
Let us calculate the distance between the data points and the above medoids to find the new cluster. The results
have been tabulated as follows.
A1 (2, 6) 3 6 Cluster 1
A2 (3, 8) 4 7 Cluster 1
A3 (4, 7) 4 5 Cluster 1
A4 (6, 2) 5 2 Cluster 2
A5 (6, 4) 3 0 Cluster 2
A6 (7, 3) 5 2 Cluster 2
A7 (7,4) 4 1 Cluster 2
A8 (8, 5) 6 3 Cluster 2
A9 (7, 6) 6 3 Cluster 2
A10 (3, 4) 0 3 Cluster 1
Iteration 3
Again, the clusters haven’t changed. Hence, clusters are:
• Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
• Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
Now, let us again calculate the cost for each cluster and find their sum. The total cost this time will be
3+4+4+2+0+2+1+3+3+0=22.
The current cost is 22 which is greater than the cost in the previous iteration i.e. 20. Hence, we will revert the
change and the point (7, 4) will again be made the medoid for cluster 2.
So, the clusters after this iteration will be cluster1 = {(2, 6), (3, 8), (4, 7), (3, 4)} and cluster 2= {(7,4), (6,2),
(6, 4), (7,3), (8,5), (7,6)}. The medoids are (3,4) and (7,4).
We keep replacing the medoids with a non-medoid data point. The set of medoids for which the cost is the
least, the medoids, and the associated clusters are made permanent. So, after all the iterations, you will get the
final clusters and their medoids.
The K-Medoids clustering algorithm is a computation-intensive algorithm that requires many iterations. In each
iteration, we need to calculate the distance between the medoids and the data points, assign clusters, and
compute the cost. Hence, K-Medoids clustering is not well suited for large data sets.
Different Implementations of K-Medoids Clustering
K-Medoids clustering algorithm has different implementations such as PAM, CLARA, and CLARANS. Let us
discuss their properties one by one.
PAM Implementation
The PAM implementation of the K-Medoids clustering algorithm is the most basic implementation. In the
numerical example on K-medoids clustering discussed above, we have discussed the PAM implementation.
The PAM implementation performs iterative clustering on the entire dataset and is computationally intensive.
Hence, it is not very efficient for large datasets.
CLARA Implementation
CLARA implementation of k-medoids clustering is an improvement over the PAM implementation.
• In CLARA implementation, the algorithm first selects samples of data points from the data. Then, it
performs PAM clustering on the samples. After applying PAM on the samples, CLARA gives output
the best clusters from the sampled outputs.
• CLARA performs clustering on samples of a given dataset. Hence, we can use it to perform clustering
on large datasets too. With increasing dataset size, the effectiveness of the CLARA implementation of
k-medoids clustering also increases. For small datasets, CLARA doesn’t work efficiently. Hence, you
should use it for large datasets for better performance.
• In the case of unbalanced datasets, if the samples are also biased, then the results of the clustering
algorithm won’t be good. Hence, you need to take measures during data preprocessing to make the data
as balanced as possible.
CLARANS Implementation
CLARANS implementation of the k-medoids clustering algorithm is the most efficient implementation.
• While clustering, CLARANS chooses a sample of neighboring data points. Due to this, it doesn’t
restrict the sampling on a particular area and gives the best results based on all the samples.
• CLARANS uses the neighbor data points at each step. Hence, for a large dataset having a large number
of clusters. The time of execution of CLARANS implementation of k-medoids clustering becomes high.
• We can define the number of local optimums while clustering using CLARANS. IN PAM or CLARA,
the algorithm runs for all data points in the cluster to find a local optimum by randomly picking another
data point once it finds a local optimum. We can limit the number of times this process is repeated in
CLARANS.
Applications of K-Medoids Clustering in Machine Learning
K-Medoids clustering is an alternative to the K-means clustering algorithm. Hence, you can use k-medoids
clustering wherever k-means is applied.
• You can use k-medoids clustering to document classification. The documents are first converted to
vectors called word embeddings. Then, you can use k-medoids clustering to group documents into
clusters using the word embeddings of the documents.
• E-commerce websites, supermarkets, and malls use transaction data for customer segmentation based on
buying behavior. It helps them to increase sales by recommending products according to user behavior
using recommendation systems.
• K-Medoids clustering can also be used in cyber profiling. It involves grouping people based on their
connection to each other. By collecting usage data from individual users, you can cluster them into
groups for profiling.
• By grouping similar pixels into clusters, you can also perform image segmentation using k-medoids
clustering.
• You can also perform fraud detection on banking and insurance data using k-medoids clustering. By
using historical data on frauds, you can find probable fraudulent transactions from a given dataset.
• You can also use k-medoids clustering in various other tasks such as social media profiling, ride-share
data analysis, identification of localities with a high rate of crime, etc.
Advantages of K-Medoids Clustering
• K-Medoids clustering is a simple iterative algorithm and is very easy to implement.
• K-Medoids clustering is guaranteed to converge. Hence, we are guaranteed to get results when we
perform k-medoids clustering on any dataset.
• K-Medoids clustering doesn’t apply to a specific domain. Owing to the generalization, we can use k-
medoids clustering in different machine learning applications ranging from text data to Geo-spatial data
and financial data to e-commerce data.
• The medoids in k-medoids clustering are selected randomly. We can choose the initial medoids in a
way such that the number of iterations can be minimized. For improving the performance of the
algorithm, you can warm start the choice of medoids by selecting specific data points as medoids after
data analysis.
• Compared to other partitioning clustering algorithms such as K-median and k-modes clustering, the k-
medoids clustering algorithm is faster in execution.
• K-Medoids clustering algorithm is very robust and it effectively deals with outliers and noise in the
dataset. Compared to k-means clustering, the k-medoids clustering algorithm is a better choice for
analyzing data with significant noise and outliers.
Disadvantages of K-Medoids Clustering
• K-Medoids clustering is not a scalable algorithm. For large datasets, the execution time of the algorithm
becomes very high. Due to this, you cannot use the K-Medoids algorithm for very large datasets.
• In K-Medoids clustering, we don’t know the optimal number of clusters. Hence, we need to perform
clustering using different values of k to find the optimal number of clusters.
• When the dimensionality in the dataset increases, the distance between the data points starts to become
similar. Due to this, the distance between a data point and various medoids becomes almost the same.
This introduces inefficiency in the execution. To overcome this problem, you can use advanced
clustering algorithms like spectral clustering. Alternatively, you can also try to reduce the
dimensionality of the dataset while data preprocessing.
• K-Medoids clustering algorithm chooses the initial medoids in a random manner. Also, the output
clusters are primarily dependent on the initial medoids. Due to this, every time you run the k-medoids
clustering algorithm, you will get unique clusters.
• K-Medoids clustering algorithm uses the distance from a central point (medoid). Due to this, the k-
medoids algorithm gives circular/spherical clusters. This algorithm is not useful for clustering data into
arbitrarily-shaped clusters.

K-Medoids-Clustering Method

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

K-Medoids-Clustering Method

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K-Medoids-Clustering Method

Uploaded by

Copyright:

Available Formats

K-Medoids Clustering Algorithm

A10 (3, 4) 0 5 Cluster 1

A10 (3, 4) 0 4 Cluster 1

A10 (3, 4) 0 3 Cluster 1

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.