0% found this document useful (0 votes)
4 views18 pages

Clustering

K-Means Clustering is an unsupervised machine learning algorithm that partitions unlabeled data into K clusters based on distance from centroids. The algorithm iteratively assigns data points to the nearest centroid, updates centroids, and continues until convergence. Techniques like the Elbow Method and Silhouette Score are used to determine the optimal number of clusters and evaluate clustering quality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views18 pages

Clustering

K-Means Clustering is an unsupervised machine learning algorithm that partitions unlabeled data into K clusters based on distance from centroids. The algorithm iteratively assigns data points to the nearest centroid, updates centroids, and continues until convergence. Techniques like the Elbow Method and Silhouette Score are used to determine the optimal number of clusters and evaluate clustering quality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

K-Means Clustering Algorithm

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled
dataset into different clusters.

What is K-means Clustering?

Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified
data and enabling the algorithm to operate on that data without supervision. Without any previous
data training, the machine’s job in this case is to organize unsorted data according to parallels,
patterns, and variations.

K means clustering, assigns data points to one of the K clusters depending on their distance from the
center of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each
data point assign to one of the clusters based on its distance from centroid of the cluster. After
assigning each point to one of the cluster, new cluster centroids are assigned. This process runs
iteratively until it finds good cluster. In the analysis we assume that number of clusters is given in
advanced, and we have to put points in one of the groups.

What is the objective of k-means clustering?

The goal of clustering is to divide the population or set of data points into a number of groups so that
the data points within each group are more comparable to one another and different from the data
points within the other groups. It is essentially a grouping of things based on how similar and
different they are to one another.

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

 Step-1: Select the number K to decide the number of clusters.

 Step-2: Select random K points or centroids. (It can be other from the input dataset).

 Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

 Step-4: Calculate the variance and place a new centroid of each cluster.

 Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

 Step-7: The model is ready.

The Elbow Method:

The elbow method is a technique used to determine the optimal number of clusters (K) in a dataset.
It involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and
identifying the "elbow" point, where the rate of decrease in WCSS slows down significantly. This
point indicates the optimal number of clusters where adding more clusters does not significantly
reduce WCSS.
Calculating Distances:

In K-Means clustering, distances between data points and cluster centroids play a crucial role in
determining cluster assignments. Two common distance metrics are Euclidean distance and
Manhattan distance.

1. Euclidean Distance:

Euclidean distance is the straight-line distance between two points in Euclidean space. For two points
(x1, y1) and (x2, y2) in a two-dimensional space, the Euclidean distance (d) is calculated as:

d=√ (x2−x1)2+(y2−y1)2

This formula can be generalized to higher-dimensional spaces.

2. Manhattan Distance:

Manhattan distance, also known as city block distance or taxicab distance, measures the sum of the
absolute differences between the coordinates of two points. For two points (x1, y1) and (x2, y2), the
Manhattan distance (d) is calculated as:

d=∣x2−x1∣+∣y2−y1∣

Like Euclidean distance, Manhattan distance can be extended to higher-dimensional spaces.

Example: HDFC Banking Data Analysis:

Let's consider a hypothetical dataset from HDFC Banking, containing information about customers'
spending score and age. The objective is to segment customers based on these two features using
the K-Means clustering algorithm.

1. Data Preparation:

 Retrieve the HDFC Banking dataset containing customer information.

 Extract the features of interest: spending score and age.

 Normalize the features to ensure uniform scale.

2. Applying K-Means Clustering:

 Choose the number of clusters (K) based on domain knowledge or through techniques like
the elbow method.

 Initialize K cluster centroids.

 Assign each data point to the nearest centroid.

 Update the centroids based on the mean of the data points in each cluster.

 Repeat the assignment and update steps until convergence.

3. Visualization and Interpretation:

 Plot the data points on a 2D graph with spending score on one axis and age on the other.

 Color code the points based on their assigned clusters.

 Analyze the clusters to glean insights into customer segmentation.


Mathematics Behind Clustering

K-Means clustering is a fundamental technique in the field of data science, widely used for
partitioning data into distinct clusters based on similarity. Behind its simplicity lies a robust
mathematical framework that governs the clustering process. In this article, we will delve into the
mathematics behind K-Means clustering, elucidating the steps involved in the algorithm using a
simple example dataset consisting of 8 data points and 3 centroids.

Understanding K-Means Clustering:

At its core, K-Means clustering aims to minimize the within-cluster variance by iteratively assigning
data points to clusters and updating cluster centroids. The algorithm follows these key steps:

 Initialization: Select 'K' initial cluster centroids randomly from the dataset.

 Assignment: Assign each data point to the nearest centroid, forming 'K' clusters.

 Update: Recalculate the centroids of the clusters based on the mean of the data points
assigned to each cluster.

 Iteration: Repeat steps 2 and 3 until convergence, i.e., when the centroids no longer change
significantly, or a predefined number of iterations is reached.

Mathematics Behind K-Means Clustering:

Let's dive into the mathematical details of each step using an example dataset of 8 data points (2,10),
(2,5),(8,4),(5,8),(7,5),(6,4),(1,2),(4,9) and 3 centroids.

1. Initialization:

Choose 3 initial centroids:


 Centroid 1: (2,10)

 Centroid 2: (5,8)

 Centroid 3: (1,2)

2. Assignment:

Calculate the distance between each data point and each centroid.

Assign each data point to the nearest centroid based on Euclidean distance.

Data
Centroid 1 (2,10) Centroid 2 (5,8) Centroid 3 (1,2) Assigned Cluster
Point

(2,10) 0 3.61 8.06 Centroid 1

(2,5) 5 3 3.16 Centroid 2

(8,4) 8.06 3.16 7.21 Centroid 2

(5,8) 5.83 0 7.07 Centroid 2

(7,5) 6.40 2.24 7.28 Centroid 2

(6,4) 6.40 2.24 6.08 Centroid 2

(1,2) 8.06 5 0 Centroid 3

(4,9) 3.16 1.41 7.07 Centroid 2

3. Update:

Calculate the mean of the data points in each cluster and update the centroids.

New Centroids:

 Centroid 1: (2,10)

 Centroid 2: (6,6)

 Centroid 3: (1.5,3.5)

4. Iteration:

Repeat steps 2 and 3 if necessary, until convergence.

Conclusion:
After convergence, the data points are grouped into three clusters based on their proximity to the
centroids. The final clusters are as follows:

Cluster 1:

 Data Points: (2,10)

 Centroid: (2,10)

Cluster 2:

 Data Points: (2,5), (8,4), (5,8), (7,5), (6,4), (4,9)

 Centroid: (6,6)

Cluster 3:

 Data Points: (1,2)

 Centroid: (1.5,3.5)

These clusters represent the segmentation of the dataset based on the K-Means clustering
algorithm. Each cluster contains data points that are close to each other and share similarities in
their features. The centroids represent the mean position of the data points in each cluster.

K-Means++

K-Means++ is an improvement over the traditional K-Means clustering algorithm, designed to select
better initial centroids, thereby leading to improved clustering results. The primary goal of K-Means+
+ is to avoid poor initializations that can result in suboptimal clustering.

Concept of K-Means++:

1. Initialization of Centroids:

o In standard K-Means, centroids are randomly initialized, which can lead to varying
results depending on the initial centroids' locations.

o K-Means++ introduces a smarter initialization process to select the initial centroids.

2. Selection of Initial Centroids:

o The first centroid is chosen randomly from the dataset.

o Subsequent centroids are selected iteratively based on a probability distribution that


favors points farther away from existing centroids.

o The probability of selecting a data point as the next centroid is proportional to the
square of its distance from the nearest centroid already chosen.

3. Algorithm Workflow:

o Initialize the first centroid randomly from the dataset.

o For each data point, compute its distance to the nearest centroid already chosen.

o Assign probabilities to data points based on their squared distances to the nearest
centroid.
o Select the next centroid by sampling from the data points according to the computed
probabilities.

o Repeat the process until 'K' centroids have been selected.

Advantages of K-Means++:

1. Improved Initialization:

o By selecting initial centroids that are well-separated and representative of the data,
K-Means++ reduces the likelihood of converging to suboptimal solutions.

2. Robustness:

o K-Means++ is less sensitive to the choice of initial centroids compared to random


initialization, leading to more consistent and stable clustering results.

3. Reduced Number of Iterations:

o Better initialization often leads to faster convergence of the K-Means algorithm,


reducing the number of iterations required to reach a solution.

4. Better Clustering Quality:

o With more representative initial centroids, K-Means++ tends to produce clusters that
better capture the underlying structure of the data.

Implementation:

The implementation of K-Means++ involves modifying the initialization step of the traditional K-
Means algorithm to incorporate the probabilistic centroid selection process described above. Once
the initial centroids are chosen using K-Means++, the rest of the algorithm proceeds as usual,
iteratively assigning data points to clusters and updating centroids until convergence.

Finding the optimal value of K

Finding the optimal value of K, the number of clusters, is crucial in K-Means clustering as it directly
impacts the quality of the clustering results. One common approach to determining the optimal K is
by using the concept of Within-Cluster Sum of Squares (WCSS).

Concept of WCSS (Within-Cluster Sum of Squares):

WCSS measures the compactness of clusters in K-Means clustering. It represents the sum of squared
distances between each data point and its assigned centroid within each cluster. Mathematically,
WCSS is calculated as follows:

WCSS=∑i=1 to K ∑j=1 to n (xij−ci)2

Where:

 K is the number of clusters.

 n is the number of data points in cluster i.

 xij is the j-th data point in cluster i.

 ci is the centroid of cluster i.


Determining Optimal K using Elbow Method:

The Elbow Method is a graphical technique used to find the optimal value of K by plotting the WCSS
against different values of K and identifying the "elbow" point where the rate of decrease in WCSS
slows down significantly. The idea is to choose the value of K where adding more clusters doesn't
lead to a significant improvement in WCSS.

Here are the steps to determine the optimal K using the Elbow Method:

1. Choose a Range of K Values: Start by selecting a range of potential values for K, typically
ranging from 1 to a maximum value based on domain knowledge or computational
resources.

2. Compute WCSS for Each K: For each value of K, run the K-Means algorithm and calculate the
corresponding WCSS.

3. Plot WCSS vs. K: Create a line plot where the x-axis represents the number of clusters (K) and
the y-axis represents the WCSS.

4. Identify the Elbow Point: Examine the plot and identify the point where the rate of decrease
in WCSS slows down significantly. This point is often referred to as the "elbow" point.

5. Choose Optimal K: Select the value of K at the elbow point as the optimal number of
clusters.

Evaluation Metric: Silhouette Score

A metric called the Silhouette Score is employed to assess a dataset’s well-defined clusters. The
cohesiveness and separation between clusters are quantified. Better-defined clusters are indicated
by higher scores, which range from -1 to 1. An object is said to be well-matched to its own cluster
and poorly-matched to nearby clusters if its score is close to 1. A score of about -1, on the other
hand, suggests that the object might be in the incorrect cluster. The Silhouette Score is useful for
figuring out how appropriate clustering methods are and how many clusters are best for a particular
dataset.

Mathematical Formula:

Silhouette Score (S) for a data point i is calculated as:

S(i)= (b(i) - a(i)) / max(a(i),b(i))

Here,

a(i) is the average distance from i to other data points in the same cluster.

b(i) is the smallest average distance from i to data points in a different cluster.

Interpretation: It ranges from -1 (poor clustering) to +1 (perfect clustering). A score close to 1


suggests well-separated clusters.

Good and poor clustering

Good Clustering:
In a good clustering scenario, the clusters are well-separated and distinct from each other. Let's say
we have a dataset with two well-separated clusters: one cluster around the coordinates (2,2) and
another cluster around the coordinates (8,8). When we apply K-means clustering with k=2, ideally,
the algorithm should correctly identify these two clusters.

After clustering, the Silhouette Score would be close to 1 for most points in both clusters because the
points are closer to other points in the same cluster than to points in other clusters. The silhouette
score would be high overall, indicating good separation between clusters.

Poor Clustering:

In contrast, let's consider a scenario where the clusters overlap significantly. For instance, imagine a
dataset with two clusters that partially overlap each other. Perhaps one cluster is centered at (3,3),
and the other cluster is centered at (5,5), but there's a region where points from both clusters mix
together.

When we apply K-means clustering with k=2, the algorithm may struggle to distinguish between
these overlapping clusters. As a result, it might incorrectly assign some points from one cluster to the
other, leading to misclassification.

In this case, the Silhouette Score would be closer to 0 for many points because the distance to points
in other clusters would be comparable to the distance to points in the same cluster. The silhouette
score would be lower overall, indicating poorer separation between clusters.

In summary, good clustering results in well-defined, separated clusters with high Silhouette Scores,
while poor clustering results in overlapping or poorly separated clusters with lower Silhouette Scores.

Practical Implementation

In practical terms, implementing the Silhouette Score for evaluating clustering involves a few steps:

1. Perform Clustering: First, you need to apply a clustering algorithm to your dataset. Common
algorithms for clustering include K-means, hierarchical clustering, DBSCAN, etc. Choose an
appropriate algorithm based on your data and requirements.

2. Calculate Silhouette Score for Each Data Point: For each data point in your dataset, calculate its
silhouette score. To do this, you need to compute two distances:

a: The average distance from the data point to other points within the same cluster.

b: The smallest average distance from the data point to points in a different cluster, calculated over
all clusters except the one the data point belongs to.

3. Compute Silhouette Score: Once you have computed a and b for each data point, calculate the
Silhouette Score for that data point using the formula:

Silhouette score= (b−a)/max(a,b)

Average these scores across all data points to obtain the overall Silhouette Score for the clustering.

4. Interpretation: Higher Silhouette Scores indicate better clustering, with scores closer to 1
suggesting well-separated clusters. Scores around 0 indicate overlapping clusters, and negative
scores suggest that data points may have been assigned to the wrong clusters.
5. Parameter Tuning and Validation: Repeat the clustering process with different parameters (e.g.,
number of clusters for K-means) and compare the Silhouette Scores to find the optimal
configuration. Additionally, it's essential to validate the clustering results using domain knowledge or
other evaluation metrics, as the Silhouette Score alone may not always capture all aspects of the
clustering quality.

Here's a basic Python example using the popular scikit-learn library to compute the Silhouette Score:

Python

from sklearn.metrics import silhouette_samples, silhouette_score

score = silhouette_score(df_k, km.labels_, metric='euclidean')

print(score)

 df_k: This parameter represents the data points you want to cluster. In your case, it seems
df_k is a DataFrame containing the features of your dataset. Usually, you would pass your
feature matrix here.

 km.labels_: This parameter represents the cluster labels assigned by the KMeans algorithm
to each data point in df_k. After fitting the KMeans model (km in your case) to the data,
the .labels_ attribute contains the cluster labels assigned to each data point.

 metric='euclidean': This parameter specifies the distance metric to use when calculating the
Silhouette Score. The default metric is 'euclidean', which measures the distance between
two points in Euclidean space. You can specify other distance metrics supported by scikit-
learn, such as 'manhattan', 'cosine', etc., depending on your specific needs.

The silhouette_score function returns the average Silhouette Score of all samples. It calculates the
Silhouette Score for each sample individually and then computes the mean over all samples.
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work. As there is no requirement to predetermine the number of
clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with


taking all data points as single clusters and merging them until one cluster is left.

2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down


approach.

Why hierarchical clustering?

As we already have other clustering algorithms such as K-Means Clustering, then why we need
hierarchical clustering? So, as we have seen in the K-means clustering that there are some challenges
with this algorithm, which are a predetermined number of clusters, and it always tries to create the
clusters of the same size. To solve these two challenges, we can opt for the hierarchical clustering
algorithm because, in this algorithm, we don't need to have knowledge about the predefined
number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the
datasets into clusters, it follows the bottom-up approach. It means, this algorithm considers each
dataset as a single cluster at the beginning, and then start combining the closest pair of clusters
together. It does this until all the clusters are merged into a single cluster that contains all the
datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.

Note: To better understand hierarchical clustering, it is advised to have a look on k-means clustering

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the hierarchical
clustering. There are various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering. These measures are called Linkage methods. Some of the popular
linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of problem or
business requirement.

Woking of Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the
HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between
the data points, and the x-axis shows all the data points of the given dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in agglomerative clustering,
and the right part is showing the corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a
cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean distance between the
data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is
higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater
than the P2 and P3.

o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram,
and P4, P5, and P6, in another dendrogram.

o At last, the final dendrogram is created that combines all the data points together.

We can cut the dendrogram tree structure at any level as per our requirement.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy