0% found this document useful (0 votes)

4 views18 pages

Clustering

K-Means Clustering is an unsupervised machine learning algorithm that partitions unlabeled data into K clusters based on distance from centroids. The algorithm iteratively assigns data points to the nearest centroid, updates centroids, and continues until convergence. Techniques like the Elbow Method and Silhouette Score are used to determine the optimal number of clusters and evaluate clustering quality.

Uploaded by

ravindrababu.jonnadula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views18 pages

Clustering

Uploaded by

ravindrababu.jonnadula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

K-Means Clustering Algorithm

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled
dataset into different clusters.

What is K-means Clustering?

Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified
data and enabling the algorithm to operate on that data without supervision. Without any previous
data training, the machine’s job in this case is to organize unsorted data according to parallels,
patterns, and variations.

K means clustering, assigns data points to one of the K clusters depending on their distance from the
center of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each
data point assign to one of the clusters based on its distance from centroid of the cluster. After
assigning each point to one of the cluster, new cluster centroids are assigned. This process runs
iteratively until it finds good cluster. In the analysis we assume that number of clusters is given in
advanced, and we have to put points in one of the groups.

What is the objective of k-means clustering?

The goal of clustering is to divide the population or set of data points into a number of groups so that
the data points within each group are more comparable to one another and different from the data
points within the other groups. It is essentially a grouping of things based on how similar and
different they are to one another.

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

 Step-1: Select the number K to decide the number of clusters.

 Step-2: Select random K points or centroids. (It can be other from the input dataset).

 Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

 Step-4: Calculate the variance and place a new centroid of each cluster.

 Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

 Step-7: The model is ready.

The Elbow Method:

The elbow method is a technique used to determine the optimal number of clusters (K) in a dataset.
It involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and
identifying the "elbow" point, where the rate of decrease in WCSS slows down significantly. This
point indicates the optimal number of clusters where adding more clusters does not significantly
reduce WCSS.
Calculating Distances:

In K-Means clustering, distances between data points and cluster centroids play a crucial role in
determining cluster assignments. Two common distance metrics are Euclidean distance and
Manhattan distance.

1. Euclidean Distance:

Euclidean distance is the straight-line distance between two points in Euclidean space. For two points
(x1, y1) and (x2, y2) in a two-dimensional space, the Euclidean distance (d) is calculated as:

d=√ (x2−x1)2+(y2−y1)2

This formula can be generalized to higher-dimensional spaces.

2. Manhattan Distance:

Manhattan distance, also known as city block distance or taxicab distance, measures the sum of the
absolute differences between the coordinates of two points. For two points (x1, y1) and (x2, y2), the
Manhattan distance (d) is calculated as:

d=∣x2−x1∣+∣y2−y1∣

Like Euclidean distance, Manhattan distance can be extended to higher-dimensional spaces.

Example: HDFC Banking Data Analysis:

Let's consider a hypothetical dataset from HDFC Banking, containing information about customers'
spending score and age. The objective is to segment customers based on these two features using
the K-Means clustering algorithm.

1. Data Preparation:

 Retrieve the HDFC Banking dataset containing customer information.

 Extract the features of interest: spending score and age.

 Normalize the features to ensure uniform scale.

2. Applying K-Means Clustering:

 Choose the number of clusters (K) based on domain knowledge or through techniques like
the elbow method.

 Initialize K cluster centroids.

 Assign each data point to the nearest centroid.

 Update the centroids based on the mean of the data points in each cluster.

 Repeat the assignment and update steps until convergence.

3. Visualization and Interpretation:

 Plot the data points on a 2D graph with spending score on one axis and age on the other.

 Color code the points based on their assigned clusters.

 Analyze the clusters to glean insights into customer segmentation.

Mathematics Behind Clustering

K-Means clustering is a fundamental technique in the field of data science, widely used for
partitioning data into distinct clusters based on similarity. Behind its simplicity lies a robust
mathematical framework that governs the clustering process. In this article, we will delve into the
mathematics behind K-Means clustering, elucidating the steps involved in the algorithm using a
simple example dataset consisting of 8 data points and 3 centroids.

Understanding K-Means Clustering:

At its core, K-Means clustering aims to minimize the within-cluster variance by iteratively assigning
data points to clusters and updating cluster centroids. The algorithm follows these key steps:

 Initialization: Select 'K' initial cluster centroids randomly from the dataset.

 Assignment: Assign each data point to the nearest centroid, forming 'K' clusters.

 Update: Recalculate the centroids of the clusters based on the mean of the data points
assigned to each cluster.

 Iteration: Repeat steps 2 and 3 until convergence, i.e., when the centroids no longer change
significantly, or a predefined number of iterations is reached.

Mathematics Behind K-Means Clustering:

Let's dive into the mathematical details of each step using an example dataset of 8 data points (2,10),
(2,5),(8,4),(5,8),(7,5),(6,4),(1,2),(4,9) and 3 centroids.

1. Initialization:

Choose 3 initial centroids:

 Centroid 1: (2,10)

 Centroid 2: (5,8)

 Centroid 3: (1,2)

2. Assignment:

Calculate the distance between each data point and each centroid.

Assign each data point to the nearest centroid based on Euclidean distance.

Data
Centroid 1 (2,10) Centroid 2 (5,8) Centroid 3 (1,2) Assigned Cluster
Point

(2,10) 0 3.61 8.06 Centroid 1

(2,5) 5 3 3.16 Centroid 2

(8,4) 8.06 3.16 7.21 Centroid 2

(5,8) 5.83 0 7.07 Centroid 2

(7,5) 6.40 2.24 7.28 Centroid 2

(6,4) 6.40 2.24 6.08 Centroid 2

(1,2) 8.06 5 0 Centroid 3

(4,9) 3.16 1.41 7.07 Centroid 2

3. Update:

Calculate the mean of the data points in each cluster and update the centroids.

New Centroids:

 Centroid 1: (2,10)

 Centroid 2: (6,6)

 Centroid 3: (1.5,3.5)

4. Iteration:

Repeat steps 2 and 3 if necessary, until convergence.

Conclusion:
After convergence, the data points are grouped into three clusters based on their proximity to the
centroids. The final clusters are as follows:

Cluster 1:

 Data Points: (2,10)

 Centroid: (2,10)

Cluster 2:

 Data Points: (2,5), (8,4), (5,8), (7,5), (6,4), (4,9)

 Centroid: (6,6)

Cluster 3:

 Data Points: (1,2)

 Centroid: (1.5,3.5)

These clusters represent the segmentation of the dataset based on the K-Means clustering
algorithm. Each cluster contains data points that are close to each other and share similarities in
their features. The centroids represent the mean position of the data points in each cluster.

K-Means++

K-Means++ is an improvement over the traditional K-Means clustering algorithm, designed to select
better initial centroids, thereby leading to improved clustering results. The primary goal of K-Means+
+ is to avoid poor initializations that can result in suboptimal clustering.

Concept of K-Means++:

1. Initialization of Centroids:

o In standard K-Means, centroids are randomly initialized, which can lead to varying
results depending on the initial centroids' locations.

o K-Means++ introduces a smarter initialization process to select the initial centroids.

2. Selection of Initial Centroids:

o The first centroid is chosen randomly from the dataset.

o Subsequent centroids are selected iteratively based on a probability distribution that

favors points farther away from existing centroids.

o The probability of selecting a data point as the next centroid is proportional to the
square of its distance from the nearest centroid already chosen.

3. Algorithm Workflow:

o Initialize the first centroid randomly from the dataset.

o For each data point, compute its distance to the nearest centroid already chosen.

o Assign probabilities to data points based on their squared distances to the nearest
centroid.
o Select the next centroid by sampling from the data points according to the computed
probabilities.

o Repeat the process until 'K' centroids have been selected.

Advantages of K-Means++:

1. Improved Initialization:

o By selecting initial centroids that are well-separated and representative of the data,
K-Means++ reduces the likelihood of converging to suboptimal solutions.

2. Robustness:

o K-Means++ is less sensitive to the choice of initial centroids compared to random

initialization, leading to more consistent and stable clustering results.

3. Reduced Number of Iterations:

o Better initialization often leads to faster convergence of the K-Means algorithm,

reducing the number of iterations required to reach a solution.

4. Better Clustering Quality:

o With more representative initial centroids, K-Means++ tends to produce clusters that
better capture the underlying structure of the data.

Implementation:

The implementation of K-Means++ involves modifying the initialization step of the traditional K-
Means algorithm to incorporate the probabilistic centroid selection process described above. Once
the initial centroids are chosen using K-Means++, the rest of the algorithm proceeds as usual,
iteratively assigning data points to clusters and updating centroids until convergence.

Finding the optimal value of K

Finding the optimal value of K, the number of clusters, is crucial in K-Means clustering as it directly
impacts the quality of the clustering results. One common approach to determining the optimal K is
by using the concept of Within-Cluster Sum of Squares (WCSS).

Concept of WCSS (Within-Cluster Sum of Squares):

WCSS measures the compactness of clusters in K-Means clustering. It represents the sum of squared
distances between each data point and its assigned centroid within each cluster. Mathematically,
WCSS is calculated as follows:

WCSS=∑i=1 to K ∑j=1 to n (xij−ci)2

Where:

 K is the number of clusters.

 n is the number of data points in cluster i.

 xij is the j-th data point in cluster i.

 ci is the centroid of cluster i.

Determining Optimal K using Elbow Method:

The Elbow Method is a graphical technique used to find the optimal value of K by plotting the WCSS
against different values of K and identifying the "elbow" point where the rate of decrease in WCSS
slows down significantly. The idea is to choose the value of K where adding more clusters doesn't
lead to a significant improvement in WCSS.

Here are the steps to determine the optimal K using the Elbow Method:

1. Choose a Range of K Values: Start by selecting a range of potential values for K, typically
ranging from 1 to a maximum value based on domain knowledge or computational
resources.

2. Compute WCSS for Each K: For each value of K, run the K-Means algorithm and calculate the
corresponding WCSS.

3. Plot WCSS vs. K: Create a line plot where the x-axis represents the number of clusters (K) and
the y-axis represents the WCSS.

4. Identify the Elbow Point: Examine the plot and identify the point where the rate of decrease
in WCSS slows down significantly. This point is often referred to as the "elbow" point.

5. Choose Optimal K: Select the value of K at the elbow point as the optimal number of
clusters.

Evaluation Metric: Silhouette Score

A metric called the Silhouette Score is employed to assess a dataset’s well-defined clusters. The
cohesiveness and separation between clusters are quantified. Better-defined clusters are indicated
by higher scores, which range from -1 to 1. An object is said to be well-matched to its own cluster
and poorly-matched to nearby clusters if its score is close to 1. A score of about -1, on the other
hand, suggests that the object might be in the incorrect cluster. The Silhouette Score is useful for
figuring out how appropriate clustering methods are and how many clusters are best for a particular
dataset.

Mathematical Formula:

Silhouette Score (S) for a data point i is calculated as:

S(i)= (b(i) - a(i)) / max(a(i),b(i))

Here,

a(i) is the average distance from i to other data points in the same cluster.

b(i) is the smallest average distance from i to data points in a different cluster.

Interpretation: It ranges from -1 (poor clustering) to +1 (perfect clustering). A score close to 1

suggests well-separated clusters.

Good and poor clustering

Good Clustering:
In a good clustering scenario, the clusters are well-separated and distinct from each other. Let's say
we have a dataset with two well-separated clusters: one cluster around the coordinates (2,2) and
another cluster around the coordinates (8,8). When we apply K-means clustering with k=2, ideally,
the algorithm should correctly identify these two clusters.

After clustering, the Silhouette Score would be close to 1 for most points in both clusters because the
points are closer to other points in the same cluster than to points in other clusters. The silhouette
score would be high overall, indicating good separation between clusters.

Poor Clustering:

In contrast, let's consider a scenario where the clusters overlap significantly. For instance, imagine a
dataset with two clusters that partially overlap each other. Perhaps one cluster is centered at (3,3),
and the other cluster is centered at (5,5), but there's a region where points from both clusters mix
together.

When we apply K-means clustering with k=2, the algorithm may struggle to distinguish between
these overlapping clusters. As a result, it might incorrectly assign some points from one cluster to the
other, leading to misclassification.

In this case, the Silhouette Score would be closer to 0 for many points because the distance to points
in other clusters would be comparable to the distance to points in the same cluster. The silhouette
score would be lower overall, indicating poorer separation between clusters.

In summary, good clustering results in well-defined, separated clusters with high Silhouette Scores,
while poor clustering results in overlapping or poorly separated clusters with lower Silhouette Scores.

Practical Implementation

In practical terms, implementing the Silhouette Score for evaluating clustering involves a few steps:

1. Perform Clustering: First, you need to apply a clustering algorithm to your dataset. Common
algorithms for clustering include K-means, hierarchical clustering, DBSCAN, etc. Choose an
appropriate algorithm based on your data and requirements.

2. Calculate Silhouette Score for Each Data Point: For each data point in your dataset, calculate its
silhouette score. To do this, you need to compute two distances:

a: The average distance from the data point to other points within the same cluster.

b: The smallest average distance from the data point to points in a different cluster, calculated over
all clusters except the one the data point belongs to.

3. Compute Silhouette Score: Once you have computed a and b for each data point, calculate the
Silhouette Score for that data point using the formula:

Silhouette score= (b−a)/max(a,b)

Average these scores across all data points to obtain the overall Silhouette Score for the clustering.

4. Interpretation: Higher Silhouette Scores indicate better clustering, with scores closer to 1
suggesting well-separated clusters. Scores around 0 indicate overlapping clusters, and negative
scores suggest that data points may have been assigned to the wrong clusters.
5. Parameter Tuning and Validation: Repeat the clustering process with different parameters (e.g.,
number of clusters for K-means) and compare the Silhouette Scores to find the optimal
configuration. Additionally, it's essential to validate the clustering results using domain knowledge or
other evaluation metrics, as the Silhouette Score alone may not always capture all aspects of the
clustering quality.

Here's a basic Python example using the popular scikit-learn library to compute the Silhouette Score:

Python

from sklearn.metrics import silhouette_samples, silhouette_score

score = silhouette_score(df_k, km.labels_, metric='euclidean')

print(score)

 df_k: This parameter represents the data points you want to cluster. In your case, it seems
df_k is a DataFrame containing the features of your dataset. Usually, you would pass your
feature matrix here.

 km.labels_: This parameter represents the cluster labels assigned by the KMeans algorithm
to each data point in df_k. After fitting the KMeans model (km in your case) to the data,
the .labels_ attribute contains the cluster labels assigned to each data point.

 metric='euclidean': This parameter specifies the distance metric to use when calculating the
Silhouette Score. The default metric is 'euclidean', which measures the distance between
two points in Euclidean space. You can specify other distance metrics supported by scikit-
learn, such as 'manhattan', 'cosine', etc., depending on your specific needs.

The silhouette_score function returns the average Silhouette Score of all samples. It calculates the
Silhouette Score for each sample individually and then computes the mean over all samples.
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work. As there is no requirement to predetermine the number of
clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with

taking all data points as single clusters and merging them until one cluster is left.

2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down

approach.

Why hierarchical clustering?

As we already have other clustering algorithms such as K-Means Clustering, then why we need
hierarchical clustering? So, as we have seen in the K-means clustering that there are some challenges
with this algorithm, which are a predetermined number of clusters, and it always tries to create the
clusters of the same size. To solve these two challenges, we can opt for the hierarchical clustering
algorithm because, in this algorithm, we don't need to have knowledge about the predefined
number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the
datasets into clusters, it follows the bottom-up approach. It means, this algorithm considers each
dataset as a single cluster at the beginning, and then start combining the closest pair of clusters
together. It does this until all the clusters are merged into a single cluster that contains all the
datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.

Note: To better understand hierarchical clustering, it is advised to have a look on k-means clustering

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the hierarchical
clustering. There are various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering. These measures are called Linkage methods. Some of the popular
linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of problem or
business requirement.

Woking of Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the
HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between
the data points, and the x-axis shows all the data points of the given dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in agglomerative clustering,
and the right part is showing the corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a
cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean distance between the
data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is
higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater
than the P2 and P3.

o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram,
and P4, P5, and P6, in another dendrogram.

o At last, the final dendrogram is created that combines all the data points together.

We can cut the dendrogram tree structure at any level as per our requirement.

K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
40 pages
UNIT 4
No ratings yet
UNIT 4
125 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
ML 5 (1)
No ratings yet
ML 5 (1)
61 pages
K Mean Cluster Analysis
No ratings yet
K Mean Cluster Analysis
16 pages
Unsupervised Learning (1)
No ratings yet
Unsupervised Learning (1)
27 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
06. k Clustering
No ratings yet
06. k Clustering
28 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
kmea
No ratings yet
kmea
53 pages
K MEANS
No ratings yet
K MEANS
40 pages
algo
No ratings yet
algo
59 pages
5 - CH 5-K-Means Clustering
No ratings yet
5 - CH 5-K-Means Clustering
54 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
K Means Questions
No ratings yet
K Means Questions
2 pages
k Mean Clustering
No ratings yet
k Mean Clustering
32 pages
Clustering_notes
No ratings yet
Clustering_notes
29 pages
1-Kmeans
No ratings yet
1-Kmeans
13 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
UNIT-4
No ratings yet
UNIT-4
22 pages
42-Unsupervised Learning - k-means clustering-21-11-2024
No ratings yet
42-Unsupervised Learning - k-means clustering-21-11-2024
18 pages
ML-12
No ratings yet
ML-12
19 pages
kmeansfinal
No ratings yet
kmeansfinal
16 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
ML-Unit III - K-Means Clustering
No ratings yet
ML-Unit III - K-Means Clustering
22 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
KMeans_Clustering
No ratings yet
KMeans_Clustering
11 pages
10 Marks Questions
No ratings yet
10 Marks Questions
19 pages
K-Means With Elbow Method
No ratings yet
K-Means With Elbow Method
24 pages
K-Means Clustering Algorithm With Numerical Example - Coding Infinite
No ratings yet
K-Means Clustering Algorithm With Numerical Example - Coding Infinite
16 pages
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
No ratings yet
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
20 pages
K means Clustering
No ratings yet
K means Clustering
11 pages
K-means Clustering Algorithm With Numerical Example
No ratings yet
K-means Clustering Algorithm With Numerical Example
7 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Knn
No ratings yet
Knn
11 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Assignment 6 ML
No ratings yet
Assignment 6 ML
4 pages
Kmean
No ratings yet
Kmean
24 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
Architectural Design
No ratings yet
Architectural Design
45 pages
Chapter 9
No ratings yet
Chapter 9
8 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Moving to Design
No ratings yet
Moving to Design
34 pages
Building the Analysis Model
No ratings yet
Building the Analysis Model
32 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
Pilot
No ratings yet
Pilot
3 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Exp 7
No ratings yet
Exp 7
3 pages
Kmeans
No ratings yet
Kmeans
6 pages
K Mean
No ratings yet
K Mean
7 pages
Kmean Clustering
No ratings yet
Kmean Clustering
3 pages
Batch B DWM Experiments
No ratings yet
Batch B DWM Experiments
90 pages
Clustering Numericals
No ratings yet
Clustering Numericals
8 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
Ch5 - CPU Scheduling
No ratings yet
Ch5 - CPU Scheduling
60 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
2875 27398 1 SP
No ratings yet
2875 27398 1 SP
4 pages
Files
No ratings yet
Files
4 pages
Cs3491 Aiml Q&A Material
No ratings yet
Cs3491 Aiml Q&A Material
22 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unit-IV ppt
No ratings yet
Unit-IV ppt
51 pages
1JS18IS074 Internship Report
No ratings yet
1JS18IS074 Internship Report
27 pages
CMR University School of Engineering and Technology Department of Cse and It
No ratings yet
CMR University School of Engineering and Technology Department of Cse and It
6 pages
Final ML
No ratings yet
Final ML
50 pages
Customer Segmentation With K-means Clustering and Visualization - Colab
No ratings yet
Customer Segmentation With K-means Clustering and Visualization - Colab
3 pages
Electronics 09 01295 v2
No ratings yet
Electronics 09 01295 v2
12 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
A Review On Machine Learning and Deep Learning Image-Based Plant Disease Classification For Industrial Farming Systems
No ratings yet
A Review On Machine Learning and Deep Learning Image-Based Plant Disease Classification For Industrial Farming Systems
18 pages
Unsupervised Learning and Clustering: Somayeh Molaei University of Michigan, BDSI 2022
No ratings yet
Unsupervised Learning and Clustering: Somayeh Molaei University of Michigan, BDSI 2022
46 pages
Final Lab Manual of ML BCA
No ratings yet
Final Lab Manual of ML BCA
69 pages
Video 18
No ratings yet
Video 18
17 pages
A Markovian-Genetic Algorithm Model For Predicting Pavement Deterioration
No ratings yet
A Markovian-Genetic Algorithm Model For Predicting Pavement Deterioration
9 pages
SMEC ML LAB MANUAL R22
No ratings yet
SMEC ML LAB MANUAL R22
21 pages
AIML Lab Programs
No ratings yet
AIML Lab Programs
13 pages
1098 2174 1 SM
No ratings yet
1098 2174 1 SM
9 pages
Design of Fuzzy Rule-Based Classifiers With Semantic Cointension
No ratings yet
Design of Fuzzy Rule-Based Classifiers With Semantic Cointension
17 pages
K Mean
No ratings yet
K Mean
12 pages
21EC744
No ratings yet
21EC744
6 pages
Final Exam, 10701 Machine Learning, Spring 2009: Max. Score Score 1 2 3 4 5 6 7 8 9 10
No ratings yet
Final Exam, 10701 Machine Learning, Spring 2009: Max. Score Score 1 2 3 4 5 6 7 8 9 10
25 pages
Cailloux Taxonomy 2007
No ratings yet
Cailloux Taxonomy 2007
18 pages
Data Science and Big Data Analysis Mcqs
No ratings yet
Data Science and Big Data Analysis Mcqs
53 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
Notes For Business Analytics Part II
No ratings yet
Notes For Business Analytics Part II
66 pages
Data Analytics Unit 4
No ratings yet
Data Analytics Unit 4
22 pages
Paper Id 277
No ratings yet
Paper Id 277
7 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Clustering

Uploaded by

Clustering

Uploaded by

K-Means Clustering Algorithm

What is K-means Clustering?

What is the objective of k-means clustering?

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

 Step-1: Select the number K to decide the number of clusters.

 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

 Step-7: The model is ready.

The Elbow Method:

This formula can be generalized to higher-dimensional spaces.

Like Euclidean distance, Manhattan distance can be extended to higher-dimensional spaces.

Example: HDFC Banking Data Analysis:

 Retrieve the HDFC Banking dataset containing customer information.

 Extract the features of interest: spending score and age.

 Normalize the features to ensure uniform scale.

2. Applying K-Means Clustering:

 Initialize K cluster centroids.

 Assign each data point to the nearest centroid.

 Repeat the assignment and update steps until convergence.

3. Visualization and Interpretation:

 Color code the points based on their assigned clusters.

 Analyze the clusters to glean insights into customer segmentation.

Understanding K-Means Clustering:

Mathematics Behind K-Means Clustering:

Choose 3 initial centroids:

(2,10) 0 3.61 8.06 Centroid 1

(2,5) 5 3 3.16 Centroid 2

(8,4) 8.06 3.16 7.21 Centroid 2

(5,8) 5.83 0 7.07 Centroid 2

(7,5) 6.40 2.24 7.28 Centroid 2

(6,4) 6.40 2.24 6.08 Centroid 2

(1,2) 8.06 5 0 Centroid 3

(4,9) 3.16 1.41 7.07 Centroid 2

Repeat steps 2 and 3 if necessary, until convergence.

 Data Points: (2,10)

 Data Points: (2,5), (8,4), (5,8), (7,5), (6,4), (4,9)

 Data Points: (1,2)

o K-Means++ introduces a smarter initialization process to select the initial centroids.

2. Selection of Initial Centroids:

o The first centroid is chosen randomly from the dataset.

o Subsequent centroids are selected iteratively based on a probability distribution that

o Initialize the first centroid randomly from the dataset.

o Repeat the process until 'K' centroids have been selected.

o K-Means++ is less sensitive to the choice of initial centroids compared to random

3. Reduced Number of Iterations:

o Better initialization often leads to faster convergence of the K-Means algorithm,

4. Better Clustering Quality:

Finding the optimal value of K

Concept of WCSS (Within-Cluster Sum of Squares):

WCSS=∑i=1 to K ∑j=1 to n (xij−ci)2

 K is the number of clusters.

 n is the number of data points in cluster i.

 xij is the j-th data point in cluster i.

 ci is the centroid of cluster i.

Evaluation Metric: Silhouette Score

Silhouette Score (S) for a data point i is calculated as:

S(i)= (b(i) - a(i)) / max(a(i),b(i))

Interpretation: It ranges from -1 (poor clustering) to +1 (perfect clustering). A score close to 1

Good and poor clustering

Silhouette score= (b−a)/max(a,b)

from sklearn.metrics import silhouette_samples, silhouette_score

score = silhouette_score(df_k, km.labels_, metric='euclidean')

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with

2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down

Why hierarchical clustering?

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

Measure for the distance between two clusters

Woking of Dendrogram in Hierarchical clustering

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.