0% found this document useful (0 votes)
3 views110 pages

ML unit 4

The document discusses K-means clustering, an unsupervised machine learning algorithm that partitions data into K clusters based on similarity. It contrasts K-means with other clustering methods such as hierarchical and density-based clustering, highlighting their differences in approach and applications. Additionally, it outlines the algorithm's steps, advantages, disadvantages, and various applications in fields like customer segmentation and fraud detection.

Uploaded by

jallusowjanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views110 pages

ML unit 4

The document discusses K-means clustering, an unsupervised machine learning algorithm that partitions data into K clusters based on similarity. It contrasts K-means with other clustering methods such as hierarchical and density-based clustering, highlighting their differences in approach and applications. Additionally, it outlines the algorithm's steps, advantages, disadvantages, and various applications in fields like customer segmentation and fraud detection.

Uploaded by

jallusowjanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

K-means Clustering

Types of Clustering :

Types of Clustering Algorithms (Comparison):


• Partitional Clustering (e.g., K-Means, K-Medoids)
• Hierarchical Clustering (e.g., Agglomerative, Divisive)
• Density-Based Clustering (e.g., DBSCAN)
• Model-Based Clustering (e.g., Gaussian Mixture
Models)
K-Means is a partitional and centroid-based
unsupervised clustering algorithm
Unsupervised Learning is a type of machine
learning where the model is trained on data
without explicit labels or supervision.

Unlike supervised learning, which uses labeled


data to train a model, unsupervised learning
identifies patterns, structures, and relationships
within the data on its own.
Partitional Clustering vs Hierarchical Clustering
Both partitional clustering and hierarchical
clustering are unsupervised machine learning
techniques used for grouping similar data points
into clusters, but they differ in their approach,
algorithms, and applications.
Hierarchical clustering creates a
Partitional clustering divides the
tree-like structure of clusters,
data into a predefined number of
which can be visualized as a
clusters. The goal is to partition
dendrogram. This method does
the data into a set of disjoint
not require the number of
clusters such that data points
clusters to be predefined.
within the same cluster are more
Instead, it builds a hierarchy of
similar to each other than to data
clusters and allows you to choose
points in other clusters.
the desired level of clustering
Key Concepts of Unsupervised Learning
1.Clustering – Grouping similar data points together.
1. Examples:
1.K-Means Clustering: Divides data into K clusters.
2.Hierarchical Clustering: Builds a tree-like hierarchy of
clusters.
3.DBSCAN (Density-Based Spatial Clustering): Identifies
clusters of varying density.
2.Dimensionality Reduction – Reducing the number of variables while
preserving important information.
1. Examples:
1.Principal Component Analysis (PCA): Projects data into a
lower-dimensional space.
2.t-SNE (t-Distributed Stochastic Neighbor Embedding):
Useful for visualizing high-dimensional data.
3.Autoencoders: Neural networks that learn efficient data
representations.
Key Concepts of Unsupervised Learning
3. Association Rule Learning – Finding relationships between
variables in large datasets.
1. Examples:
1.Apriori Algorithm: Used in market basket analysis to
identify frequently bought items together.
2.Eclat Algorithm: Another rule-mining approach for finding
associations.

4. Anomaly Detection – Identifying unusual patterns or outliers in


data.
1. Examples:
1.Isolation Forest: Detects outliers by isolating data points.
2.One-Class SVM (Support Vector Machine): Learns the
normal behavior of data and detects deviations.
Applications of Unsupervised Learning

• Customer Segmentation (e.g., e-commerce,


marketing)
• Fraud Detection (e.g., banking, cybersecurity)
• Recommendation Systems (e.g., Netflix, Amazon)
• Medical Diagnosis (e.g., identifying diseases from
medical images)
• Image and Speech Recognition
K-means:
• K-means algorithm is an algorithm to cluster n objects based
on attributes into k partitions, where k<n.
• K-Means clustering is an unsupervised clustering technique.
• It is a partitions based clustering algorithm.
• A cluster is defined as a group of objects that belongs to the
same class.
•Definition:
K-Means is an unsupervised machine learning algorithm used for clustering
data into K distinct groups based on similarity.
•User-defined Parameter (K):
The number of clusters K is specified by the user before running the algorithm.
•Goal of the Algorithm:
To minimize intra-cluster distance (homogeneity within clusters) and
maximize inter-cluster distance (differences between clusters).
•Cluster Representation:
Each cluster is represented by its centroid, which is the mean of the data
points in that cluster.
•Algorithm Steps:
•Initialize K centroids randomly.
•Assign each data point to the nearest centroid.
•Recalculate the centroids by averaging the points in each cluster.
•Repeat the assign–recalculate steps until centroids do not change
(convergence).
•Distance Measure:
Commonly uses Euclidean distance to compute similarity between data points
and centroids.
•Convergence Criteria:
The algorithm stops when data points no longer change clusters or after a
predefined number of iterations.
•Applications:
Used in market segmentation, image compression, document clustering,
anomaly detection, etc.
K-Means Clustering Algorithm
K-Means Clustering Algorithm involves the following steps-
Step-01:
• Choose the number of clusters K.
Step-02:
• Randomly select any K data points as cluster centres.
• Select cluster centers in such a way that they are as farther as
possible from each other.
Step-03:
• Calculate the distance between each data point and each
cluster center.
• The distance may be calculated either by using given distance
function or by using Euclidean distance formula.
K-Means Clustering Algorithm
Step-04:
• Assign each data point to some cluster.
A data point is assigned to that cluster whose center is nearest to
that data point.
Step-05:
• Re-compute the center of newly formed clusters.
The center of a cluster is computed by taking mean of all the data
points contained in that cluster.
Step-06:
• Keep repeating the procedure from Step-03 to Step-05 until any of
the following stopping criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
Squared Error criteria
Flowchart
Example
Use K-Means Algorithm to create two clusters-

Solution-
• We follow the above discussed K-Means Clustering Algorithm.
• Assume A(2, 2) and C(1, 1) are centers of the two clusters.
Iteration-01:
• We calculate the distance of each point from each of the center of the two
clusters.
• The distance is calculated by using the Euclidean distance formula.
The following illustration shows the calculation of distance between point
A(2, 2) and each of the center of the two clusters
Calculating Distance Between A(2, 2) and C1(2, 2)-
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
=0
Calculating Distance Between A(2, 2) and C2(1, 1)-
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41
• In the similar manner, we calculate the distance of other points from each
of the center of the two clusters.

From here, New clusters are-


Cluster-01:
• First cluster contains points-A(2, 2), B(3, 2), D(3, 1)
Cluster-02:
• Second cluster contains points-C(1, 1), E(1.5, 0.5)
Now,
• We re-compute the new cluster centers.
• The new cluster center is computed by taking mean of all the points
contained in that cluster.
For Cluster-01:
• Center of Cluster-01
• = ((2 + 3 + 3)/3, (2 + 2 + 1)/3)
• = (2.67, 1.67)

For Cluster-02:
• Center of Cluster-02
• = ((1 + 1.5)/2, (1 + 0.5)/2)
• = (1.25, 0.75)
This is completion of Iteration-01.
Next,
• we go to iteration-02, iteration-03 and so on until the centers do not
change anymore.
Iteration-02:
Given points Distance from Distance from Points belongs
cluster(2.67,1.67) of cluster(1.25,0.75) to cluster
data points of data points

A(2,2) 0.73 1.45 C1


B(3,2) 0.44 2.14 C1
C(1,1) 1.79 0.34 C2
D(3,1) 0.54 1.76 C1
E(1.5,0.5) 1.45 0.34 C2

From here, New clusters are-


Cluster-01:
First cluster contains points-A(2, 2), B(3, 2), D(3, 1)
Cluster-02:
Second cluster contains points-C(1, 1), E(1.5, 0.5)
Here,
Cluster elements are same as in the previous iteration then stop the process.
K-means Advantages
• Relatively simple to implement.
• Scales to large data sets.
• Guarantees convergence.
• Easily adapts to new examples.
• Generalizes to clusters of different shapes and
sizes, such as elliptical clusters.
K-means Disadvantages
• It requires to specify the number of clusters (k)
in advance.
• It can not handle noisy data and outliers.
• It is not suitable to identify clusters with non-
convex shapes.
Example -2
Suppose we have a dataset of customer spending habits, and we want to group them
into K = 3 clusters based on their annual income and spending score.

Customer ID Annual Income (k$) Spending Score


1 15 39
2 16 81
3 17 6
4 18 77
5 19 40
6 20 76
7 21 6
8 22 94
9 23 3
10 24 99
Exercise Problem
Challenges in Unsupervised Learning
• Number of clusters are normally not known a priori.
• For clustering algorithms, such as K-means, different initial
centers may lead to different clustering results, moreover K is
unknown.
• Time complexity - partitional clustering algorithms
are O(N) whereas hierarchical are O(N2).
• The similarity criteria is not clear - should we use Euclidean
or Manhattan or Hamming?
• In hierarchical clustering, at what stage should we stop?
• Evaluating clustering results are difficult because labels are
not available at the beginning.
Hierarchical clustering
• The hierarchical clustering methods are used to group the data
into hierarchy or tree-like structure.
• For example, in a machine learning problem of organizing
employees of a university in different departments, first the
employees are grouped under the different departments in the
university, and then within each department, the employees
can be grouped according to their roles such as professors,
assistant professors, supervisors, lab assistants, etc. This
creates a hierarchical structure of the employee data and eases
visualization and analysis.
Types of Hierarchal Clustering
There are two types of hierarchal clustering:
1. Agglomerative clustering
2. Divisive Clustering
Types of Hierarchal Clustering
• Agglomerative Clustering is a type of hierarchical clustering

algorithm. It is an unsupervised machine learning technique that

divides the population into several clusters such that data points in the

same cluster are more similar and data points in different clusters are

dissimilar.

• Points in the same cluster are closer to each other.

• Points in the different clusters are far apart.

• On the other hand, the divisive method starts with one cluster with all
given objects and then splits it iteratively to form smaller clusters
• The agglomerative hierarchical clustering method uses
the bottom-up strategy. It starts with each object forming
its own cluster and then iteratively merges the clusters
according to their similarity to form larger clusters. It
terminates either when a certain clustering condition
imposed by the user is achieved or all the clusters merge
into a single cluster.
Some pros and cons of Hierarchical
Clustering
Pros
• No assumption of a particular number of
clusters (i.e. k-means)
• May correspond to meaningful taxonomies
Cons
• Once a decision is made to combine two
clusters, it can’t be undone
• Too slow for large data sets, O(𝑛2 log(𝑛))
Agglomerative Clustering: It uses a bottom-up
approach. It starts with each object forming its own
cluster and then iteratively merges the clusters
according to their similarity to form large clusters. It
terminates either
• When certain clustering condition imposed by
user is achieved or
• All clusters merge into a single cluster
variants of Agglomerative methods:
1. Agglomerative Algorithm: Single Link
• Single-nearest distance or single linkage is the
agglomerative method that uses the distance
between the closest members of the two
clusters.
Question. Find the clusters using a single link
technique. Use Euclidean distance and draw
the dendrogram.
Sample
X Y
No.
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Step 2: Merging the two closest members of the two
clusters and finding the minimum element in distance
matrix. Here the minimum value is 0.10 and hence we
combine P3 and P6 (as 0.10 came in the P6 row and P3
column).
Now, form clusters of elements corresponding to the
minimum value and update the distance matrix. To update
the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28
Now we will repeat the same process. Merge two closest members of
the two clusters and find the minimum element in distance matrix.
The minimum value is 0.13 and hence we combine P3, P6 and P4.
Now, form the clusters of elements corresponding to the minimum
values and update the Distance matrix. In order to find, what we have
to update in distance matrix,

min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22

min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14

min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2 and P5. Now, form cluster of elements
corresponding to minimum value and update the distance matrix. To
update the distance matrix:
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4))
= min (0.14. 0.23) = 0.14
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2,P5 and P3,P6,P4. Now, form cluster of
elements corresponding to minimum value and update the distance
matrix. To update the distance matrix:
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1))
= min (0.23, 0.22) = 0.22
So now we have reached
to the solution finally,
the dendrogram for
those question will be as
follows:
DBSCAN Clustering
• There are different approaches and algorithms to
perform clustering tasks which can be divided
into three sub-categories:
1. Partition-based clustering: E.g. k-means, k-
median
2. Hierarchical clustering: E.g. Agglomerative,
Divisive
3. Density-based clustering: E.g. DBSCAN
Density-based clustering
• Partition-based and hierarchical clustering techniques are
highly efficient with normal shaped clusters. However,
when it comes to arbitrary shaped clusters or detecting
outliers, density-based techniques are more efficient.
• For example, the dataset in the figure below can easily be
divided into three clusters using k-means algorithm.

k-means clustering
Consider the following figures:

The data points in these figures are grouped in arbitrary


shapes or include outliers. Density-based clustering
algorithms are very efficient at finding high-density regions
and outliers. It is very important to detect outliers for some
task, e.g. anomaly detection.
DBSCAN Algorithm
• DBSCAN stands for Density-Based Spatial Clustering
of Applications with Noise. It is able to find arbitrary shaped
clusters and clusters with noise (i.e. outliers).
• In DBSCAN, instead of guessing the number of clusters, will
define two hyper parameters: epsilon and min Points to arrive
at clusters.
• Epsilon (ε): The distance that specifies the neighborhoods.
Two points are considered to be neighbors if the distance
between them are less than or equal to epsilon
• minPoints(n): Minimum number of data points to define a
cluster.
DBSCAN Algorithm
Based on Epsilon (ε) and minPoints(n) parameters, points
are classified as core, border, and outlier or noise points:
• Core point: A point is a core point if there are at least
minPoints number of points (including the point itself) in
its surrounding area with radius epsilon.
• Border point: A point is a border point if it is reachable
from a core point and there are less than minPoints number
of points within its surrounding area.
• Outlier or Noise point: A point is an outlier if it is not a
core point and not reachable from any core points.
DBSCAN Algorithm
• These points may be better explained with visualizations.
Density connected
Three terms are necessary to understand in order to
understand DBSCAN:
• Direct density reachable: A point is called direct
density reachable if it has a core point in its
neighbourhood.
• Density Connected: Two points are called density
connected if there is a core point which is density
reachable from both the points.
• Density Reachable: A point is called density reachable
from another point if they are connected through a series
of core points.
Evaluation Metrics of DBSCAN
• We will use the Silhouette score and Adjusted rand
score for evaluating clustering algorithms.
• Silhouette score is in the range of -1 to 1. A score near 1
denotes the best meaning that the data point i is very compact
within the cluster to which it belongs and far away from the
other clusters. Values near 0 denote overlapping clusters.
• Absolute Rand Score is in the range of 0 to 1. More than 0.9
denotes excellent cluster recovery, above 0.8 is a good
recovery. Less than 0.5 is considered to be poor recovery.
DBSCAN
Pros
• The DBSCAN is better than other cluster algorithms because
it does not require a pre-set number of clusters.
• It identifies outliers as noise, unlike the Mean-Shift method
that forces such points into the cluster in spite of having
different characteristics.
• It finds arbitrarily shaped and sized clusters quite well.
Cons
• It is not very effective when you have clusters of varying
densities.
• If you have high dimensional data, the determining of the
distance threshold Ɛ becomes a challenging task.
DBSCAN Algorithm

Step1: Label Core point and Noise point


▪ Select a random starting point, say x
▪ Identify neighborhood of point x using the radius ε
▪ Count the number of points, say k, in this neighborhood including poin
▪ If k>=Minpts of points then mark x as a core point else mark x as nois
▪ Select a new unvisited point and repeat the above steps
Step2: Check if noise point can become boundary point
▪ If noise point is directly density reachable (That is within the boundary
core point), mark it as boundary and it will form the part of the cluster
▪ A point which is neither core point nor boundary point is marked as no
Problem
Dimensionality reduction

Dimensionality reduction is a technique used to


reduce the number of features in a dataset while
retaining as much of the important information as
possible.
In other words, it is a process of transforming
high-dimensional data into a lower-dimensional
space that still preserves the essence of the
original data.
• In machine learning, high-dimensional data
refers to data with a large number of features or
variables.
• The curse of dimensionality is a common
problem in machine learning, where the
performance of the model deteriorates as the
number of features increases.
• This is because the complexity of the model
increases with the number of features, and it
becomes more difficult to find a good solution.
• In addition, high-dimensional data can also lead
to overfitting, where the model fits the training
data too closely and does not generalize well to
new data.
Dimensionality reduction can help to mitigate these
problems by reducing the complexity of the model
and improving its generalization performance.
There are two main approaches to dimensionality
reduction: feature selection and feature extraction.
Feature Selection
Feature selection involves selecting a subset of the
original features that are most relevant to the
problem at hand. The goal is to reduce the
dimensionality of the dataset while retaining the
most important features.

There are several methods for feature selection,


including filter methods, wrapper methods, and
embedded methods.
Filter methods rank the features based on their
relevance to the target variable, wrapper methods
use the model performance as the criteria for
selecting features, and embedded methods combine
feature selection with the model training process.
Feature Extraction:
Feature extraction involves creating new features by
combining or transforming the original features. The
goal is to create a set of features that captures the
essence of the original data in a lower-dimensional
space.

There are several methods for feature extraction,


including principal component analysis (PCA), linear
discriminant analysis (LDA), and t-distributed
stochastic neighbor embedding (t-SNE).

PCA is a popular technique that projects the original


features onto a lower-dimensional space while
preserving as much of the variance as possible.
A 3-D classification problem can be hard to visualize,
whereas a 2-D one can be mapped to a simple 2-dimensional
space, and a 1-D problem to a simple line. The below figure
illustrates this concept, where a 3-D feature space is split into
two 2-D feature spaces, and later, if found to be correlated,
the number of features can be reduced even further.
components of dimensionality redn this, we try
tuction:
Feature selection: Io find a subset of the original set
of variables, or features, to get a smaller subset
which can be used to model the problem. It usually
involves three ways:
• Filter
• Wrapper
• Embedded
• Feature extraction: This reduces the data in a high
dimensional space to a lower dimension space, i.e. a
space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction
include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
• Dimensionality reduction is the process of reducing the number of features in a
dataset while retaining as much information as possible.
This can be done to reduce the complexity of a model, improve the performance of
a learning algorithm, or make it easier to visualize the data.
• Techniques for dimensionality reduction include: principal component analysis
(PCA), singular value decomposition (SVD), and linear discriminant analysis (LDA).
• Each technique projects the data onto a lower-dimensional space while
preserving important information.
• Dimensionality reduction is performed during pre-processing stage before
building a model to improve the performance
Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised


dimensionality reduction technique that aims to project data
onto a lower-dimensional space while maximizing the
separation between different classes within the data,

essentially making it easier to classify data points by


emphasizing the features that best differentiate between
classes;

it achieves this by finding linear combinations of features that


maximize class separability, making it particularly useful for
classification tasks where class labels are available
Linear Discriminant Analysis (LDA) is a technique used to reduce the
number of dimensions (or features) in data, while preserving the
information that helps in distinguishing between different classes
or categories.
Imagine you have data points representing different groups (e.g.,
"spam" and "not spam" emails). These data points have features
like words, email length, etc. Now, LDA tries to find a simpler
representation of the data, reducing the number of features while
keeping the key differences between "spam" and "not spam."
Here's how it works:
1. Maximizing Separation: It tries to make the difference between
classes as big as possible in the new, reduced space.
2. Combining Features: LDA doesn't just look at each individual
feature. It combines the features in such a way that the new
combination of features best separates the different classes.
3. Projection to Lower Dimensions: The result is that you end up
with fewer features, but the key information that helps distinguish
between classes is still there.
Principal Component Analysis (PCA) is a dimensionality reduction technique
used to reduce the number of features (variables) in a dataset by
transforming them into a new set of features called principal
components. These components retain most of the original data’s
variance while being uncorrelated and independent.
2. Motivation for PCA
Real-world datasets have many features that may be highly correlated.
For example, height and weight are often related — higher height often
implies more weight.
Such correlations add redundancy, making ML models less efficient.
PCA helps by removing correlation and retaining essential information in
fewer dimensions.
3. Goals of PCA
Reduce dimensionality of the data.
Maximize variance captured in the new features.
Generate orthogonal (independent) features.
• Preserve the original data structure as much as possible in fewer
dimensions.
Comparing and Evaluating Clustering Algorithms

Clustering is an unsupervised learning technique used to group similar data points.


Different clustering algorithms exist, each with its strengths and weaknesses.
Comparing and evaluating them requires considering factors like scalability, cluster
shapes, speed, and accuracy.

REFERENCE

https://www.tutorialspoint.com/scikit_learn/scikit_learn_clust
ering_performance_evaluation.htm
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy