0% found this document useful (0 votes)
7 views38 pages

UNIT-5-ML

The document provides an overview of clustering techniques in data analysis, including various methods such as K-Means, agglomerative, and divisive clustering. It emphasizes the importance of data partitioning for model performance and introduces matrix factorization as a mathematical representation of clustering. Additionally, it compares hard and soft clustering approaches and illustrates the efficiency of clustering through examples.

Uploaded by

kammarabhavi63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

UNIT-5-ML

The document provides an overview of clustering techniques in data analysis, including various methods such as K-Means, agglomerative, and divisive clustering. It emphasizes the importance of data partitioning for model performance and introduces matrix factorization as a mathematical representation of clustering. Additionally, it compares hard and soft clustering approaches and illustrates the efficiency of clustering through examples.

Uploaded by

kammarabhavi63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

UNIT – 5

Clustering
1.Introduction
2.Partitioning of Data
3.Matrix Factorization
4.Clustering of Patterns
5.Divisive clustering
6.Agglomerative clustering
7.Partitional clustering
8.K- Means Clustering
9.Soft Partitioning
10.Soft Clustering
11.Fuzzy C-Means Clustering
12.Rough Clustering
13.Rough K-Means clustering Algorithm
14.Expertation Maximization-Based Clustering
15.Spectral clustering
1.Introduction
Clustering refers to the process of arranging or organizing
objects according to specific criteria. It plays a crucial role in
uncovering concealed knowledge in large data sets.
Clustering involves dividing or grouping the data into smaller
data sets based on similarities/dissimilarities. Depending on
the requirements, this grouping can lead to various
outcomes, such as partitioning of data, data re-organization,
compression of data and data summarization.
2.Partitioning of Data
Partitioning of data in machine learning refers to the process
of dividing a dataset into smaller subsets for efficient
processing, training, and evaluation. This helps improve
model performance, reduce overfitting, and optimize
computational efficiency.
Types of Data Partitioning
1. Train-Test Split
• The dataset is divided into training data (used to train
the model) and testing data (used to evaluate
performance).
• Common split ratios: 80-20, 70-30, or 90-10.
• Example: A spam detection model is trained on 80% of
emails and tested on the remaining 20%.
2. Train-Validation-Test Split
• The dataset is divided into three subsets:
o Training Set: Used for model learning.
o Validation Set: Used for hyperparameter tuning
and model selection.
o Test Set: Used to assess final model performance.
• Common split: 70% Train, 15% Validation, 15% Test.
3. Cross-Validation (K-Fold Partitioning)
• The dataset is divided into K equal-sized subsets (folds).
• The model is trained on K-1 folds and tested on the
remaining fold, repeating the process K times.
• Example: 5-Fold Cross-Validation splits the data into 5
subsets, trains the model on 4, and tests on 1, repeating
this process 5 times.
4. Clustering-Based Partitioning
• Used in unsupervised learning where data is partitioned
into clusters based on similarity.
• Example: K-Means clustering groups customers based on
purchase behavior.
5. Stratified Sampling
• Ensures that each class or category is proportionally
represented in the training and test sets.
• Useful for imbalanced datasets (e.g., medical diagnoses
where 90% of samples are non-disease and 10% are
disease cases).
Importance of Data Partitioning
• Prevents overfitting by ensuring the model generalizes
well to new data.
• Helps in fair model evaluation by keeping a separate
test set.
• Improves computational efficiency by working on
smaller subsets.
EXAMPLE :
To illustrate, let us take the example of an EMPLOYEE data
table that contains 30,000 records of fixed length. We assume
that there are 1000 distinct values of DEPT_CODE and that
the employee records are evenly distributed among these
values. If this data table is clustered by DEPT_CODE, accessing
all the employees of the department with DEPT_CODE = '15'
requires log2(1000) + 30 accesses.
The first term involves accessing the index table that is
constructed using DEPT_CODE, which is achieved through
binary search.
The second term involves fetching 30 (that is, 30000/1000)
records from the clustered (that is, grouped based on
DEPT_CODE) employee table, which is indicated by the
fetched entry in the index table. Without clustering, accessing
30 employee records from a department would require, on
average, 30 x 30000/2 accesses.
Answer :
The example explains the efficiency of data access when
using clustering in a database. Let's break it down step by
step:
Understanding the Problem
• There is an EMPLOYEE data table containing 30,000
records.
• The table includes a column DEPT_CODE, which has
1,000 distinct values.
• The records are evenly distributed among these 1,000
departments. So, each department has:
30,000 records1,000 departments=30 records per departmen
30,000 \records\1,000 \departments = 30 records per
department,.
Accessing Employee Records Without Clustering
If the data is not clustered, then accessing all employees in a
department involves scanning the entire table, because the
records are scattered.
• On average, finding 30 employees from one department
requires searching through half of the table (assuming a
uniform distribution), which means:
30×30,0002=450,000 accesses.
30 \times \frac{30,000}{2} = 450,000 \text{ accesses}.
Accessing Employee Records With Clustering
When the data is clustered by DEPT_CODE, all employees
belonging to a department are stored together. The search
process works in two steps:
1. Index Search
o The database first searches for DEPT_CODE = '15'
using an index table.
o Since there are 1,000 departments, the search uses
binary search, which requires log₂(1000) accesses.
o Approximate value:
log⁡2(1000)≈10 accesses.\log_2(1000) \approx 10 \text{
accesses}.
2. Fetching Records
o After locating the department in the index table, we
directly retrieve its 30 records (since they are
stored together).
o This requires 30 accesses.
Comparison of Access Costs
• Without Clustering: ~450,000 accesses.
• With Clustering: log₂(1000) + 30 ≈ 10 + 30 = 40 accesses.
Conclusion
By clustering the data based on DEPT_CODE, access time
significantly reduces from 450,000 accesses to 40 accesses,
improving efficiency in database queries.
3.Matrix Factorization :
Matrix Factorization in Clustering
Matrix factorization is a technique that can be used to
represent clustering mathematically. It allows us to express a
dataset as the product of two matrices:
X≈BCX
where:
• X is the original data matrix.
• B is the cluster assignment matrix (indicates which data
points belong to which clusters).
• C is the representative matrix (contains the cluster
centroids or leaders).
This approach helps in understanding clustering as a
factorization problem, where we break down a large data
matrix into meaningful components.

Example of Clustering as Matrix Factorization


Consider the dataset:
X=[6 6 6]
[6 6 8]
[2 4 2]
[2 2 2]
Here, there are 4 data points in a 3-dimensional space. If we
use the Leader Algorithm with a threshold of 3 units, we
obtain two clusters:
• Cluster 1: (6,6,6) and (6,6,8) with (6,6,6) as the leader.
• Cluster 2: (2,4,2) and (2,2,2) with (2,4,2) as the leader.
Thus, the Cluster Assignment Matrix B is:
B=[1 0
10
01
0 1]
Each row in B represents a data point, and the 1s indicate the
assigned cluster.
The Cluster Representative Matrix C (leaders of clusters) is:
C=[6 6 6
2 4 2]
Thus, the matrix factorization:
X≈B×C

Hard Clustering vs. Soft Clustering


• Hard Clustering (Leader Algorithm): Each data point
belongs to only one cluster (values in BB are either 0 or
1).
• Soft Clustering (Fuzzy Clustering): A data point can
belong to multiple clusters with different probabilities
(values in BB are between 0 and 1, and each row sums
to 1).
Example of soft clustering:
B=[0.8 0.20
0.7 0.3
0.1 0.9
0.2 0.8]

Using Centroids Instead of Leaders


Instead of using a single leader for a cluster, we can compute
the centroid (mean) of points in each cluster. The updated CC
matrix is:
C=[6 6 7
2 3 2]
Here:
• (6,6,7) is the centroid of Cluster 1 (mean of (6,6,6) and
(6,6,8)).
• (2,3,2) is the centroid of Cluster 2 (mean of (2,4,2) and
(2,2,2)).
4.Clustering of Patterns
• Intra-cluster distance (distance within a cluster) should
be small, meaning data points in a cluster are close
together.
• Inter-cluster distance (distance between clusters) should
be large, ensuring well-separated clusters.
This ensures that the clustering structure is meaningful and
useful for applications like image processing, customer
segmentation, and recommendation systems.
5.Divisive clustering
Divisive Clustering in Machine Learning
Divisive Clustering is a top-down hierarchical clustering
approach. It starts with all data points in one large cluster
and then recursively splits the clusters into smaller ones,
until each point is in its own cluster or a stopping condition is
met.

How Divisive Clustering Works:


1. Start with all data in one cluster.
2. At each step:
o Choose the cluster to split (often the one with the
highest dissimilarity or largest size).
o Split it into two clusters based on a certain
criterion (e.g., distance, variance).
3. Continue splitting until:
o A predefined number of clusters is reached, or
o A threshold distance or dissimilarity is met.
Example:
Imagine you have 10 data points:
• Initially:
Cluster_1 = {x1, x2, ..., x10}
• After 1st split:
Cluster_1 = {x1, x2, x3, x4, x5}
Cluster_2 = {x6, x7, x8, x9, x10}
• After next split:
Cluster_1a = {x1, x2}
Cluster_1b = {x3, x4, x5}
(and so on...)

Divisive vs Agglomerative Clustering:


Feature Divisive Agglomerative
Direction Top-down Bottom-up
Each point as its own
Start with One big cluster
cluster
Merge/Split Split clusters Merge clusters
Often more
Complexity More commonly used
expensive

Advantages of Divisive Clustering


• May produce better results if the global structure is
more important than local patterns.
• Good when a natural top-down structure exists in the
data.
Disadvantages
• Computationally expensive (needs to evaluate many
possible splits).
• Less commonly used in practice compared to
agglomerative clustering.

Algorithms
One popular algorithm for divisive clustering is:
• DIANA (DIvisive ANAlysis Clustering)

Example :
A (1, 2)
B (2, 2)
C (5, 5)
D (6, 5)
E (8, 6)
Sure! Let's walk through examples of both divisive and
agglomerative clustering using a small, simple dataset.
Dataset (2D points):
Let’s say we have the following 5 data points:
A (1, 2)
B (2, 2)
C (5, 5)
D (6, 5)
E (8, 6)

1. Agglomerative Clustering (Bottom-Up)


Steps:
1. Start with each point as its own cluster:
2. {A}, {B}, {C}, {D}, {E}
3. Find the two closest clusters (using, say, Euclidean
distance):
o A and B are closest → merge them:
o {A, B}, {C}, {D}, {E}
4. Next closest: C and D → merge them:
5. {A, B}, {C, D}, {E}
6. Next: {C, D} and E → merge:
7. {A, B}, {C, D, E}
8. Final merge:
9. {A, B, C, D, E}
You can visualize this process using a dendrogram, which
shows how clusters are formed step by step.

2. Divisive Clustering (Top-Down)


Steps:
1. Start with all points in one cluster:
2. {A, B, C, D, E}
3. Find the “best” way to split into two clusters:
o Based on distance, one possible split:
o {A, B} and {C, D, E}
4. Next, choose one of the clusters to split:
o Split {C, D, E} → maybe {C, D} and {E}
Now you have:
{A, B}, {C, D}, {E}
5. Keep splitting until each point is in its own cluster or
until your stopping condition is met.

Comparison Summary:
Feature Agglomerative Divisive
Start with Individual data points One big cluster
Recursively split
Process Merge closest clusters
clusters
Feature Agglomerative Divisive
One cluster (or target Many small clusters
Ends with
number) (or target)
Visualization
Dendrogram Dendrogram
tool
Sure! Here's a problem example that illustrates both
agglomerative and divisive clustering with step-by-step
clustering.

Problem:
You are given 4 data points in 1D space:
A: 1
B: 2
C: 5
D: 8
Use Euclidean distance and demonstrate agglomerative and
divisive clustering steps until 2 clusters remain.

AGGLOMERATIVE CLUSTERING (Bottom-Up)


Step 0: Initial clusters
{A}, {B}, {C}, {D}
Step 1: Find closest pair
• Distances:
o A-B = |1−2| = 1
o B-C = |2−5| = 3
o C-D = |5−8| = 3
Closest pair: A-B
→ Merge:
{A, B}, {C}, {D}
Step 2: Find next closest
• Distance between {A, B} and C = min(|1−5|, |2−5|) = 3
• C-D = 3
Tie: pick C-D
→ Merge:
{A, B}, {C, D}
Reached 2 clusters

DIVISIVE CLUSTERING (Top-Down)


Step 0: Start with all in one cluster
{A, B, C, D}
Step 1: Split into 2 groups
• Based on distance, a good split is:
• {A, B} and {C, D}
→ Why?
Distance between B (2) and C (5) = 3
Farther apart than within-group distances:
• A-B = 1
• C-D = 3
Reached 2 clusters

Final Result:
Both methods ended with the same two clusters:
{A, B} and {C, D}

Great! Here's a more advanced version of the problem using


2D data points and including both agglomerative and
divisive clustering. At the end, I’ll also show how soft
clustering would approach this.

Problem (Harder Version):


You’re given the following 5 data points in 2D:
A (1, 2)
B (2, 1)
C (5, 5)
D (6, 5)
E (8, 6)
1. Agglomerative Clustering (Bottom-Up)
Step 0: Start with each point in its own cluster
{A}, {B}, {C}, {D}, {E}
Step 1: Compute pairwise Euclidean distances
A B C D E
A0 1.41 5.00 5.83 8.06
B 1.41 0 5.00 5.66 7.81
C 5.00 5.00 0 1.00 3.16
D 5.83 5.66 1.00 0 2.24
E 8.06 7.81 3.16 2.24 0
Step 2: Merge closest pair → C and D (1.00)
{A}, {B}, {C, D}, {E}
Step 3: Next closest → A and B (1.41)
{A, B}, {C, D}, {E}
Step 4: Merge C-D and E (distance = 2.24)
{A, B}, {C, D, E}
Reached 2 clusters

2. Divisive Clustering (Top-Down)


Step 0: Start with all points in one cluster
{A, B, C, D, E}
Step 1: Split into 2 clusters
• Based on distances, A and B are far from C, D, E.
• Good split:
• {A, B}, {C, D, E}
Step 2: Optional further split if required
• Could split {C, D, E} into {C, D} and {E}, or stop here.
Again, reached 2 clusters:
{A, B} and {C, D, E}

3. Soft Clustering
Instead of assigning each point to only one cluster, we give
degrees of membership. For example:
• Let’s assume 2 soft clusters: Cluster 1 (centered near A-
B) and Cluster 2 (near C-D-E).
• Membership matrix (sample):
Point Cluster 1 Cluster 2
A 0.95 0.05
B 0.90 0.10
C 0.10 0.90
D 0.05 0.95
Point Cluster 1 Cluster 2
E 0.02 0.98
Each row sums to 1. This reflects soft clustering like Fuzzy C-
Means, where patterns are partially assigned to multiple
clusters.

Agglomerative Clustering in Machine Learning


Agglomerative Clustering is a bottom-up hierarchical
clustering method. It is one of the most commonly used
hierarchical clustering techniques.

How it works:
1. Start with each data point as its own cluster.
2. Compute distances (or similarity) between all pairs of
clusters.
3. Merge the two closest clusters.
4. Repeat steps 2 and 3 until:
o All points are in one single cluster, or
o A desired number of clusters is reached.
Distance Measures:
The closeness of clusters can be measured using different
linkage criteria:
• Single linkage: Minimum distance between two points
from different clusters.
• Complete linkage: Maximum distance between two
points.
• Average linkage: Average distance between all points
from both clusters.
• Ward’s method: Minimizes the total within-cluster
variance.

Example:
Let’s say we have points:
A (1), B (2), C (8)
• Step 1: Each point is its own cluster: {A}, {B}, {C}
• Step 2: Distances:
o A-B = 1, B-C = 6, A-C = 7
• Step 3: Merge A and B → {A, B}, {C}
• Step 4: Next closest is {A, B} and {C} → merge → {A, B, C}
You can stop after any number of clusters (e.g., stop at 2
clusters: {A, B} and {C})
Visualization: Dendrogram
A dendrogram is a tree-like diagram that shows the sequence
of merges. You can "cut" it at any level to choose how many
clusters you want.

Advantages:
• Simple and intuitive
• Doesn’t require the number of clusters in advance (but
you can set one if desired)
• Works well for small datasets

Disadvantages:
• Computationally expensive for large datasets (O(n² log
n))
• Sensitive to noise and outliers
• Once merged, clusters can’t be split (no backtracking)
The terms soft partitioning and soft clustering are often used
interchangeably, but there can be a subtle difference in
context depending on how they're used. Here's a breakdown:
1. Soft Clustering
• Definition: A form of clustering where each data point
can belong to multiple clusters with varying degrees of
membership.
• Example: In fuzzy c-means, a point might belong to
Cluster A with 70% membership and Cluster B with 30%.
• Focus: More on the degree of belonging or fuzzy
membership of data points to clusters.
• Use cases: When data naturally overlaps or has
ambiguity in group membership.
2. Soft Partitioning
• Definition: A general term that refers to partitioning
data into overlapping subsets, where elements can
belong to multiple partitions.
• Example: In overlapping community detection in
networks, a node might belong to multiple communities.
• Focus: More on the partition structure—the fact that
the partitions are not mutually exclusive.
Key Difference
• Soft clustering is a specific technique (often using fuzzy
logic) to assign partial memberships to clusters.
• Soft partitioning is a broader concept that refers to any
partitioning scheme where overlap is allowed (not
necessarily tied to fuzzy logic or probabilities).
Analogy
Think of soft clustering as how you assign elements to groups
(e.g., based on probability or fuzzy logic), and soft
partitioning as what kind of grouping structure you allow
(e.g., overlaps allowed or not).
Let me know if you want an example to illustrate this better.
Partitional clustering :
Partitional clustering in machine learning is a type of
clustering method where the dataset is divided into distinct,
non-overlapping groups (clusters). Each data point belongs
to exactly one cluster.
Key Characteristics:
• The number of clusters (k) is usually defined beforehand.
• The goal is to minimize intra-cluster distance (points in
the same cluster are close) and maximize inter-cluster
distance (clusters are far apart).
• It’s a hard clustering method (no overlaps).

Popular Algorithms:
1. K-Means Clustering
o Most common partitional algorithm.
o Iteratively assigns points to the nearest centroid
and updates centroids.
2. K-Medoids (PAM)
o Similar to K-Means, but uses actual data points as
cluster centers (medoids).
3. CLARANS
o A scalable version of K-Medoids for large datasets.
Advantages:
• Simple and fast for large datasets.
• Easy to implement and interpret.
Limitations:
• Requires the number of clusters in advance.
• Sensitive to initial cluster centroids.
• Doesn’t handle non-spherical clusters or noise well.

Here’s a clear and simple explanation of the k-means


clustering algorithm, based on the text you shared:

What is K-Means Clustering?


K-means is a partitional clustering algorithm that divides a
set of data points into k clusters. Each cluster has a center
(centroid), and each data point is assigned to the cluster with
the nearest centroid.

Steps in K-Means Algorithm:


1. Initialize: Choose k data points randomly as the initial
cluster centers (centroids).
2. Assign: Assign each data point to the closest cluster
based on distance (usually Euclidean distance).
3. Update: Recalculate the centroids by computing the
mean of all data points in each cluster.
4. Repeat: Reassign data points to the nearest centroid and
update centroids again until:
o No changes in cluster assignments, or
o A maximum number of iterations is reached.

Key Considerations:
• Sensitive to Initialization: Results may vary based on
initial centroid selection. That’s why better initialization
methods (like k-means++) or multiple runs are used.
• Choosing k: It's often not known beforehand. A popular
method is the Elbow Method.
Elbow Method (to find best k):
1. Run k-means for several values of k.
2. Compute the total in-cluster sum of squared error (IC-
SSE) for each.
3. Plot IC-SSE vs. k.
4. Look for the “elbow” point (sharp bend) in the graph.
That k is a good choice.

Time & Space Complexity:


• Time Complexity:
O(n⋅l⋅k⋅p)
Where:
o n= number of data points
o l= number of features
o k= number of clusters
o p = number of iterations
• Space Complexity:
O(k⋅n)
(to store distances or cluster assignments)
Given:
There are 8 data points with two features f1 and f2, from
Table 7.15 (not shown here, but implied). Let's denote each
point as:
Here's the visual plot showing the k-means clustering result
for k=3:
• Red points belong to Cluster 1 (C1)
• Blue points belong to Cluster 2 (C2)
• Green points belong to Cluster 3 (C3)
• 'X' markers show the final cluster centers
Each point is labeled (x₁ to x₈), so you can see which cluster it
got assigned to.
Here are the calculated Euclidean distances from each data
point to the initial cluster centers (C1, C2, and C3), along with
their assigned cluster (minimum distance):
Fuzzy C-Means (FCM) Clustering is an advanced version of k-
means clustering used in machine learning, especially when
clusters overlap and a data point can belong to more than
one cluster.

Key Concepts
• Unlike k-means (hard clustering), where each point
belongs to exactly one cluster, FCM uses soft clustering
— each point belongs to all clusters with varying
degrees of membership (between 0 and 1).
• It's widely used in image segmentation, pattern
recognition, and medical diagnosis.
How FCM Works
1. Initialize:
o Choose number of clusters c.
o Randomly initialize the membership matrix (values
indicating how much a point belongs to each
cluster).
2. Update cluster centers: Each cluster center is calculated
as a weighted average of all points, weighted by their
membership values.
3. Update membership values: Membership of a point to a
cluster depends on its distance to the cluster center.
Closer = higher membership.
4. Repeat steps 2 and 3 until convergence (small changes in
values).
Difference from K-means
Feature K-Means Fuzzy C-Means
Cluster
Hard (0 or 1) Soft (0 to 1)
membership
Overlapping
Not allowed Allowed
clusters
One cluster per Membership degrees per
Output
point cluster
When to Use
• When data is not clearly separable
• When soft assignments are more meaningful (e.g., in
medical diagnostics, a symptom may relate to multiple
diseases)

9.Soft Partitioning
10.Soft Clustering
11.Fuzzy C-Means Clustering
12.Rough Clustering
13.Rough K-Means clustering Algorithm
14.Expertation Maximization-Based Clustering
15.Spectral clustering
Refer your Note Book

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy