UNIT-5-ML
UNIT-5-ML
Clustering
1.Introduction
2.Partitioning of Data
3.Matrix Factorization
4.Clustering of Patterns
5.Divisive clustering
6.Agglomerative clustering
7.Partitional clustering
8.K- Means Clustering
9.Soft Partitioning
10.Soft Clustering
11.Fuzzy C-Means Clustering
12.Rough Clustering
13.Rough K-Means clustering Algorithm
14.Expertation Maximization-Based Clustering
15.Spectral clustering
1.Introduction
Clustering refers to the process of arranging or organizing
objects according to specific criteria. It plays a crucial role in
uncovering concealed knowledge in large data sets.
Clustering involves dividing or grouping the data into smaller
data sets based on similarities/dissimilarities. Depending on
the requirements, this grouping can lead to various
outcomes, such as partitioning of data, data re-organization,
compression of data and data summarization.
2.Partitioning of Data
Partitioning of data in machine learning refers to the process
of dividing a dataset into smaller subsets for efficient
processing, training, and evaluation. This helps improve
model performance, reduce overfitting, and optimize
computational efficiency.
Types of Data Partitioning
1. Train-Test Split
• The dataset is divided into training data (used to train
the model) and testing data (used to evaluate
performance).
• Common split ratios: 80-20, 70-30, or 90-10.
• Example: A spam detection model is trained on 80% of
emails and tested on the remaining 20%.
2. Train-Validation-Test Split
• The dataset is divided into three subsets:
o Training Set: Used for model learning.
o Validation Set: Used for hyperparameter tuning
and model selection.
o Test Set: Used to assess final model performance.
• Common split: 70% Train, 15% Validation, 15% Test.
3. Cross-Validation (K-Fold Partitioning)
• The dataset is divided into K equal-sized subsets (folds).
• The model is trained on K-1 folds and tested on the
remaining fold, repeating the process K times.
• Example: 5-Fold Cross-Validation splits the data into 5
subsets, trains the model on 4, and tests on 1, repeating
this process 5 times.
4. Clustering-Based Partitioning
• Used in unsupervised learning where data is partitioned
into clusters based on similarity.
• Example: K-Means clustering groups customers based on
purchase behavior.
5. Stratified Sampling
• Ensures that each class or category is proportionally
represented in the training and test sets.
• Useful for imbalanced datasets (e.g., medical diagnoses
where 90% of samples are non-disease and 10% are
disease cases).
Importance of Data Partitioning
• Prevents overfitting by ensuring the model generalizes
well to new data.
• Helps in fair model evaluation by keeping a separate
test set.
• Improves computational efficiency by working on
smaller subsets.
EXAMPLE :
To illustrate, let us take the example of an EMPLOYEE data
table that contains 30,000 records of fixed length. We assume
that there are 1000 distinct values of DEPT_CODE and that
the employee records are evenly distributed among these
values. If this data table is clustered by DEPT_CODE, accessing
all the employees of the department with DEPT_CODE = '15'
requires log2(1000) + 30 accesses.
The first term involves accessing the index table that is
constructed using DEPT_CODE, which is achieved through
binary search.
The second term involves fetching 30 (that is, 30000/1000)
records from the clustered (that is, grouped based on
DEPT_CODE) employee table, which is indicated by the
fetched entry in the index table. Without clustering, accessing
30 employee records from a department would require, on
average, 30 x 30000/2 accesses.
Answer :
The example explains the efficiency of data access when
using clustering in a database. Let's break it down step by
step:
Understanding the Problem
• There is an EMPLOYEE data table containing 30,000
records.
• The table includes a column DEPT_CODE, which has
1,000 distinct values.
• The records are evenly distributed among these 1,000
departments. So, each department has:
30,000 records1,000 departments=30 records per departmen
30,000 \records\1,000 \departments = 30 records per
department,.
Accessing Employee Records Without Clustering
If the data is not clustered, then accessing all employees in a
department involves scanning the entire table, because the
records are scattered.
• On average, finding 30 employees from one department
requires searching through half of the table (assuming a
uniform distribution), which means:
30×30,0002=450,000 accesses.
30 \times \frac{30,000}{2} = 450,000 \text{ accesses}.
Accessing Employee Records With Clustering
When the data is clustered by DEPT_CODE, all employees
belonging to a department are stored together. The search
process works in two steps:
1. Index Search
o The database first searches for DEPT_CODE = '15'
using an index table.
o Since there are 1,000 departments, the search uses
binary search, which requires log₂(1000) accesses.
o Approximate value:
log2(1000)≈10 accesses.\log_2(1000) \approx 10 \text{
accesses}.
2. Fetching Records
o After locating the department in the index table, we
directly retrieve its 30 records (since they are
stored together).
o This requires 30 accesses.
Comparison of Access Costs
• Without Clustering: ~450,000 accesses.
• With Clustering: log₂(1000) + 30 ≈ 10 + 30 = 40 accesses.
Conclusion
By clustering the data based on DEPT_CODE, access time
significantly reduces from 450,000 accesses to 40 accesses,
improving efficiency in database queries.
3.Matrix Factorization :
Matrix Factorization in Clustering
Matrix factorization is a technique that can be used to
represent clustering mathematically. It allows us to express a
dataset as the product of two matrices:
X≈BCX
where:
• X is the original data matrix.
• B is the cluster assignment matrix (indicates which data
points belong to which clusters).
• C is the representative matrix (contains the cluster
centroids or leaders).
This approach helps in understanding clustering as a
factorization problem, where we break down a large data
matrix into meaningful components.
Algorithms
One popular algorithm for divisive clustering is:
• DIANA (DIvisive ANAlysis Clustering)
Example :
A (1, 2)
B (2, 2)
C (5, 5)
D (6, 5)
E (8, 6)
Sure! Let's walk through examples of both divisive and
agglomerative clustering using a small, simple dataset.
Dataset (2D points):
Let’s say we have the following 5 data points:
A (1, 2)
B (2, 2)
C (5, 5)
D (6, 5)
E (8, 6)
Comparison Summary:
Feature Agglomerative Divisive
Start with Individual data points One big cluster
Recursively split
Process Merge closest clusters
clusters
Feature Agglomerative Divisive
One cluster (or target Many small clusters
Ends with
number) (or target)
Visualization
Dendrogram Dendrogram
tool
Sure! Here's a problem example that illustrates both
agglomerative and divisive clustering with step-by-step
clustering.
Problem:
You are given 4 data points in 1D space:
A: 1
B: 2
C: 5
D: 8
Use Euclidean distance and demonstrate agglomerative and
divisive clustering steps until 2 clusters remain.
Final Result:
Both methods ended with the same two clusters:
{A, B} and {C, D}
3. Soft Clustering
Instead of assigning each point to only one cluster, we give
degrees of membership. For example:
• Let’s assume 2 soft clusters: Cluster 1 (centered near A-
B) and Cluster 2 (near C-D-E).
• Membership matrix (sample):
Point Cluster 1 Cluster 2
A 0.95 0.05
B 0.90 0.10
C 0.10 0.90
D 0.05 0.95
Point Cluster 1 Cluster 2
E 0.02 0.98
Each row sums to 1. This reflects soft clustering like Fuzzy C-
Means, where patterns are partially assigned to multiple
clusters.
How it works:
1. Start with each data point as its own cluster.
2. Compute distances (or similarity) between all pairs of
clusters.
3. Merge the two closest clusters.
4. Repeat steps 2 and 3 until:
o All points are in one single cluster, or
o A desired number of clusters is reached.
Distance Measures:
The closeness of clusters can be measured using different
linkage criteria:
• Single linkage: Minimum distance between two points
from different clusters.
• Complete linkage: Maximum distance between two
points.
• Average linkage: Average distance between all points
from both clusters.
• Ward’s method: Minimizes the total within-cluster
variance.
Example:
Let’s say we have points:
A (1), B (2), C (8)
• Step 1: Each point is its own cluster: {A}, {B}, {C}
• Step 2: Distances:
o A-B = 1, B-C = 6, A-C = 7
• Step 3: Merge A and B → {A, B}, {C}
• Step 4: Next closest is {A, B} and {C} → merge → {A, B, C}
You can stop after any number of clusters (e.g., stop at 2
clusters: {A, B} and {C})
Visualization: Dendrogram
A dendrogram is a tree-like diagram that shows the sequence
of merges. You can "cut" it at any level to choose how many
clusters you want.
Advantages:
• Simple and intuitive
• Doesn’t require the number of clusters in advance (but
you can set one if desired)
• Works well for small datasets
Disadvantages:
• Computationally expensive for large datasets (O(n² log
n))
• Sensitive to noise and outliers
• Once merged, clusters can’t be split (no backtracking)
The terms soft partitioning and soft clustering are often used
interchangeably, but there can be a subtle difference in
context depending on how they're used. Here's a breakdown:
1. Soft Clustering
• Definition: A form of clustering where each data point
can belong to multiple clusters with varying degrees of
membership.
• Example: In fuzzy c-means, a point might belong to
Cluster A with 70% membership and Cluster B with 30%.
• Focus: More on the degree of belonging or fuzzy
membership of data points to clusters.
• Use cases: When data naturally overlaps or has
ambiguity in group membership.
2. Soft Partitioning
• Definition: A general term that refers to partitioning
data into overlapping subsets, where elements can
belong to multiple partitions.
• Example: In overlapping community detection in
networks, a node might belong to multiple communities.
• Focus: More on the partition structure—the fact that
the partitions are not mutually exclusive.
Key Difference
• Soft clustering is a specific technique (often using fuzzy
logic) to assign partial memberships to clusters.
• Soft partitioning is a broader concept that refers to any
partitioning scheme where overlap is allowed (not
necessarily tied to fuzzy logic or probabilities).
Analogy
Think of soft clustering as how you assign elements to groups
(e.g., based on probability or fuzzy logic), and soft
partitioning as what kind of grouping structure you allow
(e.g., overlaps allowed or not).
Let me know if you want an example to illustrate this better.
Partitional clustering :
Partitional clustering in machine learning is a type of
clustering method where the dataset is divided into distinct,
non-overlapping groups (clusters). Each data point belongs
to exactly one cluster.
Key Characteristics:
• The number of clusters (k) is usually defined beforehand.
• The goal is to minimize intra-cluster distance (points in
the same cluster are close) and maximize inter-cluster
distance (clusters are far apart).
• It’s a hard clustering method (no overlaps).
Popular Algorithms:
1. K-Means Clustering
o Most common partitional algorithm.
o Iteratively assigns points to the nearest centroid
and updates centroids.
2. K-Medoids (PAM)
o Similar to K-Means, but uses actual data points as
cluster centers (medoids).
3. CLARANS
o A scalable version of K-Medoids for large datasets.
Advantages:
• Simple and fast for large datasets.
• Easy to implement and interpret.
Limitations:
• Requires the number of clusters in advance.
• Sensitive to initial cluster centroids.
• Doesn’t handle non-spherical clusters or noise well.
Key Considerations:
• Sensitive to Initialization: Results may vary based on
initial centroid selection. That’s why better initialization
methods (like k-means++) or multiple runs are used.
• Choosing k: It's often not known beforehand. A popular
method is the Elbow Method.
Elbow Method (to find best k):
1. Run k-means for several values of k.
2. Compute the total in-cluster sum of squared error (IC-
SSE) for each.
3. Plot IC-SSE vs. k.
4. Look for the “elbow” point (sharp bend) in the graph.
That k is a good choice.
Key Concepts
• Unlike k-means (hard clustering), where each point
belongs to exactly one cluster, FCM uses soft clustering
— each point belongs to all clusters with varying
degrees of membership (between 0 and 1).
• It's widely used in image segmentation, pattern
recognition, and medical diagnosis.
How FCM Works
1. Initialize:
o Choose number of clusters c.
o Randomly initialize the membership matrix (values
indicating how much a point belongs to each
cluster).
2. Update cluster centers: Each cluster center is calculated
as a weighted average of all points, weighted by their
membership values.
3. Update membership values: Membership of a point to a
cluster depends on its distance to the cluster center.
Closer = higher membership.
4. Repeat steps 2 and 3 until convergence (small changes in
values).
Difference from K-means
Feature K-Means Fuzzy C-Means
Cluster
Hard (0 or 1) Soft (0 to 1)
membership
Overlapping
Not allowed Allowed
clusters
One cluster per Membership degrees per
Output
point cluster
When to Use
• When data is not clearly separable
• When soft assignments are more meaningful (e.g., in
medical diagnostics, a symptom may relate to multiple
diseases)
9.Soft Partitioning
10.Soft Clustering
11.Fuzzy C-Means Clustering
12.Rough Clustering
13.Rough K-Means clustering Algorithm
14.Expertation Maximization-Based Clustering
15.Spectral clustering
Refer your Note Book