0% found this document useful (0 votes)

15 views44 pages

Clustering

The document provides an overview of unsupervised learning, specifically focusing on clustering techniques such as partitional clustering (K-means) and hierarchical clustering (agglomerative). It discusses the principles of clustering, evaluation methods, and various algorithms, highlighting their applications in fields like marketing, text analysis, and anomaly detection. The document also addresses the strengths and limitations of these clustering methods, including the challenges of evaluating clustering performance.

Uploaded by

shanzayiqbal2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views44 pages

Clustering

Uploaded by

shanzayiqbal2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Machine Learning

Unsupervised Learning - Clustering

Outline

- Introduction to Unsupervised Learning, Clustering

- Clustering Overview
- Partitional Clustering
- K-means Clustering
- Hierarchical Clustering
- Agglomerative Clustering
Unsupervised Learning
Overview:
The learning algorithm would receive unlabeled raw data to train a model
and to find patterns in the data

Training Data
All Structured data,
unlabeled Model Labeled data,
data Clustering

Clustering, aka unsupervised learning (due to historical reasons), is

the most widely used technique.
Clustering
Overview:
Given the data, group ‘similar’ points into the form of clusters.

Clustering
Clustering
Overview:
- Idea: the process of grouping data into similarity groups known as clusters.

- Formally, organize the unlabeled data into classes such that

- Inter-cluster similarity is minimized:
- low similarity between data points in different classes
- Intra-cluster similarity is maximized:
- High similarity between data points of each class

- In contrast to classification, we learn the number of classes and class labels

directly from the data.
Clustering
Applications of Clustering:
Marketing: Clustering is used for segmentation of the customers/markets to do
targeted marketing.
- Spatio-temporal demographic distribution of the sales of products
- Insurance companies cluster policy holders to identify groups of policy
holders with a high claim costs on average

Text Analysis: Grouping of a collection of text documents with respect to similarity in their content.
- Grouping of news items when you search for an item

Anomaly Detection: Given data from the sensors, grouping of sensor readings for machine
operating in different states and detect anomaly as an outlier.

Finance: Allocation of diversified portfolios of stocks by using clustering.

Earth-quake studies: Clustering of epi-centers of earthquakes are distributed around or along fault
lines.
Clustering
Aspects of Clustering:
- Given the data, what do we need to carry out clustering?

- A measure to quantify or determine similarity

- A criterion to evaluate the quality of the clustering

- Low inter-class similarity, High intra-class similarity
- Ability to identify hidden patterns in the data

- Clustering techniques/algorithms for grouping similar data points

- Partitional Clustering
- Hierarchical Clustering
- Model Based
- Density Based
Clustering
(Dis)Similarity using distance metric:

Non-negativity
Symmetry

Self-Similarity
Triangular Inequality

- We studied earlier; Minkowski, Euclidean distance, Manhattan distance,

Chebyshev distance, cosine distance

- For categorical variables, we use Hamming distance

Clustering
Evaluation of Clustering:
- Unlike supervised learning problems, the evaluation of quality of a
clustering is a hard problem and is mostly subjective as the information
about correct clusters is unknown.

Evaluation criteria:
- Using Internal Data:
- Use the unlabeled data for evaluation of the clustering algroithm.

- Using External Data:

- Use labeled data (supervised learning) to evaluate the performance of different
clustering algorithms.
Clustering
Evaluation of Clustering using Internal Data:
- Inter-cluster separability
- measure of the isolation of the cluster
-E.g., measured as the distance between
the centroids of the clusters

- Intra-cluster cohesion
- measure of the compactness of the cluster
- E.g., measured by the sum of squared error that quantifies the spread of the points
around the centroid.
Clustering
Evaluation of Clustering using External (Labeled) Data:
- Use labeled data to carry out clustering and measure the extent to
which the external class labels match the cluster labels.

- Idea: Evaluation of clustering performance using the labeled data gives us some
confidence about the performance of the algorithm.

- This evaluation method is referred to as evaluation based on external data.

- Assuming each class as a cluster, we use classification evaluation metrics

after clustering:
- Confusion matrix
- Precision, recall, F1-score
- Purity and Entropy
Clustering
Evaluation of Clustering using External (Labeled) Data:

Entropy:
Measure of the proportion of different classes in each cluster.
Clustering
Evaluation of Clustering using External (Labeled) Data:
Purity:
Also serves as a measure of the proportion of different classes in each cluster.

Remark:
- Since we do not have labels associated with the data for the clustering problem; it
must be noted that the good performance on the labeled data does not guarantee
good performance on the data with no labels.
Clustering
Clustering Algorithms:
- In clustering algorithms, we usually optimize the following for a given
number of clusters.

- Tightness, spread, cohesion of clusters

- Separability of clusters, distance between the centers

- Ideally, we require clustering algorithms to be

- scalable (in terms of both time and space)
- able to deal with different data types and noise/outliers
- insensitive to order of input records
- interpretable and usable
Clustering
Clustering Algorithms:

- Partitional Clustering
- Divides data points into non-overlapping subsets (clusters) such
that each data point is in exactly one subset.
Partitional
- E.g. K-means clustering

- Hierarchical Clustering
- Constructs a set of nested clusters by carrying out hierarchical
division of the data points.
- E.g., Agglomerative clustering, Divisive Clustering Hierarchical

- Model Based Clustering

- Assumes that the data was generated by a model and try to fit the
data to model that defines clusters of the data

- Density Based Clustering

- Assumes that a cluster in the space is a region of high point density
separated from other clusters by regions of low point density.
Outline

- Introduction to Unsupervised Learning, Clustering

- Clustering Overview
- Partitional Clustering
- K-means Clustering
- Hierarchical Clustering
- Agglomerative Clustering
Partitional Clustering
Overview:
K-means Algorithms
Overview and Notation:
K-means Algorithms
Algorithm:
In K-means algorithm, we carry out the following steps:
- Input: K and Data D

- Randomly choose K cluster centers (centroids)

- Repeat until convergence:
- Each data point is assigned membership of the cluster of closest centroid
- Compute the centroids again for each cluster using the current cluster memberships

Computations: Complexity:

Repeat until convergence:

K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Stopping and Convergence Criterion:
Multiple convergence criteria:

Interpretation: clusters are no more changing.

Remark: Algorithm may converge at a local optimum.

K-means Algorithms
Evaluation of K-means clustering:
Choice of Initial Centroids:
K-means Algorithms

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Original Points Optimal Clustering Sub-optimal Clustering

Possible solutions:
K-means Algorithms
Number of Clusters:
K-means Algorithms
Limitations/Weaknesses:
K-means Algorithms
Summary:
- Despite these limitations, K-means is the most popular and fundamental
unsupervised clustering algorithm;
- Simple: two-step iterative algorithm; easy to understand and to implement.
- Computationally efficient: O(K n d) is the time complexity.

- It assumes that the number of clusters is known.

- Most of the convergence takes place in the first few iterations.

-Performance of the clustering is often hard to evaluate, that is true for every
clustering algorithm.

- It is sensitive to initial values of centroids, outliers and difference in sizes, densities

and shapes of clusters.
Outline

- Introduction to Unsupervised Learning, Clustering

- Clustering Overview
- Partitional Clustering
- K-means Clustering
- Hierarchical Clustering
- Agglomerative Clustering
Overview – Illustration: Hierarchical Clustering
Dendrogram

- We take a union of clusters

at level i+1 to obtain a
6 cluster s parent cluster at level i.

4 cluste rs 3 cluste rs 2 clusters

Hierarchical Clustering
Overview:
- In hierarchical clustering, we carry out a hierarchical decomposition of
the data points using some criterion.
- Use distance or similarity metric to carry out hierarchical decomposition.
• We do not need to define the number of clusters as an input.
- A nested sequence of clusters is created in this decomposition process.
- This nested sequence of clusters, a tree, is also called Dendrogram.
• Dendrogram:
- A tree data structure, that records the sequences of splits or merges,
used for the visualization of hierarchical clustering techniques.

- We represent the similarity between two data-points in the

dendrogram as the height of the lowest internal node they share.

- Root corresponds to one cluster and leaf represents individual clusters.

- Each level of the tree shows clusters for that level.

Hierarchical Clustering
Overview – Illustration:

Data Dendrogram Nested Clusters

- Dendrogram is the representation of nested clusters.

- We can cut the dendrogram at a desired level to carry out clustering; the
connected data-points below the desired level form a cluster.
Hierarchical Clustering
Overview:
- Agglomerative:
- Start with considering each data point as one cluster
- Merge the clusters iteratively
- Keep on merging until all clusters are fused to form one cluster
- Also termed as ‘Bottom-Up’

- Divisive:
- Starting with considering all data points as a single cluster
- Divide (split) the clusters successively
- Also termed as ‘Top-Down’

- In both approaches, we usually similarity (distance metric) one cluster at a time.

Agglomerative Clustering
Algorithm:
In agglomerative algorithm, we carry out the following steps:

Complexity:
Agglomerative Clustering
Agglomerative Clustering:
- Here, we are merging the two clusters that are nearest to each other.
- A group of points represents a cluster.
- We have studied a distance metric that computes the distance between points.

Question: How do we compute the distance between a point and a cluster or the distance between
two clusters?
Answer: We can define the closest pair of clusters in multiple ways, and this results in different
versions of hierarchical clustering.

- Single linkage: Distance of two closest data points in the different clusters (nearest neighbor)
- Complete linkage: Distance of the furthest points in the different clusters (furthest neighbor)
- Group average linkage: Average distance between all pairs of points in the two different clusters.
- Centroid linkage: Distance between centroids
- Wards linkage: Merge the clusters such that the variance of the merged clusters is minimized.
Agglomerative Clustering
Agglomerative Single Linkage:
- Single linkage: Distance between the two clusters is the distance
between the closest data points (nearest neighbor).

- Results in (long and thin) clusters.

- Sensitive to noise and/or outliers

- Complete linkage: Distance between the two clusters is the distance

between the furthest closest data points (furthest neighbor)

- Results in more compact spherical clusters (biased towards globular, blob clusters).
- Less sensitive to noise and/or outliers.
Agglomerative Clustering
Agglomerative Single Linkage:
- Single linkage vs Complete linkage:

Single linkage Complete linkage

Hierarchical Clustering
Summary:
- We obtain a set of nested clusters arranged as a tree, aka dendrogram.
- We do not need to specify the number of clusters in advance.
- Agglomerative is bottom-up and divisive is top-down.
- We have different metrics to quantify the distance between the clusters; the clusters
are different for each metric.
- Hierarchal clustering is often used for analyzing text data or social network data.
- Unlike K-means, hierarchical clustering is does not scale well due to significant
computational cost O(n2).
- Like any heuristic search algorithms, local optima are a problem.
- Interpretation of results is (very) subjective.
Clustering
References:
- CB: 9.1
- KM: 11.4.2.5
- Introduction to Information Retrieval (https://nlp.stanford.edu/IR-book/) (Ch: 16, 17)
- Data clustering: A review by Jain, Anil et. al. ACM Computing Surveys 31 (3): 264-323, 1999

Data Analysis - Groups - INCOMPLETE
No ratings yet
Data Analysis - Groups - INCOMPLETE
24 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Unit 3 Unsupervised Learning Algorith
No ratings yet
Unit 3 Unsupervised Learning Algorith
15 pages
Clustering
No ratings yet
Clustering
29 pages
DA-Unit V
No ratings yet
DA-Unit V
152 pages
Unit Iii - ML
No ratings yet
Unit Iii - ML
13 pages
Clustering
No ratings yet
Clustering
38 pages
Fuzzy Meaning
No ratings yet
Fuzzy Meaning
6 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Data Science Unit 5
No ratings yet
Data Science Unit 5
105 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Cluster
100% (1)
Cluster
72 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Lecture 3 Types of Machine Learning
No ratings yet
Lecture 3 Types of Machine Learning
40 pages
Clustering New
No ratings yet
Clustering New
6 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
6 pages
3CP10 MJJ Clustering Intro
No ratings yet
3CP10 MJJ Clustering Intro
18 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
Clustering
No ratings yet
Clustering
84 pages
07 Clustering
No ratings yet
07 Clustering
34 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Clustering
No ratings yet
Clustering
7 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Clustering
No ratings yet
Clustering
22 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Clustering
No ratings yet
Clustering
39 pages
Unit 4
No ratings yet
Unit 4
53 pages
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering Notes
No ratings yet
Clustering Notes
37 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
UNIT5
No ratings yet
UNIT5
60 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
ML Clustering
No ratings yet
ML Clustering
33 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
63 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Lecture PS7
No ratings yet
Lecture PS7
47 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
Unit 4 Mining
No ratings yet
Unit 4 Mining
12 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Clustering
No ratings yet
Clustering
27 pages
Julia for Data Science
From Everand
Julia for Data Science
Anshul Joshi
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Unit Iii Efficiency 9
No ratings yet
Unit Iii Efficiency 9
16 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
Hierarchical Clustering: Ke Chen
No ratings yet
Hierarchical Clustering: Ke Chen
21 pages
Expt 5
No ratings yet
Expt 5
3 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
Statistical Indexing Is A Method Used in Information Retrieval Systems
No ratings yet
Statistical Indexing Is A Method Used in Information Retrieval Systems
22 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Clustering
No ratings yet
Clustering
28 pages
CSC 501 Mid Term 2-Assignment
No ratings yet
CSC 501 Mid Term 2-Assignment
2 pages
Chatgpt in Foreign Language Lesson Plan Creation Trends Variability and Historical Biases
No ratings yet
Chatgpt in Foreign Language Lesson Plan Creation Trends Variability and Historical Biases
16 pages
Machine Learning
No ratings yet
Machine Learning
115 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Computer Vision Lecture Notes All Compress
No ratings yet
Computer Vision Lecture Notes All Compress
17 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Computer Vision Lecture Notes All
No ratings yet
Computer Vision Lecture Notes All
18 pages
Decision Region Vs
No ratings yet
Decision Region Vs
4 pages
Unit 5 Data Science
No ratings yet
Unit 5 Data Science
18 pages
Survey of Clustering Algorithms: IEEE Transactions On Neural Networks June 2005
No ratings yet
Survey of Clustering Algorithms: IEEE Transactions On Neural Networks June 2005
35 pages
MLT Quantum
No ratings yet
MLT Quantum
163 pages
Camintac Essay - Nubbh Kejriwal
No ratings yet
Camintac Essay - Nubbh Kejriwal
4 pages
Get Exploratory Multivariate Analysis by Example Using R Second Edition Husson Free All Chapters
No ratings yet
Get Exploratory Multivariate Analysis by Example Using R Second Edition Husson Free All Chapters
67 pages
A Review On Customer Segmentation Methods For Personalized Customer Targeting in e Commerce Use Cases
No ratings yet
A Review On Customer Segmentation Methods For Personalized Customer Targeting in e Commerce Use Cases
44 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
61 pages
Weka Clustering
No ratings yet
Weka Clustering
15 pages
DM Cluster Analysis
No ratings yet
DM Cluster Analysis
3 pages
Distance Based Models
No ratings yet
Distance Based Models
58 pages
Performance Lawn Equipment
100% (2)
Performance Lawn Equipment
6 pages
Midterm Lab Exam - AI
No ratings yet
Midterm Lab Exam - AI
13 pages
Query Directed Web Page Clustering
No ratings yet
Query Directed Web Page Clustering
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Clustering

Uploaded by

Clustering

Uploaded by

Machine Learning

Unsupervised Learning - Clustering

- Introduction to Unsupervised Learning, Clustering

Clustering, aka unsupervised learning (due to historical reasons), is

- Formally, organize the unlabeled data into classes such that

- In contrast to classification, we learn the number of classes and class labels

Finance: Allocation of diversified portfolios of stocks by using clustering.

- A measure to quantify or determine similarity

- A criterion to evaluate the quality of the clustering

- Clustering techniques/algorithms for grouping similar data points

- We studied earlier; Minkowski, Euclidean distance, Manhattan distance,

- For categorical variables, we use Hamming distance

- Using External Data:

- This evaluation method is referred to as evaluation based on external data.

- Assuming each class as a cluster, we use classification evaluation metrics

- Tightness, spread, cohesion of clusters

- Separability of clusters, distance between the centers

- Ideally, we require clustering algorithms to be

- Model Based Clustering

- Density Based Clustering

- Introduction to Unsupervised Learning, Clustering

- Randomly choose K cluster centers (centroids)

Repeat until convergence:

Interpretation: clusters are no more changing.

Remark: Algorithm may converge at a local optimum.

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Original Points Optimal Clustering Sub-optimal Clustering

- It assumes that the number of clusters is known.

- Most of the convergence takes place in the first few iterations.

- It is sensitive to initial values of centroids, outliers and difference in sizes, densities

- Introduction to Unsupervised Learning, Clustering

- We take a union of clusters

4 cluste rs 3 cluste rs 2 clusters

- We represent the similarity between two data-points in the

- Root corresponds to one cluster and leaf represents individual clusters.

- Each level of the tree shows clusters for that level.

Data Dendrogram Nested Clusters

- Dendrogram is the representation of nested clusters.

- In both approaches, we usually similarity (distance metric) one cluster at a time.

- Results in (long and thin) clusters.

- Complete linkage: Distance between the two clusters is the distance

Single linkage Complete linkage

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.