0% found this document useful (0 votes)
15 views44 pages

Clustering

The document provides an overview of unsupervised learning, specifically focusing on clustering techniques such as partitional clustering (K-means) and hierarchical clustering (agglomerative). It discusses the principles of clustering, evaluation methods, and various algorithms, highlighting their applications in fields like marketing, text analysis, and anomaly detection. The document also addresses the strengths and limitations of these clustering methods, including the challenges of evaluating clustering performance.

Uploaded by

shanzayiqbal2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views44 pages

Clustering

The document provides an overview of unsupervised learning, specifically focusing on clustering techniques such as partitional clustering (K-means) and hierarchical clustering (agglomerative). It discusses the principles of clustering, evaluation methods, and various algorithms, highlighting their applications in fields like marketing, text analysis, and anomaly detection. The document also addresses the strengths and limitations of these clustering methods, including the challenges of evaluating clustering performance.

Uploaded by

shanzayiqbal2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Machine Learning

Unsupervised Learning - Clustering


Outline

- Introduction to Unsupervised Learning, Clustering


- Clustering Overview
- Partitional Clustering
- K-means Clustering
- Hierarchical Clustering
- Agglomerative Clustering
Unsupervised Learning
Overview:
The learning algorithm would receive unlabeled raw data to train a model
and to find patterns in the data

Training Data
All Structured data,
unlabeled Model Labeled data,
data Clustering

Clustering, aka unsupervised learning (due to historical reasons), is


the most widely used technique.
Clustering
Overview:
Given the data, group ‘similar’ points into the form of clusters.

Clustering
Clustering
Overview:
- Idea: the process of grouping data into similarity groups known as clusters.

- Formally, organize the unlabeled data into classes such that


- Inter-cluster similarity is minimized:
- low similarity between data points in different classes
- Intra-cluster similarity is maximized:
- High similarity between data points of each class

- In contrast to classification, we learn the number of classes and class labels


directly from the data.
Clustering
Applications of Clustering:
Marketing: Clustering is used for segmentation of the customers/markets to do
targeted marketing.
- Spatio-temporal demographic distribution of the sales of products
- Insurance companies cluster policy holders to identify groups of policy
holders with a high claim costs on average

Text Analysis: Grouping of a collection of text documents with respect to similarity in their content.
- Grouping of news items when you search for an item

Anomaly Detection: Given data from the sensors, grouping of sensor readings for machine
operating in different states and detect anomaly as an outlier.

Finance: Allocation of diversified portfolios of stocks by using clustering.

Earth-quake studies: Clustering of epi-centers of earthquakes are distributed around or along fault
lines.
Clustering
Aspects of Clustering:
- Given the data, what do we need to carry out clustering?

- A measure to quantify or determine similarity

- A criterion to evaluate the quality of the clustering


- Low inter-class similarity, High intra-class similarity
- Ability to identify hidden patterns in the data

- Clustering techniques/algorithms for grouping similar data points


- Partitional Clustering
- Hierarchical Clustering
- Model Based
- Density Based
Clustering
(Dis)Similarity using distance metric:

Non-negativity
Symmetry

Self-Similarity
Triangular Inequality

- We studied earlier; Minkowski, Euclidean distance, Manhattan distance,


Chebyshev distance, cosine distance

- For categorical variables, we use Hamming distance


Clustering
Evaluation of Clustering:
- Unlike supervised learning problems, the evaluation of quality of a
clustering is a hard problem and is mostly subjective as the information
about correct clusters is unknown.

Evaluation criteria:
- Using Internal Data:
- Use the unlabeled data for evaluation of the clustering algroithm.

- Using External Data:


- Use labeled data (supervised learning) to evaluate the performance of different
clustering algorithms.
Clustering
Evaluation of Clustering using Internal Data:
- Inter-cluster separability
- measure of the isolation of the cluster
-E.g., measured as the distance between
the centroids of the clusters

- Intra-cluster cohesion
- measure of the compactness of the cluster
- E.g., measured by the sum of squared error that quantifies the spread of the points
around the centroid.
Clustering
Evaluation of Clustering using External (Labeled) Data:
- Use labeled data to carry out clustering and measure the extent to
which the external class labels match the cluster labels.

- Idea: Evaluation of clustering performance using the labeled data gives us some
confidence about the performance of the algorithm.

- This evaluation method is referred to as evaluation based on external data.

- Assuming each class as a cluster, we use classification evaluation metrics


after clustering:
- Confusion matrix
- Precision, recall, F1-score
- Purity and Entropy
Clustering
Evaluation of Clustering using External (Labeled) Data:

Entropy:
Measure of the proportion of different classes in each cluster.
Clustering
Evaluation of Clustering using External (Labeled) Data:
Purity:
Also serves as a measure of the proportion of different classes in each cluster.

Remark:
- Since we do not have labels associated with the data for the clustering problem; it
must be noted that the good performance on the labeled data does not guarantee
good performance on the data with no labels.
Clustering
Clustering Algorithms:
- In clustering algorithms, we usually optimize the following for a given
number of clusters.

- Tightness, spread, cohesion of clusters

- Separability of clusters, distance between the centers

- Ideally, we require clustering algorithms to be


- scalable (in terms of both time and space)
- able to deal with different data types and noise/outliers
- insensitive to order of input records
- interpretable and usable
Clustering
Clustering Algorithms:

- Partitional Clustering
- Divides data points into non-overlapping subsets (clusters) such
that each data point is in exactly one subset.
Partitional
- E.g. K-means clustering

- Hierarchical Clustering
- Constructs a set of nested clusters by carrying out hierarchical
division of the data points.
- E.g., Agglomerative clustering, Divisive Clustering Hierarchical

- Model Based Clustering


- Assumes that the data was generated by a model and try to fit the
data to model that defines clusters of the data

- Density Based Clustering


- Assumes that a cluster in the space is a region of high point density
separated from other clusters by regions of low point density.
Outline

- Introduction to Unsupervised Learning, Clustering


- Clustering Overview
- Partitional Clustering
- K-means Clustering
- Hierarchical Clustering
- Agglomerative Clustering
Partitional Clustering
Overview:
K-means Algorithms
Overview and Notation:
K-means Algorithms
Algorithm:
In K-means algorithm, we carry out the following steps:
- Input: K and Data D

- Randomly choose K cluster centers (centroids)


- Repeat until convergence:
- Each data point is assigned membership of the cluster of closest centroid
- Compute the centroids again for each cluster using the current cluster memberships

Computations: Complexity:

Repeat until convergence:


K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Illustration:
K-means Algorithms
Stopping and Convergence Criterion:
Multiple convergence criteria:

Interpretation: clusters are no more changing.

Remark: Algorithm may converge at a local optimum.


K-means Algorithms
Evaluation of K-means clustering:
Choice of Initial Centroids:
K-means Algorithms

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Original Points Optimal Clustering Sub-optimal Clustering

Possible solutions:
K-means Algorithms
Number of Clusters:
K-means Algorithms
Limitations/Weaknesses:
K-means Algorithms
Summary:
- Despite these limitations, K-means is the most popular and fundamental
unsupervised clustering algorithm;
- Simple: two-step iterative algorithm; easy to understand and to implement.
- Computationally efficient: O(K n d) is the time complexity.

- It assumes that the number of clusters is known.

- Most of the convergence takes place in the first few iterations.

-Performance of the clustering is often hard to evaluate, that is true for every
clustering algorithm.

- It is sensitive to initial values of centroids, outliers and difference in sizes, densities


and shapes of clusters.
Outline

- Introduction to Unsupervised Learning, Clustering


- Clustering Overview
- Partitional Clustering
- K-means Clustering
- Hierarchical Clustering
- Agglomerative Clustering
Overview – Illustration: Hierarchical Clustering
Dendrogram

- We take a union of clusters


at level i+1 to obtain a
6 cluster s parent cluster at level i.

4 cluste rs 3 cluste rs 2 clusters


Hierarchical Clustering
Overview:
- In hierarchical clustering, we carry out a hierarchical decomposition of
the data points using some criterion.
- Use distance or similarity metric to carry out hierarchical decomposition.
• We do not need to define the number of clusters as an input.
- A nested sequence of clusters is created in this decomposition process.
- This nested sequence of clusters, a tree, is also called Dendrogram.
• Dendrogram:
- A tree data structure, that records the sequences of splits or merges,
used for the visualization of hierarchical clustering techniques.

- We represent the similarity between two data-points in the


dendrogram as the height of the lowest internal node they share.

- Root corresponds to one cluster and leaf represents individual clusters.

- Each level of the tree shows clusters for that level.


Hierarchical Clustering
Overview – Illustration:

Data Dendrogram Nested Clusters

- Dendrogram is the representation of nested clusters.

- We can cut the dendrogram at a desired level to carry out clustering; the
connected data-points below the desired level form a cluster.
Hierarchical Clustering
Overview:
- Agglomerative:
- Start with considering each data point as one cluster
- Merge the clusters iteratively
- Keep on merging until all clusters are fused to form one cluster
- Also termed as ‘Bottom-Up’

- Divisive:
- Starting with considering all data points as a single cluster
- Divide (split) the clusters successively
- Also termed as ‘Top-Down’

- In both approaches, we usually similarity (distance metric) one cluster at a time.


Agglomerative Clustering
Algorithm:
In agglomerative algorithm, we carry out the following steps:

Complexity:
Agglomerative Clustering
Agglomerative Clustering:
- Here, we are merging the two clusters that are nearest to each other.
- A group of points represents a cluster.
- We have studied a distance metric that computes the distance between points.

Question: How do we compute the distance between a point and a cluster or the distance between
two clusters?
Answer: We can define the closest pair of clusters in multiple ways, and this results in different
versions of hierarchical clustering.

- Single linkage: Distance of two closest data points in the different clusters (nearest neighbor)
- Complete linkage: Distance of the furthest points in the different clusters (furthest neighbor)
- Group average linkage: Average distance between all pairs of points in the two different clusters.
- Centroid linkage: Distance between centroids
- Wards linkage: Merge the clusters such that the variance of the merged clusters is minimized.
Agglomerative Clustering
Agglomerative Single Linkage:
- Single linkage: Distance between the two clusters is the distance
between the closest data points (nearest neighbor).

- Results in (long and thin) clusters.


- Sensitive to noise and/or outliers

- Complete linkage: Distance between the two clusters is the distance


between the furthest closest data points (furthest neighbor)

- Results in more compact spherical clusters (biased towards globular, blob clusters).
- Less sensitive to noise and/or outliers.
Agglomerative Clustering
Agglomerative Single Linkage:
- Single linkage vs Complete linkage:

Single linkage Complete linkage


Hierarchical Clustering
Summary:
- We obtain a set of nested clusters arranged as a tree, aka dendrogram.
- We do not need to specify the number of clusters in advance.
- Agglomerative is bottom-up and divisive is top-down.
- We have different metrics to quantify the distance between the clusters; the clusters
are different for each metric.
- Hierarchal clustering is often used for analyzing text data or social network data.
- Unlike K-means, hierarchical clustering is does not scale well due to significant
computational cost O(n2).
- Like any heuristic search algorithms, local optima are a problem.
- Interpretation of results is (very) subjective.
Clustering
References:
- CB: 9.1
- KM: 11.4.2.5
- Introduction to Information Retrieval (https://nlp.stanford.edu/IR-book/) (Ch: 16, 17)
- Data clustering: A review by Jain, Anil et. al. ACM Computing Surveys 31 (3): 264-323, 1999

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy