Clustering
Clustering
Training Data
All Structured data,
unlabeled Model Labeled data,
data Clustering
Clustering
Clustering
Overview:
- Idea: the process of grouping data into similarity groups known as clusters.
Text Analysis: Grouping of a collection of text documents with respect to similarity in their content.
- Grouping of news items when you search for an item
Anomaly Detection: Given data from the sensors, grouping of sensor readings for machine
operating in different states and detect anomaly as an outlier.
Earth-quake studies: Clustering of epi-centers of earthquakes are distributed around or along fault
lines.
Clustering
Aspects of Clustering:
- Given the data, what do we need to carry out clustering?
Non-negativity
Symmetry
Self-Similarity
Triangular Inequality
Evaluation criteria:
- Using Internal Data:
- Use the unlabeled data for evaluation of the clustering algroithm.
- Intra-cluster cohesion
- measure of the compactness of the cluster
- E.g., measured by the sum of squared error that quantifies the spread of the points
around the centroid.
Clustering
Evaluation of Clustering using External (Labeled) Data:
- Use labeled data to carry out clustering and measure the extent to
which the external class labels match the cluster labels.
- Idea: Evaluation of clustering performance using the labeled data gives us some
confidence about the performance of the algorithm.
Entropy:
Measure of the proportion of different classes in each cluster.
Clustering
Evaluation of Clustering using External (Labeled) Data:
Purity:
Also serves as a measure of the proportion of different classes in each cluster.
Remark:
- Since we do not have labels associated with the data for the clustering problem; it
must be noted that the good performance on the labeled data does not guarantee
good performance on the data with no labels.
Clustering
Clustering Algorithms:
- In clustering algorithms, we usually optimize the following for a given
number of clusters.
- Partitional Clustering
- Divides data points into non-overlapping subsets (clusters) such
that each data point is in exactly one subset.
Partitional
- E.g. K-means clustering
- Hierarchical Clustering
- Constructs a set of nested clusters by carrying out hierarchical
division of the data points.
- E.g., Agglomerative clustering, Divisive Clustering Hierarchical
Computations: Complexity:
3 3 3
2 2 2
y
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Possible solutions:
K-means Algorithms
Number of Clusters:
K-means Algorithms
Limitations/Weaknesses:
K-means Algorithms
Summary:
- Despite these limitations, K-means is the most popular and fundamental
unsupervised clustering algorithm;
- Simple: two-step iterative algorithm; easy to understand and to implement.
- Computationally efficient: O(K n d) is the time complexity.
-Performance of the clustering is often hard to evaluate, that is true for every
clustering algorithm.
- We can cut the dendrogram at a desired level to carry out clustering; the
connected data-points below the desired level form a cluster.
Hierarchical Clustering
Overview:
- Agglomerative:
- Start with considering each data point as one cluster
- Merge the clusters iteratively
- Keep on merging until all clusters are fused to form one cluster
- Also termed as ‘Bottom-Up’
- Divisive:
- Starting with considering all data points as a single cluster
- Divide (split) the clusters successively
- Also termed as ‘Top-Down’
Complexity:
Agglomerative Clustering
Agglomerative Clustering:
- Here, we are merging the two clusters that are nearest to each other.
- A group of points represents a cluster.
- We have studied a distance metric that computes the distance between points.
Question: How do we compute the distance between a point and a cluster or the distance between
two clusters?
Answer: We can define the closest pair of clusters in multiple ways, and this results in different
versions of hierarchical clustering.
- Single linkage: Distance of two closest data points in the different clusters (nearest neighbor)
- Complete linkage: Distance of the furthest points in the different clusters (furthest neighbor)
- Group average linkage: Average distance between all pairs of points in the two different clusters.
- Centroid linkage: Distance between centroids
- Wards linkage: Merge the clusters such that the variance of the merged clusters is minimized.
Agglomerative Clustering
Agglomerative Single Linkage:
- Single linkage: Distance between the two clusters is the distance
between the closest data points (nearest neighbor).
- Results in more compact spherical clusters (biased towards globular, blob clusters).
- Less sensitive to noise and/or outliers.
Agglomerative Clustering
Agglomerative Single Linkage:
- Single linkage vs Complete linkage: