Unit 4
Unit 4
Example –
CURE (Clustering Using Representatives)
BIRCH (Balanced Iterative Reducing Clustering and using
Hierarchies)
Clustering Methods - 4 Types
Clustering by
Merging
Clustering by
Divisive
Clustering by merging or Agglomerative Clustering:
Farthest Clustering
Clustering by division or Divisive splitting
K-Means Clustering
K-means clustering is a very popular clustering algorithm which applied when we
have a dataset with labels unknown. The goal is to find certain groups based on
some kind of similarity in the data with the number of groups represented by K.
This algorithm is generally used in areas like market segmentation, customer
segmentation, etc. But, it can also be used to segment different objects in the
images on the basis of the pixel values.
The algorithm for image segmentation works as follows:
Following are the steps for applying the K-means clustering algorithm:
Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.
For each core point if it is not already assigned to a cluster, create a new
cluster.
Find recursively all its density connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exist a point c which
has a sufficient number of points in its neighbors and both the points a and b are
within the eps distance. This is a chaining process. So, if b is neighbor of c, c is
neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is
neighbor of a.
Iterate through the remaining unvisited points in the dataset. Those points that
do not belong to any cluster are noise.
Gaussian Mixtures
A Gaussian Mixture is a function that is comprised of several
Gaussians, each identified by k ∈ {1,…, K}, where K is the number
of clusters of our dataset. Each Gaussian k in the mixture is
comprised of the following parameters:
Now how do we determine the optimal values for these parameters? To achieve
this we must ensure that each Gaussian fits the data points belonging to each
cluster. This is exactly what maximum likelihood does.
In machine learning classification problems, there are often too many factors on the
basis of which the final classification is done. These factors are basically variables
called features. The higher the number of features, the harder it gets to visualize the
training set and then work on it. Sometimes, most of these features are correlated,
and hence redundant.
This leads to overfitting of the model, so even though the model performs
really well on training data, it fails drastically on any real data.
Methods of Dimensionality Reduction
Kernel PCA:
Kernel PCA is yet another extension of PCA using a kernel. The
kernel is a mathematical technique using which we can map
instances to very high dimensional space called the feature
space, enabling non-linear classification and regression with
Support Vector Machines(SVM). This is usually employed in
novelty detections and image de-noising. Scikit-Learn provides a
class Kernel PCA in sklearn.decomposition which can be used to
perform Kernel PCA.
Applications of PCA in Machine Learning
For the most part, users accomplish three primary tasks with scikit-
learn:
1. Classification
Identifying which category an object belongs to.
Application: Spam detection
2. Regression
Predicting a continuous variable based on relevant independent
variables.
Application: Stock price predictions
3. Clustering
Automatic grouping of similar objects into different clusters.
Application: Customer segmentation
THE END