0% found this document useful (0 votes)
2 views56 pages

MLLecture-1

This document discusses unsupervised learning techniques, focusing on clustering methods for image segmentation, such as semantic and instance segmentation. It elaborates on the DBSCAN algorithm for density-based clustering and Gaussian Mixture Models (GMM) for probabilistic clustering, including the Expectation-Maximization algorithm for fitting GMMs. Additionally, it addresses the challenges of high-dimensional data and dimensionality reduction techniques like PCA, highlighting their advantages and disadvantages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views56 pages

MLLecture-1

This document discusses unsupervised learning techniques, focusing on clustering methods for image segmentation, such as semantic and instance segmentation. It elaborates on the DBSCAN algorithm for density-based clustering and Gaussian Mixture Models (GMM) for probabilistic clustering, including the Expectation-Maximization algorithm for fitting GMMs. Additionally, it addresses the challenges of high-dimensional data and dimensionality reduction techniques like PCA, highlighting their advantages and disadvantages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

UNSUPERVISED LEARNING

TECHNIQUES
UNIT 4
CLUSTERING FOR IMAGE SEGMENTATION
• Image segmentation is the task of partitioning an image into multiple
segments.
• In semantic segmentation, all pixels that are part of the same object
type get assigned to the same segment.
• For example, in a self-driving car’s vision system, all pixels that are
part of a pedestrian’s image might be assigned to the “pedestrian”
segment (there would just be one segment containing all the
pedestrians).
• In instance segmentation, all pixels that are part of the same
individual object are assigned to the same segment.
• In this case there would be a different segment for each pedestrian.
• In some applications, this may be sufficient, for example if you want
to analyze satellite images to measure how much total forest area
there is in a region, color segmentation may be just fine.
Clustering for Preprocessing
• Clustering can be an efficient approach to dimensionality reduction, in particular
as a preprocessing step before a supervised learning algorithm.
Using Clustering for Semi-Supervised Learning
• Another use case for clustering is in semi-supervised learning, when we
have plenty of unlabeled instances and very few labeled instances.
DBSCAN (Density Based Spatial Clustering
Application with Noise)
• This algorithm defines clusters as continuous regions of high density.
• Groups together closely packed data points. Marks the outlier points
as low-density regions.
• The algorithm can figure out any arbitrary shaped clusters.
• The algorithm works on two parameters :
• Epsilon (ε): The maximum distance between two samples for one
data point to be considered in the neighborhood of the other data
point.
• Minimum points (minPts): The minimum number of points required
to form a dense region.
Algorithm
• For each instance, the algorithm counts how many instances are
located within a small distance ε (epsilon) from it. This region is called
the instance’s ε neighborhood.
• If an instance has at least min_samples instances in its ε-
neighborhood (including itself), then it is considered a core instance.
In other words, core instances are those that are located in dense
regions.
• All instances in the neighborhood of a core instance belong to the
same cluster. This may include other core instances, therefore a long
sequence of neighboring core instances forms a single cluster.
• Any instance that is not a core instance and does not have one in its
neighbor hood is considered an anomaly.
Example:
• Consider the dataset :
Point F1 F2
P1 4.5 8
P2 5 7
P3 6 6.5
P4 7 5
P5 9 4
P6 7 3
P7 8 3.5
P8 9 5
P9 4 4
P10 3 7.5
P11 4 6
P12 3.5 5
• ε =1.9
• min_pts=4
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
P1 0 1.12 2.12 3.91 6.02 5.59 5.70 5.41 4.03 1.58 2.06 3.16
P2 1.12 0 1.12 2.83 5.0 4.47 4.61 4.47 3.16 2.06 1.41 2.5
P3 2.12 1.12 0 1.80 3.91 3.64 3.61 3.35 3.20 3.16 2.06 2.92
P4 3.91 2.83 1.80 0 2.24 2.0 1.80 2.0 3.16 4.72 3.16 3.50
P5 6.02 5.0 3.91 2.24 0 2.24 1.12 1.0 5.0 6.95 5.39 5.59
P6 5.59 4.47 3.64 2.0 2.24 0 1.12 2.83 3.16 6.02 4.24 4.03
P7 5.70 4.61 3.61 1.80 1.12 1.12 0 1.80 4.03 6.40 4.72 4.74
P8 5.41 4.47 3.35 2.0 1.0 2.83 1.80 0 5.10 6.50 5.10 5.50
P9 4.03 3.16 3.20 3.16 5.0 3.16 4.03 5.10 0 3.64 2.00 1.12
P10 1.58 2.06 3.16 4.72 6.95 6.02 6.40 6.50 3.64 0 1.80 2.55
P11 2.06 1.41 2.06 3.16 5.39 4.24 4.72 5.10 2.00 1.80 0 1.12
P12 3.16 2.5 2.92 3.50 5.59 4.03 4.74 5.50 1.12 2.55 1.12 0
Point Identification
P1 NOISE BORDER
P2 CLUSTER CLUSTER
P3 NOISE BORDER
P4 NOISE BORDER
P5 NOISE BORDER
P6 NOISE BORDER
P7 CLUSTER CLUSTER
P8 NOISE BORDER
P9 NOISE CLUSTER
P10 NOISE BORDER
P11 CLUSTER CLUSTER
P12 NOISE BORDER
• Advantages:
• Is great of separating clusters of high density versus clusters of low
density within a given dataset.
• Is great with handling outliers within the dataset.
• Disadvantages:
• Does not work well when dealing with clusters of varying densities.
• DBSCAN struggles with clusters of similar density.
• High dimensionality data.
Gaussian Mixtures
• A Gaussian mixture model (GMM) is a probabilistic model that
assumes that the instances were generated from a mixture of several
Gaussian distributions whose parameters are unknown.
• All the instances generated from a single Gaussian distribution form a
cluster that typically looks like an ellipsoid.
• Each cluster can have a different ellipsoidal shape, size, density and
orientation.
• K-means is a clustering algorithm that assigns each data point to one
cluster based on the closest centroid. It’s a hard clustering method,
meaning each point belongs to only one cluster with no uncertainty.
• On the other hand, Gaussian Mixture Models (GMM) use soft
clustering, where data points can belong to multiple clusters with a
certain probability.
• The below image has a few Gaussian distributions with a difference in
mean (μ) and variance (σ2). Remember that the higher the σ value
more would be the spread:
1.Multiple Gaussians (Clusters): Each cluster is represented by a
Gaussian distribution, and the data points are assigned probabilities
of belonging to different clusters based on their distance from each
Gaussian.

2.Parameters of a Gaussian: The core of GMM is made up of three


main parameters for each Gaussian:
2. Mean (μ): The center of the Gaussian distribution.

3. Covariance (Σ): Describes the spread or shape of the cluster.

4. Mixing Probability (π): Determines how dominant or likely each cluster is in


the data.
• The Gaussian mixture model assigns a probability to each data point
xnx_nxn of belonging to a cluster. The probability of data point
coming from Gaussian cluster k is expressed as

• Next, we need to calculate the overall likelihood of observing a data


point xnx_nxn under all Gaussians. This is achieved by summing over
all possible clusters (Gaussians) for each point:
The Expectation-Maximization (EM) Algorithm
To fit a Gaussian Mixture Model to the data, we use the Expectation-
Maximization (EM) algorithm, which is an iterative method that optimizes
the parameters of the Gaussian distributions (mean, covariance, and
mixing coefficients). It works in two main steps:
1.Expectation Step (E-step):In this step, the algorithm calculates the
probability that each data point belongs to each cluster based on the
current parameter estimates (mean, covariance, mixing coefficients).

2.Maximization Step (M-step):After estimating the probabilities, the


algorithm updates the parameters (mean, covariance, and mixing
coefficients) to better fit the data.

• These two steps are repeated until the model converges, meaning the
parameters no longer change significantly between iterations.
GMM Algorithm
1.Initialization: Start with initial guesses for the means, covariances,
and mixing coefficients of each Gaussian distribution.

2.E-step: For each data point, calculate the probability of it belonging


to each Gaussian distribution (cluster).

3.M-step: Update the parameters (means, covariances, mixing


coefficients) using the probabilities calculated in the E-step.

4.Repeat: Continue alternating between the E-step and M-step until


the log-likelihood of the data (a measure of how well the model fits
the data) converges.
• The E-step computes the probabilities that each data point
belongs to each Gaussian, while the M-step updates the
parameters μk, Σk , and πk based on these probabilities.
Dimensionality Reduction
• The Curse of Dimensionality:
• Curse of Dimensionality refers to a set of problems that arise when
working with high-dimensional data.
• The dimension of a dataset corresponds to the number of
attributes/features that exist in a dataset.
• A dataset with a large number of attributes, generally of the order of
a hundred or more, is referred to as high dimensional data.
• Some of the difficulties that come with high dimensional data
manifest during analyzing or visualizing the data to identify patterns,
and some manifest while training machine learning models.
• The difficulties related to training machine learning models due to
high dimensional data is referred to as ‘Curse of Dimensionality’.
Solutions to Curse of Dimensionality:
• One of the ways to reduce the impact of high dimensions is to use a
different measure of distance in a space vector.
• One could explore the use of cosine similarity to replace Euclidean
distance.
• Cosine similarity can have a lesser impact on data with higher dimensions.
However, use of such method could also be specific to the required
solution of the problem.
• Other methods:
• Other methods could involve the use of reduction in dimensions. Some of
the techniques that can be used are:
1. Forward-feature selection: This method involves picking the most useful
subset of features from all given features.
2. PCA/t-SNE: Though these methods help in reduction of number of
features, but it does not necessarily preserve the class labels and thus can
make the interpretation of results a tough task.
Main Approaches for Dimensionality Reduction
• The two main approaches to reducing dimensionality:
• Projection: lower-dimensional subspace of the high-dimensional space.
However, projection is not always the best approach to dimensionality
reduction.
• Manifold Learning:
• Manifold learning is a type of non-linear dimensionality reduction
process.
• It relies on the manifold assumption, also called the manifold hypothesis,
which holds that most real-world high-dimensional datasets lie close to a
much lower-dimensional manifold.
Principal Components Analysis
• This method was introduced by Karl Pearson. It works on a condition
that while the data in a higher dimensional space is mapped to data
in a lower dimension space, the variance of the data in the lower
dimensional space should be maximum.
• It involves the following steps:
• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are used to
reconstruct a large fraction of variance of the original data.
• Hence, we are left with a lesser number of eigenvectors, and there
might have been some data loss in the process. But, the most
important variances should be retained by the remaining
eigenvectors.
• Preserving the Variance:
• you can project the training set onto a lower-dimensional hyperplane,
you first need to choose the right hyperplane.
• It seems reasonable to select the axis that preserves the maximum
amount of variance, as it will most likely lose less information than
the other projections. Another way to justify this choice is that it is
the axis that minimizes the mean squared distance between the
original dataset and its projection onto that axis.
Principle Components
• The unit vector that defines the ith axis is called the ith principal
component (PC), the 1st PC is c1 and the 2nd PC is c2.
• Luckily, there is a standard matrix factorization technique called
Singular Value Decomposition (SVD) that can decompose the training
set matrix X into the matrix multiplication of three matrices U Σ VT,
where V contains all the principal components.
• Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
• Disadvantages of Dimensionality Reduction
• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is
sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to
define datasets.
• We may not know how many principal components to keep- in
practice, some thumb rules are applied.
Steps in PCA:
• Step 1:

• Step2
• Step 3:
• PCA will provide a mechanism to recognize this geometric similarity
through algebraic means.
• The covariance matrix S is a symmetric matrix and According to Spectral
Theorem(Spectral Decomposition)

• Here we call ⃗vi as Eigen Vector and λi as the corresponding Eigen Value
and A as the covariance matrix.
• Step 4: Inferring the Principal components from Eigen Values of the
Co Variance Matrix From Spectral theorem we infer:

oThe Most Significant Principal Component is the Eigen vector


corresponding to the largest Eigen Value.
• Step 5: Projecting the data using the Principal Components
o The projection matrix is obtained by selected Eigen vectors(k<d)
numbers. The original dataset is transformed via the projection matrix
to obtain a reduced k dimension subspace of original dataset.
Step 1: Calculate Mean
• The figure shows the scatter plot of the given data points.
• Calculate the mean of X1 and X2 as shown below.
Step 2: Calculation of the covariance matrix.
The covariances are calculated as follows:
• The covariance matrix is,
Step 3: Eigenvalues of the covariance matrix
• The characteristic equation of the covariance matrix is,

• Solving the characteristic equation we get,


Step 4: Computation of the eigenvectors
To find the first principal components, we need only compute the
eigenvector corresponding to the largest eigenvalue. In the present
example, the largest eigenvalue is λ and so we compute the
1

eigenvector corresponding to λ .
1

The eigenvector corresponding to λ = λ is a vector


1
• satisfying the following equation:

• Using the theory of systems of linear equations, we note that these


equations are not independent and solutions are given by,
• Step 5: Computation of first principal components
• let, be the kth sample in the above Table (dataset). The first
principal component of this example is given by (here “T” denotes the
transpose of the matrix)

For example, the first principal component corresponding to the first


example
is calculated as follows:
• Step 6: Geometrical meaning of first principal components
• First, we shift the origin to the “center” and then change the
directions of coordinate axes to the directions of the eigenvectors
e1 and e2.
• Next, we drop perpendiculars from the given data points to the e1-
axis (see below Figure).
• The first principal components are the e1-coordinates of the feet of
perpendiculars, that is, the projections on the e1-axis. The projections
of the data points on the e1-axis may be taken as approximations of
the given data points hence we may replace the given data set with
these points.
• Now, each of these approximations can be unambiguously specified
by a single number, namely, the e1-coordinate of approximation. Thus
the two-dimensional data set can be represented approximately by
the following one-dimensional data set.
USING SCIKIT-LEARN
• Scikit-Learn’s PCA class implements PCA using SVD decomposition
just like we did before. The following code applies PCA to reduce the
dimensionality of the dataset down to two dimensions (note that it
automatically takes care of centering the data):
RANDOMIZED PCA
• If you set the svd_solver hyperparameter to "randomized", Scikit
Learn uses a stochastic algorithm called Randomized PCA that quickly
finds an approximation of the first d principal components. Its
computational complexity is O(m × d 2) + O(d 3), instead of O(m ×
n^2) + O(n3) for the full SVD approach, so it is dramatically faster than
full SVD when d is much smaller than n:
• By default, svd_solver is actually set to "auto": Scikit-Learn
automatically uses the randomized PCA algorithm if m or n is greater
than 500 and d is less than 80% of m or n, or else it uses the full SVD
approach. If you want to force Scikit-Learn to use full SVD, you can set
the svd_solver hyperparameter to "full“.

• Linear dimensionality reduction using Singular Value Decomposition


of the data to project it to a lower dimensional space. The input data
is centered but not scaled for each feature before applying the SVD.
KERNEL PCA
• Kernel PCA a mathematical technique that implicitly maps instances
into a very high-dimensional space (called the feature space),
enabling nonlinear classification and regression with Support Vector
Machines.
• A linear decision boundary in the high-dimensional feature space
corresponds to a complex nonlinear decision boundary in the original
space. It turns out that the same trick can be applied to PCA, making
it possible to perform complex nonlinear projections for
dimensionality reduction. This is called Kernel PCA (kPCA).
• It is often good at preserving clusters of instances after projection, or
sometimes even unrolling datasets that lie close to a twisted
manifold.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy