MLLecture-1
MLLecture-1
TECHNIQUES
UNIT 4
CLUSTERING FOR IMAGE SEGMENTATION
• Image segmentation is the task of partitioning an image into multiple
segments.
• In semantic segmentation, all pixels that are part of the same object
type get assigned to the same segment.
• For example, in a self-driving car’s vision system, all pixels that are
part of a pedestrian’s image might be assigned to the “pedestrian”
segment (there would just be one segment containing all the
pedestrians).
• In instance segmentation, all pixels that are part of the same
individual object are assigned to the same segment.
• In this case there would be a different segment for each pedestrian.
• In some applications, this may be sufficient, for example if you want
to analyze satellite images to measure how much total forest area
there is in a region, color segmentation may be just fine.
Clustering for Preprocessing
• Clustering can be an efficient approach to dimensionality reduction, in particular
as a preprocessing step before a supervised learning algorithm.
Using Clustering for Semi-Supervised Learning
• Another use case for clustering is in semi-supervised learning, when we
have plenty of unlabeled instances and very few labeled instances.
DBSCAN (Density Based Spatial Clustering
Application with Noise)
• This algorithm defines clusters as continuous regions of high density.
• Groups together closely packed data points. Marks the outlier points
as low-density regions.
• The algorithm can figure out any arbitrary shaped clusters.
• The algorithm works on two parameters :
• Epsilon (ε): The maximum distance between two samples for one
data point to be considered in the neighborhood of the other data
point.
• Minimum points (minPts): The minimum number of points required
to form a dense region.
Algorithm
• For each instance, the algorithm counts how many instances are
located within a small distance ε (epsilon) from it. This region is called
the instance’s ε neighborhood.
• If an instance has at least min_samples instances in its ε-
neighborhood (including itself), then it is considered a core instance.
In other words, core instances are those that are located in dense
regions.
• All instances in the neighborhood of a core instance belong to the
same cluster. This may include other core instances, therefore a long
sequence of neighboring core instances forms a single cluster.
• Any instance that is not a core instance and does not have one in its
neighbor hood is considered an anomaly.
Example:
• Consider the dataset :
Point F1 F2
P1 4.5 8
P2 5 7
P3 6 6.5
P4 7 5
P5 9 4
P6 7 3
P7 8 3.5
P8 9 5
P9 4 4
P10 3 7.5
P11 4 6
P12 3.5 5
• ε =1.9
• min_pts=4
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
P1 0 1.12 2.12 3.91 6.02 5.59 5.70 5.41 4.03 1.58 2.06 3.16
P2 1.12 0 1.12 2.83 5.0 4.47 4.61 4.47 3.16 2.06 1.41 2.5
P3 2.12 1.12 0 1.80 3.91 3.64 3.61 3.35 3.20 3.16 2.06 2.92
P4 3.91 2.83 1.80 0 2.24 2.0 1.80 2.0 3.16 4.72 3.16 3.50
P5 6.02 5.0 3.91 2.24 0 2.24 1.12 1.0 5.0 6.95 5.39 5.59
P6 5.59 4.47 3.64 2.0 2.24 0 1.12 2.83 3.16 6.02 4.24 4.03
P7 5.70 4.61 3.61 1.80 1.12 1.12 0 1.80 4.03 6.40 4.72 4.74
P8 5.41 4.47 3.35 2.0 1.0 2.83 1.80 0 5.10 6.50 5.10 5.50
P9 4.03 3.16 3.20 3.16 5.0 3.16 4.03 5.10 0 3.64 2.00 1.12
P10 1.58 2.06 3.16 4.72 6.95 6.02 6.40 6.50 3.64 0 1.80 2.55
P11 2.06 1.41 2.06 3.16 5.39 4.24 4.72 5.10 2.00 1.80 0 1.12
P12 3.16 2.5 2.92 3.50 5.59 4.03 4.74 5.50 1.12 2.55 1.12 0
Point Identification
P1 NOISE BORDER
P2 CLUSTER CLUSTER
P3 NOISE BORDER
P4 NOISE BORDER
P5 NOISE BORDER
P6 NOISE BORDER
P7 CLUSTER CLUSTER
P8 NOISE BORDER
P9 NOISE CLUSTER
P10 NOISE BORDER
P11 CLUSTER CLUSTER
P12 NOISE BORDER
• Advantages:
• Is great of separating clusters of high density versus clusters of low
density within a given dataset.
• Is great with handling outliers within the dataset.
• Disadvantages:
• Does not work well when dealing with clusters of varying densities.
• DBSCAN struggles with clusters of similar density.
• High dimensionality data.
Gaussian Mixtures
• A Gaussian mixture model (GMM) is a probabilistic model that
assumes that the instances were generated from a mixture of several
Gaussian distributions whose parameters are unknown.
• All the instances generated from a single Gaussian distribution form a
cluster that typically looks like an ellipsoid.
• Each cluster can have a different ellipsoidal shape, size, density and
orientation.
• K-means is a clustering algorithm that assigns each data point to one
cluster based on the closest centroid. It’s a hard clustering method,
meaning each point belongs to only one cluster with no uncertainty.
• On the other hand, Gaussian Mixture Models (GMM) use soft
clustering, where data points can belong to multiple clusters with a
certain probability.
• The below image has a few Gaussian distributions with a difference in
mean (μ) and variance (σ2). Remember that the higher the σ value
more would be the spread:
1.Multiple Gaussians (Clusters): Each cluster is represented by a
Gaussian distribution, and the data points are assigned probabilities
of belonging to different clusters based on their distance from each
Gaussian.
• These two steps are repeated until the model converges, meaning the
parameters no longer change significantly between iterations.
GMM Algorithm
1.Initialization: Start with initial guesses for the means, covariances,
and mixing coefficients of each Gaussian distribution.
• Step2
• Step 3:
• PCA will provide a mechanism to recognize this geometric similarity
through algebraic means.
• The covariance matrix S is a symmetric matrix and According to Spectral
Theorem(Spectral Decomposition)
• Here we call ⃗vi as Eigen Vector and λi as the corresponding Eigen Value
and A as the covariance matrix.
• Step 4: Inferring the Principal components from Eigen Values of the
Co Variance Matrix From Spectral theorem we infer:
eigenvector corresponding to λ .
1