0% found this document useful (0 votes)
19 views

Unit 4

Uploaded by

Sai Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Unit 4

Uploaded by

Sai Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Unit IV

Unsupervised Learning Techniques: Clustering, K-Means,


Limits of K-Means, Using Clustering for Image
Segmentation, Using Clustering for Preprocessing, Using
Clustering for Semi-Supervised Learning, DBSCAN,
Gaussian Mixtures.

Dimensionality Reduction: The Curse of Dimensionality,


Main Approaches for Dimensionality Reduction, PCA,
Using Scikit-Learn, Randomized PCA, Kernel PCA.
Supervised Learning
Clustering
Introduction to Clustering
It is basically a type of unsupervised learning method. An unsupervised
learning method is a method in which we draw references from datasets
consisting of input data without labeled responses.

Generally, it is used as a process to find meaningful structure,


explanatory underlying processes, generative features, and groupings
inherent in a set of examples.

Clustering is the task of dividing the population or data points into a


number of groups such that data points in the same groups are more similar
to other data points in the same group and dissimilar to the data points in
other groups. It is basically a collection of objects on the basis of similarity
and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into
one single group. We can distinguish the clusters, and we can identify that there are
3 clusters in the below picture.

It is not necessary for clusters to be


spherical. Such as :
Why Clustering?

Clustering is very much important as it determines the intrinsic


grouping among the unlabelled data present. There are no
criteria for good clustering. It depends on the user, and what
criteria they may use which satisfy their need.

For instance, we could be interested in finding


 Representatives for homogeneous groups (data “reduction”)
 Natural clusters and describing their unknown properties
(“natural” data types)
 Useful and suitable groupings (“useful” data classes)
 Unusual data objects (“outlier” detection).
Clustering Methods - 4 Types

1) Density-Based Methods: The clusters are formed as the dense region


having some similarities and differences
Example –
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
OPTICS (Ordering Points to Identify Clustering Structure)

2) Hierarchical Based Methods: The clusters formed in this method form

a tree-type structure based on the hierarchy.


2 Types –
Agglomerative (bottom-up approach) & Divisive (top-down approach)

Example –
CURE (Clustering Using Representatives)
BIRCH (Balanced Iterative Reducing Clustering and using
Hierarchies)
Clustering Methods - 4 Types

3) Partitioning Methods: These methods partition the objects into k


clusters and each partition forms one cluster
Example –
K-means
CLARANS (Clustering Large Applications based upon Randomized
Search)

4) Grid-based Methods: In this method, the data space is formulated into a


finite number of cells that form a grid-like structure
Example –
STING (Statistical Information Grid)
Wave cluster
CLIQUE (Clustering In Quest)
Clustering Algorithm:
K-means clustering algorithm is the simplest unsupervised learning
algorithm that solves clustering problem.

K-means algorithm partitions n observations into k clusters where


each observation belongs to the cluster with the nearest mean serving
as a prototype of the cluster

Applications of Clustering in different fields:


Marketing, biology, Insurance, City Planning, Earthquake studies , Image
Processing, Genetics, Finance, Sports Analysis, Crime Analysis, etc….
Types of Clustering
Clustering is a type of unsupervised learning wherein data points
are grouped into different sets based on their degree of similarity.

The various types of clustering are:


1. Hierarchical clustering
2. Partitioning clustering

Hierarchical clustering is further subdivided into:


• Agglomerative clustering
• Divisive clustering

Partitioning clustering is further subdivided into:


• K-Means clustering
• Fuzzy C-Means clustering
1) Hierarchical clustering

Hierarchical clustering is a connectivity-based clustering model


that groups the data points together that are close to each other
based on the measure of similarity or distance.

- The dataset is divided into clusters to create a tree-like


structure, which is also called a dendrogram
- The number of clusters can be selected by cutting the tree at
the correct level

Hierarchical clustering is further subdivided into


• Agglomerative clustering (bottom-up approach)
• Divisive clustering
Agglomerative Clustering

 Also known as the bottom-up approach or hierarchical


agglomerative clustering (HAC).
 A structure that is more informative than the unstructured set
of clusters returned by flat clustering.
 Does not require to prespecify the number of clusters.
 Treat each data as a singleton cluster at the outset
 Further all clusters have been merged into a single cluster
that contains all data
Fig: Hierarchical Agglomerative Clustering
Divisive clustering
 It is also known as a top-down approach or Hierarchical
Divisive clustering
 This algorithm also does not require to prespecify the number
of clusters.
 Top-down clustering requires a method for splitting a cluster
that contains the whole data and proceeds by splitting
clusters recursively until individual data have been split into
singleton clusters
Fig: Hierarchical Divisive clustering
Computing Distance Matrix – To check the distance between two every pair of
clusters and merge
1. Min Distance: Find the minimum distance between any two points of the
cluster.
2. Max Distance: Find the maximum distance between any two points of the
cluster.
3. Group Average: Find the average distance between every two points of the
clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in
squared error when two clusters are merged.
Fig: Distance Matrix Comparision in Hierarchical Clustering
K-means
According to the formal definition of K-means clustering – K-
means clustering is
- An iterative algorithm that partitions a group of data
containing n values into k subgroups.
- Each of the n value belongs to the k cluster with the nearest
mean.
- Here K defines the number of pre-defined clusters that need to
be created

This means that given a group of objects, we partition that group


into several sub-groups.

These sub-groups are formed on the basis of their similarity and


the distance of each data-point in the sub-group with the mean of
their centroid.
•It is a centroid-based algorithm, where each cluster is associated
with a centroid. [A centroid is a data point that represents the center of
the cluster (the mean)]

•It allows us to cluster the data into different groups and a


convenient way to discover the categories of groups in the unlabeled
dataset on its own without the need for any training.

•These sub-groups are formed on the basis of their similarity and


the distance of each data-point in the sub-group with the mean of
their centroid
K-means clustering is the most popular form of an unsupervised
learning algorithm. It is easy to understand and implement.

The objective of the K-means clustering is to minimize the


Euclidean distance that each point has from the centroid of the
cluster. This is known as intra-cluster variance and can be
minimized using the following squared error function –
Where J is the objective function of the centroid of the cluster. K
are the number of clusters and n are the number of cases. C is
the number of centroids and j is the number of clusters.
X is the given data-point from which we have to determine the
Euclidean Distance to the centroid. Let us have a look at the
algorithm for K-means clustering –

First, we randomly initialize and select the k-points. These k-


points are the means.
We use the Euclidean distance to find data-points that are
closest to their centre of the cluster.
Then we calculate the mean of all the points in the cluster
which is finding their centroid.
We iteratively repeat step 1, 2 and 3 until all the points are
assigned to their respective clusters.
Applications of K-Means Clustering Algorithm

K-means algorithm is used in the business sector for identifying segments


of purchases made by the users. It is also used to cluster activities on websites
and applications.
It is used as a form of lossy image compression technique. In image
compression, K-means is used to cluster pixels of an image that reduce the
overall size of it.
It is also used in document clustering to find relevant documents in one
place.
K-means is used in the field of insurance and fraud detection. Based on the
previous historical data, it is possible to cluster fraudulent practices and claims
based on their closeness towards clusters that indicate patterns of fraud.
It is also used to classify sounds based on their similar patterns and
isolating deformities in speech.
K-means clustering is used for Call Detail Record (CDR) Analysis. It
provides an in-depth insight into the customer requirements based on the call-
traffic during the time of the day and demographics of the place.
Assign the data points to Cluster which is having the smallest distance
Advantages
• It is very easy to understand and implement.
• If we have large number of variables then, K-means would be faster
than Hierarchical clustering.
• On re-computation of centroids, an instance can change the cluster.
• Tighter clusters are formed with K-means as compared to Hierarchical
clustering.
Disadvantages
• It is a bit difficult to predict the number of clusters i.e. the value of k.
• Output is strongly impacted by initial inputs like number of clusters
(value of k).
• Order of data will have strong impact on the final output.
• It is very sensitive to rescaling. If we will rescale our data by means of
normalization or standardization, then the output will completely change
the final output.
• It is not good in doing clustering job if the clusters have a complicated
geometric shape.
Image Segmentation By Clustering
Segmentation By clustering

It is a method to perform Image Segmentation of pixel-wise segmentation. In


this type of segmentation, we try to cluster the pixels that are together. There
are two approaches for performing the Segmentation by clustering.

Clustering by
Merging

Clustering by
Divisive
Clustering by merging or Agglomerative Clustering:

In this approach, we follow the bottom-up approach, which means we assign


the pixel closest to the cluster. The algorithm for performing the
agglomerative clustering as follows:

•Take each point as a separate cluster.


•For a given number of epochs or until clustering is satisfactory.
• Merge two clusters with the smallest inter-cluster distance
•Repeat the above step

The agglomerative clustering is represented by Dendrogram. It can be


performed in 3 methods:
 by selecting the closest pair for merging
 by selecting the farthest pair for merging
 by selecting the pair which is at an average distance (neither closest nor
farthest).
The dendrogram representing these types of clustering is below:
Average Clustering
Nearest clustering

Farthest Clustering
Clustering by division or Divisive splitting

In this approach, we follow the top-down approach, which means we assign


the pixel closest to the cluster. The algorithm for performing Divisive clustering
as follows:

• Construct a single cluster containing all points.


• For a given number of epochs or until clustering is satisfactory.
• Split the cluster into two clusters with the largest inter-cluster distance.
• Repeat the above steps.

K-Means Clustering
K-means clustering is a very popular clustering algorithm which applied when we
have a dataset with labels unknown. The goal is to find certain groups based on
some kind of similarity in the data with the number of groups represented by K.
This algorithm is generally used in areas like market segmentation, customer
segmentation, etc. But, it can also be used to segment different objects in the
images on the basis of the pixel values.
The algorithm for image segmentation works as follows:

1. First, we need to select the value of K in K-means clustering.


2. Select a feature vector for every pixel (color values such as RGB value,
texture etc.).
3. Define a similarity measure b/w feature vectors such as Euclidean distance
to measure the similarity b/w any two points/pixel.
4. Apply K-means algorithm to the cluster centers
5. Apply connected component’s algorithm.
6. Combine any component of size less than the threshold to an adjacent
component that is similar to it until you can’t combine more.

Following are the steps for applying the K-means clustering algorithm:

•Select K points and assign them one cluster center each.


•Until the cluster center won’t change, perform the following steps:
• Allocate each point to the nearest cluster center and ensure that each cluster
center has one point.
• Replace the cluster center with the mean of the points assigned to it.
•End
Using Clustering for Preprocessing

Clustering algorithms are the largest group of data mining


algorithms used for unsupervised learning. Additionally, they are
often used as a preprocessing step for supervised algorithms.

Clustering is a type of unsupervised learning that can be used to


find patterns in data. Clustering algorithms group data points
together in such a way that points within a group are more similar
to each other than to points in other groups. This process can be
useful as a preprocessing step before applying another machine
learning model because it can help to identify groups or patterns
within the data that might not be immediately apparent.
For example, let's say you have a dataset containing customer data,
and you want to build a model to predict which customers are most
likely to make a purchase. Before building the predictive model,
you could use a clustering algorithm to identify patterns or groups
within the data. This might reveal that there are certain
characteristics or attributes that are more common among customers
who are more likely to make a purchase, which you can then use to
build a more accurate predictive model.

In summary, clustering can be useful as a preprocessing step before


applying another machine learning model because it can help to
identify patterns or groups within the data that can be used to
improve the accuracy of the model.
Using Clustering for Semi-Supervised Learning

Semi-supervised clustering is a method that partitions unlabeled


data by creating the use of domain knowledge. It is generally
expressed as pairwise constraints between instances or just as an
additional set of labeled instances.

The quality of unsupervised clustering can be essentially improved


using some weak structure of supervision, for instance, in the form
of pairwise constraints (i.e., pairs of objects labeled as belonging to
similar or different clusters). Such a clustering procedure that
depends on user feedback or guidance constraints is known as
semisupervised clustering.

There are several methods for semi-supervised clustering that can


be divided into two classes which are as follows −
Constraint-based semi-supervised clustering − It can be used based on user-
provided labels or constraints to support the algorithm toward a more appropriate
data partitioning. This contains modifying the objective function depending on
constraints or initializing and constraining the clustering process depending on the
labeled objects.

Distance-based semi-supervised clustering − It can be used to employ an


adaptive distance measure that is trained to satisfy the labels or constraints in the
supervised data. Multiple adaptive distance measures have been utilized, including
string-edit distance trained using Expectation-Maximization (EM), and Euclidean
distance changed by the shortest distance algorithm.

An interesting clustering method, known as CLTree (CLustering based on


decision TREEs). It integrates unsupervised clustering with the concept of
supervised classification. It is an instance of constraint-based semi-supervised
clustering. It changes a clustering task into a classification task by considering the
set of points to be clustered as belonging to one class, labeled as “Y,” and inserts a
set of relatively uniformly distributed, “nonexistence points” with a multiple class
label, “N.”
The problem of partitioning the data area into data (dense) regions
and empty (sparse) regions can then be changed into a
classification problem. These points can be considered as a set of
“Y” points. It shows the addition of a collection of uniformly
distributed “N” points, defined by the “o” points.
The original clustering problem is thus changed into a
classification problem, which works out a design that distinguishes
“Y” and “N” points. A decision tree induction method can be used
to partition the two-dimensional space. Two clusters are
recognized, which are from the “Y” points only.
It can be used to insert a large number of “N” points to the original
data can introduce unnecessary overhead in the calculation.
Moreover, it is unlikely that some points added would truly be
uniformly distributed in a very high-dimensional space as this can
need an exponential number of points.
Density-based spatial clustering of applications with
noise (DBSCAN)
DBSCAN is a clustering algorithm that defines clusters as
continuous regions of high density and works well if all the clusters
are dense enough and well separated by low-density regions.
In the case of DBSCAN, instead of guessing the number of clusters,
will define two hyperparameters: epsilon and minPoints to arrive at
clusters.

Epsilon (ε): A distance measure that will be used to locate the


points/to check the density in the neighbourhood of any point.

minPoints(n): The minimum number of points (a threshold)


clustered together for a region to be considered dense.
In this algorithm, we have 3 types of
data points.

Core Point: A point is a core point if it


has more than MinPts points within
eps.

Border Point: A point which has


fewer than MinPts within eps but it is
in the neighborhood of a core point.

Noise or outlier: A point which is not


a core point or border point.
DBSCAN algorithm can be abstracted in the following steps:

Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.

For each core point if it is not already assigned to a cluster, create a new
cluster.

Find recursively all its density connected points and assign them to the same
cluster as the core point.

A point a and b are said to be density connected if there exist a point c which
has a sufficient number of points in its neighbors and both the points a and b are
within the eps distance. This is a chaining process. So, if b is neighbor of c, c is
neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is
neighbor of a.

Iterate through the remaining unvisited points in the dataset. Those points that
do not belong to any cluster are noise.
Gaussian Mixtures
A Gaussian Mixture is a function that is comprised of several
Gaussians, each identified by k ∈ {1,…, K}, where K is the number
of clusters of our dataset. Each Gaussian k in the mixture is
comprised of the following parameters:

A mean μ that defines its centre.

A covariance Σ that defines its width. This would be equivalent to


the dimensions of an ellipsoid in a multivariate scenario.

A mixing probability π that defines how big or small the Gaussian


function will be.
Let us now illustrate these parameters graphically:
Here, we can see that there are three Gaussian functions, hence K = 3. Each
Gaussian explains the data contained in each of the three clusters available. The
mixing coefficients are themselves probabilities and must meet this condition:

Now how do we determine the optimal values for these parameters? To achieve
this we must ensure that each Gaussian fits the data points belonging to each
cluster. This is exactly what maximum likelihood does.

In general, the Gaussian density function is given by:


The following are three different steps to using gaussian mixture
models:

Determining a covariance matrix that defines how each Gaussian


is related to one another. The more similar two Gaussians are, the
closer their means will be and vice versa if they are far away from
each other in terms of similarity. A gaussian mixture model can
have a covariance matrix that is diagonal or symmetric.

Determining the number of Gaussians in each group defines how


many clusters there are.

Selecting the hyperparameters which define how to optimally


separate data using gaussian mixture models as well as deciding on
whether or not each gaussian’s covariance matrix is diagonal or
symmetric.
Here are some real-world problems which can be solved using Gaussian mixture models:
Finding patterns in medical datasets: GMMs can be used for segmenting images into
multiple categories based on their content or finding specific patterns in medical datasets.
They can be used to find clusters of patients with similar symptoms, identify disease
subtypes, and even predict outcomes. In one recent study, a Gaussian mixture model was
used to analyze a dataset of over 700,000 patient records. The model was able to identify
previously unknown patterns in the data, which could lead to better treatment for
patients with cancer.
Modeling natural phenomena: GMM can be used to model natural phenomena where it
has been found that noise follows Gaussian distributions. s.
Customer behavior analysis: GMMs can be used for performing customer behavior
analysis in marketing to make predictions about future purchases based on historical
data.
Stock price prediction: Another area Gaussian mixture models are used is in finance
where they can be applied to a stock’s price time series. GMMs can be used to detect
changepoints in time series data and help find turning points of stock prices or other
market movements that are otherwise difficult to spot due to volatility and noise.
Gene expression data analysis: Gaussian mixture models can be used for gene expression
data analysis. In particular, GMMs can be used to detect differentially expressed genes
between two conditions and identify which genes might contribute toward a certain
phenotype or disease state.
Unsupervised Learning Techniques: Clustering, K-Means,
Limits of K-Means, Using Clustering for Image Segmentation,
Using Clustering for Preprocessing, Using Clustering for Semi-
Supervised Learning, DBSCAN, Gaussian Mixtures.

Dimensionality Reduction: The Curse of Dimensionality, Main


Approaches for Dimensionality Reduction, PCA, Using Scikit-
Learn, Randomized PCA, Kernel PCA.
Dimensionality reduction is the process of reducing the number of features (or
dimensions) in a dataset while retaining as much information as possible. This can
be done for a variety of reasons, such as to reduce the complexity of a model, to
improve the performance of a learning algorithm, or to make it easier to visualize
the data.
What is Dimensionality Reduction?

In machine learning classification problems, there are often too many factors on the
basis of which the final classification is done. These factors are basically variables
called features. The higher the number of features, the harder it gets to visualize the
training set and then work on it. Sometimes, most of these features are correlated,
and hence redundant.

This is where dimensionality reduction algorithms come into play. Dimensionality


reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables. It can be divided into
feature selection and feature extraction.
Components of Dimensionality Reduction

There are two components of dimensionality reduction:

Feature selection: In this, we try to find a subset of the original set


of variables, or features, to get a smaller subset which can be used
to model the problem. It usually involves three ways:
Filter
Wrapper
Embedded

Feature extraction: This reduces the data in a high dimensional


space to a lower dimension space, i.e. a space with lesser no. of
dimensions.
The Curse of Dimensionality
The curse of dimensionality basically refers to the difficulties a
machine learning algorithm faces when working with data in
the higher dimensions, that did not exist in the lower
dimensions. This happens because when you add dimensions
(features), the minimum data requirements also increase
rapidly.

This means, that as the number of features (columns)


increases, you need an exponentially growing number of
samples (rows) to have all combinations of feature values
well-represented in our sample.
With the increase in the data dimensions, your model –would also increase
in complexity. would become increasingly dependent on the data it is being
trained on.

This leads to overfitting of the model, so even though the model performs
really well on training data, it fails drastically on any real data.
Methods of Dimensionality Reduction

The various methods used for dimensionality reduction include:

Principal Component Analysis (PCA)

Linear Discriminant Analysis (LDA)

Generalized Discriminant Analysis (GDA)


What is Principal Component Analysis?

The Principal Component Analysis is a popular unsupervised


learning technique for reducing the dimensionality of data. It
increases interpretability yet, at the same time, it minimizes
information loss. It helps to find the most significant features in a
dataset and makes the data easy for plotting in 2D and 3D. PCA
helps in finding a sequence of linear combinations of variables.
Randomized PCA, Kernel PCA
.Randomized PCA:
This is an extension to PCA which uses approximated Singular
Value Decomposition(SVD) of data.
Conventional PCA works in
O(n*p2) + O(p3)
where n is the number of data points and
p is the number of features

whereas Randomized version works in


O(n*d2) + O(d3)
where d is the number of principal components.

Thus, it is blazing fast when d is much smaller than n.


Sklearn provide a method randomized_svd in
sklearn.utils.extmath which can be used to do randomized PCA.

This method returns three matrices:


U which is an m x m matrix,
Sigma is an m x n diagonal matrix, and
VT is the transpose of an n x n matrix.

Another way to use sklearn.decomposition.PCA and change


the svd_solver hyperparameter from ‘auto’ to ‘randomized’ or
‘full’. However, Scikit-learn automatically uses randomized PCA
if either p or n exceeds 500 or the number of principal
components is less than 80% of p and n.
Randomized PCA is a variation of Principal Component Analysis
(PCA) that is designed to approximate the first k principal
components of a large dataset efficiently.

Instead of computing the eigenvectors of the covariance matrix of


the data, as is done in traditional PCA, randomized PCA uses a
random projection matrix to map the data to a lower-
dimensional subspace.

The first k principal components of the data can then be


approximated by computing the eigenvectors of the covariance
matrix of the projected data.
Randomized PCA has several advantages over
traditional PCA:

 Scalability: Randomized PCA can handle large datasets that


are not possible to fit into memory using traditional PCA.
 Speed: Randomized PCA is much faster than traditional PCA
for large datasets, making it more suitable for real-time
applications.
 Sparsity: Randomized PCA is able to handle sparse datasets,
which traditional PCA is not able to handle well.
 Low-rank approximation: Randomized PCA can be used to
obtain a low-rank approximation of a large dataset, which
can then be used for further analysis or visualization.
PCA linearly transforms the original inputs into new
uncorrelated features. KPCA is a nonlinear PCA developed
by using the kernel method.
In PCA, the original inputs are linearly transformed into
features which are mutually statistically independent

Kernel PCA:
Kernel PCA is yet another extension of PCA using a kernel. The
kernel is a mathematical technique using which we can map
instances to very high dimensional space called the feature
space, enabling non-linear classification and regression with
Support Vector Machines(SVM). This is usually employed in
novelty detections and image de-noising. Scikit-Learn provides a
class Kernel PCA in sklearn.decomposition which can be used to
perform Kernel PCA.
Applications of PCA in Machine Learning

PCA is used to visualize multidimensional data.


It is used to reduce the number of dimensions in healthcare data.
PCA can help resize an image.
It can be used in finance to analyze stock data and forecast
returns.
PCA helps to find patterns in the high-dimensional datasets.
Using Scikit-Learn

Scikit-learn (Sklearn) is the most useful and robust library for


machine learning in Python. It provides a selection of efficient
tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction
via a consistence interface in Python.

This library, which is largely written in Python, is built upon


NumPy, SciPy and Matplotlib.
What Can We Achieve Using Python Scikit-Learn?

For the most part, users accomplish three primary tasks with scikit-
learn:
1. Classification
Identifying which category an object belongs to.
Application: Spam detection

2. Regression
Predicting a continuous variable based on relevant independent
variables.
Application: Stock price predictions

3. Clustering
Automatic grouping of similar objects into different clusters.
Application: Customer segmentation
THE END

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy