0% found this document useful (0 votes)
4 views

UNIT 4 NOTES

The document provides an overview of clustering in machine learning, explaining its purpose of grouping similar data points and its applications across various fields. It discusses the evaluation of clustering algorithms, the importance of determining the optimal number of clusters, and different types of clustering methods such as hard and soft clustering. Additionally, it outlines popular clustering algorithms, including K-means, and methods for calculating distances and evaluating clustering quality.

Uploaded by

abinishi222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

UNIT 4 NOTES

The document provides an overview of clustering in machine learning, explaining its purpose of grouping similar data points and its applications across various fields. It discusses the evaluation of clustering algorithms, the importance of determining the optimal number of clusters, and different types of clustering methods such as hard and soft clustering. Additionally, it outlines popular clustering algorithms, including K-means, and methods for calculating distances and evaluating clustering quality.

Uploaded by

abinishi222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 66

0

Subject: MACHINE LEARNING UNIT-IV


Faculty: G. Umamaheswari Book Reference:T1

Introduction: Problem Definition & Clustering Overview

Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group than those
in other groups. In simple words, the aim is to segregate groups with similar traits and assign
them into clusters.

Example: Suppose, you are the head of a rental store and wish to understand preferences of your
costumers to scale up your business. Is it possible for you to look at details of each costumer and
devise a unique business strategy for each one of them? Definitely not. But, what you can do is
to cluster all of your costumers into say 10 groups based on their purchasing habits and use a
separate strategy for costumers in each of these 10 groups. And this is what we call clustering.

Clustering algorithms are a part of unsupervised machine learning algorithms. Why


unsupervised? Because, the target variable is not present. The model is trained based on given
input variables which attempt to discover intrinsic groups (or clusters).
What Is Good Clustering?

A good clustering method will produce high quality clusters with

o high intra-class similarity

o low inter-class similarity

 The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.

 The quality of a clustering method is also measured by its ability to discover some or all of
the hidden patterns.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.

 Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.

 In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.

 Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house type,
value, and geographic location.

 Clustering also helps in classifying documents on the web for information discovery.

 Clustering is also used in outlier detection applications such as detection of credit card fraud.

 As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to observe characteristics of each cluster.

Evaluation of Clustering algorithms

Clustering is an unsupervised machine learning algorithm. It helps in clustering data points to groups.
Validating the clustering algorithm is bit tricky compared to supervised machine learning algorithm as
clustering process does not contain ground truth labels.
How Clustering can be evaluated?

Three important factors by which clustering can be evaluated are

(a) Clustering tendency (b) Number of clusters, k (c) Clustering quality

a) Clustering tendency

Before evaluating the clustering performance, making sure that data set we are working has clustering
tendency and does not contain uniformly distributed points is very important. If the data does not
contain clustering tendency, then clusters identified by any state of the art clustering algorithms may be
irrelevant. Non-uniform distribution of points in data set becomes important in clustering.

To solve this, Hopkins test, a statistical test for spatial randomness of a variable, can be used to measure
the probability of data points generated by uniform data distribution.

Null Hypothesis (Ho) : Data points are generated by non-random, uniform distribution (implying no
meaningful clusters)

Alternate Hypothesis (Ha): Data points are generated by random data points (presence of clusters)
If H>0.5, null hypothesis can be rejected and it is very much likely that data contains clusters. If H is
more close to 0, then data set doesn’t have clustering tendency.

b) Number of Optimal Clusters, k

Some of the clustering algorithms like K-means, require number of clusters, k, as clustering parameter.
Getting the optimal number of clusters is very significant in the analysis. If k is too high, each point will
broadly start representing a cluster and if k is too low, then data points are incorrectly clustered. Finding
the optimal number of clusters leads to granularity in clustering.

There is no definitive answer for finding right number of cluster as it depends upon (a) Distribution shape
(b) scale in the data set (c) clustering resolution required by user. Although finding number of clusters is
a very subjective problem. There are two major approaches to find optimal number of clusters:

Domain knowledge
(2) Data driven approach

Domain knowledge — Domain knowledge might give some prior knowledge on finding number of
clusters.
Data driven approach — If the domain knowledge is not available, mathematical methods help in
finding out right number of clusters.

Clustering quality

Once clustering is done, how well the clustering has performed can be quantified by a number of
metrics. Ideal clustering is characterized by minimal intra cluster distance and maximal inter cluster
distance.

Types of Clustering

Broadly speaking, clustering can be divided into two subgroups :

 Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or
not. For example, in the above example each customer is put into one group out of the 10
groups.
 Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example, from
the above scenario each costumer is assigned a probability to be in either of 10 clusters of the
retail store.
Types of clustering algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty.
Every methodology follows a different set of rules for defining the ‘similarity’ among

data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are
used popularly, let’s look at them in detail:

 Connectivity models: As the name suggests, these models are based on the notion that the data
points closer in data space exhibit more similarity to each other than the data points lying
farther away. These models can follow two approaches. In the first approach, they start with
classifying all data points into separate clusters & then aggregating them as the distance
decreases. In the second approach, all data points are classified as a single cluster and then
partitioned as the distance increases. Also, the choice of distance function is subjective. These
models are very easy to interpret but lacks scalability for handling big datasets. Examples of
these models are hierarchical clustering algorithm and its variants.
 Centroid models: These are iterative clustering algorithms in which the notion of similarity is
derived by the closeness of a data point to the centroid of the clusters. K-Means clustering
algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters
required at the end have to be mentioned beforehand, which makes it important to have prior
knowledge of the dataset. These models run iteratively to find the local optima.
 Distribution models: These clustering models are based on the notion of how probable is it that
all data points in the cluster belong to the same distribution (For example: Normal, Gaussian).
These models often suffer from overfitting. A popular example of these models is Expectation-
maximization algorithm which uses multivariate normal distributions.
 Density Models: These models search the data space for areas of varied density of data points
in the data space. It isolates various different density regions and assign the data points within
these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.
Clustering Methods

Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Clustering:

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of
data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −

 Each group contains at least one object.


 Each object must belong to exactly one group.

 For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We can classify
hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two
approaches here −
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each object forming a
separate group. It keeps on merging the objects or groups that are close to one another. It keep on
doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the objects in the
same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each
object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or
splitting is done, it can never be undone

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −

 Perform careful analysis of object linkages at each hierarchical partitioning.


 Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to
group objects into micro-clusters, and then performing macro-clustering on the micro-clusters.
Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the given cluster as
long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a given
cluster, the radius of a given cluster has to contain at least a minimum number of points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number of cells that
form a grid structure.

Advantages

 The major advantage of this method is fast processing time.


 It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given model.
This method locates the clusters by clustering the density function. It reflects spatial distribution of the
data points.

This method also provides a way to automatically determine the number of clusters based on standard
statistics, taking outlier or noise into account. It therefore yields robust clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or application-oriented


constraints. A constraint refers to the user expectation or the properties of desired clustering results.
Constraints provide us with an interactive way of communication with the clustering process.
Constraints can be specified by the user or the application requirement.

K Means Clustering:

This technique partitions the data set into unique homogeneous clusters whose observations are similar
to each other but different than other clusters. The resultant clusters remain mutually exclusive, i.e.,
non- overlapping clusters.

In this technique, "K" refers to the number of cluster among which we wish to partition the data.
Every cluster has a centroid. The name "k means" is derived from the fact that cluster centroids
are computed as the mean distance of observations assigned to each cluster.

This k value k is given by the user. It's a hard clustering technique, i.e., one observation gets
classified into exactly one cluster.

Technically, the k means technique tries to classify the observations into K clusters such that the
total within cluster variation is as small as possible. Let C denote a cluster. k denotes the number
of clusters.

K means is an iterative clustering algorithm that aims to find local maxima in each iteration.

Working of K-Means Algorithm

The following stages will help us understand how the K-Means clustering technique works-

 Step 1: First, we need to provide the number of clusters, K, that need to be generated by this
algorithm.

 Step 2: Next, choose K data points at random and assign each to a cluster. Briefly, categorize the
data based on the number of data points.
 Step 3: The cluster centroids will now be computed.

 Step 4: Iterate the steps below until we find the ideal centroid, which is the assigning of data
points to clusters that do not vary.

 4.1 The sum of squared distances between data points and centroids would be calculated first.

 4.2 At this point, we need to allocate each data point to the cluster that is closest to the others
(centroid).

 4.3 Finally, compute the centroids for the clusters by averaging all of the cluster’s data points.

This algorithm works in these 5 steps :

1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.

2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids : The centroid of data points in the red cluster is shown using red
cross and those in grey cluster using grey cross.

4. Re-assign each point to the closest cluster centroid : Note that only the data point at the
bottom is assigned to the red cluster even though its closer to the centroid of grey cluster.
Thus, we assign that data point into grey cluster
5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.

6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4 th and
5th steps until we’ll reach global optima. When there will be no further switching of data
points between two clusters for two successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.
Distance Calculation for Clustering

There are some important things you should keep in mind:

1. With quantitative variables, distance calculations are highly influenced by variable units
and magnitude. For example, clustering variable height (in feet) with salary (in rupees) having
different units and distribution (skewed) will invariably return biased results. Hence, always
make sure to standardize (mean = 0, sd = 1) the variables. Standardization results in unit-less
variables.
2. Use of a particular distance measure depends on the variable types; i.e., formula for calculating
distance between numerical variables is different than categorical variables.
 Distance can be calculated using many distance functions as listed below:
o Euclidean Distance
o Manhattan Distance
o Hamming Distance
o Cosine Similarity

Let's see the most widely used method Euclidean distance which we use in K Means algorithm:
Euclidean Distance:

Elbow Method
The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS value. WCSS
stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate the value of WCSS (for 3
clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in

CLuster3 distance(Pi C3)2


In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances
between each data point and its centroid within a cluster1 and the same for
the other two terms.

To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.

Example:

Problem-01:

Cluster the following eight points (with (x, y) representing locations) into three

clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2),

A8(4, 9)

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).

The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-

Ρ(a, b) = |x2 – x1| + |y2 – y1|

Use K-Means Algorithm to find the three cluster centers after the second iteration.

Solution-

We follow the above discussed K-Means Clustering Algorithm-

Iteration-01:

 We calculate the distance of each point from each of the center of the three clusters.
 The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and
each of the center of the three clusters-

Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

=0

Calculating Distance Between A1(2, 10) and C2(5, 8)-

Ρ(A1, C2)

= |x2 – x1| + |y2 – y1|

= |5 – 2| + |8 – 10|

=3+2

=5

Calculating Distance Between A1(2, 10) and C3(1, 2)-

Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1 – 2| + |2 – 10|

=1+8

=9

In the similar manner, we calculate the distance of other points from each of the center
of the three clusters.

Next,

 We draw a table showing all the results.


 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.
Distance from
Distance from center Distance from center Point belongs
Given Points center (2, 10)
(5, 8) of Cluster-02 (1, 2) of Cluster-03 to Cluster
of Cluster-01

A1(2, 10) 0 5 9 C1

A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2

A4(5, 8) 5 0 10 C2

A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2

A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2

From here, New clusters are-

Cluster-01:

First cluster contains points-

 A1(2, 10)

Cluster-02:

Second cluster contains points-

 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)
 A8(4, 9)

Cluster-03:

Third cluster contains points-

 A2(2, 5)
 A7(1, 2)

Now,

 We re-compute the new cluster clusters.


 The new cluster center is computed by taking mean of all the points contained in that cluster.

For Cluster-01:

 We have only one point A1(2, 10) in Cluster-01.


 So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02

= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)

= (6, 6)

For Cluster-03:

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is completion of Iteration-01.


Iteration-02:

 We calculate the distance of each point from each of the center of the three clusters.
 The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and
each of the center of the three clusters-

Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

=0

Calculating Distance Between A1(2, 10) and C2(6, 6)-

Ρ(A1, C2)

= |x2 – x1| + |y2 – y1|

= |6 – 2| + |6 – 10|

=4+4

=8

Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1.5 – 2| + |3.5 – 10|

= 0.5 + 6.5

=7

In the similar manner, we calculate the distance of other points from each of the center
of the three clusters.

Next,
 We draw a table showing all the results.
 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.

Distance from Distance Distance from center


Point belongs to
Given Points center (2, 10) from center (1.5, 3.5) of Cluster-
Cluster
of Cluster-01 (6, 6) of 03
Cluster-02

A1(2, 10) 0 8 7 C1

A2(2, 5) 5 5 2 C3

A3(8, 4) 12 4 7 C2

A4(5, 8) 5 3 8 C2

A5(7, 5) 10 2 7 C2

A6(6, 4) 10 2 5 C2

A7(1, 2) 9 9 2 C3

A8(4, 9) 3 5 8 C1

From here, New clusters are-

Cluster-01:

First cluster contains points-

 A1(2, 10)
 A8(4, 9)
Cluster-02:

Second cluster contains points-

 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)

Cluster-03:

Third cluster contains points-

 A2(2, 5)
 A7(1, 2)

Now,

 We re-compute the new cluster clusters.


 The new cluster center is computed by taking mean of all the points contained in that cluster.

For Cluster-01:

Center of Cluster-01

= ((2 + 4)/2, (10 + 9)/2)

= (3, 9.5)

For Cluster-02:

Center of Cluster-02

= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)

= (6.5, 5.25)

For Cluster-03:

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)


= (1.5, 3.5)

This is completion of Iteration-02.

After second iteration, the center of the three clusters are-

 C1(3, 9.5)
 C2(6.5, 5.25)
 C3(1.5, 3.5)

K-Modes clustering
K-Modes clustering is one of the unsupervised Machine Learning algorithms that is used to
cluster categorical variables.
You might be wondering, why KModes clustering when we already have KMeans.
K-Means uses mathematical measures (distance) to cluster continuous data. The lesser the distance, the
more similar our data points are. Centroids are updated by Means.
But for categorical data points, we cannot calculate the distance.
So we go for KModes algorithm. It uses the dissimilarities(total mismatches) between the data points.
The lesser the dissimilarities the more similar our data points are. It uses Modes instead of means.
How does the KModes algorithm work?
Unlike Hierarchical clustering methods, we need to upfront specify the K.
1. Pick K observations at random and use them as leaders/clusters
2. Calculate the dissimilarities and assign each observation to its closest cluster
3. Define new modes for the clusters
4. Repeat 2–3 steps until there are is no re-assignment required
So let us quickly take an example to illustrate the working step by
step.
Example: Imagine we have a dataset that has the information about hair color, eye color, and skin color
of persons. We aim to group them based on the available information
Hair color, eye color, and skin color are all categorical variables.
Below is how our dataset looks like.
Alright, we have the sample data now. Let us proceed by defining the number of clusters(K)=3
Step 1: Pick K observations at random and use them as leaders/clusters
I am choosing P1, P7, P8 as leaders/clusters
Leaders and Observations

Step 2: Calculate the dissimilarities(no. of mismatches) and assign each observation to its closest cluster
Iteratively compare the cluster data points to each of the observations. Similar data points give 0,
dissimilar data points give 1.
Comparing leader/Cluster P1 to the observation P1 gives 0 dissimilarities.
Comparing leader/cluster P1 to the observation P2 gives 3(1+1+1) dissimilarities.
Likewise, calculate all the dissimilarities and put them in a matrix as shown below and assign the
observations to their closest cluster(cluster that has the least dissimilarity)

Dissimilarity matrix (Image by Author)


After step 2, the observations P1, P2, P5 are assigned to cluster 1; P3, P7 are assigned to Cluster 2; and
P4, P6, P8 are assigned to cluster 3.
Note: If all the clusters have the same dissimilarity with an observation, assign to any cluster randomly.
In our case, the observation P2 has 3 dissimilarities with all the leaders. I randomly assigned it to Cluster
1. Step 3: Define new modes for the clusters
Mode is simply the most observed value.
Mark the observations according to the cluster they belong to. Observations of Cluster 1 are marked in
Yellow, Cluster 2 are marked in Brick red, and Cluster 3 are marked in Purple.

Looking for Modes (Image by author)


Considering one cluster at a time, for each feature, look for the Mode and update the new leaders.
Explanation: Cluster 1 observations(P1, P2, P5) has brunette as the most observed hair color, amber as
the most observed eye color, and fair as the most observed skin color.
Note: If you observe the same occurrence of values, take the mode randomly. In our case, the
observations of Cluster 3(P3, P7) have one occurrence of brown, fair skin color. I randomly chose brown
as the mode. Below are our new leaders after the update.

Obtained new leaders


Repeat steps 2–4
After obtaining the new leaders, again calculate the dissimilarities between the observations and the
newly obtained leaders.
Comparing Cluster 1 to the observation P1 gives 1 dissimilarity.
Comparing Cluster 1 to the observation P2 gives 2 dissimilarities.
Likewise, calculate all the dissimilarities and put them in a matrix. Assign each observation to its closest
cluster.

The observations P1, P2, P5 are assigned to Cluster 1; P3, P7 are assigned to Cluster 2; and P4, P6, P8 are
assigned to Cluster 3.
We stop here as we see there is no change in the assignment of observations.
Hierarchical Clustering

In simple words, hierarchical clustering tries to create a sequence of nested clusters to explore deeper
insights from the data. For example, this technique is being popularly used to explore the standard plant
taxonomy which would classify plants by family, genus, species, and so on.

Hierarchical clustering technique is of two types:

1. Agglomerative Clustering – It starts with treating every observation as a cluster. Then, it merges the
most similar observations into a new cluster. This process continues until all the observations are
merged into one cluster. It uses a bottoms-up approach (think of an inverted tree).

Computing Distance Matrix:


While merging two clusters we check the distance between two every pair of clusters and merge the
pair with least distance/most similarity. But the question is how is that distance determined. There are
different ways of defining Inter Cluster distance/similarity. Some of them are:
1. Single Linkage:Min Distance-Find minimum distance between any two points of the cluster.
2. Complete Linkage:Max Distance- Find maximum distance between any two points of the cluster.
3. Average Linkage:Group Average: Find average of distance between every two points of the clusters.
4. Ward’s Method: Similarity of two clusters is based on the increase in squared error when two
clusters are merged.

2. Divisive Clustering – In this technique, initially all the observations are partitioned into one cluster
(irrespective of their similarities). Then, the cluster splits into two sub-clusters carrying similar
observations. These sub-clusters are intrinsically homogeneous. Then, we continue to split the clusters
until the leaf cluster contains exactly one observation. It uses a top-down approach.

Let us study about agglomerative clustering (it's the most commonly used).

This technique creates a hierarchy (in a recursive fashion) to partition the data set into clusters.
This partitioning is done in a bottoms-up fashion. This hierarchy of clusters is graphically
presented using a Dendogram (shown below).
This dendrogram shows how clusters are merged / split hierarchically. Each node in the tree is a
cluster. And, each leaf of the tree is a singleton cluster (cluster with one observation). So, how do
we find out the optimal number of clusters from a dendrogram?

As you know, every leaf in the dendrogram carries one observation. As we move up the leaves,
the leaf observations begin to merge into nodes (carrying observations which are similar to each
other). As we move further up, these nodes again merge further.

Always remember, lower the merging happens (towards the bottom of the tree), more similar the
observations will be. Higher the merging happens (toward the top of the tree), less similar the
observations will be.

To determine clusters, we make horizontal cuts across the branches of the dendrogram. The
number of clusters is then calculated by the number of vertical lines on the dendrogram, which
lies under horizontal line.
As seen above, the horizontal line cuts the dendrogram into three clusters since it surpasses
three vertical lines. In a way, the selection of height to make a horizontal cut is similar to finding
k in k means since it also controls the number of clusters.

But, how to decide where to cut a dendrogram? Practically, analysts do it based on their
judgement and business need. More logically, there are several methods (described below) using
which you can calculate the accuracy of your model based on different cuts. Finally, select the
cut with a better accuracy.

Algorithm:

given a dataset (d1, d2, d3, ....dN) of

size N # compute the distance matrix

for i=1 to N:

# as the distance matrix is symmetric about

# the primary diagonal so we compute only

lower # part of the primary diagonal

for j=1 to i:

dis_mat[i][j] = distance[di, dj]

each data point is a singleton

cluster repeat

merge the two cluster having minimum

distance update the distance matrix


untill only a single cluster remains

Divisive clustering :

Also known as top-down approach. This algorithm also does not require to pre-specify the
number of clusters. Top-down clustering requires a method for splitting a cluster that contains
the whole data and proceeds by splitting clusters recursively until individual data have been
splitted into singleton cluster.

Algorithm :

given a dataset (d1, d2, d3, ....dN) of


size N at the top we have all data in
one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to
split split that cluster by the flat clustering
algorithm until each data is in its own singleton
Hierarchical Agglomerative vs Divisive clustering –

 Divisive clustering is more complex as compared to agglomerative clustering, as in case of divisive


clustering we need a flat clustering method as “subroutine” to split each cluster until we have
each data having its own singleton cluster.
 Divisive clustering is more efficient if we do not generate a complete hierarchy all the way down to
individual data leaves. Time complexity of a naive agglomerative clustering is O(n3) because we
exhaustively scan the N x N matrix dist_mat for the lowest distance in each of N-1 iterations. Using
priority queue data structure we can reduce this complexity to O(n2logn). By using some more
optimizations it can be brought down to O(n2). Whereas for divisive clustering given a fixed
number of top levels, using an efficient flat algorithm like K-Means, divisive algorithms are linear
in the number of patterns and clusters.
 Divisive algorithm is also more accurate. Agglomerative clustering makes decisions by considering
the local patterns or neighbor points without initially taking into account the global distribution of
data. These early decisions cannot be undone. whereas divisive clustering takes into consideration
the global distribution of data when making top-level partitioning decisions.

Hierarchical vs K Means:

The advantage of using hierarchical clustering over k means is, it doesn't require advanced
knowledge of number of clusters. However, some of the advantages which k means has over
hierarchical clustering are as follows:

 It uses less memory.


 It converges faster.
 Unlike hierarchical, k means doesn't get trapped in mistakes made on a previous level. It
improves iteratively.
 k means is non-deterministic in nature, i.e.. after every time you initialize, it will produce
different clusters. On the contrary, hierarchical clustering is deterministic.

Note: K means is preferred when the data is numeric. Hierarchical clustering is preferred when
the data is categorical.

Key Issues in Hierarchical Clustering

Lack of a Global Objective Function: agglomerative hierarchical clustering techniques


perform clustering on a local level and as such there is no global objective function like in the K-
Means algorithm. This is actually an advantage of this technique because the time and space
complexity of global functions tends to be very
expensive.

Ability to Handle Different cluster Sizes: we have to decide how to treat clusters of various
sizes that are merged together.

Merging Decisions Are Final: one downside of this technique is that once two clusters have
been merged they cannot be split up at a later time for a more favorable union.

Strengths and weakness

Strength:

 The strengths of hierarchical clustering are that it is easy to understand and easy to do.

 Hierarchical clustering is preferred when the data is categorical.

Weakness:

The weaknesses are that it rarely provides the best solution, it involves lots of arbitrary decisions, it does
not work with missing data, it works poorly with mixed data types, it does not work well on very large
data sets, and its main output, the dendogram, is commonly misinterpreted. There are better
alternatives, such as latent class analysis.

Self Organizing Map (or Kohonen Map or SOM) :


It is a type of Artificial Neural Network which is also inspired by biological models of neural systems from
the 1970s. It follows an unsupervised learning approach and trained its network through a competitive
learning algorithm. SOM is used for clustering and mapping (or dimensionality reduction) techniques to
map multidimensional data onto lower-dimensional which allows people to reduce complex problems
for
easy interpretation. SOM has two layers, one is the Input layer and the other one is the Output layer.
The architecture of the Self Organizing Map with two clusters and n input features of any sample is given
below:

How do SOM works?


Let’s say an input data of size (m, n) where m is the number of training examples and n is the number of
features in each example. First, it initializes the weights of size (n, C) where C is the number of clusters.
Then iterating over the input data, for each training example, it updates the winning vector (weight
vector with the shortest distance (e.g Euclidean distance) from training example). Weight updation rule
is given by :
wij = wij(old) + alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the ith feature of training
example and k denotes the k th training example from the input data. After training the SOM network,
trained weights are used for clustering new examples. A new example falls in the cluster of winning
vectors.
Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate the new weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.

Expectation-Maximization algorithm can be used for the latent variables (variables that are not directly
observable and are actually inferred from the values of the other observed variables) too in order to
predict their values with the condition that the general form of probability distribution governing those
latent variables is known to us. This algorithm is actually at the base of many unsupervised clustering
algorithms in the field of machine learning.
It was explained, proposed and given its name in a paper published in 1977 by Arthur Dempster, Nan
Laird, and Donald Rubin. It is used to find the local maximum likelihood parameters of a statistical model
in the cases where latent variables are involved and the data is missing or incomplete.

Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset, estimate (guess)
the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation (E) step is used in
order to update the parameters.
4. Repeat step 2 and step 3 until convergence.

The essence of Expectation-Maximization algorithm is to use the available observed data of the dataset
to estimate the missing data and then using that data to update the values of the parameters. Let us
understand the EM algorithm in detail.
 Initially, a set of initial values of the parameters are considered. A set of incomplete observed
data is given to the system with the assumption that the observed data comes from a specific
model.
 The next step is known as “Expectation” – step or E-step. In this step, we use the observed data
in order to estimate or guess the values of the missing or incomplete data. It is basically used to
update the variables.
 The next step is known as “Maximization”-step or M-step. In this step, we use the complete data
generated in the preceding “Expectation” – step in order to update the values of the
parameters. It is basically used to update the hypothesis.
 Now, in the fourth step, it is checked whether the values are converging or not, if yes, then stop
otherwise repeat step-2 and step-3 i.e. “Expectation” – step and “Maximization” – step until the
convergence occurs.
Flow chart for EM algorithm –

Usage of EM algorithm –
 It can be used to fill the missing data in a sample.
 It can be used as the basis of unsupervised learning of clusters.
 It can be used for the purpose of estimating the parameters of Hidden Markov Model (HMM).
 It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
 It is always guaranteed that likelihood will increase with each iteration.
 The E-step and M-step are often pretty easy for many problems in terms of implementation.
 Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward (numerical optimization requires only
forward probability)
Gaussian Mixture Models
Gaussian Mixture Models (GMMs) assume that there are a certain number of Gaussian
distributions, and each of these distributions represent a cluster. Hence, a Gaussian Mixture
Model tends to group the data points belonging to a single distribution together.
Let’s say we have three Gaussian distributions (more on that in the next section) – GD1, GD2,
and GD3. These have a certain mean (μ1, μ2, μ3) and variance (σ1, σ2, σ3) value respectively.
For a given set of data points, our GMM would identify the probability of each data point
belonging to each of these distributions.
Gaussian Mixture Models are probabilistic models and use the soft clustering approach for
distributing the points in different clusters. I’ll take another example that will make it easier to
understand.
Here, we have three clusters that are denoted by three colors – Blue, Green, and Cyan. Let’s
take the data point highlighted in red. The probability of this point being a part of the blue
cluster is 1, while the probability of it being a part of the green or cyan clusters is 0.

Now, consider another point – somewhere in between the blue and cyan (highlighted in the
below figure). The probability that this point is a part of cluster green is 0, right? And the
probability that this belongs to blue and cyan is 0.2 and 0.8 respectively.
Gaussian Mixture Models use the soft clustering technique for assigning data points to Gaussian
distributions.
The Gaussian Distribution
It has a bell-shaped curve, with the data points symmetrically distributed around the mean
value. The below image has a few Gaussian distributions with a difference in mean (μ) and
variance (σ2). Remember that the higher the σ value more would be the spread:

In a one dimensional space, the probability density function of a Gaussian distribution is given by:
where μ is the mean and σ2 is the variance.
The probability density function would be given by:

where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2 covariance matrix. The
covariance would now define the shape of this curve. We can generalize the same for d-
dimensions.
Thus, this multivariate Gaussian model would have x and μ as vectors of length d, and Σ would be a d x
d covariance matrix.
Hence, for a dataset with d features, we would have a mixture of k Gaussian distributions
(where k is equivalent to the number of clusters), each having a certain mean vector and
variance matrix.

Principal Component Analysis(PCA)

Principal component analysis, or PCA, is a dimensionality reduction method that is often used to reduce

the dimensionality of large data sets, by transforming a large set of variables into a smaller one that

still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick

in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are

easier to explore and visualize and make analyzing data points much easier and faster for machine

learning algorithms without extraneous variables to process.

So, to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while

preserving as much information as possible.

Basic Terminologies of PCA


Before getting into PCA, we need to understand some basic terminologies,

 Variance – for calculating the variation of data distributed across dimensionality of graph

 Covariance – calculating dependencies and relationship between features

 Standardizing data – Scaling our dataset within a specific range for unbiased output

 Covariance matrix – Used for calculating interdependencies between the features or variables and also

helps in reduce it to improve the performance

 EigenValues and EigenVectors – Eigenvectors’ purpose is to find out the largest variance that exists in the

dataset to calculate Principal Component. Eigenvalue means the magnitude of the Eigenvector.
Eigenvalue indicates variance in a particular direction and whereas eigenvector is expanding or

contracting X-Y (2D) graph without altering the direction.

In this shear mapping, the blue arrow changes direction whereas the pink arrow does not. The pink

arrow in this instance is an eigenvector because of its constant orientation. The length of this arrow is

also unaltered, and its eigenvalue is 1. Technically, PC is a straight line that captures the maximum

variance (information) of the data. PC shows direction and magnitude. PC are perpendicular to each

other.

Dimensionality Reduction – Transpose of original data and multiply it by transposing of the derived

feature vector. Reducing the features without losing information.

Locally Linear Embedding (LLE)


Data sets can often be represented in a n-Dimensional feature space, with each dimension used for a

specific feature.

The LLE algorithm is an unsupervised method for dimensionality reduction. It tries to reduce these n-

Dimensions while trying to preserve the geometric features of the original non-linear feature structure.

For example, in the below illustration, we cast the structure of the swiss roll into a lower dimensional

plane, while maintaining its geometric structure.

In short, if we have D dimensions for data X1, we try to reduce X1 to X2 in a feature space with d

dimensions.

LLE first finds the k-nearest neighbors of the points. Then, it approximates each data vector as a

weighted linear combination of its k-nearest neighbors. Finally, it computes the weights that best

reconstruct the vectors from its neighbors, then produce the low-dimensional vectors best

reconstructed by these weights [6].

1. Finding the K nearest neighbours.


One advantage of the LLE algorithm is that there is only one parameter to tune, which is the value
of K, or the number of nearest neighbours to consider as part of a cluster. If K is chosen to be too
small or too large, it will not be able to accomodate the geometry of the original data.
Here, for each data point that we have we compute the K nearest neighbours.

2. We do a weighted aggregation of the neighbours of each point to construct a new point. We try to
minimize the cost function, where j’th nearest neighbour for point Xi.

3. Now we define the new vector space Y such that we minimize the cost for Y as the new points.
Factor Analysis
Factor Analytics is a special technique reducing the huge number of variables into a few numbers of
factors is known as factoring of the data, and managing which data is to be present in sheet comes
under factor analysis. It is completely a statistical approach that is also used to describe fluctuations
among the observed and correlated variables in terms of a potentially lower number of unobserved
variables called factors.
The factor analysis technique extracts the maximum common variance from all the variables and puts
them into a common score. It is a theory that is used in training the machine learning model and so it is
quite related to data mining. The belief behind factor analytic techniques is that the information gained
about the interdependencies between observed variables can be used later to reduce the set of
variables in a dataset.
Factor analysis is a very effective tool for inspecting changeable relationships for complex concepts such
as social status, economic status, dietary patterns, psychological scales, biology, psychometrics,
personality theories, marketing, product management, operations research, finance, etc. It can help a
researcher to investigate the concepts that are not easily measured in a much easier and quicker way
directly by the cave in a large number of variables into a few easily interpretable fundamental
factors.
Types of factor analysis:
1. Exploratory factor analysis (EFA) :
It is used to identify composite inter-relationships among items and group items that are the part of
uniting concepts. The Analyst can’t make any prior assumptions about the relationships among
factors. It is also used to find the fundamental structure of a huge set of variables. It lessens the
large data to a much smaller set of summary variables. It is almost similar to the Confirmatory
Factor Analysis(CFA).
Similarities are:
 Evaluate the internal reliability of an amount.
 Examine the factors represented by item sets. They presume that the factors aren’t correlated.
 Investigate the grade/class of each item.

However, some common differences, most of them are concerned about how factors are used.
Basically, EFA is a data-driven approach, which allows all items to load on all the factors, while in
CFA you need to specify which factors are required to load. EFA is really a nice choice if you have no
idea about what common factors might exist. EFA is able to generate a huge number of possible
models for your data, something which is not possible is, if a researcher has to specify factors. If
you have a bit idea about what actually the models look like, and then afterwards you want to test
your hypotheses about the data structure, in that case, the CFA is a better approach.
2. Confirmatory factor analysis (CFA) :
It is a more complex(composite) approach that tests the theory that the items are associated with
specific factors. Confirmatory Factor Analysis uses a properly structured equation model to test a
measurement model whereby loading on the factors allows for the evaluation of relationships
between observed variables and unobserved variables.
As we know, the Structural equation modelling approaches can board measurement error easily,
and these are much less restrictive than least-squares estimation thus provide more exposure to
accommodate errors. Hypothesized models are tested against actual data, and the analysis would
demonstrate loadings of observed variables on the latent variables (factors), as well as the
correlation between the latent variables.
Confirmatory Factor Analysis allows an analyst and researcher to figure out if a relationship
between a set of observed variables (also known as manifest variables) and their underlying
constructs exists. It is similar to the Exploratory Factor Analysis.
The main difference between the two is:
 Simply use Exploratory Factor Analysis to explore the pattern.
 Use Confirmatory Factor Analysis to perform hypothesis testing.

Confirmatory Factor Analysis provides information about the standard quality of the number of
factors that are required to represent the data set. Using Confirmatory Factor Analysis, you can
define the total number of factors required. For example, Confirmatory Factor Analysis is able to
answer questions like Does my thousand question survey can able to measure accurately the one
specific factor. Even though it is technically applicable to any kind of discipline, it is typically used in
social sciences.
3. Multiple Factor Analysis :
This type of Factor Analysis is used when your variables are structured in changeable groups. For
example, you may have a teenager’s health questionnaire with several points like sleeping patterns,
wrong addictions, psychological health, mobile phone addiction, or learning disabilities.
The Multiple Factor Analysis is performed in two steps which are:-
 Firstly, the Principal Component Analysis will perform on each and every section of the data.
Further, this can give a useful eigenvalue, which is actually used to normalize the data sets
for further use.
 The newly formed data sets are going to merge into a distinctive matrix and then global PCA
is performed.

K-Nearest Neighbors
 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most similar features it will put it in either cat
or dog category.


 Why do we need a K-NN Algorithm?
 Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of
a particular dataset. Consider the below diagram:


 How does K-NN work?
 The K-NN working can be explained on the basis of the below algorithm:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of K number of neighbors
 Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data points in each category.
 Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
 Step-6: Our model is ready.
 Suppose we have a new data point and we need to put it in the required category. Consider
the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.

o It is robust to the noisy training data


o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.

o The computation cost is high because of calculating the distance between the data points for all
the training samples.

Neural Networks: Introduction


What are Neural Networks?
Neural networks are used to mimic the basic functioning of the human brain and are inspired by
how the human brain interprets information. It is used to solve various real-time tasks because
of its ability to perform computations quickly and its fast responses.

An Artificial Neural Network model contains various components that are inspired by the
biological nervous system.
Artificial Neural Network has a huge number of interconnected processing elements, also known
as Nodes. These nodes are connected with other nodes using a connection link. The connection
link contains weights, these weights contain the information about the input signal. Each
iteration and input in turn leads to updation of these weights. After inputting all the data
instances from the training data set, the final weights of the Neural Network along with its
architecture is known as the Trained Neural Network. This process is called Training of Neural
Networks. This trained neural network is used to solve specific problems as defined in the
problem statement.
Types of tasks that can be solved using an artificial neural network include Classification
problems, Pattern Matching, Data Clustering, etc.

Perceptron
Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron learning rule
based on the original MCP neuron. A Perceptron is an algorithm for supervised learning of
binary classifiers. This algorithm enables neurons to learn and processes elements in the
training set one at a time.
Basic Components of Perceptron
Perceptron is a type of artificial neural network, which is a fundamental concept in machine
learning. The basic components of a perceptron are:
1. Input Layer: The input layer consists of one or more input neurons, which receive input
signals from the external world or from other layers of the neural network.
2. Weights: Each input neuron is associated with a weight, which represents the strength of
the connection between the input neuron and the output neuron.
3. Bias: A bias term is added to the input layer to provide the perceptron with additional
flexibility in modeling complex patterns in the input data.
4. Activation Function: The activation function determines the output of the perceptron based on
the weighted sum of the inputs and the bias term. Common activation functions used in
perceptrons include the step function, sigmoid function, and ReLU function.
5. Output: The output of the perceptron is a single binary value, either 0 or 1, which indicates
the class or category to which the input data belongs.
6. Training Algorithm: The perceptron is typically trained using a supervised learning algorithm
such as the perceptron learning algorithm or backpropagation. During training, the weights
and biases of the perceptron are adjusted to minimize the error between the predicted output
and the true output for a given set of training examples.
7. Overall, the perceptron is a simple yet powerful algorithm that can be used to perform
binary classification tasks and has paved the way for more complex neural networks used in
deep learning today.

Multilayer Perceptron
A fully connected multi-layer neural network is called a Multilayer Perceptron (MLP).
It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a deep
ANN. An MLP is a typical example of a feedforward artificial neural network. In this figure, the
ith activation unit in the lth layer is denoted as ai(l).
The number of layers and the number of neurons are referred to as hyperparameters of a
neural network, and these need tuning. Cross-validation techniques must be used to find ideal
values for these.
The weight adjustment training is done via backpropagation. Deeper neural networks are better
at processing data. However, deeper layers can lead to vanishing gradient problems. Special
algorithms are required to solve this issue.

Notations

In the representation below:


 ai(in) refers to the ith value in the input layer

 ai(h) refers to the ith unit in the hidden layer

 ai(out) refers to the ith unit in the output layer

 ao(in) is simply the bias unit and is equal to 1; it will have the corresponding weight w0

 The weight coefficient from layer l to layer l+1 is represented by wk,j(l)

A simplified view of the multilayer is presented here. This image shows a fully connected three-layer
neural network with 3 input neurons and 3 output neurons. A bias term is added to the input vector .

Forward Propagation

In the following topics, let us look at the forward propagation in

detail. MLP Learning Procedure

The MLP learning procedure is as follows:

 Starting with the input layer, propagate data forward to the output
layer. This step is the forward propagation.
 Based on the output, calculate the error (the difference between
the predicted and known outcome). The error needs to be
minimized.
 Backpropagate the error. Find its derivative with respect to each
weight in the network, and update the model.

Repeat the three steps given above over multiple epochs to learn ideal

weights. Finally, the output is taken via a threshold function to obtain the

predicted class labels. Forward Propagation in MLP

In the first step, calculate the activation unit al(h) of the hidden layer.

Activation unit is the result of applying an activation function φ to the z


value. It must be differentiable to be able to learn weights using gradient
descent. The activation function φ is often the sigmoid (logistic) function.
It allows nonlinearity needed to solve complex problems like image processing.

Support vector machines:

Linear and NonLinear

SVM is a powerful supervised algorithm that works best on smaller datasets but on complex
ones. Support Vector Machine, abbreviated as SVM can be used for both regression and
classification tasks, but generally, they work best in classification problems.

What is a Support Vector Machine?

It is a supervised machine learning problem where we try to find a hyperplane that best separates
the two classes. Note: Don’t get confused between SVM and logistic regression. Both the
algorithms try to find the best hyperplane, but the main difference is logistic regression is a
probabilistic approach whereas support vector machine is based on statistical approaches.

Now the question is which hyperplane does it select? There can be an infinite number of
hyperplanes passing through a point and classifying the two classes perfectly. So, which one is
the best?

Well, SVM does this by finding the maximum margin between the hyperplanes that means
maximum distances between the two classes.

Types of Support Vector Machine

Linear SVM

When the data is perfectly linearly separable only then we can use Linear SVM. Perfectly
linearly separable means that the data points can be classified into 2 classes by using a single
straight line(if 2D).

Non-Linear SVM

When the data is not linearly separable then we can use Non-Linear SVM, which means when
the data points cannot be separated into 2 classes by using a straight line (if 2D) then we use
some advanced techniques like kernel tricks to classify them. In most real-world applications we
do not find linearly separable datapoints hence we use kernel trick to solve them

Support Vectors: These are the points that are closest to the hyperplane. A separating line will
be defined with the help of these data points.
Margin: it is the distance between the hyperplane and the observations closest to the hyperplane
(support vectors). In SVM large margin is considered a good margin. There are two types of
margins hard margin and soft margin. I will talk more about these two in the later section.

Image 1

How does Support Vector Machine work?

SVM is defined such that it is defined in terms of the support vectors only, we don’t have to
worry about other observations since the margin is made using the points which are closest to the
hyperplane (support vectors), whereas in logistic regression the classifier is defined over all the
points. Hence SVM enjoys some natural speed-ups.
Let’s understand the working of SVM using an example. Suppose we have a dataset that has two
classes (green and blue). We want to classify that the new data point as either blue or green.
Image Source: Author

To classify these points, we can have many decision boundaries, but the question is which is the
best and how do we find it? NOTE: Since we are plotting the data points in a 2-dimensional
graph we call this decision boundary a straight line but if we have more dimensions, we call this
decision boundary a “hyperplane”
Image Source: Author

The best hyperplane is that plane that has the maximum distance from both the classes, and this
is the main aim of SVM. This is done by finding different hyperplanes which classify the labels
in the best way then it will choose the one which is farthest from the data points or the one
which has a maximum margin.
Margin in Support Vector Machine

We all know the equation of a hyperplane is w.x+b=0 where w is a vector normal to hyperplane
and b is an offset.
To classify a point as negative or positive we need to define a decision rule. We can define
decision rule as:
If the value of w.x+b>0 then we can say it is a positive point otherwise it is a negative point.
Now we need (w,b) such that the margin has a maximum distance. Let’s say this distance is ‘d’.
Kernel Functions

The most interesting feature of SVM is that it can even work with a non-linear dataset and for
this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a
dataset like this:

Image Source: Author

Here we see we cannot draw a single line or say hyperplane which can classify the points
correctly. So what we do is try converting this lower dimension space to a higher dimension
space using some quadratic functions which will allow us to find a decision boundary that
clearly divides the data points. These functions which help us do this are called Kernels and
which kernel to use is purely determined by hyperparameter tuning.

Image 3

Different Kernel functions

Some kernel functions which you can use in SVM are given below:

1. Polynomial kernel

Following is the formula for the polynomial kernel:

Here d is the degree of the polynomial, which we need to specify manually.

Suppose we have two features X1 and X2 and output variable as Y, so using polynomial kernel
we can write it as:
So we basically need to find X12 , X22 and X1.X2, and now we can see that 2 dimensions got
converted into 5 dimensions.

Image 4

2. Sigmoid kernel

We can use it as the proxy for neural networks. Equation is:


It is just taking your input, mapping them to a value of 0 and 1 so that they can be separated by a
simple straight line.

Image Source: https://dataaspirant.com/svm-kernels/#t-1608054630725

3. RBF kernel

What it actually does is to create non-linear combinations of our features to lift your samples
onto a higher-dimensional feature space where we can use a linear decision boundary to separate
your classes It is the most used kernel in SVM classifications, the following formula explains it
mathematically:
where,

1. ‘σ’ is the variance and our hyperparameter


2. ||X₁ – X₂|| is the Euclidean Distance between two points X₁ and X₂

Image 5

4. Bessel function kernel

It is mainly used for eliminating the cross term in mathematical functions. Following is the
formula of the Bessel function kernel:

5. Anova Kernel

It performs well on multidimensional regression problems. The formula for this kernel function
is:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy