UNIT III - ML
UNIT III - ML
1. Clustering Algorithms:
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on
updating the candidates for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this, the clusters can be
found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as
an alternative for the k-means algorithm or for those cases where K-means can be failed.
In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity, which
is the main drawback of this algorithm.
2. K – Means:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabelled
dataset into different clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
3. Hierarchical Clustering:
Hierarchical clustering is a connectivity-based clustering model that groups the data points
together that are close to each other based on the measure of similarity or distance.
4.Cluster Validity:
The term cluster validation is used to design the procedure of evaluating the goodness of
clustering algorithm results. This is important to avoid finding patterns in a random data, as
well as, in the situation where you want to compare two clustering algorithms.
Generally, clustering validation statistics can be categorized into 3 classes
1. Internal cluster validation, which uses the internal information of the clustering
process to evaluate the goodness of a clustering structure without reference to external
information. It can be also used for estimating the number of clusters and the
appropriate clustering algorithm without any external data.
2. External cluster validation, which consists in comparing the results of a cluster
analysis to an externally known result, such as externally provided class labels. It
measures the extent to which cluster labels match externally supplied class labels. Since
we know the “true” cluster number in advance, this approach is mainly used for
selecting the right clustering algorithm for a specific data set.
3. Relative cluster validation, which evaluates the clustering structure by varying
different parameter values for the same algorithm (e.g.,: varying the number of clusters
k). It’s generally used for determining the optimal number of clusters.
5. Dimensionality Reduction:
Dimensionality reduction is a technique used to reduce the number of features in a dataset
while retaining as much of the important information as possible.
Dimensionality reduction can help to mitigate these problems by reducing the complexity of
the model and improving its generalization performance. There are two main approaches to
dimensionality reduction: feature selection and feature extraction.
FeatureSelection:
Feature selection involves selecting a subset of the original features that are most relevant to
the problem at hand.
FeatureExtraction:
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data
in a lower-dimensional space.
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by
finding a new set of variables, smaller than the original set of variables, retaining most of
the sample’s information, and useful for the regression and classification of data.
Principal Component Analysis
Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a
standard deviation of 1.
Here,
for some scalar values . then is known as the eigenvalue of matrix A and X is known
as the eigenvector of matrix A for the corresponding eigenvalue.
It can also be written as :
where I am the identity matrix of the same shape as matrix A. And the above conditions
will be true only if will be non-invertible (i.e. singular matrix). That means,
From the above equation, we can find the eigenvalues \lambda, and therefore corresponding
eigenvector can be found using the equation .
7.Recommendation Systems:
Recommender systems are the systems that are designed to recommend things to the user based
on many different factors. These systems predict the most likely product that the users are most
likely to purchase and are of interest to. Companies like Netflix, Amazon, etc. use
recommender systems to help their users to identify the correct product or movies for them.
The recommender system deals with a large volume of information present by filtering the
most important information based on the data provided by a user and other factors that take
care of the user’s preference and interest. It finds out the match between user and item and
imputes the similarities between users and items for recommendation. Both the users and the
services provided have benefited from these kinds of systems. The quality and decision-making
• Personalized content.
anything which is in trend. These systems check about the product or movie which are in
trend or are most popular among the users and directly recommend those.
• It does not suffer from cold start problems which means on day 1 of the business also
• Not personalized
• The system would recommend the same sort of products/movies which are solely
8. EM algorithm:
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local maximum
likelihood estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable
variables in statistical models. Further, it is a technique to find maximum likelihood estimation
when the latent variables are present. It is also referred to as the latent variable model.
o Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
Steps in EM Algorithm
o 1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained from
a specific model.
o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data. Further, E-
step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or not.
If it gets "yes", then stop the process; else, repeat the process from step 2 until the
convergence occurs.
Advantages of EM algorithm
Disadvantages of EM algorithm
9. Reinforcement Learning:
There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a policy
that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.
The model-based learning in machine learning is a technique that tries to generate a custom
solution for each new challenge.
This paradigm evolved as a result of a significant confluence of three main ideas:
• Factor graphs
• Bayesian perspective,
• Probabilistic Programming
• Describe the Model: Using factor graphs, describe the process that created the data.
• Condition on Reported Data: Make the observed variables equal to their known
values.
• Backward reasoning is used to update the prior distribution across the latent
constructs or parameters. Estimate the Bayesian probability distributions of latent
constructs based on observable variables.