0% found this document useful (0 votes)
3 views

UNIT III - ML

Unit III covers unsupervised learning and reinforcement learning, focusing on clustering algorithms like K-Means and hierarchical clustering, as well as dimensionality reduction techniques such as Principal Component Analysis (PCA). It discusses the importance of cluster validity and the role of recommendation systems in predicting user preferences. Additionally, the Expectation-Maximization (EM) algorithm is introduced for estimating unobservable variables in statistical models.

Uploaded by

Divya Tharshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

UNIT III - ML

Unit III covers unsupervised learning and reinforcement learning, focusing on clustering algorithms like K-Means and hierarchical clustering, as well as dimensionality reduction techniques such as Principal Component Analysis (PCA). It discusses the importance of cluster validity and the role of recommendation systems in predicting user preferences. Additionally, the Expectation-Maximization (EM) algorithm is introduced for estimating unobservable variables in statistical models.

Uploaded by

Divya Tharshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT III - UNSUPERVISED LEARNING AND REINFORCEMENT LEARNING

Introduction - Clustering Algorithms -K – Means – Hierarchical Clustering - Cluster Validity -


Dimensionality Reduction –Principal Component Analysis – Recommendation Systems - EM
algorithm. Reinforcement Learning – Elements -Model based Learning – Temporal Difference
Learning

1. Clustering Algorithms:
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."

Clustering algorithms that are widely used in machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on
updating the candidates for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this, the clusters can be
found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as
an alternative for the k-means algorithm or for those cases where K-means can be failed.
In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity, which
is the main drawback of this algorithm.
2. K – Means:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabelled
dataset into different clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

3. Hierarchical Clustering:

Hierarchical clustering is a connectivity-based clustering model that groups the data points
together that are close to each other based on the measure of similarity or distance.

A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the hierarchical


relationships between groups. Individual data points are located at the bottom of the
dendrogram, while the largest clusters, which include all the data points, are located at the
top. In order to generate different numbers of clusters, the dendrogram can be sliced at
various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a measure of
similarity or distance between data points. Clusters are divided or merged repeatedly until all
data points are contained within a single cluster, or until the predetermined number of clusters
is attained.

Types of Hierarchical Clustering

Basically, there are two types of hierarchical Clustering:


1. Agglomerative Clustering
2. Divisive clustering

Hierarchical Agglomerative Clustering


It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).
A structure that is more informative than the unstructured set of clusters returned by flat
clustering. This clustering algorithm does not require us to prespecify the number of clusters.
Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[d i, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively until individual data
have been split into singleton clusters.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster

4.Cluster Validity:
The term cluster validation is used to design the procedure of evaluating the goodness of
clustering algorithm results. This is important to avoid finding patterns in a random data, as
well as, in the situation where you want to compare two clustering algorithms.
Generally, clustering validation statistics can be categorized into 3 classes

1. Internal cluster validation, which uses the internal information of the clustering
process to evaluate the goodness of a clustering structure without reference to external
information. It can be also used for estimating the number of clusters and the
appropriate clustering algorithm without any external data.
2. External cluster validation, which consists in comparing the results of a cluster
analysis to an externally known result, such as externally provided class labels. It
measures the extent to which cluster labels match externally supplied class labels. Since
we know the “true” cluster number in advance, this approach is mainly used for
selecting the right clustering algorithm for a specific data set.
3. Relative cluster validation, which evaluates the clustering structure by varying
different parameter values for the same algorithm (e.g.,: varying the number of clusters
k). It’s generally used for determining the optimal number of clusters.
5. Dimensionality Reduction:
Dimensionality reduction is a technique used to reduce the number of features in a dataset
while retaining as much of the important information as possible.
Dimensionality reduction can help to mitigate these problems by reducing the complexity of
the model and improving its generalization performance. There are two main approaches to
dimensionality reduction: feature selection and feature extraction.
FeatureSelection:
Feature selection involves selecting a subset of the original features that are most relevant to
the problem at hand.
FeatureExtraction:
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data
in a lower-dimensional space.

Why is Dimensionality Reduction important in Machine Learning and Predictive


Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these
features may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since both of the
aforementioned are correlated to a high degree. Hence, we can reduce the number of features
in such problems. A 3-D classification problem can be hard to visualize, whereas a 2-D one
can be mapped to a simple 2-dimensional space, and a 1-D problem to a simple line. The
below figure illustrates this concept, where a 3-D feature space is split into two 2-D feature
spaces, and later, if found to be correlated, the number of features can be reduced even
further.
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
• Feature selection: In this, we try to find a subset of the original set of variables,
or features, to get a smaller subset which can be used to model the problem. It
usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
• Feature extraction: This reduces the data in a high dimensional space to a
lower dimension space, i.e. a space with lesser no. of dimensions.

Methods of Dimensionality Reduction


The various methods used for dimensionality reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)

Advantages of Dimensionality Reduction


• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
• Improved Visualization

Disadvantages of Dimensionality Reduction


• It may lead to some amount of data loss.
• Overfitting
• Sensitivity to outliers
• Computational complexity

6.Principal Component Analysis:


• Principal Component Analysis (PCA) is a statistical procedure that uses an
orthogonal transformation that converts a set of correlated variables to a set of
uncorrelated variables.PCA is the most widely used tool in exploratory data analysis
and in machine learning for predictive models.
• PCA is an unsupervised learning algorithm technique used to examine the
interrelations among a set of variables. It is also known as a general factor analysis
where regression determines a line of best fit.
• The main goal of Principal Component Analysis (PCA) is to reduce the
dimensionality of a dataset while preserving the most important patterns or
relationships between the variables without any prior knowledge of the target
variables.

Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by
finding a new set of variables, smaller than the original set of variables, retaining most of
the sample’s information, and useful for the regression and classification of data.
Principal Component Analysis

1. PCA is a technique for dimensionality reduction that identifies a set of orthogonal


axes, called principal components, that capture the maximum variance in the data.
The principal components are linear combinations of the original variables in the
dataset and are ordered in decreasing order of importance. The total variance
captured by all the principal components is equal to the total variance in the
original dataset.
2. The first principal component captures the most variation in the data, but the
second principal component captures the maximum variance that is orthogonal to
the first principal component, and so on.
3. PCA can be used for a variety of purposes, including data visualization, feature
selection, and data compression. In data visualization, PCA can be used to plot
high-dimensional data in two or three dimensions, making it easier to interpret. In
feature selection, PCA can be used to identify the most important variables in a
dataset. In data compression, PCA can be used to reduce the size of a dataset
without losing important information.
4. In PCA, it is assumed that the information is carried in the variance of the features,
that is, the higher the variation in a feature, the more information that features
carries.
Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets,
making them easier to understand and work with.
Step-By-Step Explanation of Principal Component Analysis(PCA)

Standardization

First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a
standard deviation of 1.

Here,

• is the mean of independent features


• \sigma is the standard deviation of independent features \sigma = \left \{
\sigma_1, \sigma_2, \cdots, \sigma_m \right \}
Covariance Matrix Computation
Covariance measures the strength of joint variability between two or more variables,
indicating how much they change in relation to each other. To find the covariance we can use
the formula:

The value of covariance can be positive, negative, or zeros.


• Positive: As the x1 increases x2 also increases.
• Negative: As the x1 increases x2 also decreases.
• Zeros: No direct relation

Eigenvalues and Eigenvectors to Identify Principal Components

Let A be a square nXn matrix and X be a non-zero vector for which

for some scalar values . then is known as the eigenvalue of matrix A and X is known
as the eigenvector of matrix A for the corresponding eigenvalue.
It can also be written as :

where I am the identity matrix of the same shape as matrix A. And the above conditions
will be true only if will be non-invertible (i.e. singular matrix). That means,

From the above equation, we can find the eigenvalues \lambda, and therefore corresponding
eigenvector can be found using the equation .

7.Recommendation Systems:

Recommender systems are the systems that are designed to recommend things to the user based

on many different factors. These systems predict the most likely product that the users are most

likely to purchase and are of interest to. Companies like Netflix, Amazon, etc. use

recommender systems to help their users to identify the correct product or movies for them.

The recommender system deals with a large volume of information present by filtering the

most important information based on the data provided by a user and other factors that take
care of the user’s preference and interest. It finds out the match between user and item and
imputes the similarities between users and items for recommendation. Both the users and the

services provided have benefited from these kinds of systems. The quality and decision-making

process has also improved through these kinds of systems.

Why the Recommendation system?

• Benefits users in finding items of their interest.

• Help item providers in delivering their items to the right user.

• Identity products that are most relevant to users.

• Personalized content.

• Help websites to improve user engagement.

Types of Recommendation System

1. Popularity-Based Recommendation System

It is a type of recommendation system which works on the principle of popularity and or

anything which is in trend. These systems check about the product or movie which are in

trend or are most popular among the users and directly recommend those.

Merits of popularity based recommendation system

• It does not suffer from cold start problems which means on day 1 of the business also

it can recommend products on various different filters.

• There is no need for the user's historical data.

Demerits of popularity based recommendation system

• Not personalized

• The system would recommend the same sort of products/movies which are solely

based upon popularity to every other user.

8. EM algorithm:
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local maximum
likelihood estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable
variables in statistical models. Further, it is a technique to find maximum likelihood estimation
when the latent variables are present. It is also referred to as the latent variable model.

o Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.

Steps in EM Algorithm

The EM algorithm is completed mainly in 4 steps, which include Initialization Step,


Expectation Step, Maximization Step, and convergence Step. These steps are explained as
follows:

o 1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained from
a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data. Further, E-
step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or not.
If it gets "yes", then stop the process; else, repeat the process from step 2 until the
convergence occurs.

Advantages of EM algorithm

o It is very easy to implement .


o It often generates a solution for the M-step in the closed form.

Disadvantages of EM algorithm

o The convergence of the EM algorithm is very slow.


o It can make convergence for the local optima only.

9. Reinforcement Learning:

o Reinforcement Learning is a feedback-based Machine learning technique in which an


agent learns to behave in an environment by performing the actions and seeing the
results of actions. For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal
is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal of
an agent in reinforcement learning is to improve the performance by getting the
maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the experience, it learns
to perform the task in a better way. Hence, we can say that "Reinforcement learning
is a type of machine learning method where an intelligent agent (computer program)
interacts with the environment and learns to act within that." How a Robotic dog
learns the movement of his arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment, and his
goal is to find the diamond. The agent interacts with the environment by performing
some actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
o The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a policy
that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.

10. Model based Learning

The model-based learning in machine learning is a technique that tries to generate a custom
solution for each new challenge.
This paradigm evolved as a result of a significant confluence of three main ideas:

• Factor graphs

• Bayesian perspective,

• Probabilistic Programming

Model-Based ML Developmental Stages


It consists of three rules-based models in machine learning:

• Describe the Model: Using factor graphs, describe the process that created the data.

• Condition on Reported Data: Make the observed variables equal to their known
values.

• Backward reasoning is used to update the prior distribution across the latent
constructs or parameters. Estimate the Bayesian probability distributions of latent
constructs based on observable variables.

11. Temporal Difference Learning:

Temporal Difference Learning is an unsupervised learning technique that is very commonly


used in reinforcement learning for the purpose of predicting the total reward expected over the
future. They can, however, be used to predict other quantities as well. It is essentially a way to
learn how to predict a quantity that is dependent on the future values of a given signal. It is a
method that is used to compute the long-term utility of a pattern of behaviour from a series of
intermediate rewards.

What is the benefit of temporal difference learning?

• TD learning methods are able to learn in each step, online or offline.


• These methods are capable of learning from incomplete sequences, which means that
they can also be used in continuous problems.
• Temporal difference learning can function in non-terminating environments.

What are the disadvantages of temporal difference learning?

• It has greater sensitivity towards the initial value.


• It is a biased estimation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy