0% found this document useful (0 votes)
2 views32 pages

MLT Unit 3 Notes

The document discusses two major machine learning techniques: Unsupervised Learning and Reinforcement Learning. Unsupervised Learning focuses on uncovering patterns in unlabeled data through methods like clustering and dimensionality reduction, while Reinforcement Learning involves an agent learning to make decisions based on feedback from its environment. Additionally, it details various clustering algorithms, including K-Means and Hierarchical Clustering, outlining their processes, applications, and key concepts.

Uploaded by

ranandraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views32 pages

MLT Unit 3 Notes

The document discusses two major machine learning techniques: Unsupervised Learning and Reinforcement Learning. Unsupervised Learning focuses on uncovering patterns in unlabeled data through methods like clustering and dimensionality reduction, while Reinforcement Learning involves an agent learning to make decisions based on feedback from its environment. Additionally, it details various clustering algorithms, including K-Means and Hierarchical Clustering, outlining their processes, applications, and key concepts.

Uploaded by

ranandraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT-III

UNSUPERVISED LEARNING AND REINFORCEMENT LEARNING

1. Introduction

Unsupervised Learning

Unsupervised learning is a branch of machine learning where the model is trained on data without labeled
responses. The objective is to uncover the underlying structure, patterns, and relationships within the data.

Key Concepts:

1. Data: Input data with no associated output labels.

2. Clustering: Grouping similar data points together.

3. Dimensionality Reduction: Reducing the number of variables while retaining essential information.

4. Anomaly Detection: Identifying outliers or rare events in data.

Common Algorithms:

 K-Means Clustering: Divides the dataset into K distinct, non-overlapping subsets (clusters).

 Hierarchical Clustering: Creates a tree of clusters based on data similarity.

 Principal Component Analysis (PCA): Reduces the dimensionality of data by transforming it into a new
set of orthogonal variables (principal components).

 Autoencoders: Neural networks that learn efficient codings of input data for dimensionality reduction or
feature learning.

Applications:

 Customer segmentation for targeted marketing.

 Market basket analysis to identify product associations.

 Image and video compression.

 Detection of fraudulent activities in financial transactions.

 Topic modeling in large text corpora.

Process:

1. Data Collection: Gather large volumes of unlabeled data.

2. Preprocessing: Clean and normalize the data to ensure consistency.

3. Model Training: Apply unsupervised learning algorithms to identify patterns or structures.

4. Evaluation: Analyze and interpret the output to extract valuable insights.

1
Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by
performing actions in an environment to maximize cumulative rewards. Unlike supervised learning, RL relies on
feedback from the environment rather than explicit instructions.

Key Concepts:

1. Agent: The learner or decision-maker.

2. Environment: The context within which the agent operates.

3. State: A specific situation in the environment at a given time.

4. Action: The set of all possible moves the agent can take.

5. Reward: Feedback from the environment based on the agent's actions.

6. Policy: The strategy used by the agent to decide actions based on states.

7. Value Function: A function that estimates the expected cumulative reward from a given state.

Common Algorithms:

 Q-Learning: An off-policy algorithm that learns the value of an action in a particular state.

 SARSA (State-Action-Reward-State-Action): An on-policy algorithm that updates the value function


based on the action actually taken.

 Deep Q-Networks (DQN): Combines Q-learning with deep neural networks to handle complex, high-
dimensional environments.

 Policy Gradient Methods: Directly optimize the policy by adjusting parameters in the direction that
increases expected reward.

Applications:

 Robotics: Training robots to perform tasks through trial and error.

 Game Playing: Developing AI agents that can play and master complex games.

 Autonomous Vehicles: Enabling self-driving cars to navigate safely.

 Resource Management: Optimizing allocation of resources in operations and logistics.

 Finance: Creating algorithms for automated trading and portfolio management.

Process:

1. Initialization: Define the environment and initialize the agent.

2. Exploration: The agent explores the environment to understand the consequences of different actions.

3. Exploitation: The agent exploits its current knowledge to maximize rewards.

4. Learning: The agent updates its policy and value functions based on the feedback received from the
environment.

2
5. Evaluation and Optimization: Continuously assess and fine-tune the agent's performance to improve
outcomes.

2.CLUSTERING ALGORITHMS

What is Clustering ?

The task of grouping data points based on their similarity with each other is called Clustering or Cluster Analysis.
This method is defined under the branch of Unsupervised Learning, which aims at gaining insights from unlabelled
data points, that is, unlike supervised learning we don’t have a target variable.

Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It evaluates the
similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan distance, etc. and then group the
points with highest similarity score together.

For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming on the basis of
distance.

Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can be arbitrary.
There are many algortihms that work well with detecting arbitrary shaped clusters.

For example, In the below given graph we can see that the clusters formed are not circular in shape.

Types of Clustering

Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:

 Hard Clustering: In this type of clustering, each data point belongs to a cluster completely or not. For
example, Let’s say there are 4 data point and we have to cluster them into 2 clusters. So each data point
will either belong to cluster 1 or cluster 2.

 Soft Clustering: In this type of clustering, instead of assigning each data point into a separate cluster, a
probability or likelihood of that point being that cluster is evaluated. For example, Let’s say there are 4
data point and we have to cluster them into 2 clusters. So we will be evaluating a probability of a data point
belonging to both clusters. This probability is calculated for all data points.

Types of Clustering Algorithms


3
At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest distance, and the
density of the data points are a few of the elements that influence cluster formation. Clustering is the process of
determining how related the objects are based on a metric called the similarity measure. Similarity metrics are
easier to locate in smaller sets of features. It gets harder to create similarity measures as the number of features
increases. Depending on the type of clustering algorithm being utilized in data mining, several techniques are
employed to group the data from the datasets. In this part, the clustering techniques are described. Various types of
clustering algorithms are:

1. Centroid-based Clustering (Partitioning methods)

2. Density-based Clustering (Model-based methods)

3. Connectivity-based Clustering (Hierarchical clustering)

4. Distribution-based Clustering

1. Centroid-based Clustering (Partitioning methods)

Partitioning methods are the most easiest clustering algorithms. They group data points on the basis of their
closeness. Generally, the similarity measure chosen for these algorithms are Euclidian distance, Manhattan
Distance or Minkowski Distance. The datasets are separated into a predetermined number of clusters, and each
cluster is referenced by a vector of values. When compared to the vector value, the input data variable shows no
difference and joins the cluster.

The primary drawback for these algorithms is the requirement that we establish the number of clusters, “k,” either
intuitively or scientifically (using the Elbow Method) before any clustering machine learning system starts
allocating the data points. Despite this, it is still the most popular type of clustering. K-means and K-
medoids clustering are some examples of this type clustering.

2. Density-based Clustering (Model-based methods)

Density-based clustering, a model-based method, finds groups based on the density of data points. Contrary to
centroid-based clustering, which requires that the number of clusters be predefined and is sensitive to initialization,
density-based clustering determines the number of clusters automatically and is less susceptible to beginning
positions. They are great at handling clusters of different sizes and forms, making them ideally suited for datasets
with irregularly shaped or overlapping clusters. These methods manage both dense and sparse data regions by
focusing on local density and can distinguish clusters with a variety of morphologies.

In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped clusters. Due to its preset
number of cluster requirements and extreme sensitivity to the initial positioning of centroids, the outcomes can
vary. Furthermore, the tendency of centroid-based approaches to produce spherical or convex clusters restricts their
capacity to handle complicated or irregularly shaped clusters. In conclusion, density-based clustering overcomes
the drawbacks of centroid-based techniques by autonomously choosing cluster sizes, being resilient to
initialization, and successfully capturing clusters of various sizes and forms. The most popular density-based
clustering algorithm is DBSCAN.

3. Connectivity-based Clustering (Hierarchical clustering)

A method for assembling related data points into hierarchical clusters is called hierarchical clustering. Each data
point is initially taken into account as a separate cluster, which is subsequently combined with the clusters that are
the most similar to form one large cluster that contains all of the data points.

4
Think about how you may arrange a collection of items based on how similar they are. Each object begins as its
own cluster at the base of the tree when using hierarchical clustering, which creates a dendrogram, a tree-like
structure. The closest pairings of clusters are then combined into larger clusters after the algorithm examines how
similar the objects are to one another. When every object is in one cluster at the top of the tree, the merging process
has finished. Exploring various granularity levels is one of the fun things about hierarchical clustering. To obtain a
given number of clusters, you can select to cut the dendrogram at a particular height. The more similar two objects
are within a cluster, the closer they are. It’s comparable to classifying items according to their family trees, where
the nearest relatives are clustered together and the wider branches signify more general connections. There are 2
approaches for Hierarchical clustering:

 Divisive Clustering: It follows a top-down approach, here we consider all data points to be part one big
cluster and then this cluster is divide into smaller groups.

 Agglomerative Clustering: It follows a bottom-up approach, here we consider all data points to be part of
individual clusters and then these clusters are clubbed together to make one big cluster with all data points.

4. Distribution-based Clustering

Using distribution-based clustering, data points are generated and organized according to their propensity to fall
into the same probability distribution (such as a Gaussian, binomial, or other) within the data. The data elements
are grouped using a probability-based distribution that is based on statistical distributions. Included are data objects
that have a higher likelihood of being in the cluster. A data point is less likely to be included in a cluster the further
it is from the cluster’s central point, which exists in every cluster.

A notable drawback of density and boundary-based approaches is the need to specify the clusters a priori for some
algorithms, and primarily the definition of the cluster form for the bulk of algorithms. There must be at least one
tuning or hyper-parameter selected, and while doing so should be simple, getting it wrong could have unanticipated
repercussions. Distribution-based clustering has a definite advantage over proximity and centroid-based clustering
approaches in terms of flexibility, accuracy, and cluster structure. The key issue is that, in order to
avoid overfitting, many clustering methods only work with simulated or manufactured data, or when the bulk of the
data points certainly belong to a preset distribution. The most popular distribution-based clustering algorithm
is Gaussian Mixture Model.

3.K-MEANS

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine
learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm
works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there
will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset
belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in
the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is
to minimize the sum of distances between the data point and their corresponding clusters.
5
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the
process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.

o Assigns each data point to its closest k-center. Those data points which are near to the particular k-center,
create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:

6
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.

o We need to choose some random k points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it
by applying some mathematics that we have studied to calculate the distance between two points. So, we
will draw a median between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the
right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.

7
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the
new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to
the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown
in the below image:

8
o As we got the new centroids so again will draw the median line and reassign the data points. So, the image
will be:

o We can see in the above image; there are no dissimilar data points on either side of the line, which means
our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in
the below image:

4. Hierarchical Clustering

Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the unlabeled
datasets into a cluster and also known as hierarchical cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is known
as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both differ
depending on how they work. As there is no requirement to predetermine the number of clusters as we did in the K-
Means algorithm.

The hierarchical clustering technique has two approaches:

9
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data
points as single clusters and merging them until one cluster is left.

2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.

Why hierarchical clustering?

As we already have other clustering algorithms such as K-Means Clustering, then why we need hierarchical
clustering? So, as we have seen in the K-means clustering that there are some challenges with this algorithm, which
are a predetermined number of clusters, and it always tries to create the clusters of the same size. To solve these
two challenges, we can opt for the hierarchical clustering algorithm because, in this algorithm, we don't need to
have knowledge about the predefined number of clusters.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the datasets into
clusters, it follows the bottom-up approach. It means, this algorithm considers each dataset as a single cluster at
the beginning, and then start combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of
clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be
N-1 clusters.

10
o Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2
clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below
images:

o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the
clusters as per the problem.

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the hierarchical clustering. There are
various ways to calculate the distance between two clusters, and these ways decide the rule for clustering. These
measures are called Linkage methods. Some of the popular linkage methods are given below:
11
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider the below
image:

2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of
the popular linkage methods as it forms tighter clusters than single-linkage.

3. Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up
and then divided by the total number of datasets to calculate the average distance between two clusters. It
is also one of the most popular linkage methods.

4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of problem or business
requirement.

Woking of Dendrogram in Hierarchical clustering

12
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC algorithm
performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points, and the x-axis
shows all the data points of the given dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in agglomerative clustering, and the right
part is showing the corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular shape. The hight is
decided according to the Euclidean distance between the data points.

o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher than of
previous, as the Euclidean distance between P5 and P6 is a little bit greater than the P2 and P3.

o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P5, and
P6, in another dendrogram.

o At last, the final dendrogram is created that combines all the data points together.

We can cut the dendrogram tree structure at any level as per our requirement.

5.CLUSTER VALIDITY

Cluster validity in machine learning is the process of determining the optimal number of clusters in a dataset. It's a
set of techniques that find a set of clusters that best fit natural partitions in a dataset without any a priori class
information. The outcome of the clustering process is validated by a cluster validity index.

For supervised classification we have a variety of measures to evaluate how good our model is

 Accuracy, precision, recall

For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?

 To avoid finding patterns in noise

 To compare clustering algorithms

13
 To compare two sets of clusters

 To compare two clusters

Clusters are natural in purely random data

Different Aspects of Cluster Validation

1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure
actually exists in the data.

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class
labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.
--Use only the data

4. Comparing the results of two different sets of cluster analyses to determine which is better.

5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual
clusters.

Measures of Cluster Validity

Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following
three types.
14
External Index: Used to measure the extent to which cluster labels match externally supplied class labels.
Entropy

Internal Index: Used to measure the goodness of a clustering structure without respect to external information.
Sum of Squared Error (SSE)

Relative Index: Used to compare two different clusterings or clusters.


Often an external or internal index is used for this function, e.g., SSE or entropy

Sometimes these are referred to as criteria instead of indices

 However, sometimes criterion is the general strategy and index is the numerical measure that implements
the criterion.

Recall: evaluating K-means clusters

Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest cluster

To get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci

 can show that mi corresponds to the center (mean) of the cluster

Given two sets of clusters, we prefer the one with the smallest error

One easy way to reduce SSE is to increase K, the number of clusters

 A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Internal Measures: SSE

Clusters in more complicated figures aren’t well separated.

Internal Index: Used to measure the goodness of a clustering structure without respect to external information --
SSE

SSE is good for comparing two clusterings or two clusters (average SSE).

SSE can also be used to estimate the number of clusters

15
Look for the knee bend to determine what might be the "best" number of clusters.

Internal Measures: Cohesion and Separation

Cluster Cohesion: Measures how closely related are objects in a cluster

 Example: SSE

Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters

 Example: Squared Error, overall

Cohesion is measured by the within-cluster sum of squares =

Separation is measured by the between-cluster sum of squares =

Where |Ci| is the size of cluster i and m is the mean of the means.

Note that BSS+ WSS = constant

16
A proximity graph based approach can also be used for cohesion and separation.

 Cluster cohesion is the sum of the weight of all links within a cluster.

 Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.

6. Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features (or dimensions) in a dataset while
retaining as much information as possible. This can be done for a variety of reasons, such as to reduce the
complexity of a model, to improve the performance of a learning algorithm, or to make it easier to visualize the
data. There are several techniques for dimensionality reduction, including principal component analysis (PCA),
singular value decomposition (SVD), and linear discriminant analysis (LDA). Each technique uses a different
method to project the data onto a lower-dimensional space while preserving important information.

What is Dimensionality Reduction?

Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much
of the important information as possible. In other words, it is a process of transforming high-dimensional data into
a lower-dimensional space that still preserves the essence of the original data.

In machine learning, high-dimensional data refers to data with a large number of features or variables. The curse of
dimensionality is a common problem in machine learning, where the performance of the model deteriorates as the
number of features increases. This is because the complexity of the model increases with the number of features,
and it becomes more difficult to find a good solution. In addition, high-dimensional data can also lead to
overfitting, where the model fits the training data too closely and does not generalize well to new data.

Dimensionality reduction can help to mitigate these problems by reducing the complexity of the model and
improving its generalization performance. There are two main approaches to dimensionality reduction: feature
selection and feature extraction.

17
Feature Selection:
Feature selection involves selecting a subset of the original features that are most relevant to the problem at hand.
The goal is to reduce the dimensionality of the dataset while retaining the most important features. There are
several methods for feature selection, including filter methods, wrapper methods, and embedded methods. Filter
methods rank the features based on their relevance to the target variable, wrapper methods use the model
performance as the criteria for selecting features, and embedded methods combine feature selection with the model
training process.

Feature Extraction:
Feature extraction involves creating new features by combining or transforming the original features. The goal is to
create a set of features that captures the essence of the original data in a lower-dimensional space. There are several
methods for feature extraction, including principal component analysis (PCA), linear discriminant analysis (LDA),
and t-distributed stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the original
features onto a lower-dimensional space while preserving as much of the variance as possible.

Why is Dimensionality Reduction important in Machine Learning and Predictive Modeling?

An intuitive example of dimensionality reduction can be discussed through a simple e-mail classification problem,
where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as
whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc.
However, some of these features may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since both of the aforementioned are
correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification
problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2-dimensional space, and a 1-D
problem to a simple line. The below figure illustrates this concept, where a 3-D feature space is split into two 2-D
feature spaces, and later, if found to be correlated, the number of features can be reduced even further.

Components of Dimensionality Reduction

There are two components of dimensionality reduction:

 Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a
smaller subset which can be used to model the problem. It usually involves three ways:

1. Filter

2. Wrapper

3. Embedded

18
 Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e. a
space with lesser no. of dimensions.

Methods of Dimensionality Reduction

The various methods used for dimensionality reduction include:

 Principal Component Analysis (PCA)

 Linear Discriminant Analysis (LDA)

 Generalized Discriminant Analysis (GDA)

Dimensionality reduction may be both linear and non-linear, depending upon the method used. The prime linear
method, called Principal Component Analysis, or PCA, is discussed below.

Principal Component Analysis

This method was introduced by Karl Pearson. It works on the condition that while the data in a higher dimensional
space is mapped to data in a lower dimension space, the variance of the data in the lower dimensional space should
be maximum.

It involves the following steps:

 Construct the covariance matrix of the data.

 Compute the eigenvectors of this matrix.

 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of variance of
the original data.

Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss in the process.
But, the most important variances should be retained by the remaining eigenvectors.

Advantages of Dimensionality Reduction

 It helps in data compression, and hence reduced storage space.

 It reduces computation time.

 It also helps remove redundant features, if any.

19
 Improved Visualization: High dimensional data is difficult to visualize, and dimensionality reduction
techniques can help in visualizing the data in 2D or 3D, which can help in better understanding and
analysis.

 Overfitting Prevention: High dimensional data may lead to overfitting in machine learning models, which
can lead to poor generalization performance. Dimensionality reduction can help in reducing the complexity
of the data, and hence prevent overfitting.

 Feature Extraction: Dimensionality reduction can help in extracting important features from high
dimensional data, which can be useful in feature selection for machine learning models.

 Data Preprocessing: Dimensionality reduction can be used as a preprocessing step before applying machine
learning algorithms to reduce the dimensionality of the data and hence improve the performance of the
model.

 Improved Performance: Dimensionality reduction can help in improving the performance of machine
learning models by reducing the complexity of the data, and hence reducing the noise and irrelevant
information in the data.

Disadvantages of Dimensionality Reduction

 It may lead to some amount of data loss.

 PCA tends to find linear correlations between variables, which is sometimes undesirable.

 PCA fails in cases where mean and covariance are not enough to define datasets.

 We may not know how many principal components to keep- in practice, some thumb rules are applied.

 Interpretability: The reduced dimensions may not be easily interpretable, and it may be difficult to
understand the relationship between the original features and the reduced dimensions.

 Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially when the number
of components is chosen based on the training data.

 Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to outliers, which can result
in a biased representation of the data.

 Computational complexity: Some dimensionality reduction techniques, such as manifold learning, can be
computationally intensive, especially when dealing with large datasets.

7. Principal Component Analysis

What is Principal Component Analysis(PCA)?

Principal Component Analysis(PCA) technique was introduced by the mathematician Karl Pearson in 1901. It
works on the condition that while the data in a higher dimensional space is mapped to data in a lower dimension
space, the variance of the data in the lower dimensional space should be maximum.

 Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation
that converts a set of correlated variables to a set of uncorrelated variables.PCA is the most widely used
tool in exploratory data analysis and in machine learning for predictive models. Moreover,

 Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used to examine the
interrelations among a set of variables. It is also known as a general factor analysis where regression
determines a line of best fit.

20
 The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a dataset while
preserving the most important patterns or relationships between the variables without any prior knowledge
of the target variables.

Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables, retaining most of the sample’s information, and useful for
the regression and classification of data.

Principal Component Analysis

1. Principal Component Analysis (PCA) is a technique for dimensionality reduction that identifies a set of
orthogonal axes, called principal components, that capture the maximum variance in the data. The principal
components are linear combinations of the original variables in the dataset and are ordered in decreasing
order of importance. The total variance captured by all the principal components is equal to the total
variance in the original dataset.

2. The first principal component captures the most variation in the data, but the second principal component
captures the maximum variance that is orthogonal to the first principal component, and so on.

3. Principal Component Analysis can be used for a variety of purposes, including data visualization, feature
selection, and data compression. In data visualization, PCA can be used to plot high-dimensional data in
two or three dimensions, making it easier to interpret. In feature selection, PCA can be used to identify the
most important variables in a dataset. In data compression, PCA can be used to reduce the size of a dataset
without losing important information.

4. In Principal Component Analysis, it is assumed that the information is carried in the variance of the
features, that is, the higher the variation in a feature, the more information that features carries.

Some common terms used in PCA algorithm:

Dimensionality: It is the number of features or variables present in the given dataset. More easily, it is the number
of columns present in the dataset.

Correlation: It signifies that how strongly two variables are related to each other. Such as if one changes, the other
variable also gets changed. The correlation value ranges from -1 to +1. Here, -1 occurs if variables are inversely
proportional to each other, and +1 indicates that variables are directly proportional to each other.

Orthogonal: It defines that variables are not correlated to each other, and hence the correlation between the pair of
variables is zero.

Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be eigenvector if Av is
the scalar multiple of v.

21
Covariance Matrix: A matrix containing the covariance between the pair of variables is called the Covariance
Matrix.

8. Recommendation Systems

Recommendation systems are a subclass of machine learning algorithms that aim to predict the preferences or
interests of users and suggest items that they might like. These systems have become integral to many online
platforms, enhancing user experience by personalizing content.

Key Concepts

1. Users and Items: The primary entities in a recommendation system. Users are individuals interacting with
the system, and items are the products, services, or content being recommended.

2. Preferences: Data representing user interactions with items, such as ratings, clicks, purchases, or viewing
history.

3. Cold Start Problem: The challenge of making recommendations for new users or items with little to no
interaction data.

Types of Recommendation Systems

1. Content-Based Filtering: Recommends items similar to those a user has liked in the past based on item
features.

 Example: If a user likes science fiction books, the system recommends other science fiction books.

 Features: Uses item attributes such as genre, author, or description.

2. Collaborative Filtering: Recommends items based on the preferences of similar users.

 User-Based Collaborative Filtering: Finds users similar to the target user and recommends items
they liked.

 Item-Based Collaborative Filtering: Finds items similar to those the target user has liked and
recommends them.

 Example: If two users have similar viewing histories on a streaming platform, recommendations
for one user might be based on the other user's preferences.

3. Hybrid Methods: Combine content-based and collaborative filtering to leverage the strengths of both
approaches.

 Example: A streaming service might use collaborative filtering for initial recommendations and
content-based filtering to refine them.

4. Matrix Factorization: A technique used in collaborative filtering to discover latent factors underlying
user-item interactions.

 Example: Singular Value Decomposition (SVD) decomposes the user-item interaction matrix into
lower-dimensional representations, facilitating predictions.

5. Deep Learning-Based Methods: Use neural networks to model complex user-item interactions.

 Example: Neural Collaborative Filtering (NCF) applies deep neural networks to learn user and
item representations.

22
Key Algorithms

1. k-Nearest Neighbors (k-NN): Used in collaborative filtering to find similar users or items.

2. Singular Value Decomposition (SVD): A matrix factorization technique for collaborative filtering.

3. Alternating Least Squares (ALS): Another matrix factorization method, particularly effective for large-
scale recommendation problems.

4. Autoencoders: Used for dimensionality reduction and feature learning in content-based recommendations.

5. Deep Neural Networks: Applied in hybrid and advanced recommendation systems to capture complex
patterns in user-item interactions.

Applications

1. E-commerce: Personalized product recommendations based on user browsing and purchase history.

 Example: Amazon suggesting items frequently bought together.

2. Streaming Services: Movie, TV show, or music recommendations tailored to user preferences.

 Example: Netflix recommending shows based on viewing history.

3. Social Media: Suggesting friends, posts, or groups based on user interactions.

 Example: Facebook suggesting friends or pages to follow.

4. News Websites: Recommending articles based on user reading history and preferences.

 Example: Google News personalizing news feed for users.

5. Online Advertising: Displaying targeted ads based on user behavior and preferences.

 Example: Google Ads showing personalized advertisements.

Evaluation Metrics

1. Accuracy Metrics: Measure how closely the recommendations match user preferences.

 Precision: The proportion of recommended items that are relevant.

 Recall: The proportion of relevant items that are recommended.

 F1 Score: The harmonic mean of precision and recall.

2. Ranking Metrics: Evaluate the quality of the ranking of recommended items.

 Mean Average Precision (MAP): The average precision across all users.

 Normalized Discounted Cumulative Gain (NDCG): Measures the ranking quality of the
recommendations.

3. Diversity and Novelty: Ensure that recommendations are not only relevant but also diverse and novel to
the user.

 Diversity: The variety of different items recommended.

 Novelty: The proportion of recommendations that are new to the user.

23
9. EM algorithm

The EM algorithm is considered a latent variable model to find the local maximum likelihood parameters of a
statistical model, proposed by Arthur Dempster, Nan Laird, and Donald Rubin in 1977. The EM (Expectation-
Maximization) algorithm is one of the most commonly used terms in machine learning to obtain maximum
likelihood estimates of variables that are sometimes observable and sometimes not. However, it is also applicable
to unobserved data or sometimes called latent. It has various real-world applications in statistics, including
obtaining the mode of the posterior marginal distribution of parameters in machine learning and data mining
applications.

What is an EM algorithm?

The Expectation-Maximization (EM) algorithm is defined as the combination of various unsupervised machine
learning algorithms, which is used to determine the local maximum likelihood estimates (MLE) or maximum a
posteriori estimates (MAP) for unobservable variables in statistical models. Further, it is a technique to find
maximum likelihood estimation when the latent variables are present. It is also referred to as the latent variable
model.

A latent variable model consists of both observable and unobservable variables where observable can be predicted
while unobserved are inferred from the observed variable. These unobservable variables are known as latent
variables.

Key Points:

o It is known as the latent variable model to determine MLE and MAP parameters for latent variables.

o It is used to predict values of parameters in instances where data is missing or unobservable for learning,
and this is done until convergence of the values occurs.

EM Algorithm

The EM algorithm is the combination of various unsupervised ML algorithms, such as the k-means clustering
algorithm. Being an iterative approach, it consists of two modes. In the first mode, we estimate the missing or
latent variables. Hence it is referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more clearly. The second mode is
known as the maximization-step or M-step.

o Expectation step (E - step): It involves the estimation (guess) of all missing values in the dataset so that
after completing this step, there should not be any missing value.

o Maximization step (M - step): This step involves the use of estimated data in the E-step and updating the
parameters.

o Repeat E-step and M-step until the convergence of the values occurs.

24
The primary goal of the EM algorithm is to use the available observed data of the dataset to estimate the missing
data of the latent variables and then use that data to update the values of the parameters in the M-step.

What is Convergence in the EM algorithm?

Convergence is defined as the specific situation in probability based on intuition, e.g., if there are two random
variables that have very less difference in their probability, then they are known as converged. In other words,
whenever the values of given variables are matched with each other, it is called convergence.

Steps in EM Algorithm

The EM algorithm is completed mainly in 4 steps, which include Initialization Step, Expectation Step,
Maximization Step, and convergence Step. These steps are explained as follows:

o 1st Step: The very first step is to initialize the parameter values. Further, the system is provided with
incomplete observed data with the assumption that data is obtained from a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or guess the values of the
missing or incomplete data using the observed data. Further, E-step primarily updates the variables.

o 3rd Step: This step is known as Maximization or M-step, where we use complete data obtained from the
2nd step to update the parameter values. Further, M-step primarily updates the hypothesis.

o 4th step: The last step is to check if the values of latent variables are converging or not. If it gets "yes", then
stop the process; else, repeat the process from step 2 until the convergence occurs.

Gaussian Mixture Model (GMM)

The Gaussian Mixture Model or GMM is defined as a mixture model that has a combination of the unspecified
probability distribution function. Further, GMM also requires estimated statistics values such as mean and
standard deviation or parameters. It is used to estimate the parameters of the probability distributions to best fit the
density of a given training dataset. Although there are plenty of techniques available to estimate the parameter of
the Gaussian Mixture Model (GMM), the Maximum Likelihood Estimation is one of the most popular techniques
among them.

Let's understand a case where we have a dataset with multiple data points generated by two different processes.
However, both processes contain a similar Gaussian probability distribution and combined data. Hence it is very
difficult to discriminate which distribution a given point may belong to.

The processes used to generate the data point represent a latent variable or unobservable data. In such cases, the
Estimation-Maximization algorithm is one of the best techniques which helps us to estimate the parameters of the
gaussian distributions. In the EM algorithm, E-step estimates the expected value for each latent variable, whereas

25
M-step helps in optimizing them significantly using the Maximum Likelihood Estimation (MLE). Further, this
process is repeated until a good set of latent values, and a maximum likelihood is achieved that fits the data.

Applications of EM algorithm

The primary aim of the EM algorithm is to estimate the missing data in the latent variables through observed data
in datasets. The EM algorithm or latent variable model has a broad range of real-life applications in machine
learning. These are as follows:

o The EM algorithm is applicable in data clustering in machine learning.

o It is often used in computer vision and NLP (Natural language processing).

o It is used to estimate the value of the parameter in mixed models such as the Gaussian Mixture Modeland
quantitative genetics.

o It is also used in psychometrics for estimating item parameters and latent abilities of item response theory
models.

o It is also applicable in the medical and healthcare industry, such as in image reconstruction and structural
engineering.

o It is used to determine the Gaussian density of a function.

Advantages of EM algorithm

o It is very easy to implement the first two basic steps of the EM algorithm in various machine learning
problems, which are E-step and M- step.

o It is mostly guaranteed that likelihood will enhance after each iteration.

o It often generates a solution for the M-step in the closed form.

Disadvantages of EM algorithm

o The convergence of the EM algorithm is very slow.

o It can make convergence for the local optima only.

o It takes both forward and backward probability into consideration. It is opposite to that of numerical
optimization, which takes only forward probabilities.

10. Reinforcement Learning and elements

Reinforcement Learning (RL) in machine learning involves several core elements that define the process of
learning from interactions with an environment to achieve a goal. These elements are critical to understanding how
RL systems function and are typically structured within the framework of a Markov Decision Process (MDP). Here
are the key elements:

1. Agent

The agent is the decision-maker or learner in the reinforcement learning model. It interacts with the environment by
taking actions and receiving feedback in the form of rewards or penalties.

2. Environment

26
The environment is everything that the agent interacts with and learns from. It encompasses all the states and
dynamics that affect the agent's decisions and is often modeled as an MDP.

3. State (s)

A state represents a specific situation or configuration of the environment at a given time. The state captures all
relevant information needed by the agent to make a decision.

4. Action (a)

An action is a choice made by the agent that affects the state of the environment. Each action taken by the agent
leads to a transition to a new state, accompanied by a reward.

5. Reward (r)

A reward is the feedback from the environment following an action taken by the agent. It serves as a signal
indicating how good or bad the action was in terms of achieving the agent's goal. The objective of the agent is to
maximize cumulative rewards over time.

6. Policy (π)

The policy is a strategy or rule that the agent follows to choose actions based on the current state. It can be
deterministic (always choosing the same action for a given state) or stochastic (choosing actions based on a
probability distribution).

7. Value Function (V)

The value function estimates the expected cumulative reward (or value) of being in a particular state and following
a certain policy from that state onwards. It provides a measure of the long-term benefit of states.

8. Action-Value Function (Q)

The action-value function, or Q-function, estimates the expected cumulative reward of taking a specific action in a
particular state and then following a certain policy. It helps in evaluating the utility of specific actions in the
context of future rewards.

9. Markov Decision Process (MDP)

An MDP provides a formal mathematical framework for modeling decision-making problems where outcomes are
partly random and partly under the control of the agent. An MDP is defined by:

 States (S)

 Actions (A)

 Transition probabilities (P): The probability of transitioning from one state to another given an action.

 Reward function (R): The immediate reward received after transitioning from one state to another given
an action.

Core Algorithms in RL

1. Value-Based Methods

 Q-Learning: An off-policy algorithm that seeks to find the optimal action-selection policy by
learning the Q-values.

27
 SARSA (State-Action-Reward-State-Action): An on-policy algorithm that updates the action-
value function based on the action actually taken by the current policy.

2. Policy-Based Methods

 Policy Gradient Methods: These methods directly optimize the policy by adjusting the
parameters using gradient ascent on expected reward.

3. Actor-Critic Methods

 These methods combine value-based and policy-based approaches. The "actor" updates the policy
based on feedback from the "critic," which evaluates the actions using a value function.

4. Model-Based Methods

 These methods involve the agent building a model of the environment's dynamics and using it to
simulate and plan actions, often improving learning efficiency.

Application Examples

1. Gaming: Training AI to play games like Chess, Go, and video games, where the RL agents can surpass
human performance.

2. Robotics: Enabling robots to learn complex tasks like walking, object manipulation, and autonomous
navigation.

3. Finance: Developing trading strategies, portfolio management, and financial decision-making systems.

4. Healthcare: Optimizing treatment plans, drug discovery, and personalized medicine.

5. Autonomous Vehicles: Teaching self-driving cars to make decisions in real-time for navigation and safety.

Challenges

1. Exploration vs. Exploitation: Balancing the need to explore new actions to find better strategies with the
need to exploit known actions that yield high rewards.

2. Scalability: Handling large state and action spaces efficiently.

3. Sample Efficiency: Reducing the number of interactions needed to learn effective policies.

4. Safety and Robustness: Ensuring the agent's decisions are safe and reliable, especially in critical
applications.

Reinforcement Learning is a powerful paradigm in machine learning, allowing agents to learn complex behaviors
through interactions with their environments, with applications spanning various domains and industries.

11. Model based Learning

What is Model-Based Machine Learning?

Hundreds of learning algorithms have been developed in the field of machine learning. Scientists typically select
from among these algorithms to answer specific issues. Their options are frequently restricted by their expertise
with these systems. In this classical/traditional machine learning framework, scientists are forced to make some
assumptions to employ an existing algorithm.

 The model-based learning in machine learning is a technique that tries to generate a custom solution
for each new challenge
28
MBML’s purpose is to offer a single development framework that facilitates the building of a diverse variety
of custom models. This paradigm evolved as a result of a significant confluence of three main ideas:

 Factor graphs

 Bayesian perspective,

 Probabilistic Programming

The essential principle is that in the form of a model, all assumptions about the issue domain are made clear.
Model-based deep learning is just a collection of assumptions stated in a graphical manner.

Factor Graphs

The usage of PGM- Probabilistic Graphical Models, particularly factor graphs, is the pillar of MBML. A PGM is a
graph-based diagrammatic representation of the joint probability distribution across all random variables in a
model.

They are a form of PGM with round nodes and square nodes representing variable probability distributions
(factors), and vertices expressing conditional relationships between nodes. They offer a broad framework for
simulating the combined dispersion of a set of random variables.

In factor graphs, we consider implicit parameters as random variables and discover their probability distributions
throughout the network using Bayesian inference techniques. Inference/learning is just the product of factors across
a subset of the graph’s variables. This makes it simple to develop local message forwarding algorithms.

Bayesian Methods

The first essential concept allowing this new machine learning architecture is Bayesian inference/learning.
Latent/hidden parameters are represented in MBML as random variables with probability distributions. This
provides for a consistent and rational approach to quantifying uncertainty in model parameters. Again when the
observed variables in the model are locked to their values, the Bayes’ theorem is used to update the previously
assumed probability distributions.

In contrast, the classical ML framework assigns model parameters to average values derived by maximizing an
objective function. Bayesian inference on big models with millions of variables is accomplished similarly, but in a
more complicated way, employing the Bayes’ theorem. This is because Bayes’ theory is an accurate inference
approach that is intractable when applied to huge datasets. The rise in the processing capacity of computers over
the last decade has enabled the research and innovation of algorithms that can scale to enormous data sets.

Probabilistic Programming

Probabilistic programming (PP) is a breakthrough in computer science in which programming languages are now
created to compute with uncertainty in addition to logic. Current programming languages can already handle
random variables, variable restrictions, and inference packages. You may now express a model-based
reinforcement learning of your problem concisely with a few lines of code using a PP language. So an inference
engine is invoked to produce inference procedures to solve the problem automatically.

Model-Based ML Developmental Stages

It consists of three rules-based models in machine learning:

 Describe the Model: Using factor graphs, describe the process that created the data.

 Condition on Reported Data: Make the observed variables equal to their known values.

29
 Backward reasoning is used to update the prior distribution across the latent constructs or parameters.
Estimate the Bayesian probability distributions of latent constructs based on observable variables.

12. Temporal Difference Learning

What is temporal difference learning?

Temporal Difference Learning in reinforcement learning is an unsupervised learning technique that is very
commonly used in it for the purpose of predicting the total reward expected over the future. They can, however, be
used to predict other quantities as well. It is essentially a way to learn how to predict a quantity that is dependent on
the future values of a given signal. It is a method that is used to compute the long-term utility of a pattern of
behaviour from a series of intermediate rewards.

Essentially, Temporal Difference Learning (TD Learning) focuses on predicting a variable's future value in a
sequence of states. Temporal difference learning was a major breakthrough in solving the problem of reward
prediction. You could say that it employs a mathematical trick that allows it to replace complicated reasoning with
a simple learning procedure that can be used to generate the very same results.

The trick is that rather than attempting to calculate the total future reward, temporal difference learning just
attempts to predict the combination of immediate reward and its own reward prediction at the next moment in time.
Now when the next moment comes and brings fresh information with it, the new prediction is compared with the
expected prediction. If these two predictions are different from each other, the Temporal Difference Learning
algorithm will calculate how different the predictions are from each other and make use of this temporal difference
to adjust the old prediction toward the new prediction.

The temporal difference algorithm always aims to bring the expected prediction and the new prediction together,
thus matching expectations with reality and gradually increasing the accuracy of the entire chain of prediction.

Temporal Difference Learning aims to predict a combination of the immediate reward and its own reward
prediction at the next moment in time.

In TD Learning, the training signal for a prediction is a future prediction. This method is a combination of the
Monte Carlo (MC) method and the Dynamic Programming (DP) method. Monte Carlo methods adjust their
estimates only after the final outcome is known, but temporal difference methods tend to adjust predictions to
match later, more accurate, predictions for the future, much before the final outcome is clear and know. This is
essentially a type of bootstrapping.

Temporal difference learning in machine learning got its name from the way it uses changes, or differences, in
predictions over successive time steps for the purpose of driving the learning process.

The prediction at any particular time step gets updated to bring it nearer to the prediction of the same quantity at
the next time step.

30
What are the parameters used in temporal difference learning?

Parameters used in temporal difference learning

 Alpha (α): learning rate


It shows how much our estimates should be adjusted, based on the error. This rate varies between 0 and 1.

 Gamma (γ): the discount rate


This indicates how much future rewards are valued. A larger discount rate signifies that future rewards are
valued to a greater extent. The discount rate also varies between 0 and 1.

 e: the ratio reflective of exploration vs. exploitation.


This involves exploring new options with probability e and staying at the current max with probability 1-e.
A larger e signifies that more exploration is carried out during training

What is the benefit of temporal difference learning?

The advantages of temporal difference learning in machine learning are:

 TD learning methods are able to learn in each step, online or offline.

 These methods are capable of learning from incomplete sequences, which means that they can also be used
in continuous problems.

 Temporal difference learning can function in non-terminating environments.

 TD Learning has less variance than the Monte Carlo method, because it depends on one random action,
transition, reward.

 It tends to be more efficient than the Monte Carlo method.

 Temporal Difference Learning exploits the Markov property, which makes it more effective in Markov
environments.

What are the disadvantages of temporal difference learning?

There are two main disadvantages:

 It has greater sensitivity towards the initial value.

 It is a biased estimation.
31

What is the temporal difference error?

commonly called the TD Error. Here the TD error is the difference between the current estimate for 𝑉𝑡, the
TD error arises in various forms throughout reinforcement learning and δt = rt+1 + γV(st+1) − V(st) value is

discounted value estimate of 𝑉𝑡+1, and the actual reward gained from transitioning between 𝑠𝑡 and 𝑠𝑡+1. The
TD error at each time is the error in the calculation made at that time. Because the TD error at step t relies on the
next state and next reward, it is not available until step t + 1. When we update the value function with the TD error,
it is called a backup. The TD error is related to the Bellman equation.

What is the difference between Q-learning & Temporal Difference Learning?

Temporal Difference Learning in machine learning is a method to learn how to predict a quantity that depends on
future values of a given signal. It can also be used to learn both the V-function and the Q-function, whereas Q-
learning is a specific TD algorithm that is used to learn the Q-function. If you have only the V-function you can
still derive the Q-function by repeating all the possible next states and choosing the action that leads you to the
state with the highest V-value of the signal.

In the model-free RL concept, you don't learn the state-transition function (the model) and you can depend only on
samples. However, you might be interested also in learning it because you cannot collect many samples and want
to generate some virtual ones. In this case, we talk about model-based RL. Model-based RL is quite common in
robotics and machine learning, where you cannot perform many real simulations or the robot will break and that is
the difference in TD learning vs Q learning.

What are the different algorithms in temporal difference learning?

There are predominantly three different categories of TD algorithms which are as follows:

1. TD(1) Algorithm

2. TD(0) Algorithm

3. TD(λ) Algorithm

32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy