Unit Iv Learning

UNIT IV
LEARNING
18
Lecture Notes
Probability basics - Bayes Rule and its Applications
Bayes rule provides us with a way to update our beliefs based on the arrival of
new, relevant pieces of evidence.
For example,
if we were trying to provide the probability that a given person has

cancer, we would initially just say it is whatever percent of the population
has cancer.
However, given additional evidence such as the fact that the person is a
smoker, we can update our probability, since the probability of having
cancer is higher given that the person is a smoker.
This allows us to utilize prior knowledge to improve our probability

estimations.
The Rule
The below equation is Bayes rule:
The rule has a very simple derivation that directly leads from the relationship
between joint and conditional probabilities. First, note that P(A,B) = P(A|B)P(B)
= P(B,A) = P(B|A)P(A).
Where A and B are events:
P(A) and P(B) are the probabilities of A and B without regard to each other.
P(A | B), a conditional probability, is the probability of observing event A given

that B is true.
P(B | A) is the probability of observing event B given that A is true.
19
Lecture Notes
Example
Using the cancer diagnosis example, we can show that Bayes rule
allows us to obtain a much better estimate. Now, we will put some
made-up numbers into the example so we can assess the
difference that Bayes rule made .
Assume that the probability of having cancer is 0.05 — meaning

that 5% of people have cancer. Now, assume that the probability
of being a smoker is 0.10 — meaning that 10% of people are
smokers, and that 20% of people with cancer are smokers, so
P(smoker|cancer) = 0.20.
Initially, our probability for cancer is simply our prior, so 0.05.

However, using new evidence,
we can instead calculate P(cancer|smoke),
which is equal to (P(smoker|cancer) * P(cancer)) / P(smoker) =

(0.20 * 0.05) / (0.10) = 0.10.
20
Lecture Notes
Bayesian Network
A Bayesian network (also known as a Bayes network, belief network,
or decision network) is a probabilistic graphical model that represents a set
of variables and their conditional dependencies via a directed acyclic
graph (DAG).
Bayesian networks are ideal for taking an event that occurred and predicting
the likelihood that any one of several possible known causes was the
contributing factor.
For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms.
A Bayesian network graph is made up of nodes and Arcs (directed links),
where:
Each node corresponds to the random variables, and a variable can

be continuous or discrete.
Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.
In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
Node C is independent of node A.
21
Lecture Notes
The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG.
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called the
Harry.
Solution:
The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David and
Sophia's calls depend on alarm probability.
The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
The conditional distributions for each node are given as conditional probabilities
table or CPT.
Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
In CPT, a boolean variable with k boolean parents contains 2 K probabilities.

Hence, if there are two parents, then CPT will contain 4 probability values
22
Lecture Notes
List of all events occurring in this network:
Burglary (B)
Earthquake(E)
Alarm(A)
David Calls(D)
Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability
distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
23
Lecture Notes
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999
Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of
Alarm.
A P(D= True) P(D= False)

True 0.91 0.09
False 0.05 0.95
Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
A P(S= True) P(S= False)

True 0.75 0.25
False 0.02 0.98
24
From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by
using Joint distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint

probability distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional

independence statements.
It is helpful in designing inference procedure.
Limitations of Bayesian Networks

Typically require initial knowledge of many probabilities…quality and extent of
prior knowledge play an important role
Significant computational cost(NP hard task)
Unanticipated probability of an event is not taken care of.
25
Hidden Markov Model (HMM)
Hidden Markov Models (HMMs) are probabilistic models, it implies that the Markov
Model underlying the data is hidden or unknown. More specifically, we only know
observational data and not information about the states.
From Markov Chain (left) to Hidden Markov Model (right); where S=states,
y=possible observations, P=state transition probabilities and b=observation
probabilities
HMM is determined by three model parameters;
The start probability; a vector containing the probability for the state of being
the first state of the sequence.
The state transition probabilities; a matrix consisting of the probabilities of

transitioning from state Si to state Sj.
The observation probability; the likelihood of a certain observation, y, if the

model is in state Si.
HMMs can be used to solve four fundamental problems;
1. Given the model parameters and the observation sequence, estimate the
most likely (hidden) state sequence, this is called a decoding problem.
26
Lecture Notes
2. Given the model parameters and observation sequence, find the probability of the
observation sequence under the given model. This process involves a maximum
likelihood estimate of the attributes, sometimes called an evaluation problem.
3. Given observation sequences, estimate the model parameters, this is called

a training problem.
4. Estimate the observation sequences, y1, y2, …, and model parameters, which
maximizes the probability of y. This is called a Learning or optimization
problem.
HMM Example
Use case of HMM in quantitative finance
Considering the largest issue we face when trying to apply predictive techniques to
asset returns is a non-stationary time series. In other words, the expected mean
and volatility of asset returns changes over time. Most of the time series models and
techniques assume that the data is stationary, which is the major weakness of
these models.
Now, let’s frame this problem differently, we know that the time series exhibit
temporary periods where the expected means and variances are stable through
time. These periods or regimes can be associated with hidden states of
HMM. Based on this assumption, all we need are observable variables whose
behavior allows us to infer to the true hidden states.
In the finance world, if we can better estimate an asset’s most likely regime,
including the associated means and variances, then our predictive models become
more adaptable and will likely improve. Furthermore, we can use the estimated
regime parameters for better scenario analysis.
In this example, I will use the observable variables, the Ted Spread, the 10-year —
2-year constant maturity spread, the 10-year —3-month constant maturity spread,
and ICE BofA US High Yield Index Total Return Index, to find the hidden states
27
In this example, the observable variables I use are the underlying asset returns,
the ICE BofA US High Yield Index Total Return Index, the Ted Spread, the 10
year —2-year constant maturity spread, and the 10 year —3-month constant
maturity spread.
Now we are trying to model the hidden states of GE stock, by using two methods;
sklearn's GaussianMixture and HMMLearn's GaussianHMM. Both models require us
to specify the number of components to fit the time series, we can think of these
components as regimes. For this specific example, I will assign three components
and assume to be high, neural, and low volatility.
Learning from Observations
1. Learning agents
2. Inductive learning
3. Decision tree learning
Learning
Learning is essential for unknown environments,
i.e., when designer lacks omniscience
Learning is useful as a system construction method,
i.e., expose the agent to reality rather than trying to write it down
Learning modifies the agent's decision mechanisms to

improve performance
Learning agents
Learning element
Design of a learning element is affected by
Which components of the performance element are to be learned
What feedback is available to learn these components
What representation is used for the components
Type of feedback:
Supervised learning: correct answers for each example
Unsupervised learning: correct answers not given
Reinforcement learning: occasional rewards
Inductive learning
Simplest form: learn a function from examples
f is the target function
An example is a pair (x, f(x))
Problem: find a hypothesis h
such that h ≈ f
given a training set of examples
(This is a highly simplified model of real learning:
Ignores prior knowledge
Assumes examples are given)
Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Forms of Learning
Machine Learning (ML) is an automated learning with little or no human
intervention. It involves programming computers so that they learn from the
available inputs. The main purpose of machine learning is to explore and construct
algorithms that can learn from the previous data and make predictions on new input
data.
The input to a learning algorithm is training data, representing experience, and

the output is any expertise, which usually takes the form of another algorithm that
can perform a task. The input data to a machine learning system can be numerical,
textual, audio, visual, or multimedia. The corresponding output data of the system
can be a floating-point number, for instance, the velocity of a rocket, an integer
representing a category or a class, for example, a pigeon or a sunflower from image
recognition.
Concepts of Learning
Learning is the process of converting experience into expertise or knowledge.
Learning can be broadly classified into three categories, as mentioned below,

based on the nature of the learning data and interaction between the learner and
the environment.
Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Similarly, there are four categories of machine learning algorithms as shown below
−
Supervised learning algorithm
Unsupervised learning algorithm
Semi-supervised learning algorithm
Reinforcement learning algorithm
However, the most commonly used ones are supervised and unsupervised
learning.
30
Supervised Learning
Supervised learning is commonly used in real world applications, such as face and
speech recognition, products or movie recommendations, and sales forecasting.
Supervised learning can be further classified into two types -
Regression and Classification.
Regression trains on and predicts a continuous-valued response, for example

predicting real estate prices.
Classification attempts to find the appropriate class label, such as analyzing

positive/negative sentiment, male and female persons, benign and malignant
tumors, secure and unsecure loans etc.
In supervised learning, learning data comes with description, labels, targets or

desired outputs and the objective is to find a general rule that maps inputs to
outputs. This kind of learning data is called labeled data. The learned rule is then
used to label new data with unknown outputs.
Supervised learning involves building a machine learning model that is based

on labeled samples. For example, if we build a system to estimate the price of a
plot of land or a house based on various features, such as size, location, and so
on, we first need to create a database and label it. We need to teach the
algorithm what features correspond to what prices. Based on this data, the
algorithm will learn how to calculate the price of real estate using the values of
the input features.
Supervised learning deals with learning a function from available training data.
Here, a learning algorithm analyzes the training data and produces a derived
function that can be used for mapping new examples. There are
many supervised learning algorithms such as Logistic Regression, Neural
networks, Support Vector Machines (SVMs), and Naive Bayes classifiers.
Common examples of supervised learning include classifying e-mails into spam

and not-spam categories, labeling webpages based on their content, and voice
recognition.
31
Learning Decision Trees
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
The decisions or the test are performed on the basis of features of the given
dataset.
It is a graphical representation for getting all the possible solutions to a

problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands

for Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
Below diagram explains the general structure of a decision tree:
32
Lecture Notes
Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm
for the given dataset and problem is the main point to remember while creating a
machine learning model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it
is easy to understand.
The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
A classic famous example where decision tree is used is known as Play Tennis.
33
If the outlook is sunny and humidity is normal, then yes, you may play tennis.
Where should you use decision tree?

At any scenario where learning data has following traits:
The learning data has attribute value pair like in the example shown above: Wind
as an attribute has two possible values – strong or weak
Target function has discreet output. Here, the target function is – should you play
tennis? And the output to this discreet output – Yes and No
The training data might be missing or have error
Dataset for Play Tennis
ID3 Algorithm
Although there are various decision tree learning algorithms, we will explore the
Iterative Dichotomiser 3 or commonly known as ID3. ID3 was invented by Ross
Quinlan.
34
Entropy
Entropy is a measure of randomness. In other words, its a measure of

unpredictability. We will take a moment here to give entropy in case of binary
event(like the coin toss, where output can be either of the two events, head or
tail) a mathematical face:
Entropy = -(probability(a) * log2(probability(a))) – (probability(b) *

log2(probability(b)))
where probability(a) is probability of getting head and probability(b) is

probability of getting tail.
of course this formulae can be generalised for n discreet outcome as follow:
Entropy = -p(1)*log2(p(1)) -p(2)*log2(p(2))-

p(3)*log2(p(3)) ......................... p(n)*log(2p(n))
Information Gain
In the decision tree shown here:
We decided to break the first decision on the basis of outlook. We could have our
first decision based on humidity or wind but we chose outlook. Why?
Because making our decision on the basis of outlook reduced our randomness in
the outcome(which is whether to play or not), more than what it would have been
reduced in case of humidity or wind.
35
Let’s understand with the example here. Please refer to the play tennis dataset that
is pasted above.
We have data for 14 days. We have only two outcomes :
Either we played tennis or we didn’t play.
In the given 14 days, we played tennis on 9 occasions and we did not play on 5
occasions.
Probability of playing tennis:

Number of favourable events : 9
Number of total events : 14
Probability = (Number of favourable events) / (Number of total events)
= 9/14
= 0.642
Now, we will see probability of not playing tennis.

Probability of not playing tennis:
Number of favourable events : 5

Number of total events : 14
Probability = (Number of favourable events) / (Number of total events)
=5/14
=0.357
And now entropy of outcome
Entropy at source= -(probability(a) * log2(probability(a))) – (probability(b) *

log2(probability(b)))
= -(Probability of playing tennis) * log2(Probability of playing tennis) – (Probability
of not playing tennis) * log2(Probability of not playing tennis)
= -0.652 * log2(0.652) – 0.357*log2(0.357)
=0.940
So, entropy of whole system before we make our firest question is 0.940
Now, we have four features to make decision and they are:
Outlook
Temperature
Windy
Humidity
36
Let’s see what happens to entropy when we make our first
decision on the basis of Outlook.
Outlook
If we make a decison tree divison at this level 0 based on outlook, we have three
branches possible; either it will be Sunny or Overcast or it will be Raining.
1. Sunny : In the given data, 5 days were sunny. Among those 5 days, tennis was
played on 2 days and tenis was not played on 3 days. What is the entropy here?
Probablity of playing tennis = 2/5 = 0.4
Probablity of not playing tennis = 3/5 = 0.6
Entropy when sunny = -0.4 * log2(0.4) – 0.6 * log2(0.6)
= 0.97
2. Overcast: In the given data, 4 days were overcast and tennis was played on all
the four days. Let
Probablity of playing tennis = 4/4 = 1
Probablity of not playing tennis = 0/4 = 0
Entropy when overcast = 0.0
3. Rain: In the given data, 5 days were rainy. Among those 5 days, tennis was
played on 3 days and tenis was not played on 2 days. What is the entropy here?
Probablity of not playing tennis = 2/5 = 0.4
Probablity of playing tennis = 3/5 = 0.6
Entropy when rainy = -0.4 * log2(0.4) – 0.6 * log2(0.6)
= 0.97
Entropy among the three branches:

Entropy among three branches = ((number of sunny days)/(total days) * (entropy
when sunny)) + ((number of overcast days)/(total days) * (entropy when overcast))
+ ((number of rainy days)/(total days) * (entropy when rainy))
= ((5/14) * 0.97) + ((4/14) * 0) + ((5/14) * 0.97)
= 0.69
What is the reduction is randomness due to choosing outlook as a decsion maker?
Reduction in randomness = entropy source – entropy of branches
= 0.940 – 0.69
= 0.246
This reduction in randomness is called Information Gain. Similar calculation can
be done for other features.
37
Temperature
Information Gain = 0.029
Windy
Humidity
We can see that decrease in randomness, or information gain is most for Outlook.
So, we choose first decision maker as Outlook.
Advantages of the Decision Tree
It is simple to understand as it follows the same process which a human follow

while making any decision in real-life.
It can be very useful for solving decision-related problems.
It helps to think about all the possible outcomes for a problem.
There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
The decision tree contains lots of layers, which makes it complex.
It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
For more class labels, the computational complexity of the decision tree may
increase.
38
Regression and Classification with Linear Models
Regression
 For classification the output(s) is nominal
 In regression the output is continuous
 Function Approximation
 Many models could be used – Simplest is linear regression
 Fit data with the best hyper-plane which "goes through"
the points
39
For classification the output(s) is nominal
In regression the output is continuous
Function Approximation
Many models could be used – Simplest is linear regression
Fit data with the best hyper-plane which "goes through" the points
For each point the differences between the predicted point and the actual
observation is the residue
Simple Linear Regression
For now, assume just one (input) independent variable x, and one (output)
dependent variable y
Multiple linear regression assumes an input vector x
Multivariate linear regression assumes an output vector y
We will "fit" the points with a line (i.e. hyper-plane)
Which line should we use?
Choose an objective function
For simple linear regression we choose sum squared error (SSE)
S (predictedi – actuali)2 = S (residuei)2
Thus, find the line which minimizes the sum of the squared residues (e.g.
least squares)
40
Lecture Notes
How do we "learn" parameters
Multiple Linear Regression
41
Intelligibility
One nice advantage of linear regression models (and linear classification) is the
potential to look at the coefficients to give insight into which input variables are
most important in predicting the output
The variables with the largest magnitude have the highest correlation with the
output
A large positive coefficient implies that the output will increase when this
input is increased (positively correlated)
A large negative coefficient implies that the output will decrease when this
input is increased (negatively correlated)
A small or 0 coefficient suggests that the input is uncorrelated with the

output (at least at the 1st order)
Linear regression can be used to find best "indicators"
However, be careful not to confuse correlation with causality
Summary
Linear Regression and Logistic Regression are nice tools for many simple
situations
But both force us to fit the data with one shape (line or sigmoid) which will
often underfit
Intelligible results
When problem includes more arbitrary non-linearity then we need more powerful
models which we will introduce
Though non-linear data transformation can help in these cases while still
using a linear model for learning
These models are commonly used in data mining applications and also as a "first
attempt" at understanding data trends, indicators, etc.
42
Artificial Neural Networks
Introduction
Artificial Neural Networks (ANN)
Information processing paradigm inspired by biological nervous systems
ANN is composed of a system of neurons connected by synapses
ANN learn by example
Adjust synaptic connections between neurons
History
1943: McCulloch and Pitts model neural networks based on their understanding of
neurology.
Neurons embed simple logic functions:
a or b
a and b
1950s:
Farley and Clark
IBM group that tries to model biological behavior
Consult neuro-scientists at McGill, whenever stuck
Rochester, Holland, Haibit and Duda
Perceptron (Rosenblatt 1958)
Three layer system:
Input nodes
Output node
Association layer
Can learn to connect or associate a given input to a random output unit
Minsky and Papert
Showed that a single layer perceptron cannot learn the XOR of two binary inputs
Lead to loss of interest (and funding) in the field
43
Lecture Notes
Back-propagation learning method (Werbos 1974)
Three layers of neurons

Input, Output, Hidden
Better learning rule for generic three layer networks
Regenerates interest in the 1980s
Successful applications in medicine, marketing, risk management, … (1990)
ANN
Promises
Combine speed of silicon with proven success of carbon  artificial

brains
Neuron Model
Natural neurons
Neuron collects signals from dendrites
Sends out spikes of electrical activity through an axon, which splits into thousands
of branches.
At end of each brand, a synapses converts activity into either exciting or inhibiting
activity of a dendrite at another neuron.
Neuron fires when exciting activity surpasses inhibitory activity
Learning changes the effectiveness of the synapses
44
Abstract neuron model:
ANN Forward Propagation
Bias Nodes
Add one node to each layer that has constant output
Forward propagation
Calculate from input layer to output layer
For each neuron:
Calculate weighted average of input
Calculate activation function
Firing Rules:
Threshold rules:
Calculate weighted average of input
Fire if larger than threshold

Perceptron rule
Calculate weighted average of input input
Output activation level is
45
ANN Applications
Pattern recognition
Network attacks
Breast cancer
handwriting recognition
Pattern completion
Auto-association
ANN trained to reproduce input as output
Noise reduction
Compression
Finding anomalies
Support Vector Machines
Introduction to SVMs:
In machine learning, support vector machines (SVMs, also support vector

networks) are supervised learning models with associated learning algorithms that
analyze data used for classification and regression analysis.
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a
separating hyperplane. In other words, given labeled training data (supervised
learning), the algorithm outputs an optimal hyperplane which categorizes new
examples.
What is Support Vector Machine?
An SVM model is a representation of the examples as points in space, mapped so

that the examples of the separate categories are divided by a clear gap that is as
wide as possible.
In addition to performing linear classification, SVMs can efficiently perform a non-
linear classification, implicitly mapping their inputs into high-dimensional feature
spaces.
46
Lecture Notes
What does SVM do?
Given a set of training examples, each marked as belonging to one or the other of
two categories, an SVM training algorithm builds a model that assigns new
examples to one category or the other, making it a non-probabilistic binary linear
classifier.
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which

can be used for both classification or regression challenges. However, it is mostly
used in classification problems. In the SVM algorithm, we plot each data item as a
point in n-dimensional space (where n is number of
features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well (look at the below snapshot).
Support Vectors are simply the co-ordinates of individual observation. The SVM
classifier is a frontier which best segregates the two classes (hyper-plane/ line)
47
Statistical Learning
Statistical Approaches
1. Statistical Learning
2. Naïve Bayes
3. Instance-based Learning
4. Neural Networks
Example: Candy Bags
1. Candy comes in two flavors: cherry (J) and lime (L)
2. Candy is wrapped, can’t tell which flavor until opened
3. There are 5 kinds of bags of candy:
– H1= all cherry (P(c|h1) = 1, P(l|h1) = 0)
H2= 75% cherry, 25% lime
– H5= 100% lime
1. Given a new bag of candy, predict H
2. Observations: D1, D2 , D3, …
Bayesian Learning
48
Expectation-Maximization Algorithm
In the real-world applications of machine learning, it is very common that there

are many relevant features available for learning but only a small subset of them
are observable. So, for the variables which are sometimes observable and
sometimes not, then we can use the instances when that variable is visible is
observed for the purpose of learning and then predict its value in the instances
when it is not observable.
On the other hand, Expectation-Maximization algorithm can be used for the

latent variables (variables that are not directly observable and are actually inferred
from the values of the other observed variables) too in order to predict their
values with the condition that the general form of probability distribution
governing those latent variables is known to us. This algorithm is actually at the
base of many unsupervised clustering algorithms in the field of machine learning.
It was explained, proposed and given its name in a paper published in 1977 by
Arthur Dempster, Nan Laird, and Donald Rubin. It is used to find the local
maximum likelihood parameters of a statistical model in the cases where latent
variables are involved and the data is missing or incomplete.
Algorithm:
Given a set of incomplete data, consider a set of starting parameters.
Expectation step (E – step): Using the observed available data of the

dataset, estimate (guess) the values of the missing data.
Maximization step (M – step): Complete data generated after the

expectation (E) step is used in order to update the parameters.
Repeat step 2 and step 3 until convergence.
49
Lecture Notes
The essence of Expectation-Maximization algorithm is to use the available

observed data of the dataset to estimate the missing data and then using that
data to update the values of the parameters. Let us understand the EM algorithm
in detail.
Initially, a set of initial values of the parameters are considered. A set of

incomplete observed data is given to the system with the assumption that the
observed data comes from a specific model.
The next step is known as “Expectation” – step or E-step. In this step, we use the
observed data in order to estimate or guess the values of the missing or
incomplete data. It is basically used to update the variables.
The next step is known as “Maximization”-step or M-step. In this step, we use the
complete data generated in the preceding “Expectation” – step in order to update
the values of the parameters. It is basically used to update the hypothesis.
Now, in the fourth step, it is checked whether the values are converging or not, if
yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Flow chart for EM algorithm –
Usage of EM algorithm –
It can be used to fill the missing data in a sample.
It can be used as the basis of unsupervised learning of clusters.
It can be used for the purpose of estimating the parameters of Hidden

Markov Model (HMM).
It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
It is always guaranteed that likelihood will increase with each iteration.
The E-step and M-step are often pretty easy for many problems in terms of
implementation.
Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
It has slow convergence.
It makes convergence to the local optima only.
It requires both the probabilities, forward and backward (numerical

optimization requires only forward probability).
Reinforcement Learning
Supervised learning:
a situation in which sample (input, output) pairs of the function to be learned

can be perceived or are given
You can think it as if there is a kind teacher
Reinforcement learning:
in the case of the agent acts on its environment, it receives some evaluation of
its action (reinforcement), but is not told of which action is the correct one to
achieve its goal
Task
Learn how to behave successfully to achieve a goal while interacting with an

external environment
Learn via experiences!
Examples
Game playing: player knows whether it win or lose, but not know how to
move at each step
Control: a traffic system can measure the delay of cars, but not know how to
decrease it.
RL is learning from interaction

RL model
Each percept(e) is enough to determine the State(the state is accessible)
The agent can decompose the Reward component from a percept.
The agent task: to find a optimal policy, mapping states to actions, that
maximize long-run measure of the reinforcement
Think of reinforcement as reward
Can be modeled as MDP model!
Model based v.s.Model free approaches

But, we don’t know anything about the environment model—the
transition function T(s,a,s’)
Here comes two approaches
Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
Model free approach RL:
derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
Lecture Notes
Passive learning v.s. Active learning
Passive learning
The agent imply watches the world going by and tries to learn the utilities of
being in various states
Active learning
The agent not simply watches, but also acts
Generalization in Reinforcement Learning
So far we assumed that all the functions learned by the agent are (U, T, R,Q) are
tabular forms—
i.e.. It is possible to enumerate state and action spaces.
Use generalization techniques to deal with large state or action space.
Function approximation techniques

Unit Iv Learning

Uploaded by

Copyright:

Available Formats

Unit Iv Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit Iv Learning

Uploaded by

Copyright:

Available Formats

UNIT IV

Probability basics - Bayes Rule and its Applications

if we were trying to provide the probability that a given person has

This allows us to utilize prior knowledge to improve our probability

The below equation is Bayes rule:

Where A and B are events:

P(A | B), a conditional probability, is the probability of observing event A given

P(B | A) is the probability of observing event B given that A is true.

Assume that the probability of having cancer is 0.05 — meaning

Initially, our probability for cancer is simply our prior, so 0.05.

we can instead calculate P(cancer|smoke),

which is equal to (P(smoker|cancer) * P(cancer)) / P(smoker) =

Each node corresponds to the random variables, and a variable can

In CPT, a boolean variable with k boolean parents contains 2 K probabilities.

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

A P(D= True) P(D= False)

Conditional probability table for Sophia Calls:

A P(S= True) P(S= False)

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

The semantics of Bayesian Network:

1. To understand the network as the representation of the Joint

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional

It is helpful in designing inference procedure.

Limitations of Bayesian Networks

Significant computational cost(NP hard task)

Unanticipated probability of an event is not taken care of.

HMM is determined by three model parameters;

The state transition probabilities; a matrix consisting of the probabilities of

The observation probability; the likelihood of a certain observation, y, if the

HMMs can be used to solve four fundamental problems;

3. Given observation sequences, estimate the model parameters, this is called

Use case of HMM in quantitative finance

Learning from Observations

3. Decision tree learning

Learning is essential for unknown environments,

i.e., when designer lacks omniscience

Learning is useful as a system construction method,

Learning modifies the agent's decision mechanisms to

The input to a learning algorithm is training data, representing experience, and

Learning is the process of converting experience into expertise or knowledge.

Learning can be broadly classified into three categories, as mentioned below,

Supervised learning algorithm

Unsupervised learning algorithm

Semi-supervised learning algorithm

Reinforcement learning algorithm

P(S, D, A, ¬B, ¬E) = P (S|A) P (D|A)P (A|¬B ^ ¬E) P (¬B) P (¬E).

Entropy = -p(1)log2(p(1)) -p(2)log2(p(2))-