Unit Iv Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

UNIT IV

LEARNING

18
Lecture Notes

Probability basics - Bayes Rule and its Applications

Bayes rule provides us with a way to update our beliefs based on the arrival of
new, relevant pieces of evidence.

For example,

if we were trying to provide the probability that a given person has


cancer, we would initially just say it is whatever percent of the population
has cancer.

However, given additional evidence such as the fact that the person is a
smoker, we can update our probability, since the probability of having
cancer is higher given that the person is a smoker.

This allows us to utilize prior knowledge to improve our probability


estimations.

The Rule

The below equation is Bayes rule:

The rule has a very simple derivation that directly leads from the relationship
between joint and conditional probabilities. First, note that P(A,B) = P(A|B)P(B)
= P(B,A) = P(B|A)P(A).

Where A and B are events:

P(A) and P(B) are the probabilities of A and B without regard to each other.

P(A | B), a conditional probability, is the probability of observing event A given


that B is true.

P(B | A) is the probability of observing event B given that A is true.

19
Lecture Notes

Example

Using the cancer diagnosis example, we can show that Bayes rule
allows us to obtain a much better estimate. Now, we will put some
made-up numbers into the example so we can assess the
difference that Bayes rule made .

Assume that the probability of having cancer is 0.05 — meaning


that 5% of people have cancer. Now, assume that the probability
of being a smoker is 0.10 — meaning that 10% of people are
smokers, and that 20% of people with cancer are smokers, so
P(smoker|cancer) = 0.20.

Initially, our probability for cancer is simply our prior, so 0.05.


However, using new evidence,

we can instead calculate P(cancer|smoke),

which is equal to (P(smoker|cancer) * P(cancer)) / P(smoker) =


(0.20 * 0.05) / (0.10) = 0.10.

20
Lecture Notes

Bayesian Network
A Bayesian network (also known as a Bayes network, belief network,
or decision network) is a probabilistic graphical model that represents a set
of variables and their conditional dependencies via a directed acyclic
graph (DAG).
Bayesian networks are ideal for taking an event that occurred and predicting
the likelihood that any one of several possible known causes was the
contributing factor.
For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms.
A Bayesian network graph is made up of nodes and Arcs (directed links),
where:

Each node corresponds to the random variables, and a variable can


be continuous or discrete.
Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.
In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
Node C is independent of node A.

21
Lecture Notes

The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG.

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called the
Harry.

Solution:

The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David and
Sophia's calls depend on alarm probability.

The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.

The conditional distributions for each node are given as conditional probabilities
table or CPT.

Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.

In CPT, a boolean variable with k boolean parents contains 2 K probabilities.


Hence, if there are two parents, then CPT will contain 4 probability values

22
Lecture Notes
List of all events occurring in this network:

Burglary (B)

Earthquake(E)

Alarm(A)

David Calls(D)

Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability
distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

23
Lecture Notes
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:


The Conditional probability of David that he will call depends on the probability of
Alarm.

A P(D= True) P(D= False)


True 0.91 0.09
False 0.05 0.95

Conditional probability table for Sophia Calls:


The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."

A P(S= True) P(S= False)


True 0.75 0.25
False 0.02 0.98

24
From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by
using Joint distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is
given below:

1. To understand the network as the representation of the Joint


probability distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional


independence statements.

It is helpful in designing inference procedure.

Limitations of Bayesian Networks


Typically require initial knowledge of many probabilities…quality and extent of
prior knowledge play an important role

Significant computational cost(NP hard task)

Unanticipated probability of an event is not taken care of.

25
Hidden Markov Model (HMM)

Hidden Markov Models (HMMs) are probabilistic models, it implies that the Markov
Model underlying the data is hidden or unknown. More specifically, we only know
observational data and not information about the states.

From Markov Chain (left) to Hidden Markov Model (right); where S=states,
y=possible observations, P=state transition probabilities and b=observation
probabilities

HMM is determined by three model parameters;

The start probability; a vector containing the probability for the state of being
the first state of the sequence.

The state transition probabilities; a matrix consisting of the probabilities of


transitioning from state Si to state Sj.

The observation probability; the likelihood of a certain observation, y, if the


model is in state Si.

HMMs can be used to solve four fundamental problems;

1. Given the model parameters and the observation sequence, estimate the
most likely (hidden) state sequence, this is called a decoding problem.

26
Lecture Notes
2. Given the model parameters and observation sequence, find the probability of the
observation sequence under the given model. This process involves a maximum
likelihood estimate of the attributes, sometimes called an evaluation problem.

3. Given observation sequences, estimate the model parameters, this is called


a training problem.

4. Estimate the observation sequences, y1, y2, …, and model parameters, which
maximizes the probability of y. This is called a Learning or optimization
problem.

HMM Example

Use case of HMM in quantitative finance

Considering the largest issue we face when trying to apply predictive techniques to
asset returns is a non-stationary time series. In other words, the expected mean
and volatility of asset returns changes over time. Most of the time series models and
techniques assume that the data is stationary, which is the major weakness of
these models.

Now, let’s frame this problem differently, we know that the time series exhibit
temporary periods where the expected means and variances are stable through
time. These periods or regimes can be associated with hidden states of
HMM. Based on this assumption, all we need are observable variables whose
behavior allows us to infer to the true hidden states.

In the finance world, if we can better estimate an asset’s most likely regime,
including the associated means and variances, then our predictive models become
more adaptable and will likely improve. Furthermore, we can use the estimated
regime parameters for better scenario analysis.

In this example, I will use the observable variables, the Ted Spread, the 10-year —
2-year constant maturity spread, the 10-year —3-month constant maturity spread,
and ICE BofA US High Yield Index Total Return Index, to find the hidden states

27
In this example, the observable variables I use are the underlying asset returns,
the ICE BofA US High Yield Index Total Return Index, the Ted Spread, the 10
year —2-year constant maturity spread, and the 10 year —3-month constant
maturity spread.

Now we are trying to model the hidden states of GE stock, by using two methods;
sklearn's GaussianMixture and HMMLearn's GaussianHMM. Both models require us
to specify the number of components to fit the time series, we can think of these
components as regimes. For this specific example, I will assign three components
and assume to be high, neural, and low volatility.

Learning from Observations

1. Learning agents

2. Inductive learning

3. Decision tree learning

Learning

Learning is essential for unknown environments,

i.e., when designer lacks omniscience

Learning is useful as a system construction method,

i.e., expose the agent to reality rather than trying to write it down

Learning modifies the agent's decision mechanisms to


improve performance
Learning agents

Learning element
Design of a learning element is affected by
Which components of the performance element are to be learned
What feedback is available to learn these components
What representation is used for the components
Type of feedback:
Supervised learning: correct answers for each example
Unsupervised learning: correct answers not given
Reinforcement learning: occasional rewards
Inductive learning
Simplest form: learn a function from examples
f is the target function
An example is a pair (x, f(x))
Problem: find a hypothesis h
such that h ≈ f
given a training set of examples
(This is a highly simplified model of real learning:
Ignores prior knowledge
Assumes examples are given)
Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Forms of Learning
Machine Learning (ML) is an automated learning with little or no human
intervention. It involves programming computers so that they learn from the
available inputs. The main purpose of machine learning is to explore and construct
algorithms that can learn from the previous data and make predictions on new input
data.

The input to a learning algorithm is training data, representing experience, and


the output is any expertise, which usually takes the form of another algorithm that
can perform a task. The input data to a machine learning system can be numerical,
textual, audio, visual, or multimedia. The corresponding output data of the system
can be a floating-point number, for instance, the velocity of a rocket, an integer
representing a category or a class, for example, a pigeon or a sunflower from image
recognition.

Concepts of Learning

Learning is the process of converting experience into expertise or knowledge.

Learning can be broadly classified into three categories, as mentioned below,


based on the nature of the learning data and interaction between the learner and
the environment.

Supervised Learning

Unsupervised Learning

Semi-supervised Learning

Similarly, there are four categories of machine learning algorithms as shown below

Supervised learning algorithm

Unsupervised learning algorithm

Semi-supervised learning algorithm

Reinforcement learning algorithm

However, the most commonly used ones are supervised and unsupervised
learning.
30
Supervised Learning
Supervised learning is commonly used in real world applications, such as face and
speech recognition, products or movie recommendations, and sales forecasting.
Supervised learning can be further classified into two types -
Regression and Classification.

Regression trains on and predicts a continuous-valued response, for example


predicting real estate prices.

Classification attempts to find the appropriate class label, such as analyzing


positive/negative sentiment, male and female persons, benign and malignant
tumors, secure and unsecure loans etc.

In supervised learning, learning data comes with description, labels, targets or


desired outputs and the objective is to find a general rule that maps inputs to
outputs. This kind of learning data is called labeled data. The learned rule is then
used to label new data with unknown outputs.

Supervised learning involves building a machine learning model that is based


on labeled samples. For example, if we build a system to estimate the price of a
plot of land or a house based on various features, such as size, location, and so
on, we first need to create a database and label it. We need to teach the
algorithm what features correspond to what prices. Based on this data, the
algorithm will learn how to calculate the price of real estate using the values of
the input features.

Supervised learning deals with learning a function from available training data.
Here, a learning algorithm analyzes the training data and produces a derived
function that can be used for mapping new examples. There are
many supervised learning algorithms such as Logistic Regression, Neural
networks, Support Vector Machines (SVMs), and Naive Bayes classifiers.

Common examples of supervised learning include classifying e-mails into spam


and not-spam categories, labeling webpages based on their content, and voice
recognition.

31
Learning Decision Trees
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.

The decisions or the test are performed on the basis of features of the given
dataset.

It is a graphical representation for getting all the possible solutions to a


problem/decision based on given conditions.

It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.

In order to build a tree, we use the CART algorithm, which stands


for Classification and Regression Tree algorithm.

A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.

Below diagram explains the general structure of a decision tree:

32
Lecture Notes

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm
for the given dataset and problem is the main point to remember while creating a
machine learning model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it
is easy to understand.
The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

A classic famous example where decision tree is used is known as Play Tennis.

33
If the outlook is sunny and humidity is normal, then yes, you may play tennis.

Where should you use decision tree?


At any scenario where learning data has following traits:
The learning data has attribute value pair like in the example shown above: Wind
as an attribute has two possible values – strong or weak
Target function has discreet output. Here, the target function is – should you play
tennis? And the output to this discreet output – Yes and No
The training data might be missing or have error

Dataset for Play Tennis

ID3 Algorithm

Although there are various decision tree learning algorithms, we will explore the
Iterative Dichotomiser 3 or commonly known as ID3. ID3 was invented by Ross
Quinlan.

34
Entropy

Entropy is a measure of randomness. In other words, its a measure of


unpredictability. We will take a moment here to give entropy in case of binary
event(like the coin toss, where output can be either of the two events, head or
tail) a mathematical face:

Entropy = -(probability(a) * log2(probability(a))) – (probability(b) *


log2(probability(b)))

where probability(a) is probability of getting head and probability(b) is


probability of getting tail.

of course this formulae can be generalised for n discreet outcome as follow:

Entropy = -p(1)*log2(p(1)) -p(2)*log2(p(2))-


p(3)*log2(p(3)) ......................... p(n)*log(2p(n))

Information Gain

In the decision tree shown here:

We decided to break the first decision on the basis of outlook. We could have our
first decision based on humidity or wind but we chose outlook. Why?

Because making our decision on the basis of outlook reduced our randomness in
the outcome(which is whether to play or not), more than what it would have been
reduced in case of humidity or wind.
35
Let’s understand with the example here. Please refer to the play tennis dataset that
is pasted above.
We have data for 14 days. We have only two outcomes :
Either we played tennis or we didn’t play.
In the given 14 days, we played tennis on 9 occasions and we did not play on 5
occasions.

Probability of playing tennis:


Number of favourable events : 9
Number of total events : 14
Probability = (Number of favourable events) / (Number of total events)

= 9/14
= 0.642

Now, we will see probability of not playing tennis.


Probability of not playing tennis:

Number of favourable events : 5


Number of total events : 14
Probability = (Number of favourable events) / (Number of total events)

=5/14

=0.357
And now entropy of outcome

Entropy at source= -(probability(a) * log2(probability(a))) – (probability(b) *


log2(probability(b)))
= -(Probability of playing tennis) * log2(Probability of playing tennis) – (Probability
of not playing tennis) * log2(Probability of not playing tennis)
= -0.652 * log2(0.652) – 0.357*log2(0.357)

=0.940
So, entropy of whole system before we make our firest question is 0.940
Now, we have four features to make decision and they are:

Outlook
Temperature

Windy
Humidity
36
Let’s see what happens to entropy when we make our first
decision on the basis of Outlook.
Outlook
If we make a decison tree divison at this level 0 based on outlook, we have three
branches possible; either it will be Sunny or Overcast or it will be Raining.
1. Sunny : In the given data, 5 days were sunny. Among those 5 days, tennis was
played on 2 days and tenis was not played on 3 days. What is the entropy here?
Probablity of playing tennis = 2/5 = 0.4
Probablity of not playing tennis = 3/5 = 0.6
Entropy when sunny = -0.4 * log2(0.4) – 0.6 * log2(0.6)
= 0.97
2. Overcast: In the given data, 4 days were overcast and tennis was played on all
the four days. Let
Probablity of playing tennis = 4/4 = 1
Probablity of not playing tennis = 0/4 = 0
Entropy when overcast = 0.0
3. Rain: In the given data, 5 days were rainy. Among those 5 days, tennis was
played on 3 days and tenis was not played on 2 days. What is the entropy here?
Probablity of not playing tennis = 2/5 = 0.4
Probablity of playing tennis = 3/5 = 0.6
Entropy when rainy = -0.4 * log2(0.4) – 0.6 * log2(0.6)
= 0.97

Entropy among the three branches:


Entropy among three branches = ((number of sunny days)/(total days) * (entropy
when sunny)) + ((number of overcast days)/(total days) * (entropy when overcast))
+ ((number of rainy days)/(total days) * (entropy when rainy))
= ((5/14) * 0.97) + ((4/14) * 0) + ((5/14) * 0.97)
= 0.69
What is the reduction is randomness due to choosing outlook as a decsion maker?
Reduction in randomness = entropy source – entropy of branches
= 0.940 – 0.69
= 0.246
This reduction in randomness is called Information Gain. Similar calculation can
be done for other features.

37
Temperature

Information Gain = 0.029

Windy

Information Gain = 0.048

Humidity

Information Gain = 0.152

We can see that decrease in randomness, or information gain is most for Outlook.
So, we choose first decision maker as Outlook.

Advantages of the Decision Tree

It is simple to understand as it follows the same process which a human follow


while making any decision in real-life.

It can be very useful for solving decision-related problems.

It helps to think about all the possible outcomes for a problem.

There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

The decision tree contains lots of layers, which makes it complex.

It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.

For more class labels, the computational complexity of the decision tree may
increase.

38
Regression and Classification with Linear Models

Regression
 For classification the output(s) is nominal
 In regression the output is continuous
 Function Approximation
 Many models could be used – Simplest is linear regression
 Fit data with the best hyper-plane which "goes through"
the points

39
For classification the output(s) is nominal

In regression the output is continuous

Function Approximation

Many models could be used – Simplest is linear regression

Fit data with the best hyper-plane which "goes through" the points

For each point the differences between the predicted point and the actual
observation is the residue

Simple Linear Regression

For now, assume just one (input) independent variable x, and one (output)
dependent variable y

Multiple linear regression assumes an input vector x

Multivariate linear regression assumes an output vector y

We will "fit" the points with a line (i.e. hyper-plane)

Which line should we use?

Choose an objective function

For simple linear regression we choose sum squared error (SSE)

S (predictedi – actuali)2 = S (residuei)2

Thus, find the line which minimizes the sum of the squared residues (e.g.
least squares)

40
Lecture Notes

How do we "learn" parameters

Multiple Linear Regression

41
Intelligibility

One nice advantage of linear regression models (and linear classification) is the
potential to look at the coefficients to give insight into which input variables are
most important in predicting the output

The variables with the largest magnitude have the highest correlation with the
output

A large positive coefficient implies that the output will increase when this
input is increased (positively correlated)

A large negative coefficient implies that the output will decrease when this
input is increased (negatively correlated)

A small or 0 coefficient suggests that the input is uncorrelated with the


output (at least at the 1st order)

Linear regression can be used to find best "indicators"

However, be careful not to confuse correlation with causality

Summary

Linear Regression and Logistic Regression are nice tools for many simple
situations

But both force us to fit the data with one shape (line or sigmoid) which will
often underfit

Intelligible results

When problem includes more arbitrary non-linearity then we need more powerful
models which we will introduce

Though non-linear data transformation can help in these cases while still
using a linear model for learning

These models are commonly used in data mining applications and also as a "first
attempt" at understanding data trends, indicators, etc.

42
Artificial Neural Networks

Introduction

Artificial Neural Networks (ANN)

Information processing paradigm inspired by biological nervous systems

ANN is composed of a system of neurons connected by synapses

ANN learn by example

Adjust synaptic connections between neurons

History

1943: McCulloch and Pitts model neural networks based on their understanding of
neurology.

Neurons embed simple logic functions:

a or b

a and b

1950s:

Farley and Clark

IBM group that tries to model biological behavior

Consult neuro-scientists at McGill, whenever stuck

Rochester, Holland, Haibit and Duda

Perceptron (Rosenblatt 1958)

Three layer system:

Input nodes

Output node

Association layer

Can learn to connect or associate a given input to a random output unit

Minsky and Papert

Showed that a single layer perceptron cannot learn the XOR of two binary inputs

Lead to loss of interest (and funding) in the field

43
Lecture Notes

Back-propagation learning method (Werbos 1974)

Three layers of neurons


Input, Output, Hidden

Better learning rule for generic three layer networks

Regenerates interest in the 1980s

Successful applications in medicine, marketing, risk management, … (1990)

ANN

Promises

Combine speed of silicon with proven success of carbon  artificial


brains

Neuron Model

Natural neurons

Neuron collects signals from dendrites

Sends out spikes of electrical activity through an axon, which splits into thousands
of branches.

At end of each brand, a synapses converts activity into either exciting or inhibiting
activity of a dendrite at another neuron.

Neuron fires when exciting activity surpasses inhibitory activity

Learning changes the effectiveness of the synapses

44
Abstract neuron model:

ANN Forward Propagation

Bias Nodes

Add one node to each layer that has constant output

Forward propagation

Calculate from input layer to output layer

For each neuron:

Calculate weighted average of input

Calculate activation function

Firing Rules:

Threshold rules:

Calculate weighted average of input

Fire if larger than threshold


Perceptron rule

Calculate weighted average of input input

Output activation level is

45
ANN Applications

Pattern recognition

Network attacks

Breast cancer

handwriting recognition

Pattern completion

Auto-association

ANN trained to reproduce input as output

Noise reduction

Compression

Finding anomalies

Support Vector Machines

Introduction to SVMs:

In machine learning, support vector machines (SVMs, also support vector


networks) are supervised learning models with associated learning algorithms that
analyze data used for classification and regression analysis.
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a
separating hyperplane. In other words, given labeled training data (supervised
learning), the algorithm outputs an optimal hyperplane which categorizes new
examples.

What is Support Vector Machine?

An SVM model is a representation of the examples as points in space, mapped so


that the examples of the separate categories are divided by a clear gap that is as
wide as possible.
In addition to performing linear classification, SVMs can efficiently perform a non-
linear classification, implicitly mapping their inputs into high-dimensional feature
spaces.

46
Lecture Notes

What does SVM do?

Given a set of training examples, each marked as belonging to one or the other of
two categories, an SVM training algorithm builds a model that assigns new
examples to one category or the other, making it a non-probabilistic binary linear
classifier.

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which


can be used for both classification or regression challenges. However, it is mostly
used in classification problems. In the SVM algorithm, we plot each data item as a
point in n-dimensional space (where n is number of

features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well (look at the below snapshot).

Support Vectors are simply the co-ordinates of individual observation. The SVM
classifier is a frontier which best segregates the two classes (hyper-plane/ line)

47
Statistical Learning
Statistical Approaches

1. Statistical Learning

2. Naïve Bayes

3. Instance-based Learning

4. Neural Networks

Example: Candy Bags

1. Candy comes in two flavors: cherry (J) and lime (L)

2. Candy is wrapped, can’t tell which flavor until opened

3. There are 5 kinds of bags of candy:

– H1= all cherry (P(c|h1) = 1, P(l|h1) = 0)

H2= 75% cherry, 25% lime

H3= 50% cherry, 50% lime

H4= 25% cherry, 75% lime

– H5= 100% lime

1. Given a new bag of candy, predict H

2. Observations: D1, D2 , D3, …

Bayesian Learning

48
Expectation-Maximization Algorithm

In the real-world applications of machine learning, it is very common that there


are many relevant features available for learning but only a small subset of them
are observable. So, for the variables which are sometimes observable and
sometimes not, then we can use the instances when that variable is visible is
observed for the purpose of learning and then predict its value in the instances
when it is not observable.

On the other hand, Expectation-Maximization algorithm can be used for the


latent variables (variables that are not directly observable and are actually inferred
from the values of the other observed variables) too in order to predict their
values with the condition that the general form of probability distribution
governing those latent variables is known to us. This algorithm is actually at the
base of many unsupervised clustering algorithms in the field of machine learning.
It was explained, proposed and given its name in a paper published in 1977 by
Arthur Dempster, Nan Laird, and Donald Rubin. It is used to find the local
maximum likelihood parameters of a statistical model in the cases where latent
variables are involved and the data is missing or incomplete.

Algorithm:

Given a set of incomplete data, consider a set of starting parameters.

Expectation step (E – step): Using the observed available data of the


dataset, estimate (guess) the values of the missing data.

Maximization step (M – step): Complete data generated after the


expectation (E) step is used in order to update the parameters.

Repeat step 2 and step 3 until convergence.

49
Lecture Notes

The essence of Expectation-Maximization algorithm is to use the available


observed data of the dataset to estimate the missing data and then using that
data to update the values of the parameters. Let us understand the EM algorithm
in detail.

Initially, a set of initial values of the parameters are considered. A set of


incomplete observed data is given to the system with the assumption that the
observed data comes from a specific model.

The next step is known as “Expectation” – step or E-step. In this step, we use the
observed data in order to estimate or guess the values of the missing or
incomplete data. It is basically used to update the variables.

The next step is known as “Maximization”-step or M-step. In this step, we use the
complete data generated in the preceding “Expectation” – step in order to update
the values of the parameters. It is basically used to update the hypothesis.

Now, in the fourth step, it is checked whether the values are converging or not, if
yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Flow chart for EM algorithm –

Usage of EM algorithm –

It can be used to fill the missing data in a sample.

It can be used as the basis of unsupervised learning of clusters.

It can be used for the purpose of estimating the parameters of Hidden


Markov Model (HMM).

It can be used for discovering the values of latent variables.

Advantages of EM algorithm –

It is always guaranteed that likelihood will increase with each iteration.

The E-step and M-step are often pretty easy for many problems in terms of
implementation.

Solutions to the M-steps often exist in the closed form.

Disadvantages of EM algorithm –

It has slow convergence.

It makes convergence to the local optima only.

It requires both the probabilities, forward and backward (numerical


optimization requires only forward probability).
Reinforcement Learning

Supervised learning:

a situation in which sample (input, output) pairs of the function to be learned


can be perceived or are given

You can think it as if there is a kind teacher

Reinforcement learning:

in the case of the agent acts on its environment, it receives some evaluation of
its action (reinforcement), but is not told of which action is the correct one to
achieve its goal

Task

Learn how to behave successfully to achieve a goal while interacting with an


external environment

Learn via experiences!

Examples

Game playing: player knows whether it win or lose, but not know how to
move at each step

Control: a traffic system can measure the delay of cars, but not know how to
decrease it.

RL is learning from interaction


RL model

Each percept(e) is enough to determine the State(the state is accessible)

The agent can decompose the Reward component from a percept.

The agent task: to find a optimal policy, mapping states to actions, that
maximize long-run measure of the reinforcement
Think of reinforcement as reward

Can be modeled as MDP model!

Model based v.s.Model free approaches


But, we don’t know anything about the environment model—the
transition function T(s,a,s’)
Here comes two approaches
Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
Model free approach RL:
derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
Lecture Notes
Passive learning v.s. Active learning

Passive learning

The agent imply watches the world going by and tries to learn the utilities of
being in various states

Active learning

The agent not simply watches, but also acts

Generalization in Reinforcement Learning

So far we assumed that all the functions learned by the agent are (U, T, R,Q) are
tabular forms—

i.e.. It is possible to enumerate state and action spaces.

Use generalization techniques to deal with large state or action space.

Function approximation techniques

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy