Unit Iv Learning
Unit Iv Learning
Unit Iv Learning
LEARNING
18
Lecture Notes
Bayes rule provides us with a way to update our beliefs based on the arrival of
new, relevant pieces of evidence.
For example,
However, given additional evidence such as the fact that the person is a
smoker, we can update our probability, since the probability of having
cancer is higher given that the person is a smoker.
The Rule
The rule has a very simple derivation that directly leads from the relationship
between joint and conditional probabilities. First, note that P(A,B) = P(A|B)P(B)
= P(B,A) = P(B|A)P(A).
P(A) and P(B) are the probabilities of A and B without regard to each other.
19
Lecture Notes
Example
Using the cancer diagnosis example, we can show that Bayes rule
allows us to obtain a much better estimate. Now, we will put some
made-up numbers into the example so we can assess the
difference that Bayes rule made .
20
Lecture Notes
Bayesian Network
A Bayesian network (also known as a Bayes network, belief network,
or decision network) is a probabilistic graphical model that represents a set
of variables and their conditional dependencies via a directed acyclic
graph (DAG).
Bayesian networks are ideal for taking an event that occurred and predicting
the likelihood that any one of several possible known causes was the
contributing factor.
For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms.
A Bayesian network graph is made up of nodes and Arcs (directed links),
where:
21
Lecture Notes
The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG.
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called the
Harry.
Solution:
The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David and
Sophia's calls depend on alarm probability.
The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
The conditional distributions for each node are given as conditional probabilities
table or CPT.
Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
22
Lecture Notes
List of all events occurring in this network:
Burglary (B)
Earthquake(E)
Alarm(A)
David Calls(D)
Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability
distribution:
23
Lecture Notes
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
24
From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by
using Joint distribution.
There are two ways to understand the semantics of the Bayesian network, which is
given below:
25
Hidden Markov Model (HMM)
Hidden Markov Models (HMMs) are probabilistic models, it implies that the Markov
Model underlying the data is hidden or unknown. More specifically, we only know
observational data and not information about the states.
From Markov Chain (left) to Hidden Markov Model (right); where S=states,
y=possible observations, P=state transition probabilities and b=observation
probabilities
The start probability; a vector containing the probability for the state of being
the first state of the sequence.
1. Given the model parameters and the observation sequence, estimate the
most likely (hidden) state sequence, this is called a decoding problem.
26
Lecture Notes
2. Given the model parameters and observation sequence, find the probability of the
observation sequence under the given model. This process involves a maximum
likelihood estimate of the attributes, sometimes called an evaluation problem.
4. Estimate the observation sequences, y1, y2, …, and model parameters, which
maximizes the probability of y. This is called a Learning or optimization
problem.
HMM Example
Considering the largest issue we face when trying to apply predictive techniques to
asset returns is a non-stationary time series. In other words, the expected mean
and volatility of asset returns changes over time. Most of the time series models and
techniques assume that the data is stationary, which is the major weakness of
these models.
Now, let’s frame this problem differently, we know that the time series exhibit
temporary periods where the expected means and variances are stable through
time. These periods or regimes can be associated with hidden states of
HMM. Based on this assumption, all we need are observable variables whose
behavior allows us to infer to the true hidden states.
In the finance world, if we can better estimate an asset’s most likely regime,
including the associated means and variances, then our predictive models become
more adaptable and will likely improve. Furthermore, we can use the estimated
regime parameters for better scenario analysis.
In this example, I will use the observable variables, the Ted Spread, the 10-year —
2-year constant maturity spread, the 10-year —3-month constant maturity spread,
and ICE BofA US High Yield Index Total Return Index, to find the hidden states
27
In this example, the observable variables I use are the underlying asset returns,
the ICE BofA US High Yield Index Total Return Index, the Ted Spread, the 10
year —2-year constant maturity spread, and the 10 year —3-month constant
maturity spread.
Now we are trying to model the hidden states of GE stock, by using two methods;
sklearn's GaussianMixture and HMMLearn's GaussianHMM. Both models require us
to specify the number of components to fit the time series, we can think of these
components as regimes. For this specific example, I will assign three components
and assume to be high, neural, and low volatility.
1. Learning agents
2. Inductive learning
Learning
i.e., expose the agent to reality rather than trying to write it down
Learning element
Design of a learning element is affected by
Which components of the performance element are to be learned
What feedback is available to learn these components
What representation is used for the components
Type of feedback:
Supervised learning: correct answers for each example
Unsupervised learning: correct answers not given
Reinforcement learning: occasional rewards
Inductive learning
Simplest form: learn a function from examples
f is the target function
An example is a pair (x, f(x))
Problem: find a hypothesis h
such that h ≈ f
given a training set of examples
(This is a highly simplified model of real learning:
Ignores prior knowledge
Assumes examples are given)
Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Forms of Learning
Machine Learning (ML) is an automated learning with little or no human
intervention. It involves programming computers so that they learn from the
available inputs. The main purpose of machine learning is to explore and construct
algorithms that can learn from the previous data and make predictions on new input
data.
Concepts of Learning
Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Similarly, there are four categories of machine learning algorithms as shown below
−
However, the most commonly used ones are supervised and unsupervised
learning.
30
Supervised Learning
Supervised learning is commonly used in real world applications, such as face and
speech recognition, products or movie recommendations, and sales forecasting.
Supervised learning can be further classified into two types -
Regression and Classification.
Supervised learning deals with learning a function from available training data.
Here, a learning algorithm analyzes the training data and produces a derived
function that can be used for mapping new examples. There are
many supervised learning algorithms such as Logistic Regression, Neural
networks, Support Vector Machines (SVMs), and Naive Bayes classifiers.
31
Learning Decision Trees
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
The decisions or the test are performed on the basis of features of the given
dataset.
It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
32
Lecture Notes
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
A classic famous example where decision tree is used is known as Play Tennis.
33
If the outlook is sunny and humidity is normal, then yes, you may play tennis.
ID3 Algorithm
Although there are various decision tree learning algorithms, we will explore the
Iterative Dichotomiser 3 or commonly known as ID3. ID3 was invented by Ross
Quinlan.
34
Entropy
Information Gain
We decided to break the first decision on the basis of outlook. We could have our
first decision based on humidity or wind but we chose outlook. Why?
Because making our decision on the basis of outlook reduced our randomness in
the outcome(which is whether to play or not), more than what it would have been
reduced in case of humidity or wind.
35
Let’s understand with the example here. Please refer to the play tennis dataset that
is pasted above.
We have data for 14 days. We have only two outcomes :
Either we played tennis or we didn’t play.
In the given 14 days, we played tennis on 9 occasions and we did not play on 5
occasions.
= 9/14
= 0.642
=5/14
=0.357
And now entropy of outcome
=0.940
So, entropy of whole system before we make our firest question is 0.940
Now, we have four features to make decision and they are:
Outlook
Temperature
Windy
Humidity
36
Let’s see what happens to entropy when we make our first
decision on the basis of Outlook.
Outlook
If we make a decison tree divison at this level 0 based on outlook, we have three
branches possible; either it will be Sunny or Overcast or it will be Raining.
1. Sunny : In the given data, 5 days were sunny. Among those 5 days, tennis was
played on 2 days and tenis was not played on 3 days. What is the entropy here?
Probablity of playing tennis = 2/5 = 0.4
Probablity of not playing tennis = 3/5 = 0.6
Entropy when sunny = -0.4 * log2(0.4) – 0.6 * log2(0.6)
= 0.97
2. Overcast: In the given data, 4 days were overcast and tennis was played on all
the four days. Let
Probablity of playing tennis = 4/4 = 1
Probablity of not playing tennis = 0/4 = 0
Entropy when overcast = 0.0
3. Rain: In the given data, 5 days were rainy. Among those 5 days, tennis was
played on 3 days and tenis was not played on 2 days. What is the entropy here?
Probablity of not playing tennis = 2/5 = 0.4
Probablity of playing tennis = 3/5 = 0.6
Entropy when rainy = -0.4 * log2(0.4) – 0.6 * log2(0.6)
= 0.97
37
Temperature
Windy
Humidity
We can see that decrease in randomness, or information gain is most for Outlook.
So, we choose first decision maker as Outlook.
It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
For more class labels, the computational complexity of the decision tree may
increase.
38
Regression and Classification with Linear Models
Regression
For classification the output(s) is nominal
In regression the output is continuous
Function Approximation
Many models could be used – Simplest is linear regression
Fit data with the best hyper-plane which "goes through"
the points
39
For classification the output(s) is nominal
Function Approximation
Fit data with the best hyper-plane which "goes through" the points
For each point the differences between the predicted point and the actual
observation is the residue
For now, assume just one (input) independent variable x, and one (output)
dependent variable y
Thus, find the line which minimizes the sum of the squared residues (e.g.
least squares)
40
Lecture Notes
41
Intelligibility
One nice advantage of linear regression models (and linear classification) is the
potential to look at the coefficients to give insight into which input variables are
most important in predicting the output
The variables with the largest magnitude have the highest correlation with the
output
A large positive coefficient implies that the output will increase when this
input is increased (positively correlated)
A large negative coefficient implies that the output will decrease when this
input is increased (negatively correlated)
Summary
Linear Regression and Logistic Regression are nice tools for many simple
situations
But both force us to fit the data with one shape (line or sigmoid) which will
often underfit
Intelligible results
When problem includes more arbitrary non-linearity then we need more powerful
models which we will introduce
Though non-linear data transformation can help in these cases while still
using a linear model for learning
These models are commonly used in data mining applications and also as a "first
attempt" at understanding data trends, indicators, etc.
42
Artificial Neural Networks
Introduction
History
1943: McCulloch and Pitts model neural networks based on their understanding of
neurology.
a or b
a and b
1950s:
Input nodes
Output node
Association layer
Showed that a single layer perceptron cannot learn the XOR of two binary inputs
43
Lecture Notes
ANN
Promises
Neuron Model
Natural neurons
Sends out spikes of electrical activity through an axon, which splits into thousands
of branches.
At end of each brand, a synapses converts activity into either exciting or inhibiting
activity of a dendrite at another neuron.
44
Abstract neuron model:
Bias Nodes
Forward propagation
Firing Rules:
Threshold rules:
45
ANN Applications
Pattern recognition
Network attacks
Breast cancer
handwriting recognition
Pattern completion
Auto-association
Noise reduction
Compression
Finding anomalies
Introduction to SVMs:
46
Lecture Notes
Given a set of training examples, each marked as belonging to one or the other of
two categories, an SVM training algorithm builds a model that assigns new
examples to one category or the other, making it a non-probabilistic binary linear
classifier.
features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well (look at the below snapshot).
Support Vectors are simply the co-ordinates of individual observation. The SVM
classifier is a frontier which best segregates the two classes (hyper-plane/ line)
47
Statistical Learning
Statistical Approaches
1. Statistical Learning
2. Naïve Bayes
3. Instance-based Learning
4. Neural Networks
Bayesian Learning
48
Expectation-Maximization Algorithm
Algorithm:
49
Lecture Notes
The next step is known as “Expectation” – step or E-step. In this step, we use the
observed data in order to estimate or guess the values of the missing or
incomplete data. It is basically used to update the variables.
The next step is known as “Maximization”-step or M-step. In this step, we use the
complete data generated in the preceding “Expectation” – step in order to update
the values of the parameters. It is basically used to update the hypothesis.
Now, in the fourth step, it is checked whether the values are converging or not, if
yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Flow chart for EM algorithm –
Usage of EM algorithm –
Advantages of EM algorithm –
The E-step and M-step are often pretty easy for many problems in terms of
implementation.
Disadvantages of EM algorithm –
Supervised learning:
Reinforcement learning:
in the case of the agent acts on its environment, it receives some evaluation of
its action (reinforcement), but is not told of which action is the correct one to
achieve its goal
Task
Examples
Game playing: player knows whether it win or lose, but not know how to
move at each step
Control: a traffic system can measure the delay of cars, but not know how to
decrease it.
The agent task: to find a optimal policy, mapping states to actions, that
maximize long-run measure of the reinforcement
Think of reinforcement as reward
Passive learning
The agent imply watches the world going by and tries to learn the utilities of
being in various states
Active learning
So far we assumed that all the functions learned by the agent are (U, T, R,Q) are
tabular forms—