Unit IV CI PDF
Unit IV CI PDF
Unit IV CI PDF
Probability basics - Bayes Rule and its Applications - Bayesian Networks – Exact and Approximate
Inference in Bayesian Networks - Hidden Markov Models - Forms of Learning - Supervised Learning
- Learning Decision Trees – Regression and Classification with Linear Models - Artificial Neural
Networks – Nonparametric Models - Support Vector Machines - Statistical Learning - Learning with
Complete Data - Learning with Hidden Variables- The EM Algorithm – Reinforcement Learning
BAYESIAN THEORY
Bayes’ theorem (Bayes’ law or Bayes' rule) describes the probability of an event, based on prior
knowledge of conditions that might be related to the event.
For example, if diabetic is related to age, then, using Bayes’ theorem, a person’s age can be used
to more accurately assess the probability that they have diabetic, compared to the assessment of the
probability of diabetic made without knowledge of the person's age. It is the basis of uncertain reasoning
where the results are unpredictable.
Bayes Rule
BAYESIAN NETWORK
• A Bayesian network is a probabilistic graphical model that represents a set of variables and their
probabilistic independencies. Otherwise known as Bayes net, Bayesian belief Network or simply
Belief Networks. A Bayesian network specifies a joint distribution in a structured form. It represents
dependencies and independence via a directed graph. Networks of concepts linked with conditional
probabilities.
• Bayesian network consists of
– Nodes = random variables
– Edges = direct dependence
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Requires that graph is acyclic (no directed cycles)
• 2 components to a Bayesian network
– The graph structure (conditional independence assumptions)
– The numerical probabilities (for each variable given its parents)
For eg, evidence says that lab produces 98% accurate results. It means that a person X has 98%
malaria or 2% of not having malaria. This factor is called uncertainty factor. This is the reason that we
go for Bayesian theory. Bayesian theory is also known as probability learning.
The probabilities are numeric values between 0 and 1 that represent uncertainties.
p(A,B,C) = p(C|A,B)p(A)p(B)
ii) 3-way Bayesian network (Marginal Independence)
p(A,B,C) = p(A) p(B) p(C)
iii) 3-way Bayesian network (Conditionally independent effects)
p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent Given A
iv) 3-way Bayesian network (Markov dependence)
Problem 1
You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor
earth quakes. Two neighbors (John, Mary) promise to call you at work when they hear the alarm. John
always calls when hears alarm, but confuses with phone ringing. Mary likes loud music and
sometimes misses alarm. Find the probability of the event that the alarm has sounded but neither a
burglary nor an earth quake has occurred and both Mary and John call.
Consider 5 binary variables
B=Burglary occurs at your house
E=Earth quake occurs at your home
A=Alarm goes off
J=John calls to report alarm
M=Mary calls to report the alarm
Probability of the event that the alarm has sounded but neither a burglary nor an earth quake has
occurred and both Mary and John call
P(J,M,A, E, B)=P(J|A).P(M|A).P(A|E, B).P(E).P(B)
=0.90*0.70*0.001*0.99*0.998
=0.00062
Problem 2
Rain influences sprinkler usage. Rain and sprinkler influences whether grass is wet or not. What is the
probability that rain gives grass
wet?
Solution
Let S= Sprinkler
R=Rain
G=Grass wet
P(G,S,R)=P(G|S,R).P(S|R).P(R)
=0.99*0.01*0.2
=0.00198
Problem 3
Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Problem 4
Did the patient have malignant tumour or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive
result in only 98% of the cases in which a malignant tumour actually present, and a correct negative
result in only 97% of the cases in which it is not present. Furthermore, o.oo8 of the entire population
have this tumour.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
The probability of not having tumour is high. So the person is not having malignant tumour.
Case 2:
Hypothesis: Did the patient have malignant tumour if the result reports negative.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
= (0.02)(0.008)/p(-)
= (0.97)(0.992)/p(-)
(0.02)(0.008)/p(-) + (0.97)(0.992)/p(-) = 1
(0.002)(0.008) + (0.97)(0.992) =p(-)
0.000016+0.96=p(-)
Hence p(-)=0.96
The probability of not having tumour is high. So the person is not having malignant tumour.
• Set of states:
• Markov chain property: probability of each subsequent state depends only on what was
the previous state
• Set of states:
• Markov chain property: probability of each subsequent state depends only on what was
the previous state
• States are not visible, but each state randomly generates one of M observations (or visible states)
axi [ δK(i) ]
• Backtrack best path.
Learning problem. Given some training observation sequences and general structure of HMM
(numbers of hidden and visible states), determine HMM parameters M=(A, B, π) that best fit
training data, that is maximizes P(O | M) .
• There is no algorithm producing optimal parameter values.
• Use iterative expectation-maximization algorithm to find local maximum of P(O |
M) - Baum-Welch algorithm
If training data has information about sequence of hidden states (as in word recognition
example), then use maximum likelihood estimation of parameters:
LEARNING DECISION TREE
A decision tree represents a function that takes as input a vector of attribute values and returns a
“decision”—a single output value. The input and output values can be discrete or continuous. A decision
tree reaches its decision by performing a sequence of tests. Each internal node in the tree corresponds
to a test of the value of one of the input attributes and the branches from the node are labeled with the
possible values of the attribute. Each leaf node in the tree specifies a value to be returned by the function.
Choosing attribute tests
The scheme used in decision tree learning for selecting attributes is designed to minimize the depth of
the final tree. The idea is to pick the attribute that goes as far as possible toward providing an exact
classification of the example. A perfect attribute divides the examples in to sets that are all positive or
all negative.
The information gain from the attribute test on A is the expected reduction in entropy:
Over-fitting
It is a situation when the model works well for the training data but fails for new testing data.
A decision tree learning system for real world applications must be able to handle the following issues.
• Missing data
• Multi-valued attribute
• Continuous and integer valued input attribute
• Continuous-valued output attribute.
LEARNING WITH COMPLETE DATA
Statistical learning begins with parameter learning with complete data. A parameter learning task
involves finding the numerical parameters for a probability model whose structure is fixed. Data are
complete when each data point contains values for every variable in the probability model being learned.
Complete data simplify the problem of learning the parameters of a complex model.
Learning Structure
• Maximum-likelihood parameter learning: Discrete models
• Naive Bayes models
• Maximum-likelihood parameter learning: Continuous models
• Bayesian parameter learning
Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime proportions
are completely unknown. ϴ is the proportion of cherry candies hϴ hypothesis. Therefore, proportion of
lime is 1- ϴ. Suppose we unwrap N candies of which c are cherries. Therefore, lime l=N-c
Equation of likelihood
Log derivative
To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the
resulting expression to zero:
The parameters of this model are the mean μ and the standard deviation σ. The log likelihood
Setting the derivatives to zero
EM ALGORITHM
EM algorithm is presented by Dempster, Laird and Rubin in 1977. EM algorithm is an iterative
estimation algorithm that can derive the maximum likelihood estimates in the presence of
missing/hidden data (incomplete data)
Uses of EM algorithm
• Filling the missing data in a sample.
• Discovering the value of latent variables.
• Estimating the parameters of HMM
• Estimating the parameters of finite mixtures
• Unsupervised learning of clusters
• Parameters of mixtures of Gaussian
Algorithm
1. Consider a set of starting parameter
- Given a set of incomplete data
- Assume observed data come from a specific model
2. Use these to “estimate” the missing data, formulate some parameters for that model, use this to
guess the missing value/data (expectation step).
3. Use “complete data to update parameters from the missing data and observed data, find the
most likely parameters (maximization step).
4. Repeat steps 2 and 3 until convergence.
Basic Setting in EM
X is a set of data points
is a parameter vector
EM is a method to find where
= arg max L(
= arg max log P(X/
L( is the likelihood function
Z=(X,Y)
Z: complete data (augmented data)
X: observed data (incomplete data)
Y: hidden data (missing data)
The same problem can be worked out with EM algorithm with missing information. We do not know
which coin is tossed in each set. So we cannot able to calculate the maximum likelihood directly. So
that EM algorithm is used.
5 sets of 10 tosses each
HTTTHHTHTH
HHHHTHHHHH
HTHHHHHTHH
HTHHHTTTTT
THHHTHHHTH
1. Initialization step (Randomly choose the initial values between 0 to 1)
2. Estimation step
Binomial Distribution
=
Strength of EM
• Numerical stability: In every iteration of the EM algorithm, it increases the likelihood of the
observed data,
• The EM handles parameter constraints gracefully.
Problems with EM
• Convergence can be very slow on some problems and is intimately related to the amount of
missing information.
• It guarantees to improve the probability of the training corpus, which is different from reducing
the errors directly.
• It cannot guarantee to reach global maximum.
REINFORCEMENT LEARNING
Reinforcement learning is close to human learning. Algorithm learns a policy of how to act in
a given environment. Every action has some impact in the environment, and the environment provides
rewards that guides the learning algorithm. Reinforcement learning deals with agents that must sense
and act upon their environment.
In many complex domains, reinforcement learning is the only feasible way to train a program
to perform at high levels. For example, in game playing, it is very hard for a human to provide accurate
and consistent evaluations of large numbers of positions, which would be needed to train an evaluation
function directly from examples. Instead, the program can be told when it has won or lost, and it can
use this information to learn an evaluation function that gives reasonably accurate estimates of the
probability of winning from any given position.
The agent executes a set of trials in TRIAL the environment using its policy π. In each trial, the agent
starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal
states, (4,2) or (4,3). Its percepts supply both the current state and the reward received in that state.
Typical trials might look like this:
Naïve updating
Naïve updating is otherwise called as LMS (least mean squares) approach. In essence, it assumes that
for each state in a training sequence, the observed reward-to-go on that sequence provides direct
evidence of the actual expected reward-to-go. Thus, at the end of each sequence, the algorithm
calculates the observed reward-to-go for each state and updates the estimated utility for that state
accordingly.
where, R(i) is the reward associated with being in state i, and Mij is the probability that a transition will
occur from state i to state j.
where, a is the learning rate parameter. Because this update rule uses the difference in utilities between
successive states, it is often called the temporal-difference, or TD, equation. The basic idea of all
temporal-difference methods is to first define the conditions that hold locally when the utility estimates
are correct; and then to write an update equation that moves the estimates toward this ideal "equilibrium"
equation.
Active learning
Active learning, where the agent must also learn what to do. An active agent must consider
what actions to take, what their outcomes may be, and how they will affect the rewards received.
• The environment model must now incorporate the probabilities of transitions to other states
given a particular action. We will use Mij to denote the probability of reaching state j if the
action a is taken in state i.
• The constraints on the utility of each state must now take into account the fact that the agent
has a choice of actions. A rational agent will maximize its expected utility
• The agent must now choose an action at each step, and will need a performance element to do
so. In the algorithm, this means calling PERFORMANCE-ELEMENT(e) and returning the
resulting action.
.
Support Vectors for linearly separable case
• Support vectors are the elements of the training set that would change the
position of the dividing hyperplane if removed.
• Support vectors are the critical elements of the training set
• The problem of finding the optimal hyper plane is an optimization problem
and can be solved by optimization techniques (we use Lagrange multipliers
to get this problem into a form that can be solved analytically).
find the optimal hyperplane :
if we maximize the margin(distance) between two hyperplanes then divide by 2 we
get the decision boundary.
Find maximize the margin
lets take only 2 dimensions, we get the equation for hyper line is
w.x+b=0 which is same as w.x =0 (which has more dimensions)
4.