Unit IV CI PDF

UNIT IV L EARNING
Probability basics - Bayes Rule and its Applications - Bayesian Networks – Exact and Approximate
Inference in Bayesian Networks - Hidden Markov Models - Forms of Learning - Supervised Learning
- Learning Decision Trees – Regression and Classification with Linear Models - Artificial Neural
Networks – Nonparametric Models - Support Vector Machines - Statistical Learning - Learning with
Complete Data - Learning with Hidden Variables- The EM Algorithm – Reinforcement Learning
BAYESIAN THEORY
Bayes’ theorem (Bayes’ law or Bayes' rule) describes the probability of an event, based on prior
knowledge of conditions that might be related to the event.
For example, if diabetic is related to age, then, using Bayes’ theorem, a person’s age can be used
to more accurately assess the probability that they have diabetic, compared to the assessment of the
probability of diabetic made without knowledge of the person's age. It is the basis of uncertain reasoning
where the results are unpredictable.
Bayes Rule
P(h)- prior probability of hypothesis h

P(D)prior probability of data D, the evident
P(h|D)-posterior probability (prob. Of h based on given evident)
P(D|h)- likelihood of D given h (Prob. of evident based on h)
Axioms of probability
1. All probabilities are between 0 and 1 ie0≤P(A) ≤1
2. P(True)=1 and P(false)=0
3. P(AB)=P(A)+P(B)-P(AB)
BAYESIAN NETWORK
• A Bayesian network is a probabilistic graphical model that represents a set of variables and their
probabilistic independencies. Otherwise known as Bayes net, Bayesian belief Network or simply
Belief Networks. A Bayesian network specifies a joint distribution in a structured form. It represents
dependencies and independence via a directed graph. Networks of concepts linked with conditional
probabilities.
• Bayesian network consists of
– Nodes = random variables
– Edges = direct dependence
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Requires that graph is acyclic (no directed cycles)
• 2 components to a Bayesian network
– The graph structure (conditional independence assumptions)
– The numerical probabilities (for each variable given its parents)
For eg, evidence says that lab produces 98% accurate results. It means that a person X has 98%
malaria or 2% of not having malaria. This factor is called uncertainty factor. This is the reason that we
go for Bayesian theory. Bayesian theory is also known as probability learning.
The probabilities are numeric values between 0 and 1 that represent uncertainties.
i) Simple Bayesian network
p(A,B,C) = p(C|A,B)p(A)p(B)
ii) 3-way Bayesian network (Marginal Independence)
p(A,B,C) = p(A) p(B) p(C)
iii) 3-way Bayesian network (Conditionally independent effects)
p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent Given A
iv) 3-way Bayesian network (Markov dependence)
p(A,B,C) = p(C|B) p(B|A)p(A)
Problem 1
You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor
earth quakes. Two neighbors (John, Mary) promise to call you at work when they hear the alarm. John
always calls when hears alarm, but confuses with phone ringing. Mary likes loud music and
sometimes misses alarm. Find the probability of the event that the alarm has sounded but neither a
burglary nor an earth quake has occurred and both Mary and John call.
Consider 5 binary variables
B=Burglary occurs at your house
E=Earth quake occurs at your home
A=Alarm goes off
J=John calls to report alarm
M=Mary calls to report the alarm
Probability of the event that the alarm has sounded but neither a burglary nor an earth quake has
occurred and both Mary and John call
P(J,M,A, E, B)=P(J|A).P(M|A).P(A|E, B).P(E).P(B)
=0.90*0.70*0.001*0.99*0.998
=0.00062
Problem 2
Rain influences sprinkler usage. Rain and sprinkler influences whether grass is wet or not. What is the
probability that rain gives grass
wet?
Solution
Let S= Sprinkler
R=Rain
G=Grass wet
P(G,S,R)=P(G|S,R).P(S|R).P(R)
=0.99*0.01*0.2
=0.00198
Problem 3
Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
age income student credit_ratingbuys_computer

<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Solution
• P(Ci):
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Problem 4
Did the patient have malignant tumour or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive
result in only 98% of the cases in which a malignant tumour actually present, and a correct negative
result in only 97% of the cases in which it is not present. Furthermore, o.oo8 of the entire population
have this tumour.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
The probability of not having tumour is high. So the person is not having malignant tumour.
Case 2:
Hypothesis: Did the patient have malignant tumour if the result reports negative.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
P(tumour|-) = p(-|tumour) p(tumour) / p(-)
= (0.02)(0.008)/p(-)
P(┐tumour|-) = p(-|┐tumour) p(┐tumour) / p(-)
= (0.97)(0.992)/p(-)
(0.02)(0.008)/p(-) + (0.97)(0.992)/p(-) = 1
(0.002)(0.008) + (0.97)(0.992) =p(-)
0.000016+0.96=p(-)
Hence p(-)=0.96
Substitute the value of p(-)

P(tumour|-) = (0.02)(0.008)/p(-) = (0.02)(0.008)/0.96= 0.00015
P(┐tumour|-) = (0.97)(0.992)/0.96 = 0.99985
The probability of not having tumour is high. So the person is not having malignant tumour.
HIDDEN MARKOV MODEL

Markov Model
• Set of states:
• Process moves from one state to another generating a sequence of

states :
• Markov chain property: probability of each subsequent state depends only on what was
the previous state
• To define Markov model, the following probabilities have to be specified:

transition probabilities P(Si|sj) and initialprobabilities
Example of Markov Model

• Two states : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Rain’|‘Rain’)=0.3 ,
P(‘Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6

Calculation of sequence probability
• By Markov chain property, probability of state sequence can be found by the formula:
• Suppose we want to calculate a probability of a sequence of states in our example,

{‘Dry’,’Dry’,’Rain’,Rain’}.
P({‘Dry’,’Dry’,’Rain’,Rain’} ) =P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’)

P(‘Dry’)
= 0.3*0.2*0.8*0.6
Hidden Markov Models
• Set of states:
• Process moves from one state to another generating a sequence of

states :
• Markov chain property: probability of each subsequent state depends only on what was
the previous state
• States are not visible, but each state randomly generates one of M observations (or visible states)
• To define hidden Markov model, the following probabilities

have to be specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix of observation
probabilities
B=(bi (vm )),
bi(vm ) = P(vm | si) and a vector of initial probabilities π=(πi), πi = P(si) . Model is represented by
M=(A, B, π).
Example of Hidden Markov Model

• Two states : ‘Low’ and ‘High’ atmospheric pressure.
• Two observations : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Low’|‘Low’)=0.3 ,

P(‘High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2,
P(‘High’|‘High’)=0.8
• Observation probabilities : P(‘Rain’|‘Low’)=0.6 ,

P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 .
• Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .
Calculation of observation sequence probability

•Suppose we want to calculate a probability of a sequence of observations in our example,
{‘Dry’,’Rain’}.
•Consider all possible hidden state sequences:
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) +
P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})
where first term is :
P({‘Dry’,’Rain’} , {‘Low’,’Low’})=
P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’)
P(‘Low’)P(‘Low’|’Low)
= 0.4*0.4*0.6*0.4*0.3
Hidden Markov Model Issues
1. Evaluation Problem.
2. Decoding Problem
3. Learning Problem
Evaluation problem.
Given the HMM M=(A, B, π) and the observation sequence O=o1 o2 ... oK , calculate the probability
tha t
model M has generated sequence O .
• Trying to find probability of observations O=o1 o2 ... oK by means of considering all hidden
state sequences (as was done in example) is impractical:
NK hidden state sequences - exponential complexity.
• Use Forward-Backward HMM algorithms for efficient calculations.
• Define the forward variable αk(i) as the joint probability of the
partial observation sequence o1 o2 ... ok and that the hidden state at time k is si : αk(i)= P(o1 o2 ... ok ,
qk= si )
Forward recursion for HMM
• Initialization:
α1(i)= P(o1 , q1= si ) = πi bi (o1) , 1<=i<=N.
• Forward recursion:
αk+1(i)= P(o1 o2 ... ok+1 , qk+1= sj ) =
Σi P(o1 o2 ... ok+1 , qk= si , qk+1= sj ) =
Σi P(o1 o2 ... ok , qk= si) aij bj (ok+1 ) =
[Σi αk(i) aij ] bj (ok+1 ) ,1<=j<=N, 1<=k<=K-1.
• Termination:
P(o1 o2 ... oK) = Σi P(o1 o2 ... oK , qK= si) = Σi αK(i)
• Complexity :
N2K operations.
Backward recursion for HMM
• Define the forward variable βk(i) as the joint probability of the partial observation sequence ok+1
ok+2 ... oK given that the hidden state at time k is si : βk(i)= P(ok+1 ok+2 ... oK |qk= si )
• Initialization:
βK(i)= 1 , 1<=i<=N.
• Backward recursion:
βk(j)= P(ok+1 ok+2 ... oK | qk= sj ) =
Σi P(ok+1 ok+2 ... oK , qk+1= si | qk= sj ) =
Σi P(ok+2 ok+3 ... oK | qk+1= si) aji bi (ok+1 ) =
Σi βk+1(i) aji bi (ok+1 ) ,1<=j<=N, 1<=k<=K-1.
• Termination:
P(o1 o2 ... oK) = Σi P(o1 o2 ... oK , q1= si) =Σi P(o1 o2 ... oK |q1= si) P(q1= si) = Σi β1(i) bi (o1)
πDecoding problem
Decoding problem. Given the HMM M=(A, B, π) and the observation sequence O=o1 o2 ...
oK , calculate the most likely sequence of hidden states si that produced this observation sequence.
We want to find the state sequence Q= q1…qK which maximizes
P(Q | o1 o2 ... oK ) , or equivalently P(Q , o1 o2 ... oK ) .
Brute force consideration of all paths takes exponential time. Use efficient Viterbi algorithm
instead.
Define variable δk(i) as the maximum probability of producing observation sequence o1 o2 ...
ok when moving along any hidden state sequence q1… qk-1 and getting into qk= si .
δk(i) = max P(q1… qk-1 , qk= si , o1 o2 ... ok)
where max is taken over all possible paths q1… qk-1 .
Viterbi algorithm
• Initialization:
δ1(i) = max P(q1= si , o1) = πi bi (o1) , 1<=i<=N.
•Forward recursion:
δk(j) = max P(q1… qk-1 , qk= sj , o1 o2 ... ok) =
maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ] =

maxi [ aij bj (ok ) δk-1(i) ] ,1<=j<=N, 2<=k<=K.
•Termination: choose best path ending at time K
axi [ δK(i) ]
• Backtrack best path.
Learning problem. Given some training observation sequences and general structure of HMM
(numbers of hidden and visible states), determine HMM parameters M=(A, B, π) that best fit
training data, that is maximizes P(O | M) .
• There is no algorithm producing optimal parameter values.
• Use iterative expectation-maximization algorithm to find local maximum of P(O |
M) - Baum-Welch algorithm
If training data has information about sequence of hidden states (as in word recognition
example), then use maximum likelihood estimation of parameters:
LEARNING DECISION TREE
A decision tree represents a function that takes as input a vector of attribute values and returns a
“decision”—a single output value. The input and output values can be discrete or continuous. A decision
tree reaches its decision by performing a sequence of tests. Each internal node in the tree corresponds
to a test of the value of one of the input attributes and the branches from the node are labeled with the
possible values of the attribute. Each leaf node in the tree specifies a value to be returned by the function.
Choosing attribute tests
The scheme used in decision tree learning for selecting attributes is designed to minimize the depth of
the final tree. The idea is to pick the attribute that goes as far as possible toward providing an exact
classification of the example. A perfect attribute divides the examples in to sets that are all positive or
all negative.
Assessing the performance of learning algorithm

1. Collect a large set of examples.
2. Divide it in to two disjoint sets, the training set and a testing set.
3. Apply the learning algorithm to the training set, generating a hypothesis h.
4. Measure the percentage of examples in the test set that are correctly classified by h.
5. Repeat steps 1 to 4 for different sizes of training seta and different randomly selected training
sets of each size.
Entropy is a measure of the uncertainty of a random variable
The information gain from the attribute test on A is the expected reduction in entropy:
Over-fitting
It is a situation when the model works well for the training data but fails for new testing data.
How to prevent over-fitting

• Lets have a fully grown tree T
• Choose a test node having only leaf node as decedents.
• If the test appear s to be irrelevant, remove the test and replace it with a leaf node with the
majority class.
• Repeat until all test seem to be relevant.
A decision tree learning system for real world applications must be able to handle the following issues.
• Missing data
• Multi-valued attribute
• Continuous and integer valued input attribute
• Continuous-valued output attribute.
LEARNING WITH COMPLETE DATA
Statistical learning begins with parameter learning with complete data. A parameter learning task
involves finding the numerical parameters for a probability model whose structure is fixed. Data are
complete when each data point contains values for every variable in the probability model being learned.
Complete data simplify the problem of learning the parameters of a complex model.
Learning Structure
• Maximum-likelihood parameter learning: Discrete models
• Naive Bayes models
• Maximum-likelihood parameter learning: Continuous models
• Bayesian parameter learning
Maximum-likelihood parameter learning: Discrete models

1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero.
Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime proportions
are completely unknown. ϴ is the proportion of cherry candies hϴ hypothesis. Therefore, proportion of
lime is 1- ϴ. Suppose we unwrap N candies of which c are cherries. Therefore, lime l=N-c
Equation of likelihood
Log derivative
To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the
resulting expression to zero:
Naive Bayes models

The most common Bayesian network model used in machine learning is the naïve Bayes model.
In this model, the “class” variable C is the root and the “attribute” variables Xi are the leaves. The model
is “naive” because it assumes that the attributes are conditionally independent of each other, given the
class. With observed attribute values x1, . . . , xn, the probability of each class is given by
A deterministic prediction can be obtained by choosing the most likely class.
Maximum-likelihood parameter learning: Continuous models

Continuous probability models such as the linear Gaussian model were used. Learning the
parameters of a Gaussian density function on a single variable. That is, the data are generated as
follows:
The parameters of this model are the mean μ and the standard deviation σ. The log likelihood
Setting the derivatives to zero
Bayesian parameter learning

Maximum-likelihood learning gives rise to some very simple procedures, but it has some
serious deficiencies with small data sets. Bayesian parameter learning places an hypothesis priority over
the possible values of parameter. Parameter independence can be represented as
Learning Bayes net structure

This approach is to search for a good model. It overcomes all disadvantage of all the above
models.
EM ALGORITHM
EM algorithm is presented by Dempster, Laird and Rubin in 1977. EM algorithm is an iterative
estimation algorithm that can derive the maximum likelihood estimates in the presence of
missing/hidden data (incomplete data)
Uses of EM algorithm
• Filling the missing data in a sample.
• Discovering the value of latent variables.
• Estimating the parameters of HMM
• Estimating the parameters of finite mixtures
• Unsupervised learning of clusters
• Parameters of mixtures of Gaussian
Algorithm
1. Consider a set of starting parameter
- Given a set of incomplete data
- Assume observed data come from a specific model
2. Use these to “estimate” the missing data, formulate some parameters for that model, use this to
guess the missing value/data (expectation step).
3. Use “complete data to update parameters from the missing data and observed data, find the
most likely parameters (maximization step).
4. Repeat steps 2 and 3 until convergence.
The main two steps of this algorithm are

Expectation step: Use current parameters to reconstruct hidden structure.
Maximization step: Use the hidden structures to re-estimate parameters.
Basic Setting in EM
X is a set of data points
is a parameter vector
EM is a method to find where
= arg max L(
= arg max log P(X/
L( is the likelihood function
Z=(X,Y)
Z: complete data (augmented data)
X: observed data (incomplete data)
Y: hidden data (missing data)
Coin Toss Problem

The target is to figure out the probability of heads of two coins. ML estimate can be directly calculated
from the result.
• We have two coins A and B
• The probabilities for heads are qA and qB.
• 5 measurements set including 10 coin tosses in each set
Coin Type 5 sets of 10 tosses each Coin A Coin B
B HTTTHHTHTH 5H, 5T
A HHHHTHHHHH 9H, 1T
A HTHHHHHTHH 8H, 2T
B HTHHHTTTTT 4H, 6T
A THHHTHHHTH 7H, 3T
Total 24H, 6T 9H, 11T
Maximmum likelihood
qA=24/(24+6)=0.8 (When you toss A coin, the probability of head comes out is 0.8)
qB=9/(9+11)=0.45 (When you toss B coin, the probability of head comes out is 0.45)
The same problem can be worked out with EM algorithm with missing information. We do not know
which coin is tossed in each set. So we cannot able to calculate the maximum likelihood directly. So
that EM algorithm is used.
5 sets of 10 tosses each
HTTTHHTHTH
HHHHTHHHHH
HTHHHHHTHH
HTHHHTTTTT
THHHTHHHTH
1. Initialization step (Randomly choose the initial values between 0 to 1)
2. Estimation step
Binomial Distribution
=
n is the number of coin tosses

k is the number of head
p is the initial probability
Maximum Likelihood Estimates

0.201/ (0.201+0.246) = 0.45
0.246/ (0.201+0.246) = 0.55
5 sets of 10 tosses each No. head and ML Estimates

tail
HTTTHHTHTH 5H, 5T 0.45 0.55 2.2H, 2.2T 2.8H, 2.8T
HHHHTHHHHH 9H, 1T 0.80 0.20 7.2H, 0.8T 1.8H, 0.2T
HTHHHHHTHH 8H, 2T 0.73 0.27 5.9H,1.7T 2.1H, 0.5T
HTHHHTTTTT 4H, 6T 0.35 0.65 1.4H,2.1T 2.6H, 3.8T
THHHTHHHTH 7H, 3T 0.65 0.35 4.6H, 1.8T 2.5H, 1.1T
Total 21.3H, 8.6T 11.7H, 8.4T
Convergence happens in the 10th iteration.
(When you toss A coin, the probability of head comes out is 0.8)
(When you toss B coin, the probability of head comes out is 0.52)
EM for K-means
1. Initialize means
2. E step: Assign each point to a cluster.
3. M step: Given clusters, refine mean of each cluster k.
4. Stop when change in mean is small.
EM for Gaussian Mixtures

1. Initialize Gaussian Mixture parameters mean , covariance and mixing co efficient .
2. E step: Assign each point an assignment score for each cluster k.
3. M step: Given scores, adjust , for each cluster k.
4.
5. Evaluate likelihood. If likelihood or parameters converge, then stop.
Strength of EM
• Numerical stability: In every iteration of the EM algorithm, it increases the likelihood of the
observed data,
• The EM handles parameter constraints gracefully.
Problems with EM
• Convergence can be very slow on some problems and is intimately related to the amount of
missing information.
• It guarantees to improve the probability of the training corpus, which is different from reducing
the errors directly.
• It cannot guarantee to reach global maximum.
REINFORCEMENT LEARNING
Reinforcement learning is close to human learning. Algorithm learns a policy of how to act in
a given environment. Every action has some impact in the environment, and the environment provides
rewards that guides the learning algorithm. Reinforcement learning deals with agents that must sense
and act upon their environment.
In many complex domains, reinforcement learning is the only feasible way to train a program
to perform at high levels. For example, in game playing, it is very hard for a human to provide accurate
and consistent evaluations of large numbers of positions, which would be needed to train an evaluation
function directly from examples. Instead, the program can be told when it has won or lost, and it can
use this information to learn an evaluation function that gives reasonably accurate estimates of the
probability of winning from any given position.
Passive reinforcement learning

The agent’s policy is fixed and the task is to learn the utilities of states (or state–action pairs); this
could also involve learning a model of the environment.
• The agent see the sequences of state transitions and associate rewards
• The environment generates state transitions and the agent perceive them
The agent executes a set of trials in TRIAL the environment using its policy π. In each trial, the agent
starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal
states, (4,2) or (4,3). Its percepts supply both the current state and the reward received in that state.
Typical trials might look like this:
Naïve updating
Naïve updating is otherwise called as LMS (least mean squares) approach. In essence, it assumes that
for each state in a training sequence, the observed reward-to-go on that sequence provides direct
evidence of the actual expected reward-to-go. Thus, at the end of each sequence, the algorithm
calculates the observed reward-to-go for each state and updates the estimated utility for that state
accordingly.
Adaptive dynamic programming

Adaptive dynamic programming (or ADP) to denote any reinforcement learning method that
works by solving the utility equations with a dynamic programming algorithm. In terms of its ability to
make good use of experience, ADP provides a standard against which to measure other reinforcement
learning algorithms. The utilities are computed by solving the set of equations
where, R(i) is the reward associated with being in state i, and Mij is the probability that a transition will
occur from state i to state j.
Temporal difference learning

The key is to use the observed transitions to adjust the values of the observed states so that they agree
with the constraint equations. Suppose that we observe a transition from state i to state j, where
currently U(i) = -0.5 and U(j)= +0.5. This suggests that we should consider increasing U(i) to make it
agree better with its successor. This can be achieved using the following updating rule:
where, a is the learning rate parameter. Because this update rule uses the difference in utilities between
successive states, it is often called the temporal-difference, or TD, equation. The basic idea of all
temporal-difference methods is to first define the conditions that hold locally when the utility estimates
are correct; and then to write an update equation that moves the estimates toward this ideal "equilibrium"
equation.
Active learning
Active learning, where the agent must also learn what to do. An active agent must consider
what actions to take, what their outcomes may be, and how they will affect the rewards received.
• The environment model must now incorporate the probabilities of transitions to other states
given a particular action. We will use Mij to denote the probability of reaching state j if the
action a is taken in state i.
• The constraints on the utility of each state must now take into account the fact that the agent
has a choice of actions. A rational agent will maximize its expected utility
• The agent must now choose an action at each step, and will need a performance element to do
so. In the algorithm, this means calling PERFORMANCE-ELEMENT(e) and returning the
resulting action.
Support Vector Machine

o SVM is a supervised learning algorithm which is a widely used classification algorithm.
o A new classification method for both linear and nonlinear data.
o SVM is applicable for the data that are linearly sperable.
o In Non linear data kernel functions are used.
SVM finds this hyper plane using support vectors (“essential” training tuples) and margins
(defined by the support vectors)
Support Vectors
• Support vectors are the data points that lie closest to the decision surface (or hyperplane)
• They are the data points most difficult to classify
• They have direct bearing on the optimum location of the decision surface
Hyper Plane
Hyper plane should have the largest margin in a high dimension space to separate given into
two classes.
The Margin between the two classes represent the longest distance between closest data point
to those classes.
.
Support Vectors for linearly separable case
• Support vectors are the elements of the training set that would change the
position of the dividing hyperplane if removed.
• Support vectors are the critical elements of the training set
• The problem of finding the optimal hyper plane is an optimization problem
and can be solved by optimization techniques (we use Lagrange multipliers
to get this problem into a form that can be solved analytically).
find the optimal hyperplane :
if we maximize the margin(distance) between two hyperplanes then divide by 2 we
get the decision boundary.
Find maximize the margin
lets take only 2 dimensions, we get the equation for hyper line is
w.x+b=0 which is same as w.x =0 (which has more dimensions)
if w.x+b=0 then we get the decision boundary

if w.x+b=1 then we get (+)class hyperplane
for all positive(x) points satisfy this rule (w.x+b ≥1)
if w.x+b=-1 then we get (-)class hyperplane
for all negative(x) points satisfy this rule (w.x+b≤-1)
Now we give a unknown point and we want to predict whether it belongs to positive class or
negative class.
DECISION TREE PROBLEM :
Types of Decision Trees
Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it
called as categorical variable decision tree
Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as
Continuous Variable Decision Tree.
Steps :
1. Place the best attribute of the dataset at the root of the tree.
2. Split the training set into subsets. Subsets should be made in such a way that each subset
contains data with the same value for an attribute.
3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
Calculate Entropy
4.
INFORMATION GAIN FORMULA:

24

Unit IV CI PDF

Uploaded by

Copyright:

Available Formats

Unit IV CI PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit IV CI PDF

Uploaded by

Copyright:

Available Formats

UNIT IV L EARNING

P(h)- prior probability of hypothesis h

i) Simple Bayesian network

p(A,B,C) = p(C|B) p(B|A)p(A)

age income student credit_ratingbuys_computer

P(tumour|-) = p(-|tumour) p(tumour) / p(-)

P(┐tumour|-) = p(-|┐tumour) p(┐tumour) / p(-)

Substitute the value of p(-)

P(┐tumour|-) = (0.97)(0.992)/0.96 = 0.99985

HIDDEN MARKOV MODEL

• Process moves from one state to another generating a sequence of

• To define Markov model, the following probabilities have to be specified:

Example of Markov Model

P(‘Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8

• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6

• Suppose we want to calculate a probability of a sequence of states in our example,

P({‘Dry’,’Dry’,’Rain’,Rain’} ) =P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’)

• Process moves from one state to another generating a sequence of

• To define hidden Markov model, the following probabilities

Example of Hidden Markov Model

• Transition probabilities: P(‘Low’|‘Low’)=0.3 ,

• Observation probabilities : P(‘Rain’|‘Low’)=0.6 ,

• Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .

Calculation of observation sequence probability

maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ] =

•Termination: choose best path ending at time K

Assessing the performance of learning algorithm

How to prevent over-fitting

Maximum-likelihood parameter learning: Discrete models

Naive Bayes models

A deterministic prediction can be obtained by choosing the most likely class.

Maximum-likelihood parameter learning: Continuous models

Bayesian parameter learning

Learning Bayes net structure

The main two steps of this algorithm are

Coin Toss Problem

n is the number of coin tosses

Maximum Likelihood Estimates

5 sets of 10 tosses each No. head and ML Estimates

EM for Gaussian Mixtures

Passive reinforcement learning

Adaptive dynamic programming

Temporal difference learning

Support Vector Machine

if w.x+b=0 then we get the decision boundary

INFORMATION GAIN FORMULA:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.