Unit Iv L Earning
Unit Iv L Earning
Unit Iv L Earning
Probability basics - Bayes Rule and its Applications - Bayesian Networks – Exact and Approximate
Inference in Bayesian Networks - Hidden Markov Models - Forms of Learning - Supervised Learning
- Learning Decision Trees – Regression and Classification with Linear Models - Artificial Neural
Networks – Nonparametric Models - Support Vector Machines - Statistical Learning - Learning with
Complete Data - Learning with Hidden Variables- The EM Algorithm – Reinforcement Learning
BAYESIAN THEORY
Bayes’ theorem (Bayes’ law or Bayes' rule) describes the probability of an event, based on prior
knowledge of conditions that might be related to the event.
For example, if diabetic is related to age, then, using Bayes’ theorem, a person’s age can be used
to more accurately assess the probability that they have diabetic, compared to the assessment of the
probability of diabetic made without knowledge of the person's age. It is the basis of uncertain reasoning
where the results are unpredictable.
Bayes Rule
𝑃(𝐷|ℎ)𝑃(ℎ)
𝑃(ℎ|𝐷) =
𝑃(𝐷)
P(h)- prior probability of hypothesis h
P(D)prior probability of data D, the evident
P(h|D)-posterior probability (prob. Of h based on given evident)
P(D|h)- likelihood of D given h (Prob. of evident based on h)
Axioms of probability
1. All probabilities are between 0 and 1 ie0≤P(A) ≤1
2. P(True)=1 and P(false)=0
3. P(AB)=P(A)+P(B)-P(AB)
BAYESIAN NETWORK
• A Bayesian network is a probabilistic graphical model that represents a set of variables and their
probabilistic independencies. Otherwise known as Bayes net, Bayesian belief Network or simply
Belief Networks. A Bayesian network specifies a joint distribution in a structured form. It represents
dependencies and independence via a directed graph. Networks of concepts linked with conditional
probabilities.
• Bayesian network consists of
– Nodes = random variables
– Edges = direct dependence
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Requires that graph is acyclic (no directed cycles)
• 2 components to a Bayesian network
– The graph structure (conditional independence assumptions)
– The numerical probabilities (for each variable given its parents)
For eg, evidence says that lab produces 98% accurate results. It means that a person X has 98%
malaria or 2% of not having malaria. This factor is called uncertainty factor. This is the reason that we
go for Bayesian theory. Bayesian theory is also known as probability learning.
The probabilities are numeric values between 0 and 1 that represent uncertainties.
i) Simple Bayesian network
p(A,B,C) = p(C|A,B)p(A)p(B)
ii) 3-way Bayesian network (Marginal Independence)
p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent Given A
iv) 3-way Bayesian network (Markov dependence)
Problem 1
You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor
earth quakes. Two neighbors (John, Mary) promise to call you at work when they hear the alarm. John
always calls when hears alarm, but confuses with phone ringing. Mary likes loud music and
sometimes misses alarm. Find the probability of the event that the alarm has sounded but neither a
burglary nor an earth quake has occurred and both Mary and John call.
Consider 5 binary variables
B=Burglary occurs at your house
E=Earth quake occurs at your home
A=Alarm goes off
J=John calls to report alarm
M=Mary calls to report the alarm
Probability of the event that the alarm has sounded but neither a burglary nor an earth quake has
occurred and both Mary and John call
P(J,M,A, E, B)=P(J|A).P(M|A).P(A|E, B).P(E).P(B)
=0.90*0.70*0.001*0.99*0.998
=0.00062
Problem 2
Rain influences sprinkler usage. Rain and sprinkler influences whether grass is wet or not. What is the
probability that rain gives grass wet?
Solution
Let S= Sprinkler
R=Rain
G=Grass wet
P(G,S,R)=P(G|S,R).P(S|R).P(R)
=0.99*0.01*0.2
=0.00198
Problem 3
Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
age income student credit_ratingbuys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Solution
• P(Ci):
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Problem 4
Did the patient have malignant tumour or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive
result in only 98% of the cases in which a malignant tumour actually present, and a correct negative
result in only 97% of the cases in which it is not present. Furthermore, o.oo8 of the entire population
have this tumour.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
𝑃(+|𝑡𝑢𝑚𝑜𝑢𝑟)𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) =
𝑃(+)
0.98 ∗ 0.008
=
𝑃(+)
𝑃(+|𝑡𝑢𝑚𝑜𝑢𝑟)𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) =
𝑃(+)
0.3∗0.992
=
𝑃(+)
Case 2:
Hypothesis: Did the patient have malignant tumour if the result reports negative.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
= (0.02)(0.008)/p(-)
= (0.97)(0.992)/p(-)
(0.02)(0.008)/p(-) + (0.97)(0.992)/p(-) = 1
(0.002)(0.008) + (0.97)(0.992) =p(-)
0.000016+0.96=p(-)
Hence p(-)=0.96
The probability of not having tumour is high. So the person is not having malignant tumour.
MARKOV MODEL
Markov model is a discrete finite system with N distinct states. It begins (at time t=1) in some initial
states. At each time step (t=1,2,..) the system moves from current to next state according to transition
probabilities associated with current state. This kind of system is called a finite or discrete Markov
model.
Markov property (Memory less property): The state of the system at time t+1 depends only on the
state of the system at time t. Future is independent of past given present. Three basic information to
define a Markov model
Parameter space
State space
State transition probability
What is the probability that the weather for the next 7 days will be “sun-sun-rain-rain-sun-cloudy-sun”
when today is sunny?
S1: rain, S2: cloudy, S3: sunny
P(O|model)=P(S3, S3, S3, S1, S1, S3, S2,S3|model)
=P(S3)*P(S3|S3)* P(S3|S3)* P(S1|S3)* P(S1|S1)* P(S3|S1)* P(S2|S3)* P(S3|S2)
= π 3*a33*a33*a31*a11*a11*a13*a32*a23
=1*0.8*0.8*0.1*0.4*0.3*0.1*0.2
=1.536x10-4
Initial sate probability matrix
0.5
π =( π i)=[0.2]
0.3
Sate transition probability matrix
0.6 0.2 0.2
A={aij}=[0.5 0.3 0.2]
0.4 0.1 0.5
What is the probability of 5 consecutive up days?
P(1,1,1,1,1)= π 1*a11*a11*a11*a11=0.5*(0.6)4= 0.0648
aij are state transition probabilities, bik are observation (output) probabilities.
Example 1:
2. Decoding Problem
Given the HMM M= (A, B, π) and the observation sequence O=o1 o2 ... oK, calculate the most
likely sequence of hidden states Si that produced this observation sequence O.
Solution: Use efficient Viterbi algorithm
Define variable δk(i) as the maximum probability of producing observation sequence o1, o2 ...
ok when moving along any hidden state sequence q1… qk-1 and getting into qk= si .
δk(i) = max P(q1… qk-1 , qk= si , o1 o2 ... ok) where max is taken over all possible paths q1… qk-1.
3. Learning Problem
Given some training observation sequences O=o1 o2 ... oK and general structure of HMM
(number of hidden and visible states), determine HMM parameters M= (A, B, π) that best fit
training data.
Solution: Use iterative expectation-maximization algorithm to find local maximum of P(O|M)
- Baum-Welch algorithm
Expected number of transitions from state sj to state si
aij= Expected number of transitions out of state sj
2. Mathematical Form
A: The height of a child
B: The # of words that the child knows
C: The child's age
A better way to remember the expression:
Thus, MRFs have more power than Bayesian networks, but are more difficult
to deal with computationally. A general rule of thumb is to use Bayesian
networks whenever possible, and only switch to MRFs if there is no natural
way to model the problem with a directed graph (like in our voting
example).
-CollPCron
Bayes eheorem A
*
on
based
Classifier not a bingle alg, butfamiloE al
Naive Rayes Share
common pmnap
a
beiDg claASIA
e eve
iS_indo
QLofeature
f eaOh e t t a
aProbablistc
A naie Baus claSSifier is
coseificatontassk
ML Model haLs Used for
Hat
Bayes Thapaun
PLa B PLBIA) .P) O- PoRmuna
PL8)
Featis
the_pooablity of
sing D e_an £ind
Occcarebce e A VenB
B Evidence
A> tPothesis
Example
EEs au oe bave a leature " headache" klCol
n Po
P A o p oh t
CTas
aeneralixing utetiko
Phohahilit
PALdictor_paioipRobabi lity
xFor all _entria in a data set the denaminator
does not Change Jt Can be removed
ON8ed
PClengty) xP(sweel) x pCyello) y
Lets uOk outL tach factor s in
o0o 0 8
PLenghy l8anana
P (Swee| 31500= 7
PgellawI 495D0
PCeanaDa )
500 Lo00 = 055
PCLengthy)
PCSweet) 6GD looO -0 65
PLyellow) 1ooo= O8
in
Subsutuitingall the above values
=0.62 0.94
thy,DeeEgRlaL
Plorangel Leng
No
N De need +o find
lororngs
Pengtylotange)
xD[swee lorarge)xPLHell
xPCorange)
PLlengthy)xPcweet)x Plyella)
L
Sin Ce pCLengtbylozange)=6
PLorangel lengthy sweet yellow =6
oN -doyla1ea
t i n d tor
We ASsign the
classwhich has maX._pogbability
which S gn b eq
m an m u u m