Unit Iv L Earning

UNIT IV L EARNING
Probability basics - Bayes Rule and its Applications - Bayesian Networks – Exact and Approximate
Inference in Bayesian Networks - Hidden Markov Models - Forms of Learning - Supervised Learning
- Learning Decision Trees – Regression and Classification with Linear Models - Artificial Neural
Networks – Nonparametric Models - Support Vector Machines - Statistical Learning - Learning with
Complete Data - Learning with Hidden Variables- The EM Algorithm – Reinforcement Learning
BAYESIAN THEORY
Bayes’ theorem (Bayes’ law or Bayes' rule) describes the probability of an event, based on prior
knowledge of conditions that might be related to the event.
For example, if diabetic is related to age, then, using Bayes’ theorem, a person’s age can be used
to more accurately assess the probability that they have diabetic, compared to the assessment of the
probability of diabetic made without knowledge of the person's age. It is the basis of uncertain reasoning
where the results are unpredictable.
Bayes Rule
𝑃(𝐷|ℎ)𝑃(ℎ)
𝑃(ℎ|𝐷) =
𝑃(𝐷)
P(h)- prior probability of hypothesis h
P(D)prior probability of data D, the evident
P(h|D)-posterior probability (prob. Of h based on given evident)
P(D|h)- likelihood of D given h (Prob. of evident based on h)
Axioms of probability
1. All probabilities are between 0 and 1 ie0≤P(A) ≤1
2. P(True)=1 and P(false)=0
3. P(AB)=P(A)+P(B)-P(AB)
BAYESIAN NETWORK
• A Bayesian network is a probabilistic graphical model that represents a set of variables and their
probabilistic independencies. Otherwise known as Bayes net, Bayesian belief Network or simply
Belief Networks. A Bayesian network specifies a joint distribution in a structured form. It represents
dependencies and independence via a directed graph. Networks of concepts linked with conditional
probabilities.
• Bayesian network consists of
– Nodes = random variables
– Edges = direct dependence
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Requires that graph is acyclic (no directed cycles)
• 2 components to a Bayesian network
– The graph structure (conditional independence assumptions)
– The numerical probabilities (for each variable given its parents)
For eg, evidence says that lab produces 98% accurate results. It means that a person X has 98%
malaria or 2% of not having malaria. This factor is called uncertainty factor. This is the reason that we
go for Bayesian theory. Bayesian theory is also known as probability learning.
The probabilities are numeric values between 0 and 1 that represent uncertainties.
i) Simple Bayesian network
p(A,B,C) = p(C|A,B)p(A)p(B)
ii) 3-way Bayesian network (Marginal Independence)
p(A,B,C) = p(A) p(B) p(C)

iii) 3-way Bayesian network (Conditionally independent effects)
p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent Given A
iv) 3-way Bayesian network (Markov dependence)
p(A,B,C) = p(C|B) p(B|A)p(A)
Problem 1
You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor
earth quakes. Two neighbors (John, Mary) promise to call you at work when they hear the alarm. John
always calls when hears alarm, but confuses with phone ringing. Mary likes loud music and
sometimes misses alarm. Find the probability of the event that the alarm has sounded but neither a
burglary nor an earth quake has occurred and both Mary and John call.
Consider 5 binary variables
B=Burglary occurs at your house
E=Earth quake occurs at your home
A=Alarm goes off
J=John calls to report alarm
M=Mary calls to report the alarm
Probability of the event that the alarm has sounded but neither a burglary nor an earth quake has
occurred and both Mary and John call
P(J,M,A, E, B)=P(J|A).P(M|A).P(A|E, B).P(E).P(B)
=0.90*0.70*0.001*0.99*0.998
=0.00062
Problem 2
Rain influences sprinkler usage. Rain and sprinkler influences whether grass is wet or not. What is the
probability that rain gives grass wet?
Solution
Let S= Sprinkler
R=Rain
G=Grass wet
P(G,S,R)=P(G|S,R).P(S|R).P(R)
=0.99*0.01*0.2
=0.00198
Problem 3
Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
age income student credit_ratingbuys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Solution
• P(Ci):
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Problem 4
Did the patient have malignant tumour or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive
result in only 98% of the cases in which a malignant tumour actually present, and a correct negative
result in only 97% of the cases in which it is not present. Furthermore, o.oo8 of the entire population
have this tumour.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
𝑃(+|𝑡𝑢𝑚𝑜𝑢𝑟)𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) =
𝑃(+)
0.98 ∗ 0.008
=
𝑃(+)
𝑃(+|𝑡𝑢𝑚𝑜𝑢𝑟)𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) =
𝑃(+)
0.3∗0.992
=
𝑃(+)
0.98 ∗ 0.008 0.3 ∗ 0.992

+ =1
𝑃(+) 𝑃(+)
𝑃(+) = 0.98 ∗ 0.008 + 0.3 ∗ 0.992 = 0.305
0.98 ∗ 0.008
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) = = 0.025
0.305
0.3 ∗ 0.992
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) = = 0.975
0.305
The probability of not having tumour is high. So the person is not having malignant tumour.
Case 2:
Hypothesis: Did the patient have malignant tumour if the result reports negative.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
P(tumour|-) = p(-|tumour) p(tumour) / p(-)
= (0.02)(0.008)/p(-)
P(┐tumour|-) = p(-|┐tumour) p(┐tumour) / p(-)
= (0.97)(0.992)/p(-)
(0.02)(0.008)/p(-) + (0.97)(0.992)/p(-) = 1
(0.002)(0.008) + (0.97)(0.992) =p(-)
0.000016+0.96=p(-)
Hence p(-)=0.96
Substitute the value of p(-)

P(tumour|-) = (0.02)(0.008)/p(-) = (0.02)(0.008)/0.96= 0.00015
P(┐tumour|-) = (0.97)(0.992)/0.96 = 0.99985
The probability of not having tumour is high. So the person is not having malignant tumour.
HIDDEN MARKOV MODEL

Markov process is a simple stochastic process in which the distribution of future states depends only
on the present state and not on how it arrived in the present state. A random sequence has the Markov
property if its distribution is determined solely by its current state. Any random process having tis
property is called Markov random process. For observable state sequences, this leads to a Markov chain
model. For non-observable state, this leads to a Hidden Markov chain model.
MARKOV MODEL
Markov model is a discrete finite system with N distinct states. It begins (at time t=1) in some initial
states. At each time step (t=1,2,..) the system moves from current to next state according to transition
probabilities associated with current state. This kind of system is called a finite or discrete Markov
model.
Markov property (Memory less property): The state of the system at time t+1 depends only on the
state of the system at time t. Future is independent of past given present. Three basic information to
define a Markov model
 Parameter space
 State space
 State transition probability
Stationary Assumption: in general, a process is called stationary if transition probabilities are

independent of t, namely for all t, P[Xt+1=xj|Xt=xi]=pij
Set of states {S1, S2,…,SN}

• Process moves from one state to another generating a sequence of states: Si1, Si2,…,Sik,..
• Markov chain property: probability of each subsequent state depends only on what was the previous
state: P (Sik|Si1, Si2,…,Sik-1)=P(Sik| Sik-1)
• To define Markov model, the following probabilities have to be specified: transition probabilities
aij=P(Si|Sj) and initial probabilities π i=P(Si).
Example 1: Weather prediction
Tomorrow’s weather depends on today’s weather.
Tomorrow
Today Rainy Cloudy Sunny
Rainy 0.4 0.3 0.3
Cloudy 0.2 0.6 0.2
Sunny 0.1 0.1 0.8
What is the probability that the weather for the next 7 days will be “sun-sun-rain-rain-sun-cloudy-sun”
when today is sunny?
S1: rain, S2: cloudy, S3: sunny
P(O|model)=P(S3, S3, S3, S1, S1, S3, S2,S3|model)
=P(S3)*P(S3|S3)* P(S3|S3)* P(S1|S3)* P(S1|S1)* P(S3|S1)* P(S2|S3)* P(S3|S2)
= π 3*a33*a33*a31*a11*a11*a13*a32*a23
=1*0.8*0.8*0.1*0.4*0.3*0.1*0.2
=1.536x10-4
Initial sate probability matrix
0.5
π =( π i)=[0.2]
0.3
Sate transition probability matrix
0.6 0.2 0.2
A={aij}=[0.5 0.3 0.2]
0.4 0.1 0.5
What is the probability of 5 consecutive up days?
P(1,1,1,1,1)= π 1*a11*a11*a11*a11=0.5*(0.6)4= 0.0648
HIDDEN MARKOV MODEL

A hidden Markov model is an extension of a Markov model in which the input symbols are not the
same as the states. This means that we don’t know which state we are in. Often we face scenarios where
states cannot be directly observed. So there is a need of Hidden Markov Model.
aij are state transition probabilities, bik are observation (output) probabilities.
Set of states {S1, S2,…,SN}

• Process moves from one state to another generating a sequence of states: Si1, Si2,…,Sik,..
• Markov chain property: probability of each subsequent state depends only on what was the previous
state: P (Sik|Si1, Si2,…,Sik-1)=P(Sik| Sik-1)
• States are not visible, but each state randomly generates one of M observations (or visible states) {v1,
v2,…, vM}
• To define hidden Markov model, the following probabilities have to be specified: matrix of transition
probabilities A=(aij), aij= P(si|sj) , matrix of observation probabilities B=(bi(vm)), bi(vm)= P(vm|si) and a
vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).
Example 1:
Number of states: N=3

Number of observation M=3
V={R,G,B}
Initial state distribution π=[1, 0, 0]
Sate transition probability distribution
0.6 0.2 0.2
A={aij}=[0.5 0.3 0.6]
0.3 0.1 0.6
Observation symbol probability distribution
3/6 2/6 1/6
B={bi(vk)}= [1/6 3/6 2/6]
1/6 1/6 4/6
Consider n urns containing color balls with m distinct colors. Each urn contains different number of
color balls.
Sequence generating algorithm
1. Pick initial urn according to some random process
2. Randomly pick a ball from the urn and then replace it.
3. Select another urn according to a random selection process.
4. Repeat steps 2 & 3.
Here, what is hidden? We can just see the chosen balls. We can’t see which urn is selected at a time.
So, urn selection (state transition) information is hidden.
Main issues of Hidden Markov Model

1. Evaluation Problem
Given the HMM M= (A, B, π) and the observation sequence O=o1 o2 ... oK , calculate the
probability that model M has generated sequence O.
Solution: Use Forward-Backward HMM algorithms for efficient calculations.
Forward Recursion for HMM: Define the forward variable αk(i) as the joint probability of
the partial observation sequence o1,o2 ... ok and that the hidden state at time k is si : αk(i)= P(o1,o2,
... ok, qk= si).
Backward Recursion for HMM: Define the forward variable βk(i) as the joint probability of
the partial observation sequence ok+1,ok+2 ... ok given that the hidden state at time k is si : βk(i)=
P(ok+1,ok+2, ... ok|qk= si).
2. Decoding Problem
Given the HMM M= (A, B, π) and the observation sequence O=o1 o2 ... oK, calculate the most
likely sequence of hidden states Si that produced this observation sequence O.
Solution: Use efficient Viterbi algorithm
Define variable δk(i) as the maximum probability of producing observation sequence o1, o2 ...
ok when moving along any hidden state sequence q1… qk-1 and getting into qk= si .
δk(i) = max P(q1… qk-1 , qk= si , o1 o2 ... ok) where max is taken over all possible paths q1… qk-1.
3. Learning Problem
Given some training observation sequences O=o1 o2 ... oK and general structure of HMM
(number of hidden and visible states), determine HMM parameters M= (A, B, π) that best fit
training data.
Solution: Use iterative expectation-maximization algorithm to find local maximum of P(O|M)
- Baum-Welch algorithm
Expected number of transitions from state sj to state si
aij= Expected number of transitions out of state sj
ExpectExpected number of times observation vm occurs in state si

bi(vM)= Expected number of times in state si
Advantages of HMM on Sequential Data

 Natural model structure: doubly stochastic process
 Efficient and good modelling tool for sequences with temporal constraints, spatial variability
along the sequence and real world complex processes.
 Mathematically strong and computationally efficient
Application areas of HMM

 Online handwriting recognition
 Speech recognition
Conditional Independence
— The Backbone of Bayesian Networks
In probability theory, conditional independence describes situations wherein
an observation is irrelevant or redundant when evaluating the certainty of a
hypothesis. Conditional independence is usually formulated in terms of
conditional probability, as a special case where the probability of the
hypothesis given the uninformative observation is equal to the probability
without. If A is the hypothesis, and B and C are observations, conditional
independence can be stated as an equality:
1. The intuition of Conditional Independence

Let’s say A is the height of a child and B is the number of words that the
child knows. It seems when A is high, B is high too.
There is a single piece of information that will make A and B completely
independent. What would that be?
The child’s age.
The height and the # of words known by the kid are NOT independent, but
they are conditionally independent if you provide the kid’s age.
2. Mathematical Form
A: The height of a child
B: The # of words that the child knows
C: The child's age
A better way to remember the expression:
Conditional independence is basically the concept of independence P(A ∩ B)

= P(A) * P(B) applied to the conditional model.
Why is P(A|B ∩ C) = P(A|C) when (Aㅛ B)|C?
Here goes the proof.

The gist of conditional independence: Knowing C makes A and B
independent.
P(A,B|C) = P(A|C) * P(B|C)
Conditional Independence in Bayesian Network (aka Graphical Models)
A Bayesian network represents a joint distribution using a graph.
Specifically, it is a directed acyclic graph in which each edge is a conditional
dependency, and each node is a distinctive random variable. It has many
other names: belief network, decision network, causal network, Bayes(ian) model or
probabilistic directed acyclic graphical model, etc.
It looks like so:
In order for the Bayesian network to model a probability distribution, it

relies on the important assumption: each variable is conditionally
independent of its non-descendants, given its parents.
For instance, we can simplify P(Grass Wet|Sprinkler, Rain) into P(Grass
Wet|Sprinkler) since Grass Wet is conditionally independent of its non-
descendant, Rain, given Sprinkler.
Using this property, we can simplify the whole joint distribution into the
formula below:
Markov random fields

Bayesian networks are a class of models that can compactly represent many
interesting probability distributions. However, we have seen in the previous
chapter that some distributions may have independence assumptions that
cannot be perfectly represented by the structure of a Bayesian network.
In such cases, unless we want to introduce false independencies among the
variables of our model, we must fall back to a less compact representation
(which can be viewed as a graph with additional, unnecessary edges). This
leads to extra, unnecessary parameters in the model, and makes it more
difficult to learn these parameters and to make predictions.
There exists, however, another technique for compactly representing and
visualizing a probability distribution that is based on the language
of undirected graphs. This class of models (known as Markov Random Fields
or MRFs) can compactly represent independence assumptions that directed
models cannot represent. We will explore the advantages and drawbacks of
these methods in this chapter.
Markov Random Fields
Undirected graphical representation of a joint probability of voting

preferences over four individuals. The figure on the right illustrates the
pairwise factors present in the model.
As a motivating example, suppose that we are modeling voting preferences
among persons A,B,C,DA,B,C,D. Let’s say
that (A,B)(A,B), (B,C)(B,C), (C,D)(C,D), and (D,A)(D,A) are friends, and
friends tend to have similar voting preferences. These influences can be
naturally represented by an undirected graph.
One way to define a probability over the joint voting decision
of A,B,C,DA,B,C,D is to assign scores to each assignment to these variables
and then define a probability as a normalized score. A score can be any
function, but in our case, we will define it to be of the form
When normalized, we can view ϕ(A,B)ϕ(A,B) as an interaction that
pushes BB’s vote closer to that of AA. The term ϕ(B,C)ϕ(B,C) pushes BB’s
vote closer to CC, and the most likely vote will require reconciling these
conflicting influences.
Note that unlike in the directed case, we are not saying anything about how
one variable is generated from another set of variables (as a conditional
probability distribution would do). We simply indicate a level of coupling
between dependent variables in the graph. In a sense, this requires less prior
knowledge, as we no longer have to specify a full generative story of how
the vote of BB is constructed from the vote of AA (which we would need to
do if we had a P(B∣A)P(B∣A) factor). Instead, we simply identify dependent
variables and define the strength of their interactions; this in turn defines an
energy landscape over the space of possible assignments and we convert this
energy to a probability via the normalization constant.
Formal definition
A Markov Random Field (MRF) is a probability distribution pp over
variables x1,…,xnx1,…,xn defined by an undirected graph GG in which
nodes correspond to variables xixi. The probability pp has the form
Thus, given a graph GG, our probability distribution may contain

factors whose scope is any clique in GG, which can be a single node,
an edge, a triangle, etc. Note that we do not need to specify a factor for
each clique. In our above example, we defined a factor over each edge
(which is a clique of two nodes). However, we chose not to specify any
unary factors, i.e., cliques over single nodes.
Comparison to Bayesian networks
Examples of directed models for our four-variable voting example. None of

them can accurately express our prior knowledge about the dependency
structure among the variables.
In our earlier voting example, we had a distribution
over A,B,C,DA,B,C,D that
satisfied A⊥C∣{B,D}A⊥C∣{B,D} and B⊥D∣{A,C}B⊥D∣{A,C} (because only
friends directly influence a person’s vote). We can easily check by counter-
example that these independencies cannot be perfectly represented by a
Bayesian network. However, the MRF turns out to be a perfect map for this
distribution.
More generally, MRFs have several advantages over directed models:
 They can be applied to a wider range of problems in which there is no
natural directionality associated with variable dependencies.
 Undirected graphs can succinctly express certain dependencies that
Bayesian nets cannot easily describe (although the converse is also
true)
They also possess several important drawbacks:
 Computing the normalization constant ZZ requires summing over a
potentially exponential number of assignments. We will see that in the
general case, this will be NP-hard; thus many undirected models will
be intractable and will require approximation techniques.
 Undirected models may be difficult to interpret.
 It is much easier to generate data from a Bayesian network, which is
important in some applications.
It is not hard to see that Bayesian networks are a special case of MRFs with
a very specific type of clique factor (one that corresponds to a conditional
probability distribution and implies a directed acyclic structure in the
graph), and a normalizing constant of one. In particular, if we take a directed
graph GG and add side edges to all parents of a given node (and removing
their directionality), then the CPDs (seen as factors over a variable and its
ancestors) factorize over the resulting undirected graph. The resulting
process is called moralization.
A Bayesian network can always be converted into an undirected network
with normalization constant one. The converse is also possible, but may be
computationally intractable, and may produce a very large (e.g., fully
connected) directed graph.
Thus, MRFs have more power than Bayesian networks, but are more difficult
to deal with computationally. A general rule of thumb is to use Bayesian
networks whenever possible, and only switch to MRFs if there is no natural
way to model the problem with a directed graph (like in our voting
example).
-CollPCron
Bayes eheorem A
*
on
based
Classifier not a bingle alg, butfamiloE al
Naive Rayes Share
common pmnap
a
beiDg claASIA
e eve
iS_indo
QLofeature
f eaOh e t t a
aProbablistc
A naie Baus claSSifier is
coseificatontassk
ML Model haLs Used for
Hat
Bayes Thapaun
PLa B PLBIA) .P) O- PoRmuna
PL8)
Featis
the_pooablity of
sing D e_an £ind
Occcarebce e A VenB
B Evidence
A> tPothesis
NaI e Paues features are independlen2E
Example
EEs au oe bave a leature " headache" klCol
Pteverheadache.cold): PCead ache fever )x P (old fever)x

PGtever
PCheadache) X PCcold
Ln the "tever" lg_depenolnt on" headache62

Cold but these taofeaturs are not_depndent
oneach othd to Cause fevey"
MAed tex ClasSf t ( t v ha r hacdee high
S m t1ern tien, entmorn1alA(monAio
tian
Chce he rtass fy arti les
Qne,Naive"
Let X eplesent aSel.ofea Lures where
n Po
P A o p oh t
CTas
aeneralixing utetiko
Phohahilit
P( 2a, n: P(ly)x Plraly). xPlzl9) xPly)_

POstert o probability Pla Dx p(2) x PCz2)x. P(n)
PALdictor_paioipRobabi lity
xFor all _entria in a data set the denaminator
does not Change Jt Can be removed
P Lalx1 2n)_d PCyDT P(xily)
Let's 0orkout using a small datasel

SDo TYPe Lengtby SuweeE Helou TOER
Banana LOO 350 450 500
2 orange 30
0 300
Ohers oo 56 20 O
3
total 650 too
PLeananoa|Lengthy SweeL, delloo)=

P(sweet[Banana)X
PCtengthy Banana) x PCHeltouwl Banana) XPlerr
ON8ed
PClengty) xP(sweel) x pCyello) y
Lets uOk outL tach factor s in
o0o 0 8
PLenghy l8anana
P (Swee| 31500= 7
PgellawI 495D0
PCeanaDa )
500 Lo00 = 055
PCLengthy)
PCSweet) 6GD looO -0 65
PLyellow) 1ooo= O8
in
Subsutuitingall the above values
=0.62 0.94
O.8 X O1X0.9 X O:h

O.65
5 X O65 X 08
thy,DeeEgRlaL
Plorangel Leng
No
N De need +o find
lororngs
Pengtylotange)
xD[swee lorarge)xPLHell
xPCorange)
PLlengthy)xPcweet)x Plyella)
L
Sin Ce pCLengtbylozange)=6
PLorangel lengthy sweet yellow =6
oN -doyla1ea
t i n d tor
of othersLengthy,_weet, yellao = O.0722|
We ASsign the
classwhich has maX._pogbability
which S gn b eq
m an m u u m
arg ma PLH)T P(zily) 289

PBarmna 1engthy, SweetHellao)= 0.96
P_Corange
P Cothers E p.021|
The classBanana" bag maimum paokabi lty

heretore the fruit with featue Lengthyueet
yelle' is_classified as eanana
X

Unit Iv L Earning

Uploaded by

Copyright:

Available Formats

Unit Iv L Earning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit Iv L Earning

Uploaded by

Copyright:

Available Formats

UNIT IV L EARNING

p(A,B,C) = p(A) p(B) p(C)

p(A,B,C) = p(C|B) p(B|A)p(A)

0.98 ∗ 0.008 0.3 ∗ 0.992

P(tumour|-) = p(-|tumour) p(tumour) / p(-)

P(┐tumour|-) = p(-|┐tumour) p(┐tumour) / p(-)

Substitute the value of p(-)

P(┐tumour|-) = (0.97)(0.992)/0.96 = 0.99985

HIDDEN MARKOV MODEL

Stationary Assumption: in general, a process is called stationary if transition probabilities are

Set of states {S1, S2,…,SN}

HIDDEN MARKOV MODEL

Set of states {S1, S2,…,SN}

Number of states: N=3

Main issues of Hidden Markov Model

ExpectExpected number of times observation vm occurs in state si

Advantages of HMM on Sequential Data

Application areas of HMM

1. The intuition of Conditional Independence

Conditional independence is basically the concept of independence P(A ∩ B)

Here goes the proof.

In order for the Bayesian network to model a probability distribution, it

Markov random fields

Undirected graphical representation of a joint probability of voting

Thus, given a graph GG, our probability distribution may contain

Examples of directed models for our four-variable voting example. None of

NaI e Paues features are independlen2E

Pteverheadache.cold): PCead ache fever )x P (old fever)x

Ln the "tever" lg_depenolnt on" headache62

Let X eplesent aSel.ofea Lures where

P( 2a, n: P(ly)x Plraly). xPlzl9) xPly)_

P Lalx1 2n)_d PCyDT P(xily)

Let's 0orkout using a small datasel

PLeananoa|Lengthy SweeL, delloo)=

O.8 X O1X0.9 X O:h

of othersLengthy,_weet, yellao = O.0722|

arg ma PLH)T P(zily) 289

The classBanana" bag maimum paokabi lty

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.