System Identification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 114

System Identification

Sometimes called

Individualization/ Personalization
Simple Medical Decision
Making
First simple applications

• Probability Basics
• Medical decision making (simple approach)
• Therapeutic decision making (applied utility theory)
Probability Basics
Probabilistic assertions summarize effects of
– laziness: failure to enumerate exceptions, qualifications, etc.
– ignorance: lack of relevant facts, initial conditions, etc.

Subjective probability:
• Probabilities relate propositions to agent's own state of
knowledge
e.g., P(A25 | no reported accidents) = 0.06

These are not assertions about the world

Probabilities of propositions change with new evidence:


e.g., P(A25 | no reported accidents, 5 a.m.) = 0.15
Axioms of probability
• For any propositions A, B

– 0 ≤ P(A) ≤ 1
– P(true) = 1 and P(false) = 0
– P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
Prior probability
• Prior or unconditional probabilities of propositions
e.g., P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72 correspond to belief prior to arrival
of any (new) evidence

• Probability distribution gives values for all possible assignments:


P(Weather) = <0.72,0.1,0.08,0.1> (normalized, i.e., sums to 1)

• Joint probability distribution for a set of random variables gives the


probability of every atomic event on those random variables
P(Weather,Cavity) = a 4 × 2 matrix of values
Weather = sunny rainy cloudy snow

Cavity = true 0.144 0.02 0.016 0.02

Cavity = false 0.576 0.08 0.064 0.08


Conditional probability
• Conditional or posterior probabilities
e.g., P(cavity | toothache) = 0.8
i.e., given that toothache is all I know

• (Notation for conditional distributions:


P(Cavity | Toothache) = 2-element vector of 2-element
vectors)

If we know more, e.g., cavity is also given, then we have


P(cavity | toothache,cavity) = 1

• New evidence may be irrelevant, allowing


Conditional probability
If we know more, e.g., cavity is also given, then we have
P(cavity | toothache,cavity) = 1

• New evidence may be irrelevant, allowing


simplification, e.g.,
P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8
• This kind of inference, sanctioned by domain
knowledge, is crucial
Conditional probability
• Definition of conditional probability:
P(a | b) = P(a ∧ b) / P(b) if P(b) > 0

• Product rule gives an alternative formulation:


P(a ∧ b) = P(a | b) P(b) = P(b | a) P(a)

• A general version holds for whole distributions, e.g.,


P(Weather,Cavity) = P(Weather | Cavity) P(Cavity)
• (View as a set of 4 × 2 equations, not matrix mult.)

n
i =1
P( X i | X 1 ,..., X i −1 )
• Chain rule is derived by successive application of product rule:

P(X1, …,Xn) = P(X1,...,Xn-1) P(Xn | X1,...,Xn-1)
= P(X1,...,Xn-2) P(Xn-1 | X1,...,Xn-2) P(Xn | X1,...,Xn-1)
=…
= πi= 1^n P(Xi | X1, … ,Xi-1)
Inference by enumeration
• Start with the joint probability distribution:

• For any proposition φ, sum the atomic events where it is true:


P(φ) = Σω:ω╞φ P(ω)
• Examples
Inference by enumeration
• Start with the joint probability distribution:

• For any proposition φ, sum the atomic events where it is true:


P(φ) = Σω:ω╞φ P(ω)

• P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2


Inference by enumeration
• Start with the joint probability distribution:

• For any proposition φ, sum the atomic events where it is true:


P(φ) = Σω:ω╞φ P(ω)

• P(toothache v cavity) =
0.108 + 0.012 + 0.016 + 0.064 +.072 +0.008 = 0.28
Inference by enumeration
• Start with the joint probability distribution:

• Can also compute conditional probabilities:


P(¬cavity | toothache) = P(¬cavity ∧ toothache)
P(toothache)
= 0.016+0.064
0.108 + 0.012 + 0.016 + 0.064
= 0.4
Normalization

Denominator can be viewed as a normalization constant α


P(Cavity | toothache) = α* P(Cavity,toothache)
= α* [P(Cavity,toothache,catch) + P(Cavity,toothache,¬ catch)]
= α* [<0.108,0.016> + <0.012,0.064>]
= α* <0.12,0.08> = <0.6,0.4>

General idea: compute distribution on query variable by


fixing evidence variables and summing over hidden
variables
Independence
• A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)

P(Toothache, Catch, Cavity, Weather)


= P(Toothache, Catch, Cavity) P(Weather)

• 32 entries reduced to 12; for n independent biased coins, O(2n) →O(n)

• Absolute independence powerful but rare

• Dentistry is a large field with hundreds of variables, none of which are


independent. What to do?
Conditional independence
• P(Toothache, Cavity, Catch) has 23 – 1 = 7 independent entries

• If I have a cavity, the probability that the probe catches in it doesn't


depend on whether I have a toothache:
(1) P(catch | toothache, cavity) = P(catch | cavity)
• The same independence holds if I haven't got a cavity:
(2) P(catch | toothache,¬cavity) = P(catch | ¬cavity)

• Catch is conditionally independent of Toothache given Cavity:


P(Catch | Toothache,Cavity) = P(Catch | Cavity)

• Equivalent statements:
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
Conditional independence contd.
• Write out full joint distribution using chain rule:
P(Toothache, Catch, Cavity)
= P(Toothache | Catch, Cavity) P(Catch, Cavity)
= P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity)
= P(Toothache | Cavity) P(Catch | Cavity) P(Cavity)

I.e., 2 + 2 + 1 = 5 independent numbers

• In most cases, the use of conditional independence reduces the


size of the representation of the joint distribution from exponential in
n to linear in n.

• Conditional independence is our most basic and robust form of


knowledge about uncertain environments.
Bayes' Rule
• Product rule P(a∧b) = P(a | b) P(b) = P(b | a) P(a)

⇒ Bayes' rule: P(a | b) = P(b | a) P(a) / P(b)

• or in distribution form
P(Y|X) = P(X|Y) P(Y) / P(X) = αP(X|Y) P(Y)

• Useful for assessing diagnostic probability from causal probability:


– P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)
– E.g., let M be meningitis, S be stiff neck:
P(m|s) = P(s|m) P(m) / P(s) = 0.8 × 0.0001 / 0.1 = 0.0008

– Note: posterior probability of meningitis still very small!


Bayes' Rule and conditional
independence
P(Cavity | toothache ∧ catch)
= αP(toothache ∧ catch | Cavity) P(Cavity)
= αP(toothache | Cavity) P(catch | Cavity) P(Cavity)

• This is an example of a naïve Bayes model:



P(Cause,Effect1, … ,Effectn) = P(Cause) πiP(Effecti|Cause)

• Total number of parameters is linear in n


Summary
• Probability is a rigorous formalism for uncertain
knowledge
• Joint probability distribution specifies probability of
every atomic event
• Queries can be answered by summing over atomic
events
• For nontrivial domains, we must find a way to
reduce the joint size
• Independence and conditional independence
provide the tools
Medical Decision Making

• Diagnostic decision support


e.g. given his/her anamnestic history and an ECG what‘s the
probability for an myocardial infaction

• Therapeutic decision support


e.g. What‘s the best intervention if a patients‘ left coronary artery is
obstructed to 90%
Features and distributions

f(blood pressure)
Where do features come from?
Diagnosis/Classification
decision

data

Selection Feature values


of features
Whole population

At practioner

At cardiologic clinic Decision model

+ (f > Θ) - (f ≤

Truth
Θ RR
Θ)
TP = True Positive + TP FN
FP = False Positive
TN = True Negative - FP TN
FN = False Negative
Simple Decisionmodel

Decision model Decision model/Test

+ - + - Sum
Truth

+ TP FN 205 193 398

- FP TN 29 73 102

234 266 500

Sonography Galle
Quality measures

Decision model Decision model/Test

+ - + - Sum
Truth

+ TP FN 205 193 398

- FP TN 29 73 102

234 266 500

Sonography Cholezystis

• Sensitivity = TPR = TP/(TP+FN) ≤ 1


• Specificity = TNR = TN/(TN+FP) ≤ 1
Quality measures

Decision model Decision model/Test

+ - + - Sum
Truth

+ TP FN 205 193 398

- FP TN 29 73 102

234 266 500

Sonography Cholezystis

• Pos. Pred.Value = PV+ = TP/(TP+FP) ≤ 1


• Neg. Pred.Value = PV- = TN/(TN+FN) ≤ 1
Comparing Tests/ROC
curves
Statistics depends on study population
• Population bias: e.g. study population includes only
individuals with advanced disease and healthy
volunteers i.e. TPR and TNR are too high
• Test-referral bias: if a positive index-test is a criterion for
applying the (expensive) gold-standard test. Thus the
study population will consist of individuals with higher
probability of disease i.e. TPR is over and TNR
underestimated.
• Test-interpretation bias: if interpretation of gold standard
test depends on index test.
Example

• Screening test (HIV) for blood donors: The test


currently used to screen blood donors for HIV
antibody is an enzyme-linked immunosorbent
assay(ELISA). To measure the performance of the
ELISA, the test is performed on 400 patients;
Bayes and PV
Posttest Probability
• P(D|+) = number of diseased patients with positive test /
total number of patients with positive test
= TP / (TP+FP)
= PV+
• P(-D|-) = number of healthy patients with negative test/
total number of patients with negative test
= TN/(TN+FN)
= PV-
Odds-Ratio and Probabilities

p
odds =
(1 − p )

odds
p=
(1 + odds )
Likelihood ratio

TPR Pr( positive _ test _ in _ diseased )


LR + = =
FPR Pr( positive _ test _ in _ nondiseased )

FNR Pr( negative _ test _ in _ diseased )


LR − = =
TNR Pr( negative _ test _ in _ nondiseased )
Odds-ratio form of Bayes‘ Theorem

Posttest odds = pretest odds x LR

p[ D | R ] p[ D ] p[ R | D ]
= ×
p[ − D | R ] p[ − D ] p[ R | − D ]
Example
• Calculate posttest probability for a positive exercise test
(TPR=0.65; FPR=0.2) of a 60 year old man whose
pretest probability is 0.75.
• Pretest odds =…
• LR+=…
Implications
TPR=0.9
TNR=0.9

Positive Test Negative Test


Result Result
Positive Test Positive Test
Result Result

Negative Test Negative Test


Result Result
Summary
• Probability Theory provides simple methods for
diagnostic reasoning
• Performance of tests can be quantified
• ROC curves may help to choose best test
• Odds and probabilities can be interchanged
• Interpretation of performance measures
Therapy planning
A

Value for optimization ?


-QALY
-MICROMORTS
Therapy planning

Create a decision tree:


Drug therapy
Surgery or not?
Bayesian network models

Ch.14 Russel/Norvig (3. Edition)


Outline
• Definitions
• Example
• Construction
• Inference / Answering Queries
• Approximate Inference
Bayesian networks
• A simple, graphical notation for conditional
independence assertions and hence for compact
specification of full joint distributions

• Syntax:
– a set of nodes, one per variable
– a directed, acyclic graph (link ≈ "directly influences")
– a conditional distribution for each node given its parents:
P (Xi | Parents (Xi))

• In the simplest case, conditional distribution represented


as a conditional probability table(CPT) giving the
distribution over Xi for each combination of parent values
Basic concepts of probability
• Probability variables
• Notation
• Conditional probabilities
• Independence – conditional Independence
• Bayes law
A Bayesian Network
A Bayesian network is made up of:
1. A Directed Acyclic Graph

2. A set of tables for each node in the graph


Example
• Topology of network encodes conditional
independence assertions:

• Weather is independent of the other variables


• Toothache and Catch are conditionally
independent given Cavity
A Directed Acyclic Graph
A Set of Tables for Each Node
A Set of Tables for Each Node
Conditional Probability Distribution for C given B

If you have a Boolean variable with k Boolean parents, this


table has 2k+1probabilities (but only 2k need to be stored)
Example

Variables: Burglary, Earthquake, Alarm, JohnCalls,


MaryCalls

Network topology reflects "causal" knowledge:


A burglar can set the alarm off
An earthquake can set the alarm off
The alarm can cause Mary to call
The alarm can cause John to call
Example contd.

Each row in a CPT has to sum up to 1. We omit the second probability.


E.g.
P(¬j | a) = 1 − P(j | a) = 0.1.
Compactness
• A CPT for Boolean Xi with k Boolean parents has 2k rows for the
combinations of parent values

• Each row requires one number p for Xi= true


• (the number for Xi= false is just 1-p)

• If each variable has no more than k parents, the complete network


requires O(n ·2k) numbers

• I.e., grows linearly with n, vs. O(2n) for the full joint distribution

• For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31)


Semantics
The full joint distribution is defined as the
product of the local conditional
distributions:

P (X1, … ,Xn) = πi = 1P(Xi | Parents(Xi))


n

e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e)


= P (j | a) P (m | a) P (a | ¬b, ¬e) P (¬b) P (¬e)
Semantics
The full joint distribution is defined as the
product of the local conditional
distributions:

P (X1, … ,Xn) = πi = 1P(Xi | Parents(Xi))


n

e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e)


= P (j | a) P (m | a) P (a | ¬b, ¬e) P (¬b) P (¬e)
= 0.9 x 0.7 x 0.001 x 0.999 x 0.998 = 0.00062
Semantics
One more unresolved issue…
We still haven’t said where we get the
Bayesian network from. There are two
options:
• Get an expert to design it
• Learn it from data
Constructing Bayesian networks
1. Choose an ordering of variables X1, … ,Xn
2. For i= 1 to n
– add Xi to the network
– select parents from X1, … ,Xi-1 such that
P(Xi| Parents(Xi)) = P(Xi| X1, ... Xi-1)

This choice of parents guarantees:


n
P(X1, … ,Xn) = π i =1 P(Xi | X1, … , Xi-1) (chain rule)
n
=π i =1 P(Xi | Parents(Xi)) (by construction)
Example
Suppose we choose the ordering
M, J, A, B, E

P(J | M) = P(J)?
Example
Suppose we choose the ordering
M, J, A, B, E

P(J | M) = P(J)? No
P(A | J, M) = P(A | J)?P(A | J, M) = P(A)?
Example
Suppose we choose the ordering
M, J, A, B, E

P(J | M) = P(J)? No
P(A | J, M) = P(A | J)?P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)?
P(B | A, J, M) = P(B)?
Example
Suppose we choose the ordering
M, J, A, B, E

P(J | M) = P(J)? No
P(A | J, M) = P(A | J)?P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
Example
Suppose we choose the ordering
M, J, A, B, E

P(J | M) = P(J)? No
P(A | J, M) = P(A | J)?P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)?
P(E | B, A, J, M) = P(E | A, B)?
Example
Suppose we choose the ordering
M, J, A, B, E

P(J | M) = P(J)? No
P(A | J, M) = P(A | J)?P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)? No
P(E | B, A, J, M) = P(E | A, B)? Yes
Example contd.

• Deciding conditional independence is hard in noncausal


directions
• (Causal models and conditional independence seem
hardwired for humans!)
• Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers
needed
Summary - Basics
• Bayesian networks provide a natural
representation for (causally induced)
conditional independence
• Topology + CPTs = compact
representation of joint distribution
• Generally easy for domain experts to
construct
Conditional Independence
Non-descendants Markov blanket
Answering Queries
• Exact Inference
– by enumeration
– by variable elimination

• Approximate Inference
– Direct sampling
– Rejection sampling
– Likelihood weighting
– Markov Chain Monte Carlo (MCMC)
Typical Inference Tasks
Inference by enumeration
revisited
Usually, our interest is on P(X|e) with
the posterior joint distribution of a query variable
given specific values e for the evidence variables E
Let the hidden or nonevidence variables be

As is already known, summation of joint entries is


done by summing out the hidden variables y:
Example: enumeration
With the following notation:
e: Earthquake=true b: Burglary=true a: Alarm=true j:
JohnCalls=true m: MaryCalls=true

i.e. for every e є{true,false} and a є{true,false} we have to multiply


5 numbers to be found in the CPTs from our bayesian network.
Example: enumeration cont.
Variable elimination
Inference by enumeration:
complexity
Obvious problems:
• Worst-case time complexity O(dn) where d is the largest
arity , n is the number of variables (i.e. size of network)
Space complexity O(dn) to store the joint distribution.
How to find the numbers for O(dn) entries?

• But:
for polytrees i.e. networks in which there is at most one
undirected path between any two nodes in the network
time and space complexity is linear in n!
The Bad News
• Exact inference is feasible in small to
medium-sized networks

• Exact inference in large networks takes a


very long time

• We resort to approximate inference


techniques which are much faster and give
pretty good results
Approximate inference
• Instead of exact computation we may use
randomized sampling (Monte Carlo
method)

• If samples reflect the distribution of the


(hidden) random variables the statistics
will approximate the true distribution of the
query variables
1. Direct sampling
• Generate samples from known distribution
i.e. a fair coin has a prior distribution
P(coin) = <0.5, 0.5>
So sampling (i.e. flipping the coin) will
return head with a probability of 0.5
• In a Bayesian network every non-evidence
variable is sampled in topological order
and conditioned on the values assigned
(by sampling) to its‘ parents
Example: Sprinkler
1. P(C) = <0.5,0.5>; suppose
sampling returns true;
2. P(S|c)=<0.1,0.9>; suppose
we‘ll get false;
3. P(R|c)=<0.8,0.2>; not
surprisingly we‘ll find true;
4. P(W|¬s,r)=<0.9,0.1>; suppose
sampling returns true;

Event [true,false,true,true] is returned


Direct sampling
Every event sampled by this process has a
probability SDS(x1,…,xn).

i.e. if N is the total number of samples and


N(x1,…,xn) the frequency of an event x1,…,xn
Problems: rejection sampling
• Too many samples rejected and thus effort
(time or space) is wasted.

• Will not work efficiently especially if the


condition e is a rare event!
Likelihood weighting

Problem: not all events are equal i.e. all events in


which the evidence appears unlikely should be
given less weight.
Solution: a weighting factor w will be calculated
from the network
Example summarized
Query: P(R|s,wg)
1. w=1.0 %initialization
2. Sample from P(C)=<0.5,0.5>; suppose true is returned
3. Sprinkler is an evidence variable with value true.
Therefore ww x P(s|c)=0.1
4. Sample from P(R|c)=<0.8,0.2>; suppose this returns true
5. WetGrassis an evidence variable with value true.
Therefore ww x P(wg|s,r)=0.099

This means the event [true,true,true,true] has weight 0.099 which is


collected for Rain=true.
The low weight is reasonable because a cloudy day makes the
evidence „Sprinkler=true“ unlikely.
Correctness of LW
MCMC
(Markov Chain Monte Carlo)
Idea: Consider a state to be an event
specifying a value for every variable. In
MCMC sampling is performed by making a
random change to a non-evidence variable
X given the current values of the variables in
the Markov blanket of X.
Definitions
Example MCMC
Consider the query P(R|s,wg).

• Evidence variables „Sprinkler“ and


„WetGrass“ are fixed to their observed
values (i.e. true)
• Hidden variables „Cloudy“ and „Rain“ are
initialized randomly e.g. true and false.

 Initial state: [true,true,false,true]


Example: MCMC -2
• Given a state (e.g. [true,true,false,true]),
the following steps are executed
repeatedly:
1. „Cloudy“ is sampled from P(C|s,¬r).
Suppose the result is Cloudy = false, i.e.
the next state is [false,true,false,true].
2.„Rain“ is sampled from P(R|s,wg,¬c).
Suppose the result is Rain = true, i.e. the
new state now is [false,true,true,true].
Distribution of X given its
Markov Blanket
Remember Yis the set of all hidden
variables other than X
Summary
• Exact inference in general Bayesian
networks is infeasible.

• Approximate inference is based on


sampling. Among a couple of approaches
the Gibbs sampler (MCMC variant)
appears to be most convenient and
efficient.
HW 1
Influenca epidemic:
A patient enters the regional practice of a
general physician (GP). He has fever (F).
The GP knows, that in general 30% of the
population suffer from F. Currently there is
an influenca (I) epidemic that comes with F
in 80% of cases, which affects 15%.
What is the probabilty that the patient with F
has I?
HW2
We have a bag of three biased coins a, b, and c with
probabilities of coming up heads of 20%, 60%, and 80%,
respectively. One coin is drawn randomly from the bag
(with equal likelihood of drawing each of the three coins),
and then the coin is flipped three times to generate the
outcomes X1, X2, and X3.
a. Draw the Bayesian network corresponding to this setup
and define the necessary CPTs.
b. Calculate which coin was most likely to have been drawn
from the bag if the observed flips come out heads twice
and tails once.
HW 3
Consider the Burglary network.
a. If no evidence is observed, are Burglary
and Earthquake independent? Why?
b. If we observe Alarm =true, are Burglary
and Earthquake independent? Justify your
answer by calculating whether the
probabilities involved satisfy the definition of
conditional independence
HW 4
Consider the Bayes net shown in the next
slide.
a. Which of the following are asserted by the
network structure?
(i) P(B, I,M) = P(B)P(I)P(M).
(ii) P(J |G) = P(J | G, I).
(iii) P(M | G,B, I) = P(M | G,B, I, J).
b. Calculate the value of P(b, i,¬m, g, j).
c. Calculate P(j| b,i,m).
HW4 cont.
HW 5
Consider the query P(Rain | Sprinkler =true,
WetGrass =true) and how Gibbs sampling can
answer it.
a. How many states does the Markov chain have?
b. Calculate the transition matrix Q containing q(y → y’)
for all y, y’.
c. What does Q2, the square of the transition matrix,
represent?
d. What about Qn as n→∞?
e. Explain how to do probabilistic inference in Bayesian
networks, assuming that Qn is available. Is this a practical
way to do inference?
HW 6
Use MatLab to implement approximate
reasoning with direct sampling:

Query: P(m|a)
Try n=10, n=100 and n=1000 samples

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy