2a Probability

COMP9444: Neural Networks and
Deep Learning
Week 2a. Probability
Alan Blair
School of Computer Science and Engineering
June 3, 2024
Outline
➛ Probability and Random Variables (3.1-3.2)

➛ Probability for Continuous Variables (3.3)
➛ Gaussian Distributions (3.9.3)
➛ Conditional Probability (3.5)
➛ Bayes’ Rule (3.11)
➛ Entropy and KL-Divergence (3.13)
➛ Continuous Distributions
➛ Wasserstein Distance
2
Probability (3.1)
Begin with a set Ω – the sample space (e.g. 6 possible rolls of a die)
Each ω ∈ Ω is a sample point / possible world / atomic event
A probability space or probability model is a sample space
with an assignment P (ω) for every ω ∈ Ω such that
0 ≤ P (ω) ≤ 1
X
P (ω) = 1
ω
e.g. P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 16 .
3
Random Events
A random event A is any subset of Ω

X
P (A) = P (ω)
{ω∈A}
1 1 1 1 2
e.g. P (die roll < 5) = P (1) + P (2) + P (3) + P (4) = 6 + 6 + 6 + 6 = 3
4
Random Variables (3.2)
A random variable is a function from sample points to some range

(e.g. the Reals or Booleans)
For example, Odd(3) = true
P induces a probability distribution for any random variable X:

X
P (X = xi ) = P (ω)
{ω:X(ω)=xi }
1 1 1 1
e.g., P (Odd = true) = P (1) + P (3) + P (5) = 6 + 6 + 6 = 2
5
Probability for Continuous Variables (3.3)
For continuous variables, P is a density; it integrates to 1.
e.g. P (X = x) = U [18, 26](x) = uniform density between 18 and 26
0.125
18 dx 26
When we say P (X = 20.5) = 0.125, it really means
lim P (20.5 ≤ X ≤ 20.5 + dx)/dx = 0.125

dx→0
6
Gaussian Distribution (3.9.3)
µ = mean 2
√ 1 e−(x−µ) /2σ
2
Pµ,σ (x) =
σ = standard deviation 2πσ
7
Multivariate Gaussians
The d-dimensional multivariate Gaussian with mean µ and covariance Σ

is given by
1 1 T −1
Pµ,Σ (x) = p e− 2 (x−µ) Σ (x−µ)
d
(2π) |Σ|
where |Σ| denotes the determinant of Σ.
If Σ = diag(σ12 , . . . , σd2 ) is diagonal, the multivariate Gaussian reduces to
Y
Pµ,Σ (x) = Pµi ,σi (xi )
i
The Gaussian with µ = 0, Σ = I is called the Standard Normal distribution.
8
Probability and Logic
B
A
A^B
Logically related events must have related probabilities
For example, P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
9
Conditional Probability (3.5)
B
A
A^B
If P (B) 6= 0, then the conditional probability of A given B is
P (A ∧ B)
P (A | B) =
P (B)
10
Bayes’ Rule (3.11)
The formula for conditional probability can be manipulated to find a relationship

when the two variables are swapped:
P (A ∧ B) = P (A | B)P (B) = P (B | A)P (A)
P (B | A)P (A)
→ Bayes’ rule P (A | B) =
P (B)
This is often useful for assessing the probability of an underlying Cause after an
Effect has been observed:
P (Effect|Cause)P (Cause)
P (Cause|Effect) =
P (Effect)
11
Example: Medical Diagnosis
Question: Suppose we have a test for a type of cancer which occurs in 1% of

patients. The test has a sensitivity of 98% and a specificity of 97%.
If a patient tests positive, what is the probability that they have the cancer?
Answer: There are two random variables: Cancer (true or false) and Test (positive
or negative). The probability is called a prior, because it represents our estimate of
the probability before we have done the test (or made some other observation).
The sensitivity and specificity are interpreted as follows:
P (positive | cancer) = 0.98, and P (negative | ¬cancer) = 0.97
12
Bayes’ Rule for Medical Diagnosis
.
Pos P(Yes,Pos) = 0.01
0.98
Test
Yes 0.02
0.01 Neg P(Yes,Neg) = 0.00
Cancer?
0.99 Pos P(No ,Pos) = 0.03
No 0.03
Test
0.97
Neg P(No ,Neg) = 0.96 .
P (positive | cancer)P (cancer)

P (cancer | positive) =
P (positive)
0.98 ∗ 0.01 0.01 1
= = =
0.98 ∗ 0.01 + 0.03 ∗ 0.99 0.01 + 0.03 4
13
Example: Light Bulb Defects
Question: You work for a lighting company which manufactures 60% of its light
bulbs in Factory A and 40% in Factory B. One percent of the light bulbs from
Factory A are defective, while two percent of those from Factory B are defective. If
a random light bulb turns out to be defective, what is the probability that it was
manufactured in Factory A?
Answer: There are two random variables: Factory (A or B) and Defect (Yes or No).
In this case, the prior is:
P (A) = 0.6, P (B) = 0.4
The conditional probabilities are:

P (defect | A) = 0.01, and P (defect | B) = 0.02
14
Bayes’ Rule for Light Bulb Defects
Yes P(A,Yes) = 0.006
0.01
Defect?
A 0.99
0.6 No P(A,No ) = 0.594
Factory?
0.4 Yes P(B,Yes) = 0.008
B 0.02
Defect?
0.98
No P(B,No ) = 0.392
P (defect | A)P (A)
P (A | defect) =
P (defect)
0.01 ∗ 0.6 0.006 3
= = =
0.01 ∗ 0.6 + 0.02 ∗ 0.4 0.006 + 0.008 7
15
Entropy and KL-Divergence (3.13)
The entropy of a discrete probability distribution p = hp1 , . . . , pn i is
n
X
H(p) = pi (− log2 pi )
i=1
Given two probability distributions p = hp1 , . . . , pn i and q = hq1 , . . . , qn i on the

same set Ω, the Kullback-Leibler Divergence between p and q is
n
X
DKL (p || q) = pi (log2 pi − log2 qi )
i=1
KL-Divergence is like a “distance” from one probability distribution to another.

But, it is not symmetric.
DKL (p || q) 6= DKL (q || p)
16
Entropy and Huffmann Coding
Entropy is the number of bits per symbol achieved by a (block) Huffman Coding
scheme.
Example 1: H(h0.5, 0.5i) = 1 bit.
Suppose we want to encode, in zeros and ones, a long message composed of the
letters A and B, which occur with equal frequency. This can be done efficiently by
assigning A=0, B=1. In other words, one bit is needed to encode each letter.
0 1
A B
17
Entropy and Huffmann Coding
Example 2: H(h0.5, 0.25, 0.25i) = 1.5 bits.
Suppose we need to encode a message consisting of the letters A, B and C, where
B and C occur equally often but A occurs twice as often as the other two letters.
In this case, an optimally efficient code would be A=0, B=10, C=11.
The average number of bits needed to encode each letter is 1.5 .
0 1
A 0 1
B C
18
Entropy and KL-Divergence
If the samples occur in some other proportion, we would need to “block” them
together in order to encode them efficiently. But, the average number of bits
required by the most efficient coding scheme is given by
n
X
H(hp1 , . . . , pn i) = pi (− log2 pi )
i=1
DKL (q || p) is the number of extra bits we need to trasmit if we designed a code for
q() but it turned out that the samples were drawn from p() instead.
n
X
DKL (p || q) = pi (log2 pi − log2 qi )
i=1
19
Continuous Entropy and KL-Divergence
➛ the entropy of a continuous distribution p() is

Z
H(p) = p(θ)(− log p(θ)) dθ
θ
➛ KL-Divergence between two continuous distributions p() and q() is

Z
DKL (p || q) = p(θ)(log p(θ) − log q(θ)) dθ
θ
20
Entropy for Gaussian Distributions
Entropy of Gaussian with mean µ and standard deviation σ :
1
(1 + log(2π)) + log(σ)
2
Entropy of a d-dimensional Gaussian p() with mean µ and variance Σ :
d 1
H(p) = (1 + log(2π)) + log |Σ|
2 2
If Σ = diag(σ12 , . . . , σd2 ) is diagonal, the entropy is:
d
d X
H(p) = (1 + log(2π)) + log(σi )
2
i=1
21
KL-Divergence for Gaussians
KL-Divergence between Gaussians q(), p() with mean µ1 , µ2 and variance Σ1 , Σ2 :
1h |Σ2 | i
DKL (q||p) = (µ2 − µ1 )T Σ−1
2 (µ 2 − µ 1 ) + Trace(Σ −1
2 Σ 1 ) + log − d
2 |Σ1 |
In the case where µ2 = 0, Σ2 = I, the KL-Divergence simplifies to:

1
||µ1 ||2 + Trace(Σ1 ) − log |Σ1 | − d

DKL (q||p) =
2
If Σ1 = diag(σ12 , . . . , σd2 ) is diagonal, this reduces to:
d
1 X
DKL (q||p) = ||µ1 ||2 + (σi2 − 2 log(σi ) − 1)

2
i=1
22
Wasserstein Distance
Another commonly used measure is the Wasserstein Distance which, for

multivariate Gaussians, is given by
1
W2 (q, p)2 = ||µ1 − µ2 ||2 + Trace(Σ1 + Σ2 − 2(Σ1 Σ2 ) 2 )
In the case where µ2 = 0, Σ2 = I, the KL-Divergence simplifies to:

1
W2 (q, p)2 = ||µ1 ||2 + d + Trace(Σ1 − 2(Σ1 ) 2 )
If Σ = diag(σ12 , . . . , σd2 ) is diagonal, this reduces to:
d
X
W2 (q, p)2 = ||µ1 ||2 + (σi − 1)2
i=1
23
Forward KL-Divergence
Given P , choose Gaussian Q to minimize DKL (P || Q)
Q must not be small in any place where P is large.
24
Reverse KL-Divergence
Given P , choose Gaussian Q to minimize DKL (Q || P )
Q just needs to be concentrated in some place where P is large.
25

2a Probability

Uploaded by

Copyright:

Available Formats

2a Probability

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2a Probability

Uploaded by

Copyright:

Available Formats

COMP9444: Neural Networks and

➛ Probability and Random Variables (3.1-3.2)

e.g. P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 16 .

A random event A is any subset of Ω

A random variable is a function from sample points to some range

For example, Odd(3) = true

P induces a probability distribution for any random variable X:

lim P (20.5 ≤ X ≤ 20.5 + dx)/dx = 0.125

The d-dimensional multivariate Gaussian with mean µ and covariance Σ

The Gaussian with µ = 0, Σ = I is called the Standard Normal distribution.

The formula for conditional probability can be manipulated to find a relationship

Question: Suppose we have a test for a type of cancer which occurs in 1% of

The sensitivity and specificity are interpreted as follows:

P (positive | cancer) = 0.98, and P (negative | ¬cancer) = 0.97

P (positive | cancer)P (cancer)

The conditional probabilities are:

Given two probability distributions p = hp1 , . . . , pn i and q = hq1 , . . . , qn i on the

KL-Divergence is like a “distance” from one probability distribution to another.

➛ the entropy of a continuous distribution p() is

➛ KL-Divergence between two continuous distributions p() and q() is

KL-Divergence between Gaussians q(), p() with mean µ1 , µ2 and variance Σ1 , Σ2 :

In the case where µ2 = 0, Σ2 = I, the KL-Divergence simplifies to:

Another commonly used measure is the Wasserstein Distance which, for

In the case where µ2 = 0, Σ2 = I, the KL-Divergence simplifies to:

If Σ = diag(σ12 , . . . , σd2 ) is diagonal, this reduces to:

Q must not be small in any place where P is large.

Q just needs to be concentrated in some place where P is large.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.