2a Probability
2a Probability
2a Probability
Deep Learning
Week 2a. Probability
Alan Blair
School of Computer Science and Engineering
June 3, 2024
Outline
2
Probability (3.1)
Begin with a set Ω – the sample space (e.g. 6 possible rolls of a die)
Each ω ∈ Ω is a sample point / possible world / atomic event
A probability space or probability model is a sample space
with an assignment P (ω) for every ω ∈ Ω such that
0 ≤ P (ω) ≤ 1
X
P (ω) = 1
ω
3
Random Events
1 1 1 1 2
e.g. P (die roll < 5) = P (1) + P (2) + P (3) + P (4) = 6 + 6 + 6 + 6 = 3
4
Random Variables (3.2)
1 1 1 1
e.g., P (Odd = true) = P (1) + P (3) + P (5) = 6 + 6 + 6 = 2
5
Probability for Continuous Variables (3.3)
For continuous variables, P is a density; it integrates to 1.
e.g. P (X = x) = U [18, 26](x) = uniform density between 18 and 26
0.125
18 dx 26
When we say P (X = 20.5) = 0.125, it really means
6
Gaussian Distribution (3.9.3)
µ = mean 2
√ 1 e−(x−µ) /2σ
2
Pµ,σ (x) =
σ = standard deviation 2πσ
7
Multivariate Gaussians
8
Probability and Logic
B
A
A^B
Logically related events must have related probabilities
For example, P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
9
Conditional Probability (3.5)
B
A
A^B
If P (B) 6= 0, then the conditional probability of A given B is
P (A ∧ B)
P (A | B) =
P (B)
10
Bayes’ Rule (3.11)
P (B | A)P (A)
→ Bayes’ rule P (A | B) =
P (B)
This is often useful for assessing the probability of an underlying Cause after an
Effect has been observed:
P (Effect|Cause)P (Cause)
P (Cause|Effect) =
P (Effect)
11
Example: Medical Diagnosis
Answer: There are two random variables: Cancer (true or false) and Test (positive
or negative). The probability is called a prior, because it represents our estimate of
the probability before we have done the test (or made some other observation).
12
Bayes’ Rule for Medical Diagnosis
.
Pos P(Yes,Pos) = 0.01
0.98
Test
Yes 0.02
0.01 Neg P(Yes,Neg) = 0.00
Cancer?
0.99 Pos P(No ,Pos) = 0.03
No 0.03
Test
0.97
Neg P(No ,Neg) = 0.96 .
13
Example: Light Bulb Defects
Question: You work for a lighting company which manufactures 60% of its light
bulbs in Factory A and 40% in Factory B. One percent of the light bulbs from
Factory A are defective, while two percent of those from Factory B are defective. If
a random light bulb turns out to be defective, what is the probability that it was
manufactured in Factory A?
Answer: There are two random variables: Factory (A or B) and Defect (Yes or No).
In this case, the prior is:
P (A) = 0.6, P (B) = 0.4
14
Bayes’ Rule for Light Bulb Defects
Yes P(A,Yes) = 0.006
0.01
Defect?
A 0.99
0.6 No P(A,No ) = 0.594
Factory?
0.4 Yes P(B,Yes) = 0.008
B 0.02
Defect?
0.98
No P(B,No ) = 0.392
P (defect | A)P (A)
P (A | defect) =
P (defect)
0.01 ∗ 0.6 0.006 3
= = =
0.01 ∗ 0.6 + 0.02 ∗ 0.4 0.006 + 0.008 7
15
Entropy and KL-Divergence (3.13)
The entropy of a discrete probability distribution p = hp1 , . . . , pn i is
n
X
H(p) = pi (− log2 pi )
i=1
16
Entropy and Huffmann Coding
Entropy is the number of bits per symbol achieved by a (block) Huffman Coding
scheme.
Example 1: H(h0.5, 0.5i) = 1 bit.
Suppose we want to encode, in zeros and ones, a long message composed of the
letters A and B, which occur with equal frequency. This can be done efficiently by
assigning A=0, B=1. In other words, one bit is needed to encode each letter.
0 1
A B
17
Entropy and Huffmann Coding
Example 2: H(h0.5, 0.25, 0.25i) = 1.5 bits.
Suppose we need to encode a message consisting of the letters A, B and C, where
B and C occur equally often but A occurs twice as often as the other two letters.
In this case, an optimally efficient code would be A=0, B=10, C=11.
The average number of bits needed to encode each letter is 1.5 .
0 1
A 0 1
B C
18
Entropy and KL-Divergence
If the samples occur in some other proportion, we would need to “block” them
together in order to encode them efficiently. But, the average number of bits
required by the most efficient coding scheme is given by
n
X
H(hp1 , . . . , pn i) = pi (− log2 pi )
i=1
DKL (q || p) is the number of extra bits we need to trasmit if we designed a code for
q() but it turned out that the samples were drawn from p() instead.
n
X
DKL (p || q) = pi (log2 pi − log2 qi )
i=1
19
Continuous Entropy and KL-Divergence
20
Entropy for Gaussian Distributions
Entropy of Gaussian with mean µ and standard deviation σ :
1
(1 + log(2π)) + log(σ)
2
Entropy of a d-dimensional Gaussian p() with mean µ and variance Σ :
d 1
H(p) = (1 + log(2π)) + log |Σ|
2 2
If Σ = diag(σ12 , . . . , σd2 ) is diagonal, the entropy is:
d
d X
H(p) = (1 + log(2π)) + log(σi )
2
i=1
21
KL-Divergence for Gaussians
1h |Σ2 | i
DKL (q||p) = (µ2 − µ1 )T Σ−1
2 (µ 2 − µ 1 ) + Trace(Σ −1
2 Σ 1 ) + log − d
2 |Σ1 |
22
Wasserstein Distance
d
X
W2 (q, p)2 = ||µ1 ||2 + (σi − 1)2
i=1
23
Forward KL-Divergence
Given P , choose Gaussian Q to minimize DKL (P || Q)
24
Reverse KL-Divergence
Given P , choose Gaussian Q to minimize DKL (Q || P )
25