2a Probability

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

COMP9444: Neural Networks and

Deep Learning
Week 2a. Probability

Alan Blair
School of Computer Science and Engineering
June 3, 2024
Outline

➛ Probability and Random Variables (3.1-3.2)


➛ Probability for Continuous Variables (3.3)
➛ Gaussian Distributions (3.9.3)
➛ Conditional Probability (3.5)
➛ Bayes’ Rule (3.11)
➛ Entropy and KL-Divergence (3.13)
➛ Continuous Distributions
➛ Wasserstein Distance

2
Probability (3.1)

Begin with a set Ω – the sample space (e.g. 6 possible rolls of a die)
Each ω ∈ Ω is a sample point / possible world / atomic event
A probability space or probability model is a sample space
with an assignment P (ω) for every ω ∈ Ω such that

0 ≤ P (ω) ≤ 1
X
P (ω) = 1
ω

e.g. P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 16 .

3
Random Events

A random event A is any subset of Ω


X
P (A) = P (ω)
{ω∈A}

1 1 1 1 2
e.g. P (die roll < 5) = P (1) + P (2) + P (3) + P (4) = 6 + 6 + 6 + 6 = 3

4
Random Variables (3.2)

A random variable is a function from sample points to some range


(e.g. the Reals or Booleans)

For example, Odd(3) = true

P induces a probability distribution for any random variable X:


X
P (X = xi ) = P (ω)
{ω:X(ω)=xi }

1 1 1 1
e.g., P (Odd = true) = P (1) + P (3) + P (5) = 6 + 6 + 6 = 2

5
Probability for Continuous Variables (3.3)
For continuous variables, P is a density; it integrates to 1.
e.g. P (X = x) = U [18, 26](x) = uniform density between 18 and 26

0.125

18 dx 26
When we say P (X = 20.5) = 0.125, it really means

lim P (20.5 ≤ X ≤ 20.5 + dx)/dx = 0.125


dx→0

6
Gaussian Distribution (3.9.3)

µ = mean 2
√ 1 e−(x−µ) /2σ
2
Pµ,σ (x) =
σ = standard deviation 2πσ

7
Multivariate Gaussians

The d-dimensional multivariate Gaussian with mean µ and covariance Σ


is given by
1 1 T −1
Pµ,Σ (x) = p e− 2 (x−µ) Σ (x−µ)
d
(2π) |Σ|
where |Σ| denotes the determinant of Σ.
If Σ = diag(σ12 , . . . , σd2 ) is diagonal, the multivariate Gaussian reduces to
Y
Pµ,Σ (x) = Pµi ,σi (xi )
i

The Gaussian with µ = 0, Σ = I is called the Standard Normal distribution.

8
Probability and Logic

B
A

A^B
Logically related events must have related probabilities
For example, P (A ∨ B) = P (A) + P (B) − P (A ∧ B)

9
Conditional Probability (3.5)

B
A

A^B
If P (B) 6= 0, then the conditional probability of A given B is
P (A ∧ B)
P (A | B) =
P (B)

10
Bayes’ Rule (3.11)

The formula for conditional probability can be manipulated to find a relationship


when the two variables are swapped:
P (A ∧ B) = P (A | B)P (B) = P (B | A)P (A)

P (B | A)P (A)
→ Bayes’ rule P (A | B) =
P (B)

This is often useful for assessing the probability of an underlying Cause after an
Effect has been observed:
P (Effect|Cause)P (Cause)
P (Cause|Effect) =
P (Effect)

11
Example: Medical Diagnosis

Question: Suppose we have a test for a type of cancer which occurs in 1% of


patients. The test has a sensitivity of 98% and a specificity of 97%.
If a patient tests positive, what is the probability that they have the cancer?

Answer: There are two random variables: Cancer (true or false) and Test (positive
or negative). The probability is called a prior, because it represents our estimate of
the probability before we have done the test (or made some other observation).

The sensitivity and specificity are interpreted as follows:

P (positive | cancer) = 0.98, and P (negative | ¬cancer) = 0.97

12
Bayes’ Rule for Medical Diagnosis
.
Pos P(Yes,Pos) = 0.01
0.98
Test
Yes 0.02
0.01 Neg P(Yes,Neg) = 0.00
Cancer?
0.99 Pos P(No ,Pos) = 0.03
No 0.03
Test
0.97
Neg P(No ,Neg) = 0.96 .

P (positive | cancer)P (cancer)


P (cancer | positive) =
P (positive)
0.98 ∗ 0.01 0.01 1
= = =
0.98 ∗ 0.01 + 0.03 ∗ 0.99 0.01 + 0.03 4

13
Example: Light Bulb Defects

Question: You work for a lighting company which manufactures 60% of its light
bulbs in Factory A and 40% in Factory B. One percent of the light bulbs from
Factory A are defective, while two percent of those from Factory B are defective. If
a random light bulb turns out to be defective, what is the probability that it was
manufactured in Factory A?

Answer: There are two random variables: Factory (A or B) and Defect (Yes or No).
In this case, the prior is:
P (A) = 0.6, P (B) = 0.4

The conditional probabilities are:


P (defect | A) = 0.01, and P (defect | B) = 0.02

14
Bayes’ Rule for Light Bulb Defects
Yes P(A,Yes) = 0.006
0.01
Defect?
A 0.99
0.6 No P(A,No ) = 0.594
Factory?
0.4 Yes P(B,Yes) = 0.008
B 0.02
Defect?
0.98
No P(B,No ) = 0.392
P (defect | A)P (A)
P (A | defect) =
P (defect)
0.01 ∗ 0.6 0.006 3
= = =
0.01 ∗ 0.6 + 0.02 ∗ 0.4 0.006 + 0.008 7

15
Entropy and KL-Divergence (3.13)
The entropy of a discrete probability distribution p = hp1 , . . . , pn i is
n
X
H(p) = pi (− log2 pi )
i=1

Given two probability distributions p = hp1 , . . . , pn i and q = hq1 , . . . , qn i on the


same set Ω, the Kullback-Leibler Divergence between p and q is
n
X
DKL (p || q) = pi (log2 pi − log2 qi )
i=1

KL-Divergence is like a “distance” from one probability distribution to another.


But, it is not symmetric.
DKL (p || q) 6= DKL (q || p)

16
Entropy and Huffmann Coding

Entropy is the number of bits per symbol achieved by a (block) Huffman Coding
scheme.
Example 1: H(h0.5, 0.5i) = 1 bit.
Suppose we want to encode, in zeros and ones, a long message composed of the
letters A and B, which occur with equal frequency. This can be done efficiently by
assigning A=0, B=1. In other words, one bit is needed to encode each letter.

0 1

A B
17
Entropy and Huffmann Coding
Example 2: H(h0.5, 0.25, 0.25i) = 1.5 bits.
Suppose we need to encode a message consisting of the letters A, B and C, where
B and C occur equally often but A occurs twice as often as the other two letters.
In this case, an optimally efficient code would be A=0, B=10, C=11.
The average number of bits needed to encode each letter is 1.5 .

0 1

A 0 1

B C
18
Entropy and KL-Divergence
If the samples occur in some other proportion, we would need to “block” them
together in order to encode them efficiently. But, the average number of bits
required by the most efficient coding scheme is given by
n
X
H(hp1 , . . . , pn i) = pi (− log2 pi )
i=1

DKL (q || p) is the number of extra bits we need to trasmit if we designed a code for
q() but it turned out that the samples were drawn from p() instead.
n
X
DKL (p || q) = pi (log2 pi − log2 qi )
i=1

19
Continuous Entropy and KL-Divergence

➛ the entropy of a continuous distribution p() is


Z
H(p) = p(θ)(− log p(θ)) dθ
θ

➛ KL-Divergence between two continuous distributions p() and q() is


Z
DKL (p || q) = p(θ)(log p(θ) − log q(θ)) dθ
θ

20
Entropy for Gaussian Distributions
Entropy of Gaussian with mean µ and standard deviation σ :
1
(1 + log(2π)) + log(σ)
2
Entropy of a d-dimensional Gaussian p() with mean µ and variance Σ :

d 1
H(p) = (1 + log(2π)) + log |Σ|
2 2
If Σ = diag(σ12 , . . . , σd2 ) is diagonal, the entropy is:

d
d X
H(p) = (1 + log(2π)) + log(σi )
2
i=1

21
KL-Divergence for Gaussians

KL-Divergence between Gaussians q(), p() with mean µ1 , µ2 and variance Σ1 , Σ2 :

1h |Σ2 | i
DKL (q||p) = (µ2 − µ1 )T Σ−1
2 (µ 2 − µ 1 ) + Trace(Σ −1
2 Σ 1 ) + log − d
2 |Σ1 |

In the case where µ2 = 0, Σ2 = I, the KL-Divergence simplifies to:


1
||µ1 ||2 + Trace(Σ1 ) − log |Σ1 | − d

DKL (q||p) =
2
If Σ1 = diag(σ12 , . . . , σd2 ) is diagonal, this reduces to:
d
1 X
DKL (q||p) = ||µ1 ||2 + (σi2 − 2 log(σi ) − 1)

2
i=1

22
Wasserstein Distance

Another commonly used measure is the Wasserstein Distance which, for


multivariate Gaussians, is given by
1
W2 (q, p)2 = ||µ1 − µ2 ||2 + Trace(Σ1 + Σ2 − 2(Σ1 Σ2 ) 2 )

In the case where µ2 = 0, Σ2 = I, the KL-Divergence simplifies to:


1
W2 (q, p)2 = ||µ1 ||2 + d + Trace(Σ1 − 2(Σ1 ) 2 )

If Σ = diag(σ12 , . . . , σd2 ) is diagonal, this reduces to:

d
X
W2 (q, p)2 = ||µ1 ||2 + (σi − 1)2
i=1

23
Forward KL-Divergence
Given P , choose Gaussian Q to minimize DKL (P || Q)

Q must not be small in any place where P is large.

24
Reverse KL-Divergence
Given P , choose Gaussian Q to minimize DKL (Q || P )

Q just needs to be concentrated in some place where P is large.

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy