Sam Roweis Probx
Sam Roweis Probx
• We use probabilities p(x) to represent our beliefs B(x) about the • Expectation of a function a(x) is written E[a] or hai
states x of the world. X
• There is a formal calculus for manipulating uncertainties E[a] = hai = p(x)a(x)
represented by probabilities. x
P P 2
• Any consistent set of beliefs obeying the Cox Axioms can be
e.g. mean = x xp(x), variance = x(x − E[x]) p(x)
mapped into probabilities. • Moments are expectations of higher order powers.
(Mean is first moment. Autocorrelation is second moment.)
1. Rationally ordered degrees of belief:
if B(x) > B(y) and B(y) > B(z) then B(x) > B(z) • Centralized moments have lower moments subtracted away
2. Belief in x and its negation x̄ are related: B(x) = f [B(x̄)] (e.g. variance, skew, curtosis).
3. Belief in conjunction depends only on conditionals: • Deep fact: Knowledge of all orders of moments
B(x and y) = g[B(x), B(y|x)] = g[B(y), B(x|y)] completely defines the entire distribution.
Joint Probability Conditional Probability
• Key concept: two or more random variables may interact. • If we know that some event has occurred, it changes our belief
Thus, the probability of one taking on a certain value depends on about the probability of other events.
which value(s) the others are taking. • This is like taking a ”slice” through the joint table.
• We call this a joint ensemble and write p(x|y) = p(x, y)/p(y)
p(x, y) = prob(X = x and Y = y)
z
z
p(x,y|z)
p(x,y,z)
y
x
y
x
P
• Another equivalent definition: p(x) = y p(x|y)p(y).
Independence & Conditional Independence Entropy
• Two variables are independent iff their joint factors: • Measures the amount of ambiguity or uncertainty in a distribution:
X
p(x, y) = p(x)p(y) H(p) = − p(x) log p(x)
p(x,y)
p(x)
x
z z
p(x,y,z) p(x,y|z)
y y
x x
• Probability density functions p(x) (for continuous variables) or • For (continuous or discrete) random variable x
probability mass functions p(x = k) (for discrete variables) tell us p(x|η) = h(x) exp{η >T (x) − A(η)}
how likely it is to get a particular value for a random variable 1
(possibly conditioned on the values of some other variables.) = h(x) exp{η >T (x)}
Z(η)
• We can consider various types of variables: binary/discrete is an exponential family distribution with
(categorical), continuous, interval, and integer counts. natural parameter η.
• For each type we’ll see some basic probability models which are • Function T (x) is a sufficient statistic.
parametrized families of distributions.
• Function A(η) = log Z(η) is the log normalizer.
• Key idea: all you need to know about the data is captured in the
summarizing function T (x).
Bernoulli Multinomial
• For a binary random variable with p(heads)=π: • For a set of integer counts on k trials
p(x|π) = π x(1 − π)1−x
k! x x
X
p(x|π) = π1 1 π2 2 · · · πnxn = h(x) exp xi log πi
π x1!x2! · · · xn!
= exp log x + log(1 − π)
i
1−π
P
• Exponential family with: • But the parameters are constrained: i πi = 1.
Pn−1
π So we define the last one πn = 1 − i=1 πi.
η = log
1−π
nP o
n−1 πi
p(x|π) = h(x) exp i=1 log πn x i + k log π n
T (x) = x
A(η) = − log(1 − π) = log(1 + eη ) • Exponential family with:
h(x) = 1 ηi = log πi − log πn
• The logistic function relates the natural parameter and the chance T (xi) = xi
of heads A(η) = −k log πn = k log i eηi
P
1
π= h(x) = k!/x1!x2! · · · xn!
1 + e−η
Poisson • The softmax function relates the basic and natural parameters:
eηi
• For an integer count variable with rate λ: πi = P ηj
je
λxe−λ
p(x|λ) =
x!
1
= exp{x log λ − λ}
x!
• Exponential family with:
η = log λ
T (x) = x
A(η) = λ = eη
1
h(x) =
x!
• e.g. number of photons x that arrive at a pixel during a fixed
interval given mean intensity λ
• Other count densities: binomial, exponential.
Gaussian (normal) Important Gaussian Facts
• When the variable(s) being conditioned on (parents) are discrete, • We can be even more general and define distributions by arbitrary
we just have one density for each possible setting of the parents. energy functions proportional to the log probability.
e.g. a table of natural parameters in exponential models or a table
X
p(x) ∝ exp{− Hk (x)}
of tables for discrete models. k
• When the conditioned variable is continuous, its value sets some of • A common choice is to use pairwise terms in the energy:
the parameters for the other variables. X X
H(x) = ai x i + wij xixj
• A very common instance of this for regression is the i pairs ij
“linear-Gaussian”: p(y|x) = gauss(θ >x; Σ).
• For discrete children and continuous parents, we often use a
Bernoulli/multinomial whose paramters are some function f (θ >x).
Special variables Likelihood Function
• If certain variables are always observed we may not want to model • So far we have focused on the (log) probability function p(x|θ)
their density. For example inputs in regression or classification. which assigns a probability (density) to any joint configuration of
This leads to conditional density estimation. variables x given fixed parameters θ.
• If certain variables are always unobserved, they are called hidden or • But in learning we turn this on its head: we have some fixed data
latent variables. They can always be marginalized out, but can and we want to find parameters.
make the density modeling of the observed variables easier. • Think of p(x|θ) as a function of θ for fixed x:
(We’ll see more on this later.)
L(θ; x) = p(x|θ)
`(θ; x) = log p(x|θ)
This function is called the (log) “likelihood”.
• Chose θ to maximize some cost function c(θ) which includes `(θ):
c(θ) = `(θ; D) maximum likelihood (ML)
c(θ) = `(θ; D) + r(θ) maximum a posteriori (MAP)/penalizedML
(also cross-validation, Bayesian estimators, BIC, AIC, ...)
• A single observation of the data X is rarely useful on its own. • For IID data:
p(xm|θ)
Y
• Generally we have data including many observations, which creates p(D|θ) =
a set of random variables: D = {x1, x2, . . . , xM } m
log p(xm|θ)
X
• Two very common assumptions: `(θ; D) =
m
1. Observations are independently and identically distributed
according to joint distribution of graphical model: IID samples. • Idea of maximum likelihod estimation (MLE): pick the setting of
2. We observe all random variables in the domain on each parameters most likely to have generated the data we saw:
∗ = argmax `(θ; D)
θML
observation: complete data. θ
• Very commonly used in statistics.
Often leads to “intuitive”, “appealing”, or “natural” estimators.
Example: Bernoulli Trials Example: Univariate Normal
• We observe M iid coin flips: D=H,H,T,H,. . . • We observe M iid real samples: D=1.18,-.25,.78,. . .
• Model: p(H) = θ p(T ) = (1 − θ) • Model: p(x) = (2πσ 2)−1/2 exp{−(x − µ)2/2σ 2}
• Likelihood: • Likelihood (using probability density):
`(θ; D) = log p(D|θ) `(θ; D) = log p(D|θ)
Y m m
= log θx (1 − θ)1−x M 1 X (xm − µ)2
mX = − log(2πσ 2) −
2 2 m σ2
xm + log(1 − θ) (1 − xm)
X
= log θ
m m • Take derivatives and set to zero:
= log θNH + log(1 − θ)NT ∂` = (1/σ 2)
P
∂µ m(xm − µ)
• Take derivatives and set to zero: ∂` M + 1 P (x − µ)2
= − 2σ
∂σ 2 2 2σ 4 m m
∂` NH N
= − T
P
⇒ µML = (1/M ) m xm
∂θ θ 1−θ 2 = (1/M ) m x2m − µ2ML
P
∗ NH σML
⇒ θML =
NH + N T
• In linear regression, some inputs (covariates,parents) and all • A statistic is a function of a random variable.
outputs (responses,children) are continuous valued variables. • T (X) is a “sufficient statistic” for X if
• For each child and setting of discrete parents we use the model: T (x1) = T (x2) ⇒ L(θ; x1) = L(θ; x2) ∀θ
p(y|x, θ) = gauss(y|θ x, σ 2)>
• Equivalently (by the Neyman factorization theorem) we can write:
• The likelihood is the familiar “squared error” cost: p(x|θ) = h (x, T (x)) g (T (x), θ)
1 X m
`(θ; D) = − 2 (y − θ>xm)2 • Example: exponential family models:
2σ m
p(x|θ) = h(x) exp{η >T (x) − A(η)}
• The ML parameters can be solved for using linear least-squares:
∂`
(y m − θ>xm)xm
X
=−
∂θ m
⇒ θML∗ = (X>X)−1X>Y
Y x
MLE for Exponential Family Models Fundamental Operations with Distributions
• Recall the probability function for exponential models: • Generate data: draw samples from the distribution. This often
>
p(x|θ) = h(x) exp{η T (x) − A(η)} involves generating a uniformly distributed variable in the range
[0,1] and transforming it. For more complex distributions it may
• For iid data, sufficient statistic is m T (xm):
P
involve an iterative procedure that takes a long time to produce a
single sample (e.g. Gibbs sampling, MCMC).
! !
m m
X X
>
`(η; D) = log p(D|η) = log h(x ) −M A(η)+ η T (x ) • Compute log probabilities.
m m
When all variables are either observed or marginalized the result is a
• Take derivatives and set to zero: single number which is the log prob of the configuration.
∂` P m ∂A(η)
∂η = m T (x ) − M ∂η • Inference: Compute expectations of some variables given others
∂A(η) 1 P T (xm) which are observed or marginalized.
⇒ ∂η = M m
1
ηML = M m T (xm)
P • Learning.
Set the parameters of the density functions given some (partially)
recalling that the natural moments of an exponential distribution observed data to maximize likelihood or penalized likelihood.
are the derivatives of the log normalizer.
• Let’s remind ourselves of the basic problems we discussed on the • In AI the bottleneck is often knowledge acquisition.
first day: density estimation, clustering classification and regression. • Human experts are rare, expensive, unreliable, slow.
• Density estimation is hardest. If we can do joint density estimation • But we have lots of data.
then we can always condition to get what we want:
• Want to build systems automatically based on data and a small
Regression: p(y|x) = p(y, x)/p(x)
amount of prior information (from experts).
Classification: p(c|x) = p(c, x)/p(x)
Clustering: p(c|x) = p(c, x)/p(x) c unobserved
Known Models
Jensen’s Inequality
E[f(x)]
√
• e.g. log() and are concave
P
• This allows us to bound expressions like log p(x) = log z p(x, z)