CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
We assume that the voters correctly disclose their vote during the survey. Thus, for each
value of i, we have that Xij are drawn IID from a Bernoulli(i ) distribution. Moreover,
the Xij s (for all i, j) are all mutually independent.
After the survey, the fraction of democrat votes in state i is estimated as:
m
1 X
i =
Xij
m j=1
Also, let Zi = 1{|i i | > } be a binary random variable that indicates whether the
prediction in state i was highly inaccurate.
(a) Let i be the probability that Zi = 1. Using the Hoeffding inequality, find an upper
bound on i .
(b) In this part, we prove a general result which will be useful for this problem. Let Vi
and Wi (1 i k) be Bernoulli random variables, and suppose
E[Vi ] = P (Vi = 1) P (Wi = 1) = E[Wi ]
i {1, 2, . . . k}
Let the Vi s be mutually independent, and similarly let the Wi s also be mutually
independent. Prove that, for any value of t, the following holds:
!
!
k
k
X
X
P
Vi > t P
Wi > t
i=1
i=1
[Hint: One way to do this is via induction on k. If you use a proof by induction, for
the base case (k = 1), you must show that the inequality holds for t < 0, 0 t < 1,
and t 1.]
(c) The fraction
Pn of states on which our predictions are highly inaccurate is given by
Z = n1 i=1 Zi . Prove a reasonable closed form upper bound on the probability
P (Z > ) of being highly inaccurate on more than a fraction of the states.
[Note: There are many possible answers, but to be considered reasonable, your bound
must decrease to zero as m (for fixed n and > 0). Also, your bound should
either remain constant or decrease as n (for fixed m and > 0). It is also fine
if, for some values of , m and n, your bound just tells us that P (Z > ) 1 (the
trivial bound).]
2. [15 points] More VC dimension
Let the domain of the inputs for a learning problem be X = R. Consider using hypotheses
of the following form:
h (x) = 1{0 + 1 x + 2 x2 + + d xd 0},
and let H = {h : Rd+1 } be the corresponding hypothesis class. What is the VC
dimension of H? Justify your answer.
[Hint: You may use the fact that a polynomial of degree d has at most d real roots. When
doing this problem, you should not assume any other non-trivial result (such as that the
VC dimension of linear classifiers in d-dimensions is d + 1) that was not formally proved
in class.]
m
Y
i=1
If we wanted to regularize logistic regression, then we might put a Bayesian prior on the
parameters. Suppose we chose the prior N (0, 2 I) (here, > 0, and I is the n + 1-byn + 1 identity matrix), and then found the MAP estimate of as:
MAP = arg max p()
m
Y
i=1
Prove that
||MAP ||2 ||ML ||2
[Hint: Consider using a proof by contradiction.]
Remark. For this reason, this form of regularization is sometimes also called weight
decay, since it encourages the weights (meaning parameters) to take on generally smaller
values.
4. [15 points] KL divergence and Maximum Likelihood
The Kullback-Leibler (KL) divergence between two discrete-valued distributions P (X), Q(X)
is defined as follows:1
K L(P kQ) =
X
x
P (x) log
P (x)
Q(x)
For notational convenience, we assume P (x) > 0, x. (Otherwise, one standard thing to do
is to adopt the convention that 0 log 0 = 0.) Sometimes, we also write the KL divergence
as K L(P ||Q) = K L(P (X)||Q(X)).
The KL divergence is an assymmetric measure of the distance between 2 probability distributions. In this problem we will prove some basic properties of KL divergence, and
work out a relationship between minimizing KL divergence and the maximum likelihood
estimation that were familiar with.
(a) Nonnegativity. Prove the following:
P, Q K L(P kQ) 0
and
K L(P kQ) = 0
if and only if P = Q.
1 If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral,
and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem well just
work with this form of KL divergence for probability mass functions/discrete-valued distributions.
[Hint: You may use the following result, called Jensens inequality. If f is a convex
function, and X is a random variable, then E[f (X)] f (E[X]). Moreover, if f is
strictly convex (f is convex if its Hessian satisfies H 0; it is strictly convex if H > 0;
for instance f (x) = log x is strictly convex), then E[f (X)] = f (E[X]) implies that
X = E[X] with probability 1; i.e., X is actually a constant.]
(b) Chain rule for KL divergence. The KL divergence between 2 conditional distributions P (X|Y ), Q(X|Y ) is defined as follows:
!
X
X
P (x|y)
K L(P (X|Y )kQ(X|Y )) =
P (y)
P (x|y) log
Q(x|y)
y
x
This can be thought of as the expected KL divergence between the corresponding
conditional distributions on x (that is, between P (X|Y = y) and Q(X|Y = y)),
where the expectation is taken over the random y.
Prove the following chain rule for KL divergence:
K L(P (X, Y )kQ(X, Y )) = K L(P (X)kQ(X)) + K L(P (Y |X)kQ(Y |X)).
(c) KL and maximum likelihood.
Consider a density estimation problem, and suppose we are given
set
Pma training
1
(i)
{x(i) ; i = 1, . . . , m}. Let the empirical distribution be P (x) = m
1{x
=
x}.
i=1
(P is just the uniform distribution over the training set; i.e., sampling from the empirical distribution is the same as picking a random example from the training set.)
Suppose we have some family of distributions P parameterized by . (If you like,
think of P (x) as an alternative notation for P (x; ).) Prove that finding the maximum
likelihood estimate for the parameter is equivalent to finding P with minimal KL
divergence from P . I.e. prove:
arg min K L(P kP ) = arg max
m
X
log P (x(i) )
i=1
Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli
Naive Bayes parameter estimation.
Qn In the Naive Bayes model we assumed P is of the
following form: P (x, y) = p(y) i=1 p(xi |y). By the chain rule for KL divergence, we
therefore have:
K L(P kP ) = K L(P (y)kp(y)) +
n
X
i=1
2 In order to use the imread and imshow commands in octave, you have to install the Image package from
octave-forge. This package and installation instructions are available at: http://octave.sourceforge.net
3 Please implement K-means yourself, rather than using built-in functions from, e.g., MATLAB or octave.