CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning

CS229 Problem Set #3
CS 229 Spring 2016

Problem Set #3: Theory & Unsupervised learning
Due Wednesday, May 18 at 11:00 pm on Gradescope.
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at https://piazza.com/stanford/spring2016/cs229. (3)
If you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4) For
problems that require programming, please include in your submission a printout of your code
(with comments) and any figures that you are asked to plot.
If you are skipping a question, please include it on your PDF/photo, but leave the question
blank and tag it appropriately on Gradescope. This includes extra credit problems. If you are
scanning your document by cellphone, please check the Piazza forum for recommended cellphone
scanning apps and best practices.
1. [23 points] Uniform convergence
You are hired by CNN to help design the sampling procedure for making their electoral
predictions for the next presidential election in the (fictitious) country of Elbania.
The country of Elbania is organized into states, and there are only two candidates running
in this election: One from the Elbanian Democratic party, and another from the Labor
Party of Elbania. The plan for making our electorial predictions is as follows: Well sample
m voters from each state, and ask whether theyre voting democrat. Well then publish,
for each state, the estimated fraction of democrat voters. In this problem, well work out
how many voters we need to sample in order to ensure that we get good predictions with
high probability.
One reasonable goal might be to set m large enough that, with high probability, we obtain
uniformly accurate estimates of the fraction of democrat voters in every state. But this
might require surveying very many people, which would be prohibitively expensive. So,
were instead going to demand only a slightly lower degree of accuracy.
Specifically, well say that our prediction for a state is highly inaccurate if the estimated
fraction of democrat voters differs from the actual fraction of democrat voters within that
state by more than a tolerance factor . CNN knows that their viewers will tolerate some
small number of states estimates being highly inaccurate; however, their credibility would
be damaged if they reported highly inaccurate estimates for too many states. So, rather
than trying to ensure that all states estimates are within of the true values (which
would correspond to no states estimate being highly inaccurate), we will instead try only
to ensure that the number of states with highly inaccurate estimates is small.
To formalize the problem, let there be n states, and let m voters be drawn IID from each
state. Let the actual fraction of voters in state i that voted democrat be i . Also let Xij
(1 i n, 1 j m) be a binary random variable indicating whether the j-th randomly
chosen voter from state i voted democrat:

1 if the j th example from the ith state voted democrat
Xij =
0 otherwise
We assume that the voters correctly disclose their vote during the survey. Thus, for each
value of i, we have that Xij are drawn IID from a Bernoulli(i ) distribution. Moreover,
the Xij s (for all i, j) are all mutually independent.
After the survey, the fraction of democrat votes in state i is estimated as:
m
1 X
i =
Xij
m j=1
Also, let Zi = 1{|i i | > } be a binary random variable that indicates whether the
prediction in state i was highly inaccurate.
(a) Let i be the probability that Zi = 1. Using the Hoeffding inequality, find an upper
bound on i .
(b) In this part, we prove a general result which will be useful for this problem. Let Vi
and Wi (1 i k) be Bernoulli random variables, and suppose
E[Vi ] = P (Vi = 1) P (Wi = 1) = E[Wi ]
i {1, 2, . . . k}
Let the Vi s be mutually independent, and similarly let the Wi s also be mutually
independent. Prove that, for any value of t, the following holds:
!
!
k
k
X
X
P
Vi > t P
Wi > t
i=1
i=1
[Hint: One way to do this is via induction on k. If you use a proof by induction, for
the base case (k = 1), you must show that the inequality holds for t < 0, 0 t < 1,
and t 1.]
(c) The fraction
Pn of states on which our predictions are highly inaccurate is given by
Z = n1 i=1 Zi . Prove a reasonable closed form upper bound on the probability
P (Z > ) of being highly inaccurate on more than a fraction of the states.
[Note: There are many possible answers, but to be considered reasonable, your bound
must decrease to zero as m (for fixed n and > 0). Also, your bound should
either remain constant or decrease as n (for fixed m and > 0). It is also fine
if, for some values of , m and n, your bound just tells us that P (Z > ) 1 (the
trivial bound).]
2. [15 points] More VC dimension
Let the domain of the inputs for a learning problem be X = R. Consider using hypotheses
of the following form:
h (x) = 1{0 + 1 x + 2 x2 + + d xd 0},
and let H = {h : Rd+1 } be the corresponding hypothesis class. What is the VC
dimension of H? Justify your answer.
[Hint: You may use the fact that a polynomial of degree d has at most d real roots. When
doing this problem, you should not assume any other non-trivial result (such as that the
VC dimension of linear classifiers in d-dimensions is d + 1) that was not formally proved
in class.]
3. [12 points] MAP estimates and weight decay

Consider using a logistic regression model h (x) = g(T x) where g is the sigmoid function,
and let a training set {(x(i) , y (i) ); i = 1, . . . , m} be given as usual. The maximum likelihood
estimate of the parameters is given by
ML = arg max
m
Y
p(y (i) |x(i) ; ).
i=1
If we wanted to regularize logistic regression, then we might put a Bayesian prior on the
parameters. Suppose we chose the prior N (0, 2 I) (here, > 0, and I is the n + 1-byn + 1 identity matrix), and then found the MAP estimate of as:
MAP = arg max p()
m
Y
p(y (i) |x(i) , )
i=1
Prove that
||MAP ||2 ||ML ||2
[Hint: Consider using a proof by contradiction.]
Remark. For this reason, this form of regularization is sometimes also called weight
decay, since it encourages the weights (meaning parameters) to take on generally smaller
values.
4. [15 points] KL divergence and Maximum Likelihood
The Kullback-Leibler (KL) divergence between two discrete-valued distributions P (X), Q(X)
is defined as follows:1
K L(P kQ) =
X
x
P (x) log
P (x)
Q(x)
For notational convenience, we assume P (x) > 0, x. (Otherwise, one standard thing to do
is to adopt the convention that 0 log 0 = 0.) Sometimes, we also write the KL divergence
as K L(P ||Q) = K L(P (X)||Q(X)).
The KL divergence is an assymmetric measure of the distance between 2 probability distributions. In this problem we will prove some basic properties of KL divergence, and
work out a relationship between minimizing KL divergence and the maximum likelihood
estimation that were familiar with.
(a) Nonnegativity. Prove the following:
P, Q K L(P kQ) 0
and
K L(P kQ) = 0
if and only if P = Q.
1 If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral,
and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem well just
work with this form of KL divergence for probability mass functions/discrete-valued distributions.
[Hint: You may use the following result, called Jensens inequality. If f is a convex
function, and X is a random variable, then E[f (X)] f (E[X]). Moreover, if f is
strictly convex (f is convex if its Hessian satisfies H 0; it is strictly convex if H > 0;
for instance f (x) = log x is strictly convex), then E[f (X)] = f (E[X]) implies that
X = E[X] with probability 1; i.e., X is actually a constant.]
(b) Chain rule for KL divergence. The KL divergence between 2 conditional distributions P (X|Y ), Q(X|Y ) is defined as follows:
!
X
X
P (x|y)
K L(P (X|Y )kQ(X|Y )) =
P (y)
P (x|y) log
Q(x|y)
y
x
This can be thought of as the expected KL divergence between the corresponding
conditional distributions on x (that is, between P (X|Y = y) and Q(X|Y = y)),
where the expectation is taken over the random y.
Prove the following chain rule for KL divergence:
K L(P (X, Y )kQ(X, Y )) = K L(P (X)kQ(X)) + K L(P (Y |X)kQ(Y |X)).
(c) KL and maximum likelihood.
Consider a density estimation problem, and suppose we are given
set
Pma training
1
(i)
{x(i) ; i = 1, . . . , m}. Let the empirical distribution be P (x) = m
1{x
=
x}.
i=1
(P is just the uniform distribution over the training set; i.e., sampling from the empirical distribution is the same as picking a random example from the training set.)
Suppose we have some family of distributions P parameterized by . (If you like,
think of P (x) as an alternative notation for P (x; ).) Prove that finding the maximum
likelihood estimate for the parameter is equivalent to finding P with minimal KL
divergence from P . I.e. prove:
arg min K L(P kP ) = arg max
m
X
log P (x(i) )
i=1
Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli
Naive Bayes parameter estimation.
Qn In the Naive Bayes model we assumed P is of the
following form: P (x, y) = p(y) i=1 p(xi |y). By the chain rule for KL divergence, we
therefore have:
K L(P kP ) = K L(P (y)kp(y)) +
n
X
K L(P (xi |y)kp(xi |y)).
i=1
This shows that finding the maximum likelihood/minimum KL-divergence estimate

of the parameters decomposes into 2n + 1 independent optimization problems: One
for the class priors p(y), and one for each of the conditional distributions p(xi |y)
for each feature xi given each of the two possible labels for y. Specifically, finding
the maximum likelihood estimates for each of these problems individually results in
also maximizing the likelihood of the joint distribution. (If you know what Bayesian
networks are, a similar remark applies to parameter estimation for them.)
5. [20 points] K-means for compression

In this problem, we will apply the K-means algorithm to lossy image compression, by
reducing the number of colors used in an image.
We will be using the following files:
http://cs229.stanford.edu/ps/ps3/mandrill-small.tiff
http://cs229.stanford.edu/ps/ps3/mandrill-large.tiff
The mandrill-large.tiff file contains a 512x512 image of a mandrill represented in 24bit color. This means that, for each of the 262144 pixels in the image, there are three 8-bit
numbers (each ranging from 0 to 255) that represent the red, green, and blue intensity
values for that pixel. The straightforward representation of this image therefore takes
about 262144 3 = 786432 bytes (a byte being 8 bits). To compress the image, we will use
K-means to reduce the image to k = 16 colors. More specifically, each pixel in the image is
considered a point in the three-dimensional (r, g, b)-space. To compress the image, we will
cluster these points in color-space into 16 clusters, and replace each pixel with the closest
cluster centroid.
Follow the instructions below. Be warned that some of these operations can take a while
(several minutes even on a fast computer)!2
(a) Start up MATLAB, and type A = double(imread(mandrill-large.tiff)); to
read in the image. Now, A is a three dimensional matrix, and A(:,:,1), A(:,:,2)
and A(:,:,3) are 512x512 arrays that respectively contain the red, green, and blue
values for each pixel. Enter imshow(uint8(round(A))); to display the image.
(b) Since the large image has 262144 pixels and would take a while to cluster, we will instead run vector quantization on a smaller image. Repeat (a) with mandrill-small.tiff.
Treating each pixels (r, g, b) values as an element of R3 , run K-means3 with 16 clusters on the pixel data from this smaller image, iterating (preferably) to convergence,
but in no case for less than 30 iterations. For initialization, set each cluster centroid
to the (r, g, b)-values of a randomly chosen pixel in the image.
(c) Take the matrix A from mandrill-large.tiff, and replace each pixels (r, g, b) values
with the value of the closest cluster centroid. Display the new image, and compare it
visually to the original image. Hand in all your code and a printout of your compressed
image (printing on a black-and-white printer is fine).
(d) If we represent the image with these reduced (16) colors, by (approximately) what
factor have we compressed the image?
2 In order to use the imread and imshow commands in octave, you have to install the Image package from
octave-forge. This package and installation instructions are available at: http://octave.sourceforge.net
3 Please implement K-means yourself, rather than using built-in functions from, e.g., MATLAB or octave.

CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning

Uploaded by

Copyright:

Available Formats

CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning

Uploaded by

Copyright:

Available Formats

CS229 Problem Set #3