0% found this document useful (0 votes)
352 views19 pages

CS 229, Summer 2019 Problem Set #3 Solutions

This document provides instructions for problem set 3 in CS229. It outlines 5 problems to complete including reinforcement learning of an inverted pendulum. For problem 1, the student is asked to apply reinforcement learning to balance an inverted pendulum in a simulation without knowing the system dynamics. The student must estimate the MDP model and solve for the value function and policy. The summary provides the number of trials for the algorithm to converge and includes a plot of a learning curve.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
352 views19 pages

CS 229, Summer 2019 Problem Set #3 Solutions

This document provides instructions for problem set 3 in CS229. It outlines 5 problems to complete including reinforcement learning of an inverted pendulum. For problem 1, the student is asked to apply reinforcement learning to balance an inverted pendulum in a simulation without knowing the system dynamics. The student must estimate the MDP model and solve for the value function and policy. The summary provides the number of trials for the algorithm to converge and includes a plot of a learning curve.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CS229 Problem Set #3 1

CS 229, Summer 2019


Problem Set #3 Solutions
Yusu Qian (006239176)

Due Monday, Aug 12 at 11:59 pm on Gradescope.

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/summer2019/cs229. (3)
If you missed the first lecture or are unfamiliar with the collaboration or honor code policy,
please read the policy on the course website before starting work. (4) For the coding problems,
you may not use any libraries except those defined in the provided environment.yml file. In
particular, ML-specific libraries such as scikit-learn are not permitted. (5) To account for late
days, the due date is Monday, Aug 12 at 11:59 pm. If you submit after Monday, Aug 12 at 11:59
pm, you will begin consuming your late days. If you wish to submit on time, submit before
Monday, Aug 12 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly rec-
ommend typesetting your solutions via LATEX. All students must also submit a zip file of their
source code to Gradescope, which should be created using the make zip.py script. You should
make sure to (1) restrict yourself to only using libraries included in the environment.yml file,
and (2) make sure your code runs without errors. Your submission may be evaluated by the
auto-grader using a private test set, or used for verifying the outputs reported in the writeup.
CS229 Problem Set #3 2

1. [25 points] Reinforcement Learning: The inverted pendulum


In this problem, you will apply reinforcement learning to automatically design a policy for a
difficult control task, without ever using any explicit knowledge of the dynamics of the underlying
system.
The problem we will consider is the inverted pendulum or the pole-balancing problem.1
Consider the figure shown. A thin pole is connected via a free hinge to a cart, which can move
laterally on a smooth table surface. The controller is said to have failed if either the angle of
the pole deviates by more than a certain amount from the vertical position (i.e., if the pole falls
over), or if the cart’s position goes out of bounds (i.e., if it falls off the end of the table). Our
objective is to develop a controller to balance the pole with these constraints, by appropriately
having the cart accelerate left and right.

3.5

2.5

1.5

0.5

−0.5
−3 −2 −1 0 1 2 3

We have written a simple simulator for this problem. The simulation proceeds in discrete time
cycles (steps). The state of the cart and pole at any time is completely characterized by 4
parameters: the cart position x, the cart velocity ẋ, the angle of the pole θ measured as its
deviation from the vertical position, and the angular velocity of the pole θ̇. Since it would be
simpler to consider reinforcement learning in a discrete state space, we have approximated the
state space by a discretization that maps a state vector (x, ẋ, θ, θ̇) into a number from 1 to
NUM STATES. Your learning algorithm will need to deal only with this discretized representation
of the states.
At every time step, the controller must choose one of two actions - push (accelerate) the cart
right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These
are represented as actions 1 and 2 respectively in the code. When the action choice is made, the
simulator updates the state parameters according to the underlying dynamics, and provides a
new discretized state.
We will assume that the reward R(s) is a function of the current state only. When the pole
angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given,
and the system is reinitialized randomly. At all other times, the reward is zero. Your program
must learn to balance the pole using only the state transitions and rewards observed.
The files for this problem are in src/cartpole/ directory. Most of the the code has already been
written for you, and you need to make changes only to cartpole.py in the places specified. This
file can be run to show a display and to plot a learning curve at the end. Read the comments at
the top of the file for more details on the working of the simulation.
1 The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html
CS229 Problem Set #3 3

To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities
and rewards) for the underlying MDP, solve Bellman’s equations for this estimated MDP to
obtain a value function, and act greedily with respect to this value function.
Briefly, you will maintain a current model of the MDP and a current estimate of the value func-
tion. Initially, each state has estimated reward zero, and the estimated transition probabilities
are uniform (equally likely to end up in any other state).
During the simulation, you must choose actions at each time step according to some current
policy. As the program goes along taking actions, it will gather observations on transitions and
rewards, which it can use to get a better estimate of the MDP model. Since it is inefficient to
update the whole estimated MDP after every observation, we will store the state transitions and
reward observations each time, and update the model and value function/policy only periodically.
Thus, you must maintain counts of the total number of times the transition from state si to state
sj using action a has been observed (similarly for the rewards). Note that the rewards at any
state are deterministic, but the state transitions are not because of the discretization of the state
space (several different but close configurations may map onto the same discretized state).
Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition
probabilities and rewards as the average of the observed values (if any). Your program must then
use value iteration to solve Bellman’s equations on the estimated MDP, to get the value function
and new optimal policy for the new model. For value iteration, use a convergence criterion
that checks if the maximum absolute change in the value function on an iteration exceeds some
specified tolerance.
Finally, assume that the whole learning procedure has converged once several consecutive at-
tempts (defined by the parameter NO LEARNING THRESHOLD) to solve Bellman’s equation all con-
verge in the first iteration. Intuitively, this indicates that the estimated model has stopped
changing significantly.
The code outline for this problem is already in p6 pendulum.py, and you need to write code
fragments only at the places specified in the file. There are several details (convergence criteria
etc.) that are also explained inside the code. Use a discount factor of γ = 0.995.
Implement the reinforcement learning algorithm as specified, and run it.

• How many trials (how many times did the pole fall over or the cart fall off) did it take
before the algorithm converged?
• Plot a learning curve showing the number of time-steps for which the pole was balanced on
each trial. Python starter code already includes the code to plot.

Answer:
10
CS229 Problem Set #3 4
CS229 Problem Set #3 5

2. [15 points] KL divergence and Maximum Likelihood


The Kullback-Leibler (KL) divergence is a measure of how much one probability distribution is
different from a second one. It is a concept that originated in Information Theory, but has made
its way into several other fields, including Statistics, Machine Learning, Information Geometry,
and many more. In Machine Learning, the KL divergence plays a crucial role, connecting various
concepts that might otherwise seem unrelated.
In this problem, we will introduce KL divergence over discrete distributions, practice some simple
manipulations, and see its connection to Maximum Likelihood Estimation.
The KL divergence between two discrete-valued distributions P (X), Q(X) over the outcome
space X is defined as follows2 :

X P (x)
DKL (P kQ) = P (x) log
Q(x)
x∈X

For notational convenience, we assume P (x) > 0, ∀x. (One other standard thing to do is to
adopt the convention that “0 log 0 = 0.”) Sometimes, we also write the KL divergence more
explicitly as DKL (P ||Q) = DKL (P (X)||Q(X)).
Background on Information Theory
Before we dive deeper, we give a brief (optional) Information Theoretic background on KL
divergence. While this introduction is not necessary to answer the assignment question, it may
help you better understand and appreciate why we study KL divergence, and how Information
Theory can be relevant to Machine Learning.
We start with the entropy H(P ) of a probability distribution P (X), which is defined as
X
H(P ) = − P (x) log P (x).
x∈X

Intuitively, entropy measures how dispersed a probability distribution is. For example, a uni-
form distribution is considered to have very high entropy (i.e. a lot of uncertainty), whereas a
distribution that assigns all its mass on a single point is considered to have zero entropy (i.e.
no uncertainty). Notably, it can be shown that among continuous distributions over R, the
Gaussian distribution N (µ, σ 2 ) has the highest entropy (highest uncertainty) among all possible
distributions that have the given mean µ and variance σ 2 .
To further solidify our intuition, we present motivation from communication theory. Suppose we
want to communicate from a source to a destination, and our messages are always (a sequence
of) discrete symbols over space X (for example, X could be letters {a, b, . . . , z}). We want to
construct an encoding scheme for our symbols in the form of sequences of binary bits that are
transmitted over the channel. Further, suppose that in the long run the frequency of occurrence
of symbols follow a probability distribution P (X). This means, in the long run, the fraction of
times the symbol x gets transmitted is P (x).
A common desire is to construct an encoding scheme such that the average number of bits per
symbol transmitted remains as small as possible. Intuitively, this means we want very frequent
symbols to be assigned to a bit pattern having a small number of bits. Likewise, because we are
2 If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral,

and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem we’ll just
work with this form of KL divergence for probability mass functions/discrete-valued distributions.
CS229 Problem Set #3 6

interested in reducing the average number of bits per symbol in the long term, it is tolerable for
infrequent words to be assigned to bit patterns having a large number of bits, since their low
frequency has little effect on the long term average. The encoding scheme can be as complex as
we desire, for example, a single bit could possibly represent a long sequence of multiple symbols
(if that specific pattern of symbols is very common). The entropy of a probability distribution
P (X) is its optimal bit rate, i.e., the lowest average bits per message that can possibly be
achieved if the symbols x ∈ X occur according to P (X). It does not specifically tell us how to
construct that optimal encoding scheme. It only tells us that no encoding can possibly give us
a lower long term bits per message than H(P ).
To see a concrete example, suppose our messages have a vocabulary of K = 32 symbols, and
each symbol has an equal probability of transmission in the long term (i.e, uniform probability
distribution). An encoding scheme that would work well for this scenario would be to have
log2 K bits per symbol, and assign each symbol some unique combination of the log2 K bits. In
fact, it turns out that this is the most efficient encoding one can come up with for the uniform
distribution scenario.
It may have occurred to you by now that the long term average number of bits per message
depends only on the frequency of occurrence of symbols. The encoding scheme of scenario A can
in theory be reused in scenario B with a different set of symbols (assume equal vocabulary size
for simplicity), with the same long term efficiency, as long as the symbols of scenario B follow
the same probability distribution as the symbols of scenario A. It might also have occurred to
you, that reusing the encoding scheme designed to be optimal for scenario A, for messages in
scenario B having a different probability of symbols, will always be suboptimal for scenario B.
To be clear, we do not need know what the specific optimal schemes are in either scenarios. As
long as we know the distributions of their symbols, we can say that the optimal scheme designed
for scenario A will be suboptimal for scenario B if the distributions are different.
Concretely, if we reuse the optimal scheme designed for a scenario having symbol distribution
Q(X), into a scenario that has symbol distribution P (X), the long term average number of bits
per symbol achieved is called the cross entropy, denoted by H(P, Q):
X
H(P, Q) = − P (x) log Q(x).
x∈X

To recap, the entropy H(P ) is the best possible long term average bits per message (optimal)
that can be achieved under a symbol distribution P (X) by using an encoding scheme (possibly
unknown) specifically designed for P (X). The cross entropy H(P, Q) is the long term average bits
per message (suboptimal) that results under a symbol distribution P (X), by reusing an encoding
scheme (possibly unknown) designed to be optimal for a scenario with symbol distribution Q(X).
Now, KL divergence is the penalty we pay, as measured in average number of bits, for using the
optimal scheme for Q(X), under the scenario where symbols are actually distributed as P (X).
It is straightforward to see this

X P (x)
DKL (P kQ) = P (x) log
Q(x)
x∈X
X X
= P (x) log P (x) − P (x) log Q(x)
x∈X x∈X
= H(P, Q) − H(P ). (difference in average number of bits.)
CS229 Problem Set #3 7

If the cross entropy between P and Q is H(P ) (and hence DKL (P ||Q) = 0) then it necessarily
means P = Q. In Machine Learning, it is a common task to find a distribution Q that is “close”
to another distribution P . To achieve this, it is common to use DKL (Q||P ) as the loss function
to be optimized. As we will see in this question below, Maximum Likelihood Estimation, which
is a commonly used optimization objective, turns out to be equivalent to minimizing the KL
divergence between the training data (i.e. the empirical distribution over the data) and the
model.
Now, we get back to showing some simple properties of KL divergence.

(a) [5 points] Nonnegativity.


Prove the following:
∀P, Q. DKL (P kQ) ≥ 0
and

DKL (P kQ) = 0 if and only if P = Q.


[Hint: You may use the following result, called Jensen’s inequality. If f is a convex
function, and X is a random variable, then E[f (X)] ≥ f (E[X]). Moreover, if f is strictly
convex (f is convex if its Hessian satisfies H ≥ 0; it is strictly convex if H > 0; for instance
f (x) = − log x is strictly convex), then E[f (X)] = f (E[X]) implies that X = E[X] with
probability 1; i.e., X is actually a constant.] Answer:
DKL (P kQ) = − x P log Q = −E[log Q Q
P
PP P ] >= − log E[ P ]
P Q
= − log x P P = − log x Q = 0

Q
When P = Q it is easy to show that DKL (P kQ) = 0. Because − log x is convex, E[log P] >=
log E[ Q
P ], thus P = Q
(b) [5 points] Chain rule for KL divergence.
The KL divergence between 2 conditional distributions P (X|Y ), Q(X|Y ) is defined as fol-
lows: !
X X P (x|y)
DKL (P (X|Y )kQ(X|Y )) = P (y) P (x|y) log
y x
Q(x|y)

This can be thought of as the expected KL divergence between the corresponding conditional
distributions on x (that is, between P (X|Y = y) and Q(X|Y = y)), where the expectation
is taken over the random y.
Prove the following chain rule for KL divergence:

DKL (P (X, Y )kQ(X, Y )) = DKL (P (X)kQ(X)) + DKL (P (Y |X)kQ(Y |X)).

Answer:
KL(P (X)kQ(X)) + KL(P (Y |X)kQ(Y P |X)) 
P P (x) P P (y|x)
= x P (x) log Q(x) + x P (x) y P d(y|x) log Q(y|x)
 
P P (x) P P (y|x)
= x P (x) log Q(x) + y P (y|x) log Q(y|x)
P  
P P (x)P (y|x)
= x P (x) y P (y|x) log Q(x)Q(y|x)
P  
P P (x,y)
= x P (x) y P (y|x) log Q(x,y)
CS229 Problem Set #3 8

P P (x,y)
= x P (x, y) log Q(x,y)
Thus, DKL (P (X, Y )kQ(X, Y )) = DKL (P (X)kQ(X)) + DKL (P (Y |X)kQ(Y |X)).
(c) [5 points] KL and maximum likelihood.
Consider a density estimation problem, and suppose weP are given a training set {x(i) ; i =
1 n
1, . . . , n}. Let the empirical distribution be P̂ (x) = n i=1 1{x(i) = x}. (P̂ is just the
uniform distribution over the training set; i.e., sampling from the empirical distribution is
the same as picking a random example from the training set.)
Suppose we have some family of distributions Pθ parameterized by θ. (If you like, think of
Pθ (x) as an alternative notation for P (x; θ).) Prove that finding the maximum likelihood
estimate for the parameter θ is equivalent to finding Pθ with minimal KL divergence from
P̂ . I.e. prove:
Xn
arg min DKL (P̂ kPθ ) = arg max log Pθ (x(i) )
θ θ
i=1

Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli Naive
Bayes parameter estimation.
Qd In the Naive Bayes model we assumed Pθ is of the following
form: Pθ (x, y) = p(y) i=1 p(xi |y). By the chain rule for KL divergence, we therefore have:
d
X
DKL (P̂ kPθ ) = DKL (P̂ (y)kp(y)) + DKL (P̂ (xi |y)kp(xi |y)).
i=1

This shows that finding the maximum likelihood/minimum KL-divergence estimate of the
parameters decomposes into 2n + 1 independent optimization problems: One for the class
priors p(y), and one for each of the conditional distributions p(xi |y) for each feature xi
given each of the two possible labels for y. Specifically, finding the maximum likelihood
estimates for each of these problems individually results in also maximizing the likelihood
of the joint distribution. (If you know what Bayesian networks are, a similar remark applies
to parameter estimation for them.)
Answer:
i=1 1{x =x}
1
Pm (i)
P 1 Pm  (i)
DKL (P̂ kPθ ) = x P̂ (x) log PP̂θ(x)
P m
(x) = x m i=1 1 x = x log P θ (x)
Pm 1
Pm
1
=m i=1 log Pθ (x(i) )
m i=1

1
P m (i)
 1
= −m i=1 log Pθ x = −m log −likelihood
CS229 Problem Set #3 9

3. [20 points] K-means for compression


In this problem, we will apply the K-means algorithm to lossy image compression, by reducing
the number of colors used in an image.
We will be using the files src/k means/peppers-small.tiff and src/k means/peppers-large.tiff.
The peppers-large.tiff file contains a 512x512 image of peppers represented in 24-bit color.
This means that, for each of the 262144 pixels in the image, there are three 8-bit numbers (each
ranging from 0 to 255) that represent the red, green, and blue intensity values for that pixel. The
straightforward representation of this image therefore takes about 262144 × 3 = 786432 bytes (a
byte being 8 bits). To compress the image, we will use K-means to reduce the image to k = 16
colors. More specifically, each pixel in the image is considered a point in the three-dimensional
(r, g, b)-space. To compress the image, we will cluster these points in color-space into 16 clusters,
and replace each pixel with the closest cluster centroid.
Follow the instructions below. Be warned that some of these operations can take a while (several
minutes even on a fast computer)!

(a) [15 points] [Coding Problem] K-Means Compression Implementation. First let us
look at our data. From the src/k means/ directory, open an interactive Python prompt,
and type
from matplotlib.image import imread; import matplotlib.pyplot as plt;
and run A = imread(‘peppers-large.tiff’). Now, A is a “three dimensional matrix,”
and A[:,:,0], A[:,:,1] and A[:,:,2] are 512x512 arrays that respectively contain the
red, green, and blue values for each pixel. Enter plt.imshow(A); plt.show() to display
the image.
Since the large image has 262,144 pixels and would take a while to cluster, we will instead
run vector quantization on a smaller image. Repeat (a) with peppers-small.tiff.
Next we will implement image compression in the file src/k means/k means.py which has
some starter code. Treating each pixel’s (r, g, b) values as an element of R3 , implement
K-means with 16 clusters on the pixel data from this smaller image, iterating (preferably)
to convergence, but in no case for less than 30 iterations. For initialization, set each cluster
centroid to the (r, g, b)-values of a randomly chosen pixel in the image.
Take the image of peppers-large.tiff, and replace each pixel’s (r, g, b) values with the
value of the closest cluster centroid from the set of centroids computed with peppers-small.tiff.
Visually compare it to the original image to verify that your implementation is reasonable.
Include in your write-up a copy of this compressed image alongside the original
image.
Answer:
(b) [5 points] Compression Factor.
If we represent the image with these reduced (16) colors, by (approximately) what factor
have we compressed the image?
Answer:
The original large image has 183525 unique colors so after compressed to 16 colors, the factor
is about 183525/16.
CS229 Problem Set #3 10
CS229 Problem Set #3 11
CS229 Problem Set #3 12

4. [35 points] Semi-supervised EM


Expectation Maximization (EM) is a classical algorithm for unsupervised learning (i.e., learning
with hidden or latent variables). In this problem we will explore one of the ways in which EM
algorithm can be adapted to the semi-supervised setting, where we have some labelled examples
along with unlabelled examples.
In the standard unsupervised setting, we have n ∈ N unlabelled examples {x(1) , . . . , x(n) }. We
wish to learn the parameters of p(x, z; θ) from the data, but z (i) ’s are not observed. The classical
EM algorithm is designed for this very purpose, where we maximize the intractable p(x; θ)
indirectly by iteratively performing the E-step and M-step, each time maximizing a tractable
lower bound of p(x; θ). Our objective can be concretely written as:

n
X
`unsup (θ) = log p(x(i) ; θ)
i=1
n
X X
= log p(x(i) , z (i) ; θ)
i=1 z (i)

Now, we will attempt to construct an extension of EM to the semi-supervised setting. Let


us suppose we have an additional ñ ∈ N labelled examples {(x̃(1) , z̃ (1) ), . . . , (x̃(ñ) , z̃ (ñ) )} where
both x and z are observed. We want to simultaneously maximize the marginal likelihood of the
parameters using the unlabelled examples, and full likelihood of the parameters using the labelled
examples, by optimizing their weighted sum (with some hyperparameter α). More concretely,
our semi-supervised objective `semi-sup (θ) can be written as:

X
`sup (θ) = log p(x̃(i) , z̃ (i) ; θ)
i=1
`semi-sup (θ) = `unsup (θ) + α`sup (θ)

We can derive the EM steps for the semi-supervised setting using the same approach and steps
as before. You are strongly encouraged to show to yourself (no need to include in the write-up)
that we end up with:

E-step (semi-supervised)

For each i ∈ {1, . . . , n}, set


(t)
Qi (z (i) ) := p(z (i) |x(i) ; θ(t) )

M-step (semi-supervised)

" n ! ñ
!#
(t+1)
X X (t) p(x(i) , z (i) ; θ) X
θ := arg max Qi (z (i) ) log (t)
+α (i)
log p(x̃ , z̃ (i)
; θ)
θ
i=1 z (i)
Qi (z (i) ) i=1
CS229 Problem Set #3 13

(a) [5 points] Convergence. First we will show that this algorithm eventually converges. In
order to prove this, it is sufficient to show that our semi-supervised objective `semi-sup (θ)
monotonically increases with each iteration of E and M step. Specifically, let θ(t) be the
parameters obtained at the end of t EM-steps. Show that `semi-sup (θ(t+1) ) ≥ `semi-sup (θ(t) ).
Answer:
This is ensured because EM constructs a lower bound of l(θ) in each iteration and alters θ to
maximize it, then saves the new θ. Thus `semi-sup (θ(t+1) ) ≥ `semi-sup (θ(t) ). holds.

Semi-supervised GMM

Now we will revisit the Gaussian Mixture Model (GMM), to apply our semi-supervised EM al-
gorithm. Let us consider a scenario where data is generated from k ∈ N Gaussian distributions,
with unknown means µj ∈ Rd and covariances Σj ∈ Sd+ where j ∈ {1, . . . , k}. We have n data
points x(i) ∈ Rd , i ∈ {1, . . . , n}, and each data point has a corresponding latent (hidden/un-
known) variable z (i) ∈ {1, . . . , k} indicating which distribution x(i) belongs to. Specifically,
Pk
z (i) ∼ Multinomial(φ), such that j=1 φj = 1 and φj ≥ 0 for all j, and x(i) |z (i) ∼ N (µz(i) , Σz(i) )
i.i.d. So, µ, Σ, and φ are the model parameters.
We also have additional ñ data points x̃(i) ∈ Rd , i ∈ {1, . . . , ñ}, and an associated observed
variable z̃ (i) ∈ {1, . . . , k} indicating the distribution x̃(i) belongs to. Note that z̃ (i) are known
constants (in contrast to z (i) which are unknown random variables). As before, we assume
x̃(i) |z̃ (i) ∼ N (µz̃(i) , Σz̃(i) ) i.i.d.
In summary we have n + ñ examples, of which n are unlabelled data points x’s with unobserved
z’s, and ñ are labelled data points x̃(i) with corresponding observed labels z̃ (i) . The traditional
EM algorithm is designed to take only the n unlabelled examples as input, and learn the model
parameters µ, Σ, and φ.
Our task now will be to apply the semi-supervised EM algorithm to GMMs in order to also
leverage the additional ñ labelled examples, and come up with semi-supervised E-step and M-
step update rules specific to GMMs. Whenever required, you can cite the lecture notes for
derivations and steps.

(b) [5 points] Semi-supervised E-Step. Clearly state which are all the latent variables that
need to be re-estimated in the E-step. Derive the E-step to re-estimate all the stated
latent variables. Your final E-step expression must only involve x, z, µ, Σ, φ and universal
constants.
Answer:
(c) [10 points] Semi-supervised M-Step. Clearly state which are all the parameters that
need to be re-estimated in the M-step. Derive the M-step to re-estimate all the stated
parameters. Specifically, derive closed form expressions for the parameter update rules for
µ(t+1) , Σ(t+1) and φ(t+1) based on the semi-supervised objective.
Answer:P
<z k >P (Z|X;θ)x
µ(t+1) = i<zki>P (Z|X;θ) i
i
T
k k(t)
( xi −µt+1
)(xi −µt+1 )
P
(t+1) i <zi >P (Z|X;θ) k k
Σ = <zik >P (Z|X;θ)k(t)
P k
i <zi >P (Z|X;θ)xi
φ(t+1) = N
(d) [5 points] Classical (Unsupervised) EM Implementation. For this sub-question,
we are only going to consider the n unlabelled examples. Follow the instructions in
CS229 Problem Set #3 14

src/semi supervised em/gmm.py to implement the traditional EM algorithm, and run


it on the unlabelled data-set until convergence.
Run three trials and use the provided plotting function to construct a scatter plot of the
resulting assignments to clusters (one plot for each trial). Your plot should indicate clus-
ter assignments with colors they got assigned to (i.e., the cluster which had the highest
probability in the final E-step).
Submit the three plots obtained above in your write-up.
Answer:

Figure 1: pred0

(e) [7 points] Semi-supervised EM Implementation. Now we will consider both the la-
belled and unlabelled examples (a total of n + ñ), with 5 labelled examples per cluster. We
have provided starter code for splitting the dataset into matrices x and x tilde of unlabelled
and labelled examples respectively. Add to your code in src/semi supervised em/gmm.py
to implement the modified EM algorithm, and run it on the dataset until convergence.
Create a plot for each trial, as done in the previous sub-question.
Submit the three plots obtained above in your write-up.
Answer:
(f) [3 points] Comparison of Unsupervised and Semi-supervised EM. Briefly describe
the differences you saw in unsupervised vs. semi-supervised EM for each of the following:
i. Number of iterations taken to converge.
ii. Stability (i.e., how much did assignments change with different random initializations?)
CS229 Problem Set #3 15

Figure 2: pred1

iii. Overall quality of assignments.


Note: The dataset was sampled from a mixture of three low-variance Gaussian distribu-
tions, and a fourth, high-variance Gaussian distribution. This should be useful in deter-
mining the overall quality of the assignments that were found by the two algorithms.
Answer:
CS229 Problem Set #3 16

Figure 3: pred2
CS229 Problem Set #3 17

5. [10 points] PCA


In class, we showed that PCA finds the “variance maximizing” directions onto which to project
the data. In this problem, we find another interpretation of PCA.
Suppose we are given a set of points {x(1) , . . . , x(n) }. Let us assume that we have as usual
preprocessed the data to have zero-mean and unit variance in each coordinate. For a given
unit-length vector u, let fu (x) be the projection of point x onto the direction given by u. I.e., if
V = {αu : α ∈ R}, then
fu (x) = arg min ||x − v||2 .
v∈V

Show that the unit-length vector u that minimizes the mean squared error between projected
points and original points corresponds to the first principal component for the data. I.e., show
that
Xn
arg min kx(i) − fu (x(i) )k22 .
u:uT u=1
i=1

gives the first principal component.


Remark. If we are asked to find a k-dimensional subspace onto which to project the data so as
to minimize the sum of squares distance between the original data and their projections, then
we should choose the k-dimensional subspace spanned by the first k principal components of the
data. This problem shows that this result holds for the case of k = 1.
Answer:
1
Pm (i)  2 1
Pm  (i)  T (i)  
m i=1 x
− fu x(i) 2 = m i=1 x − uT x(i) u x − uT x(i) u
1
Pm  (i) T (i) T (i)
 T (i)  T (i) 2
 T

=m i=1 x x − 2 u x u x + u x u u
1
Pm  (i) T (i) 2  1 Pm T 1
Pm T (i) 2

= m i=1 x x − uT x(i) = m i=1 x(i) x(i) − m i=1 u x
(i) T (i)
1
Pm  1
Pm T (i) 2 
=m i=1 x x − m i=1 u x − Var uT x(i)
thus, minimizing MSE above corresponds to the first principal component for the data.
CS229 Problem Set #3 18

6. [20 points] Independent components analysis


While studying Independent Component Analysis (ICA) in class, we made an informal argu-
ment about why Gaussian distributed sources will not work. We also mentioned that any other
distribution (except Gaussian) for the sources will work for ICA, and hence used the logistic
distribution instead. In this problem, we will go deeper into understanding why Gaussian dis-
tributed sources are a problem. We will also derive ICA with the Laplace distribution, and apply
it to the cocktail party problem.
Reintroducing notation, let s ∈ Rd be source data that is generated from d independent sources.
Let x ∈ Rd be observed data such that x = As, where A ∈ Rd×d is called the mixing matrix.
We assume A is invertible, and W = A−1 is called the unmixing matrix. So, s = W x. The goal
of ICA is to estimate W . Similar to the notes, we denote wjT to be the j th row of W . Note
that this implies that the j th source can be reconstructed with wj and x, since sj = wjT x. We
are given a training set {x(1) , . . . , x(n) } for the following sub-questions. Let us denote the entire
training set by the design matrix X ∈ Rn×d where each example corresponds to a row in the
matrix.

(a) [5 points] Gaussian source


For this sub-question, we assume sources are distributed according to a standard normal
distribution, i.e sj ∼ N (0, 1), j = {1, . . . , d}. The log-likelihood of our unmixing matrix, as
described in the notes, is
 
n
X d
X
`(W ) = log |W | + log g 0 (wjT x(i) ) ,
i=1 j=1

where g is the cumulative distribution function, and g 0 is the probability density function of
the source distribution (in this sub-question it is a standard normal distribution). Whereas
in the notes we derive an update rule to train W iteratively, for the cause of Gaussian
distributed sources, we can analytically reason about the resulting W .
Try to derive a closed form expression for W in terms of X when g is the standard normal
CDF. Deduce the relation between W and X in the simplest terms, and highlight the
ambiguity (in terms of rotational invariance) in computing W .
Answer:
ambiguity: the variances energies of the independent components cannot be determined;the
order of the independent components cannot be determined
(b) [10 points] Laplace source.
For this sub-question, we assume sources are distributed according to a standard Laplace
distribution, i.e si ∼ L(0, 1). The Laplace distribution L(0, 1) has PDF fL (s) = 12 exp (−|s|).
With this assumption, derive the update rule for a single example in the form

W := W + α (. . .) .

Answer:
g(s) ∼ L(0, 1) 
1 − 2g w1T x(i) 
 
 1 − 2g w2T x(i)  −1 
 (i)T
W := W + α  x + WT
 
.. 
 . 
 
1 − 2g wnT x(i)
CS229 Problem Set #3 19

(c) [5 points] Cocktail Party Problem


For this question you will implement the Bell and Sejnowski ICA algorithm, but assuming
a Laplace source (as derived in part-b), instead of the Logistic distribution covered in
class. The file src/ica/mix.dat contains the input data which consists of a matrix with 5
columns, with each column corresponding to one of the mixed signals xi . The code for this
question can be found in src/ica/ica.py.
Implement the update W and unmix functions in src/ica/ica.py.
You can then run ica.py in order to split the mixed audio into its components. The mixed
audio tracks are written to mixed i.wav in the output folder. The split audio tracks are
written to split i.wav in the output folder.
To make sure your code is correct, you should listen to the resulting unmixed sources.
(Some overlap or noise in the sources may be present, but the different sources should be
pretty clearly separated.)
Submit the full unmixing matrix W (5×5) that you obtained in your writeup.
Note: In our implementation, we anneal the learning rate α (slowly decreased it over
time) to speed up learning. In addition to using the variable learning rate to speed up
convergence, one thing that we also do is choose a random permutation of the training
data, and running stochastic gradient ascent visiting the training data in that order (each
of the specified learning rates was then used for one full pass through the data).
Answer:
[[ 0.10723895 -1.5225561 0.80968129 0.71812469 0.3211656 ] [ 0.40906193 2.73775441 -
1.61603613 -3.17100585 -0.83587429] [-0.50774346 0.54306621 -0.79968196 -0.19978552 -
1.69782201] [ 0.65459003 1.75662812 -1.2726741 -3.55243545 -1.07418389] [ 0.32744528 0.83548748
-1.30386933 -3.2780422 -2.46637458]]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy