HW 23 P 4 Rie
HW 23 P 4 Rie
HW 23 P 4 Rie
Fall 2016
—————————————————————————————————————
Homework # 2
All questions have multiple-choice answers ([a], [b], [c], ...). You can collaborate
with others, but do not discuss the selected or excluded choices in the answers. You
can consult books and notes, but not other people’s solutions. Your solutions should
be based on your own work. Definitions and notation follow the lectures.
• The problems range from easy to difficult, and from practical to theoretical.
Some problems require running a full experiment to arrive at the answer.
• The answer may not be obvious or numerically close to one of the choices,
but one (and only one) choice will be correct if you follow the instructions
precisely in each problem. You are encouraged to explore the problem further
by experimenting with variations on these instructions, for the learning benefit.
• You are also encouraged to take part in the discussion forum. Please make
sure you don’t discuss specific answers, or specific excluded answers, before the
homework is due.
c 2012-2016 Yaser Abu-Mostafa. All rights reserved. No redistribution in any
format. No translation or derivative products without written permission.
1
• Hoeffding Inequality
Run a computer simulation for flipping 1,000 virtual fair coins. Flip each coin inde-
pendently 10 times. Focus on 3 coins as follows: c1 is the first coin flipped, crand is a
coin chosen randomly from the 1,000, and cmin is the coin which had the minimum
frequency of heads (pick the earlier one in case of a tie). Let ν1 , νrand , and νmin be
the fraction of heads obtained for the 3 respective coins out of the 10 tosses.
Run the experiment 100,000 times in order to get a full distribution of ν1 , νrand , and
νmin (note that crand and cmin will change from run to run).
[a] 0
[b] 0.01
[c] 0.1
[d] 0.5
[e] 0.67
[a] c1 only
[b] crand only
[c] cmin only
[d] c1 and crand
[e] cmin and crand
2
[a] µ
[b] λ
[c] 1-µ
[d] (1 − λ) ∗ µ + λ ∗ (1 − µ)
[e] (1 − λ) ∗ (1 − µ) + λ ∗ µ
[a] 0
[b] 0.5
√
[c] 1/ 2
[d] 1
[e] No values of λ
• Linear Regression
In these problems, we will explore how Linear Regression for classification works. As
with the Perceptron Learning Algorithm in Homework # 1, you will create your own
target function f and data set D. Take d = 2 so you can visualize the problem, and
assume X = [−1, 1] × [−1, 1] with uniform probability of picking each x ∈ X . In
each run, choose a random line in the plane as your target function f (do this by
taking two random, uniformly distributed points in [−1, 1] × [−1, 1] and taking the
line passing through them), where one side of the line maps to +1 and the other maps
to −1. Choose the inputs xn of the data set as random points (uniformly in X ), and
evaluate the target function on each xn to get the corresponding output yn .
5. Take N = 100. Use Linear Regression to find g and evaluate Ein , the fraction of
in-sample points which got classified incorrectly. Repeat the experiment 1000
times and take the average (keep the g’s as they will be used again in Problem
6). Which of the following values is closest to the average Ein ? (Closest is the
option that makes the expression |your answer − given option| closest to 0. Use
this definition of closest here and throughout.)
[a] 0
[b] 0.001
[c] 0.01
[d] 0.1
[e] 0.5
3
6. Now, generate 1000 fresh points and use them to estimate the out-of-sample
error Eout of g that you got in Problem 5 (number of misclassified out-of-sample
points / total number of out-of-sample points). Again, run the experiment 1000
times and take the average. Which value is closest to the average Eout ?
[a] 0
[b] 0.001
[c] 0.01
[d] 0.1
[e] 0.5
7. Now, take N = 10. After finding the weights using Linear Regression, use
them as a vector of initial weights for the Perceptron Learning Algorithm. Run
PLA until it converges to a final vector of weights that completely separates
all the in-sample points. Among the choices below, what is the closest value to
the average number of iterations (over 1000 runs) that PLA takes to converge?
(When implementing PLA, have the algorithm choose a point randomly from
the set of misclassified points at each iteration)
[a] 1
[b] 15
[c] 300
[d] 5000
[e] 10000
• Nonlinear Transformation
In these problems, we again apply Linear Regression for classification. Consider the
target function:
8. Carry out Linear Regression without transformation, i.e., with feature vector:
(1, x1 , x2 ),
4
to find the weight w. What is the closest value to the classification in-sample
error Ein ? (Run the experiment 1000 times and take the average Ein to reduce
variation in your results.)
[a] 0
[b] 0.1
[c] 0.3
[d] 0.5
[e] 0.8
9. Now, transform the N = 1000 training data into the following nonlinear feature
vector:
(1, x1 , x2 , x1 x2 , x21 , x22 )
Find the vector w̃ that corresponds to the solution of Linear Regression. Which
of the following hypotheses is closest to the one you find? Closest here means
agrees the most with your hypothesis (has the highest probability of agreeing on
a randomly selected point). Average over a few runs to make sure your answer
is stable.
10. What is the closest value to the classification out-of-sample error Eout of your
hypothesis from Problem 9? (Estimate it by generating a new set of 1000 points
and adding noise, as before. Average over 1000 runs to reduce the variation in
your results.)
[a] 0
[b] 0.1
[c] 0.3
[d] 0.5
[e] 0.8