Midterm
Midterm
Midterm Exam
Instructors: Eric Xing, Ziv Bar-Joseph
17 November, 2015
Name:
Andrew ID:
1
1 Basic Probability and MLE - 10 points
1. You are trapped in a dark cave with three indistinguishable exits on the walls. One of the exits takes
you 3 hours to travel and takes you outside. One of the other exits takes 1 hour to travel and the other
takes 2 hours, but both drop you back in the original cave. You have no way of marking which exits
you have attempted. What is the expected time it takes for you to get outside?
2. Let X1 , · · · , Xn be iid data from a uniform distribution over the disc of radius θ in R2 . Thus, Xi ∈
R2 and (
1
2 if kxk ≤ θ
p(x; θ) = πθ
0 otherwise
2
2 Decision Trees - 10 points
1. The following figure presents the top two levels of a decision tree learned to predict the attractiveness
of a book. What should be the value of A if the decision tree was learned using the algorithm discussed
in class (you can either say ‘At most X’ or ‘At least X’ or ‘Equal to X’ where you should replace X
with a number based on your calculation), explain your answer?
2. We now focus on all samples assigned to the left side of the tree (i.e. those that are longer than 120
minutes). We know that we have a binary feature, ‘American director’ that after the ‘Action movie’
split provides a perfect split for the data (i.e. all samples on one side are ‘like’ and all those on the
other side ‘didn’t like’. Fill in the missing values in the picture below:
3
3 Naı̈ve Bayes & Logistic Regression - 6 points
1. In online learning, we can update the decision boundary of a classifier based on new data without
reprocessing the old data. Now for a new data point that is an outlier, which of the following classifiers
are likely to be effected more severely? NB, LR, SVM? Please give a one sentence explanation to your
answer.
2. Now to build a classifier on discrete features using small training data, one will need to consider the
scenario where some features have rare values that were never observed in the training data (e.g.,
the word ‘Buenos Aires’ does not appear in training for a text classification problem). To train a
generalizable classifier, do you want to use NB or LR, how will you augment the original formulation
of the classifier under a Bayesian or regularization setting?
3. Now to build a classifier on high-dimensional features using small training data, one will need to
consider the scenario where many features are just irrelevant noises. To train a generalizable classifier,
do you want to use NB or LR, how will you augment the original formulation of the classifier under a
Bayesian or regularization setting?
4
4 Deep Neural Networks - 10 points
In homework 3, we counted the model parameters of a convolutional neural network (CNN), which gives us
a sense how much memory a CNN will consume. Now we estimate the computation overhead of CNNs by
counting the FLOPs (floating point operations). For simplicity we only consider the forward pass.
Consider a convolutional layer C followed by a max pooling layer P . The input of layer C has 50 channels,
each of which is of size 12 × 12. Layer C has 20 filters, each of which is of size 4 × 4. The convolution padding
is 1 and the stride is 2. Layer P performs max pooling over each of the C’s output feature maps, with 3 × 3
local receptive fields, and stride 1.
Given x1 , x2 , · · · , xn all scalars, we assume:
• A scalar multiplication xi · xj accounts for one FLOP;
• A scalar addition xi + xj accounts for one FLOP;
2. How many FLOPs layer C and P conduct in total during one forward pass?
5
5 SVM - 12 points
Recall that the soft-margin primal SVM problem is
1 T
Pn
min 2w w + C i=1 ξi
s.t. ξi ≥ 0, ∀i ∈ {1, · · · , n} (1)
(wT xi + b)yi ≥ 1 − ξi , ∀i ∈ {1, · · · , n}
We can get the kernel SVM by taking the dual of the primal problem and then replace the product of xTi xj
by k(xi , xj ) where k(·, ·) is the kernel function.
Figure 1 plots SVM decision boundaries resulting from using different kernels and/or different slack
penalties. In Figure 1, there are two classes of training data, with labels yi ∈ {−1, 1}, represented by circles
and squares respectively. The SOLID circles and squares represent the support vectors. Label each plot in
Figure 1 with the letter of the optimization problem below. You are NOT required to explain the reasons.
6
6 Bias-Variance Decomposition - 14 points
1. To understand bias and variance, we will create a graphical visualization using a bulls-eye. Imagine
that the center of the target is our true model (a model that perfectly predicts the correct values).
As we move away from the bulls-eye, our predictions get worse and worse. Imagine we can repeat
our entire model building process to get a number of separate hits on the target. Each hit represents
an individual realization of our model, given the chance variability in the training data we gather.
Sometimes we will get a good distribution of training data so we predict very well and we are close
to the bulls-eye, while sometimes our training data might be full of outliers or non-standard values
resulting in poorer predictions. Consider these four different realizations resulting from a scatter of hits
on the target. Characterize the bias and variance of the estimates of the following models on the data
with respect to the true model as low or high by circling the appropriate entries below each diagram.
7
2. Explain what effect will the following operations have on the bias and variance of your model. Fill in
one of ‘increases’, ‘decreases’ or ‘no change’ in each of the cells:
Bias Variance
Regularizing the weights in a lin-
ear/logistic regression model
Increasing k in k-nearest neigh-
bor models
Pruning a decision tree (to a cer-
tain depth for example)
Increasing the number of hidden
units in an artificial neural net-
work
Using dropout to train a deep
neural network
Removing all the non-support
vectors in SVM
8
7 Gaussian Mixture Model - 6 points
Consider a mixture distribution given by
K
X
p(x) = πk p(x|zk ). (2)
k=1
Suppose that we partition the vector x into two parts as x = (x1 , x2 ), then the conditional distribution
p(x2 |x1 ) is also a mixture distribution:
K
X
p(x2 |x1 ) = k p(x2 |x1 , zk ). (3)
k=1
9
8 Semi-Supervised learning - 12 points
1. We would like to use semi-supervised learning to classify text documents. We are using the ‘bag of
words’ representation discussed in class with binary indicators for the presence of 10000 words in each
document (so each document is represented by a binary vector of length 10000).
For the following classifiers and learning methods discussed in class, state whether the method can be
applied to improve the classifier (Yes) or not (No) and provide a brief explanation.
2. Unlike all other classifiers we discussed, KNN does not have any parameters to tune. For each of the
following semi-supervised methods state whether a KNN classifier (where K is fixed and not allowed
to change) learned for some data using labeled and unlabeled data could be different from a KNN
classifier learned using only the labeled data in this dataset (no need to explain).
10
9 Learning Theory, PAC learning - 10 points
In class we learned the following agnostic PAC learning bound:
Theorem 1. Let H be a finite concept class. Let D be an arbitrary, fixed unknown distribution over X. For
any , δ > 0, if we draw a sample S from D of size
1 2
m ≥ 2 ln |H| + ln , (4)
2 δ
then with probability at least 1 − δ, all hypothesis h ∈ H have |errD (h) − errS (h)| ≤ .
Our friend Yan is trying to solve a learning problem that fits in the assumptions above.
1. Yan tried a training set of 100 examples and observed some gap between training error and test error,
so he wanted to reduce the overfitting to half. How many examples should Yan use, according to the
above PAC bound?
2. Yan took your suggestion and ran his algorithm again, however the overfitting did not halve. Do you
think it is possible? Explain briefly.
11
10 Bayes Networks - 10 points
Consider the Bayesian network in Figure 2. We use (X ⊥⊥ Y |Z) to denote the fact that X and Y are
independent given Z. Answer the following questions:
1. Are there any pairs of point that are independent? If your answer is yes, please list out all such pairs.
2. Does (B ⊥
⊥ C|A, D) hold? briefly explain.
3. Does (B ⊥
⊥ F |A, D) hold? briefly explain.
4. Assuming that there are d = 10 values that each of these variables can take (say 1 to 10), how many
parameters do we need to model the full joint distribution without using the kowldge incoded in the
graph (i.e. no independence / conditional independence assumptions)? How many parameters do we
need for the Bayesian network for such setting? (you do not need to provide the exact number, a close
approximation or a tight upper / lower bound will do).
12