10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
Midterm Exam
Fall 2011
Tom Mitchell, Aarti Singh
Carnegie Mellon University
1. Personal information:
• Name:
• Andrew account:
• E-mail address:
2. There should be 11 numbered pages in this exam.
3. This exam is open book, open notes. No computers or internet access is allowed.
4. You do not need a calculator.
5. If you need more room to answer a question, use the back of the page and clearly mark
on the front of the page if we are to look at the back.
6. Work efficiently. Answer the easier questions first.
7. You have 80 minutes.
8. Good luck!
1
1 Short Questions [35 pts]
Answer True/False in the following 8 questions. Explain your reasoning in 1 sentence.
1. [3 pts] Suppose you are given a dataset of cellular images from patients with and
without cancer. If you are required to train a classifier that predicts the probability that
the patient has cancer, you would prefer to use Decision trees over logistic regression.
F SOLUTION: FALSE. Decision trees only provide a label estimate, whereas logistic
regression provides the probability of a label (patient has cancer) for a given input (cellular
image).
2. [3 pts] Suppose the dataset in the previous question had 900 cancer-free images and
100 images from cancer patients. If I train a classifier which achieves 85% accuracy on
this dataset, it is it a good classifier.
F SOLUTION: FALSE. This is not a good accuracy on this dataset, since a classifier
that outputs ”cancer-free” for all input images will have better accuracy (90%).
3. [3 pts] A classifier that attains 100% accuracy on the training set and 70% accuracy
on test set is better than a classifier that attains 70% accuracy on the training set and
75% accuracy on test set.
F SOLUTION: FALSE. The second classifier has better test accuracy which reflects
the true accuracy, whereas the first classifier is overfitting.
4. [3 pts] A football coach whispers a play number n to two players A and B inde-
pendently. Due to crowd noise, each player imperfectly and independently draws a
conclusion about what the play number was. A thinks he heard the number nA , and
B thinks he heard nB . True or false: nA and nB are marginally dependent but condi-
tionally independent given the true play number n.
2
n"
nA" nB"
6. [3 pts] If you train a linear regression estimator with only half the data, its bias is
smaller.
F SOLUTION: FALSE. Bias depends on the model you use (in this case linear regres-
sion) and not on the number of training data.
7. [3 pts] The following two Bayes nets encode the same set of conditional independence
relations.
A B C$
C$ B A
F SOLUTION: TRUE. Both models encode that C and A are conditionally indepen-
dent given B. Also
3
8. [3 pts] A, B and C are three Boolean random variables. The following equality holds
without any assumptions on the joint distribution P (A, B, C)
P (A|B) = P (A|B, C = 0)P (C = 0) + P (A|B, C = 1)P (C = 1).
The following three short questions are not True/False questions. Please provide explanations
for your answers.
9. [3 pts] The Bayes net below implies that A is conditionally independent of B given
C (A ⊥⊥ B|C). Prove this, based on its factorization of the joint distribution, and on
the definition of conditional independence.
4
10. [3 pts] Which of the following classifiers can perfectly classify the following data:
F SOLUTION: Decision Tree only. Decision trees of depth 2 which first splits on
X1 and then on X2 wil perfectly classify it. Logistic regression leads to linear decision
boundaries, hence cannot classify this data perfectly. Due to conditional independence
requirement, it is not possible to fit a Gaussian that peaks at the labels of only one class
and has no covariance between features, so Gaussian Naive Bayes cannot classify this data
perfectly.
5
11. [5 pts] Boolean random variables A and B have the joint distribution specified in the
table below.
A B P (A, B)
0 0 0.32
0 1 0.48
1 0 0.08
1 1 0.12
Given the above table, please compute the following five quantities:
P (A = 1) = 1 − P (A = 0) = 0.2
P (B = 0) = 1 − P (B = 1) = 0.4
6
2 MLE/MAP Estimation [15 pts]
In this question you will estimate the probability of a coin landing heads using MLE and
MAP estimates.
Suppose you have a coin whose probability of landing heads is p = 0.5, that is, it is a fair
coin. However, you do not know p and would like to form an estimator θ̂ for the probability
of landing heads p. In class, we derived an estimator that assumed p can take on any value
in the interval [0, 1]. In this question, you will derive an estimator that assumes p can take
on only two possible values: 0.3 or 0.6.
Note: Pθ̂ [heads] = θ̂.
Hint: All the calculations involved here are simple. You do not require a calculator.
1. [5 pts] You flip the coin 3 times and note that it landed 2 times on tails and 1 time
on heads. Find the maximum likelihood estimate θ̂ of p over the set of possible values
{0.3, 0.6}.
Solution:
θ̂ = argmaxθ∈{0.3,0.6} Pθ [D]
= argmaxθ∈{0.3,0.6} Pθ [heads]Pθ [tails]2
= argmaxθ∈{0.3,0.6} θ(1 − θ)2
We observe that
Pθ=0.3 [D] 0.3 ∗ 0.72 0.49
= = >1
Pθ=0.6 D] 0.6 ∗ 0.42 0.32
which implies that θ̂ = 0.3.
2. [4 pts] Suppose that you have the following prior on the parameter p:
Again, you flip the coin 3 times and note that it landed 2 times on tails and 1 time on
heads. Find the MAP estimate θ̂ of p over the set {0.3, 0.6}, using this prior.
Solution:
7
3. [3 pts] Suppose that the number of times you flip the coin tends to infinity. What
would be the maximum likelihood estimate θ̂ of p over the set {0.3, 0.6} in that case?
Justify your answer.
Solution:
With the number of flips tending to infinity, proportion of heads to the total number
of flips tends to 0.5. The MLE would be 0.6 as this is closer to 0.5.
4. [3 pts] Suppose that the number of times you flip the coin tends to infinity. What
would be the MAP estimate θ̂ of p over the set {0.3, 0.6}, using the prior defined in
part 2 of this question? Justify your answer.
Solution:
With the number of flips tending to infinity, the effect of the prior becomes negligible.
Therefore, the MAP estimate will be the same as the MLE.
8
3 Bayes Nets [15 pts]
1. (a) [3 pts] Please draw a Bayes net which represents the following joint distribution:
Solution:
D" E"
C" D"
F"
E"
G"
(b) [2 pts] For the graph that you drew above, assume each variable can take on
the values 1, 2 or 3. Also assume that you are given values for the probabilities
P (D = 1), P (D = 2) and P (D = 3). Please specify the smallest set of Bayes net
parameters you would need in order to calculate P (E = 1). Solution:
We can write:
9
4" D" 4" E"
C" D"
F"
(d) B is independent of 4"
C given only A
(e) B is not independent of C given A and D E"
Solution: Any of the
G"following satisfy the above:
2"
10
3. [4 pts] Consider the graph drawn below. Assume that each variable can only take on
values true and false.
(a) How many parameters are necessary to specify the joint distribution P (A, B, C, D, E, F, G)
for this Bayes net? You may answer by writing the number of parameters directly
next to each graph node.
Solution: See below for the number of parameters needed for each node. To-
tal is 17.
(b) Please give the minimum number of Bayes net parameters required to fully spec-
ify the distribution P (G|A, B, C, D, E, F ). Briefly justify your answer.
Solution: Note that the Markov blanket for G consists only of F . Thus,
P (G|A, B, C, D, E, F ) = P (G|F ) and only two parameters are need to specify
this distribution.
F" 4"
E"
G" 2"
4. [2 pts] Given the graph provided above, please state if the following are true or false.
(a) E is conditionally independent of G given F. Solution: True.
(b) A is conditionally independent of C given B and G. Solution: False.
11
4 EM [15 pts]
In this question you will apply EM to train the following simple Bayes net:
using the following data set, for which X2 is unobserved in training example 4.
Example X1 X2
1. 0 1
2. 0 0
3. 1 0
4. 1 ?
5. 0 1
The EM process has run for several iterations. At this point the parameter estimates
are:
θ̂X1=1 = P̂ (X1 = 1) = 0.4
θ̂X2=1|X1=1 = P̂ (X2 = 1|X1 = 1) = 0.4
θ̂X2=1|X1=0 = P̂ (X2 = 1|X1 = 0) = 0.66
2
θ̂X1=1 = = 0.4
5
2
θ̂X2=1|X1=0 = = 0.66
3
0.4
θ̂X2=1|X1=1 = = 0.2
2
12
5 Bias and Variance in Linear Regression [20 pts]
In this question, we will explore bias and variance in linear regression. Assume that a total
of N data points of the form (xi , yi ) are generated from the following (true) model:
xi ∼ U nif (0, 1), yi = f (xi ) + i , i ∼ N (0, 1), f (x) = x
We assume xi ⊥ j ∀i, j and i ⊥ j ∀i 6= j (note a ⊥ b means a and b are independent).
You may find the following pieces of information useful when solving this problem:
Z
• bias = (ED [hD (x)] − f (x))2 p(x)dx
2
x
Z
• variance = ED [(hD (x) − ED [hD (x)])2 ]p(x)dx
x
1
F SOLUTION: ED [hD (x)] = ED [µ̂] = 2
Z 1
1 1 1 1
F SOLUTION: Bias = 2
( − x)2 (1) dx = − ( − x)3 |10 = .
q 0 2 3 2 12
1
The bias is thus 12 .
13
F SOLUTION: The variance is the variance of the MLE estimator. By the third bullet,
this is N1 .
F SOLUTION: The unavoidable error and bias do not change. The variance goes to
0 as N → ∞.
14
Now assume we notice that y in fact depends on x. Therefore, we change to a linear regression
model (with zero intercept), which assumes the data are generated as follows:
xi ∼ U nif (0, 1), yi = f (xi ) + i , i ∼ N (0, 1), f (x) = ax
We also assume (as in the true model) that xi ⊥ j ∀i, j and i ⊥ j ∀i 6= j.
6. [3 pts] We choose our estimator â for a to minimize the squared sum of errors. That
is, we choose â such that
N
1X
â = argmina (yi − axi )2
2 i=1
Derive the closed form expression for â. Once we have chosen the value of â, we now
have a regression model that predicts yi = âxi .
N
X
F SOLUTION: Let f (a) = 1
2
(yi − axi )2 . Then,
i=1
N
X
∂(f )
∂a
= −xi (yi − axi )
i=1
Setting the derivative to 0, we obtain:
XN
−xi (yi − axi ) = 0
i=1
N
X N
X
=⇒ xi y i = ax2i
i=1 i=1
N
X
xi y i
i=1
=⇒ â = N
.
X
x2i
i=1
15
F SOLUTION: The unavoidable error is still introduced by i , and is 1 by assumption.
10. [2 pts] In the figure below, draw the two learned regression models if we have an
infinite number of data points.
F SOLUTION: Model 1 (the trivial model) is the horizontal line. Model 2 (the linear
regression model) is the diagonal line.
1.0
0.8
0.6
y
0.4
0.2
0.0
16