10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

10-601 Machine Learning

Midterm Exam
Fall 2011
Tom Mitchell, Aarti Singh
Carnegie Mellon University

1. Personal information:
• Name:
• Andrew account:
• E-mail address:
2. There should be 11 numbered pages in this exam.
3. This exam is open book, open notes. No computers or internet access is allowed.
4. You do not need a calculator.
5. If you need more room to answer a question, use the back of the page and clearly mark
on the front of the page if we are to look at the back.
6. Work efficiently. Answer the easier questions first.
7. You have 80 minutes.
8. Good luck!

Question Topic Max. score Score


1 Short questions 35
2 MLE/MAP 15
3 Bayes Nets 15
4 EM 15
5 Regression 20
Total 100

1
1 Short Questions [35 pts]
Answer True/False in the following 8 questions. Explain your reasoning in 1 sentence.
1. [3 pts] Suppose you are given a dataset of cellular images from patients with and
without cancer. If you are required to train a classifier that predicts the probability that
the patient has cancer, you would prefer to use Decision trees over logistic regression.

F SOLUTION: FALSE. Decision trees only provide a label estimate, whereas logistic
regression provides the probability of a label (patient has cancer) for a given input (cellular
image).

2. [3 pts] Suppose the dataset in the previous question had 900 cancer-free images and
100 images from cancer patients. If I train a classifier which achieves 85% accuracy on
this dataset, it is it a good classifier.

F SOLUTION: FALSE. This is not a good accuracy on this dataset, since a classifier
that outputs ”cancer-free” for all input images will have better accuracy (90%).

3. [3 pts] A classifier that attains 100% accuracy on the training set and 70% accuracy
on test set is better than a classifier that attains 70% accuracy on the training set and
75% accuracy on test set.

F SOLUTION: FALSE. The second classifier has better test accuracy which reflects
the true accuracy, whereas the first classifier is overfitting.

4. [3 pts] A football coach whispers a play number n to two players A and B inde-
pendently. Due to crowd noise, each player imperfectly and independently draws a
conclusion about what the play number was. A thinks he heard the number nA , and
B thinks he heard nB . True or false: nA and nB are marginally dependent but condi-
tionally independent given the true play number n.

F SOLUTION: TRUE. Knowledge of nA value tells us something about nB therefore


P (nA |nB ) 6= P (nA ) hence they are marginally dependent, but given n, nA and nB are
determined independently. Also follows from following Bayes Net:

2
n"

nA" nB"

5. [3 pts] Assume m is the minimum number of training examples sufficient to guarantee


that with probability 1 − δ a consistent learner using hypothesis space H will output a
hypothesis with true error at worst . Then a second learner that uses hypothesis space
H 0 will require 2m training examples (to make the same guarantee) if |H 0 | = 2|H|.

F SOLUTION: FALSE. Minimum number of training examples sufficient to make


an (, δ)-PAC guarantee depends logarithmically on hypothesis class size (ln |H|) and not
linearly.

6. [3 pts] If you train a linear regression estimator with only half the data, its bias is
smaller.

F SOLUTION: FALSE. Bias depends on the model you use (in this case linear regres-
sion) and not on the number of training data.

7. [3 pts] The following two Bayes nets encode the same set of conditional independence
relations.

A B C$

C$ B A

F SOLUTION: TRUE. Both models encode that C and A are conditionally indepen-
dent given B. Also

P (A, B)P (B, C)


P (A)P (B|A)P (C|B) = = P (C)P (B|C)P (A|B)
P (B)

3
8. [3 pts] A, B and C are three Boolean random variables. The following equality holds
without any assumptions on the joint distribution P (A, B, C)
P (A|B) = P (A|B, C = 0)P (C = 0) + P (A|B, C = 1)P (C = 1).

F SOLUTION: TRUE. Since C is a Boolean random variable, we have


P (A|B) = P (A, C = 0|B) + P (A, C = 1|B)
= P (A|B, C = 0)P (C = 0) + P (A|B, C = 1)P (C = 1)
where last step follows from definition of conditional probability.

The following three short questions are not True/False questions. Please provide explanations
for your answers.

9. [3 pts] The Bayes net below implies that A is conditionally independent of B given
C (A ⊥⊥ B|C). Prove this, based on its factorization of the joint distribution, and on
the definition of conditional independence.

F SOLUTION: Using factorization of joint distribution


P (A, B, C) = P (C)P (A|C)P (B|C)
and using definition of conditional independence
P (A, B, C) = P (C)P (A, B|C)
Therefore, we have:
P (A, B|C) = P (A|C)P (B|C)
i.e. A is conditionally independent of B given C (A ⊥⊥ B|C).

4
10. [3 pts] Which of the following classifiers can perfectly classify the following data:

(a) Decision Tree


(b) Logistic Regression
(c) Gaussian Naive Bayes

F SOLUTION: Decision Tree only. Decision trees of depth 2 which first splits on
X1 and then on X2 wil perfectly classify it. Logistic regression leads to linear decision
boundaries, hence cannot classify this data perfectly. Due to conditional independence
requirement, it is not possible to fit a Gaussian that peaks at the labels of only one class
and has no covariance between features, so Gaussian Naive Bayes cannot classify this data
perfectly.

5
11. [5 pts] Boolean random variables A and B have the joint distribution specified in the
table below.

A B P (A, B)
0 0 0.32
0 1 0.48
1 0 0.08
1 1 0.12

Given the above table, please compute the following five quantities:

F SOLUTION: P (A = 0) = P (A = 0, B = 0) + P (A = 0, B = 1) = 0.32 + 0.48 =


0.8

P (A = 1) = 1 − P (A = 0) = 0.2

P (B = 1) = P (B = 1, A = 0) + P (B = 1, A = 1) = 0.48 + 0.12 = 0.6

P (B = 0) = 1 − P (B = 1) = 0.4

P (A = 1|B = 0) = P (A = 1, B = 0)/P (B = 0) = 0.08/0.4 = 0.2

Are A and B independent? Justify your answer.

F SOLUTION: YES. Using the calculations above,

P (A = 0)P (B = 0) = 0.8 ∗ 0.4 = 0.32 = P (A = 0, B = 0)


P (A = 0)P (B = 1) = 0.8 ∗ 0.6 = 0.48 = P (A = 0, B = 1)
P (A = 1)P (B = 0) = 0.2 ∗ 0.4 = 0.08 = P (A = 1, B = 0)
P (A = 1)P (B = 1) = 0.2 ∗ 0.6 = 0.12 = P (A = 1, B = 1)

6
2 MLE/MAP Estimation [15 pts]
In this question you will estimate the probability of a coin landing heads using MLE and
MAP estimates.
Suppose you have a coin whose probability of landing heads is p = 0.5, that is, it is a fair
coin. However, you do not know p and would like to form an estimator θ̂ for the probability
of landing heads p. In class, we derived an estimator that assumed p can take on any value
in the interval [0, 1]. In this question, you will derive an estimator that assumes p can take
on only two possible values: 0.3 or 0.6.
Note: Pθ̂ [heads] = θ̂.
Hint: All the calculations involved here are simple. You do not require a calculator.
1. [5 pts] You flip the coin 3 times and note that it landed 2 times on tails and 1 time
on heads. Find the maximum likelihood estimate θ̂ of p over the set of possible values
{0.3, 0.6}.
Solution:

θ̂ = argmaxθ∈{0.3,0.6} Pθ [D]
= argmaxθ∈{0.3,0.6} Pθ [heads]Pθ [tails]2
= argmaxθ∈{0.3,0.6} θ(1 − θ)2
We observe that
Pθ=0.3 [D] 0.3 ∗ 0.72 0.49
= = >1
Pθ=0.6 D] 0.6 ∗ 0.42 0.32
which implies that θ̂ = 0.3.
2. [4 pts] Suppose that you have the following prior on the parameter p:

P [p = 0.3] = 0.3 and P [p = 0.6] = 0.7.

Again, you flip the coin 3 times and note that it landed 2 times on tails and 1 time on
heads. Find the MAP estimate θ̂ of p over the set {0.3, 0.6}, using this prior.
Solution:

θ̂ = argmaxθ∈{0.3,0.6} Pθ [D]P [θ]


We observe that
Pθ=0.3 [D]P [θ = 0.3] 0.3 ∗ 0.72 ∗ 0.3 0.21
= 2
= <1
Pθ=0.6 [D]P [θ = 0.6] 0.6 ∗ 0.4 ∗ 0.7 0.32

which implies that θ̂MAP = 0.6.

7
3. [3 pts] Suppose that the number of times you flip the coin tends to infinity. What
would be the maximum likelihood estimate θ̂ of p over the set {0.3, 0.6} in that case?
Justify your answer.
Solution:
With the number of flips tending to infinity, proportion of heads to the total number
of flips tends to 0.5. The MLE would be 0.6 as this is closer to 0.5.
4. [3 pts] Suppose that the number of times you flip the coin tends to infinity. What
would be the MAP estimate θ̂ of p over the set {0.3, 0.6}, using the prior defined in
part 2 of this question? Justify your answer.
Solution:
With the number of flips tending to infinity, the effect of the prior becomes negligible.
Therefore, the MAP estimate will be the same as the MLE.

8
3 Bayes Nets [15 pts]
1. (a) [3 pts] Please draw a Bayes net which represents the following joint distribution:

P (A, B, C, D, E) = P (A)P (B)P (C|A)P (D|A, B, C)P (E|D)

Solution:

A" B" C" A" B"

D" E"
C" D"

F"

E"

G"

(b) [2 pts] For the graph that you drew above, assume each variable can take on
the values 1, 2 or 3. Also assume that you are given values for the probabilities
P (D = 1), P (D = 2) and P (D = 3). Please specify the smallest set of Bayes net
parameters you would need in order to calculate P (E = 1). Solution:
We can write:

P (E = 1) =P (E = 1|D = 1)P (D = 1) + P (E = 1|D = 2)P (D = 2) + . . .


P (E = 1|D = 3)P (D = 3)

Thus we need three parameters: P (E = 1|D = 1), P (E = 1|D = 2) and


P (E = 1|D = 3).
2. [4 pts] Please draw a single Bayes net which encodes all the following conditional
independence assumptions over the variables A, B, C and D:
(a) A is independent of D given B and C
(b) A is not independent of D given only B
(c) A is not independent of D given only C

9
4" D" 4" E"
C" D"

F"
(d) B is independent of 4"
C given only A
(e) B is not independent of C given A and D E"
Solution: Any of the
G"following satisfy the above:
2"

A" A" A"

B" C" B" C" B" C"

D" D" D"

10
3. [4 pts] Consider the graph drawn below. Assume that each variable can only take on
values true and false.
(a) How many parameters are necessary to specify the joint distribution P (A, B, C, D, E, F, G)
for this Bayes net? You may answer by writing the number of parameters directly
next to each graph node.

Solution: See below for the number of parameters needed for each node. To-
tal is 17.
(b) Please give the minimum number of Bayes net parameters required to fully spec-
ify the distribution P (G|A, B, C, D, E, F ). Briefly justify your answer.

Solution: Note that the Markov blanket for G consists only of F . Thus,
P (G|A, B, C, D, E, F ) = P (G|F ) and only two parameters are need to specify
this distribution.

1" 1" 1"


A" B" C" A" B"

4" D" 4" E"


C" D"

F" 4"

E"

G" 2"

4. [2 pts] Given the graph provided above, please state if the following are true or false.
(a) E is conditionally independent of G given F. Solution: True.
(b) A is conditionally independent of C given B and G. Solution: False.

11
4 EM [15 pts]
In this question you will apply EM to train the following simple Bayes net:

using the following data set, for which X2 is unobserved in training example 4.
Example X1 X2
1. 0 1
2. 0 0
3. 1 0
4. 1 ?
5. 0 1
The EM process has run for several iterations. At this point the parameter estimates
are:
θ̂X1=1 = P̂ (X1 = 1) = 0.4
θ̂X2=1|X1=1 = P̂ (X2 = 1|X1 = 1) = 0.4
θ̂X2=1|X1=0 = P̂ (X2 = 1|X1 = 0) = 0.66

1. [2 pts] What is calculated in the next E step?


Answer: The expected value of X2 for example 4: P (X2 = 1|X1 = 1; θ)
2. [5 pts] What precisely is the result of the next E step? Show your work.

P̂ (X2 = 1|X1 = 1) = θ̂X2=1|X1=1 = 0.4

3. [3 pts] What is calculated in the next M step?


New estimates for θ̂X1=1 , θ̂X2=1|X1=0 (which do not change), and θ̂X2=1|X1=1
4. [5 pts] What precisely is the result of the next M step? Show your work.

2
θ̂X1=1 = = 0.4
5
2
θ̂X2=1|X1=0 = = 0.66
3
0.4
θ̂X2=1|X1=1 = = 0.2
2

12
5 Bias and Variance in Linear Regression [20 pts]
In this question, we will explore bias and variance in linear regression. Assume that a total
of N data points of the form (xi , yi ) are generated from the following (true) model:
xi ∼ U nif (0, 1), yi = f (xi ) + i , i ∼ N (0, 1), f (x) = x
We assume xi ⊥ j ∀i, j and i ⊥ j ∀i 6= j (note a ⊥ b means a and b are independent).
You may find the following pieces of information useful when solving this problem:
Z
• bias = (ED [hD (x)] − f (x))2 p(x)dx
2
x
Z
• variance = ED [(hD (x) − ED [hD (x)])2 ]p(x)dx
x

• µ̂ ∼ N (µ, N1 ) if µ̂ is the MLE estimator with N data points


R1
• If x ∼ U nif (0, 1), then 0 p(x)dx = 1, and therefore p(x) = 1.
We begin by examining the case where we are not aware that y depends on x. Instead, our
(incorrect) model is that f (x) has some constant value f (x) = µ, and therefore
xi ∼ U nif (0, 1), yi ∼ N (µ, 1) with xi ⊥ yi .
N
X
1
We use the MLE estimator for µ. That is, we let µ̂ = N
yi . The prediction of our trivial
i=1
regression model for the value of yi is µ̂, regardless of the value of xi .
1. [2 pts] What is the value for ED [hD (x)] in this case? Here ED refers to the expected
value over different training data sets of size N , and hD (x) is the predictor learned
from a specific data set D.

1
F SOLUTION: ED [hD (x)] = ED [µ̂] = 2

2. [3 pts] What is the bias of this trivial regression model?

Z 1
1 1 1 1
F SOLUTION: Bias = 2
( − x)2 (1) dx = − ( − x)3 |10 = .
q 0 2 3 2 12
1
The bias is thus 12 .

3. [2 pts] What is the variance of this trivial regression model?

13
F SOLUTION: The variance is the variance of the MLE estimator. By the third bullet,
this is N1 .

4. [1 pts] What is the unavoidable error in this learning setting?

F SOLUTION: The unavoidable error is introduced by i , and is 1 by assumption.

5. [2 pts] How do each of bias, variance, and unavoidable error change as N → ∞?

F SOLUTION: The unavoidable error and bias do not change. The variance goes to
0 as N → ∞.

14
Now assume we notice that y in fact depends on x. Therefore, we change to a linear regression
model (with zero intercept), which assumes the data are generated as follows:
xi ∼ U nif (0, 1), yi = f (xi ) + i , i ∼ N (0, 1), f (x) = ax
We also assume (as in the true model) that xi ⊥ j ∀i, j and i ⊥ j ∀i 6= j.
6. [3 pts] We choose our estimator â for a to minimize the squared sum of errors. That
is, we choose â such that
N
1X
â = argmina (yi − axi )2
2 i=1

Derive the closed form expression for â. Once we have chosen the value of â, we now
have a regression model that predicts yi = âxi .

N
X
F SOLUTION: Let f (a) = 1
2
(yi − axi )2 . Then,
i=1
N
X
∂(f )
∂a
= −xi (yi − axi )
i=1
Setting the derivative to 0, we obtain:
XN
−xi (yi − axi ) = 0
i=1
N
X N
X
=⇒ xi y i = ax2i
i=1 i=1
N
X
xi y i
i=1
=⇒ â = N
.
X
x2i
i=1

7. [2 pts] What is the bias of this linear regression model?

F SOLUTION: The bias of the regression model is 0.

8. [2 pts] As N → ∞, what is the variance of this linear regression model?

F SOLUTION: The variance of the linear regression models goes to 0 as N → ∞.

9. [1 pts] What is the unavoidable error in this learning setting?

15
F SOLUTION: The unavoidable error is still introduced by i , and is 1 by assumption.

10. [2 pts] In the figure below, draw the two learned regression models if we have an
infinite number of data points.

F SOLUTION: Model 1 (the trivial model) is the horizontal line. Model 2 (the linear
regression model) is the diagonal line.

1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy