ml-20240315
ml-20240315
ml-20240315
To take this exam you must be registered to this specific exam as well as to the course.
In order to pass this exam, your score x first needs to be 20 or more (out of 42, full point).
In addition, given your points y from the Programming Challenge (out of 18, full point), the
requirements on the total points, p = x + y, are preliminarily set for different grades as:
54 < p ≤ 60 → A
48 < p ≤ 54 → B
42 < p ≤ 48 → C
36 < p ≤ 42 → D
29 < p ≤ 36 → E (A pass is guaranteed with the required points for ’E’.)
0 ≤ p ≤ 29 → F
Page 1 (of 8)
For each term (a–d) in the left list, find the explanation from the right list which best describes
how the term is used in machine learning.
Suppose that we take a data set, divide it into training and test sets, and then try out two
different classification procedures. We use two-thirds of the data for training, and the remaining
one-third for testing. First we use Bagging and get an error rate of 10% on the training data. We
also get the average error rate (averaged over both test and training data samples) of 15%. Next
we use k-nearest neighbor (where k = 1) and get an average error rate (averaged over both test
and training data samples) of 10%.
a) What was the error rate with 1-nearest neighbor on the training set? (1p)
b) What was the error rate with 1-nearest neighbor on the test set? (1p)
c) What was the error rate with Baggind on the test set? (1p)
d) Based on these results, indicate the method which we should prefer to use for classification
of new observations, with a simple reasoning. (1p)
Page 2 (of 8)
For a set of N training samples {(x1 , y1 ), . . . , (xN , yN )}, each consisting of input vector x
and output y, suppose we estimate the regression coefficients w⊤ (∈ Rd ) = {w1 , . . . , wd } in a
linear regression model by minimizing
N
X d
X
(yn − w⊤ xn )2 + λ wi2
n=1 i=1
b) Repeat a) for the training error (residual sum of squares, RSS). (1p)
a) What are the two kinds of randomness involved in the design of Random Forests? (2p)
b) In Adaboost algorithm, each training sample is given a weight and it is updated according
to some factors through an iteration of training weak classifiers. What are the two most
dominant factors in updating the weights? (2p)
c) In Adaboost algorithm, how are the two factors mentioned in b) used? (1p)
Page 3 (of 8)
We consider to solve a K-class classification problem with the Subspace Method and for that
we compute a subspace L(j) (j = 1, ..., K) using training data for each class, respectively. That
is, given a set of feature vectors (as training data) which belong to a specific class C (i.e. with an
identical class label), we perform PCA on them and generated an orthonormal basis {u1 , ..., up }
which spans a p-dimensional subsapce, L, as the outcome. Provide an answer to the following
questions.
a) Given that we compute eigenvectors of the auto-correlation matrix Q based on the training
data and use some of them to form the basis {u1 , ..., up } as, do we take eigenvectors cor-
responding to the p smallest, or p largest eigenvalues of Q? Simply mention your answer.
(1p)
b) We have a new input vector x whose class is unknown, and consider its projectiton length
on L. Describe how the projection length is represented, using a simple formula. (2p)
c) Given x, we computed its projectiton length on each subspace as S (j) (j = 1, ..., K).
Among those S (α) , S (β) , and S (γ) were the largest, the second largest, and the smallest,
respectively. Based on this observation which class should x belong to? Simply choose one
of the three. (1p)
Page 4 (of 8)
The famous Monty Hall problem is sometimes called a paradox due to how counter intuitive it
may first seem. This problem is usually presented as a game show where the contender is presented
with three closed doors and asked to pick one. Only one has the prize behind it (a car), while the
rest has goats behind them. After the contestant has chosen a door, and before opening it to reveal
what is behind it, the show host Monty opens a different door and asks if the contestant wants to
stay with his choice or switch to another door. You can assume that Monty will not open a door
that has the prize behind it (as he knows where the prize is).
Model this problem using probability theory and Bayes’ rule to show which choice is best
(staying with your original door, or switching to the remaining door that Monty did not open).
Show your calculations.
a) Write the likelihood function for ML-estimation of the parameters wT , σ 2 from data. (2p)
Hint: Use that any normally distributed variable x (not the x from above) is typically defined
as
1 (x−µ)2
N (x|µ, σ 2 ) ∼ √ e− 2σ 2 . (1)
2πσ 2
b) Derive the ordinary least-squares linear regression problem from the ML estimation problem
above. Show your calculations. (4p)
Page 5 (of 8)
Consider a binary (1/0) classification problem where you have a labeled data set D = ((x, y)i ).
You have assumed the data follows some probabilistic
Q model P r(y|x, θ) with parameter vector
θ, resulting in a parameter likelihood function i P r(yi |xi , θ). You additionally assume some
(weak) prior distribution P r(θ) on the parameters.
a) Given a new input x′ , show how you would compute the probability of the new label y ′
being y ′ = 1. Assume you are estimating the model parameters from data using using
maximum a posteriori (MAP) estimation. (2p)
b) Do the same, but this time assume you are using Bayesian methods for the model parameters
θ. (2p)
Page 6 (of 8)
Select exactly one option of (1), (2), or (3), justify your answer with keywords!
Complete the following sentence: Out of all hyperplanes which solve a classification problem,
the one with widest margin will probably ...
The following diagram shows a small data set consisting of four RED samples (A, B, C, D)
and four BLUE samples (E, F, G, H). This data set can be linearly separated.
a) We use a linear support vector machine (SVM) without kernel function to correctly separate
the RED (A-D) and the BLUE (E-H) class. Which of the data points (A-H) will the support
vectors machine use to separate the two classes? Name the point(s) and explain your answer
IN KEYWORDS! (2p)
b) Assume someone suggests using a non-linear kernel for the SVM classification of the above
data set (A-H). Give one argument in favor of using non-linear SVM classification for such
a data set. USE KEYWORDS! (1p)
Page 7 (of 8)
Select exactly one option of (1), (2), or (3), justify your answer with keywords!
Error-Backpropagation-Training in neural networks mainly performs the following activity:
Page 8 (of 8)