1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
Homework #2
Due Date: Analytical part Oct 13 and Programming part Oct 20. Deadlines are midnight PST.
NOTE 1: Please use a word processing software (e.g., Microsoft word or Latex) to write your
answers. The rationale is that it is sometimes hard to read and understand the hand-written an-
swers. Thanks for your understanding.
NOTE 2: Please ensure that all the graphs are appropriately labeled (x-axis, y-axis, and each
curve). The caption or heading of each graph should be informative and self-contained.
1. Suppose we have n+ positive training examples and n− negative training examples. Let C+
be the center
P of the positive examples P and C− be the center of the negative examples, i.e.,
C+ = n1+ i: yi =+1 xi and C− = n1− i: yi =−1 xi . Consider a simple classifier called CLOSE
that classifies a test example x by assigning it to the class whose center is closest.
• Show that the decision boundary of the CLOSE classifier is a linear hyperplane of the
form sign(w · x + b). Compute the values of w and b in terms of C+ and C− .
• Recall that the P
weight vector can be written as a linear combination of all the training
n+ +n−
examples: w = i=1 αi · yi · xi . Compute the dual weights (α’s). How many of the
training examples are support vectors?
2. Suppose we use the following radial basis function (RBF) kernel: K(xi , xj ) = exp(− 12 kxi − xj k2 ),
which has some implicit unknown mapping φ(x).
• Prove that the mapping φ(x) corresponding to RBF kernel has infinite dimensions.
• Prove that for any two input examples xi and xj , the squared Euclidean distance of their
corresponding points in the higher-dimensional space defined by φ is less than 2, i.e.,
kφ(xi ) − φ(xj )k2 ≤ 2.
3. The decision boundary of a SVM with a kernel function (via implicit feature mapping φ(.))
is defined as follows:
P
w · φ(x) + b = i∈SV yi αi K(xi , x) + b = f (x; α, b)
, where w and b are parameters of the decision boundary in the feature space phi defined by
the kernel function K, SV is the set of support vectors, and αi is the dual weight of the ith
support vector.
Let us assume that we use the radial basis function (RBF) kernel K(xi , xj ) = exp(− 12 kxi − xj k2 );
also assume that the training examples are linearly separable in the feature space φ and SVM
finds a decision boundary that perfectly separates the training examples.
If we choose a testing example xf ar that is far away from any training instance xi (distance
here is measured in the original feature space <d ). Prove that f (xf ar ; α, b) ≈ b.
6. You are provided with a set of n training examples: (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn , ), where xi is
the input example, yi is the class label (+1 or -1). Suppose n is very large (say in the order of
millions). In this case, standard SVM training algorithms will not scale due to large training
set.
Tom wants to devise a solution based on “Coarse-to-Fine” framework of problem solving.
The basic idea is to cluster the training data; train a SVM classifier based on the clusters
(coarse problem); refine the clusters as needed (fine problem); perform training on the finer
problem; and repeat until convergence. Suppose we start with k+ positive clusters and k−
negative clusters to begin with (a cluster is defined as a set of examples). Please specify
the mathematical formulation (define all the variables used in your formulation) and concrete
algorithm for each of the following steps to instantiate this idea:
a) How to define the SVM training formulation for a given level of coarseness: a set of k+
positive clusters and a set of k− negative clusters?
b) How to refine the clusters based on the resulting SVM classifier?
c) What is the stopping criteria?
Optional question: For what kind of problems will this solution fail?
7. You are provided with a set of n training examples: (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn , ), where xi
is the input example, yi is the class label (+1 or -1). Suppose n is very large (say in the order
of millions). In this case, online kernelized Perceptron algorithms will not scale if the number
of allowed support vectors are unbounded.
a) Suppose you have trained using kernelized Perceptron algorithm (without any bounds on
support vectors) and got a set of support vectors SV . Tom wants to use this classifier for
real-time prediction and cannot afford more than B kernel evaluations for each classification
decision. Please give an algorithm to select B support vectors from SV . You need to motivate
your design choices in order to convince Tom to use your solution.
b) Tom wants to train using kernelized Perceptron algorithm, but wants to use at most B
support vectors during the training process. Please modify the standard kernelized Perceptron
training algorithm (from class slides) for this new setting. You need to motivate your design
choices in order to convince Tom to use your solution.
2 Programming and Empirical Analysis Part (5 percent grade)
1. Empirical analysis question. You can use a publicly available SVM classifier implementation
(e.g., scikit-learn) for SVM related experiments. scikit-learn (http://scikit-learn.org/
stable/modules/svm.html).
2. Programming question. You will implement the kernelized Perceptron training algorithm
(discussed in the class) for multi-class classification.
(a) You will use the Fashion MNIST data. Train the kernelized Perceptron classifier for 5
iterations with polynomial kernel (pick the best degree out of 2, 3, and 4 from the above
experiment). Plot the number of mistakes as a function of training iterations. Compute the
training, validation, and testing accuracy at the end of 5 iterations.
3. Programming question. “Breast Cancer” Classifier using Decision Trees. You will use the
following dataset for this question: https://archive.ics.uci.edu/ml/datasets/Breast+
Cancer+Wisconsin+%28Diagnostic%29. You will use the first 70 percent examples for train-
ing, next 10 percent examples for validation, and last 20 percent examples for testing.
(a) Implement the ID3 decision tree learning algorithm that we discussed in the class. The
key step in the decision tree learning is choosing the next feature to split on. Implement
the information gain heuristic for selecting the next feature. Please see lecture notes or
https://en.wikipedia.org/wiki/ID3_algorithm for more details. Do the following to se-
lect candidate thresholds for continuous features: Sort all candidate values for feature f from
training data. Suppose f1 , f2 , · · · , fn is the sorted list. The candidate thresholds are chosen
as fi + (fi+1 − fi )/2 for i=1 to n.
(b) Run the decision tree construction algorithm on the training examples. Compute the
accuracy on validation examples and testing examples.
(c) Implement the decision tree pruning algorithm discussed in the class (via validation data).
(d) Run the pruning algorithm on the decision tree constructed using training examples.
Compute the accuracy on validation examples and testing examples. List your observations
by comparing the performance of decision tree with and without pruning.
To debug and test your implementation, you can employ scikit-learn (http://scikit-learn.
org/stable/modules/tree.html).
Code submission. You will submit one zip file for your code as per the instructions below. If
your script and/or code does not execute, we will try to give some partial credit by looking at the
overall code contents.
• Mention the programming language and version (e.g., Python 2.5) that you used.
• Submit one folder with name WSUID-LASTNAME.zip (e.g., 111222-Fern.zip) and include a
README file.
• Include a script to run the code and it should be referred in the README file. Please make
sure that your script runs on a standard linux machine.
• Don’t submit the data folder. Assume there is a folder “data” with all the files.
• Output of your programs should be well-formatted in order to answer the empirical analysis
questions.
If you have collaborated or discussed the homework with some student, please provide this infor-
mation with all the relevant details. If we find that the code of two different students has traces of
plagiarism, both students will be penalized and we will report the academic dishonesty case to grad-
uate school (see https://communitystandards.wsu.edu/policies-and-reporting/academic-integrity-
policy/). The bottom line is please DO NOT even think of going this route. It is very unpleasant
to deal with these things for both faculty, TA, and students involved.
4 Grading Rubric
Each question in the students work will be assigned a letter grade of either A,B,C,D, or F by the
Instructor and TAs. This five-point (discrete) scale is described as follows:
• A) Exemplary (=100%).
Solution presented solves the problem stated correctly and meets all requirements of the prob-
lem.
Solution is clearly presented.
Assumptions made are reasonable and are explicitly stated in the solution.
Solution represents an elegant and effective way to solve the problem and is not overly com-
plicated than is necessary.
• B) Capable (=75%).
Solution is mostly correct, satisfying most of the above criteria under the exemplary category,
but contains some minor pitfalls, errors/flaws or limitations.
• D) Unsatisfactory (=25%)
Critical elements of the solution are missing or significantly flawed.
Solution does not demonstrate sufficient understanding of the problem and/or any reasonable
directions to solve the problem.
The points on a given homework question will be equal to the percentage assigned (given by the
letter grades shown above) multiplied by the maximum number of possible points worth for that
question. For example, if a question is worth 6 points and the answer is awarded a B grade, then
that implies 4.5 points out of 6.