COL 774: Assignment 2
COL 774: Assignment 2
Due Date: 11:50 pm, Friday Mar 10, 2017. Total Points: 58
Notes:
(a) (10 points) Implement the Nave Bayes algorithm to classify each of the articles into one of the given
categories. Report the accuracies over the training as well as the test set. In the remaining parts
below, we will only worry about test accuracy.
Notes:
Make sure to use the Laplace smoothing for Nave Bayes (as discussed in class) to avoid any zero
probabilities. Use c = 1.
You should implement your algorithm using logarithms to avoid underflow issues.
1
You should implement nave Bayes from the first principles and not use any existing Mat-
lab/Python modules.
(b) (2 points)What is the test set accuracy that you would obtain by randomly guessing one of the
categories as the target class for each of the articles (random prediction). What accuracy would you
obtain if you simply predicted the class which occurs most of the times in the training data (majority
prediction)? How much improvement does your algorithm give over the random/majority baseline?
(c) (4 points) Read about the confusion matrix. Draw the confusion matrix for your results in the part
(a) above (for the test data only). Which category has the highest value of the diagonal entry in the
confusion matrix? What does that mean? Which two categories are confused the most with each other
i.e. which is the highest entry amongst the non-diagonal entries in the confusion matrix? Explain
your observations. Include the confusion matrix in your submission.
(d) (6 points) The dataset provided to is in the raw format i.e., it has all the words appearning in the
original set of articles are present. This includes words such as of, the, and etc. (called stopwords).
Presumably, these words are not relevant for classification. In fact, their presence can sometimes hurt
the peroformance of the classifier by introducing noise in the data. Similarly, the raw data treats
separately different forms of the same word e.g., eating and eat would be treated as separate words.
Merging such variations into a single word is called stemming.
Read about stopword removal and stemming (for text classification) online.
Use the script provided to you to perform stemming and remove the stopwords in the training as
well as the test data.
Learn a new model on the transformed data. Again, report the accuracy as well as the confusion
matrix over the test data.
How do your accuracies and confusion matrix change? Comment on your observations.
2003.
2
Use C = 500 in both cases, as before. Report the set of support vectors obtained as well as the test
set accuracies for both linear as well as the Gaussian kernel setting. How do these compare with the
numbers obtained using the CVX package. Comment.
(e) (6 points) Cross-validation is a technique to estimate the best value of the model parameters (e.g.,
C, in our problem) by randomly dividing the data into multiple folds, training on all the folds
but one, validating on the remaining fold, repeating this procedure so that every fold gets a chance
to be the validation set and finally computing the average validation accuracy across all the folds.
This process is repeated for a range of model parameter values and the paramters which give best
average validation accruacy are reporeted as the best parameters. For a detailed introduction, you
can watch this video. Use LIBSVM for this part, and you are free to use the cross-validation utility
provided with the package.
In this problem, we will perform 10 fold cross-validation to estimate the value of the C parameter.
We will fix to be 2.5. Vary the value of C in the range {1, 10, 102 , 103 , 104 , 105 , 106 } and compute
the 10-fold cross-validation accuracy for each value of C. Also, compute the corresponding accuracy
on the test set. Now, plot both the average validation set accuracy as well as the test set accuracy on
a graph as you vary the value of C on x-axis (use log scale on x-axis). What do you observe? Which
value of C gives the best validation accuracy? Does this value of the C also give the best test set
accuracy? Comment on your observations.
(f) (Extra fun! No Credits). You may argue that facial attractiveness is a subjective concept. In
your test data, identify the top 3 images (each) with the highest confidence of being attractive as well
as being not attractive based on your model. Think about how would you identify such images using
your learned model parameters. Now, display these images using the matlab script provided with the
dataset repository. You can also use any other utility in Python in case you are working with Python
code. What do you observe? Does your idea of attractiveness align with what the model thinks is
attractive. Why or why not?
Note: Do not submit the CVX or LibSVM code. You should only submit the code that you wrote by
youself (including wrapper code, if any) to solve this problem.