0% found this document useful (0 votes)

86 views

COL 774: Assignment 2

This document provides instructions for Assignment 2, which includes questions on text classification using naive Bayes and facial attractiveness classification using support vector machines (SVMs). It is due on March 10, 2017 and includes both theoretical and implementation questions, though only implementations will be graded. Students should submit code, graphs, and write-ups but not datasets or provided code. The assignment must be done individually.

Uploaded by

Aditya Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views

COL 774: Assignment 2

Uploaded by

Aditya Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

COL 774: Assignment 2

Due Date: 11:50 pm, Friday Mar 10, 2017. Total Points: 58

Notes:

This assignment will have a mix of theoretical as well as implementation questions.

Only the implementation questions will be graded.
You are strongly encouraged to try out theoretical questions though they are not graded.
You should submit all your code (including any pre-processing scripts written by you) and any graphs that
you might plot. Do not submit answers to theoretical questions.
Do not submit the datasets. Do not submit any code that we have provided to you for processing.
Include a single write-up (pdf ) file which includes a brief description for each question explaining what
you did. Include any observations and/or plots required by the question in this single write-up file.
You should use MATLAB/Python for all your programming solutions.
Your code should have appropriate documentation for readability.
You will be graded based on what you have submitted as well as your ability to explain your code.
Refer to the course website for assignment submission instructions.
This assignment is supposed to be done individually. You should carry out all the implementation by
yourself.
We plan to run Moss on the submissions. We will also include submissions from previous years since some
of the questions may be repeated. Any cheating will result in a zero on the assignment, a penalty of -10
points and possibly much stricter penalties (including a fail grade and/or a DISCO).
Question 2 below has been adapted from a homework assigment question in a course offered at CMU by
Carlos Guestrin.

1. (22 points) Text Classification

In this problem, we will use the nave Bayes algorithm for text classification. The dataset for this problem
is a subset of the Reuters-21578 dataset and has been obtained from this website (look at the R8 dataset).
Read the website for more details about the dataset. You have been provided with separate training and
test files containing 5485 and 2189 articles (examples), respectively. Each article comes from one of the
eight categories (class label). Each row in the file contains the information about the class of the article
followed by the list of words appearing in the article.

(a) (10 points) Implement the Nave Bayes algorithm to classify each of the articles into one of the given
categories. Report the accuracies over the training as well as the test set. In the remaining parts
below, we will only worry about test accuracy.
Notes:
Make sure to use the Laplace smoothing for Nave Bayes (as discussed in class) to avoid any zero
probabilities. Use c = 1.
You should implement your algorithm using logarithms to avoid underflow issues.

1
You should implement nave Bayes from the first principles and not use any existing Mat-
lab/Python modules.
(b) (2 points)What is the test set accuracy that you would obtain by randomly guessing one of the
categories as the target class for each of the articles (random prediction). What accuracy would you
obtain if you simply predicted the class which occurs most of the times in the training data (majority
prediction)? How much improvement does your algorithm give over the random/majority baseline?
(c) (4 points) Read about the confusion matrix. Draw the confusion matrix for your results in the part
(a) above (for the test data only). Which category has the highest value of the diagonal entry in the
confusion matrix? What does that mean? Which two categories are confused the most with each other
i.e. which is the highest entry amongst the non-diagonal entries in the confusion matrix? Explain
your observations. Include the confusion matrix in your submission.
(d) (6 points) The dataset provided to is in the raw format i.e., it has all the words appearning in the
original set of articles are present. This includes words such as of, the, and etc. (called stopwords).
Presumably, these words are not relevant for classification. In fact, their presence can sometimes hurt
the peroformance of the classifier by introducing noise in the data. Similarly, the raw data treats
separately different forms of the same word e.g., eating and eat would be treated as separate words.
Merging such variations into a single word is called stemming.
Read about stopword removal and stemming (for text classification) online.
Use the script provided to you to perform stemming and remove the stopwords in the training as
well as the test data.
Learn a new model on the transformed data. Again, report the accuracy as well as the confusion
matrix over the test data.
How do your accuracies and confusion matrix change? Comment on your observations.

2. (36 points) Facial Attractiveness Classification

In this problem, we will use Support Vector Machines (SVMs) to build a facial attractiveness classifier.
We will be solving the SVM optimization problem using a general purpose convex optimization package
as well as using a customized solver known as LIBSVM. You are provided with separate training and test
example files along with the corresponding label files. Each row in the (train/test) data file corresponds
to an image. Every column represents a feature where the feature value denotes the grayscale value of the
corresponding pixel in the image (there is a feature for every pixel). Each row in each of the label files gives
the corresponding label (1:attractive,2: not attractive). You have been provided with a processed version of
a subset of the dataset. For details of the original dataset, you are encouraged to look at the course project
report referred here1 .
(a) (8 points) Download and install the CVX package. Express the SVM dual problem (with a linear
kernel) in the a form that the CVX package can take. You will have to think about how to express
the SVM dual objective in the form T Q + bT + c matrix where Q is an m m matrix (m being the
number of training examples), b is an m-sized column vector and c is a constant. For your optimization
problem, remember to use the constraints on i s in the dual. Use P the SVM formulation which can
handle noise and use C = 500 (i.e. C in the expression 21 wT w + C i i ). Report the set of support
vectors obtained from your optimization.
(b) (6 points) Calculate the weight vector w and the intercept term b using the solution in the part
above. Classify each of the examples in the test file into one of the two labels. Report the average test
set accuracy obtained.
(c) (6 points) Now solve the dual SVM problem using a Gaussian kernel with the bandwidth parameter
2
= 2.5 (i.e. in K(x, z) = expkxzk ). Think about how the Q matrix will be represented. What
are the set of support vectors in this case? Note that you may not be able to explicitly store the
weight vector (w) or the intercept term (b) in this case. Use your learned model to classify the test
examples and report the accuracies obtained. How do these compare with the ones obtained with the
linear SVM?
(d) (10 points) Now train an SVM on this dataset using the LIBSVM library, available for download
from this link. Repeat the parts above using a linear Kernel as well as a Gaussian kernel with = 2.5.
1 Ryan White, Ashley Eden, Michael Maire Automatic Prediction of Human Attractiveness, CS 280 class report, December

2003.

2
Use C = 500 in both cases, as before. Report the set of support vectors obtained as well as the test
set accuracies for both linear as well as the Gaussian kernel setting. How do these compare with the
numbers obtained using the CVX package. Comment.
(e) (6 points) Cross-validation is a technique to estimate the best value of the model parameters (e.g.,
C, in our problem) by randomly dividing the data into multiple folds, training on all the folds
but one, validating on the remaining fold, repeating this procedure so that every fold gets a chance
to be the validation set and finally computing the average validation accuracy across all the folds.
This process is repeated for a range of model parameter values and the paramters which give best
average validation accruacy are reporeted as the best parameters. For a detailed introduction, you
can watch this video. Use LIBSVM for this part, and you are free to use the cross-validation utility
provided with the package.
In this problem, we will perform 10 fold cross-validation to estimate the value of the C parameter.
We will fix to be 2.5. Vary the value of C in the range {1, 10, 102 , 103 , 104 , 105 , 106 } and compute
the 10-fold cross-validation accuracy for each value of C. Also, compute the corresponding accuracy
on the test set. Now, plot both the average validation set accuracy as well as the test set accuracy on
a graph as you vary the value of C on x-axis (use log scale on x-axis). What do you observe? Which
value of C gives the best validation accuracy? Does this value of the C also give the best test set
accuracy? Comment on your observations.
(f) (Extra fun! No Credits). You may argue that facial attractiveness is a subjective concept. In
your test data, identify the top 3 images (each) with the highest confidence of being attractive as well
as being not attractive based on your model. Think about how would you identify such images using
your learned model parameters. Now, display these images using the matlab script provided with the
dataset repository. You can also use any other utility in Python in case you are working with Python
code. What do you observe? Does your idea of attractiveness align with what the model thinks is
attractive. Why or why not?
Note: Do not submit the CVX or LibSVM code. You should only submit the code that you wrote by
youself (including wrapper code, if any) to solve this problem.

ISTQB Advanced Level Technical Test Analyst- Exam Insights: Q&A with Explanations
From Everand
ISTQB Advanced Level Technical Test Analyst- Exam Insights: Q&A with Explanations
SUJAN
No ratings yet
CS246 Final Exam Solutions, Winter 2011
No ratings yet
CS246 Final Exam Solutions, Winter 2011
18 pages
E9 205 - Machine Learning For Signal Processing
No ratings yet
E9 205 - Machine Learning For Signal Processing
3 pages
hw2 2020
No ratings yet
hw2 2020
3 pages
MSBD5001_WrittenAssignment2_2024F
No ratings yet
MSBD5001_WrittenAssignment2_2024F
5 pages
Assignment 2 Specification
No ratings yet
Assignment 2 Specification
3 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Ps 2
No ratings yet
Ps 2
11 pages
Homework 4
0% (1)
Homework 4
4 pages
ISYE6740_Fall2024_HW4_Rubric
No ratings yet
ISYE6740_Fall2024_HW4_Rubric
5 pages
Information Retrival Final Exam
0% (1)
Information Retrival Final Exam
16 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
ML Mid Sem Sep2023 Paper
No ratings yet
ML Mid Sem Sep2023 Paper
3 pages
ps2-sol
No ratings yet
ps2-sol
19 pages
Homework 2
No ratings yet
Homework 2
3 pages
hw7
No ratings yet
hw7
7 pages
EE 769 2023.02.23 Mid Term
No ratings yet
EE 769 2023.02.23 Mid Term
2 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
ML SP24 Mid Term Exam - Solution
No ratings yet
ML SP24 Mid Term Exam - Solution
8 pages
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
No ratings yet
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
6 pages
hw5_1
No ratings yet
hw5_1
6 pages
DSCI 303: Machine Learning For Data Science Fall 2020
No ratings yet
DSCI 303: Machine Learning For Data Science Fall 2020
5 pages
Lab Assignment - SVM - 2024
No ratings yet
Lab Assignment - SVM - 2024
5 pages
CS4100 CS5100 CW1 20241001
No ratings yet
CS4100 CS5100 CW1 20241001
10 pages
RAJIVRANJAN 26-03-2023 MachineLearningProjectReport Final
No ratings yet
RAJIVRANJAN 26-03-2023 MachineLearningProjectReport Final
54 pages
ML PG Assignment 3
No ratings yet
ML PG Assignment 3
3 pages
CS 229, Summer 2019 Problem Set #2 Solutions
No ratings yet
CS 229, Summer 2019 Problem Set #2 Solutions
18 pages
Assignment1 2020
No ratings yet
Assignment1 2020
6 pages
Ijeqi
No ratings yet
Ijeqi
10 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
2019-20-I ES Key
No ratings yet
2019-20-I ES Key
4 pages
Midterm Sol
No ratings yet
Midterm Sol
23 pages
Ai Fall-23 Assignment
No ratings yet
Ai Fall-23 Assignment
5 pages
ML Question Bank
No ratings yet
ML Question Bank
7 pages
ML SP24 Mid Term Exam - Solution (1)
No ratings yet
ML SP24 Mid Term Exam - Solution (1)
8 pages
ASSIGNMENT 3 - Probabilistic Models, GBDT, SVM
No ratings yet
ASSIGNMENT 3 - Probabilistic Models, GBDT, SVM
3 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
data_mining_end_23_24
No ratings yet
data_mining_end_23_24
2 pages
COL774_A4_v3
No ratings yet
COL774_A4_v3
4 pages
178 hw1
No ratings yet
178 hw1
4 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
hw1 Red
No ratings yet
hw1 Red
4 pages
hw1_red
No ratings yet
hw1_red
4 pages
10-701 Midterm Exam Solutions, Spring 2007
No ratings yet
10-701 Midterm Exam Solutions, Spring 2007
20 pages
CSE512 Fall19 HW4V1
No ratings yet
CSE512 Fall19 HW4V1
6 pages
Heart Merged
No ratings yet
Heart Merged
8 pages
178 hw3
No ratings yet
178 hw3
3 pages
HW_02
No ratings yet
HW_02
3 pages
Exam 2017
No ratings yet
Exam 2017
8 pages
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
No ratings yet
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
3 pages
Midterm Spring13
No ratings yet
Midterm Spring13
10 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Midterm
No ratings yet
Midterm
3 pages
Midterm Solutions For Machine Learning
No ratings yet
Midterm Solutions For Machine Learning
13 pages
15-381 Spring 2007 Assignment 6: Learning
No ratings yet
15-381 Spring 2007 Assignment 6: Learning
14 pages
Ai Ml Exam_1march 16 2022-Michael Magreola
No ratings yet
Ai Ml Exam_1march 16 2022-Michael Magreola
8 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Curriculum Vitae: Pratham Arora
No ratings yet
Curriculum Vitae: Pratham Arora
2 pages
Hong Kong University of Science and Technology COMP104: Programming Fundamentals and Methodology
No ratings yet
Hong Kong University of Science and Technology COMP104: Programming Fundamentals and Methodology
23 pages
Tle9 Q2mod6 Electrical-Schematic Drawings Simeon Pongtan v1
100% (1)
Tle9 Q2mod6 Electrical-Schematic Drawings Simeon Pongtan v1
25 pages
Cpe 001 Lab Activity No 1
No ratings yet
Cpe 001 Lab Activity No 1
9 pages
OBIEE Online Training
No ratings yet
OBIEE Online Training
6 pages
Request For Proposal (RFP) For Preparation of Survey Records of Various Districts, Gujarat
No ratings yet
Request For Proposal (RFP) For Preparation of Survey Records of Various Districts, Gujarat
29 pages
Java Certification
No ratings yet
Java Certification
29 pages
Tourist Monitoring System of Hulugan Falls
No ratings yet
Tourist Monitoring System of Hulugan Falls
132 pages
Ganglia Presentation
No ratings yet
Ganglia Presentation
12 pages
Social Media Community Using Optimized Clustering Algorithm Data Mining Project
No ratings yet
Social Media Community Using Optimized Clustering Algorithm Data Mining Project
2 pages
1 Kerala State Rutronix
No ratings yet
1 Kerala State Rutronix
22 pages
Analysis of DC Link Operation Voltage of A Hybrid Railway Power Quality Conditioner and Its PQ Compensation Capability in High Speed Co-Phase Traction Power Supply
No ratings yet
Analysis of DC Link Operation Voltage of A Hybrid Railway Power Quality Conditioner and Its PQ Compensation Capability in High Speed Co-Phase Traction Power Supply
5 pages
Ecobank Fintech Application Draft v1 - Seli
No ratings yet
Ecobank Fintech Application Draft v1 - Seli
3 pages
Vision and Scope Document
No ratings yet
Vision and Scope Document
7 pages
A Penny Saved Is A Penny Earned
No ratings yet
A Penny Saved Is A Penny Earned
2 pages
Crime Analysis and Prediction Using Data
No ratings yet
Crime Analysis and Prediction Using Data
7 pages
Data Protection Liberia
No ratings yet
Data Protection Liberia
6 pages
Real Time Manual Testing MCQ
100% (3)
Real Time Manual Testing MCQ
31 pages
Mk204a User
No ratings yet
Mk204a User
4 pages
IxVM Reference Guide
No ratings yet
IxVM Reference Guide
124 pages
A Multitier Deep Learning Model For Arrhythmia Detection
No ratings yet
A Multitier Deep Learning Model For Arrhythmia Detection
9 pages
U Disk Camera (MiniU8) Operating Instructions
No ratings yet
U Disk Camera (MiniU8) Operating Instructions
4 pages
Ih 2303
No ratings yet
Ih 2303
8 pages
Creating and Showing Comments
No ratings yet
Creating and Showing Comments
8 pages
Ix58a-Axp Ix58b-Axp 090806
No ratings yet
Ix58a-Axp Ix58b-Axp 090806
69 pages
Analysis of Single Piles Under Lateral Loading
No ratings yet
Analysis of Single Piles Under Lateral Loading
181 pages
Global Protect and WildFire FAQ
No ratings yet
Global Protect and WildFire FAQ
5 pages
Laudon MIS10 ch11-1
No ratings yet
Laudon MIS10 ch11-1
14 pages
FALLSEM2022-23 BSTS201P SS VL2022230100137 Reference Material I 29-09-2022 Divisibility Shortcuts Digital Root Method
No ratings yet
FALLSEM2022-23 BSTS201P SS VL2022230100137 Reference Material I 29-09-2022 Divisibility Shortcuts Digital Root Method
31 pages
Statistics Paper 1: Answer: (A) ..
No ratings yet
Statistics Paper 1: Answer: (A) ..
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

COL 774: Assignment 2

Uploaded by

COL 774: Assignment 2

Uploaded by

COL 774: Assignment 2

This assignment will have a mix of theoretical as well as implementation questions.

1. (22 points) Text Classification

2. (36 points) Facial Attractiveness Classification

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.