0% found this document useful (0 votes)

10 views

Lecture6 Notes

This document summarizes the Pegasos algorithm and how it can be extended to handle different machine learning tasks. [1] The Pegasos algorithm is similar to the perceptron algorithm but uses stochastic subgradient descent. It can be made more efficient by using sparse feature vectors. [2] The Pegasos objective can be modified to handle imbalanced data sets by weighting examples differently, transfer learning by combining models, and multiclass classification. [3] The kernel trick allows Pegasos to be used on non-linearly separable data by mapping examples to a higher dimensional feature space and computing dot products, without explicitly computing the mapping.

Uploaded by

Melanie2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lecture6 Notes

Uploaded by

Melanie2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Machine Learning Lecture 6 Note

Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao

February 16, 2016

1 Pegasos Algorithm
The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
just by changing a few lines of code in our Perceptron Algorithms, we can get
the Pegasos Algorithm.

Algorithm 1: Perceptron to Pegasos

1 initialize w1 = 0, t = 0;
2 for iter = 1, 2, ..., 20 do
3 for j = 1, 2, ..., |data| do
1
4 t = t + 1; ηt = tλ ;
T
5 if yj (wt xj ) < 1 then
6 wt+1 = (1 − ηt λ)wt + ηt yj xj ;
7 else
8 wt+1 = (1 − ηt λ)wt ;
9 end
10 end
11 end
Side note: We can optimize both the Pegasos and Perceptron Algorithm by
using sparse vectors in the case of document classification because most entries
in the feature vector x will be zeros.
As we discussed in the lecture, the original Pegasos algorithm randomly
chooses one data point at each iteration instead of going through each data
point in order as shown in Algorithm 1. Pegasos algorithm is an application of
the stochastic sub-gradient descent method.

2 Using Pegasos to Solve Other SVM Objec-

tives
2.1 Imbalanced data set
Sometimes it may be hard to classify an imbalanced data set where the classi-
fication categories are not equally represented. In this case, we want to weigh

1
each data point differently by placing more weights on the data points in the
underrepresented categories. We can do this very easily by changing our opti-
mization problem to
CN X CN X
min kwk22 + ξj + ξj
w 2N+ j:yj =+1
2N− j:yj =−1

where N+ , N− are the number of positive data points and negative data points
respectively. ξj ’s are the slack variables.
An intuitive way to think about this is if we want to build a classifier that
classifies whether a point is blue or red. If in our data set we only have 1 data
point that’s labelled as red and 10 data points labelled as blue, then using the
modified objective function is equivalent to duplicating the 1 red point 10 times
without explicitly creating more training data.

2.2 Transfer learning

Suppose we want to build a personalized spam classifier for Professor David
Sontag. However, Professor David only has few of his emails labelled. Professor
Rob, on the other hand, has labelled all of the emails he has ever received as
spam or not spam and trained an accurate spam filter on them. Since Professor
David and Professor Rob are both Computer Science Professors and run a lab
together, we hope that they probably share similar standards for spams/non-
spams. In this case, a spam classifier built for Professor Rob should work well
to a certain extent for Professor David as well. What should the SVM objective
be? (Class ideas: average the weight vectors of both Professors; combine David
and Rob’s data and put more weights on David’s data.)
One solution is to solve the following modified optimization problem,
C X 1
min max(0, 1 − y(wdT x + bd )) + kwd − wr k2
wd ,bd |Dd | 2
x,y∈Dd

The idea here is we assume that the weight for Rob is going to be very close
to that for David. We then try to penalize the distance between the two. C
here can be interpreted as how confident we are that the weights for Rob will
be similar to the weights for David. If we are very confident, a low C, then we
will really try to minimize the distance between the two weight vectors. If we
are not confident, large C, then we are more focused on David’s labelled data.

2.3 Multiclass classification

If we want to extend these ideas further to multi-class classification, we have a
number of options. The simplest is called a One-vs-all Classifier in which we
learn n classifiers, one for each of the n classes. We could run into issues if we
want to classify a point that fell in between our classifiers as we would need to

2
decide in which class it belongs. We can predict the most probable class using
the formula
ŷ = argmax wkT x + bk
k
Another solution is called Multiclass SVM. Here, we put soft restrictions on
predicting correct labels for the training data using:

0 0
w(yj )T xj + b(yj ) ≥ w(y )T xj + b(y ) + 1 − ξj , ∀y 0 6= yj , ξj ≥ 0, ∀j
Notice that we have one slack variable ξj per data point and one set of weights
w(k) , b(k) for each class k. We could derive a similar Pegasos Algorithm for a
multiclass classifier.

3 Kernel Trick
What if the data is not linearly separable? We can create a mapping φ(x)
that takes our feature vector x and converts it into a higher dimensional space.
Creating a linear classification in this higher dimension and projecting that onto
our original feature space will give us a squiggly line classifier.
Kernel tricks allow us to perform the aforementioned classification with little
extra cost. For Pegasos algorithm, we can do this by keeping track of just a
single variable per data point, αi , and calculating vector w when required.
X
w= αi yi xi
i

Let’s now derive the updating rule for such αi ’s. Notice in Algorithm 1, the
update rule at each iteration is
wt+1 = (1 − ηt λ)wt + 1[yj wtT xj < 1] · ηt yj xj
where 1[condition] is the indicator function. Now instead of xj , yj , let us use
x(t) , y (t) to denote the data point we randomly selected at iteration t. Substitute
ηt , we have

1 1
wt+1 = 1 − wt + 1[y (t) wtT x(t) < 1] · y (t) x(t)
t λt
Multiplying both sides with t, rearranging,
1
twt+1 − (t − 1)wt = 1[y(t) wtT x(t) < 1] · y(t) x(t)
λ
As the above equation holds for any t, we have the following t equations


 twt+1 − (t − 1)wt = λ1 1[y (t) wtT x(t) < 1] · y (t) x(t)
= λ1 1[y (t−1) wt−1

T
(t − 1)wt − (t − 2)wt−1 x(t−1) < 1] · y (t−1) x(t−1)

 ···
= λ1 1[y (1) w1T x(1) < 1] · y (1) x(1)

 w2


3
Summing over the above t equations and dividing both sides by t, we can
have
t
1 X
wt+1 = 1[y(k) wkT x(k) < 1] · y(k) x(k)
λt
k=1

written in the form of summation over i:

 
X 1 X t
wt+1 =  1[y(k) wkT x(k) < 1] · 1[(xi , yi ) = (x(k) , y(k) )] yi xi
i
λt
k=1

All the stuff in the huge parenthesis corresponds to αi we defined earlier.

(t+1)
λtαi counts the number of times data point i appears before iteration t and
(t+1)
satisfies yi wkT xi < 1. This implies a simple updating rule for λtαi :

= λ(t − 1)αi + 1[(xi , yi ) = (x(t) , y (t) )] · 1[yi wtT xi < 1]

(t+1) (t)
λtαi

i.e. suppose we draw data point (xi , yi ) at iteration t, we increment λtαi

by 1 iff yi wtT xi < 1. The algorithm is shown in Algorithm 2. To simplify the
(t) (t)
notations, we denote βi = λ(t − 1)αi .

Algorithm 2: Kernelized Pegasos

1 initialize β (1) = 0;
2 for t = 1, 2, ..., T do
3 randomly choose (x(t) , y (t) ) = (xj , yj ) from training data
1
P (t) T
4 if yj λ(t−1) i βi yi xi xj < 1 then
(t+1) (t)
5 βj = βj + 1;
6 else
(t+1) (t)
7 βj = βj ;
8 end
9 end
1 (T +1)
After convergence, we can get back αi ’s using αi = λT βi . In testing
time, predictions can be made with
 
X
ŷ = sign  αi yi xTi x
i

Now suppose we want to use more complex features φ(x) which can be ob-
tained by transforming the original features x to a higher dimensional space,
all we need to do is to substitute xTi xj in both training and testing with
φ(xi )T φ(xj ).
Further notice that φ(x) always appears in the form of dot products. Which
indicates we do not necessarily need to explicitly compute it as long as we have

4
a formula to calculate the dot products. This is where kernels come into use.
Instead of defining the function φ to do the projection, we directly define a
kernel function K to calculate the dot product of the projected features.

K(xi , xj ) = φ(xi )T φ(xj )

We can create different kernel functions K(xi , xj ) as long as those functions

are based on dot products. We can also create new valid kernel functions using
other valid kernel functions following certain rules. Examples of popular kernel
functions include Polynomial Kernels, Gaussian Kernels, and many more.

References
Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter. Extended
version: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Mathemat-
ical Programming, Series B, 127(1):3-30, 2011

SVM Using Python
No ratings yet
SVM Using Python
24 pages
Pegasos: Primal Estimated Sub-Gradient Solver For SVM
No ratings yet
Pegasos: Primal Estimated Sub-Gradient Solver For SVM
27 pages
08 Classification
No ratings yet
08 Classification
46 pages
Homework 3: SVM and Sentiment Analysis: Minted Listings
No ratings yet
Homework 3: SVM and Sentiment Analysis: Minted Listings
7 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
L6 Lecture Image.classification.fundemental v4
No ratings yet
L6 Lecture Image.classification.fundemental v4
66 pages
MergedPDF Iml
No ratings yet
MergedPDF Iml
114 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
SVM Notes
No ratings yet
SVM Notes
40 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Machine Learning
No ratings yet
Machine Learning
20 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
This Is
No ratings yet
This Is
7 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
UNIT - 2
No ratings yet
UNIT - 2
15 pages
A Practical Guide To Support Vector Classification: I I I N L
No ratings yet
A Practical Guide To Support Vector Classification: I I I N L
15 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
A Practical Guide To Support Vector Classification: I I I N L
No ratings yet
A Practical Guide To Support Vector Classification: I I I N L
12 pages
SVM Class
No ratings yet
SVM Class
33 pages
Session 5 ppt
No ratings yet
Session 5 ppt
36 pages
UNIT-3
No ratings yet
UNIT-3
100 pages
Support Vector Machines
No ratings yet
Support Vector Machines
24 pages
SVM Presentation
No ratings yet
SVM Presentation
27 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Lect_07_Distance_Based_Algorithms
No ratings yet
Lect_07_Distance_Based_Algorithms
34 pages
Title: Implement Support Vector Machine Classifier: Department of Computer Science and Engineering
No ratings yet
Title: Implement Support Vector Machine Classifier: Department of Computer Science and Engineering
5 pages
SVM
No ratings yet
SVM
40 pages
unit 6 ai
No ratings yet
unit 6 ai
28 pages
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
No ratings yet
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
12 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
15
No ratings yet
15
38 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
p6-REF-0 JMLR
No ratings yet
p6-REF-0 JMLR
16 pages
Unit 2 - SVM
No ratings yet
Unit 2 - SVM
137 pages
UNIT - 2-1
No ratings yet
UNIT - 2-1
7 pages
09_EnsembleLearning
No ratings yet
09_EnsembleLearning
36 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Mod09-ppt2-ML_in_Image_Classification
No ratings yet
Mod09-ppt2-ML_in_Image_Classification
30 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
ML-UNIT-I
No ratings yet
ML-UNIT-I
14 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Lec 05
No ratings yet
Lec 05
54 pages
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
No ratings yet
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
5 pages
UNIT 1,2,3
No ratings yet
UNIT 1,2,3
17 pages
Lecture 8
No ratings yet
Lecture 8
19 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
ECE_449_Notes
No ratings yet
ECE_449_Notes
5 pages
B24 ML Exp-3
No ratings yet
B24 ML Exp-3
10 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
Lecture 5
No ratings yet
Lecture 5
19 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Lect 1
No ratings yet
Lect 1
24 pages
AP for NLP-LO2
No ratings yet
AP for NLP-LO2
38 pages
Support Vector Machines: More Generally Kernel Methods
No ratings yet
Support Vector Machines: More Generally Kernel Methods
58 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
The Fundamentals of Machine Learning
No ratings yet
The Fundamentals of Machine Learning
12 pages
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
No ratings yet
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
26 pages
Deep Learning Vol 1 From Basics to Practice Andrew Glassner 2024 Scribd Download
100% (5)
Deep Learning Vol 1 From Basics to Practice Andrew Glassner 2024 Scribd Download
48 pages
Lec8 - K Mean Clustering - Converted
No ratings yet
Lec8 - K Mean Clustering - Converted
9 pages
BAB 3 - Dian Ayu Rahmawati - 205150201111042
No ratings yet
BAB 3 - Dian Ayu Rahmawati - 205150201111042
14 pages
sc module 2
No ratings yet
sc module 2
14 pages
ML Question Papper
100% (1)
ML Question Papper
2 pages
8.01 Machine Learning Basics
No ratings yet
8.01 Machine Learning Basics
6 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
98 pages
Technical Seminar
No ratings yet
Technical Seminar
27 pages
Deep Learning PDF
No ratings yet
Deep Learning PDF
55 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
7 pages
Convolutional Neural Networks, Explained by Mayank Mishra Towards Data Science
No ratings yet
Convolutional Neural Networks, Explained by Mayank Mishra Towards Data Science
14 pages
Representation In Machine Learning M N Murty M Avinash instant download
No ratings yet
Representation In Machine Learning M N Murty M Avinash instant download
49 pages
Multi Layer Perceptron - Notes
No ratings yet
Multi Layer Perceptron - Notes
4 pages
FML File Final
No ratings yet
FML File Final
36 pages
NLP and ML Project
100% (1)
NLP and ML Project
37 pages
ML Theory Questions
No ratings yet
ML Theory Questions
2 pages
UNIT1_C
No ratings yet
UNIT1_C
21 pages
Unit 2 Machine learning aktu
No ratings yet
Unit 2 Machine learning aktu
18 pages
Lab Session 9
No ratings yet
Lab Session 9
2 pages
kmean clustering
No ratings yet
kmean clustering
10 pages
Feedforward
No ratings yet
Feedforward
34 pages
CA - 605 - MJP Machine Learning Practical Slips
No ratings yet
CA - 605 - MJP Machine Learning Practical Slips
25 pages
CH 27
No ratings yet
CH 27
16 pages
Restricted Boltzmann Machine
No ratings yet
Restricted Boltzmann Machine
13 pages
FP Growth Algorithm
No ratings yet
FP Growth Algorithm
10 pages
Lecture 8 Clustring
No ratings yet
Lecture 8 Clustring
16 pages
Unit 3
No ratings yet
Unit 3
86 pages
CNN-Slides-Part2.pptx
No ratings yet
CNN-Slides-Part2.pptx
69 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture6 Notes

Uploaded by

Lecture6 Notes

Uploaded by

Machine Learning Lecture 6 Note

Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao

Algorithm 1: Perceptron to Pegasos

2 Using Pegasos to Solve Other SVM Objec-

2.2 Transfer learning

2.3 Multiclass classification

written in the form of summation over i:

All the stuff in the huge parenthesis corresponds to αi we defined earlier.

= λ(t − 1)αi + 1[(xi , yi ) = (x(t) , y (t) )] · 1[yi wtT xi < 1]

i.e. suppose we draw data point (xi , yi ) at iteration t, we increment λtαi

Algorithm 2: Kernelized Pegasos

K(xi , xj ) = φ(xi )T φ(xj )

We can create different kernel functions K(xi , xj ) as long as those functions

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.