0% found this document useful (0 votes)
10 views

Lecture6 Notes

This document summarizes the Pegasos algorithm and how it can be extended to handle different machine learning tasks. [1] The Pegasos algorithm is similar to the perceptron algorithm but uses stochastic subgradient descent. It can be made more efficient by using sparse feature vectors. [2] The Pegasos objective can be modified to handle imbalanced data sets by weighting examples differently, transfer learning by combining models, and multiclass classification. [3] The kernel trick allows Pegasos to be used on non-linearly separable data by mapping examples to a higher dimensional feature space and computing dot products, without explicitly computing the mapping.

Uploaded by

Melanie2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture6 Notes

This document summarizes the Pegasos algorithm and how it can be extended to handle different machine learning tasks. [1] The Pegasos algorithm is similar to the perceptron algorithm but uses stochastic subgradient descent. It can be made more efficient by using sparse feature vectors. [2] The Pegasos objective can be modified to handle imbalanced data sets by weighting examples differently, transfer learning by combining models, and multiclass classification. [3] The kernel trick allows Pegasos to be used on non-linearly separable data by mapping examples to a higher dimensional feature space and computing dot products, without explicitly computing the mapping.

Uploaded by

Melanie2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Machine Learning Lecture 6 Note

Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao


February 16, 2016

1 Pegasos Algorithm
The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
just by changing a few lines of code in our Perceptron Algorithms, we can get
the Pegasos Algorithm.

Algorithm 1: Perceptron to Pegasos


1 initialize w1 = 0, t = 0;
2 for iter = 1, 2, ..., 20 do
3 for j = 1, 2, ..., |data| do
1
4 t = t + 1; ηt = tλ ;
T
5 if yj (wt xj ) < 1 then
6 wt+1 = (1 − ηt λ)wt + ηt yj xj ;
7 else
8 wt+1 = (1 − ηt λ)wt ;
9 end
10 end
11 end
Side note: We can optimize both the Pegasos and Perceptron Algorithm by
using sparse vectors in the case of document classification because most entries
in the feature vector x will be zeros.
As we discussed in the lecture, the original Pegasos algorithm randomly
chooses one data point at each iteration instead of going through each data
point in order as shown in Algorithm 1. Pegasos algorithm is an application of
the stochastic sub-gradient descent method.

2 Using Pegasos to Solve Other SVM Objec-


tives
2.1 Imbalanced data set
Sometimes it may be hard to classify an imbalanced data set where the classi-
fication categories are not equally represented. In this case, we want to weigh

1
each data point differently by placing more weights on the data points in the
underrepresented categories. We can do this very easily by changing our opti-
mization problem to
CN X CN X
min kwk22 + ξj + ξj
w 2N+ j:yj =+1
2N− j:yj =−1

where N+ , N− are the number of positive data points and negative data points
respectively. ξj ’s are the slack variables.
An intuitive way to think about this is if we want to build a classifier that
classifies whether a point is blue or red. If in our data set we only have 1 data
point that’s labelled as red and 10 data points labelled as blue, then using the
modified objective function is equivalent to duplicating the 1 red point 10 times
without explicitly creating more training data.

2.2 Transfer learning


Suppose we want to build a personalized spam classifier for Professor David
Sontag. However, Professor David only has few of his emails labelled. Professor
Rob, on the other hand, has labelled all of the emails he has ever received as
spam or not spam and trained an accurate spam filter on them. Since Professor
David and Professor Rob are both Computer Science Professors and run a lab
together, we hope that they probably share similar standards for spams/non-
spams. In this case, a spam classifier built for Professor Rob should work well
to a certain extent for Professor David as well. What should the SVM objective
be? (Class ideas: average the weight vectors of both Professors; combine David
and Rob’s data and put more weights on David’s data.)
One solution is to solve the following modified optimization problem,
C X 1
min max(0, 1 − y(wdT x + bd )) + kwd − wr k2
wd ,bd |Dd | 2
x,y∈Dd

The idea here is we assume that the weight for Rob is going to be very close
to that for David. We then try to penalize the distance between the two. C
here can be interpreted as how confident we are that the weights for Rob will
be similar to the weights for David. If we are very confident, a low C, then we
will really try to minimize the distance between the two weight vectors. If we
are not confident, large C, then we are more focused on David’s labelled data.

2.3 Multiclass classification


If we want to extend these ideas further to multi-class classification, we have a
number of options. The simplest is called a One-vs-all Classifier in which we
learn n classifiers, one for each of the n classes. We could run into issues if we
want to classify a point that fell in between our classifiers as we would need to

2
decide in which class it belongs. We can predict the most probable class using
the formula
ŷ = argmax wkT x + bk
k
Another solution is called Multiclass SVM. Here, we put soft restrictions on
predicting correct labels for the training data using:

0 0
w(yj )T xj + b(yj ) ≥ w(y )T xj + b(y ) + 1 − ξj , ∀y 0 6= yj , ξj ≥ 0, ∀j
Notice that we have one slack variable ξj per data point and one set of weights
w(k) , b(k) for each class k. We could derive a similar Pegasos Algorithm for a
multiclass classifier.

3 Kernel Trick
What if the data is not linearly separable? We can create a mapping φ(x)
that takes our feature vector x and converts it into a higher dimensional space.
Creating a linear classification in this higher dimension and projecting that onto
our original feature space will give us a squiggly line classifier.
Kernel tricks allow us to perform the aforementioned classification with little
extra cost. For Pegasos algorithm, we can do this by keeping track of just a
single variable per data point, αi , and calculating vector w when required.
X
w= αi yi xi
i

Let’s now derive the updating rule for such αi ’s. Notice in Algorithm 1, the
update rule at each iteration is
wt+1 = (1 − ηt λ)wt + 1[yj wtT xj < 1] · ηt yj xj
where 1[condition] is the indicator function. Now instead of xj , yj , let us use
x(t) , y (t) to denote the data point we randomly selected at iteration t. Substitute
ηt , we have
 
1 1
wt+1 = 1 − wt + 1[y (t) wtT x(t) < 1] · y (t) x(t)
t λt
Multiplying both sides with t, rearranging,
1
twt+1 − (t − 1)wt = 1[y(t) wtT x(t) < 1] · y(t) x(t)
λ
As the above equation holds for any t, we have the following t equations


 twt+1 − (t − 1)wt = λ1 1[y (t) wtT x(t) < 1] · y (t) x(t)
= λ1 1[y (t−1) wt−1

T
(t − 1)wt − (t − 2)wt−1 x(t−1) < 1] · y (t−1) x(t−1)

 ···
= λ1 1[y (1) w1T x(1) < 1] · y (1) x(1)

 w2

3
Summing over the above t equations and dividing both sides by t, we can
have
t
1 X
wt+1 = 1[y(k) wkT x(k) < 1] · y(k) x(k)
λt
k=1

written in the form of summation over i:


 
X 1 X t
wt+1 =  1[y(k) wkT x(k) < 1] · 1[(xi , yi ) = (x(k) , y(k) )] yi xi
i
λt
k=1

All the stuff in the huge parenthesis corresponds to αi we defined earlier.


(t+1)
λtαi counts the number of times data point i appears before iteration t and
(t+1)
satisfies yi wkT xi < 1. This implies a simple updating rule for λtαi :

= λ(t − 1)αi + 1[(xi , yi ) = (x(t) , y (t) )] · 1[yi wtT xi < 1]


(t+1) (t)
λtαi

i.e. suppose we draw data point (xi , yi ) at iteration t, we increment λtαi


by 1 iff yi wtT xi < 1. The algorithm is shown in Algorithm 2. To simplify the
(t) (t)
notations, we denote βi = λ(t − 1)αi .

Algorithm 2: Kernelized Pegasos


1 initialize β (1) = 0;
2 for t = 1, 2, ..., T do
3 randomly choose (x(t) , y (t) ) = (xj , yj ) from training data
1
P (t) T
4 if yj λ(t−1) i βi yi xi xj < 1 then
(t+1) (t)
5 βj = βj + 1;
6 else
(t+1) (t)
7 βj = βj ;
8 end
9 end
1 (T +1)
After convergence, we can get back αi ’s using αi = λT βi . In testing
time, predictions can be made with
 
X
ŷ = sign  αi yi xTi x
i

Now suppose we want to use more complex features φ(x) which can be ob-
tained by transforming the original features x to a higher dimensional space,
all we need to do is to substitute xTi xj in both training and testing with
φ(xi )T φ(xj ).
Further notice that φ(x) always appears in the form of dot products. Which
indicates we do not necessarily need to explicitly compute it as long as we have

4
a formula to calculate the dot products. This is where kernels come into use.
Instead of defining the function φ to do the projection, we directly define a
kernel function K to calculate the dot product of the projected features.

K(xi , xj ) = φ(xi )T φ(xj )

We can create different kernel functions K(xi , xj ) as long as those functions


are based on dot products. We can also create new valid kernel functions using
other valid kernel functions following certain rules. Examples of popular kernel
functions include Polynomial Kernels, Gaussian Kernels, and many more.

References
Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter. Extended
version: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Mathemat-
ical Programming, Series B, 127(1):3-30, 2011

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy