Lecture6 Notes
Lecture6 Notes
1 Pegasos Algorithm
The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
just by changing a few lines of code in our Perceptron Algorithms, we can get
the Pegasos Algorithm.
1
each data point differently by placing more weights on the data points in the
underrepresented categories. We can do this very easily by changing our opti-
mization problem to
CN X CN X
min kwk22 + ξj + ξj
w 2N+ j:yj =+1
2N− j:yj =−1
where N+ , N− are the number of positive data points and negative data points
respectively. ξj ’s are the slack variables.
An intuitive way to think about this is if we want to build a classifier that
classifies whether a point is blue or red. If in our data set we only have 1 data
point that’s labelled as red and 10 data points labelled as blue, then using the
modified objective function is equivalent to duplicating the 1 red point 10 times
without explicitly creating more training data.
The idea here is we assume that the weight for Rob is going to be very close
to that for David. We then try to penalize the distance between the two. C
here can be interpreted as how confident we are that the weights for Rob will
be similar to the weights for David. If we are very confident, a low C, then we
will really try to minimize the distance between the two weight vectors. If we
are not confident, large C, then we are more focused on David’s labelled data.
2
decide in which class it belongs. We can predict the most probable class using
the formula
ŷ = argmax wkT x + bk
k
Another solution is called Multiclass SVM. Here, we put soft restrictions on
predicting correct labels for the training data using:
0 0
w(yj )T xj + b(yj ) ≥ w(y )T xj + b(y ) + 1 − ξj , ∀y 0 6= yj , ξj ≥ 0, ∀j
Notice that we have one slack variable ξj per data point and one set of weights
w(k) , b(k) for each class k. We could derive a similar Pegasos Algorithm for a
multiclass classifier.
3 Kernel Trick
What if the data is not linearly separable? We can create a mapping φ(x)
that takes our feature vector x and converts it into a higher dimensional space.
Creating a linear classification in this higher dimension and projecting that onto
our original feature space will give us a squiggly line classifier.
Kernel tricks allow us to perform the aforementioned classification with little
extra cost. For Pegasos algorithm, we can do this by keeping track of just a
single variable per data point, αi , and calculating vector w when required.
X
w= αi yi xi
i
Let’s now derive the updating rule for such αi ’s. Notice in Algorithm 1, the
update rule at each iteration is
wt+1 = (1 − ηt λ)wt + 1[yj wtT xj < 1] · ηt yj xj
where 1[condition] is the indicator function. Now instead of xj , yj , let us use
x(t) , y (t) to denote the data point we randomly selected at iteration t. Substitute
ηt , we have
1 1
wt+1 = 1 − wt + 1[y (t) wtT x(t) < 1] · y (t) x(t)
t λt
Multiplying both sides with t, rearranging,
1
twt+1 − (t − 1)wt = 1[y(t) wtT x(t) < 1] · y(t) x(t)
λ
As the above equation holds for any t, we have the following t equations
twt+1 − (t − 1)wt = λ1 1[y (t) wtT x(t) < 1] · y (t) x(t)
= λ1 1[y (t−1) wt−1
T
(t − 1)wt − (t − 2)wt−1 x(t−1) < 1] · y (t−1) x(t−1)
···
= λ1 1[y (1) w1T x(1) < 1] · y (1) x(1)
w2
3
Summing over the above t equations and dividing both sides by t, we can
have
t
1 X
wt+1 = 1[y(k) wkT x(k) < 1] · y(k) x(k)
λt
k=1
Now suppose we want to use more complex features φ(x) which can be ob-
tained by transforming the original features x to a higher dimensional space,
all we need to do is to substitute xTi xj in both training and testing with
φ(xi )T φ(xj ).
Further notice that φ(x) always appears in the form of dot products. Which
indicates we do not necessarily need to explicitly compute it as long as we have
4
a formula to calculate the dot products. This is where kernels come into use.
Instead of defining the function φ to do the projection, we directly define a
kernel function K to calculate the dot product of the projected features.
References
Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter. Extended
version: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Mathemat-
ical Programming, Series B, 127(1):3-30, 2011