0% found this document useful (0 votes)
18 views35 pages

315 F19 15 SVM 2

optimizations of SVM

Uploaded by

Tigabu Yaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views35 pages

315 F19 15 SVM 2

optimizations of SVM

Uploaded by

Tigabu Yaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Disclaimer: These slides can include material from

different sources. I’ll happy to explicitly acknowledge


a source if required. Contact me for requests.

Introduction to Machine Learning


10-315 Fall ‘19

Lecture 15:
Support Vector Machines 2
Teacher:
Gianni A. Di Caro
Recap: SVM (hard-margin) optimization problem, linearly separable
Quadratic (convex) optimization problem with
𝑚 linear inequality constraints
∥ 𝒘 ∥#
min 𝑥#
𝒘,* 2

𝑠. 𝑡. 𝑦 1 𝒘2 𝒙 1 + 𝑏 ≥ 1, 𝑖 = 1, ⋯ , 𝑚

𝒘2 ; 𝒘
min
𝒘,* 2
1
𝑠. 𝑡. 𝑦 𝒘2 𝒙 1 + 𝑏 ≥ 1 𝑖 = 1, ⋯ , 𝑚

𝑥$

2
Recap: What if data is still not linearly separable? → Slack variables
v Allow error in classification B

min 𝒘2 ; 𝒘 + p ? 𝜉@
𝒘,*,𝝃
@A$

𝑠. 𝑡. 𝑦 1 𝒘2 𝒙 1 + 𝑏 ≥ 1 − 𝜉1 𝑖 = 1, ⋯ , 𝑚
𝜉1 ≥ 0 𝑖 = 1, ⋯ , 𝑚

o 𝜉1 = slack variable, gets a value > 1 if 𝒙 1 is misclassified


o 0 < 𝜉1 < 1 if 𝒙 1 is correctly classified but within the margin
o 𝜉1 = 0 if 𝒙 1 is correctly classified
ü Soft margins: some examples are
within the margin zone
ü Some examples are misclassified ü Still a convex, quadratic programming problem J

3
Recap: What if data is still not linearly separable? → Slack variables

o 𝜉1 > 1 if 𝒙 1 is misclassified
o 0 < 𝜉1 < 1 if 𝒙 1 is correctly classified, within margin
o 𝜉1 = 0 if 𝒙 1 is correctly classified

o 𝜉1 is linearly proportional to the distance from the class


margin if 𝒙 1 is misclassified or within the margins
o In the obj pay a linearly proportional penalty for mistakes,
𝑏=1

a small one for being inside the margin


𝑏=−
𝒘 2𝒙 +

o 𝑝 penalty for unit of distance mistake, trade-off


𝒘 2𝒙 +

parameter between hard and soft objectives (usually by


cross-validation, e.g. 𝑝 = 1/𝑑)

4
Recap: Soft-margin SVM

𝑦 1 𝒘2 𝒙 1 + 𝑏 ≥ 1 − 𝜉1 𝑖 = 1, ⋯ , 𝑚 Soften the
𝜉1 ≥ 0 𝑖 = 1, ⋯ , 𝑚 constraints

B
Penalty for misclassifying or
min 𝒘2 ; 𝒘 + p ? 𝜉@ taking 𝒙 @ inside margin:
𝒘,*,𝝃
@A$
𝑝 𝜉@
𝑏=1

1
𝑏=−
𝒘 2𝒙 +

How do we recover hard margin SVM?


𝒘 2𝒙 +

Set: 𝑝 = ∞

5
Recap: Support Vectors in soft-margin SVM

𝒘 2𝒙 + 𝑏 = 𝒘 2𝒙 + 𝑏 =
1 1
𝒘 2𝒙 + 𝑏 𝒘 2𝒙 + 𝑏
= −1 = −1

v Margin support vectors v Non-margin support vectors


ü 𝜉1 = 0, 𝑦 1 𝒘2 𝒙 1 +𝑏 =1 ü 𝜉1 > 0
Ø Don’t contribute to objective but enforce Ø Contribute to both objective and constraints
constraints on solution o 0 < 𝜉1 < 1 Correctly classified but inside margin
Ø Correctly classified but on margin o 𝜉1 > 1 Incorrectly classified 6
Soft-margin SVM: Hinge loss

Notice that
𝜉1 (𝑚𝑎𝑟𝑔𝑖𝑛) = max 1 − 𝒘2 𝒙 1 + 𝑏 𝑦 1 ,0

𝒘 2𝒙 + 𝑏 =
Hinge loss
1 𝒘 2𝒙 + 𝑏
= −1
0-1 loss

-1 0 1 𝒘2 𝒙 1 +𝑏 𝑦 1

7
Loss Functions

8
Hinge Loss Function SVMs)

ü Intuition: hinge loss upper bounds 0-1 loss,


has non-trivial gradient
ü Loss = 0 only for margin least 1 Hinge loss
ü Try to increase margin if less than 1: 0-1 loss
Max-margin classifier
-1 0 1 𝒘2 𝒙 1 1
ü Hinge loss optimization problem: +𝑏 𝑦
B
1
min ? max 0, 1 − 𝑦 𝒘2 𝒙 1 + 𝑏
𝒘
1A$
B V

ü Hinge loss regularized optimization problem: min ? max 0, 1 − 𝑦 1


𝒘2 𝒙 1 + 𝑏 + 𝜆 ? 𝑤U#
𝒘
1A$ UA$

9
Hinge Loss Function

Hinge loss
B V 0-1 loss
min ? max 0, 1 − 𝑦 1 𝒘2 𝒙 1 +𝑏 + 𝜆 ? 𝑤U#
𝒘
1A$ UA$
-1 0 1 𝒘2 𝒙 1 +𝑏 𝑦 1

An error signal is triggered even when there’s no error (margin > 0), pushing the algorithm to
keep learning until a minimal margin of at least 1 is achieved for all data

The regularization function is to keep the hypothesis simple by avoiding large values for the
learning parameters, favoring generalization, which traduces to minimize margin!
Small margins ⇒ Good generalization

Geometric interpretation: Zero loss needs a margin 𝒘2 𝒙 𝑦 ≥ 1, in terms of geometric margin


this means: 𝑤 Y$ 𝒘2 𝒙 𝑦 ≥ 𝒘 Y$ ⟹ Keeping 𝒘 small increases geometric margin
10
Hinge loss optimization ↔ Soft-margin SVM

𝜉1 = max 0, 1 − 𝑦 1 𝒘2 𝒙 1 +𝑏
Hinge loss
0-1 loss

-1 0 1 𝒘2 𝒙 1 +𝑏 𝑦 1

Soft-margin SVM Regularized hinge loss optimization


B

min 𝒘2 ; 𝒘 + p ? 𝜉@ B V
𝒘,*,𝝃
@A$
min ? max 0, 1 − 𝑦 1 𝒘2 𝒙 1 + 𝑏 + 𝜆 ? 𝑤U#
1 2 1 𝒘
𝑠. 𝑡. 𝑦 𝒘 𝒙 + 𝑏 ≥ 1 − 𝜉1 𝑖 = 1, ⋯ , 𝑚 1A$ UA$

𝜉1 ≥ 0 𝑖 = 1, ⋯ , 𝑚

11
SVM vs. Logistic Regression

SVM : Hinge loss

Logistic Regression : Log loss ( log conditional likelihood)

Log loss Hinge loss

0-1 loss

-1 0 1
12
SVM – Linearly separable case

𝑛 training points (𝒙$ , 𝒙# , ⋯ , 𝒙] )


𝑑 features 𝒙1 is a 𝑑-dimensional vector

• Primal problem:

𝒘 – weights on features (𝒅-dim problem)


𝒘2 ; 𝒙 + 𝑏 = 0

o Convex quadratic program – quadratic objective, linear constraints


o But expensive to solve if 𝑑 and 𝑛 are very large
o Often solved in dual form (𝑛-dim problem)

13
Constrained Optimization and Constraint Activation

x⇤ = max(b, 0)

Constraint inactive Constraint active


and tight
14
Constrained Optimization – Dual Problem
Primal problem, constrained:

b +ve

Moving the constraint to objective function


Lagrangian function:

Lagrange multiplier
Price to pay for unit of constraint violation

Dual problem (unconstrained):


a = 0 constraint is inactive
a > 0 constraint is active
15
Dual problem ↔ Relaxation
o Given an optimization problem 𝑃 (primal) a relaxation 𝑅𝑃 of 𝑃 is a derived problem that removes or
aggregates constraints, and/or extends the range of variables (e.g., passing from 𝑥 ∈ 0,1 to 𝑥 ∈
[0,1]), and/or changes the objective function
o The overall aim is to possibly defining an easier problem to solve compared to 𝑃.
o Solving the (easier) 𝑅𝑃 problem we can obtain bounds on primal’s solution and, in certain special cases
(that include convex programming), the solution of the relaxed problem can be directly related to that
of the primal problem, such that solving the relaxed problem will provide the solution to the primal
§ A generic primal 𝑃 (min) can include both multiple
inequality constraints and equality constraints
§ The scalar value of the objective for an assignment
to the decision variables 𝒙 is represented as 𝑍d
min 𝑍d = 𝑓(𝒙) 𝑉 𝑃 = Feasibility region for 𝑃, from intersection
𝒙
𝑠. 𝑡. 𝑔1 𝒙 ≤ 𝑏1 𝑖 = 1, ⋯ , 𝑚 of all inequality and equality constraints

ℎ@ 𝒙 = 𝑑@ 𝑗 = 1, ⋯ , 𝑛 𝑥 ∈ 𝑋 ⊆ ℝV ∶ 𝑔1 𝑥 ≤ 𝑏1 , ℎ@ 𝑥 = 𝑑@ ,
𝑉 𝑃 =
𝑖 = 1, ⋯ , 𝑚, 𝑗 = 1, ⋯ , 𝑛
𝒙 ∈ 𝑋 ⊆ ℝV
16
Dual problem ↔ Relaxation

Primal problem 𝑃: Optimization problem 𝑅𝑃 (possibly derived from 𝑃):

min 𝑍d = 𝑓(𝒙) min 𝑍nd = 𝛷(𝒙)


𝒙 𝒙
𝑠. 𝑡. 𝒙 ∈ 𝑉(𝑃) 𝑠. 𝑡. 𝒙 ∈ 𝑉(𝑅𝑃)

v 𝑅𝑃 is a relaxation of the primal problem 𝑃 if:

1. 𝑉 𝑅𝑃 ⊇ 𝑉(𝑃)
Feasibility region of the relaxed problem fully
includes that of the primal
2. 𝛷 𝒙 ≤ 𝑓 𝒙 , ∀ 𝒙 ∈ 𝑉(𝑃)
Objective function of 𝑅𝑃 is always below 𝑓(𝒙)
(always above for a max problem)
17
Lagrangian relaxation (Lagrangian dual problem)

Primal problem 𝑃: Lagrangian relaxation of 𝑃:


min 𝑍d = 𝑓(𝒙) min 𝑍nd = 𝛷(𝒙, 𝝀, 𝝁)
𝒙 𝒙

𝑠. 𝑡. 𝑔1 𝒙 ≤ 𝑏1 𝑖 = 1, ⋯ , 𝑚 𝒙 ∈ 𝑋 ⊆ ℝV

ℎ@ 𝒙 = 𝑑@ 𝑗 = 1, ⋯ , 𝑛 𝝀 ≥ 0B , 𝝁 ∈ ℝ]

𝒙 ∈ 𝑋 ⊆ ℝV
v Where 𝛷(𝒙, 𝝀, 𝝁) adds to the primal’s 𝑓(𝑥) the linear
combination of all (or a subset of) constraints, weighted by
o It’s a relaxation, an easier problem
multipliers 𝝀, 𝝁 (each constraint gets a different multiplier)
to solve compared to the original
B ]
o Primal is constrained, Lagrangian
relaxation is unconstrained 𝛷 𝒙, 𝝀, 𝝁 = 𝑓 𝒙 + ? 𝜆1 𝑔1 𝒙 − 𝑏1 + ? 𝜇1 ℎ1 𝒙 − 𝑑1
1A$ 1A$

18
Lagrangian relaxation
For a general primal, the optimal solution of the Lagrangian relaxation is either not feasible for the
primal and/or does not correspond to 𝑃’s optimum

v The choice of the multipliers also affects the gap


19
Lagrangian relaxation
o Each multiplier weights the importance of the related constraint in the objective
o Moving the 𝑗-th constraint to the objective, we potentially allow a violation of the constraint, “paying” 𝜆@ for each
unit of constraint violation (the more the solution 𝒙 make the constraint being violated, the more we pay)

min 𝑍d = 24𝑥$ + 14𝑥#


z{,z| min 𝑍y = 24𝑥$ + 14𝑥# − 𝜆$ 3𝑥$ + 𝑥# − 12 +
z{,z|;𝝀
𝑠. 𝑡. 3𝑥$ + 𝑥# ≥ 12 𝜆# 4𝑥$ + 𝑥# − 10 − 𝜆• 2𝑥$ + 𝑥# − 7
4𝑥$ + 𝑥# ≤ 10 Lagrangian 𝑥$ , 𝑥# ∈ ℝ~
2𝑥$ + 𝑥# ≥ 7 relaxation 𝝀≥𝟎
𝑥$ , 𝑥# ∈ ℝ~

o If 𝜆$ = 8 and 𝑥$ , 𝑥# = 3, 1 , the first constraint 𝐶$ = 3𝑥$ + 𝑥# − 12 = −2 < 0, meaning that the


constraint is violated for 2 units. Therefore, a penalty (−8) ; (−2) = 16 has been payed in the objective.
o 𝜆$ penalty has a − sign in front of it because we want to add something to 𝑍y if a violation of the ≥ constraint
occurs. 𝜆# has a + sign because the violation of second constraint, ≤, generates a positive value
v In a min (max) problem, we add (subtract) violation units to the obj to search for a solution with minimal (or
zero, if possible) violations!
20
Lagrangian relaxation
o Central question: what is the right value to assign to multiplier weights?
o What is the right price to pay for a violation of a constraint?
Ø Look at multiplier as constant parameters and assign a weight value based on the importance of a constraint
Ø Look at multipliers as variables and find the best assignment to multipliers (the best price)

v By construction of a relaxation, for any assignment of values to the multipliers, the solution of the
Lagrangian relaxation is a lower bound for the solution of the primal problem of minimization.
(or an upper bound, if the primal is a problem of maximization)

General case

21
Lagrangian relaxation: Convex problems

ü For convex programs, it is possible to find an assignment such that the LB is tight and the
solution of the Lagrangian relaxation (that defines a concave problem in 𝝀) is the same as
that of the primal problem!

𝑍 𝑍 𝑓(𝑥)
𝑓(𝑥)
𝑧d
𝑧d = 𝑧y
𝑧y
Φ(𝑥; 𝜆)
Φ(𝑥; 𝜆)

𝑥 𝑥

Scenario with a generic assignment of Scenario with the best (solution of a new optimization
values to multipliers problem) assignment of values to multipliers

22
When, in general, 𝑍d ≡ 𝑍nd : Complementarity conditions

If the solution of relaxed problem, 𝒙∗𝝀,𝝁 , is feasible in 𝑉(𝑃) (i.e., it’s also a feasible solution point for the
primal) and the following conditions, termed complementarity conditions, hold ⟹ 𝒙∗𝝀,𝝁 is also 𝑃’s optimum
B
(note that for the equality constraints the
? 𝜆1 𝑔1 (𝒙∗ ) − 𝑏1 = 0
conditions above automatically hold)
1A$

Complementarity Complementarity
conditions are satisfied conditions not satisfied

23
Complementarity conditions and Lagrange multipliers
B
? 𝜆1 𝑔1 (𝒙∗) − 𝑏1 = 0 Complementarity conditions
1A$

o Therefore, for the solution being the same between convex primal and its Lagrangian relaxation, the
complementarity conditions must hold, implying that, at the optimum for the 𝑖-th constraint, either:

ü 𝜆1 = 0, or
ü Constraint is active: 𝑔1 (𝒙∗) − 𝑏1 = 0

o Another way of saying the same thing is that, at the optimum:


ü all active constraints have a multiplier value 𝜆1 > 0
ü all constraints that are not activated are associated to a multiplier with a zero value

Rationale: if a constraint 𝑖 is active (i.e., it is satisfied with the equality sign) in correspondence of the optimal
solution, then its potential violation should have a price 𝜆1 > 0, since a small change in the constraint would
change the value of the objective and the solution. Instead, if a constraint 𝑗 is not active, a small change in the
constraint would not change the solution / objective, 𝑗 is practically irrelevant for finding the solution
24
Let’s go back to our specific case: Connection between Primal and Dual

Primal problem: p* = Dual problem: d* =

Ø Weak duality: The dual solution d* lower bounds the primal solution p* i.e. d* ≤ p*

To see this, recall

For every feasible x (i.e. x ≥ b) and feasible α (i.e. α ≥ 0) , notice that

𝑑(α) = ≤ p*

Ø Dual problem (maximization) is always concave even if primal is not convex

𝑑 α = Is a concave function

25
Connection between Primal and Dual

Primal problem: p* = Dual problem: d* =

Ø Weak duality: The dual solution d* lower bounds the primal solution p* i.e. d* ≤ p*

Ø Strong duality: d* = p* holds often for many problems of interest e.g. if the primal is a
feasible convex objective with linear constraints

26
Connection between Primal and Dual
What does strong duality say about ↵⇤ (the ↵ that achieved optimal value of
dual) and x⇤ (the x that achieves optimal value of primal problem)?

Whenever strong duality holds, the following conditions (known as KKT con-
ditions) are true for ↵⇤ and x⇤ :

• 1. 5L(x⇤ , ↵⇤ ) = 0 i.e. Gradient of Lagrangian at x⇤ and ↵⇤ is zero.


• 2. x⇤ b i.e. x⇤ is primal feasible

• 3. ↵⇤ 0 i.e. ↵⇤ is dual feasible


• 4. ↵⇤ (x⇤ b) = 0 (called as complementary slackness)

We use the first one to relate x⇤ and ↵⇤ . We use the last one (complimentary
slackness) to argue that ↵⇤ = 0 if constraint is inactive and ↵⇤ > 0 if constraint
is active and tight.

27
Solving the dual

Optimization over 𝑥 is unconstrained.

Now need to maximize 𝐿(𝑥 ∗, 𝛼) over 𝛼 ≥ 0


Solve unconstrained problem to get α’ and then take max(α’,0)

) ↵0 = 2b

a = 0 constraint is inactive, α > 0 constraint is active and tight


28
Dual SVM – Linearly separable case
𝑛 training points (𝒙$ , 𝒙# , ⋯ , 𝒙] )
𝑑 features 𝒙1 is a 𝑑-dimensional vector

• Primal problem:

𝒘 – weights on features (𝒅-dim problem)

• Dual problem (derivation):

a – weights on training pts (𝒏-dim problem)


29
Dual SVM – linearly separable case

max min
𝜶 𝒘, *

min 𝐿
𝒘,*

If we can solve for as (dual problem), then we have a solution for 𝒘, 𝑏 (primal problem)

30
Dual SVM – linearly separable case

min 𝐿
𝒘,*

Substituting the solutions from min part in 𝐿:

𝑚𝑎𝑥Œ min 𝐿
𝒘,*

Dual problem is also QP


What about b?
Solution gives ajs
(not in eqs)
31
Dual SVM: Sparsity of dual solution

aj = 0
aj > 0 aj = 0

aj > 0 Only few ajs can be non-zero :


where constraint is active and tight

aj > 0 𝒘2 ; 𝒙@ + 𝑏 𝑦@ = 1
aj = 0
Support vectors – training points 𝒙@
0
+𝑏 =

whose aj is non-zero
𝒘2 ; 𝒙

32
Dual SVM – Linearly separable case

Dual problem is also QP


Solution gives ajs

Use support vectors with 𝛼U > 0 to compute 𝑏, since constraint is tight 𝒘2 ; 𝒙@ + 𝑏 𝑦@ = 1

33
Dual SVM – Non-separable case

• Primal problem:
,{ξj}

• Dual problem: Lagrange


Multipliers

,{ξj} L(w, b, ⇠, ↵, µ)

34
Dual SVM – Non-separable case

comes from
@L Intuition:
=0 If C→∞, recover hard-margin SVM
@⇠

Dual problem is also QP


Solution gives ajs

35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy