315 F19 15 SVM 2
315 F19 15 SVM 2
Lecture 15:
Support Vector Machines 2
Teacher:
Gianni A. Di Caro
Recap: SVM (hard-margin) optimization problem, linearly separable
Quadratic (convex) optimization problem with
𝑚 linear inequality constraints
∥ 𝒘 ∥#
min 𝑥#
𝒘,* 2
𝑠. 𝑡. 𝑦 1 𝒘2 𝒙 1 + 𝑏 ≥ 1, 𝑖 = 1, ⋯ , 𝑚
𝒘2 ; 𝒘
min
𝒘,* 2
1
𝑠. 𝑡. 𝑦 𝒘2 𝒙 1 + 𝑏 ≥ 1 𝑖 = 1, ⋯ , 𝑚
𝑥$
2
Recap: What if data is still not linearly separable? → Slack variables
v Allow error in classification B
min 𝒘2 ; 𝒘 + p ? 𝜉@
𝒘,*,𝝃
@A$
𝑠. 𝑡. 𝑦 1 𝒘2 𝒙 1 + 𝑏 ≥ 1 − 𝜉1 𝑖 = 1, ⋯ , 𝑚
𝜉1 ≥ 0 𝑖 = 1, ⋯ , 𝑚
3
Recap: What if data is still not linearly separable? → Slack variables
o 𝜉1 > 1 if 𝒙 1 is misclassified
o 0 < 𝜉1 < 1 if 𝒙 1 is correctly classified, within margin
o 𝜉1 = 0 if 𝒙 1 is correctly classified
4
Recap: Soft-margin SVM
𝑦 1 𝒘2 𝒙 1 + 𝑏 ≥ 1 − 𝜉1 𝑖 = 1, ⋯ , 𝑚 Soften the
𝜉1 ≥ 0 𝑖 = 1, ⋯ , 𝑚 constraints
B
Penalty for misclassifying or
min 𝒘2 ; 𝒘 + p ? 𝜉@ taking 𝒙 @ inside margin:
𝒘,*,𝝃
@A$
𝑝 𝜉@
𝑏=1
1
𝑏=−
𝒘 2𝒙 +
Set: 𝑝 = ∞
5
Recap: Support Vectors in soft-margin SVM
𝒘 2𝒙 + 𝑏 = 𝒘 2𝒙 + 𝑏 =
1 1
𝒘 2𝒙 + 𝑏 𝒘 2𝒙 + 𝑏
= −1 = −1
Notice that
𝜉1 (𝑚𝑎𝑟𝑔𝑖𝑛) = max 1 − 𝒘2 𝒙 1 + 𝑏 𝑦 1 ,0
𝒘 2𝒙 + 𝑏 =
Hinge loss
1 𝒘 2𝒙 + 𝑏
= −1
0-1 loss
-1 0 1 𝒘2 𝒙 1 +𝑏 𝑦 1
7
Loss Functions
8
Hinge Loss Function SVMs)
9
Hinge Loss Function
Hinge loss
B V 0-1 loss
min ? max 0, 1 − 𝑦 1 𝒘2 𝒙 1 +𝑏 + 𝜆 ? 𝑤U#
𝒘
1A$ UA$
-1 0 1 𝒘2 𝒙 1 +𝑏 𝑦 1
An error signal is triggered even when there’s no error (margin > 0), pushing the algorithm to
keep learning until a minimal margin of at least 1 is achieved for all data
The regularization function is to keep the hypothesis simple by avoiding large values for the
learning parameters, favoring generalization, which traduces to minimize margin!
Small margins ⇒ Good generalization
𝜉1 = max 0, 1 − 𝑦 1 𝒘2 𝒙 1 +𝑏
Hinge loss
0-1 loss
-1 0 1 𝒘2 𝒙 1 +𝑏 𝑦 1
min 𝒘2 ; 𝒘 + p ? 𝜉@ B V
𝒘,*,𝝃
@A$
min ? max 0, 1 − 𝑦 1 𝒘2 𝒙 1 + 𝑏 + 𝜆 ? 𝑤U#
1 2 1 𝒘
𝑠. 𝑡. 𝑦 𝒘 𝒙 + 𝑏 ≥ 1 − 𝜉1 𝑖 = 1, ⋯ , 𝑚 1A$ UA$
𝜉1 ≥ 0 𝑖 = 1, ⋯ , 𝑚
11
SVM vs. Logistic Regression
0-1 loss
-1 0 1
12
SVM – Linearly separable case
• Primal problem:
13
Constrained Optimization and Constraint Activation
x⇤ = max(b, 0)
b +ve
Lagrange multiplier
Price to pay for unit of constraint violation
ℎ@ 𝒙 = 𝑑@ 𝑗 = 1, ⋯ , 𝑛 𝑥 ∈ 𝑋 ⊆ ℝV ∶ 𝑔1 𝑥 ≤ 𝑏1 , ℎ@ 𝑥 = 𝑑@ ,
𝑉 𝑃 =
𝑖 = 1, ⋯ , 𝑚, 𝑗 = 1, ⋯ , 𝑛
𝒙 ∈ 𝑋 ⊆ ℝV
16
Dual problem ↔ Relaxation
1. 𝑉 𝑅𝑃 ⊇ 𝑉(𝑃)
Feasibility region of the relaxed problem fully
includes that of the primal
2. 𝛷 𝒙 ≤ 𝑓 𝒙 , ∀ 𝒙 ∈ 𝑉(𝑃)
Objective function of 𝑅𝑃 is always below 𝑓(𝒙)
(always above for a max problem)
17
Lagrangian relaxation (Lagrangian dual problem)
𝑠. 𝑡. 𝑔1 𝒙 ≤ 𝑏1 𝑖 = 1, ⋯ , 𝑚 𝒙 ∈ 𝑋 ⊆ ℝV
ℎ@ 𝒙 = 𝑑@ 𝑗 = 1, ⋯ , 𝑛 𝝀 ≥ 0B , 𝝁 ∈ ℝ]
𝒙 ∈ 𝑋 ⊆ ℝV
v Where 𝛷(𝒙, 𝝀, 𝝁) adds to the primal’s 𝑓(𝑥) the linear
combination of all (or a subset of) constraints, weighted by
o It’s a relaxation, an easier problem
multipliers 𝝀, 𝝁 (each constraint gets a different multiplier)
to solve compared to the original
B ]
o Primal is constrained, Lagrangian
relaxation is unconstrained 𝛷 𝒙, 𝝀, 𝝁 = 𝑓 𝒙 + ? 𝜆1 𝑔1 𝒙 − 𝑏1 + ? 𝜇1 ℎ1 𝒙 − 𝑑1
1A$ 1A$
18
Lagrangian relaxation
For a general primal, the optimal solution of the Lagrangian relaxation is either not feasible for the
primal and/or does not correspond to 𝑃’s optimum
v By construction of a relaxation, for any assignment of values to the multipliers, the solution of the
Lagrangian relaxation is a lower bound for the solution of the primal problem of minimization.
(or an upper bound, if the primal is a problem of maximization)
General case
21
Lagrangian relaxation: Convex problems
ü For convex programs, it is possible to find an assignment such that the LB is tight and the
solution of the Lagrangian relaxation (that defines a concave problem in 𝝀) is the same as
that of the primal problem!
𝑍 𝑍 𝑓(𝑥)
𝑓(𝑥)
𝑧d
𝑧d = 𝑧y
𝑧y
Φ(𝑥; 𝜆)
Φ(𝑥; 𝜆)
𝑥 𝑥
Scenario with a generic assignment of Scenario with the best (solution of a new optimization
values to multipliers problem) assignment of values to multipliers
22
When, in general, 𝑍d ≡ 𝑍nd : Complementarity conditions
If the solution of relaxed problem, 𝒙∗𝝀,𝝁 , is feasible in 𝑉(𝑃) (i.e., it’s also a feasible solution point for the
primal) and the following conditions, termed complementarity conditions, hold ⟹ 𝒙∗𝝀,𝝁 is also 𝑃’s optimum
B
(note that for the equality constraints the
? 𝜆1 𝑔1 (𝒙∗ ) − 𝑏1 = 0
conditions above automatically hold)
1A$
Complementarity Complementarity
conditions are satisfied conditions not satisfied
23
Complementarity conditions and Lagrange multipliers
B
? 𝜆1 𝑔1 (𝒙∗) − 𝑏1 = 0 Complementarity conditions
1A$
o Therefore, for the solution being the same between convex primal and its Lagrangian relaxation, the
complementarity conditions must hold, implying that, at the optimum for the 𝑖-th constraint, either:
ü 𝜆1 = 0, or
ü Constraint is active: 𝑔1 (𝒙∗) − 𝑏1 = 0
Rationale: if a constraint 𝑖 is active (i.e., it is satisfied with the equality sign) in correspondence of the optimal
solution, then its potential violation should have a price 𝜆1 > 0, since a small change in the constraint would
change the value of the objective and the solution. Instead, if a constraint 𝑗 is not active, a small change in the
constraint would not change the solution / objective, 𝑗 is practically irrelevant for finding the solution
24
Let’s go back to our specific case: Connection between Primal and Dual
Ø Weak duality: The dual solution d* lower bounds the primal solution p* i.e. d* ≤ p*
𝑑(α) = ≤ p*
𝑑 α = Is a concave function
25
Connection between Primal and Dual
Ø Weak duality: The dual solution d* lower bounds the primal solution p* i.e. d* ≤ p*
Ø Strong duality: d* = p* holds often for many problems of interest e.g. if the primal is a
feasible convex objective with linear constraints
26
Connection between Primal and Dual
What does strong duality say about ↵⇤ (the ↵ that achieved optimal value of
dual) and x⇤ (the x that achieves optimal value of primal problem)?
Whenever strong duality holds, the following conditions (known as KKT con-
ditions) are true for ↵⇤ and x⇤ :
We use the first one to relate x⇤ and ↵⇤ . We use the last one (complimentary
slackness) to argue that ↵⇤ = 0 if constraint is inactive and ↵⇤ > 0 if constraint
is active and tight.
27
Solving the dual
) ↵0 = 2b
• Primal problem:
max min
𝜶 𝒘, *
min 𝐿
𝒘,*
If we can solve for as (dual problem), then we have a solution for 𝒘, 𝑏 (primal problem)
30
Dual SVM – linearly separable case
min 𝐿
𝒘,*
𝑚𝑎𝑥Œ min 𝐿
𝒘,*
aj = 0
aj > 0 aj = 0
aj > 0 𝒘2 ; 𝒙@ + 𝑏 𝑦@ = 1
aj = 0
Support vectors – training points 𝒙@
0
+𝑏 =
whose aj is non-zero
𝒘2 ; 𝒙
32
Dual SVM – Linearly separable case
33
Dual SVM – Non-separable case
• Primal problem:
,{ξj}
,{ξj} L(w, b, ⇠, ↵, µ)
34
Dual SVM – Non-separable case
comes from
@L Intuition:
=0 If C→∞, recover hard-margin SVM
@⇠
35