Lecture-05__Least Squares and Optimization
Lecture-05__Least Squares and Optimization
2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2
ෝ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
𝑤
The question is how do we use this within a classification setting.
Let 𝑦ො = 𝑋𝑤.
ෝ
The elements of 𝑦ො are not ∈ −1, +1 .
That is, we are not getting a bunch of plus and minus ones here.
We are actually getting real numbers.
For example, for one sample, 𝑦 might be 0.5,
for another sample, it might be -0.7.
We can actually see not only if 𝑦ො𝑖 is close to y, but also whether
the labels 𝑦𝑖 that we’re predicting equal the labels that we
actually observed.
For a new sample 𝑥𝑛𝑒𝑤 ∈ ℝ𝑝 , we want to predict its label 𝑦ො𝑛𝑒𝑤
(new unknown label)
𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤
𝑦ො𝑛𝑒𝑤 is the inner product of the feature for this new sample and
the weight vector that we estimated using our training data.
We use our training data to find a good weight vector, and then
we can now use that weight vector with our new feature to make
new predictions.
𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤
𝑦𝑛𝑒𝑤 = 𝑠𝑖𝑔𝑛(𝑦ො𝑛𝑒𝑤 )
• Prediction Phase
3. Given a new sample, we apply our trained model.
4. The learned weight vector 𝑤
ෝ is used in the linear model.
5. The system computes the inner product: 𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤
6. The result is a predictive label for 𝑥𝑛𝑒𝑤 .
2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2
ෝ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
𝑤
What we said is 𝑤
ෝ is the value w that minimizes the length 𝑦 minus
𝑋𝑤 (2-norm squared)
Previously, we derived this equation for 𝑤
ෝ using a geometric argument
(projection onto column space).
Today, we will rederive this result by treating it as an optimization
problem, approaching it from a different perspective.
Optimization approach
𝑤
= argmin 𝑦 𝑇 𝑦 − (𝑋𝑤)𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + (𝑋𝑤)𝑇 (𝑋𝑤)
𝑤 𝑋 ∈ ℝ𝑛x𝑝
= argmin 𝑦 𝑇 𝑦 − 𝑤 𝑇 𝑋 𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤 𝑤 ∈ ℝ𝑝x1
𝑤
y ∈ ℝ𝑛x1
= argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤 𝑤 𝑇 𝑋 𝑇 𝑦 & 𝑦 𝑇 𝑋𝑤 is a scalar
𝑤 𝑇 𝑋 𝑇 𝑦 = 𝑦 𝑇 𝑋𝑤
Optimization approach
2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2
= argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤
𝑤
ෝ = argmin 𝑓(𝑤)
𝑤
𝑑𝑓
=𝑤−1=0
𝑑𝑤 W
𝑤
ෝ =1
f(x) = -1x2
hard to minimize
Ex 2. f(x) = xTQx
𝑥 ∈ ℝ2 , 𝑄 ∈ ℝ2x2
1 0
Q= =𝐼
0 1
We try to minimize f(x) = xTQx
f(x) = 𝑥12 + 𝑥22 > 0 for all x ≠ 0 x
𝑥 ∈ ℝ2 , 𝑄 ∈ ℝ2x2
−1 0
Q= x
0 −1
f(x) = xTQx
f(x) = −𝑥12 − 𝑥22 < 0 for all x ≠ 0
hard to minimize
So thinking about positive definite matrices gives us
5. Q ≽ P means Q - P ≽ 0
Recall Optimization Formulation
ෝ = argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤
𝑤
We try to solve this optimization problem.
From property 3 of positive definite matrices, XTX is positive definite
We assume
XTX ≻ 0 → f(w) = 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
→ this f(w) function is convex
→ compute "derivative” & set to zero to find minimizer
𝑑𝑓
𝑑𝑤1
𝑑𝑓
𝛻w 𝑓 = 𝑑𝑤2
⋮
𝑑𝑓
𝑑𝑤𝑝
with respect to w
Ex 1.
f w = 𝑐 𝑇 𝑤 = 𝑐1 𝑤1 + 𝑐2 𝑤2 + … + 𝑐𝑝 𝑤𝑝
𝛻𝑤 𝑓 =
𝑐1
𝑐2
𝛻𝑤 𝑓 = ⋮ = 𝑐
𝑐𝑝
Ex 2.
2
f w = 𝑤𝑇𝑤 = 𝑤 2
= 𝑤12 + 𝑤22 + … + 𝑤𝑝2
𝛻𝑤 𝑓 =
2𝑤1
2𝑤2
𝛻𝑤 𝑓 = ⋮ = 2𝑤
2𝑤𝑝
Ex 3.
f w = 𝑤 𝑇 𝑄𝑤
𝑝 𝑝
= 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
𝑖=1 𝑗=1
Let's compute the derivative of just one term in the sum before the whole sum
2𝑄𝑖𝑖 𝑤𝑖 𝑖𝑓 𝑘 = 𝑖 = 𝑗
𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗 𝑄𝑖𝑗 𝑤𝑗 𝑜𝑟 𝑄𝑘𝑗 𝑤𝑗 𝑖𝑓 𝑖 = 𝑘 ≠ 𝑗
=
𝑑𝑤𝑘 𝑄𝑖𝑗 𝑤𝑖 𝑜𝑟 𝑄𝑖𝑘 𝑤𝑖 𝑖𝑓 𝑖 ≠ 𝑘 = 𝑗
0 𝑖𝑓 𝑘 ≠ 𝑖, 𝑘 ≠ 𝑗
Ex 3.
f w = 𝑤 𝑇 𝑄𝑤
𝑝 𝑝
= 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
𝑖=1 𝑗=1
Let's compute the derivative of just one term in the sum before the whole sum
2𝑄𝑖𝑖 𝑤𝑖 𝑖𝑓 𝑘 = 𝑖 = 𝑗
𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗 𝑄𝑖𝑗 𝑤𝑗 𝑜𝑟 𝑄𝑘𝑗 𝑤𝑗 𝑖𝑓 𝑖 = 𝑘 ≠ 𝑗
=
𝑑𝑤𝑘 𝑄𝑖𝑗 𝑤𝑖 𝑜𝑟 𝑄𝑖𝑘 𝑤𝑖 𝑖𝑓 𝑖 ≠ 𝑘 = 𝑗
0 𝑖𝑓 𝑘 ≠ 𝑖, 𝑘 ≠ 𝑗
To figure out the gradient, next we have to figure out
the derivative of function f
𝑑𝑓 𝑝 𝑝 𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
= σ𝑖=1 σ𝑗=1
𝑑𝑤𝑘 𝑑𝑤𝑘
→ 𝛻𝑤 𝑓 = 𝑄𝑤 + 𝑄 𝑇 𝑤
𝛻𝑤 𝑓 = 2𝑄𝑤
Ex. Least Squares
𝑓 𝑤 = 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝛻𝑤 𝑓 = 0 − 2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝑤
We can set 𝛻𝑤 𝑓 = 0 − 2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝑤 = 0
→ 𝑋 𝑇 𝑋𝑤 = 𝑋 𝑇 𝑦 because XTX is positive definite, we take its inverse
ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚
𝒘