0% found this document useful (0 votes)
10 views

Lecture-05__Least Squares and Optimization

This lecture covers the concepts of least squares and optimization in machine learning, focusing on finding the weight vector that minimizes the sum of squared errors between true labels and predictions. It discusses the classification rule for binary labels and the importance of positive definite matrices in ensuring unique solutions for optimization problems. The lecture also emphasizes the convexity of functions and the use of gradients for optimization in multi-dimensional spaces.

Uploaded by

kimsergey606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture-05__Least Squares and Optimization

This lecture covers the concepts of least squares and optimization in machine learning, focusing on finding the weight vector that minimizes the sum of squared errors between true labels and predictions. It discusses the classification rule for binary labels and the importance of positive definite matrices in ensuring unique solutions for optimization problems. The lecture also emphasizes the convexity of functions and the use of gradients for optimization in multi-dimensional spaces.

Uploaded by

kimsergey606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Lecture 5

Least Squares and Optimization


in Machine Learning
SWCON253, Machine Learning
Won Hee Lee, PhD
Learning Goals
• Understand the fundamental concepts of
least squares and optimization in machine learning
Given
Labels: 𝑦 ∈ ℝ𝑛 (target vector) for n training samples
Features: 𝑋 ∈ ℝ𝑛x𝑝 (feature matrix with n samples and p features)
We assume that n ≥ p, rank(X) = p (X has p linearly independent columns)
Goal: We want to find w such that yො = 𝑋𝑤 ≈ 𝑦
Pictorially,
In classification, given:
Labels: 𝑦𝑖 ∈ −1, +1 , 𝑖 = 1, 2, …, n (binary class labels)

2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2

We aim to learn the weight vector 𝑤


ෝ that minimizes the sum
of squared errors between true labels 𝑦 and prediction 𝑋𝑤.
So, we want to minimize the distance between true labels 𝑦 and
𝑋𝑤 that we predict using the linear model.

ෝ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
𝑤
The question is how do we use this within a classification setting.

Let 𝑦ො = 𝑋𝑤.

The elements of 𝑦ො are not ∈ −1, +1 .

That is, we are not getting a bunch of plus and minus ones here.
We are actually getting real numbers.
For example, for one sample, 𝑦 might be 0.5,
for another sample, it might be -0.7.

We have to come up with a classification rule.


Classification rule

We predict +1 if 𝑦ො𝑖 > 0 and -1 if 𝑦ො𝑖 < 0


In other words, 𝑦෤𝑖 = 𝑠𝑖𝑔𝑛 𝑦ො𝑖 ∈ −1, +1

We can actually see not only if 𝑦ො𝑖 is close to y, but also whether
the labels 𝑦෤𝑖 that we’re predicting equal the labels that we
actually observed.
For a new sample 𝑥𝑛𝑒𝑤 ∈ ℝ𝑝 , we want to predict its label 𝑦ො𝑛𝑒𝑤
(new unknown label)

𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤

𝑦ො𝑛𝑒𝑤 is the inner product of the feature for this new sample and
the weight vector that we estimated using our training data.
We use our training data to find a good weight vector, and then
we can now use that weight vector with our new feature to make
new predictions.
𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤

𝑦෤𝑛𝑒𝑤 = 𝑠𝑖𝑔𝑛(𝑦ො𝑛𝑒𝑤 )

Pipeline of the classification system


• Training Phase
1. It takes in all the training data.
2. We learn an optimal weight vector within this linear model.

• Prediction Phase
3. Given a new sample, we apply our trained model.
4. The learned weight vector 𝑤
ෝ is used in the linear model.
5. The system computes the inner product: 𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤
6. The result is a predictive label for 𝑥𝑛𝑒𝑤 .
2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2

ෝ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
𝑤

What we said is 𝑤
ෝ is the value w that minimizes the length 𝑦 minus
𝑋𝑤 (2-norm squared)
Previously, we derived this equation for 𝑤
ෝ using a geometric argument
(projection onto column space).
Today, we will rederive this result by treating it as an optimization
problem, approaching it from a different perspective.
Optimization approach

2 Recall 2-norm squared


𝑤
ෝ = argmin 𝑦 − 𝑋𝑤 2
𝑛
𝑤 2 𝑥 = ෍ 𝑥𝑖 2 = 𝑥 𝑇 𝑥 = 𝑥, 𝑥
2
= argmin(𝑦 − 𝑋𝑤)𝑇 (𝑦 − 𝑋𝑤) 𝑖=1

𝑤
= argmin 𝑦 𝑇 𝑦 − (𝑋𝑤)𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + (𝑋𝑤)𝑇 (𝑋𝑤)
𝑤 𝑋 ∈ ℝ𝑛x𝑝
= argmin 𝑦 𝑇 𝑦 − 𝑤 𝑇 𝑋 𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤 𝑤 ∈ ℝ𝑝x1
𝑤
y ∈ ℝ𝑛x1
= argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤 𝑤 𝑇 𝑋 𝑇 𝑦 & 𝑦 𝑇 𝑋𝑤 is a scalar
𝑤 𝑇 𝑋 𝑇 𝑦 = 𝑦 𝑇 𝑋𝑤
Optimization approach

2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2

= argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤

• We started off by saying we wanted to minimize the sum of


squared errors (or residuals) and
• We have taken that objective function we want to minimize
and we’ve rewritten it as this long matrix vector equation and
• Now what we want to do is we want figure out how to actually
solve this optimization problem.
Warmup (simple optimization problem in 1D)
1 2 1
f w = 𝑤 −𝑤− f(w)
2 2

𝑤
ෝ = argmin 𝑓(𝑤)
𝑤
𝑑𝑓
=𝑤−1=0
𝑑𝑤 W

𝑤
ෝ =1

We take the derivative and set it equal to zero.


That is going to give us the minimum.
So now we want to know is how can we actually solve
optimization problems where, instead of optimizing over a scalar
(or scaler) number like a scalar w, we want to optimize over a
weight vector.
From the geometric perspective, we saw that XTX being invertible
was important for ensuring a unique solution.
From the optimization perspective, the same condition is
important as well.
Positive Definite Matrices
From the geometric perspective, we saw that it was important for
finding a unique least squares solution 𝑤
ෝ that XTX be invertible.
Is this important in the optimization setting as well? Yes!

The following two statements are equivalent for 𝑋 ∈ ℝ𝑛x𝑝


with n ≥ p, rank(X) = p (X has p linearly independent columns)
1. Invertibility: XTX ∈ ℝ𝑛x𝑝 is invertible (i.e., (XTX)-1 exists)
2. Positive Definiteness: XTX is positive definite
A matrix Q (= XTX) is positive definite (p.d.)
if zTQz > 0 for all z ≠ 0 (shorthand: Q ≻ 0)
Q ≻ 0: Q curly greater than zero

A matrix Q (= XTX) is positive semi-definite (p.s.d.)


if zTQz ≥ 0 for all z ≠ 0 (shorthand: Q ≽ 0)
Q ≽ 0: Q curly greater than or equal to zero
Ex 1. 1D f(x) = 1x2
𝑥 ∈ ℝ𝑛 , 𝑄 ∈ ℝ 𝑛
xTQx = Qx2
When is this Qx2 greater than 0?
x

Qx2 > 0 if Q > 0


Imagine minimizing f(x)=Qx2 easy to minimize (convex)
Convexity and Optimization

A function is convex if its graph has a


bowl-shaped structure. This property
makes optimization much easier because: f(x) = 1x2

• Single minimum: Convex functions


have a unique global minimum.

• Easy to minimize: Gradient-based


methods converge efficiently.

• Tangent property: For any point on x


the curve, we can compute its tangent.
If we look at its tangent, the function
always lies above the tangent
easy to minimize (convex)
Optimization theory about how to actually
find minimizers of convex functions
Q<0
Hard to minimize
(non-convex)
x

f(x) = -1x2

hard to minimize
Ex 2. f(x) = xTQx

𝑥 ∈ ℝ2 , 𝑄 ∈ ℝ2x2
1 0
Q= =𝐼
0 1
We try to minimize f(x) = xTQx
f(x) = 𝑥12 + 𝑥22 > 0 for all x ≠ 0 x

easy to minimize (convex)


Ex 2. f(x) = xTQx

𝑥 ∈ ℝ2 , 𝑄 ∈ ℝ2x2
−1 0
Q= x
0 −1
f(x) = xTQx
f(x) = −𝑥12 − 𝑥22 < 0 for all x ≠ 0

hard to minimize
So thinking about positive definite matrices gives us

a lens into thinking about how we will solve this

optimization problem associated with least squares

and when we will be able to find a good solution versus

when finding a good solution is hard.


Properties of Positive Definite Matrices

1. If P ≻ 0 and Q ≻ 0, then P+Q ≻ 0


xTPx > 0 & xTQx > 0 → xT(P+Q)x > 0 = xTPx + xTQx > 0

2. If Q ≻ 0 and a > 0, then aQ ≻ 0


xTQx > 0 → xT(aQ)x > 0 = axTQx > 0

3. For any A, ATA ≽ 0 and AAT ≽ 0


If the columns of A are linearly independent, then ATA ≻ 0

4. If Q ≽ 0, then Q-1 exists

5. Q ≽ P means Q - P ≽ 0
Recall Optimization Formulation
ෝ = argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤
𝑤
We try to solve this optimization problem.
From property 3 of positive definite matrices, XTX is positive definite

We assume
XTX ≻ 0 → f(w) = 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
→ this f(w) function is convex
→ compute "derivative” & set to zero to find minimizer

We're not really dealing with scalars anymore.


We have vectors. So, instead of derivative, use gradients
𝒅𝒇
When w is a scalar, we set derivative to zero and solve for w.
𝒅𝒘

When w is a vector, we set gradient ∇wf to zero and solve for w.

𝑑𝑓
𝑑𝑤1
𝑑𝑓
𝛻w 𝑓 = 𝑑𝑤2

𝑑𝑓
𝑑𝑤𝑝
with respect to w
Ex 1.

f w = 𝑐 𝑇 𝑤 = 𝑐1 𝑤1 + 𝑐2 𝑤2 + … + 𝑐𝑝 𝑤𝑝
𝛻𝑤 𝑓 =

𝑐1
𝑐2
𝛻𝑤 𝑓 = ⋮ = 𝑐
𝑐𝑝
Ex 2.
2
f w = 𝑤𝑇𝑤 = 𝑤 2
= 𝑤12 + 𝑤22 + … + 𝑤𝑝2
𝛻𝑤 𝑓 =

2𝑤1
2𝑤2
𝛻𝑤 𝑓 = ⋮ = 2𝑤
2𝑤𝑝
Ex 3.

f w = 𝑤 𝑇 𝑄𝑤
𝑝 𝑝
=෍ ෍ 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
𝑖=1 𝑗=1

Let's compute the derivative of just one term in the sum before the whole sum

2𝑄𝑖𝑖 𝑤𝑖 𝑖𝑓 𝑘 = 𝑖 = 𝑗
𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗 𝑄𝑖𝑗 𝑤𝑗 𝑜𝑟 𝑄𝑘𝑗 𝑤𝑗 𝑖𝑓 𝑖 = 𝑘 ≠ 𝑗
=
𝑑𝑤𝑘 𝑄𝑖𝑗 𝑤𝑖 𝑜𝑟 𝑄𝑖𝑘 𝑤𝑖 𝑖𝑓 𝑖 ≠ 𝑘 = 𝑗
0 𝑖𝑓 𝑘 ≠ 𝑖, 𝑘 ≠ 𝑗
Ex 3.

f w = 𝑤 𝑇 𝑄𝑤
𝑝 𝑝
=෍ ෍ 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
𝑖=1 𝑗=1

Let's compute the derivative of just one term in the sum before the whole sum

2𝑄𝑖𝑖 𝑤𝑖 𝑖𝑓 𝑘 = 𝑖 = 𝑗
𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗 𝑄𝑖𝑗 𝑤𝑗 𝑜𝑟 𝑄𝑘𝑗 𝑤𝑗 𝑖𝑓 𝑖 = 𝑘 ≠ 𝑗
=
𝑑𝑤𝑘 𝑄𝑖𝑗 𝑤𝑖 𝑜𝑟 𝑄𝑖𝑘 𝑤𝑖 𝑖𝑓 𝑖 ≠ 𝑘 = 𝑗
0 𝑖𝑓 𝑘 ≠ 𝑖, 𝑘 ≠ 𝑗
To figure out the gradient, next we have to figure out
the derivative of function f
𝑑𝑓 𝑝 𝑝 𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
= σ𝑖=1 σ𝑗=1
𝑑𝑤𝑘 𝑑𝑤𝑘

→ 𝛻𝑤 𝑓 = 𝑄𝑤 + 𝑄 𝑇 𝑤

If Q (=XTX) is symmetric (which is often the case in least squares


problems, where Q = QT), then

𝛻𝑤 𝑓 = 2𝑄𝑤
Ex. Least Squares
𝑓 𝑤 = 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝛻𝑤 𝑓 = 0 − 2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝑤
We can set 𝛻𝑤 𝑓 = 0 − 2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝑤 = 0
→ 𝑋 𝑇 𝑋𝑤 = 𝑋 𝑇 𝑦 because XTX is positive definite, we take its inverse
ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚
𝒘

What we got is the same as the one from geometric perspective!


(based on Ex 2. c = -2XTy and Ex 3. Q = XTX = QT)
This time, we arrived at 𝒘
ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚 in a completely

different way using these optimization arguments computing


gradients of objective functions and setting them equal to zero

We saw how using notions of positive definiteness allowed


us to try to understand why having XTX being invertible is
important.
We’ve assumed
𝑋 ∈ ℝ𝑛x𝑝 has n ≥ p and p columns of X are linearly independent (LI)

• Columns of X are LI → Q = XTX ≻ 0 (i.e., XTX is positive definite)


• XTX is positive definite
→ Least squares loss 𝑓 𝑤 = 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤 is convex
• If XTX ≻ 0, then XTX has inverse
• XTX has inverse → 𝒘
ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚 exists and is unique.
Further Readings
• Any linear algebra & optimization books should be fine!

• Mathematics for Machine Learning (MathML)


• Chapters 2 & 7: Linear Algebra & Continuous Optimization
Announcements
• Homework #1
• 스스로 복습한 내용을 A4용지에 손글씨로 작성/스캔하여 제출
• 1-page 이상 per a lecture (lectures 3-5)
• Due March 25th Thursday at 11:59 pm

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy