0% found this document useful (0 votes)

10 views

Lecture-05__Least Squares and Optimization

This lecture covers the concepts of least squares and optimization in machine learning, focusing on finding the weight vector that minimizes the sum of squared errors between true labels and predictions. It discusses the classification rule for binary labels and the importance of positive definite matrices in ensuring unique solutions for optimization problems. The lecture also emphasizes the convexity of functions and the use of gradients for optimization in multi-dimensional spaces.

Uploaded by

kimsergey606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lecture-05__Least Squares and Optimization

Uploaded by

kimsergey606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Lecture 5

Least Squares and Optimization

in Machine Learning
SWCON253, Machine Learning
Won Hee Lee, PhD
Learning Goals
• Understand the fundamental concepts of
least squares and optimization in machine learning
Given
Labels: 𝑦 ∈ ℝ𝑛 (target vector) for n training samples
Features: 𝑋 ∈ ℝ𝑛x𝑝 (feature matrix with n samples and p features)
We assume that n ≥ p, rank(X) = p (X has p linearly independent columns)
Goal: We want to find w such that yො = 𝑋𝑤 ≈ 𝑦
Pictorially,
In classification, given:
Labels: 𝑦𝑖 ∈ −1, +1 , 𝑖 = 1, 2, …, n (binary class labels)

2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2

We aim to learn the weight vector 𝑤

ෝ that minimizes the sum
of squared errors between true labels 𝑦 and prediction 𝑋𝑤.
So, we want to minimize the distance between true labels 𝑦 and
𝑋𝑤 that we predict using the linear model.

ෝ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
𝑤
The question is how do we use this within a classification setting.

Let 𝑦ො = 𝑋𝑤.
ෝ
The elements of 𝑦ො are not ∈ −1, +1 .

That is, we are not getting a bunch of plus and minus ones here.
We are actually getting real numbers.
For example, for one sample, 𝑦 might be 0.5,
for another sample, it might be -0.7.

We have to come up with a classification rule.

Classification rule

We predict +1 if 𝑦ො𝑖 > 0 and -1 if 𝑦ො𝑖 < 0

In other words, 𝑦෤𝑖 = 𝑠𝑖𝑔𝑛 𝑦ො𝑖 ∈ −1, +1

We can actually see not only if 𝑦ො𝑖 is close to y, but also whether
the labels 𝑦෤𝑖 that we’re predicting equal the labels that we
actually observed.
For a new sample 𝑥𝑛𝑒𝑤 ∈ ℝ𝑝 , we want to predict its label 𝑦ො𝑛𝑒𝑤
(new unknown label)

𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤

𝑦ො𝑛𝑒𝑤 is the inner product of the feature for this new sample and
the weight vector that we estimated using our training data.
We use our training data to find a good weight vector, and then
we can now use that weight vector with our new feature to make
new predictions.
𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤

𝑦෤𝑛𝑒𝑤 = 𝑠𝑖𝑔𝑛(𝑦ො𝑛𝑒𝑤 )

Pipeline of the classification system

• Training Phase
1. It takes in all the training data.
2. We learn an optimal weight vector within this linear model.

• Prediction Phase
3. Given a new sample, we apply our trained model.
4. The learned weight vector 𝑤
ෝ is used in the linear model.
5. The system computes the inner product: 𝑦ො𝑛𝑒𝑤 = 𝑤,
ෝ 𝑥𝑛𝑒𝑤
6. The result is a predictive label for 𝑥𝑛𝑒𝑤 .
2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2

ෝ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
𝑤

What we said is 𝑤
ෝ is the value w that minimizes the length 𝑦 minus
𝑋𝑤 (2-norm squared)
Previously, we derived this equation for 𝑤
ෝ using a geometric argument
(projection onto column space).
Today, we will rederive this result by treating it as an optimization
problem, approaching it from a different perspective.
Optimization approach

2 Recall 2-norm squared

𝑤
ෝ = argmin 𝑦 − 𝑋𝑤 2
𝑛
𝑤 2 𝑥 = ෍ 𝑥𝑖 2 = 𝑥 𝑇 𝑥 = 𝑥, 𝑥
2
= argmin(𝑦 − 𝑋𝑤)𝑇 (𝑦 − 𝑋𝑤) 𝑖=1

𝑤
= argmin 𝑦 𝑇 𝑦 − (𝑋𝑤)𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + (𝑋𝑤)𝑇 (𝑋𝑤)
𝑤 𝑋 ∈ ℝ𝑛x𝑝
= argmin 𝑦 𝑇 𝑦 − 𝑤 𝑇 𝑋 𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤 𝑤 ∈ ℝ𝑝x1
𝑤
y ∈ ℝ𝑛x1
= argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤 𝑤 𝑇 𝑋 𝑇 𝑦 & 𝑦 𝑇 𝑋𝑤 is a scalar
𝑤 𝑇 𝑋 𝑇 𝑦 = 𝑦 𝑇 𝑋𝑤
Optimization approach

2
𝑤
ෝ = argmin 𝑦 − 𝑋𝑤
𝑤 2

= argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤

• We started off by saying we wanted to minimize the sum of

squared errors (or residuals) and
• We have taken that objective function we want to minimize
and we’ve rewritten it as this long matrix vector equation and
• Now what we want to do is we want figure out how to actually
solve this optimization problem.
Warmup (simple optimization problem in 1D)
1 2 1
f w = 𝑤 −𝑤− f(w)
2 2

𝑤
ෝ = argmin 𝑓(𝑤)
𝑤
𝑑𝑓
=𝑤−1=0
𝑑𝑤 W

𝑤
ෝ =1

We take the derivative and set it equal to zero.

That is going to give us the minimum.
So now we want to know is how can we actually solve
optimization problems where, instead of optimizing over a scalar
(or scaler) number like a scalar w, we want to optimize over a
weight vector.
From the geometric perspective, we saw that XTX being invertible
was important for ensuring a unique solution.
From the optimization perspective, the same condition is
important as well.
Positive Definite Matrices
From the geometric perspective, we saw that it was important for
finding a unique least squares solution 𝑤
ෝ that XTX be invertible.
Is this important in the optimization setting as well? Yes!

The following two statements are equivalent for 𝑋 ∈ ℝ𝑛x𝑝

with n ≥ p, rank(X) = p (X has p linearly independent columns)
1. Invertibility: XTX ∈ ℝ𝑛x𝑝 is invertible (i.e., (XTX)-1 exists)
2. Positive Definiteness: XTX is positive definite
A matrix Q (= XTX) is positive definite (p.d.)
if zTQz > 0 for all z ≠ 0 (shorthand: Q ≻ 0)
Q ≻ 0: Q curly greater than zero

A matrix Q (= XTX) is positive semi-definite (p.s.d.)

if zTQz ≥ 0 for all z ≠ 0 (shorthand: Q ≽ 0)
Q ≽ 0: Q curly greater than or equal to zero
Ex 1. 1D f(x) = 1x2
𝑥 ∈ ℝ𝑛 , 𝑄 ∈ ℝ 𝑛
xTQx = Qx2
When is this Qx2 greater than 0?
x

Qx2 > 0 if Q > 0

Imagine minimizing f(x)=Qx2 easy to minimize (convex)
Convexity and Optimization

A function is convex if its graph has a

bowl-shaped structure. This property
makes optimization much easier because: f(x) = 1x2

• Single minimum: Convex functions

have a unique global minimum.

• Easy to minimize: Gradient-based

methods converge efficiently.

• Tangent property: For any point on x

the curve, we can compute its tangent.
If we look at its tangent, the function
always lies above the tangent
easy to minimize (convex)
Optimization theory about how to actually
find minimizers of convex functions
Q<0
Hard to minimize
(non-convex)
x

f(x) = -1x2

hard to minimize
Ex 2. f(x) = xTQx

𝑥 ∈ ℝ2 , 𝑄 ∈ ℝ2x2
1 0
Q= =𝐼
0 1
We try to minimize f(x) = xTQx
f(x) = 𝑥12 + 𝑥22 > 0 for all x ≠ 0 x

easy to minimize (convex)

Ex 2. f(x) = xTQx

𝑥 ∈ ℝ2 , 𝑄 ∈ ℝ2x2
−1 0
Q= x
0 −1
f(x) = xTQx
f(x) = −𝑥12 − 𝑥22 < 0 for all x ≠ 0

hard to minimize
So thinking about positive definite matrices gives us

a lens into thinking about how we will solve this

optimization problem associated with least squares

and when we will be able to find a good solution versus

when finding a good solution is hard.

Properties of Positive Definite Matrices

1. If P ≻ 0 and Q ≻ 0, then P+Q ≻ 0

xTPx > 0 & xTQx > 0 → xT(P+Q)x > 0 = xTPx + xTQx > 0

2. If Q ≻ 0 and a > 0, then aQ ≻ 0

xTQx > 0 → xT(aQ)x > 0 = axTQx > 0

3. For any A, ATA ≽ 0 and AAT ≽ 0

If the columns of A are linearly independent, then ATA ≻ 0

4. If Q ≽ 0, then Q-1 exists

5. Q ≽ P means Q - P ≽ 0
Recall Optimization Formulation
ෝ = argmin 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝑤
𝑤
We try to solve this optimization problem.
From property 3 of positive definite matrices, XTX is positive definite

We assume
XTX ≻ 0 → f(w) = 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
→ this f(w) function is convex
→ compute "derivative” & set to zero to find minimizer

We're not really dealing with scalars anymore.

We have vectors. So, instead of derivative, use gradients
𝒅𝒇
When w is a scalar, we set derivative to zero and solve for w.
𝒅𝒘

When w is a vector, we set gradient ∇wf to zero and solve for w.

𝑑𝑓
𝑑𝑤1
𝑑𝑓
𝛻w 𝑓 = 𝑑𝑤2
⋮
𝑑𝑓
𝑑𝑤𝑝
with respect to w
Ex 1.

f w = 𝑐 𝑇 𝑤 = 𝑐1 𝑤1 + 𝑐2 𝑤2 + … + 𝑐𝑝 𝑤𝑝
𝛻𝑤 𝑓 =

𝑐1
𝑐2
𝛻𝑤 𝑓 = ⋮ = 𝑐
𝑐𝑝
Ex 2.
2
f w = 𝑤𝑇𝑤 = 𝑤 2
= 𝑤12 + 𝑤22 + … + 𝑤𝑝2
𝛻𝑤 𝑓 =

2𝑤1
2𝑤2
𝛻𝑤 𝑓 = ⋮ = 2𝑤
2𝑤𝑝
Ex 3.

f w = 𝑤 𝑇 𝑄𝑤
𝑝 𝑝
=෍ ෍ 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
𝑖=1 𝑗=1

Let's compute the derivative of just one term in the sum before the whole sum

2𝑄𝑖𝑖 𝑤𝑖 𝑖𝑓 𝑘 = 𝑖 = 𝑗
𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗 𝑄𝑖𝑗 𝑤𝑗 𝑜𝑟 𝑄𝑘𝑗 𝑤𝑗 𝑖𝑓 𝑖 = 𝑘 ≠ 𝑗
=
𝑑𝑤𝑘 𝑄𝑖𝑗 𝑤𝑖 𝑜𝑟 𝑄𝑖𝑘 𝑤𝑖 𝑖𝑓 𝑖 ≠ 𝑘 = 𝑗
0 𝑖𝑓 𝑘 ≠ 𝑖, 𝑘 ≠ 𝑗
Ex 3.

f w = 𝑤 𝑇 𝑄𝑤
𝑝 𝑝
=෍ ෍ 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
𝑖=1 𝑗=1

Let's compute the derivative of just one term in the sum before the whole sum

2𝑄𝑖𝑖 𝑤𝑖 𝑖𝑓 𝑘 = 𝑖 = 𝑗
𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗 𝑄𝑖𝑗 𝑤𝑗 𝑜𝑟 𝑄𝑘𝑗 𝑤𝑗 𝑖𝑓 𝑖 = 𝑘 ≠ 𝑗
=
𝑑𝑤𝑘 𝑄𝑖𝑗 𝑤𝑖 𝑜𝑟 𝑄𝑖𝑘 𝑤𝑖 𝑖𝑓 𝑖 ≠ 𝑘 = 𝑗
0 𝑖𝑓 𝑘 ≠ 𝑖, 𝑘 ≠ 𝑗
To figure out the gradient, next we have to figure out
the derivative of function f
𝑑𝑓 𝑝 𝑝 𝑑 𝑤𝑖 𝑄𝑖𝑗 𝑤𝑗
= σ𝑖=1 σ𝑗=1
𝑑𝑤𝑘 𝑑𝑤𝑘

→ 𝛻𝑤 𝑓 = 𝑄𝑤 + 𝑄 𝑇 𝑤

If Q (=XTX) is symmetric (which is often the case in least squares

problems, where Q = QT), then

𝛻𝑤 𝑓 = 2𝑄𝑤
Ex. Least Squares
𝑓 𝑤 = 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
𝛻𝑤 𝑓 = 0 − 2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝑤
We can set 𝛻𝑤 𝑓 = 0 − 2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝑤 = 0
→ 𝑋 𝑇 𝑋𝑤 = 𝑋 𝑇 𝑦 because XTX is positive definite, we take its inverse
ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚
𝒘

What we got is the same as the one from geometric perspective!

(based on Ex 2. c = -2XTy and Ex 3. Q = XTX = QT)
This time, we arrived at 𝒘
ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚 in a completely

different way using these optimization arguments computing

gradients of objective functions and setting them equal to zero

We saw how using notions of positive definiteness allowed

us to try to understand why having XTX being invertible is
important.
We’ve assumed
𝑋 ∈ ℝ𝑛x𝑝 has n ≥ p and p columns of X are linearly independent (LI)

• Columns of X are LI → Q = XTX ≻ 0 (i.e., XTX is positive definite)

• XTX is positive definite
→ Least squares loss 𝑓 𝑤 = 𝑦 𝑇 𝑦 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑤 𝑇 𝑋 𝑇 𝑋𝑤 is convex
• If XTX ≻ 0, then XTX has inverse
• XTX has inverse → 𝒘
ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚 exists and is unique.
Further Readings
• Any linear algebra & optimization books should be fine!

• Mathematics for Machine Learning (MathML)

• Chapters 2 & 7: Linear Algebra & Continuous Optimization
Announcements
• Homework #1
• 스스로 복습한 내용을 A4용지에 손글씨로 작성/스캔하여 제출
• 1-page 이상 per a lecture (lectures 3-5)
• Due March 25th Thursday at 11:59 pm

Economic and Econometric Models
100% (1)
Economic and Econometric Models
41 pages
10_SVM (1)
No ratings yet
10_SVM (1)
77 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
CS 532 Lecture Notes
No ratings yet
CS 532 Lecture Notes
25 pages
5 Lagrange Duality
No ratings yet
5 Lagrange Duality
4 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Lecture 9 - SVMs
No ratings yet
Lecture 9 - SVMs
8 pages
Fisher Linear Discriminant Analysis: Max Welling
No ratings yet
Fisher Linear Discriminant Analysis: Max Welling
4 pages
Lec10 LeastSquaresRegression PDF
No ratings yet
Lec10 LeastSquaresRegression PDF
4 pages
SMAI-M20-L09: Aspects of Supervised Learning: C. V. Jawahar
No ratings yet
SMAI-M20-L09: Aspects of Supervised Learning: C. V. Jawahar
16 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Lecture 7: Least-Squares Problem: Convex Optimization
No ratings yet
Lecture 7: Least-Squares Problem: Convex Optimization
7 pages
Lecture1 introductionPCA
No ratings yet
Lecture1 introductionPCA
75 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
Lecture 7_SVM
No ratings yet
Lecture 7_SVM
125 pages
24142
No ratings yet
24142
7 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
lecture03a_least_squares_annotated
No ratings yet
lecture03a_least_squares_annotated
9 pages
Lec4 Quad Form
No ratings yet
Lec4 Quad Form
18 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Model 20161010
No ratings yet
Model 20161010
48 pages
39f6c97e482b96aba75c59b4ac0d99b8_MIT15_097S12_lec12
No ratings yet
39f6c97e482b96aba75c59b4ac0d99b8_MIT15_097S12_lec12
14 pages
Nonlinear Programming Concepts PDF
No ratings yet
Nonlinear Programming Concepts PDF
224 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Nonlinear Programming PDF
No ratings yet
Nonlinear Programming PDF
224 pages
lect5_removed
No ratings yet
lect5_removed
35 pages
SVM Reference
No ratings yet
SVM Reference
8 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
MLF Notes - Rishab Dec 24
No ratings yet
MLF Notes - Rishab Dec 24
6 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Support Vecto Machine (3)
No ratings yet
Support Vecto Machine (3)
62 pages
ML ES 23-24-II Key
No ratings yet
ML ES 23-24-II Key
4 pages
Convex Functions
No ratings yet
Convex Functions
13 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
Support Vector Machines (SVM) : Y.H. Hu
No ratings yet
Support Vector Machines (SVM) : Y.H. Hu
25 pages
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
No ratings yet
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
44 pages
10 Convex Optimisation
No ratings yet
10 Convex Optimisation
31 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Lecture7[1]
No ratings yet
Lecture7[1]
46 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Lagrange Multipliers and Optimization Problems
No ratings yet
Lagrange Multipliers and Optimization Problems
3 pages
(Book) Fundas of OptTheory Applications To ML
No ratings yet
(Book) Fundas of OptTheory Applications To ML
832 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Collaborative Filtering Matrix Factorization Approach: Jeff Howbert Introduction To Machine Learning Winter 2012 #
No ratings yet
Collaborative Filtering Matrix Factorization Approach: Jeff Howbert Introduction To Machine Learning Winter 2012 #
30 pages
Optimization For Machine Learning: Lecture 6: Tractable Nonconvex Problems 6.881: MIT
No ratings yet
Optimization For Machine Learning: Lecture 6: Tractable Nonconvex Problems 6.881: MIT
67 pages
lect3_removed
No ratings yet
lect3_removed
44 pages
Process Optimization Algorythms PDF
No ratings yet
Process Optimization Algorythms PDF
77 pages
8 SVMs
No ratings yet
8 SVMs
72 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
ALGEBRA SIMPLIFIED EQUATIONS WORKBOOK WITH ANSWERS: Linear Equations, Quadratic Equations, Systems of Equations
From Everand
ALGEBRA SIMPLIFIED EQUATIONS WORKBOOK WITH ANSWERS: Linear Equations, Quadratic Equations, Systems of Equations
Luke Aneke
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Chapter 9
No ratings yet
Chapter 9
48 pages
Assignment 1: The Simple Linear Regression Model
No ratings yet
Assignment 1: The Simple Linear Regression Model
3 pages
Econometrics - Classical Regression Assumptions
No ratings yet
Econometrics - Classical Regression Assumptions
15 pages
Book Machine Learning Finance Python
100% (1)
Book Machine Learning Finance Python
75 pages
Nishitani Et Al. (2020)
No ratings yet
Nishitani Et Al. (2020)
13 pages
AI & ML_SLM
No ratings yet
AI & ML_SLM
87 pages
Monetary Policy in Times of Uncertainties Evidence From Tunisia Egypt and Morocco
No ratings yet
Monetary Policy in Times of Uncertainties Evidence From Tunisia Egypt and Morocco
26 pages
Factors Affecting Good Governance in Pakistan
100% (1)
Factors Affecting Good Governance in Pakistan
10 pages
Brauchler Ryan 610 Article
No ratings yet
Brauchler Ryan 610 Article
20 pages
Development of Empirical Models From Process Data: - An Attractive Alternative
No ratings yet
Development of Empirical Models From Process Data: - An Attractive Alternative
27 pages
Business Econometrics Lecture Notes Quiz Econ2271
No ratings yet
Business Econometrics Lecture Notes Quiz Econ2271
2 pages
Impact of Internet Banking in Profitability On Commercial Bank
No ratings yet
Impact of Internet Banking in Profitability On Commercial Bank
15 pages
Summary of Previous Lecture: Standardized Variables Functional Forms
No ratings yet
Summary of Previous Lecture: Standardized Variables Functional Forms
15 pages
BMss
No ratings yet
BMss
152 pages
Exam 2012 Introductory Econometrics Answers
100% (1)
Exam 2012 Introductory Econometrics Answers
6 pages
How To Do Xtabond2: An Introduction To "Difference" and "System" GMM in Stata by David Roodman
No ratings yet
How To Do Xtabond2: An Introduction To "Difference" and "System" GMM in Stata by David Roodman
45 pages
Ai & ML 2 Marks Was
No ratings yet
Ai & ML 2 Marks Was
23 pages
Penerapan Regresi Data Panel Pada Analisis Pengaruh Infrastruktur Terhadap Produktivitas Ekonomi Provinsi-Provinsi Di Luar Pulau Jawa Tahun 2010-2014
No ratings yet
Penerapan Regresi Data Panel Pada Analisis Pengaruh Infrastruktur Terhadap Produktivitas Ekonomi Provinsi-Provinsi Di Luar Pulau Jawa Tahun 2010-2014
15 pages
Class Point Method
No ratings yet
Class Point Method
8 pages
Ajol File Journals - 716 - Articles - 249533 - Submission - Proof - 249533 8452 596659 1 10 20230619
No ratings yet
Ajol File Journals - 716 - Articles - 249533 - Submission - Proof - 249533 8452 596659 1 10 20230619
21 pages
Yip Chee Yin
No ratings yet
Yip Chee Yin
9 pages
The Effect of Physical Activities and Self-esteem
No ratings yet
The Effect of Physical Activities and Self-esteem
14 pages
Impact of Inflation On Economic Growth of India: Elevance
No ratings yet
Impact of Inflation On Economic Growth of India: Elevance
2 pages
ML Theory MCQ Unit 1,2,3,4,5
No ratings yet
ML Theory MCQ Unit 1,2,3,4,5
31 pages
191HS42 - Probability & Statistics - Question Bank PDF
No ratings yet
191HS42 - Probability & Statistics - Question Bank PDF
7 pages
Wpiea2022220 Print PDF
No ratings yet
Wpiea2022220 Print PDF
22 pages
Ees 400 - Topic Four - Multivariate Regression Analysis
No ratings yet
Ees 400 - Topic Four - Multivariate Regression Analysis
9 pages
Ebook Econometrics
No ratings yet
Ebook Econometrics
1,006 pages
This Content Downloaded From 49.249.225.158 On Mon, 20 Feb 2023 09:13:48 UTC
No ratings yet
This Content Downloaded From 49.249.225.158 On Mon, 20 Feb 2023 09:13:48 UTC
22 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture-05__Least Squares and Optimization

Uploaded by

Lecture-05__Least Squares and Optimization

Uploaded by

Lecture 5

Least Squares and Optimization

We aim to learn the weight vector 𝑤

We have to come up with a classification rule.

We predict +1 if 𝑦ො𝑖 > 0 and -1 if 𝑦ො𝑖 < 0

Pipeline of the classification system

2 Recall 2-norm squared

• We started off by saying we wanted to minimize the sum of

We take the derivative and set it equal to zero.

The following two statements are equivalent for 𝑋 ∈ ℝ𝑛x𝑝

A matrix Q (= XTX) is positive semi-definite (p.s.d.)

Qx2 > 0 if Q > 0

A function is convex if its graph has a

• Single minimum: Convex functions

• Easy to minimize: Gradient-based

• Tangent property: For any point on x

easy to minimize (convex)

a lens into thinking about how we will solve this

optimization problem associated with least squares

and when we will be able to find a good solution versus

when finding a good solution is hard.

1. If P ≻ 0 and Q ≻ 0, then P+Q ≻ 0

2. If Q ≻ 0 and a > 0, then aQ ≻ 0

3. For any A, ATA ≽ 0 and AAT ≽ 0

4. If Q ≽ 0, then Q-1 exists

We're not really dealing with scalars anymore.

When w is a vector, we set gradient ∇wf to zero and solve for w.

If Q (=XTX) is symmetric (which is often the case in least squares

What we got is the same as the one from geometric perspective!

different way using these optimization arguments computing

We saw how using notions of positive definiteness allowed

• Columns of X are LI → Q = XTX ≻ 0 (i.e., XTX is positive definite)

• Mathematics for Machine Learning (MathML)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.