0% found this document useful (0 votes)
3 views96 pages

Lec2 Regression

The document discusses regression techniques, including linear and logistic regression, focusing on their applications in predicting housing prices and classification tasks. It explains concepts such as cost functions, gradient descent algorithms, and the importance of feature scaling and regularization to avoid overfitting. Additionally, it covers multi-class classification using the one-vs-all approach and the use of optimization algorithms for parameter fitting.

Uploaded by

Hoàng Bảo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views96 pages

Lec2 Regression

The document discusses regression techniques, including linear and logistic regression, focusing on their applications in predicting housing prices and classification tasks. It explains concepts such as cost functions, gradient descent algorithms, and the importance of feature scaling and regularization to avoid overfitting. Additionally, it covers multi-class classification using the one-vs-all approach and the use of optimization algorithms for parameter fitting.

Uploaded by

Hoàng Bảo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Regression

• Linear regression
• Logistic regression
500
Housing Prices
400
(Portland, OR)
300

Price 200

(in 1000s 100


of dollars)
0
0 500 1000 1500 2000 2500 3000

Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data. Classification: discrete-valued
output
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples ( )
= 2104
x’s = “input” variable / features ( )
= 1416
y’s = “output” variable / “target” variable ( )
= 460
(x,y) – one training example
( training example
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …

Hypothesis:
‘s: Parameters
How to choose ‘s ?
Idea: Choose so that
is close to for our training
examples
Simplified
Hypothesis:

Parameters:

Cost Function:

Goal:

Square error function


(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

1 1 (1)= 0
ℎ 𝑥 =𝑦

0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5

( ()

() ( + + )=0
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

1 1
x
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5

]=
.
x
(for fixed , this is a function of x) x
(function of the parameter ) x
x
3 3 x
x

2
x x
2 x
x x
1 1 x
x x
x x
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5

]=
.
]=
.
Hypothesis:

Parameters:

Cost Function:

Goal:
Have some function
Want ,…

Outline:
• Start with some (say )
• Keep changing to reduce
until we hopefully end up at a minimum
Gradient descent algorithm

Correct: Simultaneous update Incorrect:


Gradient descent algorithm
)

If α is too small, gradient descent


can be slow.

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.
at local optima

Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.

As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Gradient descent algorithm Linear Regression Model
()

()

() ()
Gradient descent algorithm

update
and
simultaneously
J()



J()



(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the
training examples.
Stochastic Gradient Descent (SGD)
It updates the parameters for each training example,
one by one.
Stochastic Gradient Descent (SGD)
Advantages: - faster than Batch GD in some problem.
- the frequent updates allow us to have a pretty detailed
rate of improvement.
Disadvantages: - the frequent updates are more computationally expensive
- The frequency of those updates can also result in noisy
gradients

Mini Batch Gradient Descent


A combination of the concepts of SGD and Batch Gradient Descent.
Mini Batch Gradient Descent

- Splits the training dataset into small batches


- Performs an update for each of these
batches.
Creates a balance between the
robustness of stochastic gradient descent and
the efficiency of batch gradient descent.

Mini-batch sizes range between 50 and 256.


Multiple features (variables).

Size (feet2) Price ($1000)

2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables).

Size (feet2) Number of Number of Age of home Price ($1000)


bedrooms floors (years)

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Multiple features (variables).
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

2104 5 1 45 460
1416 3 2 40 232 (
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features (
= input (features) of training example.
= value of feature in training example. ( )
Hypothesis:
Previously: (one variable)

Multiple variables:
+ +

+ -
For convenience of notation, define (

+
1x 𝑛 + 1
matrix

Multivariate linear regression.


Hypothesis:
Parameters:
Cost function:

Gradient descent:
Repeat

(simultaneously update for every )


New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)
()
=1

(simultaneously update )
Feature Scaling
Idea: Make sure features are on a similar scale.
Get every feature into approximately a range.

Too big
Too small
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g. Avg. size 1000
1-5 bedrooms
Gradient descent

- “Debugging”: How to make sure gradient


descent is working correctly.
- How to choose learning rate .
Making sure gradient descent is working correctly.

Example automatic
convergence test:

Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400

No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
)

No. of iterations

No. of iterations No. of iterations

- For sufficiently small , should decrease on every iteration.


- But if is too small, gradient descent can be slow to converge.
Summary:
- If is too small: slow convergence.
- If is too large: may not decrease on
every iteration; may not converge.

To choose , try
0.003 0.03 0.3
Polynomial regression

Price
(y)

Size (x)
Choice of features

Price
(y)

Size (x)
Normal equation: Method to solve for analytically.
Intuition: If 1D

Solve for

(for every )

Solve for
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

m – dimensional vector
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1 3000 4 1 38 540
examples ; features.

E.g. If
is inverse of matrix .
Set

Matlab pinv(X’*X)*X’*y

000
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Normal equation

- What if is non-invertible? (singular/


degenerate)
- Matlab: pinv(X’*X)*X’*y
inv(X’*X)*X’*y
What if is non-invertible?
• Redundant features (linearly dependent).
E.g. size in feet2
size in m2

• Too many features (e.g. ).


- Delete some features, or use regularization.
Logistic regression (classification)

• Classification
• Hypothesis representation
• Decision boundary
• Cost function
• Multi-class classification: One-vs-all
Classification

Email: Spam / Not Spam?


Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?

0: “Negative Class” (e.g., benign tumor)


1: “Positive Class” (e.g., malignant tumor)
𝒉𝜽 𝒙 = 𝜽𝑻 𝒙
(Yes) 1
0.5
Malignant ?

(No) 0
Tumor Size

Predict “ ” Predict “ ”

Threshold classifier output at 0.5:


If , predict “ ”
If , predict “ ”
Classification: or

can be > 1 or < 0

Logistic Regression:
Logistic Regression Model
Want

𝑔(𝑧)
1

0.5

Sigmoid function 0

Logistic function
Interpretation of Hypothesis Output
= estimated probability that on input x

Example: If

Tell patient that 70% chance of tumor being malignant

“probability that y = 1, given x,


parameterized by ”
Logistic regression 1
𝑔(𝑧)

0.5

0
Suppose predict “ “ if

predict “ “ if
Decision Boundary
x2
3
2

01 2 3 x1
Decision boundary

Predict “ “ if
Decision boundary
Non-linear decision boundaries
x2

0
-1 1 x1
𝑥 +𝑥 =1
Predict “ “ if
-1

x2
𝑥 +𝑥 ≥1

x1
Training set:

m examples

How to choose parameters ?


Cost function
Linear regression:

Logistic regression: () () () ()

“non-convex” “convex”
• Logarithm function
• Natural logarithm with base e
Logistic regression cost function

If 1

0 1
Logistic regression cost function

If = 0 if
But as

=1- =0

0 1
Logistic regression cost function

If 1:

If :
Logistic regression cost function

To fit parameters :

To make a prediction given new :


Output
Gradient Descent

Want :
Repeat

(simultaneously update all )

() () ()
)= ) −𝐽 𝜃

()
=
()
=
Gradient Descent

Want :
Repeat

(simultaneously update all )

Algorithm looks identical to linear regression!


Optimization algorithm
Cost function . Want .
Given , we have code that can compute
-
- (for )

Gradient descent:
Repeat
Optimization algorithm
Given , we have code that can compute
-
- (for )

Optimization algorithms: Advantages:


- Gradient descent - No need to manually pick
- Conjugate gradient - Often faster than gradient
- BFGS descent.
- L-BFGS Disadvantages:
- More complex
Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow


Binary classification: Multi-class classification:

x2 x2

x1 x1
x2
One-vs-all (one-vs-rest):
( )

x1
x2 x2

( )

x1 x1
x2
Class 1:
Class 2: ( )
Class 3:
x1
One-vs-all

Train a logistic regression classifier for each


class to predict the probability that .

On a new input , to make a prediction, pick the


class that maximizes
Regularization

• The problem of overfitting


• Cost function
• Regularized logistic regression
Example: Linear regression (housing prices)
Price

Price

Price
Size Size Size

“Underfit” or “high bias” “just right” “overfit” or “high variance”

Overfitting: If we have too many features, the learned hypothesis


may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
Example: Logistic regression

x2 x2 x2 +

x1 x1 x1

( = sigmoid function)
Underfit
overfit
Addressing overfitting:
size of house
no. of bedrooms

Price
no. of floors
age of house
average income in neighborhood Size

kitchen size
Addressing overfitting:

Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm.
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
Intuition

Price
Price

Size of house Size of house

X X
Suppose we penalize and make , really small.
Regularization.

Small values for parameters


― “Simpler” hypothesis
― Less prone to overfitting
Housing:
― Features:
― Parameters:

, , …,
Not
Regularization.
Price

Size of house
Regularized linear regression
Gradient descent
Repeat
Regularized logistic regression.

x2 +

x1
Cost function:
Gradient descent
Repeat
Advanced optimization
function [jVal, gradient] = costFunction(theta)
jVal = [ code to compute ];

gradient(1) = [code to compute ];

gradient(2) = [code to compute ];

gradient(3) = [code to compute ];

gradient(n+1) = [code to compute ];


Reference:
- Machine Learning, Andrew Ng, coursera.org

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy