Lec2 Regression
Lec2 Regression
• Linear regression
• Logistic regression
500
Housing Prices
400
(Portland, OR)
300
Price 200
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data. Classification: discrete-valued
output
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples ( )
= 2104
x’s = “input” variable / features ( )
= 1416
y’s = “output” variable / “target” variable ( )
= 460
(x,y) – one training example
( training example
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
Idea: Choose so that
is close to for our training
examples
Simplified
Hypothesis:
Parameters:
Cost Function:
Goal:
3 3
2 2
1 1 (1)= 0
ℎ 𝑥 =𝑦
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
( ()
() ( + + )=0
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
1 1
x
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
]=
.
x
(for fixed , this is a function of x) x
(function of the parameter ) x
x
3 3 x
x
2
x x
2 x
x x
1 1 x
x x
x x
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
]=
.
]=
.
Hypothesis:
Parameters:
Cost Function:
Goal:
Have some function
Want ,…
Outline:
• Start with some (say )
• Keep changing to reduce
until we hopefully end up at a minimum
Gradient descent algorithm
Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Gradient descent algorithm Linear Regression Model
()
()
() ()
Gradient descent algorithm
update
and
simultaneously
J()
J()
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the
training examples.
Stochastic Gradient Descent (SGD)
It updates the parameters for each training example,
one by one.
Stochastic Gradient Descent (SGD)
Advantages: - faster than Batch GD in some problem.
- the frequent updates allow us to have a pretty detailed
rate of improvement.
Disadvantages: - the frequent updates are more computationally expensive
- The frequency of those updates can also result in noisy
gradients
2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables).
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Multiple features (variables).
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
2104 5 1 45 460
1416 3 2 40 232 (
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features (
= input (features) of training example.
= value of feature in training example. ( )
Hypothesis:
Previously: (one variable)
Multiple variables:
+ +
+ -
For convenience of notation, define (
+
1x 𝑛 + 1
matrix
Gradient descent:
Repeat
(simultaneously update )
Feature Scaling
Idea: Make sure features are on a similar scale.
Get every feature into approximately a range.
Too big
Too small
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g. Avg. size 1000
1-5 bedrooms
Gradient descent
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
)
No. of iterations
To choose , try
0.003 0.03 0.3
Polynomial regression
Price
(y)
Size (x)
Choice of features
Price
(y)
Size (x)
Normal equation: Method to solve for analytically.
Intuition: If 1D
Solve for
(for every )
Solve for
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
m – dimensional vector
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1 3000 4 1 38 540
examples ; features.
E.g. If
is inverse of matrix .
Set
Matlab pinv(X’*X)*X’*y
000
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Normal equation
• Classification
• Hypothesis representation
• Decision boundary
• Cost function
• Multi-class classification: One-vs-all
Classification
(No) 0
Tumor Size
Predict “ ” Predict “ ”
Logistic Regression:
Logistic Regression Model
Want
𝑔(𝑧)
1
0.5
Sigmoid function 0
Logistic function
Interpretation of Hypothesis Output
= estimated probability that on input x
Example: If
0.5
0
Suppose predict “ “ if
predict “ “ if
Decision Boundary
x2
3
2
01 2 3 x1
Decision boundary
Predict “ “ if
Decision boundary
Non-linear decision boundaries
x2
0
-1 1 x1
𝑥 +𝑥 =1
Predict “ “ if
-1
x2
𝑥 +𝑥 ≥1
x1
Training set:
m examples
Logistic regression: () () () ()
“non-convex” “convex”
• Logarithm function
• Natural logarithm with base e
Logistic regression cost function
If 1
0 1
Logistic regression cost function
If = 0 if
But as
=1- =0
0 1
Logistic regression cost function
If 1:
If :
Logistic regression cost function
To fit parameters :
Want :
Repeat
() () ()
)= ) −𝐽 𝜃
()
=
()
=
Gradient Descent
Want :
Repeat
Gradient descent:
Repeat
Optimization algorithm
Given , we have code that can compute
-
- (for )
x2 x2
x1 x1
x2
One-vs-all (one-vs-rest):
( )
x1
x2 x2
( )
x1 x1
x2
Class 1:
Class 2: ( )
Class 3:
x1
One-vs-all
Price
Price
Size Size Size
x2 x2 x2 +
x1 x1 x1
( = sigmoid function)
Underfit
overfit
Addressing overfitting:
size of house
no. of bedrooms
Price
no. of floors
age of house
average income in neighborhood Size
kitchen size
Addressing overfitting:
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm.
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
Intuition
Price
Price
X X
Suppose we penalize and make , really small.
Regularization.
, , …,
Not
Regularization.
Price
Size of house
Regularized linear regression
Gradient descent
Repeat
Regularized logistic regression.
x2 +
x1
Cost function:
Gradient descent
Repeat
Advanced optimization
function [jVal, gradient] = costFunction(theta)
jVal = [ code to compute ];