0% found this document useful (0 votes)
11 views65 pages

ML - Lec 4-Introduction To Regression

This document introduces regression analysis, focusing on linear regression and its variations, including Bayesian and multivariate regression. It explains the concepts of maximum likelihood estimation and weighted regression, as well as the inclusion of constant terms and handling varying noise in data. The document also touches on non-linear regression and polynomial regression, providing a comprehensive overview of regression techniques and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views65 pages

ML - Lec 4-Introduction To Regression

This document introduces regression analysis, focusing on linear regression and its variations, including Bayesian and multivariate regression. It explains the concepts of maximum likelihood estimation and weighted regression, as well as the inclusion of constant terms and handling varying noise in data. The document also touches on non-linear regression and polynomial regression, providing a comprehensive overview of regression techniques and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Lecture 4: An introduction

to Regression

based on Andrew W. Moore Slides


Single-
Parameter
Linear
Regression
2
Linear Regression
DATASET

inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2

w x3 = 2 y3 = 2
 1 
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1

Linear regression assumes that the expected value of the


output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
3
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei

where…
• the noise signals are independent
• the noise has a normal distribution with mean 0 and
unknown variance σ2

p(y|w,x) has a normal distribution with


• mean wx
• variance σ2
4
Bayesian Linear Regression
p(y|w,x) = Normal (mean wx, var σ2)

We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which


are EVIDENCE about w.

We want to infer w from the data.


p(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation

5
Maximum likelihood estimation of w

Asks the question:


“For which value of w is this data most likely to have
happened?”
<=>
For what w is
p(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
<=>
For what w is n

 p( y
i 1
i w, xi ) maximized?

6
For what w is
n

 p( y
i 1
i w, xi ) maximized?

For what w is n
1 yi  wx i 2
i 1
exp(  (
2 
) ) maximized?

For what w is
1  yi  wxi 
n 2

i 1
 
2 
 maximized?

For what w is 2
n

y
i 1
i  wxi  minimized?
Least Square

7
Linear Regression

The maximum
likelihood w is the
one that minimizes E(w)
w
sum-of-squares of
    yi  wxi 
2
residuals
i

  yi  2 xi yi w 
2
 x w
i
2 2

i
We want to minimize a quadratic function of w.
8
Linear Regression
Easy to show the sum of
squares is minimized


when
xy
w i i

x
2
i

The maximum likelihood model


is
Out x   wx
We can use it for
prediction
9
Linear Regression
Easy to show the sum of
squares is minimized


when p(w)
xy
w i i w

x
2
i Note: In Bayesian stats you’d have

The maximum likelihood model ended up with a prob dist of w


is
Out x   wx
And predictions would have given a prob dist
of expected output

Often useful to know your confidence. Max


We can use it for likelihood can give some kinds of
prediction confidence too.

10
Regression example
• Generated: w=2
• Recovered: w=2.03
• Noise: std=1

11
Regression example
• Generated: w=2
• Recovered: w=2.05
• Noise: std=2

12
Regression example
• Generated: w=2
• Recovered: w=2.08
• Noise: std=4

13
Multivariate
Linear
Regression
14
Multivariate Regression
What if the inputs are vectors?
3.
.4 6.
. 5 2-d input
.8 example
. 10
x2
x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
15
Multivariate Regression
Write matrix X and Y thus:

 .....x1 .....  x11 x12 ... x1m   y1 


.....x .....  x x22 ... x2 m  y 
x 2    21 y   2
       
     
.....x R .....  xR1 xR 2 ... xRm   yR 

(there are R datapoints. Each input has m components)


The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)

16
Multivariate Regression
Write matrix X and Y thus:

 .....x1 .....  x11 x12 ... x1m   y1 


.....x .....  x x22 ... x2 m  y 
x 2    21 y   2
       
     
.....x R .....  xR1 xR 2 ... xRm   yR 

(there are R datapoints. Each input has m components)


The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)

17
Multivariate Regression

- +

18
Multivariate Regression (con’t)

The max. likelihood w is w = (XTX)-1(XTY)


R

XTX is an m x m matrix: i,j’th elt is x


k 1
x
ki kj

R
XTY is an m-element vector: i’th elt x
k 1
y
ki k

19
Constant Term
in Linear
Regression
20
What about a constant term?
We may expect linear
data that does not go
through the origin.

Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.

Can you guess??

21
The constant term
• The trick is to create a fake input “X0” that
always takes the value 1

X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor In this example, You = w0+w1X1+ w2X2
model should be able to see
…has a fine constant term
the MLE w0 , w1 and
w2 by inspection
22
Linear
Regression with
varying noise
23
Regression with varying noise
• Suppose you know the variance of the noise that
was added to each datapoint.
y=3
=2
xi yi  i2
½ ½ 4 y=2
=1/2
1 1 1
=1
2 1 1/4 y=1
=1/2
=2
2 3 4 y=0

3 2 1/4 x=0 x=1 x=2 x=3

Assume yi ~ N ( wxi ,  ) i
2

24
MLE estimation with varying noise
argmax log p( y , y ,..., y
1 2 R | x1 , x 2 ,..., x R ,  2
1 ,  2
2 ,..., R , w) 
2

w Assuming independence
R
( yi  wxi ) 2 among noise and then
argmin   2
 plugging in equation for
Gaussian and simplifying.
i 1 i
w
 R
xi ( yi  wxi )  Setting dLL/dw
 w such that   0   equal to zero
 i 1 i 2

 R xi yi  Trivial algebra
  2 
 i 1  i 
 R xi2 
  2 
 i 1  i 
25
This is Weighted Regression
• We are asking to minimize the weighted sum of
squares
y=3
=2
R
( yi  wxi ) 2
argmin   2 y=2
i 1 i =1/2
w
y=1 =1
=1/2
=2
y=0
x=0 x=1 x=2 x=3

1
where weight for i’th datapoint is  i2
26
Non-linear
Regression
27
Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
y=3
xi yi
½ ½ y=2

1 2.5
2 3 y=1

3 2 y=0

3 3 x=0 x=1 x=2 x=3

Assume yi ~ N ( w  xi ,  ) 2

28
Non-linear MLE estimation
argmax log p( y , y ,..., y
1 2 R | x1 , x2 ,..., xR ,  , w) 
w Assuming i.i.d. and

argmin  y  
R 2 then plugging in
i w  xi equation for Gaussian
i 1 and simplifying.
w
 R
y  w  xi  Setting dLL/dw
 w such that  i  0  equal to zero
 w  x 
 i 1 i 

29
Non-linear MLE estimation
argmax log p( y , y ,..., y
1 2 R | x1 , x2 ,..., xR ,  , w) 
w Assuming i.i.d. and

argmin  y  
R 2 then plugging in
i w  xi equation for Gaussian
i 1 and simplifying.
w
 R
y  w  xi  Setting dLL/dw
 w such that  i  0  equal to zero
 w  x 
 i 1 i 

algebraic
solution???

30
Non-linear MLE estimation
argmax log p( y , y ,..., y1 2 R | x1 , x2 ,..., xR ,  , w) 
w Assuming i.i.d. and

argmin   
Common (but not only) approach: R 2 then plugging in

Numerical Solutions:
yi  w  xi  equation for Gaussian
i 1 and simplifying.
• Line Search w
• Simulated Annealing
 
R
yi  w  xi Setting dLL/dw
 w such that
• Gradient Descent
  w  x
 0 

equal to zero


• Conjugate Gradient i  1 i 
• Levenberg Marquart
• Newton’s Method algebraic
solution???
Also, special purpose statistical-
optimization-specific tricks such as
E.M. (See Gaussian Mixtures lecture
for introduction)
31
Polynomial
Regression
32
Polynomial Regression
So far we’ve mainly been dealing with linear regression
X1 X2 Y X= 3 2 y= 7

3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
Z= 1 3 2 y=
7
1 1 1 3
: : : b=(ZTZ)-1(ZTy)
z1=(1,3,2).. y1=7..
yest = b0+ b1 x1+ b2 x2
zk=(1,xk1,xk2)
33
Quadratic Regression
It’s trivial to do linear fits of fixed nonlinear basis functions
X1 X2 Y X= 3 2 y= 7

3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
1 3 2 9 6 4 y=
Z=
7
1 1 1 1 1 1
3 b=(ZTZ)-1(ZTy)
: :
:
yest = b0+ b1 x1+ b2 x2+
z=(1 , x1, x2 , x1 2, x1x2,x22,)
b 3 x1 2 + b 4 x 1 x 2 + b 5 x 2 2

34
Quadratic Regression
It’s trivial to do
Each linear fitsofofafixed
component nonlinear
z vector basis
is called functions
a term.
X1 X2 Each Y column of X= 3 2 y= 7
the Z matrix is called a term column
3 2 How
7 many terms in 1a quadratic
1 3
regression with m
1 1 inputs?
3 : : :
•1 constant term
: : : x1=(3,2).. y1=7..
1 •m3 linear
2 terms
9 6 4 y=
Z=
•(m+1)-choose-2 = 7 quadratic terms
m(m+1)/2
1 1 1 1 1 1
3 b=(Z TZ)-1(ZTy)
: in total = O(m )
2
: (m+2)-choose-2 terms
: esty = b0+ b1 x1+ b2 x2+
z=(1 , x1, x2 , x12, x1x2,x22,)T -1 T
Note that solving b=(Z Z) (Z y) 2 + bO(m
b isx thus x x )+ b x 2
6
3 1 4 1 2 5 2

35
Qth-degree polynomial Regression
X1 X2 Y X= 3 2 y= 7

3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
1 3 2 9 6 … y= 7
Z=
1 1 1 1 1 … 3
b=(ZTZ)-1(ZTy)
: … :
z=(all products of powers of inputs in yest = b0+
which sum of powers is q or less,) b1 x1+…

36
m inputs, degree Q: how many terms?
= the number of unique terms of them form
q1
x x ...x
1
q2
2
qm
m where  qi  Q
i 1
= the number of unique terms of the mform
q0 q1
1 x x ...x
1
q2
2
qm
m where  qi  Q
i 0
= the number of lists of non-negative integers [q0,q1,q2,..qm]
in which Sqi = Q
= the number of ways of placing Q red disks on a row of
squares of length Q+m = (Q+m)-choose-Q

Q=11, m=4
q0=2 q1=2 q2=0 q3=4 q4=3
37
Radial Basis
Functions
38
Radial Basis Functions (RBFs)
X1 X2 Y X= 3 2 y= 7

3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
… … … … … … y= 7
Z=
… … … … … … 3
b=(ZTZ)-1(ZTy)
… … … … … … :
z=(list of radial basis function evaluations) yest = b0+
b1 x1+…

39
1-d RBFs

y
c1 c1 c1
x

yest = b1 f1(x) + b2 f2(x) + b3 f3(x)


where
fi(x) = KernelFunction( | x - ci | / KW)

40
Example

y
c1 c1 c1
x

yest = 2f1(x) + 0.05f2(x) + 0.5f3(x)


where
fi(x) = KernelFunction( | x - ci | / KW)

41
RBFs with Linear Regression
KW also held constant
(initialized to be large
Allyci ’s are held constant enough that there’s decent
(initialized randomly or overlap between basis
on a grid
c1 in m- c1 functions*
c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2f1(x) + 0.05f2(x) + 0.5f3(x)
where
fi(x) = KernelFunction( | x - ci | / KW)

42
RBFs with Linear Regression
KW also held constant
(initialized to be large
Allyci ’s are held constant enough that there’s decent
(initialized randomly or overlap between basis
on a grid
c1 in m- c1 functions*
c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2f1(x) + 0.05f2(x) + 0.5f3(x)
where
fi(x) = KernelFunction( | x - ci | / KW)
then given Q basis functions, define the matrix Z such that Zkj =
KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs
And as before, b=(ZTZ)-1(ZTy)

43
RBFs with NonLinear Regression

Allow the ci ’s to adapt to KW allowed to adapt to the data.


ythe data (initialized (Some folks even let each basis
randomly or on a grid inc function have its own
c1 1 c1 fine detail in
KWj,permitting
m-dimensional
x input
dense regions of input space)
space)
yest = 2f1(x) + 0.05f2(x) + 0.5f3(x)
where
fi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the bj’s, ci ’s and KW ?

44
RBFs with NonLinear Regression

Allow the ci ’s to adapt to KW allowed to adapt to the data.


ythe data (initialized (Some folks even let each basis
randomly or on a grid inc function have its own
c1 1 c1 fine detail in
KWj,permitting
m-dimensional
x input
dense regions of input space)
space)
yest = 2f1(x) + 0.05f2(x) + 0.5f3(x)
where
fi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the bj’s, ci ’s and KW ?

Answer: Gradient Descent

45
RBFs with NonLinear Regression

Allow the ci ’s to adapt to KW allowed to adapt to the data.


ythe data (initialized (Some folks even let each basis
randomly or on a grid inc function have its own
c1 1 c1 fine detail in
KWj,permitting
m-dimensional
x input
dense regions of input space)
space)
yest = 2f1(x) + 0.05f2(x) + 0.5f3(x)
where
fi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the bj’s, ci ’s and KW ?
(But I’d like to see, or hope someone’s already done, a
hybrid, where the ci ’s and KW are updated with gradient Answer: Gradient Descent
descent while the bj’s use matrix inversion)

46
Radial Basis Functions in 2-d
Two inputs.
Outputs (heights
sticking out of page) Center
not shown.

Sphere of
x2 significant
influence of
center

x1
47
Happy RBFs in 2-d
Blue dots denote
coordinates of
input vectors
Center

Sphere of
x2 significant
influence of
center

x1
48
Crabby RBFs in 2-d What’s the
problem in this
Blue dots denote example?
coordinates of
input vectors
Center

Sphere of
x2 significant
influence of
center

x1
49
More crabby RBFs And what’s the
problem in this
Blue dots denote example?
coordinates of
input vectors
Center

Sphere of
x2 significant
influence of
center

x1
50
Hopeless! Even before seeing the data, you should
understand that this is a disaster!

Center

Sphere of
significant
x2 influence of
center

x1
51
Unhappy Even before seeing the data, you should
understand that this isn’t good either..

Center

Sphere of
significant
x2 influence of
center

x1
52
Robust
Regression
53
Robust Regression

x
54
Robust Regression
This is the best fit that
Quadratic Regression can
manage

x
55
Robust Regression
…but this is what we’d
probably prefer

x
56
LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…

y You are a very good


datapoint.

x
57
LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…

y You are a very good


datapoint.

You are not too


shabby.

x
58
LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
But you are how well it’s fitted…
pathetic.

y You are a very good


datapoint.

You are not too


shabby.

x
59
Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
datapoint k that is large if the
datapoint fits well and small if it
fits badly:
wk = KernelFn([yk- yestk]2)

60
Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
Then redo the regression datapoint k that is large if the
using weighted datapoints. datapoint fits well and small if it
Weighted regression was described earlier in fits badly:
the “vary noise” section, and is also discussed
in the “Memory-based Learning” Lecture. wk = KernelFn([yk- yestk]2)
Guess what happens next?

61
Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
Then redo the regression datapoint k that is large if the
using weighted datapoints. datapoint fits well and small if it
I taught you how to do this in the “Instance- fits badly:
based” lecture (only then the weights
depended on distance in input-space) wk = KernelFn([yk- yestk]2)
Repeat whole thing until
converged!

62
Robust Regression---what we’re
doing
What regular regression does:
Assume yk was originally generated using the
following recipe:
yk = b0+ b1 xk+ b2 xk2 +N(0,2)
Computational task is to find the Maximum
Likelihood b0 , b1 and b2

63
Robust Regression---what we’re
doing
What LOESS robust regression does:
Assume yk was originally generated using the
following recipe:
With probability p:
yk = b0+ b1 xk+ b2 xk2 +N(0,2)
But otherwise
yk ~ N(m,huge2)
Computational task is to find the Maximum
Likelihood b0 , b1 , b2 , p, m and huge

64
Robust Regression---what we’re
doing
What LOESS robust regression does:
Mysteriously, the
Assume yk was originally generated using thereweighting procedure
following recipe: does this computation
for us.
With probability p:
yk = b0+ b1 xk+ b2 xk2 +N(0,2) Your first glimpse of
two spectacular letters:
But otherwise
yk ~ N(m,huge2) E.M.
Computational task is to find the Maximum
Likelihood b0 , b1 , b2 , p, m and huge

65

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy