ML - Lec 4-Introduction To Regression
ML - Lec 4-Introduction To Regression
to Regression
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
w x3 = 2 y3 = 2
1
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
where…
• the noise signals are independent
• the noise has a normal distribution with mean 0 and
unknown variance σ2
5
Maximum likelihood estimation of w
p( y
i 1
i w, xi ) maximized?
6
For what w is
n
p( y
i 1
i w, xi ) maximized?
For what w is n
1 yi wx i 2
i 1
exp( (
2
) ) maximized?
For what w is
1 yi wxi
n 2
i 1
2
maximized?
For what w is 2
n
y
i 1
i wxi minimized?
Least Square
7
Linear Regression
The maximum
likelihood w is the
one that minimizes E(w)
w
sum-of-squares of
yi wxi
2
residuals
i
yi 2 xi yi w
2
x w
i
2 2
i
We want to minimize a quadratic function of w.
8
Linear Regression
Easy to show the sum of
squares is minimized
when
xy
w i i
x
2
i
when p(w)
xy
w i i w
x
2
i Note: In Bayesian stats you’d have
10
Regression example
• Generated: w=2
• Recovered: w=2.03
• Noise: std=1
11
Regression example
• Generated: w=2
• Recovered: w=2.05
• Noise: std=2
12
Regression example
• Generated: w=2
• Recovered: w=2.08
• Noise: std=4
13
Multivariate
Linear
Regression
14
Multivariate Regression
What if the inputs are vectors?
3.
.4 6.
. 5 2-d input
.8 example
. 10
x2
x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
15
Multivariate Regression
Write matrix X and Y thus:
16
Multivariate Regression
Write matrix X and Y thus:
17
Multivariate Regression
•
- +
18
Multivariate Regression (con’t)
R
XTY is an m-element vector: i’th elt x
k 1
y
ki k
19
Constant Term
in Linear
Regression
20
What about a constant term?
We may expect linear
data that does not go
through the origin.
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
21
The constant term
• The trick is to create a fake input “X0” that
always takes the value 1
X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor In this example, You = w0+w1X1+ w2X2
model should be able to see
…has a fine constant term
the MLE w0 , w1 and
w2 by inspection
22
Linear
Regression with
varying noise
23
Regression with varying noise
• Suppose you know the variance of the noise that
was added to each datapoint.
y=3
=2
xi yi i2
½ ½ 4 y=2
=1/2
1 1 1
=1
2 1 1/4 y=1
=1/2
=2
2 3 4 y=0
Assume yi ~ N ( wxi , ) i
2
24
MLE estimation with varying noise
argmax log p( y , y ,..., y
1 2 R | x1 , x 2 ,..., x R , 2
1 , 2
2 ,..., R , w)
2
w Assuming independence
R
( yi wxi ) 2 among noise and then
argmin 2
plugging in equation for
Gaussian and simplifying.
i 1 i
w
R
xi ( yi wxi ) Setting dLL/dw
w such that 0 equal to zero
i 1 i 2
R xi yi Trivial algebra
2
i 1 i
R xi2
2
i 1 i
25
This is Weighted Regression
• We are asking to minimize the weighted sum of
squares
y=3
=2
R
( yi wxi ) 2
argmin 2 y=2
i 1 i =1/2
w
y=1 =1
=1/2
=2
y=0
x=0 x=1 x=2 x=3
1
where weight for i’th datapoint is i2
26
Non-linear
Regression
27
Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
y=3
xi yi
½ ½ y=2
1 2.5
2 3 y=1
3 2 y=0
Assume yi ~ N ( w xi , ) 2
28
Non-linear MLE estimation
argmax log p( y , y ,..., y
1 2 R | x1 , x2 ,..., xR , , w)
w Assuming i.i.d. and
argmin y
R 2 then plugging in
i w xi equation for Gaussian
i 1 and simplifying.
w
R
y w xi Setting dLL/dw
w such that i 0 equal to zero
w x
i 1 i
29
Non-linear MLE estimation
argmax log p( y , y ,..., y
1 2 R | x1 , x2 ,..., xR , , w)
w Assuming i.i.d. and
argmin y
R 2 then plugging in
i w xi equation for Gaussian
i 1 and simplifying.
w
R
y w xi Setting dLL/dw
w such that i 0 equal to zero
w x
i 1 i
algebraic
solution???
30
Non-linear MLE estimation
argmax log p( y , y ,..., y1 2 R | x1 , x2 ,..., xR , , w)
w Assuming i.i.d. and
argmin
Common (but not only) approach: R 2 then plugging in
Numerical Solutions:
yi w xi equation for Gaussian
i 1 and simplifying.
• Line Search w
• Simulated Annealing
R
yi w xi Setting dLL/dw
w such that
• Gradient Descent
w x
0
equal to zero
• Conjugate Gradient i 1 i
• Levenberg Marquart
• Newton’s Method algebraic
solution???
Also, special purpose statistical-
optimization-specific tricks such as
E.M. (See Gaussian Mixtures lecture
for introduction)
31
Polynomial
Regression
32
Polynomial Regression
So far we’ve mainly been dealing with linear regression
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
Z= 1 3 2 y=
7
1 1 1 3
: : : b=(ZTZ)-1(ZTy)
z1=(1,3,2).. y1=7..
yest = b0+ b1 x1+ b2 x2
zk=(1,xk1,xk2)
33
Quadratic Regression
It’s trivial to do linear fits of fixed nonlinear basis functions
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
1 3 2 9 6 4 y=
Z=
7
1 1 1 1 1 1
3 b=(ZTZ)-1(ZTy)
: :
:
yest = b0+ b1 x1+ b2 x2+
z=(1 , x1, x2 , x1 2, x1x2,x22,)
b 3 x1 2 + b 4 x 1 x 2 + b 5 x 2 2
34
Quadratic Regression
It’s trivial to do
Each linear fitsofofafixed
component nonlinear
z vector basis
is called functions
a term.
X1 X2 Each Y column of X= 3 2 y= 7
the Z matrix is called a term column
3 2 How
7 many terms in 1a quadratic
1 3
regression with m
1 1 inputs?
3 : : :
•1 constant term
: : : x1=(3,2).. y1=7..
1 •m3 linear
2 terms
9 6 4 y=
Z=
•(m+1)-choose-2 = 7 quadratic terms
m(m+1)/2
1 1 1 1 1 1
3 b=(Z TZ)-1(ZTy)
: in total = O(m )
2
: (m+2)-choose-2 terms
: esty = b0+ b1 x1+ b2 x2+
z=(1 , x1, x2 , x12, x1x2,x22,)T -1 T
Note that solving b=(Z Z) (Z y) 2 + bO(m
b isx thus x x )+ b x 2
6
3 1 4 1 2 5 2
35
Qth-degree polynomial Regression
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
1 3 2 9 6 … y= 7
Z=
1 1 1 1 1 … 3
b=(ZTZ)-1(ZTy)
: … :
z=(all products of powers of inputs in yest = b0+
which sum of powers is q or less,) b1 x1+…
36
m inputs, degree Q: how many terms?
= the number of unique terms of them form
q1
x x ...x
1
q2
2
qm
m where qi Q
i 1
= the number of unique terms of the mform
q0 q1
1 x x ...x
1
q2
2
qm
m where qi Q
i 0
= the number of lists of non-negative integers [q0,q1,q2,..qm]
in which Sqi = Q
= the number of ways of placing Q red disks on a row of
squares of length Q+m = (Q+m)-choose-Q
Q=11, m=4
q0=2 q1=2 q2=0 q3=4 q4=3
37
Radial Basis
Functions
38
Radial Basis Functions (RBFs)
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
… … … … … … y= 7
Z=
… … … … … … 3
b=(ZTZ)-1(ZTy)
… … … … … … :
z=(list of radial basis function evaluations) yest = b0+
b1 x1+…
39
1-d RBFs
y
c1 c1 c1
x
40
Example
y
c1 c1 c1
x
41
RBFs with Linear Regression
KW also held constant
(initialized to be large
Allyci ’s are held constant enough that there’s decent
(initialized randomly or overlap between basis
on a grid
c1 in m- c1 functions*
c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2f1(x) + 0.05f2(x) + 0.5f3(x)
where
fi(x) = KernelFunction( | x - ci | / KW)
42
RBFs with Linear Regression
KW also held constant
(initialized to be large
Allyci ’s are held constant enough that there’s decent
(initialized randomly or overlap between basis
on a grid
c1 in m- c1 functions*
c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2f1(x) + 0.05f2(x) + 0.5f3(x)
where
fi(x) = KernelFunction( | x - ci | / KW)
then given Q basis functions, define the matrix Z such that Zkj =
KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs
And as before, b=(ZTZ)-1(ZTy)
43
RBFs with NonLinear Regression
44
RBFs with NonLinear Regression
45
RBFs with NonLinear Regression
46
Radial Basis Functions in 2-d
Two inputs.
Outputs (heights
sticking out of page) Center
not shown.
Sphere of
x2 significant
influence of
center
x1
47
Happy RBFs in 2-d
Blue dots denote
coordinates of
input vectors
Center
Sphere of
x2 significant
influence of
center
x1
48
Crabby RBFs in 2-d What’s the
problem in this
Blue dots denote example?
coordinates of
input vectors
Center
Sphere of
x2 significant
influence of
center
x1
49
More crabby RBFs And what’s the
problem in this
Blue dots denote example?
coordinates of
input vectors
Center
Sphere of
x2 significant
influence of
center
x1
50
Hopeless! Even before seeing the data, you should
understand that this is a disaster!
Center
Sphere of
significant
x2 influence of
center
x1
51
Unhappy Even before seeing the data, you should
understand that this isn’t good either..
Center
Sphere of
significant
x2 influence of
center
x1
52
Robust
Regression
53
Robust Regression
x
54
Robust Regression
This is the best fit that
Quadratic Regression can
manage
x
55
Robust Regression
…but this is what we’d
probably prefer
x
56
LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…
x
57
LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…
x
58
LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
But you are how well it’s fitted…
pathetic.
x
59
Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
datapoint k that is large if the
datapoint fits well and small if it
fits badly:
wk = KernelFn([yk- yestk]2)
60
Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
Then redo the regression datapoint k that is large if the
using weighted datapoints. datapoint fits well and small if it
Weighted regression was described earlier in fits badly:
the “vary noise” section, and is also discussed
in the “Memory-based Learning” Lecture. wk = KernelFn([yk- yestk]2)
Guess what happens next?
61
Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
Then redo the regression datapoint k that is large if the
using weighted datapoints. datapoint fits well and small if it
I taught you how to do this in the “Instance- fits badly:
based” lecture (only then the weights
depended on distance in input-space) wk = KernelFn([yk- yestk]2)
Repeat whole thing until
converged!
62
Robust Regression---what we’re
doing
What regular regression does:
Assume yk was originally generated using the
following recipe:
yk = b0+ b1 xk+ b2 xk2 +N(0,2)
Computational task is to find the Maximum
Likelihood b0 , b1 and b2
63
Robust Regression---what we’re
doing
What LOESS robust regression does:
Assume yk was originally generated using the
following recipe:
With probability p:
yk = b0+ b1 xk+ b2 xk2 +N(0,2)
But otherwise
yk ~ N(m,huge2)
Computational task is to find the Maximum
Likelihood b0 , b1 , b2 , p, m and huge
64
Robust Regression---what we’re
doing
What LOESS robust regression does:
Mysteriously, the
Assume yk was originally generated using thereweighting procedure
following recipe: does this computation
for us.
With probability p:
yk = b0+ b1 xk+ b2 xk2 +N(0,2) Your first glimpse of
two spectacular letters:
But otherwise
yk ~ N(m,huge2) E.M.
Computational task is to find the Maximum
Likelihood b0 , b1 , b2 , p, m and huge
65