Polynomial Regression
Polynomial Regression
Bremer
y = β0 + β1 x + β2 x2 +
Polynomials are widely used in regression if the relationship between the predictor(s)
and response is not linear but curve-linear.
Polynomial Models in One Variable
Definition: The regression model
y = β0 + β1 x + β2 x2 +
y = β0 + β1 x + β2 x2 + · · · + βk xk +
62
Math 261A - Spring 2012 M. Bremer
63
Math 261A - Spring 2012 M. Bremer
Hardwood
50
40
tensile paper strength (psi)
30
20
10
2 4 6 8 10 12 14
Without centering the predictor variable, the correlation between x and x2 is un-
acceptably high (0.97). Based on the scatterplot, a quadratic model seems like a
possibly good fit:
y = β0 + β1 (x − x̄) + β2 (x − x̄)2 +
The summary statistics for this model fitted with R are shown below:
Both the linear and quadratic coefficients (β1 and β2 ) are significant, the model
R2 is 0.9085. Compare this to the R2 -value of 0.3054 for the corresponding linear
regression model (output not shown).
64
Math 261A - Spring 2012 M. Bremer
A plot of the Studentized residuals against the fitted values ŷi reveals no outliers
and no (strong) patterns. A qq-plot of the standardized residuals shows that the
distribution of the residuals is not perfectly Normal.
1.5
1.5
1.0
1.0
0.5
0.5
Studentized residuals
Sample Quantiles
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-1.5
-1.5
10 20 30 40 -2 -1 0 1 2
In this case it is pretty obvious that the quadratic term is a meaningful addition to
the model. But we could also test “by-hand” the hypothesis
H0 : β2 = 0 vs. Ha : β2 6= 0
y = β0 + β1 (x − x̄) +
the sum of squares of regression is SSR (β1 |β0 ) = 1043.4. Use the output from the
previous page to compute the sum of squares of regression for the quadratic model
and to conduct the F -test for the hypotheses mentioned above.
65
Math 261A - Spring 2012 M. Bremer
where
(x − ti ) if x − ti > 0
(x − ti )+ =
0 if x − yi ≤ 0
If the positions of the knows in a spline are known, the fitting a spline function to
data reduces to a nonlinear regression problem. If the positions are not known, the
problem becomes more complicated. It is not easy to decide how many knots to use
and how they should be placed. In general, each piece of the spline should be kept
as simple as possible to avoid over-fitting the data.
A special case of spline functions are the piecewise-linear functions. As with splines,
we can assume that piecewise linear functions to be continuous, or we can allow
discontinuities at the knots.
Example: Consider a (not necessarily continuous) piecewise linear function with a
single knot at t:
S(x) = β00 + β01 x + β10 (x − t)0+ + β11 (x − t)1+
If x ≤ t (before the knot), the function is the line y = β00 + β01 x. If x > t (after the
knot) the function is
y = β00 + β01 x + β10 + β11 (x − t) = (β00 + β10 − β11 t) + (β01 + β11 )x
Notice, that β10 is the height of the vertical “jump” at the knot when x = t. If
we require the piecewise linear function to be continuous, then that means that β10
should be equal to zero.
S(x) = β00 + β01 x + β11 (x − t)1+
66
Math 261A - Spring 2012 M. Bremer
y y
non−continuous continuous
linear spline linear spline
β01+β11 β01+β11
β10
β01
β00
β01
β00
β00−β11t
t x t x
So far, we have seen models that use polynomials or lines (polynomials of degree
one) that are pieced together at knots. Alternatively, other functions can be consid-
ered. If the scatterplot shows some periodic behavior, then trigonometric functions
(sine, cosine) may be reasonable to include in the model. The trigonometric terms
can have varying amplitudes and frequencies.
Nonparametric Regression
So far, we have discussed linear regression and polynomial regression (and some
more exotic variants of regression). What all these models have in common is that
they specify a functional relationship (line, plane, parabolic surface etc.) between
the predictors and the response. And so far, we (the users) have always been the
ones with the ultimate decision of which model to use. There is an alternative. We
could not specify a model and instead allow the data to “pick” its own model.
In nonparametric regression, the regression function does not take any predeter-
mined shape but is derived entirely from the data. The conventional parametrized
regression model is
yi = f (xi , β) + i
where we specify the general class of the function f (linear, quadratic, etc.) and
use the data to estimate the function parameters β. The nonparametric regression
model is similar
yi = f (xi ) + i
but now our (more ambitious) goal is to estimate the function f itself. Of course,
without any restrictions, there are uncountably infinitely many choices. So we’ll
restrict the problem a bit.
Recall, that in ordinary linear least squares regression, the predicted values ŷ can
be written as linear combinations of the observed values y. The coefficients in the
linear combinations are determined by the entries of the hat-matrix.
ŷ = Hy
Most nonparametric regression models also model the predicted values as linear
combinations of the observations, but with different weights.
67
Math 261A - Spring 2012 M. Bremer
Kernel Regression
As in the OLS case, the predicted observations are computed as weighted sums of
the actual observations. Let ỹi be the kernel smoother estimate for observation yi .
Then n
X
ỹi = wij yj
j=1
where the weights wij sum to one (over j). Alternatively, we could write
S is called the smoothing matrix. The weights are typically chosen such that wij = 0
for all points yj outside of a specified “neighborhood” of the point yi . The width
of this neighborhood is sometimes also called the bandwidth of the kernel. The
larger the bandwidth is chosen, the smoother the estimated function will become.
There are many possible kernel functions. They need to satisfy the following prop-
erties:
Note, that these are the properties of symmetric probability density functions.
Which means that for instance the Normal distribution (Gaussian kernel), the tri-
angular distribution (triangular kernel), and the uniform distribution (uniform or
box kernel) make good kernel functions.
2
Gaussian kernel function K(t) = √1 exp(− x2 )
2π
1 − |t| t ≤ 1
Triangular kernel function K(t) =
0 t>1
0.5 |t| ≤ 1
Uniform kernel K(t) =
0 |t| > 1
68
Math 261A - Spring 2012 M. Bremer
1.0
0.8
Kernel Functions
Rectangular
Triangular
Gaussian
0.6
K(x)
0.4
0.2
0.0
-3 -2 -1 0 1 2 3
Example: Suppose you want to fit a nonparametric regression curve to the following
data:
x 1 2 4 5 7
y 1 4 2 3 1
The R-function ksmooth() computes a kernel smoother (options are Gaussian and
uniform kernels) with chosen bandwidth. The results for three different bandwidths
(1,2, and 3) are shown below.
4.0
3.5
3 3
3.0
3.0
2.5
2.5
y
y
2.0
2.0
1.5
1.5
1.0
1.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
x x
Note: The regression “function” in this case is a sequence of points for which we
have computed predictions. There is no algebraic function that is estimated.
For large sample sizes, the bandwidth of the kernel to be used becomes much more
important than the choice of the kernel function.
69
Math 261A - Spring 2012 M. Bremer
Windmill Data
2.0
1.5
DC current
1.0
0.5
4 6 8 10
70
Math 261A - Spring 2012 M. Bremer
The red curve corresponds to a LOESS smoother that uses 50% of the data points,
while the green curve corresponds to the LOESS smoother that uses 75% of the
points. For a LOESS smoother with 100% of the points, the results would be simi-
lar (but not identical) to the simple linear regression with the polynomial function
in wind velocity. The difference comes from the weighting by inverse distance.
Note: As we have seen so far, nonparametric models can sometimes be quite similar
to parametric models and sometimes they can be quite different. Which model to
prefer depends on the specific situation. If there is a good explanation for a spe-
cific parametric model that comes from the experimental context, then this model
is usually preferred if it provides a reasonable fit.
If no reasonable parametric model provides an adequate fit to the data, then non-
parametric models can provide viable alternatives.
Polynomial Models in Two or More Variables
Recall, that for an ordinary least squares regression model the model “surface” for
the additive model
y = β0 + β1 x1 + β2 x2 +
is a plane in space, while the model surface for the OLS model with interaction is a
curved surface.
y = β0 + β1 x1 + β2 x2 + β12 x1 x2 +
The quadratic regression model with two variables is an extension of this idea:
Recall, that depending on the values of β, the response surfaces can take many
different shapes.
1000 300
800 250
200
600
150
400
100
200 50
0 0
10 10
5 10 5 10
0 5 0 5
0 0
−5 −5
−5 −5
−10 −10 −10 −10
400 0
200 −200
0
−400
−200
−600
−400
−600 −800
−800 −1000
10 10
5 10 5 10
0 5 0 5
0 0
−5 −5
−5 −5
−10 −10 −10 −10
71
Math 261A - Spring 2012 M. Bremer
Four observations are collected in the center of the square (x1 , x2 ) = (0, 0) and
√ one
observation
√ each at the corners of the square and at the radial axis runs (± 2, 0)
and (0, ± 2). The data for this experiment can be found in the file “ChemicalPro-
cess.txt” on the course website.
We will use R to fit a second order model for the response y as a function of the
coded variables x1 and x2 .
y = β0 + β1 x1 + β2 x2 + β11 x21 + β22 x22 + β12 x1 x2 +
72
Math 261A - Spring 2012 M. Bremer
The coefficients table in R tells us that each component in the model (the linear,
quadratic and mixed terms) are significant. Thus, the model cannot be easily sim-
plified. The residuals are approximately normally distributed. The residual plot
shows a pattern, but that is to be expected with such a small sample size. Overall,
the residuals fall pretty much within the “pure error” range that we can observe
from the repeated observations at the center of the design.
0.5
Studentized residuals
Sample Quantiles
0
0.0
-0.5
-1
-1.0
-1.5
-2
The optimal temperature (T ) and concentration (C) combination that would lead
to the highest expected response value can now be determined by finding the max-
imum for the fitted response surface. (Take partial derivatives, set equal to zero,
solve corresponding system). The maximum occurs at approximately 245◦ C and
20% concentration.
73
Math 261A - Spring 2012 M. Bremer
Orthogonal Polynomials
We have seen previously, that when fitting polynomial regression models in one vari-
ables, ill-conditioning is frequent due to the multicollinearity between the predictors
x and x2 etc. Some of these problems can be fixed by formulating the model in
terms of orthogonal polynomials.
Suppose we want to fit a polynomial regression model with one predictor x and
degree k:
yi = β0 + β1 xi + β2 x2i + · · · + βk xki + i
Generally, for this model, the columns of the X matrix will be (highly) correlated.
Additionally, if another term βk+1 xk+1 were added to the model, the matrix (X0 X)−1
would have to be recomputed and the estimates of the other β-parameters would
change.
Instead, consider the polynomial model
P0 (xi ) = 1
Then the multiple linear regression model becomes y = Xα + where the predictor
matrix becomes
P0 (x1 ) P1 (x1 ) · · · Pk (x1 )
P0 (x2 ) P1 (x2 ) · · · Pk (x2 )
X=
.. .. ... ..
. . .
P0 (xn ) P1 (xn ) · · · Pk (xn )
Since the polynomials are orthogonal, the X0 X matrix has the following simple form
that is very easy to invert:
P n
2
P (x ) 0 ··· 0
i=1 0 i
n
2
P
0 P2 (xi ) · · · 0
0
XX= i=1
.. .. .. ..
. . . .
n
2
P
0 0 ··· Pk (xi )
i=1
Note: There are several different methods by which orthogonal polynomials can be
obtained for a given set of data. R uses a different method than the one described
in your text.
74
Math 261A - Spring 2012 M. Bremer
reorder quantity
If you do not want to fit orthogonal polynomials, you have to include the argument
raw = TRUE inside the poly() function.
You can see which orthogonal polynomials R uses by typing poly(your predictor,
degree = 2). Caution: they are not the polynomials described in the book.
75
Math 261A - Spring 2012 M. Bremer
n
Pj (xi )2 = 1 for j = 1, 2 and P0 (xi ) = 1.
P
R uses orthogonal polynomials for which
i=1
Note: prediction in R is not influenced by the polynomials chosen for the regression:
But the stability of the estimates is. The method that uses orthogonal polynomials
is not subject to the numerical problems that occur in the estimation of the model
parameters when the data are subject to multicollinearity. Therefore, in general,
predictions made with the orthogonal polynomial method are preferred. But it can
be tricky to derive the “raw” model equation for this fit.
76