0% found this document useful (0 votes)
98 views15 pages

Polynomial Regression

This document discusses polynomial regression models. Polynomial regression models extend linear regression to include higher order terms of predictor variables, such as quadratic (x2) or cubic (x3) terms. Polynomial models are useful when the relationship between predictors and response is nonlinear but curvilinear. The key points covered are: - Polynomial models of higher degree are not always best and overfitting is a risk - Strategies for building polynomial models include forward or backward selection of terms - Extrapolation from polynomial models should be done cautiously beyond the observed data range - Multicollinearity between polynomial terms like x and x2 can cause numerical problems - Hierarchical models include all powers of a variable from 1 to the maximum degree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views15 pages

Polynomial Regression

This document discusses polynomial regression models. Polynomial regression models extend linear regression to include higher order terms of predictor variables, such as quadratic (x2) or cubic (x3) terms. Polynomial models are useful when the relationship between predictors and response is nonlinear but curvilinear. The key points covered are: - Polynomial models of higher degree are not always best and overfitting is a risk - Strategies for building polynomial models include forward or backward selection of terms - Extrapolation from polynomial models should be done cautiously beyond the observed data range - Multicollinearity between polynomial terms like x and x2 can cause numerical problems - Hierarchical models include all powers of a variable from 1 to the maximum degree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Math 261A - Spring 2012 M.

Bremer

Polynomial Regression Models


Recall, that our definition of linear regression models extends to models, in which
several powers of the same predictor variable are included

y = β0 + β1 x + β2 x2 + 

and even to second-order polynomial models in two (or more) variables

y = β0 + β1 x1 + β2 x2 + β11 x21 + β22 x22 + β12 x1 x2 + 

Polynomials are widely used in regression if the relationship between the predictor(s)
and response is not linear but curve-linear.
Polynomial Models in One Variable
Definition: The regression model

y = β0 + β1 x + β2 x2 + 

is called a second-order model or a quadratic model in one variable. The


expected value of y is a parabola in x. Here, β0 can still be interpreted as the y-
intercept of the parabola. β1 is called the linear effect parameter and β2 is called
the quadratic effect parameter. The regression model

y = β0 + β1 x + β2 x2 + · · · + βk xk + 

is called a k-th order polynomial model in one variable.


Polynomial models may be analyzed with the techniques we have previously devel-
oped for multiple regression models, if we use xj = xj for the j th predictor variable.
There are several important decisions that have to be made when a polynomial re-
gression model is fit. They are discussed in more detail below.
1. Order of the Model Since our goal is to model the data well with the simplest
possible regression model, polynomial models of lower degree are usually preferable.
Keep in mind, that most datasets of k + 1
observations can be (perfectly!) modeled Polynomial regression
by a polynomial of degree k. That clearly model − degree k

cannot be the goal. In practice, we usually


start with models of degree one, and - if
transformations on the predictor or the
response are insufficient - also consider Quadratic
regression model
models of degree two. Higher degree models
should be avoided unless the context the
data is coming from explicitly calls for one
of these models.

62
Math 261A - Spring 2012 M. Bremer

2. Model-Building Strategy To decide the appropriate degree of a polynomial


regression model, two different strategies are possible. One can start with a lin-
ear model and include higher order terms one by one until the highest order term
becomes non-significant (look at p-values for t-test for slopes). This method is gen-
erally called Forward variable selection. Or one could start with a high order
model and exclude the non-significant highest order terms one by one until the re-
maining highest order term becomes significant. This method is generally referred
to as Backward variable selection. In general, the two methods do not have
to lead to the same model. For polynomial models, these methods are likely over-
powered, since we can restrict our attention to first and second order polynomial
models.
3. Extrapolation You have to take extreme care when making predictions outside
the range of observed variables in polynomial models. Recall the windmill data,
where we could have modeled the non-linear increase in DC output as a polynomial
function of wind velocity. Obviously, the DC output should not decrease as wind
velocity increases, so predictions made on the part of the parabola with negative
slope would be incorrect.
4. Ill-Conditioning Usually, we hope that the predictors in multiple regression
models are (almost) independent. If they are highly correlated, we have problems
with multicollinearity. The predictors in a polynomial regression are not indepen-
dent. One predictor is x, while the next predictor is x2 , for instance. Even though
the correlation between x and x2 is not perfect, it can be high enough to make the
matrix X0 X ill-conditioned. That means that the matrix is “hard” to invert, or
that considerable numerical error will be involved in computing the inverse. It is
possible to select the predictor functions more carefully as curve-linear functions of
x to avoid this problem.
Multicollinearity becomes more of a concern if the range of predictor values is very
narrow, so that x and x2 are almost linearly related. Some of the ill-conditioning
can be remedied by centering the variables (subtracting the mean x̄ from x) before
taking their powers.
5. Hierarchy Regression models that contain all powers of x (from 1 to k) are said
to be hierarchical. It is possible to exclude lower order terms from the model while
keeping some of the higher order terms. Statisticians are split on whether this is
a good idea. As always, any knowledge about the context the data is taken from
should be utilized to decide which polynomial regression model to fit.

63
Math 261A - Spring 2012 M. Bremer

Example: The Hardwood Data


The strength of kraft paper is related to the percentage of hardwood in the batch of
pulp that the paper is produced from. Not enough hardwood makes the paper weak,
but too much hardwood makes the paper brittle. The variables in this dataset are
the hardwood concentration (x, in %) and the tensile strength of the paper (y, in
psi) for 19 samples of paper. The scatterplot of tensile strength against hardwood
percentage (below) shows a clear non-linear relationship:

Hardwood

50
40
tensile paper strength (psi)

30
20
10

2 4 6 8 10 12 14

hardwood concentration (%)

Without centering the predictor variable, the correlation between x and x2 is un-
acceptably high (0.97). Based on the scatterplot, a quadratic model seems like a
possibly good fit:
y = β0 + β1 (x − x̄) + β2 (x − x̄)2 + 
The summary statistics for this model fitted with R are shown below:

Both the linear and quadratic coefficients (β1 and β2 ) are significant, the model
R2 is 0.9085. Compare this to the R2 -value of 0.3054 for the corresponding linear
regression model (output not shown).

64
Math 261A - Spring 2012 M. Bremer

A plot of the Studentized residuals against the fitted values ŷi reveals no outliers
and no (strong) patterns. A qq-plot of the standardized residuals shows that the
distribution of the residuals is not perfectly Normal.

Hardwood residual plot Normal Q-Q Plot

1.5
1.5

1.0
1.0

0.5
0.5
Studentized residuals

Sample Quantiles

0.0
0.0
-0.5

-0.5
-1.0

-1.0
-1.5

-1.5
10 20 30 40 -2 -1 0 1 2

fitted values Theoretical Quantiles

In this case it is pretty obvious that the quadratic term is a meaningful addition to
the model. But we could also test “by-hand” the hypothesis

H0 : β2 = 0 vs. Ha : β2 6= 0

using the extra sums-of-squares method. For the linear model

y = β0 + β1 (x − x̄) + 

the sum of squares of regression is SSR (β1 |β0 ) = 1043.4. Use the output from the
previous page to compute the sum of squares of regression for the quadratic model
and to conduct the F -test for the hypotheses mentioned above.

65
Math 261A - Spring 2012 M. Bremer

Piecewise Polynomial Fitting - Splines


If low-order polynomials do not provide a good fit for curve-linear data (poor resid-
ual plots, for instance) then the cause can be that the data behave differently for
different parts of the predictor range. Sometimes, transformations on x or y can
help. If transformations don’t help, than the predictor range can be divided into
segments and a different function is fit in each segment.
Definition: Splines are piecewise polynomial functions that satisfy certain “smooth-
ness” criteria. These criteria are often continuity and possibly continuity of the
derivative(s). The points where different polynomial pieces are joined together are
called the knots of the spline.
Example: A cubic spline (k = 3) with continuous first and second derivatives and
h knots t1 < t2 < · · · < th can be written as
3
X h
X
j
S(x) = β0j x + βi (x − ti )3+
j=1 i=1

where

(x − ti ) if x − ti > 0
(x − ti )+ =
0 if x − yi ≤ 0

If the positions of the knows in a spline are known, the fitting a spline function to
data reduces to a nonlinear regression problem. If the positions are not known, the
problem becomes more complicated. It is not easy to decide how many knots to use
and how they should be placed. In general, each piece of the spline should be kept
as simple as possible to avoid over-fitting the data.
A special case of spline functions are the piecewise-linear functions. As with splines,
we can assume that piecewise linear functions to be continuous, or we can allow
discontinuities at the knots.
Example: Consider a (not necessarily continuous) piecewise linear function with a
single knot at t:
S(x) = β00 + β01 x + β10 (x − t)0+ + β11 (x − t)1+
If x ≤ t (before the knot), the function is the line y = β00 + β01 x. If x > t (after the
knot) the function is
y = β00 + β01 x + β10 + β11 (x − t) = (β00 + β10 − β11 t) + (β01 + β11 )x
Notice, that β10 is the height of the vertical “jump” at the knot when x = t. If
we require the piecewise linear function to be continuous, then that means that β10
should be equal to zero.
S(x) = β00 + β01 x + β11 (x − t)1+

66
Math 261A - Spring 2012 M. Bremer

y y
non−continuous continuous
linear spline linear spline

β01+β11 β01+β11

β10
β01
β00
β01
β00
β00−β11t

t x t x

So far, we have seen models that use polynomials or lines (polynomials of degree
one) that are pieced together at knots. Alternatively, other functions can be consid-
ered. If the scatterplot shows some periodic behavior, then trigonometric functions
(sine, cosine) may be reasonable to include in the model. The trigonometric terms
can have varying amplitudes and frequencies.
Nonparametric Regression
So far, we have discussed linear regression and polynomial regression (and some
more exotic variants of regression). What all these models have in common is that
they specify a functional relationship (line, plane, parabolic surface etc.) between
the predictors and the response. And so far, we (the users) have always been the
ones with the ultimate decision of which model to use. There is an alternative. We
could not specify a model and instead allow the data to “pick” its own model.
In nonparametric regression, the regression function does not take any predeter-
mined shape but is derived entirely from the data. The conventional parametrized
regression model is
yi = f (xi , β) + i
where we specify the general class of the function f (linear, quadratic, etc.) and
use the data to estimate the function parameters β. The nonparametric regression
model is similar
yi = f (xi ) + i
but now our (more ambitious) goal is to estimate the function f itself. Of course,
without any restrictions, there are uncountably infinitely many choices. So we’ll
restrict the problem a bit.
Recall, that in ordinary linear least squares regression, the predicted values ŷ can
be written as linear combinations of the observed values y. The coefficients in the
linear combinations are determined by the entries of the hat-matrix.

ŷ = Hy

Most nonparametric regression models also model the predicted values as linear
combinations of the observations, but with different weights.

67
Math 261A - Spring 2012 M. Bremer

Kernel Regression
As in the OLS case, the predicted observations are computed as weighted sums of
the actual observations. Let ỹi be the kernel smoother estimate for observation yi .
Then n
X
ỹi = wij yj
j=1

where the weights wij sum to one (over j). Alternatively, we could write

ỹ = Sy, where S = (wij )

S is called the smoothing matrix. The weights are typically chosen such that wij = 0
for all points yj outside of a specified “neighborhood” of the point yi . The width
of this neighborhood is sometimes also called the bandwidth of the kernel. The
larger the bandwidth is chosen, the smoother the estimated function will become.
There are many possible kernel functions. They need to satisfy the following prop-
erties:

• K(t) ≥ 0 for all t


R∞
• K(t)dt = 1
−∞

• K(−t) = K(t) (symmetry).

Note, that these are the properties of symmetric probability density functions.
Which means that for instance the Normal distribution (Gaussian kernel), the tri-
angular distribution (triangular kernel), and the uniform distribution (uniform or
box kernel) make good kernel functions.
2
Gaussian kernel function K(t) = √1 exp(− x2 )


1 − |t| t ≤ 1
Triangular kernel function K(t) =
0 t>1

0.5 |t| ≤ 1
Uniform kernel K(t) =
0 |t| > 1

The weights for the kernel smoother are chosen as


x −x 
K ib j
wij = Pn
x −x 
K ib j
k=1

68
Math 261A - Spring 2012 M. Bremer

1.0
0.8
Kernel Functions
Rectangular
Triangular
Gaussian

0.6
K(x)

0.4
0.2
0.0

-3 -2 -1 0 1 2 3

Example: Suppose you want to fit a nonparametric regression curve to the following
data:
x 1 2 4 5 7
y 1 4 2 3 1
The R-function ksmooth() computes a kernel smoother (options are Gaussian and
uniform kernels) with chosen bandwidth. The results for three different bandwidths
(1,2, and 3) are shown below.

Gaussian Kernel Uniform Kernel


4.0

4.0

Bandwidth choice Bandwidth choice


1 1
2 2
3.5

3.5

3 3
3.0

3.0
2.5

2.5
y

y
2.0

2.0
1.5

1.5
1.0

1.0

1 2 3 4 5 6 7 1 2 3 4 5 6 7

x x

Note: The regression “function” in this case is a sequence of points for which we
have computed predictions. There is no algebraic function that is estimated.
For large sample sizes, the bandwidth of the kernel to be used becomes much more
important than the choice of the kernel function.

69
Math 261A - Spring 2012 M. Bremer

Locally Weighted Regression (LOESS)


Locally weighed scatterplot smoothing, often also referred to as LOWESS or LOESS
is similar to kernel regression as it uses data from a neighborhood of the point where
the regression function is to be estimated. The neighborhood is usually not defined
as a bandwidth but as a span which is the fraction of available data points that are
to be used for the smoother.
The LOESS procedure then uses the points in the neighborhood to estimate a
weighted least squares function, typically with either simple linear regression or
with a low order polynomial. The weights are based on the distances of the points
in the sample from the point of interest where the function is to be estimated.
The function loess() in R produces a weighted smoothed curve of predicted re-
sponse values at a sequence of specified points. The function can handle up to four
numeric predictors. The size of the neighborhood used for smoothing is specified
with the span argument. The points are weighed with tricubic weights. Let x0 be
the location where we want to compute a prediction and let ∆x0 denote the distance
from x0 to the farthest point in the neighborhood. Then the weights for points xi
in the neighborhood are
  3 !3
|x0 − xi |
wi = 1 −
∆(x0 )

Weights for points outside the neighborhood are wi = 0.


Example: The Windmill Data
Recall, that the data models the DC output of a windmill as a function of wind
velocity. In Chapter 5, we decided that the best simple linear regression model for
this data is to model DC as a reciprocal of wind velocity. A close second model was
DC as a quadratic function of wind velocity. If LOESS is applied to this data set
(with quadratic smoothing) the result is very similar to the fitted simple polynomial
model.

Windmill Data
2.0
1.5
DC current

1.0
0.5

4 6 8 10

wind velocity (mph)

70
Math 261A - Spring 2012 M. Bremer

The red curve corresponds to a LOESS smoother that uses 50% of the data points,
while the green curve corresponds to the LOESS smoother that uses 75% of the
points. For a LOESS smoother with 100% of the points, the results would be simi-
lar (but not identical) to the simple linear regression with the polynomial function
in wind velocity. The difference comes from the weighting by inverse distance.
Note: As we have seen so far, nonparametric models can sometimes be quite similar
to parametric models and sometimes they can be quite different. Which model to
prefer depends on the specific situation. If there is a good explanation for a spe-
cific parametric model that comes from the experimental context, then this model
is usually preferred if it provides a reasonable fit.
If no reasonable parametric model provides an adequate fit to the data, then non-
parametric models can provide viable alternatives.
Polynomial Models in Two or More Variables
Recall, that for an ordinary least squares regression model the model “surface” for
the additive model
y = β0 + β1 x1 + β2 x2 + 
is a plane in space, while the model surface for the OLS model with interaction is a
curved surface.
y = β0 + β1 x1 + β2 x2 + β12 x1 x2 + 
The quadratic regression model with two variables is an extension of this idea:

y = β0 + β1 x1 + β2 x2 + β11 x21 + β22 x22 + β12 x1 x2 + 

Recall, that depending on the values of β, the response surfaces can take many
different shapes.

1000 300

800 250

200
600
150
400
100

200 50

0 0
10 10
5 10 5 10
0 5 0 5
0 0
−5 −5
−5 −5
−10 −10 −10 −10

400 0

200 −200

0
−400
−200
−600
−400

−600 −800

−800 −1000
10 10
5 10 5 10
0 5 0 5
0 0
−5 −5
−5 −5
−10 −10 −10 −10

71
Math 261A - Spring 2012 M. Bremer

Under certain circumstances, namely when the matrix


β0 12 β1 12 β2
 
 1 β1 β11 1 β12 
2 2
1
β 1β
2 2 2 12
β22
is negative (or positive) definite, the response surface has a single maximum (or
minimum). These cases may be of interest, if a response function is to be maxi-
mized (or minimized) as a function of two predictors. This task is called Response
Surface Methodology (RSM) in practice.
Example: Chemical Process
Engineers are interested in maximizing the percent conversion (y) in a chemical
process. They know that the conversion depends on the reaction temperature (T ,
in ◦ C) and the reactant concentration (C, in %). They have a vague idea which
temperature and reactant concentrations tend to lead to higher response values, but
they would like to find the optimal combination.
They design the experiment in the form of a central composite design. That
means that they will run experiments for specific combinations of T and C and
measure the response. The values at which the experiments will be run are laid out
in the form of a square:

Four observations are collected in the center of the square (x1 , x2 ) = (0, 0) and
√ one
observation
√ each at the corners of the square and at the radial axis runs (± 2, 0)
and (0, ± 2). The data for this experiment can be found in the file “ChemicalPro-
cess.txt” on the course website.
We will use R to fit a second order model for the response y as a function of the
coded variables x1 and x2 .
y = β0 + β1 x1 + β2 x2 + β11 x21 + β22 x22 + β12 x1 x2 + 

72
Math 261A - Spring 2012 M. Bremer

The fitted model equation becomes:


ŷ = 79.75 + 9.825x1 + 4.216x2 − 8.877x21 − 5.125x22 − 7.75x1 x2
Note, that the standardization procedure was
T − 225 C − 20
x1 = and x2 =
25 5
What does that mean for the quadratic regression model in the original variables T
and C?

The coefficients table in R tells us that each component in the model (the linear,
quadratic and mixed terms) are significant. Thus, the model cannot be easily sim-
plified. The residuals are approximately normally distributed. The residual plot
shows a pattern, but that is to be expected with such a small sample size. Overall,
the residuals fall pretty much within the “pure error” range that we can observe
from the repeated observations at the center of the design.

Chemical Process Data Normal Q-Q Plot


1.5
1.0
1

0.5
Studentized residuals

Sample Quantiles
0

0.0
-0.5
-1

-1.0
-1.5
-2

45 50 55 60 65 70 75 80 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

fitted values Theoretical Quantiles

The optimal temperature (T ) and concentration (C) combination that would lead
to the highest expected response value can now be determined by finding the max-
imum for the fitted response surface. (Take partial derivatives, set equal to zero,
solve corresponding system). The maximum occurs at approximately 245◦ C and
20% concentration.

73
Math 261A - Spring 2012 M. Bremer

Orthogonal Polynomials
We have seen previously, that when fitting polynomial regression models in one vari-
ables, ill-conditioning is frequent due to the multicollinearity between the predictors
x and x2 etc. Some of these problems can be fixed by formulating the model in
terms of orthogonal polynomials.
Suppose we want to fit a polynomial regression model with one predictor x and
degree k:
yi = β0 + β1 xi + β2 x2i + · · · + βk xki + i
Generally, for this model, the columns of the X matrix will be (highly) correlated.
Additionally, if another term βk+1 xk+1 were added to the model, the matrix (X0 X)−1
would have to be recomputed and the estimates of the other β-parameters would
change.
Instead, consider the polynomial model

yi = α0 P0 (xi ) + α1 P1 (xi ) + α2 P2 (xi ) + · · · + αk Pk (xi ) + i

where Pu (xi ) is a polynomial of degree u defined such that


n
X
Pr (xi )Ps (xi ) = 0 r 6= s, r, s = 1, 2, . . . , k
i=1

P0 (xi ) = 1
Then the multiple linear regression model becomes y = Xα +  where the predictor
matrix becomes  
P0 (x1 ) P1 (x1 ) · · · Pk (x1 )
 P0 (x2 ) P1 (x2 ) · · · Pk (x2 ) 
X=
 
.. .. ... .. 
 . . . 
P0 (xn ) P1 (xn ) · · · Pk (xn )
Since the polynomials are orthogonal, the X0 X matrix has the following simple form
that is very easy to invert:
 P n 
2
P (x ) 0 ··· 0
 i=1 0 i 
 n 
2
P
0 P2 (xi ) · · · 0
 
0
 
XX= i=1 
 .. .. .. .. 

 . . . . 

 n 
2
P
0 0 ··· Pk (xi )
i=1

Note: There are several different methods by which orthogonal polynomials can be
obtained for a given set of data. R uses a different method than the one described
in your text.

74
Math 261A - Spring 2012 M. Bremer

Find the least squares estimates α̂ of this model:

Note: The zero-degree polynomial can be set to P0 (xi ) = 1, i = 1, . . . , n. Thus,


α̂0 = ȳ.
R automatically uses orthogonal polynomials in polynomial regression unless you
explicitly ask for something else.
Example: Inventory System
An operations research analysist has developed a computer simulation model for a
single item inventory system. He is looking at the reorder quantity (x) for the item
and the average annual cost (y) of the inventory. The data are available in the file
“Inventory.txt”.
Inventory Data

A scatterplot clearly shows that there is


345

a parabolic (quadratic) relationship between


340

cost and reorder quantity. The R command


335
average annual cost

fit <- lm(y poly(x,degree = 2), data)


330

fits a polynomial regression model with orthog-


325

onal polynomials to the data.


320
315
310

50 100 150 200 250

reorder quantity

If you do not want to fit orthogonal polynomials, you have to include the argument
raw = TRUE inside the poly() function.
You can see which orthogonal polynomials R uses by typing poly(your predictor,
degree = 2). Caution: they are not the polynomials described in the book.

75
Math 261A - Spring 2012 M. Bremer
n
Pj (xi )2 = 1 for j = 1, 2 and P0 (xi ) = 1.
P
R uses orthogonal polynomials for which
i=1

Fit both models in R and compare the coefficients:

Note: prediction in R is not influenced by the polynomials chosen for the regression:

But the stability of the estimates is. The method that uses orthogonal polynomials
is not subject to the numerical problems that occur in the estimation of the model
parameters when the data are subject to multicollinearity. Therefore, in general,
predictions made with the orthogonal polynomial method are preferred. But it can
be tricky to derive the “raw” model equation for this fit.

76

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy