0% found this document useful (0 votes)
7 views

Primer - Regression

Uploaded by

Samantha Goody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Primer - Regression

Uploaded by

Samantha Goody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

07-11-2019

A PRIMER ON LINEAR REGRESSION

Sam Spade, owner and general manager of the Campus Stationery Store, faces a perplexing
problem. His store has been serving the needs of a large university community for a number of
years. Besides selling office supplies and various forms of university memorabilia, the store has
done a brisk business in more expensive items such as microcomputers and calculators. In
particular, one popular item is an inexpensive noise cancelling headphone.

Sam has followed the sales behavior of the noise cancelling headphone over the past four years
since it was introduced, and is worried about the seemingly erratic behavior of the number of
players sold. He realizes that there are many factors which might help explain sales, but believes
that advertising and price are major determinants. Most of the advertising is done through the
local campus newspapers, while suggested list price is set at the beginning of every quarter.
After culling through the store's records, Sam extracts the following data on sales (number of
noise cancelling headphones sold), advertising (number of advertisements), and suggested list
price for every quarter over the past four years.

Quarter Sales (number Advertising (number Suggested list price


of units sold) of advertisements)
1 33 3 $125.00
2 38 7 130.00
3 24 6 140.00
4 61 6 115.00
5 38 10 155.00
6 54 8 120.00
7 17 9 145.00
8 70 10 140.00
9 52 10 145.00
10 45 12 160.00
11 65 12 135.00
12 82 13 130.00
13 29 12 170.00
14 63 13 160.00
15 50 14 190.00
16 79 15 160.00

Note: Supporting documents giving Balance Sheets, Operating Statements, and Sources and Uses of Funds Statements are
given in Appendices A, B, and C respectively. These appendices, missing from this handout and absolutely critical to the
understanding of the case, are available from your instructor at $20.00 each (cash only please).

© 2015 by the Kenan-Flagler Business School, University of North Carolina, Chapel Hill, NC 27599-3490. Not to be reproduced
without permission. All rights reserved.

This technical note was prepared by John P. Evans, J. Morgan Jones, Adam J. Mersereau, Alan W. Neebe, and David S. Rubin.
I INTRODUCTION
It is typical in business and economic problems to have large masses of data available for all
sorts of decision making and forecasting, but not to be able to see immediately how all the data
fit together. Sometimes we are lucky and we have only two variables to consider. One variable
is the one we wish to explain or predict, and the other variable is the one that does the explaining.
In this case we can plot the data and graphically visualize the statistical relationship between the
two variables. But at other times the statistical relationships involve many variables. Since most
of us cannot visualize beyond two or three dimensions, these relationships cannot be observed
graphically and might go undetected.

Linear regression is one widely used statistical technique which allows us to analyze such
complex relationships. Linear regression is relatively easy to understand and, since a computer
will usually do most of the calculation, it is easy to use. In fact, it is so easy to use that some
people try to use it without first thinking about the problem they wish to analyze. At this point
linear regression becomes dangerous to the user. Like any computerized technique, a linear
regression computer package will unfailingly do exactly what you tell it to do. And if you tell it to
do something inappropriate, the computer will do the inappropriate analysis perfectly, carrying
out the misleading calculations to 15 significant digits. For this reason you will have to
understand some of the theory that lies behind the calculations. You will see that applying linear
regression correctly is part art and part science. We hope you will learn to appreciate both its
advantages and its limitations.

We will consider the case of the Campus Stationery Store. Sam Spade decides to use Microsoft
Excel to analyze the data. He starts by entering the data into a worksheet. This is shown in
Figure 1. Note that the three variable names are SALES, ADS, and PRICE. Suppose Spade
first wishes to examine the possible statistical relationship between sales (SALES) and the
number of advertisements (ADS). He decides that it might be useful to look at a scatterplot or
scattergram between these two variables. Constructing a scatter diagram is a natural first step
in regression analysis, since the diagram frequently reveals patterns or characteristics of the
data which might otherwise go undetected. That is, the diagram might reveal whether there is
a statistical relationship linking these two variables, and whether the relationship can be
represented by a curved line or by a straight line, and how strong the relationship is. Using Excel
he obtains a scatter plot of SALES versus ADS as shown in Figure 2. On the basis of the scatter
diagram, he decides to begin his analysis of sales behavior with a straight-line model. The
diagram does not reveal any obvious nonlinear patterns, and he feels that a straight line model
at least should be investigated before any more complicated model is considered. Furthermore,
a straight line model is consistent with his intuitive belief that, providing advertising is kept within
a realistic range, each additional advertisement should have roughly the same positive impact
on sales.

We will see later that we can easily examine nonlinear relationships or try to incorporate other
variables such as price using linear regression. Also note that the manager believes that
advertising influences sales, rather than the other way around (which might have been true if it
were his policy to allocate a fixed proportion of revenue to advertising). As such, advertising will
be called the independent or explanatory variable (since it does the explaining), and sales is
called the dependent variable (since its value is explained by, or depends on, other factors).

2
Figure 1

CAMPUS STATIONERY STORE


SALES ADS PRICE
33 3 125
38 7 130
24 6 140
61 6 115
38 10 155
54 8 120
17 9 145
70 10 140
52 10 145
45 12 160
65 12 135
82 13 130
29 12 170
63 13 160
50 14 190
79 15 160

Figure 2

SALES versus ADS

90

80

70

60

50
SALES

40

30

20

10

0
0 5 10 15 20
ADS

We will say that the manager has decided to regress sales on advertising. It is traditional to
denote the dependent variable by the symbol Y, and the explanatory variable (or variables) by
the symbol X.

3
II FITTING A STRAIGHT LINE
Figure 3 shows the straight line fit to our set of data points. Suppose the manager planned to
run 10 advertisements next quarter. Using the straight line, what level of sales should he expect?
The predicted value for sales is obtained by reading up to the line from X = 10, and then across.
You should get a sales level of about 50 units. The estimated or predicted value of the
dependent variable is usually denoted by Ŷ (where the “hat” means “estimate of”). Thus Ŷ is
the estimated value of Y given X. Since we are assuming a straight line fit between Y and X, Ŷ
is of the form Ŷ = a + bX, where the constant term "a" is called the intercept coefficient and the
term "b" which is multiplied by X is called the slope coefficient. Because the regression line does
not fit the data perfectly, there are errors in using the line to summarize the data. The error
associated with the ith observation (Xi, Yi) is the vertical distance between the observed value of
Yi and the estimated value Ŷi . The error represents the number of units by which the estimate
is incorrect in the ith observation. This error is usually called the residual and is denoted by ei.
Thus

th residual

We want to find values of a and b which give a "good" fit to the set of points and which minimize
the "total error" defined in an appropriate sense. While other measures of fit are sometimes
used, by far the most popular measure is the so-called least squares one. The least squares
criterion seeks values of a and b which minimize the sum of the squares of the residuals.
Symbolically, we wish to

Minimize ∑

where n denotes the number of observations. Remember that we are solving for the values of
“a” and “b” and that Xi and Yi represent known numeric quantities.

4
Figure 3

SALES versus ADS

100

80

60
^
SALES

Y  a  bX
40 e1
e2
20 e3
a b
0
0 2 4 6 8 10 12 14 16 18 20
ADS

5
The least squares criterion is so common that the term "regression" is used synonymously with
"least squares" regression. This criterion is important for a number of reasons. It is
computationally convenient, although with the advent of computers this has become of lesser
concern. It is consistent with the statistical definition of variance, and possesses some important
statistical properties. It can be shown that the sum of the residuals is zero for a line computed
by the least squares criterion, so that the average error is zero. Finally, the least squares
criterion, because of the squaring, gives extra weight to what would otherwise be large residuals.
This is consistent with the decision making belief that an error twice as large usually results in
more than twice as many unfortunate consequences, and as such should be penalized more
than twice as much.

While there are formulas for calculating the intercept and slope coefficients, we prefer to use
Excel to do the calculation. To run a regression in Excel, select Tools, then Data Analysis, and
then Regression. The regression dialog box appears. To regress SALES on ADS, and
remembering how the data is given in Figure 1, after Input Y Range we highlighted the SALES
column and after Input X Range we highlighted the ADS column. We checked Labels (because
we highlighted the variable names as well as the numeric values). WE DID NOT check
Constant is Zero! We did not bother to check Confidence Level because the default 95
percent confidence level was acceptable. Under Output Options we asked that the output be
sent to a New Worksheet and under Residuals we simply asked for Residuals.

The Excel regression output for this model, Model 1, is given in Figure 4. The intercept and
slope coefficients can be found in the center of the worksheet under the heading "Coefficients."
The intercept coefficient is a = 20.0000 and the slope coefficient for the explanatory variable
ADS is b = 3.0000. Thus the least squares regression line is Ŷ = 20 + 3X. The bottom third of
Figure 4 contains the RESIDUAL OUTPUT listing the Observation indices, the Predicted
SALES values Ŷi , and the Residuals ei. You can check that the sum of the residuals is zero,
and that the sum of the squared residuals is 4114. Remember that by our choice of “a” and “b”,
we made this sum of squared residuals as small as possible.

In Figures 5 and 6 we see the scatterplot with a linear trend line and the regression output for
this model, Model 2, the regression of SALES on list PRICE. As expected, the slope is negative,
and the least squares line is Ŷ = 64.6218 - 0.1008X. Remember that X now refers to PRICE.

The above two regressions are called simple regressions because each involves only one
explanatory variable. If more than one explanatory variable is included, we have what is called
multiple regression, although we are still talking about linear regression. For example, suppose
the manager of the Campus Stationery Store wishes to examine the regression of SALES on
ADS and PRICE. The regression equation is of the form Ŷ = a + b1X1 + b2X2 , where "a" is
still the intercept coefficient, "b1" is the slope coefficient associated with the first explanatory
variable X1 (ADS), and "b2" is the slope coefficient associated with the second explanatory
variable X2 (PRICE). With two explanatory variables the fit is in three dimensions, and the
regression is obtained by fitting a plane to the data. Our objective is to find values of a, b1, and
b2 so as to

Figure 4

6
MODEL 1
SALES versus ADS
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.5161
R Square 0.2664
Adjusted R Square 0.2140
Standard Error 17.1423
Observations 16

ANOVA
df SS MS F Significance F
Regression 1 1494.0000 1494.0000 5.0841 0.0407
Residual 14 4114.0000 293.8571
Total 15 5608.0000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 20.0000 13.9781 1.4308 0.1744 -9.9802 49.9802
ADS 3.0000 1.3305 2.2548 0.0407 0.1464 5.8536

RESIDUAL OUTPUT

Observation Predicted SALES Residuals


1 29.0000 4.0000
2 41.0000 -3.0000
3 38.0000 -14.0000
4 38.0000 23.0000
5 50.0000 -12.0000
6 44.0000 10.0000
7 47.0000 -30.0000
8 50.0000 20.0000
9 50.0000 2.0000
10 56.0000 -11.0000
11 56.0000 9.0000
12 59.0000 23.0000
13 56.0000 -27.0000
14 59.0000 4.0000
15 62.0000 -12.0000
16 65.0000 14.0000

7
Figure 5

SALES versus PRICE

90

80

70

60

50
SALES

40

30

20

10

0
0 50 100 150 200
PRICE

8
Figure 6

MODEL 2
SALES versus PRICE
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.1039
R Square 0.0108
Adjusted R Square -0.0599
Standard Error 19.9060
Observations 16

ANOVA
df SS MS F Significance F
Regression 1 60.5042 60.5042 0.1527 0.7019
Residual 14 5547.4958 396.2497
Total 15 5608.0000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 64.6218 37.7486 1.7119 0.1090 -16.3410 145.5847
PRICE -0.1008 0.2581 -0.3908 0.7019 -0.6543 0.4527

RESIDUAL OUTPUT

Observation Predicted SALES Residuals


1 52.0168 -19.0168
2 51.5126 -13.5126
3 50.5042 -26.5042
4 53.0252 7.9748
5 48.9916 -10.9916
6 52.5210 1.4790
7 50.0000 -33.0000
8 50.5042 19.4958
9 50.0000 2.0000
10 48.4874 -3.4874
11 51.0084 13.9916
12 51.5126 30.4874
13 47.4790 -18.4790
14 48.4874 14.5126
15 45.4622 4.5378
16 48.4874 30.5126

9
Minimize ∑

The Excel output for this multiple regression model, Model 3, is given in Figures 7 and 8. When
running the regression, we asked for the Residuals, the Standardized Residuals, and the
Residual Plots. (We did not ask for the Line Fit Plots or the Normal Probability Plots. Instead
of giving a normal probability plot of the residuals, which would be really useful, the latter instead
gives a plot of the values of the dependent variable SALES, which is of dubious value.)

The regression equation is Ŷ = 109.6089 + 6.6011X1 - 0.8663X2. Thus the intercept on the
Y-axis is 109.6089. Holding PRICE constant, the slope of SALES with respect to ADS is 6.6011,
while with ADS held constant the slope of SALES with respect to PRICE is -0.8663. Since the
regression fit is in three dimensions, the regression line is actually a regression plane as depicted
in Figure 9.

10
Figure 7

MODEL 3
SALES versus ADS and PRICE
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.8239
R Square 0.6789
Adjusted R Square 0.6295
Standard Error 11.7698
Observations 16

ANOVA
df SS MS F Significance F
Regression 2 3807.1302 1903.5651 13.7413 0.0006
Residual 13 1800.8698 138.5284
Total 15 5608.0000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 109.6089 23.9373 4.5790 0.0005 57.8955 161.3222
ADS 6.6011 1.2693 5.2006 0.0002 3.8589 9.3432
PRICE -0.8663 0.2120 -4.0863 0.0013 -1.3244 -0.4083

RESIDUAL OUTPUT

Observation Predicted SALES Residuals Standard Residuals


1 21.1194 11.8806 1.0094
2 43.1919 -5.1919 -0.4411
3 27.9275 -3.9275 -0.3337
4 49.5860 11.4140 0.9698
5 41.3366 -3.3366 -0.2835
6 58.4564 -4.4564 -0.3786
7 43.3989 -26.3989 -2.2429
8 54.3317 15.6683 1.3312
9 50.0000 2.0000 0.1699
10 50.2070 -5.2070 -0.4424
11 71.8655 -6.8655 -0.5833
12 82.7983 -0.7983 -0.0678
13 41.5436 -12.5436 -1.0657
14 56.8081 6.1919 0.5261
15 37.4189 12.5811 1.0689
16 70.0102 8.9898 0.7638

11
Figure 8

ADS Residual Plot


20

15

10

0
Residuals

0 2 4 6 8 10 12 14 16
-5

-10

-15

-20

-25

-30

ADS

PRICE Residual Plot

20

15

10

0
Residuals

$0 $20 $40 $60 $80 $100 $120 $140 $160 $180 $200
-5

-10

-15

-20

-25

-30

PRICE

12
Figure 9

13
III MEASURES OF GOODNESS OF FIT
So far we have seen how to use Excel to fit a straight line (or more generally a plane) through a
set of data points. A natural question to ask is: "How good is the fit?". We will look at two
measures of goodness of fit. One (the coefficient of determination) is independent of the units
in which the variables are measured, and the other (the standard error of the estimate) is
measured in the same units as the dependent variable.

The Coefficient of Determination

For a particular observation, the difference between Yi and Y (the average value of Y) is called
the total deviation of Yi. Thus the total deviation of Yi is Yi  Y . By subtracting and adding Ŷi
(the estimated value of Yi), we can write:

( Yi - Y ) = ( Yi - Ŷi ) + ( Ŷi - Y )

total = unexplained + explained


deviation deviation deviation

Part of the total deviation is "explained" by the regression estimate Ŷi . Observe that the
unexplained deviation is the same quantity we previously called the residual. This relationship
can be seen by examining Figure 10. Before we used the regression line, the best estimate we
had for Yi was just the average value Y . Thus ( Yi - Y ) is the total deviation that we would like
to use the explanatory variable(s) to explain. Some of the total deviations are positive, and some
are negative, and we would like to be able to explain these differences. Using the regression
line, we are now able to account for part of this total deviation using the explanatory variable(s).
So the regression line takes us from Y to Ŷi . The remaining part from Ŷi to Yi (the residual) is
still unexplained and is due to factors not in our regression model.

It is interesting to note that a new version of the equality above still holds when these three
deviations are individually squared and then summed across all of the observations (provided
that we are using the least squares regression line). This gives us one of the fundamental
relationships in regression analysis:

= +

total unexplained explained


= +
variation variation variation

sum of squares sum of squares sum of squares


total due to error due to regression

SST = SSE + SSR

14
Figure 10

Deviations

*
Unexplained
Total
Y
Explained

Y
ˆ  a  bX
Y

15
In statistics the sum of squared deviations is usually called a variation. The total variation of the
dependent variable Y (also called the sum of squares total, or simply SST) can be partitioned
into two component parts: the unexplained part (the sum of squares due to error, or SSE) and
the explained part (the sum of squares due to regression, or SSR). Thus the dependent variable
Y has a certain amount of variability (or variation) associated with it (as measured by SST). Part
of this total (as measured by SSR) can be accounted for by the explanatory variable(s), while
the remaining part (as measured by SSE) is unaccounted for.

Observe that we have encountered some of these sums of squares before. SST is just (n-1)
times the sample variance of Y, and is thus consistent with our usual definition of variability. On
the other hand, SSE is just the sum of the squares of the residuals, which is what we minimize
when we do least squares regression. Since we minimize the unexplained variation (SSE) in
performing least squares regression, at the same time we maximize the explained variation
(SSR) for the explanatory variables in that regression.

At this point you may be wondering where this whole discussion is leading. The important
concept to remember is that the total variability of the dependent variable can be partitioned into
two parts: an explained part and an unexplained part. A natural question to ask is: "What
proportion of the total variation in the dependent variable can be accounted for by the
explanatory variable(s)?". Thus it is natural to look at

SSR
SST

This ratio is called the coefficient of determination, and is denoted by R2 (i.e., R-squared). The
coefficient of determination is one of the most important measures of goodness of fit in linear
regression. Provided the values of the dependent variable Y are not all the same (which would
be a pretty boring dependent variable), SST is positive, while SSE and SSR are always
non-negative. If SSR = SST, then R2 = 1, and all of the variation in the dependent variable can
be explained by changes in the explanatory variable(s). If SSR = 0, then R2 = 0 and none of this
variation can be explained. Otherwise, R2 is between 0 and 1.

The Standard Error of the Estimate

Another measure of goodness of fit can be obtained by looking at the residuals. If the fit is good,
the data points will tend to lie close to the regression line, and the residuals will tend to be close
to zero. Thus we can look at the size of the residuals to tell us something about the fit.
Remembering that our usual measure of variability is the standard deviation, we can look at the
standard deviation of the residuals, which we denote by se. Since the average value of the
residuals, e , is zero, we can write:

16
∑ ̅

SSE

Here k denotes the number of explanatory variables. In calculating this standard deviation we
divide by (n-k-1) because we have used up k degrees of freedom in estimating the k slope
coefficients and used up 1 degree of freedom estimating the intercept of the regression line. The
standard deviation of the residuals is usually called the standard error of the estimate.
Sometimes it is called the root mean square error.

Returning to Figure 7, we see that much of this information is contained in the Analysis of
Variance or ANOVA table given in the middle of the page. We see the three sum of squares (or
SS) values mentioned above. The sum of squares due to the Regression is SSR = 3807.1302.
The sum of squares due to Error (or due to the Residual) is SSE = 1800.8698. By summing we
see that the Total sum of squares is 5608.0000. Sometimes SST is called the corrected total
sum of squares because the Yi values have been "corrected" by subtracting Y . Each of the
three sum of squares values has a certain number of degrees of freedom (df) associated with it.
Regression has k degrees of freedom. (Remember that k denotes the number of explanatory
variables so in this case k = 2.) Error or Residual has (n-k-1) degrees of freedom (n-k-1 = 16-2-1
= 13). Total has (n-1) degrees of freedom (n-1 = 15). Dividing each of the first two sums of
squares by its associated number of degrees of freedom gives us the mean squares (or MS).
Thus the mean square regression is 3807.1302÷2 = 1903.5651, and the mean square error is
1800.8698÷13 = 138.5284. Notice that mean square error = (se)2, where se is the standard
error of the estimate defined above. The last two columns in the ANOVA table, F and
Significance F, will be discussed shortly.

The general form of the analysis of variance table is:

SOURCE df SUM OF SQUARES MEAN SQUARE F VALUE SIG F

REGRESSION k SSR MSR=SSR÷k MSR÷MSE Prob-value

ERROR n-k-1 SSE MSE=SSE÷(n-k-1)

TOTAL n-1 SST

Returning to Figure 7, directly above the ANOVA table under Regression Statistics we see the
R square value. Thus R2 = 3807.1302÷5608.0000 = 0.6789. Thus approximately 68 percent
of the variation in SALES can be explained knowing the values of PRICE and ADS. Two lines
below we find the standard error of the estimate, se, identified simply as the Standard Error.
Thus the standard error of the estimate is

17
SSE .
√138.5284 11.7698

For two models with the same dependent variable, we can compare the values of se. A smaller
se generally means a better fit. However, values of se should not be used to compare models
that have different dependent variables.

Some additional statistics are given in this section of the output. The Multiple R = 0.8239 is just
the square root of R2 and will be discussed in the next section. The Observations = 16 is the
number of data points that are used to fit the regression. Finally the Adjusted R Square =
0.6295 is used by some statisticians in addition to the regular (unadjusted) R2. It tries to take
into account the number of observations and the number of explanatory variables used in the
regression. For those enquiring minds who wish to know

Adjusted R2 1

Since we asked for much of the Residuals output, Figure 7 also shows the Standardized (or
Standard) Residuals. These are just the residuals ei divided by their standard deviation. Excel
also generated Residual Plots which are shown in Figure 8. These are plots of the residuals
plotted against each of the two explanatory variables. These plots show that there is not any
obvious relationship between the residuals and the explanatory variables.

18
IV COVARIANCE AND CORRELATION
The sample covariance Cov(X,Y) between two variables X and Y is:


Cov X,Y

As an example, let X = ADS and Y = SALES in the Campus Stationery Store. Calculated using
Excel’s DESCRIPTIVE STATISTICS tool, X = 10.0 and Y = 50.0. Then the sample covariance
between ADS and SALES is:

∑ . .
Cov X,Y 498 15 33.20

Consider Figure 11, which is Figure 2 with vertical and horizontal lines drawn in at X = 10.0 and

Yi

Y
i
at Y = 50.0, respectively. In quadrant I we have  and  while in quadrant III we
X

Yi

Y
i

have  and  . In quadrants I and III the cross-products are nonnegative. In


quadrants II and IV, where one of the deviations is nonpositive and the other is nonnegative, the
cross-products are nonpositive. Since ADS and SALES are related directly, most of the data
points lie in quadrants I and III, and the covariance is a positive number. On the other hand, two
variables which are related inversely will have a negative covariance.

The sign of the covariance thus indicates whether two variables are related directly or inversely.
Unfortunately the magnitude of the covariance does not tell us too much about the strength of
the relationship because it is easily influenced by scale changes to either or both variables. A
more meaningful statistic which measures the strength of the relationship between two variables
is the sample correlation coefficient. The correlation coefficient R(X,Y) between two variables X
and Y is nothing more than the standardized covariance:

Cov ,
R ,

where sX and sY are the sample standard deviations of X and Y, respectively. For ADS and
SALES, using DESCRIPTIVE STATISTICS we can calculate sX = 3.3267 and sY = 19.3356.
The correlation coefficient is:

33.20
R , 0.5161
. .

It can be shown that  1  R( X, Y )   1, and that the correlation coefficient is not influenced by
scale changes in the variables. The sign of the correlation indicates how two variables are
related. If R(X,Y) = +1, then X and Y are positively related in a perfect linear fashion (that is,

19
Figure 11

SALES versus ADS

100

80
II I
60
SALES

40

20 III IV
0
0 2 4 6 8 10 12 14 16 18 20

ADS

20
the data points can be fit perfectly by a straight line with a positive slope). If R(X,Y) = -1, then X
and Y are negatively related in a perfect linear fashion (that is, the data points can be fit perfectly
by a straight line with a negative slope). If R(X,Y) = 0, then no linear relation exists between the
variables. (However, the variables may be related in a nonlinear fashion. For example, consider
the seven data points (-3,9), (-2,4), (-1,1), (0,0), (1,1), (2,4), and (3,9). These data points all lie
on the parabola Y = X2, but for these points R(X,Y) = 0.)

Observe that Cov(X,Y) = Cov(Y,X) and also R(X,Y) = R(Y,X), so that we are not necessarily
identifying one variable as explanatory and the other as dependent. This is one main difference
between a correlation study and a regression study. In a correlation study we are only talking
about the relation between two variables, whereas in a regression study one of the variables is
necessarily dependent on the other.

Below is the output from Excel's Tools Data Analysis Correlation routine. The correlation
matrix gives the correlation coefficients for all pairs of variables.

SALES ADS PRICE


SALES 1
ADS 0.5161 1
PRICE -0.1039 0.6943 1

R(X,Y) and R2

It is not by coincidence that the symbol R is used for both the correlation coefficient and the
coefficient of determination. In fact, for the simple regression of Y on X, it is easy to show that
R(X,Y) = R2 , and where R(X,Y) is given the same sign as the slope coefficient b. For the
simple regression of SALES on ADS given in Figure 4, the coefficient of determination R2 =
0.2664 and the slope coefficient for ADS is b = +3.0000. The correlation coefficient between
SALES and ADS must be  0.2664 = +0.5161, as we have previously seen. Thus the
correlation coefficient does not tell us anything that we already didn't know if we had run the
corresponding simple regression.

Multiple R

Referring again to Figure 7, we have previously commented that the Multiple R is the square
root of R2, the coefficient of determination. Interestingly enough, in any regression Multiple R is
also the correlation coefficient between the dependent variable Y and the predicted values of
the dependent variable Ŷ . That is, Multiple R = R(Y, Ŷ ).

R(X,Y) and the slope coefficient b

Another interesting result is that for simple regressions with one explanatory variable, R(X,Y) =
b (sX)(sY), where b is the slope coefficient of Y on X, and as before sX and sY are the sample
standard deviations of X and Y, respectively. Thus the correlation coefficient can be thought of
as a standardized value of the slope coefficient b.

21
V THE UNDERLYING REGRESSION MODEL
So far we have used linear regression for descriptive purposes. We have seen how to fit a line
(or more generally a plane) through a set of points, and we have developed two measures to tell
X
us how good the fit is. Typically, however, the data used to fit the regression line are just a
sample drawn from a larger population, and we wish to make inferences about the population.
Remember how the sample mean X is used to estimate the population mean μ. In the same
way, the regression coefficients are sample estimates of unknown population parameters. We
might like to develop interval estimates which reflect the degree of confidence we have in our
estimates. Or we might want to perform some hypothesis tests to see whether the data support,
or do not support, some belief we hold. To answer these types of inference questions we need
to look at the regression model, i.e., the process which generates the data.

Let's return to our example of the Campus Stationery Store. The manager has decided to
regress SALES on ADS and PRICE. (The regression output is given in Figure 7.) What the
manager is actually doing (although he may not realize it) is hypothesizing (or suggesting) that
there is some underlying process (or model) determining SALES based on the number of
advertisements (ADS) and PRICE, and that this process is a linear one. In this case, he is
hypothesizing that the data points are generated by a model of the following form:

Here α is the unknown population intercept, β1 is the unknown slope coefficient associated with
ADS, and β2 is the unknown slope coefficient associated with PRICE. These three population
parameters are estimated by the (sample) regression coefficients a, b1, and b2, respectively.
The final term above, εi, is a disturbance (or perturbation) term that incorporates all effects not
captured by the explanatory variables.

A regression model in general can be visually represented by our model (or perturbation)
machine in Figure 12. For a given observation i, the values of the explanatory variables Xi1, Xi2,
..., Xik can be viewed as input to the machine. The model multiplies each of these Xij in turn by
βj, and adds the products along with an intercept term α to produce a "signal". The model then
adds in a perturbation (or "noise") term εi. The model selects the noise term by reaching into
the "noise basket" (containing noises of various magnitudes) and randomly picking one. This
grand total is called Yi, and is the output for the model machine. Note that we have built noise
into the model to account for the fact that all of the data points do not lie on a straight line. This
reflects the fact that the economic, behavioral, and social processes that we frequently study
using regression analysis are inherently imprecise and non-reproducible. The noises may be
due to measurement error, or may be due to variables or factors of which we are unaware.

22
Figure 12

Note that the model would be deterministic were it not for the random noise term. Because of
this term, the same input values X1, ..., Xk do not necessarily lead to the same output Y. In any
case, the following assumptions are usually made about the noise term:

a) Normality: the errors εi are normally distributed around the regression line with an
expected (or mean) value of 0. That is, E(εi) = 0.

b) Homoscedasticity: the standard deviation of εi, denoted by σe, is constant for all values
of the X's. Note that se, the standard error of the estimate, estimates σe.

c) Independence of errors: the errors εi are independent of each other and of the X's (the
explanatory variables).

Since E(εi) = 0, we sometimes say that the population regression line is the line of the expected
value of Y, given X1, ..., Xk. That is, E(Y | X1, ..., Xk) = α + β1X1 + ... + βkXk.

23
THE UNDERLYING MODEL IN ACTION

We now give an illustration of the relationship between the underlying model and the fitted
regression. Suppose we have a black box that works as follows: It has two dials on it, one for
α and one for β. I set the dials for specific values of α and β, but you can't see the dials, and I
normally won't tell you how I've set them. You give me a value of X which I put into the machine.
Then I turn on the machine, and it computes α + βX and then adds to α + βX a random
disturbance ε to complete the calculation of Y, which appears on a video screen on the front of
the box. You want to find out what α and β are, so you generate a lot of (X,Y) points and fit a
linear regressionY = a + bX to those points. The regression coefficients (a and b) are used to
estimate the unknown underlying coefficients (α and β).

Let me show you how it works. I'll set the machine to α = 1 and β = 1, and let's watch it generate
20 data points. Remember that X values are inputs and the Y values are outputs.

X α+ βX ε Y
16 17 1 18
20 21 -1 20
18 19 1 20
13 14 3 17
16 17 2 19
19 20 1 21
4 5 -1 4
14 15 2 17
6 7 -1 6
15 16 2 18
18 19 3 22
18 19 2 21
18 19 -2 17
14 15 -2 13
5 6 1 7
17 18 -1 17
13 14 -1 13
4 5 1 6
15 16 -3 13
6 7 2 9

Using Excel, the fitted regression is Ŷ = 1.3777 + 1.0054X, with R2 = 0.9010. In hindsight we
see how the calculated intercept "a" (1.3777) does a reasonable job of estimating α, and the
calculated slope "b" (1.0054) does an excellent job of estimating β.

24
VI TESTS OF SIGNIFICANCE

You will recall our discussion of R2 as a measure of goodness of fit. The closer R2 is to 1, the
better our regression equation fits our data. But a large value of R2 may occur by chance. To
take an extreme example, a blindfolded dart thrower might produce a straight line of darts on his
dartboard. We would consider that a lucky result, but we would like to separate those results
from instances where there is a significant relationship connecting the explanatory variable(s)
and the dependent variable. Thus, we would like to be able to test whether the value of R2 that
we observe is significantly greater than zero.

Suppose we wish to perform a hypothesis test of whether the regression as a whole is significant.
There are many ways of writing the appropriate null and alternate hypotheses. All of the
following are equivalent:

H0: As a whole the regression is insignificant


Ha: As a whole the regression is significant

H0: Y doesn't really depend on (i.e., is not explained by) the X's
Ha: Y does depend on the X's

H0: β1 = β2 = ... = βk = 0 (Note: the intercept α is not included)


Ha: At least one of β1, ..., βk is not equal to 0

H0: ρ2 = 0 (Note: ρ2 is the population R-squared)


Ha: ρ2 > 0

Looking at the last statement of the hypotheses we see that if we accept Ha, we are concluding
that R2 is significantly greater than 0. The test of the regression as a whole is easily performed
using the F Value found in the analysis of variance table. This F-value is calculated by dividing
the mean square regression by the mean square error:

Fobs = MSR÷MSE

25
Thus, referring back to Figure 7, we see that Fobs = 1903.5651138.5284 = 13.7413. If H0 is
true, the calculated F-value follows an F-distribution with k numerator and (n-k-1) denominator
degrees of freedom. Looking at how the F-value was calculated, we see that "large" calculated
F-values tend to imply significant regressions. This follows because MSR will tend to be large
(and MSE will be small, so F will be large) if Y does depend on the X's. Because of this, "large"
F-values in the right tail of the F-distribution are relatively unlikely if H0 is true. They thus provide
evidence to reject H0 in favor of Ha, and indicate significant regressions.
We proceed by calculating the prob-value for the test, which in this case measures how unlikely
Fobs happens to be. Using Excel, the prob-value = 1-F.DIST(13.7413,2,13,true) = 0.0006. This
gives the area to the right of 13.7413 in the appropriate F distribution. Suppose we wish to test
at a particular α. (Here α is the level of significance of the test rather than the population intercept
coefficient.) We see that we can reject H0 (and conclude that the regression as a whole is
significant) for any α > 0.0006. Alternatively we can look at the Significance F value on the
right of the analysis of variance table which gives the prob-value directly.

Now that we have tested the regression as a whole, we turn our attention to the individual
regression coefficients. The important point to remember is that the sample regression
coefficients each follow a t-distribution with (n-k-1) degrees of freedom. For illustrative purposes
we will concentrate on the explanatory variable ADS. Returning to Figure 7, we recall that the
slope coefficient associated with ADS is 6.6011. Directly to the right of this figure we see that
the Standard Error (or standard deviation) associated with this coefficient estimate is 1.2693.
This standard error is used to calculate confidence intervals and perform hypothesis tests on the
explanatory variable ADS.

Let's calculate a 95 percent confidence interval for the underlying population slope coefficient
ß1. The confidence interval is centered at b1. The value of t with (n-k-1) = 13 degrees of freedom
which leaves 2.5 percent of the area in each tail is t = 2.1604. Thus the 95 percent confidence
interval for ß1 is:

6.6011 ± 2.1604 (1.2693)


or
6.6011 ± 2.7422
or
3.8589 to 9.3433

Observe that Excel gives us this confidence interval directly in the Lower 95% and Upper 95%
columns. Had we wished for something other than a 95 percent confidence interval, we would
have requested the appropriate interval in the Regression dialog box.

Are we confident that the explanatory variable ADS belongs in the regression? Expressed
another way, is the calculated slope coefficient b1 significantly different from zero? Or, is ADS
a significant explanatory variable? Both of the following statements of hypotheses are thus
equivalent:

26
H0: ADS is not a significant explanatory variable
Ha: ADS is a significant explanatory variable

H0: β1 = 0
Ha: β1  0

If we reject H0 in favor of Ha, we are concluding that b1 is significantly different from 0.

To perform this test we need to determine how far (i.e., how many of its own standard errors) b1
is from 0. The "larger" this value is (in either the positive or the negative direction), the more
reason we have to reject H0. We will not reject H0 if this value is "small". Thus we have to
calculate the test statistic:

tobs = 6.60111.2693 = 5.2006

We proceed by calculating the prob-value for the test, which in this case measures how unlikely
tobs happens to be. Using Excel, the prob-value = 2*[1-T.DIST(5.2006,13,true)] = 0.0002. This
gives the area to the right of 5.2006 plus the area to the left of -5.2006 in the appropriate t
distribution. We want the area in both tails since this is a two-tailed test. Notice that Excel has
already performed this test for us. In the t Stat column we find the calculated t (the t Stat) value.
To the right of this in the P-value column we see that ADS is a significant explanatory variable
at any α > 0.0002. So Excel calculates a two-tail prob-value although it doesn’t clearly
indicate that it is doing so.

Suppose the manager of the Campus Stationery Store is considering running an additional
advertisement. He knows that the profit contribution of each unit is $10, and advertisements
cost $50 each. He realizes that it is not a good idea to run the ad if it does not generate more
than 5 sales. Thus he is only interested in running the ad if he is reasonably sure that ß1 > 5.
This leads to the following one-tail test. Again test at a level α = 0.05.

H0: β1  5
Ha: β1 > 5

The calculated value of t (telling how many of its own standard errors b1 is above 5) is:

tobs = (6.6011 - 5.0)1.2693 = 1.2614

Note that we have to do this calculation--Excel doesn't do everything for us! Using Excel the

27
prob-value = 1-T.DIST(1.2614,13,true) = 0.1147. If we wish to test at α = 0.05 we cannot reject
H0. We should advise the manager not to run the additional ad. Indeed, if we really believe in
this model, we should advise him not to run any ads at all!

One word of warning! Never run a regression without the constant (that is, intercept) term. If
you do so, many of the statistics we have discussed (including the coefficient of determination)
will be misleading and the interpretations we have given them will not apply.

A second word of warning! In deciding which explanatory variables to include in a multiple


regression, do not include an explanatory variable which is a perfect (or near perfect) linear
combination of other explanatory variable or variabless which you have included. If you do so,
a so-called multicollinear situation will result, and the standard errors of the slope coefficients of
the collinear variables will be very large and the corresponding slope coefficients thus will be
insignificant. With perfect multicollinearity Excel cannot calculate the regression coefficients
correctly. With imperfect multicollinearity, the model’s predictive ability is unaffected, and the
F-test is valid, but it just becomes more difficult to attach meanings to the slope coefficients.

Standardized Slope Coefficients

Using a t-test, we have shown that ADS is a significant explanatory variable. What this means
is that we are reasonably confident that the underlying slope coefficient ß1 is different from zero.
But this does not necessarily mean that changes in ADS have any meaningful impact on the
dependent variable SALES. Put another way, just because we have shown the calculated slope
b1 to be significantly different from zero doesn't necessarily mean that ADS is of any practical
significance as an explanatory variable. Looking at the absolute value of b1 doesn't tell the entire
story, since we can always scale the units of measurement of ADS or SALES to make b1
arbitrarily large or small in absolute value. Instead we might want to look at the standardized
slope coefficient

ADS . .
1.1357
SALES .

where sADS = 3.3267 and sSALES = 19.3356 are the sample standard deviations of ADS and
SALES, respectively (here calculated using DESCRIPTIVE STATISTICS). The standardized
slope coefficient is nothing more than the slope coefficient expressed in terms of standard
deviations. Thus if the explanatory variable ADS changes by one of its own standard deviations
(in this case, by 3.3267 units), then the dependent variable SALES changes by 1.1357 of its own
standard deviations (in this case, by 19.3356 units). This represents a relatively large movement
of SALES with respect to ADS, and thus demonstrates that ADS does indeed have a meaningful
impact on SALES.

In a similar fashion, the standardized slope coefficient for PRICE is

PRICE . .
0.8923
SALES .

28
where sPRICE = 19.9165 is the sample standard deviation for PRICE. This standardized slope
shows that changes in PRICE also have a relatively large impact on SALES (although not quite
as much as does ADS).

In general the standardized slope coefficient for an explanatory variable X equals

where bX is the slope coefficient for X, and as before sX and sY are the sample standard
deviations of X and Y, respectively.

29
VII PREDICTION
Very frequently the main reason for developing a regression model is to use it for prediction
purposes. Besides seeing how well the regression equation fits the existing set of data points,
we want to determine how well the model might explain other data points where we know the
value(s) of the explanatory variable(s) but not the value of the dependent variable. Of course,
we will find out how good a job of prediction has been done only after the fact, since if we knew
the value of the dependent variable there would be no reason to predict it.

Let's illustrate using the simple regression of sales on advertising. Let Ŷp denote the predicted
value of the dependent variable given some value of the explanatory variable, Xp. Naturally, for
any value of Xp, a point estimate Ŷp is obtained by substituting into the regression equation
given in Figure 4: Ŷp = 20 + 3Xp. Thus, for example, if the manager of the Campus Stationery
Store planned to run Xp = 10 advertisements next quarter, he would expect to sell Ŷp = 50 units.
Along with this point estimate, suppose the manager would like to obtain a confidence interval
for the predicted value (more commonly called a prediction interval). This would give him some
idea about how confident he should be in the prediction. Unfortunately, coming up with the true
prediction interval is easier said than done. The true prediction interval is centered around Ŷp
and follows the t-distribution. The general form of, say, a 95 percent prediction interval for an
individual predicted value is

where t is the value of t with (n-k-1) degrees of freedom which leaves 2.5 percent of the area in
each tail, and sp is the appropriate standard error of the individual predicted value. An important
observation is that sp is not constant. Instead sp depends on Xp, but is always greater than the
standard error of the estimate se. Thus se is always a lower bound for sp. Furthermore, sp is
smallest at Xp = X (the average value of X) and increases as Xp moves away from X in either
direction. Unfortunately, Excel cannot (easily) calculate the true standard error sp. The best we
can do is calculate an approximate prediction interval using the standard error of the estimate
se to approximate sp. The general form of, say, an approximate 95 percent prediction interval
for an individual predicted value is

where t is the value of t with (n-k-1) degrees of freedom which leaves 2.5 percent of the area in
each tail. For example, if Xp = 10, an approximate 95 percent prediction interval is

50 ± 2.1448 (17.1423), or 50 ± 36.7668

Remember that the true prediction interval is actually somewhat wider than this because se is a
lower bound for sp.

30
The regression line (the line of the predicted values) and the true upper and lower 95 percent
prediction bands for the individual predicted values are plotted in Figure 13. (These true
prediction bands were obtained using another statistical package.) We see that the prediction
band is not of constant width, but increases as we move away from the center of the data in
either direction. Note how the width of the band is smallest at Xp = X . The widening prediction
band illustrates one reason for not trying to predict Y for values of X beyond the range of the
existing data points: the precision of our predictions decreases (i.e., the width of the prediction
intervals increases) as we get further from X . There is another reason for avoiding such
extrapolation of the regression line: although we have found the line of best fit for the data points
we have observed, we have no reason to believe that the same relationship holds for data points
outside this range.

ASSESSING QUALITY OF THE PREDICTION (OPTIONAL)

In this Primer we have looked at two measures of goodness of fit, the coefficient of determination
(or R2), and the standard error of the estimate (or se). Both measure how well the regression
line does at fitting the data that was used to run the regression. But neither are necessarily good
indicators of how well the regression will do at predicting new responses (that is, at predicting
any out-of-sample observations). In fact, R2 can be made arbitrarily large (or even equal to 1.0)
simply by adding explanatory variables, even if the variables have nothing to do with the
dependent variable. An intelligent user of regression analysis thus needs to be on guard for the
danger of overfitting. This is where explanatory variables have been added, or where a
complicated model has been developed, which fits the existing data very well, but whose ability
to predict future observations is highly suspect.

One remedy to the danger of overfitting is through the use of data partitioning. In this method
the data is randomly split into two groups, the first of size n1, and the second of size n2 = n – n1.
The first group (the training data) is used to develop a good and plausible regression model.
The second group is held in reserve as a validation set (or holdout set). Any candidate
regression model under consideration is fitted using the training data, but evaluated using the
validation set.

The evaluation can be done in the following fashion. For any particular model under
consideration, the values of the explanatory variables in the validation set are used to calculate
the predicted values Ŷi for all n2 observations. Then the validation residuals ei  Yi  Ŷi are
calculated and examined to see how well the model predicts the values of the dependent variable
in the validation set. One common measure of goodness of fit is the root mean square error,
defined as


RMSE

31
Figure 13

SALES versus ADS


Actual 95% Prediction Bands

140

120

100

80

60
SALES

40

20

0
0 2 4 6 8 10 12 14 16 18 20
-20

-40
ADS

The RMSE can be interpreted as the standard deviation of the validation set residuals. It can be
compared with the se, the standard error of the estimate for the training data, to see how well the
model generalizes to new data that it hasn’t seen.

After a best regression model has been selected, the training data and the validation data may be
combined. The regression coefficients can then be estimated using the entire data set, and used
to make future predictions.

32
VIII THE PARTIAL F TEST (OPTIONAL)
Up to now we have seen two tests of significance in regression--an F test to test the significance
of the regression as a whole, and a t-test to test the significance of individual explanatory
variables. There is another useful test, the Partial F Test, which can be used to test the
significance of a set of variables, and which can be used to choose among competing regression
models. Consider two regression models:

Reduced model: Y = α + β1X1 + β2X2 + ... + βgXg + ε

Full model: Y = α + β1X1 + β2X2 + ... + βgXg + βg+1Xg+1 + ... + βkXk+ ε

Note that there are g explanatory variables in the reduced model, and k > g explanatory variables
in the full model, and all of the explanatory variables in the reduced model are contained in the
full model.

Suppose we wish to test whether the set of variables Xg+1, Xg+2, ... Xk are useful in explaining
any variation in Y after taking into account the variation already explained by X1, X2, ..., Xg. Thus
we are comparing the two models to determine whether it is worthwhile to include the additional
variables.

The general idea is to compare the SSR values when Xg+1, Xg+2, ... Xk are excluded and when
they are included in the regression equation (so we need the regression outputs from both the
reduced and the full models). When the additional variables are included, SSR is at least as
large as when they are excluded. The Partial F test tests whether the increase in SSR is more
than could be expected by random chance. In effect, it tests whether R2 (= SSR÷SST) for the
full model is significantly greater than R2 for the reduced model.

To perform the Partial F test, first set up the hypotheses:

H0: βg+1 = βg+2 = ... = βk = 0 (i.e., all of the additional coefficients are 0)

Ha: H0 is not true (i.e., at least one of the additional coefficients is different from 0)

Next calculate the test statistic:

SSR full SSR reduced


SSEfull

This is an upper-tail test following an F distribution with (k-g) numerator and (n-k-1) denominator
degrees of freedom.

33
We will illustrate the Partial F test using Models 1, 3, and 4 from the Campus Stationery Store.
The corresponding Excel outputs are repeated in Figure 14. Suppose we wish to compare the
reduced Model 1 (the regression of SALES versus ADS) with the full Model 4 (the regression of
SALES versus ADS, PRICE, and ADPRICE). We wish to test whether PRICE and ADPRICE
are useful in explaining any variation in SALES after taking into account the variation already
explained by ADS, thereby testing whether the R2 for the full model (0.7952) is significantly
greater than the R2 for the reduced model (0.2664). First set up the hypotheses:

H0: βPRICE = βADPRICE = 0

Ha: H0 is not true

For this example g = 1, k = 3, and n = 16. We next calculate the test statistic:

SSR full SSR reduced


SSEfull 1
4459.4326 1494.000 3 1
1148.5673 16 3 1
2965.4326 2
15.49
1148.5674 12

Using Excel the prob-value = 1-F.DIST(15.49,2,12,true) = 0.0005. We may reject H0 at any


reasonable level of significance. We conclude that PRICE and ADPRICE as a set are
collectively significant.

Consider a second example from the Campus Stationery Store. Suppose we wish to compare
the reduced Model 3 (the regression of SALES versus ADS and PRICE) with the full Model 4
discussed above. We wish to test whether ADPRICE is useful in explaining any variation in
SALES after taking into account the variation already explained by ADS and PRICE. That is,
we wish to test whether the R2 for the full model (0.7952) is significantly greater than the R2 for
the reduced model (0.6789). First set up the hypotheses:

H0: βADPRICE = 0

Ha: H0 is not true

34
Figure 14

MODEL 1
SALES versus ADS
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.5161
R Square 0.2664
Adjusted R Square 0.2140
Standard Error 17.1423
Observations 16

ANOVA
df SS MS F Sig. F
Regression 1 1494.0000 1494.0000 5.0841 0.0407
Residual 14 4114.0000 293.8571
Total 15 5608.0000

Coeffs. Std. Error t Stat P-value Lower 95% Upper 95%


Intercept 20.0000 13.9781 1.4308 0.1744 -9.9802 49.9802
ADS 3.0000 1.3305 2.2548 0.0407 0.1464 5.8536

MODEL 3
SALES versus ADS and PRICE
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.8239
R Square 0.6789
Adjusted R Square 0.6295
Standard Error 11.7698
Observations 16

ANOVA
df SS MS F Sig. F
Regression 2 3807.1302 1903.5651 13.7413 0.0006
Residual 13 1800.8698 138.5284
Total 15 5608.0000

Coeffs. Std. Error t Stat P-value Lower 95% Upper 95%


Intercept 109.6089 23.9373 4.5790 0.0005 57.8955 161.3222
ADS 6.6011 1.2693 5.2006 0.0002 3.8589 9.3432
PRICE -0.8663 0.2120 -4.0863 0.0013 -1.3244 -0.4083

MODEL 4
SALES versus ADS, PRICE, and ADPRICE
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.8917
R Square 0.7952
Adjusted R Square 0.7440
Standard Error 9.7834
Observations 16

ANOVA
df SS MS F Sig. F
Regression 3 4459.4326 1486.4775 15.5304 0.0002
Residual 12 1148.5674 95.7140
Total 15 5608.0000

Coeffs. Std. Error t Stat P-value Lower 95% Upper 95%


Intercept 312.4040 80.1898 3.8958 0.0021 137.6855 487.1226
ADS -10.3956 6.5956 -1.5761 0.1410 -24.7662 3.9750
PRICE -2.4085 0.6164 -3.9070 0.0021 -3.7516 -1.0653
ADPRICE 0.1278 0.0489 2.6106 0.0228 0.0211 0.2344

35
For this example g = 2, k = 3, and n = 16. We next calculate the test statistic:

SSR full SSR reduced


SSEfull 1
4459.4326 3807.1302 3 2
1148.5674 16 3 1
652.3024 1
6.82
1148.5674 12

Using Excel the prob-value = 1-F.DIST(6.82,1,12,true) = 0.0227. If we test at α = 0.05 we can


reject H0 and conclude that ADPRICE is significant.

At this point you may be asking whether this last test differs from a simple t-test of the
significance of ADPRICE in Model 4. (Observe that the t-value for ADPRICE in Model 4 is
2.6106 and the corresponding prob-value is 0.0228 < 0.05, so ADPRICE is significant at the
0.05 level.) Is it just a coincidence that ADPRICE is significant based on the Partial F test and
also based on the t-test? No! When the full model contains only one more explanatory variable
than the reduced model, the Partial F test testing the contribution of the added variable, and the
t-test testing the significance of that same variable, are equivalent in the sense that they have
the same prob-values and always lead to the same conclusion. This is because partial F tests
tell us whether all the additional variables in the full model are collectively significant. When the
full model contains only one more explanatory variable than the reduced model, testing the
significance of “all” the additional variables is of course the same as testing the significance of
that single variable. In fact, it can be shown that the two test statistics are always related by the
equation Fobs = (tobs)2. (Check it out: Fobs = 6.82 = (2.6106)2 = (tobs)2.)

36
IX INTERACTION MODELS
In Model 3 we examined the following regression model for the Campus Stationery Store:

SALESi = α + β1ADSi + β2PRICEi + εi

The Excel output for this model is given in Figure 7. Remember the interpretations we gave to
the slope coefficients: β1 is the change in the expected value of SALES for a unit-increase in
ADS when PRICE is held fixed, and β2 is the change in the expected value of SALES for a unit-
increase in PRICE when ADS is held fixed. The response surface for this model is shown in
Figure 9 and, as can be seen, is a (flat) plane. In Model 3 the effects of changes in ADS and
PRICE on SALES are thus independent of each other. That is, the effect of ADS does not
depend on the level of PRICE, and the effect of PRICE does not depend on the level of ADS.

On the other hand, sometimes the explanatory variables exhibit an interaction effect where the
change in the expected value of Y for a unit-change in one variable is dependent on the value
of the other variable. In this case, an interaction model is called for. One possible interaction
model for the Campus Stationery Store is Model 4 given as follows:

SALESi = α + β1ADSi + β2PRICEi + β3(ADSi)(PRICEi) + εi

We can rewrite this in two equivalent ways:

SALESi = α + (β1 + β3PRICEi)ADSi + β2PRICEi + εi

SALESi = α + β1ADSi + (β2 + β3ADSi)PRICEi + εi

From version 1 we see that if we hold PRICE constant, then (β1 + β3PRICEi) tells us how SALES
changes as ADS changes. Hence the effect of ADS on SALES also depends on the level of
PRICE. The variables are no longer independent, and the term β3(ADSi)(PRICEi) shows how
they interact with each other—hence the names “interaction” model and “interaction” variable
(also called the “interaction” term). Similarly from version 2, we see that if we hold ADS constant,
then (β2 + β3ADSi) tells us how SALES changes as PRICE changes. Hence the effect of PRICE
on SALES also depends on the level of ADS.

This is a nonlinear model, but may be easily fit using a linear regression package like Excel by
simply creating a third explanatory variable (which we call ADPRICE) equal to ADS multiplied
by PRICE. The listing of the data and the regression output are given in Figure 15. The response
surface for this model is a curved surface (or a twisted plane) as shown in Figure 16. This surface
could be produced by placing a pencil perpendicular to a line and moving it along the line, while
rotating it around the line. The value of the third slope coefficient, β3, controls the rate of twist
of the surface.

37
Figure 15

MODEL 4
SALES versus ADS, PRICE, and ADPRICE
SALES ADS PRICE ADPRICE
33 3 125 375
38 7 130 910
24 6 140 840
61 6 115 690
38 10 155 1550
54 8 120 960
17 9 145 1305
70 10 140 1400
52 10 145 1450
45 12 160 1920
65 12 135 1620
82 13 130 1690
29 12 170 2040
63 13 160 2080
50 14 190 2660
79 15 160 2400

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.8917
R Square 0.7952
Adjusted R Square 0.7440
Standard Error 9.7834
Observations 16

ANOVA
df SS MS F Significance F
Regression 3 4459.4326 1486.4775 15.5304 0.0002
Residual 12 1148.5674 95.7140
Total 15 5608.0000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 312.4040 80.1898 3.8958 0.0021 137.6855 487.1226
ADS -10.3956 6.5956 -1.5761 0.1410 -24.7662 3.9750
PRICE -2.4085 0.6164 -3.9070 0.0021 -3.7516 -1.0653
ADPRICE 0.1278 0.0489 2.6106 0.0228 0.0211 0.2344

38
Figure 16

Looking at the P-values for the regression coefficients, the interaction variable ADPRICE is
significantly different from zero at the 0.0228 level. Thus there is a significant interaction effect
between ADS and PRICE. Unfortunately or not, the variable ADS is only significant at the 0.1410
level, indicating that further refinement of this model might be called for.

39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy