Regression Analysis: Answers To Problems and Cases 1. 2

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 80

CHAPTER 6

REGRESSION ANALYSIS

ANSWERS TO PROBLEMS AND CASES

1. Option b is inconsistent because the regression coefficient and the correlation coefficient
must have the same sign.

2. a. If GNP is increased by 1 billion dollars, we will expect earnings to increase


.06 billion dollars.

b. If GNP is equal to zero, we expect earnings to be .078 billion dollars.

3. Correlation of Sales and AdvExpend = 0.848

The regression equation is


Sales = 828 + 10.8 AdvExpend

Predictor Coef SE Coef T P


Constant 828.1 136.1 6.08 0.000
AdvExpend 10.787 2.384 4.52 0.002

S = 67.1945 R-Sq = 71.9% R-Sq(adj) = 68.4%

Analysis of Variance

Source DF SS MS F P
Regression 1 92432 92432 20.47 0.002
Residual Error 8 36121 4515
Total 9 128552

a. Yes, the regression is significant. Reject using either the t value 2.384
and it’s p value .002, or the F ratio 20.47 and it’s p value .002.

b. Y = 828 + 10.8X

c. Y = 828 + 10.8(50) = $1368

d. 72% since r2 = .719

e. Unexplained variation (SSE) = 36,121

94
f. Total variation (SST) is 128,552

4. Correlation of Time and Value = 0.967

The regression equation is


Time = 0.620 + 0.109 Value

Predictor Coef SE Coef T P


Constant 0.6202 0.2501 2.48 0.038
Value 0.10919 0.01016 10.75 0.000

S = 0.470952 R-Sq = 93.5% R-Sq(adj) = 92.7%

Analysis of Variance

Source DF SS MS F P
Regression 1 25.622 25.622 115.52 0.000
Residual Error 8 1.774 0.222
Total 9 27.396

a. Yes, the regression is significant. Reject using either the t value 10.75
and it’s p value .000, or the F ratio 115.52 and it’s p value .000.

b. Y = .620 + .109X

e. Unexplained variation (SSE) = 1.774

f. Total variation (TSS) = 27.396

Point forecast: Y = .620 + .1092(3) = 0.948


99% Interval forecast: Y + tsf

2
1 (X  X )
1 
n  ( X  X )2
sf = sy.x

sf = .471 = .471 = .471

sf = .471(1.110) = .523

.948  3.355(.523) → (–.807, 2.702) Prediction interval is wide because of

95
small sample size and large confidence coefficient. Not useful.
5. a, b and d.

The regression equation is


Cost = 208.2 + 70.92 Age (Positive linear relationship)

S = 111.610 R-Sq = 87.9% R-Sq(adj) = 86.2%

Analysis of Variance

Source DF SS MS F P
Regression 1 634820 634820 50.96 0.000
Error 7 87197 12457
Total 8 722017

c. Correlation between Cost and Age = .938

e. Reject at the 5% level since F = 50.96 and it’s p value = .000 < .05.
Could also use t = 7.14, the t value associated with the slope coefficient, and it’s
p value = .000. The correlation coefficient is significantly different from 0 since
the slope coefficient is significantly different from 0.

f. Y = 208.20 + 70.92(5) = 562.80 or $562.80

96
6. a, b and d.

The regression equation is


Books = 32.46 + 36.41 Feet (Positive linear relationship)

S = 17.9671 R-Sq = 90.3% R-Sq(adj) = 89.2%

Analysis of Variance

Source DF SS MS F P
Regression 1 27032.3 27032.3 83.74 0.000
Error 9 2905.4 322.8
Total 10 29937.6

c. Correlation between Books and Feet = .950

e. Reject at the 10% level since F = 83.74 and it’s p value = .000 < .10.
Could also use t = 9.15, the t value associated with the slope coefficient, and it’s
p value = .000. The correlation coefficient is significantly different from 0 since
the slope coefficient is significantly different from 0.

97
f. Based on the residuals versus the fitted values plot, there is no reason to
doubt the adequacy of the simple linear regression model.

g. Y = 32.46 + 36.41(4) = 178 books

7. a, b, c & d.
The regression equation is
Orders = 15.8 + 1.11 Catalogs (Fitted regression line)

Predictor Coef SE Coef T P


Constant 15.846 3.092 5.13 0.000
Catalogs 1.1132 0.3596 3.10 0.011

S = 5.75660 (Standard error or estimate)


R-Sq = 48.9% (Percentage of variation in Orders explained by Catelogs)
R-Sq(adj) = 43.8%

Analysis of Variance (ANOVA Table)

Source DF SS MS F P
Regression 1 317.53 317.53 9.58 0.011
Residual Error 10 331.38 33.14
Total 11 648.92

Predicted Values for New Observations

New
Obs Fit SE Fit 90% CI 90% PI
1 26.98 1.93 (23.47, 30.48) (15.97, 37.98)

98
e. Do not reject at the 1% level since t = 3.10 and it’s p value = .011 > .01.
However, would reject at the, say, 5% level.

f. Do not reject at the 1% level since F = 9.58 and it’s p value = .011 > .01.
Result is consistent with the result in e as it should be.

g. See Fit and 90% PI at end of computer printout above. A 90% prediction interval
for mail orders when 10(000) catalogs are distributed is (16, 38)---16,000 to 38,000.

8. The regression equation is


Dollars = 3538 - 418 Rate

Predictor Coef SE Coef T P


Constant 3538.1 744.4 4.75 0.001
Rate -418.3 150.8 -2.77 0.024

S = 356.690 R-Sq = 49.0% R-Sq(adj) = 42.7%

Analysis of Variance

Source DF SS MS F P
Regression 1 978986 978986 7.69 0.024
Residual Error 8 1017824 127228
Total 9 1996810

a. There is a significant (at the 5% level) negative relationship between these variables.
Reject at the 5% level since t = -2.77 and it’s p value = .024 < .05.

b. The data set is small. Moreover, r2 = .49 so only 49% of the variation in investment
dollars is explained by interest rate. Finally, the last observation (6.2, 1420) has a
large influence on the location of the fitted straight line. If this observation is deleted,
there is a considerable change in the slope (and intercept) of the fitted line. Using the
original straight line equation for prediction is suspect.

c. A forecast can be calculated. It is 1865. However, the 95% prediction interval is wide.
Forecast unlikely to be useful without additional information. See comments in b.

d. See answer to b.

e. It seems reasonable to say movements in interest rate cause changes in the level
of investment.

9. a. The firms seem to be using very similar rationale since r = .959. Also, from the fitted
99
line plot below, notice the fitted line is not far from the 45 o line through the origin (with
intercept 0 and slope 1).

b. If ABC bids 1.01, the predicted competitor’s bid is 101.212. A 95% prediction
interval (PI) is given below.

New
Obs Fit SE Fit 95% CI 95% PI
101 101.212 0.164 (100.872, 101.552) (99.637, 102.786)

c. Assume normality distributed errors about the population regression line and
treat the least square line as if it were the population regression line (n is reasonably
large in this case). Then at ABC bid 101, possible competitor bids are normally
distributed about the fitted value 101.212 with a standard deviation estimated by
sy.x = .743. Consequently, the probability that ABC will have the bid is
P(Z ≥ (101-101.212)/ .743) = P(Z ≥ -.285) = .51.

10. a. Only if the sample size is large enough. The t statistic associated with the
slope coefficient or the F ratio should be consulted to determine if the population
regression line slope is significantly different from a horizontal line with zero
slope.

b. It will typically produce significant results, not necessarily useful results.


The coefficient of determination, r2, might be small, so forecasting using the fitted
line is unlikely to produce a useful result.

11. a. Scatter diagram follows.

100
b. The regression equation is
Permits = 2217 - 145 Rate

Predictor Coef SE Coef T P


Constant 2217.4 316.2 7.01 0.000
Rate -144.95 27.96 -5.18 0.001

S = 144.298 R-Sq = 79.3% R-Sq(adj) = 76.4%

Analysis of Variance

Source DF SS MS F P
Regression 1 559607 559607 26.88 0.001
Residual Error 7 145753 20822
Total 8 705360

c. Reject at the 5% level since t = -5.18 and it’s p value = .001 < .05.

d. If interest rate increases by 1%, on average the number of building permits will
decrease by 145.

e. From the computer output above, r2 = .793.

f. Interest rate explains about 79% of the variation in number of building permits issued.

g. Memo to Ed explaining the fairly strong negative relationship between mortgage


interest rates and building permits issued.
12. The population for this problem contains X-Y data points whose correlation
coefficient is .846 ( = .846). Each student will have a different answer, however,
101
most will conclude that the Y is linearly related to X, that r is around .846, r-squared
= .72, and so on. The population regression equation is Y = 0.948 + 0.00469X.
Any student who fails to find a meaningful relationship between X and Y will be the
victim of a Type II error.

13. a. Scatter diagram follows.

b. The regression equation is


Defectives = - 17.7 + 0.355 BatchSize

Predictor Coef SE Coef T P


Constant -17.731 4.626 -3.83 0.003
BatchSize 0.35495 0.02332 15.22 0.000

S = 7.86344 R-Sq = 95.5% R-Sq(adj) = 95.1%

Analysis of Variance

Source DF SS MS F P
Regression 1 14331 14331 231.77 0.000
Residual Error 11 680 62
Total 12 15011

c. Reject at the 5% level since t = 15.22 and it’s p value = .000 < .05

d.

102
Residual Versus Fits plot shows curvature in scatter not captured by straight line fit.

e. Model with quadratic term in Batch Size fits well. Results with Size**2 as
predictor variable follow.

The regression equation is


Defectives = 4.70 + 0.00101 Size**2

Predictor Coef SE Coef T P


Constant 4.6973 0.9997 4.70 0.001
Size**2 0.00100793 0.00001930 52.22 0.000

S = 2.34147 R-Sq = 99.6% R-Sq(adj) = 99.6%

Analysis of Variance

Source DF SS MS F P
Regression 1 14951 14951 2727.00 0.000
Residual Error 11 60 5
Total 12 15011

f. Reject at the 5% level since t = 52.22 and it’s p value = .000 < .05

g. Residual plots below indicate an adequate fit.

103
h. Predicted Values for New Observations

New
Obs Fit SE Fit 95% CI 95% PI
1 95.411 1.173 (92.829, 97.993) (89.647, 101.175)

i. Prefer second model with the quadratic predictor.

j. Memo to Harry showing the value of transforming the independent (predictor)


variable.

14. a.

b. The regression equation is: Market = 60.7 + 0.414 Assessed

104
c. . About 38% of the variation in market prices is explained by
assessed values (as predictor variable). There is a considerable amount of
unexplained variation.

d. , p value = .000. Regression is highly significant.

e. . Making a prediction at an assessed value, 90.5, outside of range


covered by data (see scatter diagram). Linear relation may no longer hold.

f. Residual plots follow.

Unusual Observations

Obs Assessed Market Fit SE Fit Residual St Resid


3 64.6 87.200 87.423 1.199 -0.223 -0.10 X
26 72.0 97.200 90.483 0.578 6.717 2.83R

R denotes an observation with a large standardized residual.


X denotes an observation whose X value gives it large leverage.

15. a. The regression equation is: OpExpens = 18.88 + 1.30 PlayCosts

b. . About 75% of the variation in operating expenses is explained


by player costs.

c. , p value = .000 < .10. The regression is clearly significant at the


level.
105
d. Coefficient on = player costs is 1.30. Is reasonable?
(p value = .000) suggests is not supported by
the data. Appears that operating expenses have a fixed cost component
represented by the intercept , and are then about 1.3 times player costs.

e. , gives or (47.2, 70.0).

f. Unusual Observations
Obs PlayCosts OpExpens Fit SE Fit Residual St Resid
7 18.0 60.00 42.31 1.64 17.69 3.45R

R denotes an observation with a large standardized residual


Team 7 has unusually low player costs relative to operating expenses.

16. a. Scatter diagram follows.

b. The regression equation is


Consumption = - 811 + 0.226 Families

Predictor Coef SE Coef T P


Constant -811.0 553.6 -1.47 0.158
Families 0.22596 0.05622 4.02 0.001

S = 819.812 R-Sq = 43.5% R-Sq(adj) = 40.8%

Analysis of Variance

106
Source DF SS MS F P
Regression 1 10855642 10855642 16.15 0.001
Residual Error 21 14113925 672092
Total 22 24969567

Although the regression is significant, the residual versus fit plot indicates the
magnitudes of the residuals increase with the level. This behavior and the
scatter diagram in a suggest that consumption is not evenly distributed about
the regression line. That is, the data have a megaphone-like appearance. A
straight line regression model for these data is not adequate.

c & d. The response variable is converted to the natural log of newsprint consumption
(LnConsum).

The regression equation is


LnConsum = 5.70 + 0.000134 Families

Predictor Coef SE Coef T P


Constant 5.6987 0.3302 17.26 0.000
Families 0.00013413 0.00003353 4.00 0.001

S = 0.488968 R-Sq = 43.2% R-Sq(adj) = 40.5%

Analysis of Variance

Source DF SS MS F P
Regression 1 3.8252 3.8252 16.00 0.001
Residual Error 21 5.0209 0.2391
107
Total 22 8.8461

The regression is significant (F = 16, p value = .001) although only 43% of the
variation in ln(consumption) is explained by families. The residual plots
above suggest the straight line regression of ln(consumption) on families is
adequate. This simple linear regression model with ln(consumption) is better
than the same model with consumption as the response.

e. Using the results in c, a forecast of ln(consumption) with 10,000 families is


7.040 so a forecast of consumption is 1,141.

f. Other variables that will influence newsprint consumption include number of


papers published and retail sales (influencing newspaper advertising).

17. a. Can see from fitted line plot below that growth in number of steakhouses is
exponential, not linear.

108
b. The slope of a regression of ln(location) versus year is related to the annual
growth rate.

The regression equation is


LnLocations = 0.348 + 0.820 Year

Predictor Coef SE Coef T P


Constant 0.3476 0.3507 0.99 0.378
Year 0.81990 0.09004 9.11 0.001

S = 0.376679 R-Sq = 95.4% R-Sq(adj) = 94.2%

Analysis of Variance

Source DF SS MS F P
Regression 1 11.764 11.764 82.91 0.001
Residual Error 4 0.568 0.142
Total 5 12.332

Estimated annual growth rate is 100(e.82 – 1)% = 127%

c. Forecast of ln(locations) for 2007 is .348 + .820(20) = 16.748. Hence a forecast of


the number of Outback Steakhouse locations for 2007 is e16.748 or 18,774,310, an
absurd number. This example illustrates the danger of extrapolating a trend (growth)
curve far into the future.

18. a, Can see from fitted line plot below that growth in number of copy centers is
exponential, not linear.

109
b. The slope of a regression of ln(centers) on time (year) is related to the annual
growth rate.

The regression equation is


LnCenters = - 0.305 + 0.483 Time

Predictor Coef SE Coef T P


Constant -0.3049 0.1070 -2.85 0.015
Time 0.48302 0.01257 38.42 0.000

S = 0.189608 R-Sq = 99.2% R-Sq(adj) = 99.1%

Analysis of Variance

Source DF SS MS F P
Regression 1 53.078 53.078 1476.38 0.000
Residual Error 12 0.431 0.036
Total 13 53.509

Estimated annual growth rate is 100(e.483 – 1)% = 62%

c. Forecast of ln(centers) for 2012 is -.305 + .483(20) = 9.355. Hence a forecast of


the number of On The Double copy centers for 2012 is e9.355 or 11,556, an unlikely
number. This example illustrates the possible danger of extrapolating a trend
(growth) curve some distance into the future.
19. a. Intercept b0 = 17.954, Slope b1 = –.2715

b. Cannot reject H0 at the 10% level since the t value associated with the slope
coefficient, –1.57, has a p value of .138 > .10. The regression is not significant.
110
There does not appear to be a relationship between profits per employee and
number of employees.

c. r2 = .15. Only 15% of the variation in profits per employee is explained by the
number of employees.

d. The regression is not significant. There is no point in using the fitted function to
generate forecasts for profits per employee for a given number of employees.

20. Deleting Dun and Bradstreet gives the following results:

The regression equation is


Profits = 25.0 - 0.713 Employees

Predictor Coef SE Coef T P


Constant 25.013 5.679 4.40 0.001
Employees -0.7125 0.2912 -2.45 0.029

S = 9.83868 R-Sq = 31.5% R-Sq(adj) = 26.3%

Analysis of Variance

Source DF SS MS F P
Regression 1 579.40 579.40 5.99 0.029
Residual Error 13 1258.40 96.80
Total 14 1837.80

The regression is now significant at the 5% level (t value = -2.45, p value = .029 < .05).
r2 has increased from 15% to 31.5%. These results suggest there is a linear
relationship between profits per employee and number of employees. A single
observation can have a large influence on the regression analysis, particularly when
the number of observations is relatively small. However, the relatively small r2 of 31.5%
indicates there will be a fair amount of uncertainly associated with any forecast of
profits per employee. Dun and Bradstreet should not be thrown out unless there is some
good (non-numerical) reason not to include this firm with the others.

21. a. The regression equation is


Actual = 0.68 + 0.922 Estimate

Predictor Coef SE Coef T P

111
Constant 0.683 1.691 0.40 0.690
Estimate 0.92230 0.08487 10.87 0.000

S = 5.69743 R-Sq = 83.1% R-Sq(adj) = 82.4%

Analysis of Variance

Source DF SS MS F P
Regression 1 3833.4 3833.4 118.09 0.000
Residual Error 24 779.1 32.5
Total 25 4612.5

b. The regression is significant (t value = 10.87, p value = .000 or, equivalently,


F ratio = 118.09, p value = .000).

c. r2 = .831 or 83.1% of the variation in actual costs is explained by estimated


costs.

d. If estimated costs are perfect predictor of actual costs, then . The


estimated intercept coefficient, .683, is consistent with . With the t value = .40
and its p value = .69, cannot reject the null hypothesis . To check the
hypothesis compute t =(.922-1)/.0849 = –.92, which is not in the rejection
region for a two-sided test at any reasonable significance level. The estimated slope
coefficient, .922, is consistent with .

e. The plot of the residuals versus the fitted values has a megaphone-like appearance.
The residuals are numerically smaller for smaller projects than for larger projects.
Estimated costs are more accurate predictors of actual costs for inexpensive (smaller)
projects than for expensive (larger) projects.
22. a. The regression is significant (t value = 14.71, p value = .000).

b. r2 = .90 or 90% of the variation in ln(actual costs) is explained by


ln(estimated costs).

112
c. If ln(estimated costs) are perfect predictor of ln(actual costs), then .
The estimated intercept coefficient, .003, is consistent with . With the
t value = .02 and its p value = .987, cannot reject the null hypothesis .
To check the hypothesis compute t =(.968-1)/.0658 = –.49, which is not
in the rejection region for a two-sided test at any reasonable significance level.
The estimated slope coefficient, .968, is consistent with .
d. ln(24) = 3.178, so forecast of ln(actual cost) = .0026 + .968(3.178) = 3.079. Forecast
of actual cost is e3.079 = 21.737.

CASE 6-1: TIGER TRANSPORT

This case asks students to summarize the analysis in a report to management. We find this a
useful exercise since it requires students to put the application and results of a statistical procedure into
their own words. If they are able to do this, they understand the technique.
This case illustrates the use of regression analysis in a situation where determining a good
regression equation is only the first step. The results must then be priced out in order to
arrive at a rational decision regarding a pricing policy. This situation can generate a discussion
regarding the general nature of quantitative techniques: they aid in the decision-making
process rather than replace it. Possible policies regarding the small-load charge can be
discussed after the cost of such loads is determined. One approach would be to take small loads
at company cost, which is low. The resultant goodwill might pay off in increased regular
business. Another would be to charge a low cost for small loads but only if the customer agrees to
book a certain number of large loads.
The low out-of-pocket cost involved in adding small loads can focus management attention
in other directions. Since no significant costs need to be recovered by the small load charge,
a policy based on other considerations is appropriate.

CASE 6-2: BUTCHER PRODUCTS, INC.

1. The 89 degree temperature is 24 degrees off ideal (89 - 65 = 24). This value is placed into
the regression equation yielding a forecast number of units per day of 338.

2. Once again, the temperature is 24 degrees from ideal (65 - 41 = 24). For X = 24, a forecast
of 338 units is calculated from the regression equation.

3. Since there is a fairly strong relationship between output and deviation from ideal
temperature (r = -.80), higher output may well result from efforts to control the
temperature in the work area so that it is close to 65 degrees. Gene should consider ways
to do this.

4. Gene has made a decent start towards finding an effective forecasting tool. However,
since about 36% of the variation in output is unexplained, he should look for additional
important predictor variables.

CASE 6-3: ACE MANUFACTURING


113
1. The correlation coefficient is: r = .927. The corresponding t = 8.9 for testing
has a p value of .000. We reject H0 and conclude the correlation between
days absent and employee age holds for the population.

2. Y = –4.28 + .254X

3. r2 = .859. About 86% of Y's (absent days) variability can be explained through
knowledge of X (employee age).

4. The null hypothesis is rejected using either t = 8.9, p value = .000 or the
F = 79.3 with p value = .000. There is a significant relation between absent days and
employee age.

5. Placing X = 24 into the prediction equation yields a Y forecast of 1.8 absent days per year.

6. If time and cost are not factors, it might be helpful to take a larger sample to see if these
small sample results hold. If results hold, a larger sample will very likely produce
more precise interval forecasts.

7. The fitted function is likely to produce useful forecasts, although 95% prediction
intervals can be fairly wide because of the small sample size.

CASE 6-4: MR. TUX

1. After John uses simple regression analysis to forecast his monthly sales volume, he is
not satisfied with the results. The low r-squared value (56.3%) disappoints him.
The high seasonal variation should be discussed as a cause of his poor fit
when using only the month number to forecast sales. The possibility of using
dummy variables to account for the monthly effect is a possibility. After this topic
is covered in Chapter 7, you can have the students return to this case.

2. Not adequate.

3. The idea of serial correlation can be mentioned at this point. The possibility of
autocorrelated residuals can be introduced based on John's Durbin-Watson statistic.
In fact, the DW is low, indicating definite autocorrelation. A class discussion about
this problem and what might be done about it is useful. After this topic is covered
in Chapter 8, you can have the students return to this case. We hope that by this
time students appreciate the difficulties involved in real-life forecasting. Forecasting
Compromises and multiple attempts are the norm, not exceptions.

114
CASE 6-5: CONSUMER CREDIT COUNSELING

1. The correlation of Clients and Stamps = 0.431 and t = 3.24, so relationship is


significant but not very useful.

The regression equation is


Clients = 32.7 + 0.00349 Stamps

Predictor Coef SE Coef T P


Constant 32.68 31.94 1.02 0.312
Stamps 0.003487 0.001076 3.24 0.002

S = 23.6787 R-Sq = 18.6% R-Sq(adj) = 16.8%

Analysis of Variance

Source DF SS MS F P
Regression 1 5891.9 5891.9 10.51 0.002
Residual Error 46 25791.4 560.7
Total 47 31683.2

The correlation of Clients and Index = 0.752. The relation is significant (see below).

The regression equation is


Clients = - 199 + 2.94 Index

Predictor Coef SE Coef T P


Constant -198.65 28.64 -6.94 0.000
Index 2.9400 0.2619 11.23 0.000

S = 19.9159 R-Sq = 56.5% R-Sq(adj) = 56.1%

Analysis of Variance

Source DF SS MS F P
Regression 1 49993 49993 126.04 0.000
Residual Error 97 38475 397
Total 98 88468
2. The regression equation is Clients = - 199 + 2.94 BI

Jan 1993: Clients = - 199 + 2.94 (125) = 168.5


Feb 1993: Clients = - 199 + 2.94 (125) = 168.5
Mar 1993: Clients = - 199 + 2.94 (130) = 183.2

115
Note: Students might develop a new equation that leaves out the first three months of
data for 1993. This is a better way to determine whether the model works and the
results are:

The regression equation is


Clients = - 204 + 2.99 Index

Predictor Coef SE Coef T P


Constant -203.85 31.37 -6.50 0.000
Index 2.9898 0.2883 10.37 0.000

S = 20.0046 R-Sq = 53.4% R-Sq(adj) = 52.9%

Analysis of Variance

Source DF SS MS F P
Regression 1 43028 43028 107.52 0.000
Residual Error 94 37617 400
Total 95 80645

Jan 1993: Clients= - 204 + 2.99 (125) = 169.8


Feb 1993: Clients= - 204 + 2.99 (125) = 169.8
Mar 1993: Clients = - 204 + 2.99 (130) = 184.7

Regressing Clients on the reciprocal of Index produces a little better straight line fit.
The results for this transformed predictor variable follow.

The regression equation is


Clients = 470 - 37719 RecipIndex

Predictor Coef SE Coef T P


Constant 469.58 32.07 14.64 0.000
RecipIndex -37719 3461 -10.90 0.000

S = 19.4689 R-Sq = 55.8% R-Sq(adj) = 55.3%

Analysis of Variance

Source DF SS MS F P
Regression 1 45015 45015 118.76 0.000
Residual Error 94 35630 379
Total 95 80645
3. Actual Forecast Forecast Forecast(RecipIndex predictor)

Jan 1993 152 169 170 168


Feb 1993 151 169 170 168
Mar 1993 199 183 185 180
116
4. Only if the business activity index could itself be forecasted accurately. Otherwise, it is
not a viable predictor because the values for the business activity index are not
available in a timely fashion.

5. Perhaps. This topic will be the subject of Chapter 8.

6. If a good regression equation can be developed in which the changes in the predictor
variable lead the response, it might be possible to accurately forecast the rest of 1993.
However, if the regression equation is based on coincident changes in the predictor
variable and response, forecasts for the rest of 1993 could not be developed since values
for the predictor variable are not known in advance.

CASE 6-6: AAA WASHINGTON

1. The four linear regression models are shown below. Both temperature and rainfall are
potential predictor variables.

The regression equation is


Calls = 18366 + 467 Rate

Predictor Coef SE Coef T P


Constant 18366 1129 16.27 0.000
Rate 467.4 174.2 2.68 0.010

S = 1740.10 R-Sq = 11.0% R-Sq(adj) = 9.5%

The regression equation is


Calls = 28582 - 137 Temp

Predictor Coef SE Coef T P


Constant 28582.2 956.0 29.90 0.000
Temp -137.44 18.06 -7.61 0.000

S = 1289.61 R-Sq = 51.3% R-Sq(adj) = 50.4%

The regression equation is


Calls = 20069 + 400 Rain

Predictor Coef SE Coef T P


Constant 20068.9 351.7 57.07 0.000
Rain 400.30 84.20 4.75 0.000
117
S = 1555.56 R-Sq = 29.1% R-Sq(adj) = 27.8%

The regression equation is


Calls = 27980 - 0.0157 Members

49 cases used, 3 cases contain missing values

Predictor Coef SE Coef T P


Constant 27980 3769 7.42 0.000
Members -0.015670 0.008703 -1.80 0.078

S = 1628.15 R-Sq = 6.5% R-Sq(adj) = 4.5%

2. & 3. Sixty-five degrees was subtracted from the temperature variable. The variable used
was the absolute value of the temperature with relative zero at 65 degrees Fahrenheit
labeled NewTemp.

The correlation coefficient between Calls and NewTemp is .724, indicating a fairly
strong positive linear relationship. However, examination of the fitted line plot below
suggests there is a curvilinear relation between Calls and NewTemp

4. A linear regression model with predictor variable NewTemp**2 gives a much


better fit. The residual plots also indicate an adequate fit.

The regression equation is


Calls = 20044 + 5.38 NewTemp**2

118
Predictor Coef SE Coef T P
Constant 20044.4 203.1 98.68 0.000
NewTemp**2 5.3817 0.5462 9.85 0.000

S = 1111.19 R-Sq = 63.8% R-Sq(adj) = 63.2%

Analysis of Variance

Source DF SS MS F P
Regression 1 119870408 119870408 97.08 0.000
Residual Error 55 67910916 1234744
Total 56 187781324

CHAPTER 7

MULTIPLE REGRESSION

ANSWERS TO PROBLEMS AND CASES

119
1. A good predictor variable is highly related to the dependent variable but not too
highly related to other predictor variables.

2. The population of Y values is normally distributed about E(Y), the plane formed by the
regression equation. The variance of the Y values around the regression plane is
constant. The residuals are independent of each other, implying a random sample. A
linear relationship exists between Y and each predictor variable.

3. The net regression coefficient measures the average change in the dependent variable per
unit change in the relevant independent variable, holding the other independent variables
constant.

4. The standard error of the estimate is an estimate of σ, the standard deviation of Y.

5. Y = 7.52 + 3(20) - 12.2(7) = -17.88

6. a. A correlation matrix displays the correlation coefficients between every


possible pair of variables in the analysis.

b. The proportion of Y's variability that can be explained by the predictor


variables is given by R2. It is also referred to as the coefficient of
determination.

c. Collinearity results when predictor variables are highly correlated among


themselves.

d. A residual is the difference between an actual Y value and , the value


predicted using the sample regression plane.

e. A dummy variable is used to determine the relationship between a qualitative


independent variable and a dependent variable.

f. Step-wise regression is a procedure for selecting the “best” regression


function by adding or deleting a single independent variable at different
stages of it’s development.

7. a. Each variable is perfectly related to itself. The correlation is always 1.

b. The entries in a correlation matrix reflected about the main diagonal are the
same. For example, r32 = r23.

c. Variables 5 and 6 with correlation coefficients of .79 and .70, respectively.

d. The r14 = -.51 indicates a negative linear relationship.

120
e. Yes. Variables 5 and 6 are to some extent collinear, r56 = .69.

f. Models that include variables 4 and 6 or variables 2 and 5 are possibilities. The
predictor variables in these models are related to the dependent variable and not
too highly related to each other.

g. Variable 5.

8. a. Correlations:
Time Amount
Amount 0.959
Items 0.876 0.923

The Full Model regression equation is:

Time = 0.422 + 0.0871 Amount - 0.039 Items

Predictor Coef SE Coef T P VIF


Constant 0.4217 0.5864 0.72 0.483
Amount 0.08715 0.01611 5.41 0.000 6.756
Items -0.0386 0.1131 -0.34 0.737 6.756

S = 0.857511 R-Sq = 92.1% R-Sq(adj) = 91.1%

Analysis of Variance

Source DF SS MS F P
Regression 2 128.988 64.494 87.71 0.000
Residual Error 15 11.030 0.735
Total 17 140.018

Amount and Time are highly collinear (correlation = .923, VIF = 6.756). Both
variables are not needed in the regression function. Deleting Items with the
non-significant t value gives the best regression below.

The regression equation is


Time = 0.263 + 0.0821 Amount

Predictor Coef SE Coef T P


Constant 0.2633 0.3488 0.75 0.461
Amount 0.082068 0.006025 13.62 0.000

S = 0.833503 R-Sq = 92.1% R-Sq(adj) = 91.6%

121
Analysis of Variance

Source DF SS MS F P
Regression 1 128.90 128.90 185.54 0.000
Residual Error 16 11.12 0.69
Total 17 140.02

b. From the Full Model, checkout time decreases by .039 which does not
make sense.

c. Using the best model

Time = .2633 + .0821(28) = 2.5621


e = Y - Y = 2.4 - 2.5621 = -.1621

d. Using the best model, sy.x = .8335

e. The standard deviation of Y is estimated by .8335.

f. Using the best model, the number of Items is not relevant so

Time = .2633 + .0821(70) = 6.01

g. Using the best model, the 95% prediction interval (interval forecast) for
Amount = $70 is given below.

New
Obs Fit SE Fit 95% CI 95% PI
1 6.008 0.238 (5.504, 6.512) (4.171, 7.845)

h. Multicollinearity is a problem. Jennifer should use the regression equation with


the single predictor variable Amount.

9. a. Correlations: Food, Income, Size

Food Income
Income 0.884
Size 0.737 0.867

Income is highly correlated with Food (expenditures) and, to a lesser extent,


so is Size. However, the predictor variables Income and Size are themselves
122
highly correlated indicating there is a potential multicollinearity problem.

b. The regression equation is

Food = 3.52 + 2.28 Income - 0.41 Size

Predictor Coef SE Coef T P VIF


Constant 3.519 3.161 1.11 0.302
Income 2.2776 0.8126 2.80 0.026 4.016
Size -0.411 1.236 -0.33 0.749 4.016

S = 2.89279 R-Sq = 78.5% R-Sq(adj) = 72.3%

When income is increased by one thousand dollars holding family size constant, the
average increase in annual food expenditures is 228 dollars. When family size is
increased by one person holding income constant, the average decrease in annual
food expenditures is 41 dollars. Since family size is positively related to food
expenditures, r = .737, it doesn’t make sense that a decrease in expenditures
would occur.

c. Multicollinearity is a problem as indicated by VIF’s of about 4.0. Size should be


dropped from the regression function and the analysis redone with only Income
as the predictor variable.

10. a. Both high temperature and traffic count are positively related to number of six-
packs sold and have potential as good predictor variables. There is some collinearity
(r = .68) between the predictor variables but perhaps not enough to limit their
value.

. b. Reject if |t| > 2.898

t= = = 3.45

Reject H0 because 3.45 > 2.898 and conclude that the regression coefficient for
the high temp-variable is unequal to zero in the population.

Reject if |t| > 2.898

.06795
t= = = 3.35
.02026

Reject H0 because 3.35 > 2.898 and conclude that the regression coefficient for
the traffic count variable is unequal to zero in the population.

c. Y = -26.706 + .78207(60) + .06795(500) = 54 (six-packs)


123
 (Y  Y )
2
2727.9
d. R2 = 1 - 2 = 1 - = .81
 (Y  Y ) 14316.9
We are able to explain 81% of the number of six-packs sold variation using
knowledge of daily high temperature and daily traffic count.

2727.9
e. sy.x’s = = = 160.46 = 12.67
( 20 3)

f. If there is an increase of one degree in high temperature while the traffic count
is held constant, beer sales increase on an average of .78 six-packs.

g. The predictor variables explain 81% of the variation in six-packs sold. Both
predictor variables are significant. It would be prudent to examine the residuals (not
available in the problem) before deciding to use the fitted regression function for
forecasting however.

11. a. Scatter diagram follows. Female drivers indicated by solid circles, male divers by
diamonds.

124
b. The regression equation is: = 25.5 - 1.04 X1 + 1.21 X2
For a given age of car, female drivers expect to get about 1.2 more miles
per gallon than male drivers.

c. Fitted line for female drivers has equation:


Fitted line for male drivers has equation:
(Parallel lines with different intercepts)

d.

Line falls “between” point representing female drivers and point


representing male drivers. Straight line equation over-predicts mileage for
male drivers and under-predicts mileage for female drivers. Important to include
gender variable in this regression function.
12. a. Correlations: Sales, Outlets, Auto

Sales Outlets
125
Outlets 0.739
Auto 0.548 0.670

Number of retail outlets is positively related to annual sales, r12 = .74, and is
potentially a good predictor variable. Number of automobiles registered is
moderately related to annual sales, r13 = .55, and is positively correlated with
number of retail outlets, r23 = .67. Given number of retail outlets in the
regression function, number of automobiles registered may not be required.

b. The regression equation is

Sales = 10.1 + 0.0110 Outlets + 0.195 Auto

Predictor Coef SE Coef T P VIF


Constant 10.109 7.220 1.40 0.199
Outlets 0.010989 0.005200 2.11 0.068 1.813
Auto 0.1947 0.6398 0.30 0.769 1.813

S = 10.3051 R-Sq = 55.1% R-Sq(adj) = 43.9%

Analysis of Variance

Source DF SS MS F P
Regression 2 1043.7 521.8 4.91 0.041
Residual Error 8 849.6 106.2
Total 10 1893.2

Predicted Values for New Observations

New
Obs Fit SE Fit 95% CI 95% PI
1 37.00 7.15 (20.50, 53.49) (8.07, 65.93)

As can be seen from the regression output, it appears as if each predictor variable is

not significant (at the 5% level), however the regression is significant at the 5%

level. This is one of things that can happen when the predictor variables are collinear.

The forecast for region 1 is 37 with a prediction error of 52.3 – 37 = 15.3. However,

it is not a good idea to use this fitted function for forecasting. If the regression is rerun

after deleting Auto, Outlets (and the regression) is significant at the 1% level and

R2 is virtually unchanged at 55%.

126
c. Y = 10.11 + .011(2500) + .195(20.2) = 41.549 (million)

d. The standard error of estimate is 10.3 which is quite large. As explained in part b,
the fitted function with both predictor variables should not be used to forecast.
Even if the regression is rerun with the single predictor Outlets, R2 =55% and
the relatively large standard error of the estimate suggest there will be a lot of
uncertainly associated with any forecast.

e. sy.x’s = = = = 10.3

f. If one retail outlet is added while the number of automobiles registered remains
constant, sales will increase by an average of .011 million or $11,000 dollars. If
one million more automobiles are registered while the number of retail outlets
remains constant, sales will increase by an average of .195 million or $195,000
dollars. However, these regression coefficients are suspect due to collinearity
between the predictor variables.
g. New predictor variables should be tried.

13. a. Correlations: Sales, Outlets, Auto, Income

Sales Outlets Auto


Outlets 0.739
Auto 0.548 0.670
Income 0.936 0.556 0.281

The regression equation is


Sales = - 3.92 + 0.00238 Outlets + 0.457 Auto + 0.401 Income

Predictor Coef SE Coef T P VIF


Constant -3.918 2.290 -1.71 0.131
Outlets 0.002384 0.001572 1.52 0.173 2.473
Auto 0.4574 0.1675 2.73 0.029 1.854
Income 0.40058 0.03779 10.60 0.000 1.481

S = 2.66798 R-Sq = 97.4% R-Sq(adj) = 96.2%

Analysis of Variance

Source DF SS MS F P
Regression 3 1843.40 614.47 86.32 0.000
Residual Error 7 49.83 7.12
Total 10 1893.23
127
Personal income by region makes a significant contribution to sales. Adding Income
to the regression function results in an increase in R2 from 55% to 97%. In addition,
the t value and corresponding p value for Income indicates the coefficient of this
variable in the population is different from 0 given predictor variables Outlets and
Sales. Notice however, the regression should be rerun after deleting the insignificant
predictor variable Outlets. The correlation matrix and the VIF numbers suggest
Outlets is multicollinear with Auto and Income.

b. Predicted Values for New Observations

New
Obs Fit SE Fit 95% CI 95% PI
1 27.306 1.878 (22.865, 31.746) (19.591, 35.020)

Values of Predictors for New Observations

New
Obs Outlets Auto Income
1 2500 20.2 40.0

Annual sales for region 12 is predicted to be 27.306 million.

c. The standard error of estimate has been reduced to 2.67 from 10.3 and R2 has increased
to 97%. The 95% PI in part b is fairly narrow. The forecast for region 12 sales in
part be should be accurate.

d. The best choice is to drop Outlets from the regression function. If this is done,
the regression equation is
Sales = - 4.03 + 0.621 Auto + 0.430 Income

Predictor Coef SE Coef T P VIF


Constant -4.027 2.468 -1.63 0.141
Auto 0.6209 0.1382 4.49 0.002 1.086
Income 0.43017 0.03489 12.33 0.000 1.086

S = 2.87655 R-Sq = 96.5% R-Sq(adj) = 95.6%

Measures of fit are nearly the same as those for the full model and there is no longer
a multicollinearity problem.
14. a. Reject H0 : 1 = 0 if |t |> 3.1.
.65
t= = 13
.05
Reject H0 and conclude that the regression coefficient for the aptitude test variable
is significantly different from zero in the population.
Similarly, Reject H0 : 2 = 0 if |t |> 3.1.

128
20.6
t= = 12.2
1.69
Reject H0 and conclude that the regression coefficient for the effort index variable
is significantly different from zero in the population.

b. If the effort index increases one point while aptitude test score remains constant,
sales performance increases by an average of $20.600.
c. Y = 16.57 + .65(75) + 20.6(.5) = 75.62

d. = (3.56)2 (14 - 3) = 139.4

e. = (16.57)2 (14 - 1) = 3569.3

 (Y  Y ) 2 139.4
f. R = 1 -
2
2 = 1 - = 1 - .039 = .961
 (Y  Y ) 3569.3

We can explain 96.1% of the variation in sales performance with our


knowledge of the aptitude test score and the effort index.

g.

15. a. Scatter plot for cash purchases versus number of items (rectangles) and credit card
purchases versus number of items (solid circles) follows.

129
b. Minitab regression output:

Notice that for a given number of items, sales from cash purchases are estimated to
be about $18.60 less than gross sales from credit card purchases.

c. The regression in part b is significant. The number of items sold and whether
the purchases were cash or credit card explains approximately 83% of the
variation in gross sales. The predictor variable Items is clearly significant. The
coefficient of the dummy variable X2 is significantly different from 0 at the
130
10% level but not at the 5% level. From the residual plots below we see that
there are a few large residuals (see, in particular, cash sales for day 25 and credit
card sales for day 1); but overall, plots do not indicate any serious departures
from the usual regression assumptions.

d. Y = 13.61 + 5.99(25) – 18.6(1) = $145

e. sy.x’s = 30.98 df = 47 t.025 = Z.025 = 1.96

95% (large sample) prediction interval:


145  1.96(30.98) = ($84, $206)

f. Fitted function in part b is effectively two parallel straight lines given by the
equations:
Cash purchases: Y = 13.61 + 5.99Items – 18.6(1) = -4.98 + 5.99Items
Credit card purchases: Y = 13.61 + 5.99Items

If we fit separate straight lines to the two types of purchases we get:


Cash purchases: Y = -.60 + 5.78Items R2 = 90.5%
Credit card purchases: Y = 10.02 + 6.46Items R2 = 66.0%
Predictions for cash sales and credit card sales will not be too much different
for the two procedures (one prediction equation or two individual equations).
In terms of R2, the single equation model falls between the fits of the separate
models for cash purchases and credit card purchases but closer to the higher
number for cash purchases. For convenience and overall good fit, prefer the
131
single equation with the dummy variable.

16. a. Correlations: WINS, ERA, SO, BA, RUNS, HR, SB

WINS ERA SO BA RUNS HR


ERA -0.494
SO 0.049 -0.393
BA 0.446 0.015 -0.007
RUNS 0.627 0.279 -0.209 0.645
HR 0.209 0.490 -0.215 0.154 0.664
SB 0.190 -0.404 -0.062 -0.207 -0.162 -0.305

ERA is moderately negatively correlated with WINS.


SO is essentially uncorrelated with WINS.
BA is moderately positively correlated with WINS and is also correlated
with the predictor variable RUNS.
RUNS is the predictor variable most highly correlated with WINS and will be
the first variable to enter the regression function in a stepwise program. RUNS
is fairly highly correlated with BA, so once RUNS is in the regression function, BA
is unlikely to be needed.
HR is essentially not related to WINS.
SB is essentially not related to WINS.

b. The stepwise results are the same for an alpha to enter = alpha to remove = .05 or
.15 (the Minitab default) or F to remove = F to enter =4.

Response is WINS on 6 predictors, with N = 26

Step 1 2
Constant 20.40 71.23

RUNS 0.087 0.115


T-Value 3.94 10.89
P-Value 0.001 0.000

ERA -18.0
T-Value -9.52
P-Value 0.000

S 7.72 3.55
R-Sq 39.28 87.72
The fitted function from the stepwise program is:

WINS = 71.23 + .115 RUNS - 18 ERA with R2 = 88%

17. a. View will enter the stepwise regression function first since it has the largest
correlation with Price. After that the order of entry is difficult to determine from
132
the correlation matrix alone. Several of the predictor variable pairs are fairly highly
correlated so multicollinearity could be a problem. For example, once View is in the
model, Elevation may not enter (be significant). Slope and Area are correlated so
it may be only one of these predictors is required.

b. As pointed out in part a, it is difficult to determine the results of a stepwise program.


However, a two predictor model will probably work as well as any in this case.
Potential two predictor models include View and Area or View and Slope.

18. a., b., & c. The regression results follow.

The regression equation is


Y = - 43.2 + 0.372 X1 + 0.352 X2 + 19.1 X3

Predictor Coef SE Coef T P VIF


Constant -43.15 31.67 -1.36 0.192
X1 0.3716 0.3397 1.09 0.290 1.473
X2 0.3515 0.2917 1.21 0.246 1.445
X3 19.12 11.04 1.73 0.103 1.481

S = 13.9119 R-Sq = 49.8% R-Sq(adj) = 40.4%

Analysis of Variance

Source DF SS MS F P
Regression 3 3071.1 1023.7 5.29 0.010
Residual Error 16 3096.7 193.5
Total 19 6167.8

Unusual Observations

Obs X1 Y Fit SE Fit Residual St Resid


20 95 57.00 84.43 4.73 -27.43 -2.10R

R denotes an observation with a large standardized residual.

Predicted Values for New Observations

New
Obs Fit SE Fit 95% CI 95% PI
1 80.88 3.36 (73.77, 88.00) (50.55, 111.22)
F = 5.29 with a p value = .010, so the regression is significant at the 1% level.

The predicted final exam score for within term exam scores of 86 and 77 and a
GPA of 3.4 is

133
The variance inflation factors (VIF’s) are all small (near 1); however, the t ratios and
corresponding p values suggest that each of the predictor variables could be dropped
from the regression equation. Since the F ratio was significant, we conclude that
multicollinearity is a problem.

d. Mean leverage = (3+1)/20= .20. None of the observations are high leverage points.

e. From the regression output above, observation 20 has a large standardized residual.
The fitted model over-predicts the response (final exam score) for this student.

19. Stepwise regression results, with significance level .05 to enter and leave the
regression function, follow.

Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is Y on 3 predictors, with N = 20

Step 1
Constant -26.24

X3 31.4
T-Value 3.30
P-Value 0.004

S 14.6
R-Sq 37.71
R-Sq(adj) 34.25

The “best” regression model relates final exam score to the single predictor
variable grade point average.

All possible regression results are summarized in the following table.

Predictor
Variables
X1 .295
X2 .301
X3 .377
X1, X2 .404
X1, X3 .452
X2, X3 .460
X1, X2, X3 .498
The criterion would suggest using all three predictor variables. However, the
results in problem 7.18 suggest there is a multicollinearity problem with three
predictors. The best two independent variable model uses predictors X2 and X3.
When this model is fit, X2 is not required. We end up with a model involving the
single predictor X3, the model selected by the stepwise procedure.
134
20. Best three predictor variable model selected by stepwise regression follows.

The regression equation is

LnComp = 5.69 - 0.505 Educate + 0.255 LnSales - 0.0246 PctOwn

Predictor Coef SE Coef T P VIF


Constant 5.6865 0.6103 9.32 0.000
Educate -0.5046 0.1170 -4.31 0.000 1.0
LnSales 0.2553 0.0725 3.52 0.001 1.0
PctOwn -0.0246 0.0130 -1.90 0.064 1.0

S = 0.4953 R-Sq = 42.8% R-Sq(adj) = 39.1%

Coefficient on education is negative. Everything else equal, as education level


increases, compensation decreases. Positive coefficient on lnsales implies as sales
increase, compensation increases, everything else equal. Finally, for fixed
education and sales, as percent ownership increases, compensation decreases.

Unusual Observations
Obs Educate LnComp Fit SE Fit Residual St Resid
31 2.00 6.5338 5.9055 0.4386 0.6283 2.73RX
33 0.00 6.3969 7.0645 0.2624 -0.6676 -1.59 X

R denotes an observation with a large standardized residual


X denotes an observation whose X value gives it large influence.

Observation 31 has a large standardized residual and is influential. Observation 33


is also influential. The CEO’s for companies 31 and 33 own relatively large
percentages of their company’s stock, 34 % and 17% respectively. They are
outliers in this respect. The large residual for company 31 results from under-
predicting compensation for this CEO. This CEO receives very adequate
compensation in addition to owning a large percentage of the company’s stock.

All in all, this k = 3 predictor model appears to be better than the k = 2 predictor
model of Example 7.12.

21. Scatter diagram with fitted quadratic regression function:

135
a. & b. The regression equation is

Assets = 7.61 - 0.0046 Accounts + 0.000034 Accounts**2

Predictor Coef E Coef T P VIF


Constant 7.608 8.503 0.89 0.401
Accounts -0.00457 0.02378 -0.19 0.853 25.965
Accounts**2 0.00003361 0.00000893 3.76 0.007 25.965

S = 12.4117 R-Sq = 97.9% R-Sq(adj) = 97.3%

Analysis of Variance

Source DF SS MS F P
Regression 2 51130 25565 165.95 0.000
Residual Error 7 1078 154
Total 9 52208

The regression is significant (F = 165.95, p value = .000). Given Accounts in the


model, Accounts**2 is significant ( t value = 3.76, p value = .007). Here Accounts
could be dropped from the regression function and the analysis repeated with only
Accounts**2 as the predictor variable. If this is done, R2 and the coefficient of
Accounts**2 remain virtually unchanged.

c. Dropping Accounts**2 from the model gives:

The regression equation is


Assets = - 17.1 + 0.0832 Accounts
136
Predictor Coef SE Coef T P
Constant -17.121 8.778 -1.95 0.087
Accounts 0.083205 0.007592 10.96 0.000

S = 20.1877 R-Sq = 93.8% R-Sq(adj) = 93.0%

The coefficient of Accounts changes from the quadratic model to the straight
line model because, not surprisingly, Accounts and Accounts**2 are highly
collinear (VIF = 25.965 in the quadratic model).

22. The final model:

The regression equation is


Taste = - 30.7 + 4.20 H2S + 17.5 Lactic

Predictor Coef SE Coef T P VIF


Constant -30.733 9.146 -3.36 0.006
H2S 4.202 1.049 4.01 0.002 2.019
Lactic 17.526 8.412 2.08 0.059 2.019

S = 6.52957 R-Sq = 84.4% R-Sq(adj) = 81.8%

Analysis of Variance

Source DF SS MS F P
Regression 2 2777.0 1388.5 32.57 0.000
Residual Error 12 511.6 42.6
Total 14 3288.7

The regression is significant (F = 32.57, p value = .000). Although Lactic is not a


significant predictor at the 5% level, it is at the 6% level (t = 2.08, p value = .059)
and we have chosen to keep it in the model. R2 indicates about 84% of the variation
in Taste is explained by H2S and Lactic. The residual plots below indicate the fitted
function is adequate. There is no reason to doubt the usual regression assumptions.

137
23. Using the final model from problem 22 with H2S = 7.3 and Lactic = 1.85

Predicted Values for New Observations

New
Obs Fit SE Fit 95% CI 95% PI
1 32.36 3.02 (25.78, 38.95) (16.69, 48.04)

Since and a large sample 95% prediction interval is:

Notice the large sample 95% prediction interval is not too much different than the
actual 95% prediction interval (PI) above.

Although the fit in this case is relatively good, the standard error of the estimate is
somewhat large, so there is a fair amount of uncertainty associated with any forecast.
It may be a good idea to collect more data and, perhaps, investigate additional
predictor variables.

24. a. Correlations: GtReceit, MediaRev, StadRev, TotRev, PlayerCt, OpExpens, ...

138
GtReceit MediaRev StadRev TotRev PlayerCt OpExpens OpIncome
MediaRev 0.304
StadRev 0.587 0.348
TotRev 0.771 0.792 0.753
PlayerCt 0.423 0.450 0.269 0.499
OpExpens 0.636 0.554 0.623 0.766 0.867
OpIncome 0.562 0.672 0.547 0.785 -0.075 0.203
FranValu 0.655 0.780 0.701 0.925 0.397 0.635 0.797

Total Revenue is likely to be a good predictor of Franchise Value. The correlation


between these two variables is .925.

b. Stepwise Regression: FranValu versus GtReceit, MediaRev, ...

Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is FranValu on 7 predictors, with N = 26

Step 1
Constant 2.928

TotRev 1.96
T-Value 11.94
P-Value 0.000

S 13.7
R-Sq 85.59
R-Sq(adj) 84.99

Results from stepwise program are not surprising given the definitions of the
variables and the strong (and in some cases perfect) multicollinearity.

c. The coefficient of TotRev from the stepwise program is 1.96 and the constant
is relatively small and, in fact, insignificant. Consequently, Franchise Value is,
on average, about twice Total Revenue.

d. The regression equation is


OpExpens = 18.9 + 1.30 PlayerCt

Predictor Coef SE Coef T P


Constant 18.883 4.138 4.56 0.000
PlayerCt 1.3016 0.1528 8.52 0.000

S = 5.38197 R-Sq = 75.1% R-Sq(adj) = 74.1%


Analysis of Variance

139
Source DF SS MS F P
Regression 1 2101.7 2101.7 72.56 0.000
Residual Error 24 695.2 29.0
Total 25 2796.9

Unusual Observations

Obs PlayerCt OpExpens Fit SE Fit Residual St Resid


7 18.0 60.00 42.31 1.64 17.69 3.45R

R denotes an observation with a large standardized residual.

The linear relation between Operating expenses and Player costs is fairly strong.
About 75% of the variation in Operating expenses is explained by Player costs.

Observation 7 (Chicago White Sox) have relatively low Player costs as a


component of Operating expenses.

e. Clearly Total revenue, Operating expenses and Operating income are


multicollinear since, by definition, Operating income = Total revenue – Operating
expenses. Also, Total revenue ≈ Gate receipts + Media revenue + Stadium revenue
so this group of variables will be highly multicollinear.

CASE 7-1: THE BOND MARKET

The actual data for this case is supplied in Appendix A. Students can either be asked to
Respond to the question at the end of the case or they can be assigned to run and analyze the data.
One approach that I have used successfully is to assign one group of students the role of asking
Judy Johnson's questions and another group the responsibility for Ron's answers.

1. What questions do you think Judy will have for Ron? The students always seem
to come up with questions that Ms. Johnson will ask. The key is that Ron should be able
to answer them. Possible issues include:

Are all the predictor variables in the final model required? Is a simpler model
with fewer predictor variables feasible?

Do the estimated regression coefficients in the final model make sense and are
they reliable?

Four observations have large standardized residuals. Is this a cause for concern?

Is the final model a good one and can it be confidently used to forecast the
utility’s bond interest rate at the time of issuance?

140
Is multiple regression the appropriate statistical method to use for this situation?

CASE 7-2: AAA WASHINGTON

1. The multiple regression model that includes both unemployment rate and average
monthly temperature is shown below. Temperature is the only good predictor variable.

2. Yes.

3. Unemployment rate lagged 11 months is a good predictor of emergency road service


calls. Unemployment rate lagged 3 months is not a good predictor. The Minitab output
with Temp and Lagged11Rate is given below.

The regression equation is


Calls = 21405 - 88.4 Temp + 756 Lag11Rate

Predictor Coef SE Coef T P


Constant 21405 1830 11.70 0.000
Temp -88.36 19.21 -4.60 0.000
Lag11Rate 756.3 172.0 4.40 0.000

S = 1116.80 R-Sq = 64.1% R-Sq(adj) = 62.8%

Analysis of Variance

Source DF SS MS F P
141
Regression 2 120430208 60215104 48.28 0.000
Residual Error 54 67351116 1247243
Total 56 187781324

The regression is significant. The signs on the coefficients of the independent variables
make sense. The coefficient of each independent variable is significantly different
from 0 (t = –4.6, p value = .000 and t = 4.4, p value = .000, respectively).

4. The results for a regression model with independent variables unemployment


rate lagged 11 months (Lag11Rate), transformed average temperature (NewTemp)
and NewTemp**2 are given below.

The regression equation is


Calls = 17060 + 635 Lag11Rate - 112 NewTemp + 7.59 NewTemp**2

Predictor Coef SE Coef T P


Constant 17060.2 847.0 20.14 0.000
Lag11Rate 635.4 146.5 4.34 0.000
NewTemp -112.00 47.70 -2.35 0.023
NewTemp**2 7.592 1.657 4.58 0.000

S = 941.792 R-Sq = 75.0% R-Sq(adj) = 73.5%

Analysis of Variance

Source DF SS MS F P
Regression 3 140771801 46923934 52.90 0.000
Residual Error 53 47009523 886972
Total 56 187781324

Unusual Observations

Obs Lag11R Calls Fit SE Fit Residual St Resid


11 6.10 24010 22101 193 1909 2.07R
29 5.60 17424 20346 191 -2922 -3.17R
32 6.19 24861 24854 487 7 0.01 X
34 5.72 19205 21157 201 -1952 -2.12R

R denotes an observation with a large standardized residual.


X denotes an observation whose X value gives it large leverage.

The residual plots follow. There is no significant residual autocorrelation at


any lag.

142
The regression is significant. Each predictor variable is significant. R2 = 75%.
Apart from a couple of large residuals, the residual plots indicate an adequate
model. There is no indication any of the usual regression assumptions have been
violated. A good model has been developed.

CASE 7-3: FANTASY BASEBALL (A)

1. The regression is significant. The R of 78.1% looks good. The t statistic for each
of the predictor variables is large with a very small p-value. The VIF’s are relatively
small for the three predictors indicating that multicollinearity is not a problem. The
residual plots shown in Figure 7-4 indicate that this model is valid. Dr. Hanke has
developed a good model to forecast ERA.

2. The matrix plot below of ERA versus each of five potential predictor variables does
not show any obvious nonlinear relationships. There does not appear to be any
reason to develop a new model.

143
3. The regression results with WHIP replacing OBA as a predictor variable follow.
The residual plots are very similar to those in Figure 7-4.

The regression equation is


ERA = - 2.81 + 4.43 WHIP + 0.101 CMD + 0.862 HR/9

Predictor Coef SE Coef T P VIF


Constant -2.8105 0.4873 -5.77 0.000
WHIP 4.4333 0.3135 14.14 0.000 1.959
CMD 0.10076 0.04254 2.37 0.019 1.793
HR/9 0.8623 0.1195 7.22 0.000 1.135

S = 0.439289 R-Sq = 77.9% R-Sq(adj) = 77.4%

Analysis of Variance

Source DF SS MS F P
Regression 3 91.167 30.389 157.48 0.000
Residual Error 134 25.859 0.193
Total 137 117.026

The fit and the adequacy of this model are virtually indistinguishable from the
corresponding model with OBA instead of WHIP as a predictor. The estimated
coefficients of CMD and HR/9 are nearly the same in both models. Both models are
good. The original model with OBA as a predictor has a slightly higher R2 and a
slightly smaller standard error of the estimate. Using these criteria, it is the preferred
model.

144
CASE 7-4: FANTASY BASEBALL (B)

The project may not be doomed to failure. A lot can be learned from investigating the
influence of the various independent variables on WINS. However, the best regression model
does not explain a large percentage of the variation in WINS, R2 = 34%, so the experts have
a point. There will be a lot of uncertainty associated with any forecast of WINS. The stepwise
selection of the best predictor variables and the subsequent full regression output follow.

Stepwise Regression: WINS versus THROWS, ERA, ...

Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is WINS on 10 predictors, with N = 138

Step 1 2
Constant 20.531 5.543

ERA -2.16 -2.01


T-Value -7.00 -6.80
P-Value 0.000 0.000

RUNS 0.0182
T-Value 3.86
P-Value 0.000

S 3.33 3.17
R-Sq 26.51 33.83
R-Sq(adj) 25.97 32.85

The regression equation is


WINS = 5.54 - 2.01 ERA + 0.0182 RUNS

Predictor Coef SE Coef T P VIF


Constant 5.543 4.108 1.35 0.179
ERA -2.0110 0.2959 -6.80 0.000 1.017
RUNS 0.018170 0.004702 3.86 0.000 1.017

S = 3.17416 R-Sq = 33.8% R-Sq(adj) = 32.8%

Analysis of Variance

Source DF SS MS F P
Regression 2 695.31 347.66 34.51 0.000
Residual Error 135 1360.17 10.08
Total 137 2055.48

145
CHAPTER 8

REGRESSION WITH TIME SERIES DATA

ANSWERS TO PROBLEMS AND CASES

1. If not properly accounted for, serial correlation can lead to false inferences under the
usual regression assumptions. Regressions can be judged significant when, in fact,
they are not, coefficient standard errors can be under (or over) estimated so individual
terms in the regression function may be judged significant (or insignificant) when they
are not (or are) and so forth.

2. Serial correlation often arises naturally in time series data. Series, like employment,
whose magnitudes are naturally related to the seasons of the year will be autocorrelated.
Series, like sales, that arise because of a consistently applied mechanism, like advertising
or effort, will be related from one period to the next (serially correlated). In the analysis
of time series data, autocorrelated residuals arise because of a model specification error
or incorrect functional form—the autocorrelation in the series is not properly accounted
for.

3. The independent observations (or, equivalently, independent errors) assumption


is most frequently violated.

4. Durbin-Watson statistic

5. Reject H0 if DW < 1.10. Since 1.0 < 1.10, reject and conclude that the errors are
positively autocorrelated.

6. Reject H0 if DW < 1.55, Do not reject H0 if DW > 1.62. Since 1.6 falls between 1.55
and 1.62, the test is inconclusive.

7. Serial correlation can be eliminated by specification of the regression function (using


the best predictor variables) consistent with the usual regression assumptions. This can
often be accomplished by using variables defined in terms of percentage changes rather
than magnitudes, or autoregressive models, or regression models involving first
differenced or generalized differenced variables.

8. A predictor variable is generated by using the Y variable lagged one or more periods.

146
9. The regression equation is
Fuel = 113 - 8.63 Price - 0.137 Pop

Predictor Coef SE Coef T P


Constant 113.01 16.67 6.78 0.000
Price -8.630 2.798 -3.08 0.009
Pop -0.13684 0.08054 -1.70 0.113

S = 2.29032 R-Sq = 76.6% R-Sq(adj) = 73.0%

Analysis of Variance

Source DF SS MS F P
Regression 2 223.39 111.69 21.29 0.000
Residual Error 13 68.19 5.25
Total 15 291.58

Durbin-Watson statistic = 0.612590

The null and alternative hypotheses are:


H0 :  = 0 H1:  > 0

Using the .05 significance level for a sample size of 16 with 2 predictor variables,
dL = .98. Since DW = .61 < .98, reject H0 and conclude the observations are positively
serially correlated.

10. The regression equation is


Visitors = 309899 + 24431 Time - 193331 Price + 217138 Celeb.

Predictor Coef SE Coef T P


Constant 309899 59496 5.21 0.000
Time 24431 7240 3.37 0.007
Price -193331 97706 -1.98 0.076
Celeb. 217138 47412 4.58 0.001

S = 70006.1 R-Sq = 81.8% R-Sq(adj) = 76.4%

Analysis of Variance

Source DF SS MS F P
Regression 3 2.20854E+11 73617995859 15.02 0.000
Residual Error 10 49008480079 4900848008
Total 13 2.69862E+11

Durbin-Watson statistic = 1.14430

147
With n = 14, k =3 and α = .05, DW = 1.14 gives an indeterminate test for serial
correlation.
11. Serial correlation is not a problem. However, it is interesting to see whether the students
realize that collinearity is a likely problem since Customer and Charge are highly correlated.
Correlation matrix:

Revenue Use Charge


Use 0.187
Charge 0.989 0.109
Customer 0.918 0.426 0.891

The regression equation is


Revenue = - 65.6 + 0.00173 Use + 29.5 Charge + 0.000197 Customer

Predictor Coef SE Coef T P VIF


Constant -65.63 14.83 -4.43 0.000
Use 0.001730 0.001483 1.17 0.255 2.151
Charge 29.496 2.406 12.26 0.000 8.515
Customer 0.0001968 0.0001367 1.44 0.163 10.280

S = 6.90038 R-Sq = 98.5% R-Sq(adj) = 98.4%

Analysis of Variance

Source DF SS MS F P
Regression 3 77037 25679 539.30 0.000
Residual Error 24 1143 48
Total 27 78180

Durbin-Watson statistic = 2.20656 (Cannot reject at any reasonable


significance level)

Deleting Customer from the regression function gives:

The regression equation is


Revenue = - 57.6 + 0.00328 Use + 32.7 Charge

Predictor Coef SE Coef T P VIF


Constant -57.60 14.03 -4.11 0.000
Use 0.003284 0.001039 3.16 0.004 1.012
Charge 32.7488 0.8472 38.66 0.000 1.012

S = 7.04695 R-Sq = 98.4% R-Sq(adj) = 98.3%

148
Analysis of Variance

Source DF SS MS F P
Regression 2 76938 38469 774.66 0.000
Residual Error 25 1241 50
Total 27 78180

Durbin-Watson statistic = 1.82064 (Cannot reject at any reasonable


significance level)

12. a. Correlations: Share, Earnings, Dividend, Payout

Share Earnings Dividend


Earnings 0.565
Dividend 0.719 0.712
Payout 0.435 -0.049 0.662

The best model, after taking account of the initial multicollinearity, uses the predictor
variables Earnings and Payout (ratio).

The regression equation is


Share = 4749 + 6651 Earnings + 171 Payout

Predictor Coef SE Coef T P VIF


Constant 4749 5844 0.81 0.424
Earnings 6651 1546 4.30 0.000 1.002
Payout 171.40 50.49 3.39 0.002 1.002

S = 3922.16 R-Sq = 53.4% R-Sq(adj) = 49.7%

Analysis of Variance

Source DF SS MS F P
Regression 2 440912859 220456429 14.33 0.000
Residual Error 25 384584454 15383378
Total 27 825497313

Durbin-Watson statistic = 0.293387

b. With n = 28, k = 2 and α = .01, DW = .29 < dL = 1.04 so there is strong evidence of
positive serial correlation.

c. An autoregressive model with lagged Shareholders as a predictor might be a viable


option. A regression using generalized differences with is another possibility.
149
13. a.

b. No. The residual autocorrelation function for the residuals from the straight line fit
indicates significant positive autocorrelation. The independent errors assumption
is not viable.

c. The fitted line plot with the natural logarithms of Passengers as the dependent variable

150
and the residual autocorrelation function follow.

The residual autocorrelation function looks a little better than that in part b,
but there is still significant positive autocorrelation at lag 1.

d. Exponential trend plot for Passengers follows along with residual autocorrelation
function.

151
Still some residual autocorrelation. Errors are not independent.

e. Models in parts c and d are equivalent. If you take the natural logarithms of
fitted exponential growth model you get the fitted model in part c.

f. As we have pointed out, the errors for either of the models in parts c and d are
not independent. Using a model that assumes the errors are independent can
lead to inaccurate forecasts and, in this case, unwarranted precision.

g. Using the exponential growth model with t = 26, gives


14. a. The best model lags permits by 2 quarters (Lg2Permits):

152
Sales = 20.2 + 9.23 Lg2Permits

Predictor Coef SE Coef T P


Constant 20.24 27.06 0.75 0.467
Lg2Permits 9.2328 0.8111 11.38 0.000

S = 66.2883 R-Sq = 90.2% R-Sq(adj) = 89.6%

b. DW = 1.47. No evidence of autocorrelation.

c. The regression equation is


Sales = 16.6 + 8.80 Lg2Permits + 30.0 Season

Predictor Coef SE Coef T P


Constant 16.61 27.99 0.59 0.563
Lg2Permits 8.801 1.020 8.63 0.000
Season 30.02 41.67 0.72 0.484

S = 67.4576 R-Sq = 90.6% R-Sq(adj) = 89.2%

d. No. For Season: t = .72, p value = .484.

e. No. DW = 1.44. No evidence of autocorrelation.

f. 2007 1st quarter forecast 177


2nd quarter forecast 113

Forecasts for the 3rd and 4th quarters can be done using several different
approaches. This is best left to the student with a discussion of why they
used a particular method. One method that is to average the past values
of Permits for the 1st and 2nd quarters and use these averages in the model.
This will result in forecasts: 3rd quarter 514; 4th quarter 235.

15.
Quarter Sales S2 S3 S4

1 16.3 0 0 0
2 17.7 1 0 0
3 28.1 0 1 0
4 34.3 0 0 1

The regression equation is

153
Sales = 19.3 - 1.43 S2 + 11.2 S3 + 33.3 S4

Predictor Coef SE Coef T P


Constant 19.292 2.074 9.30 0.000
S2 -1.425 2.933 -0.49 0.630
S3 11.163 2.999 3.72 0.001
S4 33.254 2.999 11.09 0.000

S = 7.18396 R-Sq = 80.1% R-Sq(adj) = 78.7%

Analysis of Variance

Source DF SS MS F P
Regression 3 8726.5 2908.8 56.36 0.000
Residual Error 42 2167.6 51.6
Total 45 10894.1

Durbin-Watson statistic = 1.544

1996(3rd Qt) = 19.3 - 1.43(0) + 11.2(1) + 33.3(0) = 30.5

1996(4th Qt) = 19.3 - 1.43(0) + 11.2(0) + 33.3(1) = 52.6

The regression is significant. The model explains 80.1% of the variation in Sales.
There is no lag 1 autocorrelation but a significant residual autocorrelation at lag 4.

16. a. & b. The regression equation is


Dickson = - 6.40 + 2.84 Industry

Predictor Coef SE Coef T P


Constant -6.4011 0.8435 -7.59 0.000
Industry 2.83585 0.02284 124.14 0.000

S = 0.319059 R-Sq = 99.9% R-Sq(adj) = 99.9%

Durbin-Watson statistic = 0.8237 → Consistent with positive autocorrelation.


See also plot of residuals versus time and residual autocorrelation function.

154
c. . Calculate the generalized differences and
, and fit the model given in equation (8.5). The result
is with Durbin-Watson statistic = 1.74. In this case, the
estimate of , , is nearly the same as the estimate of in part a.
Here the autocorrelation in the data is not strong enough to have much effect
on the least squares estimate of the slope coefficient.

d. The standard error of is smaller in the initial regression than it is in the


regression involving generalized differences. The standard error in the initial
regression is under estimated because of the positive serial correlation. The
standard error in the regression with generalized differences, although larger,
is the one to be trusted.

17. The regression equation is


DiffSales = 149 + 9.16 DiffIncome

20 cases used, 1 cases contain missing values

Predictor Coef SE Coef T P


Constant 148.92 97.70 1.52 0.145
DiffIncome 9.155 2.034 4.50 0.000

155
S = 239.721 R-Sq = 53.0% R-Sq(adj) = 50.3%

Analysis of Variance

Source DF SS MS F P
Regression 1 1164598 1164598 20.27 0.000
Residual Error 18 1034389 57466
Total 19 2198987

Durbin-Watson statistic = 1.1237

Here DiffSales = and DiffIncome = . The results


involving simple differences are close to the results obtained by the method of
generalized differences in Example 8.5. The estimated slope coefficient is 9.16
versus an estimated slope coefficient of 9.26 obtained with generalized. The intercept
coefficient 149 is also somewhat consistent with the intercept coefficient 54483(1.997)
= 163 for the generalized differences procedure. We would expect the two methods to
produce similar results since is nearly 1.

18. a. The regression equation is


Savings = 4.98 + 0.0577 Income

Predictor Coef SE Coef T P


Constant 4.978 5.149 0.97 0.346
Income 0.05767 0.02804 2.06 0.054

S = 10.0803 R-Sq = 19.0% ← (2) 19% of variation in Savings explained by Income

Analysis of Variance

Source DF SS MS F P
Regression 1 430.0 430.0 4.23 0.054 ← (1) Regression is not
significant
at .01 level
Residual Error 18 1829.0 101.6
Total 19 2259.0

Durbin-Watson statistic = 0.4135 ← With α = .05, dL = 1.20 so positive


autocorrelation is indicated.

Can improve model by allowing for autocorrelated observations (errors).

b. The regression equation is


Savings = - 3.14 + 0.0763 Income + 20.2 War Year

156
Predictor Coef SE Coef T P
Constant -3.141 2.504 -1.25 0.227
Income 0.07632 0.01279 5.97 0.000
War Year 20.165 2.375 8.49 0.000 ← (1) Given Income, War Year
makes
a significant contribution at
the
.01 level.
S = 4.53134 R-Sq = 84.5% R-Sq(adj) = 82.7%

Analysis of Variance

Source DF SS MS F P
Regression 2 1909.94 954.97 46.51 0.000
Residual Error 17 349.06 20.53
Total 19 2259.00

Durbin-Watson statistic = 2.010 ← (2) No significant autocorrelation of any kind is


indicated.

Using all the usual criteria for judging the adequacy of a regression model, this model
is much better than the simple linear regression model in part a.

19. a.

157
The data are clearly seasonal with fourth quarter sales large and sales for the
remaining quarters relatively small. Seasonality is confirmed by the
autocorrelation function with significant autocorrelation at the seasonal
lag 4.

b. From the autocorrelation function observations 4 periods apart are highly


positively correlated. Therefore an autoregressive model with sales lagged 4
time periods as the predictor variable might be appropriate.

c. The regression equation is


Sales = 421 + 0.853 Lg4Sales

24 cases used, 4 cases contain missing values

Predictor Coef SE Coef T P


Constant 421.4 230.0 1.83 0.081
Lg4Sales 0.85273 0.09286 9.18 0.000

S = 237.782 R-Sq = 79.3% R-Sq(adj) = 78.4%

Analysis of Variance

Source DF SS MS F P
Regression 1 4767638 4767638 84.32 0.000
Residual Error 22 1243890 56540
Total 23 6011528

158
Significant lag 1 residual autocorrelation.

d. May 31 (2003) Y = 421.4 + .85273(2118) = 2227.5 compared to 2150


Aug 31 (2003) Y = 421.4 + .85273(2221) = 2315.3 compared to 2350
Nov 30 (2003) Y = 421.4 + .85273(2422) = 2486.7 compared to 2600
Feb 28 (2004) Y = 421.4 + .85273(3239) = 3183.4 compared to 3400

Forecasts are not bad but they are below the Value Line estimates for the
last 3 quarters and the difference becomes increasingly larger.

e. Value line estimates for the last 3 quarters of 2003-04 seem increasingly optimistic.

f. Model in part c can be improved by allowing for significant lag 1 residual


autocorrelation. One approach is to include sales lagged 1 quarter as an additional
predictor variable.

20. a. Correlations: ChickConsum, Income, ChickPrice, PorkPrice, BeefPrice


ChickConsum Income ChickPrice PorkPrice
Income 0.922
ChickPrice 0.794 0.932
PorkPrice 0.871 0.957 0.970
BeefPrice 0.913 0.986 0.928 0.941

Correlations: LnChickC, LnIncome, LnChickP, LnPorkP, LnBeefP


LnChickC LnIncome LnChickP LnPorkP
LnIncome 0.952
LnChickP 0.761 0.907
LnPorkP 0.890 0.972 0.947
LnBeefP 0.912 0.979 0.933 0.954

159
The correlations are similar for both the original and natural log transformed data.
Correlations among the potential predictor variables are large implying a
multicollinearity problem. Chicken consumption is most highly correlated with
Income and BeefPrice for both the original and log transformed data. Must be
careful interpreting correlations with time series data since autocorrelation in the
individual series can result in apparent linear association.

b. Stepwise Regression: ChickConsum versus Income, ChickPrice, ...

Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is ChickConsum on 4 predictors, with N = 23

Step 1 2
Constant 28.86 37.72

Income 0.00970 0.01454


T-Value 10.90 6.54
P-Value 0.000 0.000

ChickPrice -0.29
T-Value -2.34
P-Value 0.030

S 2.58 2.34
R-Sq 84.98 88.21
R-Sq(adj) 84.27 87.03

c. There is high multicollinearity among the predictor variables so the final model
depends on which non-significant predictor variable is deleted first. If BeefPrice is
deleted, the final model is the one selected by stepwise regression (using a .05 level
for determining significance of individual terms) with significant lag 1 residual
autocorrelation. If Income is deleted first, then the final model involves the three
Price predictor variables as shown below. There is no significant residual
autocorrelation but large VIFs, although the coefficients of the predictor variables
have the right signs. In this data set, Income is essentially a proxy for the three
price variables.

The regression equation is


ChickConsum = 37.9 - 0.665 ChickPrice + 0.195 PorkPrice + 0.123 BeefPrice

Predictor Coef SE Coef T P VIF


Constant 37.859 3.672 10.31 0.000
ChickPrice -0.6646 0.1702 -3.90 0.001 17.649
PorkPrice 0.19516 0.05874 3.32 0.004 21.109
BeefPrice 0.12291 0.02625 4.68 0.000 9.011

160
S = 2.11241 R-Sq = 90.9% R-Sq(adj) = 89.4%

Analysis of Variance

Source DF SS MS F P
Regression 3 844.44 281.48 63.08 0.000
Residual Error 19 84.78 4.46
Total 22 929.22

Durbin-Watson statistic = 1.2392

21. Stepwise Regression: LnChickC versus LnIncome, LnChickP, ...

Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is LnChickC on 4 predictors, with N = 23

Step 1 2
Constant 1.729 2.375

LnIncome 0.283 0.440


T-Value 14.32 15.40
P-Value 0.000 0.000

LnChickP -0.445
T-Value -6.06
P-Value 0.000

S 0.0528 0.0321
R-Sq 90.71 96.72
R-Sq(adj) 90.27 96.40

Final model is the one selected by stepwise regression. There is no significant


residual autocorrelation.

The regression equation is


LnChickC = 2.37 + 0.440 LnIncome - 0.445 LnChickP

Predictor Coef SE Coef T P VIF


Constant 2.3748 0.1344 17.67 0.000
LnIncome 0.43992 0.02857 15.40 0.000 5.649
LnChickP -0.44491 0.07342 -6.06 0.000 5.649

S = 0.0321380 R-Sq = 96.7% R-Sq(adj) = 96.4%

161
Analysis of Variance

Source DF SS MS F P
Regression 2 0.61001 0.30500 295.30 0.000
Residual Error 20 0.02066 0.00103
Total 22 0.63067

Durbin-Watson statistic = 1.7766

The coefficient of .44 on LnIncome implies as Income increases 1% chicken


consumption increases by .44%, chicken price held constant. Similarly, the
coefficient of –.44 on LnChickP implies as chicken price increases by 1%
chicken consumption decreases by .44%, income held constant.

To obtain a forecast of chicken consumption for the following year, forecasts


of income and chicken price for the following year would be required. After taking
logarithms, these values would be used in the final regression equation to get a
forecast of LnChickC. A forecast of chicken consumption is then generated by taking
the antilog.

22. The regression equation is


DiffChickC = 1.10 + 0.00075 DiffIncome - 0.145 DiffChickP

22 cases used, 1 cases contain missing values

Predictor Coef SE Coef T P VIF


Constant 1.0967 0.4158 2.64 0.016
DiffIncome 0.000746 0.003477 0.21 0.832 1.029
DiffChickP -0.14473 0.06218 -2.33 0.031 1.029

S = 1.21468 R-Sq = 22.3% R-Sq(adj) = 14.1%

Analysis of Variance

Source DF SS MS F P
Regression 2 8.039 4.020 2.72 0.091
Residual Error 19 28.033 1.475
Total 21 36.073

Durbin-Watson statistic = 1.642

Very little explanatory power in the predictor variables. If the non-significant DiffIncome
is dropped from the model, the resulting regression is significant at the .05 level, R 2 is
virtually unchanged and the standard error of the estimate decreases slightly. The residual
plots look good and there is no evidence of autocorrelation. With the very low R 2, the fitted

162
function is not useful for forecasting the change (difference) in chicken consumption.

23. The regression equation is


ChickConsum = 1.94 + 0.975 LagChickC

22 cases used, 1 cases contain missing values

Predictor Coef SE Coef T P


Constant 1.945 1.823 1.07 0.299
LagChickC 0.97493 0.04687 20.80 0.000

S = 1.33349 R-Sq = 95.6% R-Sq(adj) = 95.4%

Analysis of Variance

Source DF SS MS F P
Regression 1 769.45 769.45 432.71 0.000
Residual Error 20 35.56 1.78
Total 21 805.01

163
Fitted regression function implies this year’s chicken consumption is likely to be
a very good predictor of next year’s chicken consumption. The coefficient on
lagged chicken consumption (LagChickC) is almost 1. The intercept in not significant.
Chicken consumption is essentially a “random walk”—next year’s chicken consumption is
this year’s chicken consumption plus a random amount with mean 0. The residual
plots look good and there is no residual autocorrelation.

We cannot infer the effect of a change in chicken price on chicken consumption with
this model since chicken price does not appear as a predictor variable.

24.

Here the independent error has mean 0 and variance 3σ2. So the first differences for
both and are stationary and X and Y are cointegrated of order 1. The cointegrating
linear combination is: .

CASE 8-2: BUSINESS ACTIVITY INDEX FOR SPOKANE COUNTY

1. Why did Young choose to solve the autocorrelation problem first?


Answer: Autocorrelation must be solved for first to create data (or model) consistent
with the usual regression assumptions.

2. Would it have been better to eliminate multicollinearity first and then tackle
autocorrelation?
Answer: No. In order to solve the autocorrelation problem, the nature of the data was
changed (first differenced). If multicollinearity were solved first, one or more important
variables may have been eliminated. Autocorrelation must be accounted for first so the
usual regression assumptions apply; then multicollinearity can be tackled.

164
3. How does the small sample size affect the analysis?
Answer: A sample size of 15 is small for a model that uses three independent
variables (ideally, n should be in the neighborhood of 30 or more). A larger sample
size would almost certainly be helpful.

4. Should the regression done on the first differences have been through the origin?
Answer: Perhaps. An intercept can be included in the regression model and then
checked for significance. Ordinarily, regressions with first differenced data does
not require an intercept term.

5. Is there any potential for the use of lagged data?


Answer: Perhaps. Although using lagged dependent and independent variables would
constructing an index more difficult. Since the first differenced data work well in this
case, there is no real need to consider lagged variables.

6. What conclusions can be drawn from a comparison of the Spokane County business
activity index and the GNP?
Answer: The Spokane business activity seems to be extremely stable. It was not
affected by the national recessions of 1970 and 1974. The large peak in 1974 was
caused by Expo 74 (a world fair). It would be inappropriate in this case to expect
the Spokane economy to follow national patterns.

CASE 8-3: RESTAURANT SALES

1. Was Jim’s use of a dummy variable correct?


Answer: Jims’s use of a dummy variable to represent periods when Marquette
was in session or out of session seems very reasonable. A good use of a dummy
variable.

2. Was it correct to use lagged sales as a predictor variable?


Answer: Jim's use of lagged sales as a predictor variable was eminently sensible.
This independent variable likely to have good predictor variable and can account
for autocorrelation. This is a good time to I

3. Do you agree with Jim’s conclusions?


Answer: Yes. Model 6 is the best. However, there may be other predictor variables
that would improve this model; the number of students enrolled at Marquette during
a particular quarter or semester is an example.

4. Would another type of forecasting model be more effective for forecasting weekly sales?
Answer: Possibly! Jim will investigate Box-Jenkins ARIMA models in Chapter 9.

CASE 8-4: MR. TUX

165
John is correct to be disappointed with the model run with seasonal dummy variables since
the residual autocorrelations have a spike at lag 12. From a forecasting perspective, the
autoregressive model is better. The intercept term allows for a time trend, seasonality is
accounted for by sales lagged 12 months as the predictor variable, R2 is large (91%) and there is
no residual autocorrelation. However, this model does not include predictor variables directly
under John’s control, like price, so he would not be able to determine how a change in price (or
changes in other
operational variables) might affect future sales.

CASE 8-5: CONSUMER CREDIT COUNSELING

Nonseasonal model:

The regression equation is


Clients = - 292 + 3.38 Index + 0.370 Bankrupt - 0.0656 Permits

Predictor Coef SE Coef T P


Constant -292.27 41.23 -7.09 0.000
Index 3.3783 0.3404 9.93 0.000
Bankrupt 0.37001 0.09740 3.80 0.000
Permits -0.06559 0.02882 -2.28 0.026

S = 16.6533 R-Sq = 61.0% R-Sq(adj) = 59.5%

Analysis of Variance

Source DF SS MS F P
Regression 3 34630 11543 41.62 0.000
Residual Error 80 22187 277
Total 83 56816

Durbin-Watson statistic = 1.605

The best nonseasonal regression model used the business activity index, number of
bankruptcies filed, and number of building permits to forecast number of clients seen. The
Durbin-Watson test for serial correlation is inconclusive at the .05 level. The residual
autocorrelation function shows some significant autocorrelation around lag 4.

Best seasonal model:

The regression equation is


Clients = - 135 + 2.51 Index - 3.79 S2 + 5.69 S3 - 15.9 S4 - 21.1 S5
166
- 13.6 S6 - 20.6 S7 - 19.6 S8 - 25.9 S9 - 6.87 S10 - 19.0 S11
- 33.1 S12

Predictor Coef SE Coef T P


Constant -135.08 26.96 -5.01 0.000
Index 2.5099 0.2421 10.37 0.000
S2 -3.793 8.443 -0.45 0.655
S3 5.686 8.469 0.67 0.504
S4 -15.869 8.445 -1.88 0.064
S5 -21.146 8.441 -2.51 0.015
S6 -13.580 8.443 -1.61 0.112
S7 -20.641 8.441 -2.45 0.017
S8 -19.650 8.443 -2.33 0.023
S9 -25.857 8.441 -3.06 0.003
S10 -6.869 8.445 -0.81 0.419
S11 -19.014 8.448 -2.25 0.027
S12 -33.143 8.441 -3.93 0.000

S = 15.7912 R-Sq = 68.8% R-Sq(adj) = 63.6%

Analysis of Variance

Source DF SS MS F P
Regression 12 39111.7 3259.3 13.07 0.000
Residual Error 71 17704.7 249.4
Total 83 56816.3

Durbin-Watson statistic = 1.757

The best seasonal model uses Index and 11 seasonal dummy variables to represent
the months Feb through Dec. We retain all the seasonal dummy variables for forecasting
purposes even though some are non-significant. The Durbin-Watson test is inconclusive at the
.05 level. The residual autocorrelations have a just significant spike at lag 6 but are otherwise
non-significant. Forecasts for the first three months of 1993 follow.

Forecast Actual
Jan 1993 179 151
Feb 1993 175 152
Mar 1993 197 199

Forecasts for Jan and Feb 1993 are high compared to actual numbers of clients but
forecast for Mar 1993 is very close to the actual number of new clients

Autoregressive model:

Autoregressive models with number of new clients lagged 1, 4 and 12 months were

167
tried. None of these models proved to be useful for forecasting. The best model had number of
new clients lagged 1 month. The results are displayed below.

The regression equation is


Client = 61.4 + 0.487 LagClients

95 cases used, 1 cases contain missing values

Predictor Coef SE Coef T P


Constant 61.41 10.91 5.63 0.000
LagClients 0.48678 0.08796 5.53 0.000

S = 24.9311 R-Sq = 24.8% R-Sq(adj) = 24.0%

Analysis of Variance

Source DF SS MS F P
Regression 1 19035 19035 30.62 0.000
Residual Error 93 57805 622
Total 94 76840

There is just significant residual autocorrelation at lag 12 but the remaining


residual autocorrelations are small. The best model of the ones attempted is the final
seasonal model with predictor variables Index and the seasonal dummies.

CASE 8-6: AAA WASHINGTON

1. The results for the best model are shown below (see also solution to Case 7-2). Each of
the independent variables is significantly different from 0 at the .05 level. The signs of
the coefficients are what we would expect them to be.

The regression equation is


Calls = 17060 + 635 Lg11Rate - 112 NewTemp + 7.59 NewTemp**2

Predictor Coef SE Coef T P


Constant 17060.2 847.0 20.14 0.000
Lg11Rate 635.4 146.5 4.34 0.000
NewTemp -112.00 47.70 -2.35 0.023
NewTemp**2 7.592 1.657 4.58 0.000

S = 941.792 R-Sq = 75.0% R-Sq(adj) = 73.5%

Analysis of Variance

Source DF SS MS F P
168
Regression 3 140771801 46923934 52.90 0.000
Residual Error 53 47009523 886972
Total 56 187781324

Durbin-Watson statistic = 1.6217

2. Serial correlation is not a problem. The value of the Durbin-Watson statistic (1.62)
would not reject the null hypothesis of no serial correlation. There are no
significant residual autocorrelations. Restricting attention to integer powers, 2 is the
best choice for the exponential transformation. Allowing other choices for powers,
e.g. 2.4, may improve the fit a bit but is not as “nice” as an integer power.

3. The memo to Mr. DeCoria should use all the usual inferential and descriptive summaries
to defend the model in part 1. A residual analysis should also be included.

CASE 8-7 ALOMEGA FOOD STORES

1. Julie appears to have a good regression equation with an R-squared of 91%.


Additional significant explanatory variables may be available but there is not much
variation left to explain. However, it good to have students search for a good
equation using the full range of available variables. Along with the R-squared value,
they should check the t values for the variables in their final equation, and the F value
and the residual autocorrelations. Their results can be used effectively as individual
or team presentations to the class, or as a hand-in writeup or even a small term paper.

2. “Selling” the final regression model to management, including the irascible Jackson
Tilson, ties the statistical exercise in the Alomega case to the real world of business
management. The idea of selling the statistical results to management can be
the focus of team presentations to the class with the instructor playing the role of
Tilson. Working through the presentation of results to the class adds an important
“real world” element to the statistical analysis.

3. As noted in the case, the advertising predictor variables are under the control of
Alomega management. Students can demonstrate the usefulness of this result by
choosing reasonable future values for these advertising variables and generating
forecasts.
However, students must recognize the regression equation does not necessarily
imply a cause and effect relationship between advertising expenditures and sales. In
addition, conditions under which the model was developed may change in the future.

4. All forecasts, including the ones using Julie’s regression equation, assume a future
that is identical to the past except for the identified predictor variables. If her
model is used to generate forecasts for Alomega, she should check the model
accuracy on a regular basis. The errors encountered as the future unfolds should
be compared to those in the data used to generate the model. If significant
changes or trends are observed, the model should be updated to include the most
recent data, along with possibly discarding some of the oldest data. Alternatively,
169
a different approach to the forecasting problem can be sought if the forecasting errors
suggest that the current regression model is inadequate.

CASE 8-8 SURTIDO COOKIES

1. The positive coefficient on November makes sense because cookie sales are seasonal
sales relatively high each year in November, the month before the Christmas holidays.

2. Jame’s model looks good. Almost 94% of the variation in cookie sales is explained
by the model. The residual analysis indicates the usual regression assumptions are
tenable, including the independence assumption.

3. Forecasts:

June 2003 733,122


July 2003 799,823
August 2003 737,002
September 2003 1,562,070
October 2003 1,744,477
November 2003 2,152,463
December 2003 1,932,194

4. The regression equation is


SurtidoSales = 115672 + 0.950 Lg12Sales

29 cases used, 12 cases contain missing values

Predictor Coef SE Coef T P


Constant 115672 91884 1.26 0.219
Lg12Sales 0.94990 0.08732 10.88 0.000

S = 243748 R-Sq = 81.4% R-Sq(adj) = 80.7%

Analysis of Variance

Source DF SS MS F P
Regression 1 7.03141E+12 7.03141E+12 118.35 0.000
Residual Error 27 1.60415E+12 59412957997
Total 28 8.63556E+12

Durbin-Watson statistic = 1.3524

This regression model is very reasonable. About 81% of the variation in cookie
sales is explained with the single predictor variable, sales lagged 12 months
170
(Lg12Sales). The usual residual plots look good and there is no significant residual
autocorrelation.

Forecasts:

June 2003 717,956


July 2003 632,126
August 2003 681,996
September 2003 1,642,130
October 2003 1,801,762
November 2003 2,113,392
December 2003 1,844,434

5. Both models fit the data well. Apart from July 2003, the forecasts generated by the
models are very close to one another. Dummy variable regression explains more of
the variation in cookie sales but the autoregression is simpler. Could make a case for
either model.

CASE 8-9 SOUTHWEST MEDICAL CENTER

1. The regression results along with residual plots and the residual autocorrelation
function follow.

The regression equation is


Total Visits = 997 + 3.98 Time - 81.4 Sep + 5.3 Oct - 118 Nov - 149 Dec
- 24.2 Jan - 116 Feb + 23.8 Mar + 18.2 Apr - 30.5 May - 39.4 Jun
+ 35.2 Jul

Predictor Coef SE Coef T P


Constant 996.97 58.42 17.06 0.000
Time 3.9820 0.4444 8.96 0.000
Sep -81.38 71.69 -1.14 0.259
Oct 5.34 71.67 0.07 0.941
Nov -118.34 71.66 -1.65 0.102
Dec -148.62 71.66 -2.07 0.041
Jan -24.21 71.65 -0.34 0.736
Feb -116.39 71.65 -1.62 0.107
Mar 23.80 73.55 0.32 0.747
Apr 18.15 73.53 0.25 0.806
May -30.50 73.53 -0.41 0.679
Jun -39.37 73.52 -0.54 0.593
Jul 35.20 73.51 0.48 0.633

171
S = 155.945 R-Sq = 48.9% R-Sq(adj) = 42.9%

Analysis of Variance

Source DF SS MS F P
Regression 12 2353707 196142 8.07 0.000
Residual Error 101 2456198 24319
Total 113 4809905

Durbin-Watson statistic = 0.4339

Mary has a right to be disappointed. This regression model does not fit well. Even
allowing for seasonality, only the Dec seasonal dummy variable is significant at the

172
.05 level. The residual plots clearly show a poor fit in the middle of the series and
there is a considerable amount of significant residual autocorrelation.

2. Mary might try an autoregression with different choices of lags of total visits
as predictor variable(s). She might try to fit a Box-Jenkins ARIMA model to
be discussed in Chapter 9. Regardless, finding an adequate model for this
time series will be challenging.

173

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy