Manga Cap 3 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46


Predicting with Many Variables

Thanks for Don’t mention it.

bringing It's nice of you to
the data. help your friend.


...I have my

Oh, that's
Over here!


I'm late!
My class
ended late.
I ran.
It's okay.
We just got
here too.
Oh! Who's
Beam friend?


This is
Kazami. He's Ah! Nice
Oh, yeah. in my history to meet
class. you.

We're going to
Tha nks f
cover multiple h elping
me. No problem.
regression analysis This is going to
today, and Kazami help me, too.
brought us some
data to analyze.

You like Which bakery

is your
croissants, favorite?
don't you, Kazami...
Miu? oh!

course! Definitely Kazami
They're bakery—Theirs
delicious. are the best!

Predicting with Many Variables  109

The heir to
Then you the Kazami
must be...? bakery

Risa, don't
be dramatic.

There are only

It's just a ten right now, Terai
Yumenooka n
small family and most of Sone

business. them are here

in the city.

ob rokujo
Post sta ash i
Kikyou tio station
Office n
Tow n
Hash im o

We're planning
w a ka b
ide Misato
Isebashi to open a new
one soon.

But I see so today...

bakeries all
over town.

We’re going to
predict the sales Wow!
of the new shop
using multiple

110  Chapter 3  Multiple Regression Analysis

In multiple
according to my notes, In simple regression
regression analysis,
multiple regression analysis, we used one
we use more than one
analysis uses more variable to predict
variable to predict
than one factor to the value of another
predict an outcome. the value of our
outcome variable.

multiple regression equation

right. outcome predictor partial regression
variable variables coefficients

every x variable
has its own a , but
there's still just
one intercept.

Yep. Just
one b.

and just one outcome I get it!

variable, y. like this, see?

multiple regression analysis

predictor predictor predictor predictor

variable variable 1 variable 2 variable P

outcome outcome
variable variable

Predicting with Many Variables  111

The Multiple Regression Equation

are the steps !

the same as in click
simple regression

they're similar—
but not exactly
the same.

Multiple Regression Analysis Procedure

draw a scatter plot of each predictor variable and the

Step 1 outcome variable to see if they appear to be related.

Step 2 calculate the multiple regression equation.

Step 3 examine the accuracy of the multiple regression equation.

Step 4 conduct the analysis of Variance (ANOVA) test.

Step 5 calculate confidence intervals for the population.

Step 6 make a prediction!

We have to look at
each predictor alone I'll write
and all of them that
together. down.

112  Chapter 3  Multiple Regression Analysis

Step 1: Draw a scatter plot of each predictor variable and the
outcome variable to see if they appear to be related.

have you
gotten first, let's look
results yet? at the data from
Can I see? the shops already
in business.

not yet.
Be patient.

Floor space Distance to the

of the shop nearest station Monthly sales
Bakery (tsubo*) (meters) (¥10,000)
Yumenooka Shop 10 80 469
Terai Station Shop 8 0 366
Sone Shop 8 200 371
Hashimoto Station Shop 5 200 208
Kikyou Town Shop 7 300 246
Post Office Shop 8 230 297
Suidobashi Station Shop 7 40 363
Rokujo Station Shop 9 0 436
Wakaba Riverside Shop 6 330 198
Misato Shop 9 180 364
* 1 tsubo is about 36 square feet.

the outcome Yep!

variable is Now draw a
monthly sales, scatter plot
and the other two for each
are predictor predictor
variables. variable.

The Multiple Regression Equation  113





Correlation Coefficient = .8924 Correlation Coefficient = -.7751

Yumenooka Shop Yumenooka Shop
Rokujo Station Shop Rokujo Station Shop
Sone Shop
Terai Station Shop
Monthly sales

Misato Shop Sone Shop

Monthly sales

Suidobashi Station Shop

Terai Station Shop Suidobashi Misato Shop
Station Shop
Post Office Shop Post Office Shop

Kikyou Town Shop Kikyou Town Shop

Hashimoto Station Shop Wakaba Riverside Shop Hashimoto Station Shop
Riverside Shop

Floor space Distance to the nearest station

m. totally.
h mm
looks like
monthly sales both of
are higher in these affect
bigger shops and monthly
in shops near a sales.
train station.

114  Chapter 3  Multiple Regression Analysis

Step 2: Calculate the multiple regression equation.

a lot of the but the

coefficients we calculation is a bit
calculated during more complicated.
simple regression Do you remember
analysis are also the method
involved in multiple we used?
Linear Least

That’s right!
Let’s review.

first get the residual then differentiate by a1, a 2 ,

sum of squares, Se. and b and set the equation
equal to zero. find the values
of a1, a 2 , and b that make Se
as small as possible...

and presto! I don't

Piece of cake. think I like
this cake.

The Multiple Regression Equation  115

Here we go...

Come on!
sharpen your aha! this is
pencils and get where we use
ready! the matrices...

we can find the

partial regression

This big thing is

equal to the partial
regression coefficient! doing this what the

calculation! heck is
Floor space

The 1’s are a baseline

multiplier used Distance to the
to calculate the nearest station
intercept (b).
Monthly sales

here, I

It's no use! please, the

this calculation computer is our
There are too will take weeks!
many numbers! only hope.
Nay, Months!!



The Multiple Regression Equation  117

I'll do it Partial regression
for you...* Predictor variable coefficients
Floor space of the shop
(tsubo) a1 = 41.5
Distance to the nearest
station (meters) a 2 = –0.3

Intercept b = 65.3 hooray!!

* See page 209 for the full calculation.

so here's
you are the equation.
a genius.

e it
sav ones.

monthly floor distance to Pay d y!
sales space the nearest

you should
write this

one more
…that will help
you understand the
multiple regression
equation and multiple
regression analysis.

is it?

118  Chapter 3  Multiple Regression Analysis

this seems
the line plotted by the familiar...
multiple regression equation think,
y = a1x1 + a2x2 + ... + apxp + b think.
will always cross the points have
( x i, x 2, ... , x p, y ), where I seen
x i is the average of xi .
my brain is

to put it differently, our equation y = 41.5x1 – 0.3x 2 + 65.3 will

always create a line that intersects the points where average
floor space and average distance to the nearest station
intersect with the average sales of the data that we used.

oh yeah! when
we plot our
equation, the line
passes through
the averages.

Step 3: Examine the accuracy of the multiple regression equation.

so now we have we'll find out

an equation, using regression
but how well diagnostics. we'll
can we really need to find R2, and
predict the if it's close to 1,
sales of the then our equation is
new shop? pretty accurate!

good memory!

The Multiple Regression Equation  119

before we find R2, we need to find plain old
R, which in this case is called the multiple
correlation coefficient. Remember: R is a way
of comparing the actual measured values (y)
with our estimated values ( ŷ ).*

Actual Estimated value

value= ŷ = 41x1 − 0.3x 2 + 65.3
(y − y ) ( ŷ − ŷ ) ( y − y ) (ŷ − ŷ ) ( y − ŷ )
2 2 2
Bakery=ŷ = 41yx1 − 0.3x 2 + 65.3 y −y ŷ −ŷ
Yumenooka 469 453.2 137.2 121.4 18823.8 14735.1 16654.4 250.0
Terai 366 397.4 34.2 65.6 1169.6 4307.5 2244.6 988.0
Sone 371 329.3 39.2 –2.5 1536.6 6.5 –99.8 1742.6
Hashimoto 208 204.7 –123.8 –127.1 15326.4 16150.7 15733.2 10.8
Kikyou 246 253.7 −85.8 –78.1 7361.6 6016.9 6705.0 58.6
Post Office 297 319.0 −34.8 –12.8 1211.0 163.1 444.4 485.3
Suidobashi 363 342.3 31.2 10.5 973.4 109.9 327.1 429.2
Rokujo 436 438.9 104.2 107.1 10857.6 11480.1 11164.5 8.7
Wakaba 198 201.9 −133.8 –129.9 17902.4 16870.5 17378.8 15.3
Misato 364 377.6 32.2 45.8 1036.8 2096.4 1474.3 184.6
Total 3318 3318 0 0 76199.6 72026.6 72026.6 4173.0
Average 331.8 331.8

y ŷ Syy Sŷ ŷ Sŷ ŷ Se

we don't need Se
yet, but we will
use it later.

sum of ( y − y ) yˆ − yˆ( ) =
= .9722
× sum of ( yˆ − yˆ )
Syy × Syy 76199.6 × 72026.6
sum of ( y − y )

Syyˆ 72026.6 Nice!

= = = .9722
Syy × Syy
76199.6 × 72026.6

R2 is
R2 = (.9722)2 = .9452 .9452.

* As in Chapter 2, some of the figures in this chapter are rounded for

readability, but all calculations are done using the full, unrounded
120 values resulting from the raw data unless otherwise stated.
So the way
we calculate
R in multiple
regression is a yes, and when the
lot like in simple value of R 2 is
regression, closer to 1, the
isn’t it? multiple regression
equation is more
accurate, just like

Is there a rule so this multiple yeah, we can

this time for how regression equation predict sales
high R2 needs to be is really accurate! of the new shop
for the equation with confidence.
to be considered .9
is 452
accurate? ab way
.5! ve

No, but .5 can

again be used as
a lower limit.

You can simplify the R2 calculation.

I won’t explain the whole thing,
but basically it’s something like this.*

R 2 = (multiple correlation coefficient)2

a1S1y + a 2S2y +  + a p Spy S
= =1− e
Syy Syy


* Refer to page 144 for an explanation of S1y, S2y, ... , Spy.

The Multiple Regression Equation  121
The Trouble with R2
before you get
we did the
this R2 might calculations
too excited, there's how can
be misleading. perfectly.
something you this be?
should know...


R is


Well, the
what? trouble is...

every time we ...R 2 gets larger.

add a predictor Guaranteed.
variable p...


Suppose we add the age

of the shop manager before adding
to the current data. another variable,
the R2 was .9452.
Floor to the
area of nearest Shop Monthly
the shop station manager’s sales
Bakery (tsubo) (meters) age (years) (¥10,000)
Yumenooka Shop 10 80 42 469
Terai Station Shop 8 0 29 366
Sone Shop 8 200 33 371
Hashimoto Station Shop 5 200 41 208
Kikyou Town Shop 7 300 33 246
Post Office Shop 8 230 35 297
Suidobashi Shop 7 40 40 363
Rokujo Station Shop 9 0 46 436
Wakaba Riverside Shop 6 330 44 198
Misato Shop 9 180 34 364

age is now would after adding
the third
age this variable...

122  Chapter 3  Multiple Regression Analysis

floor distance age floor distance age
area to the of the area to the of the
of the nearest shop of the nearest shop
shop station manager shop station manager's .9495! As you can see,

it's larger.

Correlation coefficient = .0368

but when we Yumenooka Shop

Rokujo Station Shop
Yet despite that,
plot age versus
Sone Shop
the value of R2
monthly sales,
Monthly sales

Terai Station Shop Suidobashi increased.

there is no Misato Shop Station
pattern, so... Post Office Shop

Kikyou Town Shop

Hashimoto Station Shop

Age of the shop manager

I knew it!
the age of the
shop manager has
So what was
nothing to do with
the point of
monthly sales! all those

Never the adjusted coefficient

fear. of determination, aka
adjusted R2, will save us!

another R 2?
Adjusted R2

the value of adjusted R2 ( R ) can be

2 miu, could you
obtained by using this formula. find the value of
adjusted R2 with and
 Se  without the age of

sample size − number of predictor variables − 1
 the shop manager?
R2 = 1 −  
 Syy 
 
 sample size − 1 

too many
layers! it’s worse
than R 2 !

Um, I

go miu!


when the ...R2 is .9452.

variables are
only floor area
and distance... So adjusted R2 is:

  Se S  
 
e  
 sample size − number of predictor variables − 1 
1 − 1 −  sample size − number of predictor variables − 1 
  Syy Syy  
  ple size − 1  
 sam mple size − 1 
 sam
 4173 .0  .0 
 10 − 2 − 1 
4173 it!
 I got
= 1 −= 1 −  10 − 2 −= 1  =

 76199 .6  .6 
 10 − 1 
  10 −  1 

124  Chapter 3  Multiple Regression Analysis

the answer
is .9296. how about
We’ve already
when we also
got R2 for that,
include the shop
manager’s age?

floor distance age

area to the of the
of the
nearest shop
yes, it's
great! R2 = .9495

so all I have to what are Syy and Se

get is the value of in this case?
adjusted R2...

predictor va

Syy is the same as

before. It's 76199.6.

we'll cheat and predictor variables:

calculate Se using • floor area
my computer. • distance
• manager's age

 Se 
  Se  
sample size − number of predictor variables − 1
1 −   sample size − number of predictor variables − 1 
1−  Syy 
  Syy  
  m
sam ple size − 1  
It’s  sammple size − 1 
3846.4.  3846.4 
 10 − 3
3846.4 
Wait a
= 1 −   10 − 3−=1  minute...
= 176199
− .6  = .9243
 10  76199  .6 
  − 1  
 10 − 1 

Adjusted R2 125
predictor variables

floor floor area,

area and distance,
look! the value distance and age
of adjusted R2 is
larger when the
age of the shop
manager is not It
included. worked!

see? adjusted R2
to the rescue!

hey, look
at this. adjusted R2 is smaller
than R2 in both cases.
Is it always smaller?

floor floor area,

area and distance,
distance and age

It means that adjusted

good eye! R2 is a harsher judge
yes, it is of accuracy, so when
always we use it, we can be
smaller. Is that more confident in our
good? multiple regression

R2 is

126  Chapter 3  Multiple Regression Analysis

Hypothesis Testing with
Multiple Regression We'll do
hypothesis and
since we’re happy regression
Now... with adjusted R2, coefficient tests,
we'll test our right?
about the

R2 Yes, but in
multiple regression
analysis, we have
partial regression
coefficients, instead.

I think So. we
do you tested whether
remember the population
how we did matched the
the hypothesis equation and then
testing before? checked that A
didn't equal zero.

right! it's
alternative hypothesis
the same
with multiple
If the floor area of the shop
is x1 tsubo and the distance to
the nearest station is x 2 meters,
the monthly sales follow a normal
distribution with mean A1x1 + A2x 2 + B
and standard deviation σ.

now, we have
more than one x
and more than one
A. at least one of
these A's must not I see!
equal zero.

Hypothesis Testing with Multiple Regression  127

Step 4: Conduct the Analysis of Variance (ANOVA) Test.

here are our if the regression equation obtained is

assumptions about the
partial regression y = a1 x1 + a 2 x 2 + b
coefficients. a1, a 2 , and b
are coefficients of the
entire population. A1 is approximately a1.
A2 is approximately a2.
B is approximately b.
the equation
reflect the Se
population. σ=
sample size − number of predictor variables − 1

could you apply

this to kazami sure.
bakery's data?

the multiple
regression equation is
y = 41.5x1 − 0.3x2 + 65.3,

these are our

A1 is approximately 41.5. wonderful!

A2 is approximately –0.3.
B is approximately 65.3.

128  Chapter 3  Multiple Regression Analysis

Now we need to There are
test our model two types.
using an F‑test.

the other tests the

one tests all the
individual partial
partial regression
regression coefficients
coefficients together.

null null
hypothesis and hypothesis
alternative alternative
hypothesis not hypothesis

In other words, one of

the following is true:
so, we have
to repeat this yes.
and test for each
of the partial
and regression

let's set the

level to .05.
Are you ready yes,
to try doing let's!
these tests?

Hypothesis Testing with Multiple Regression  129

first, we’ll test all the partial
regression coefficients together.

The Steps of ANOVA

Step 1 Define the population. The population is all Kazami Bakery shops.
Step 2 Set up a null hypothesis and Null hypothesis is A1 = 0 and A2 = 0.
an alternative hypothesis. Alternative hypothesis is that A1 or A2 or both ≠ 0.
Step 3 Select which hypothesis test We’ll use an F‑test.
to conduct.
Step 4 Choose the significance level. We’ll use a significance level of .05.
Step 5 Calculate the test statistic The test statistic is:
from the sample data. Syy − Se Se
number of predictor variables sample size − number off predictor var
Syy − Se 76199.6 − 4173
Syy −.0S÷ Se .0
4173 Se
÷ e = 60.4 =
number of predictor variables sample
numbersize − number
2of predictor − 1 ÷ sample
10variables size−−1number off predictor var
76199.6 − 4173.0 4173.0 .6 − 4173.0
÷ = 6076199
.4 ÷
= 60.4
2 10 − 2 − 1 2 10 − 2 − 1

The test statistic, 60.4, will follow an F distribution

with first degree of freedom 2 (the number of predictor
variables) and second degree of freedom 7 (sample size
minus the number of predictor variables minus 1), if the
null hypothesis is true.
Step 6 Determine whether the At significance level .05, with d1 being 2 and d2 being 7
p-value for the test statistic (10 - 2 - 1), the critical value is 4.7374. Our test statistic
obtained in Step 5 is smaller is 60.4.
than the significance level.
Step 7 Decide whether you can reject Since our test statistic is greater than the critical value,
the null hypothesis. we reject the null hypothesis.

130  Chapter 3  Multiple Regression Analysis

Next, we’ll test the individual partial
regression coefficients. I will do
this for A1 as an example.

The Steps of ANOVA

Step 1 Define the population. The population is all Kazami Bakery shops.
Step 2 Set up a null hypothesis and Null hypothesis is A1 = 0.
an alternative hypothesis. Alternative hypothesis is A1 ≠ 0.
Step 3 Select which hypothesis test We’ll use an F‑test.
to conduct.
Step 4 Choose the significance level. We’ll use a significance level of .05.
Step 5 Calculate the test statistic The test statistic is:
from the sample data. a12 Se
÷ =
S11 sample size − number of predictor variables − 1

41.52 4173.0
÷ = 44
0.0657 10 − 2 − 1
The test statistic will follow an F distribution with
first degree of freedom 1 and second degree of freedom
7 (sample size minus the number of predictor variables
minus 1), if the null hypothesis is true. (The value of S11
will be explained on the next page.)
Step 6 Determine whether the At significance level .05, with d1 being 1 and d2 being 7,
p-value for the test statistic the critical value is 5.5914. Our test statistic is 44.
obtained in Step 5 is smaller
than the significance level.
Step 7 Decide whether you can reject Since our test statistic is greater than the critical value,
the null hypothesis. we reject the null hypothesis.

Regardless of the result of step 7,

if the value of the test statistic

a12 Se
S11 sample size − number of predictor variables − 1

is 2 or more, we still consider the predictor variable

corresponding to that partial regression coefficient
to be useful for predicting the outcome variable.

Hypothesis Testing with Multiple Regression  131

Finding S11 and S22

This is the S11 that

appeared in step 5.

0.0657 . . . ...

= ... 0.00001 . . .
... ... ...

This is S22.

Distance to the
Floor space
nearest station

You need to add a line with a 1 in all rows and columns.

We use a matrix to find S11 and S22.

We needed S11 to calculate the test
statistic on the previous page, and we
use S22 to test our second coefficient
independently, in the same way.*

so A1 doesn't
equal zero! We
can reject the
null hypothesis.
you really
did it!
You're my
hero, miu!

* Some people use the t distribution instead of the F distribution

when explaining the "test of partial regression coefficients." Your
132  final result will be the same no matter which method you choose.
Step 5: Calculate confidence intervals for the population.

next? was it yes, that's right. we're
something about going to calculate
confidence? confidence intervals.

...the calculation is extremely

difficult. legend has it that a
student once went mad trying
to calculate it.

but this

It starts out like

wow. do you
simple regression
think we can
analysis. but then do it?
the mahalanobis
distance* comes in,
and things get very
complicated very
quickly. I know we can,
but we'll be
here all night.
we could have a
slumber party.

mahala... what?
slum ber

* The mathematician P.C. Mahalanobis invented a way to

use multivariate distances to compare populations.
Hypothesis Testing with Multiple Regression  133
not to
sleep? I'm in.
all right, Let's
let's have
stay up all night
a pillow
fight! doing math!
Can I

no! I
you really changed
want to? my mind!
eep! no math!

well then, i
guess we'll thank you,
have to find out computer!
the confidence
intervals using
data analysis

this time it's okay,

but you shouldn't
always rely on It helps
computers. doing you learn.
calculations by
hand helps you

you are
such a jerk

I thought it
was funny!

134  Chapter 3  Multiple Regression Analysis

... stop daydreaming
we'll need to choose the g
in any pay attention!
confidence level first.
e am


For a 10 tsubo
shop that’s 80 m
from a station, the
confidence interval
is 453.2 ± 34.9.*

* This calculation is explained in more detail on page 146.

So first
we do I got it! We’ve found out
453.2 + 34.9... that the average shop precisely!
earns between ¥4,183,000
and ¥4,881,000, right?

453.2 then
− 34.

Hypothesis Testing with Multiple Regression  135

Step 6: Make a prediction!

Floor space Distance to the

here is the data of the shop nearest station
for the new shop (tsubo) (meters)
we’re planning Isebashi Shop 10 110
to open.

p in
A sh o h i?
s e bas
I se
That’s h ouse!
m y

you're a genius,
can you miu! I should
predict the name the shop
sales, miu? after you.

you should
name it
* after risa...

per month!

Yep! * This calculation was made using rounded numbers. If you

use the full, unrounded numbers, the result will be 442.96.

but how could we In simple regression analysis,

know the exact the method to find both the
sales of a shop confidence and prediction
that hasn't been absolutely. intervals was similar. Is
built? should that also true for multiple
we calculate a regression analysis?
prediction interval?

yes, it's

136  Chapter 3  Multiple Regression Analysis

so we'll yeah.
use the blah blah
distance. It's the
maha...maha... mahalanobis
something distance.
again? yes, we need
to use it
to find the


Could we...maybe,
please use your
computer again?
Just once more?
If you

the confidence
level is 95%, so ...are between
predicted sales... ¥3,751,000 and

not bad!

so, do you think this has been ace.

you'll open Thank you, both
the shop? of you!

these numbers are

whoa, whoa!
pretty good. You hold on, there's
know, I think we just one more
just might! thing.

Hypothesis Testing with Multiple Regression  137

Choosing the Best Combination of Predictor Variables

a better
eq uation?

we need to check
whether there's
a better multiple

what's wrong
with the one we
have? What about
my prediction?
now who's
being dramatic?

the equation
like the age of becomes
the shop manager? complicated if
we used that, even you have too
though it didn't have many predictor
any effect on sales! variables.

so many
height of this soup is
e the ceiling a mess.
age of th
sh op manag
just as with simple
regression analysis, er of
we can calculate a nu mber of se
ats tor
multiple regression predic
equation using any
variables we have
data on, whether or exactly.
not they actually
affect the outcome

138  Chapter 3  Multiple Regression Analysis


the best multiple easy

regression equation
balances accuracy
and complexity by accurate not accurate
including only the
predictor variables
needed to make the
best prediction.
short is

the method we'll use today

there are several ways to find is simpler than any of those.
the equation that gives you the It's called best subsets
most bang for your buck. regression, or sometimes,
the round-robin method.

• forward selection
• backward elimination
• forward-backward
stepwise selection
• A
 sk a domain expert
which variables are round
the most important robin?
a fat

These are some common ways.

first, we'd calculate the

what the heck multiple regression equation
is that? for every combination of
predictor variables!

x1 x1 and x2 x1 and x2 and x3

I'll show
x2 x2 and x3
you. Suppose
x1, x2 , and x3 this sure
are potential x3 x1 and x3 is round-

Choosing the Best Combination of Predictor Variables  139

let's replace x1,
x 2 , and x3 with
shop size, distance
to a station, and
manager's age.

we'll make
a table that
shows the partial
Is our equation regression
the winner? coefficients and
adjusted R2...
this. presto!

variables a1 a2 a3 b R2
1 54.9 –91.3 .07709
2 –0.6 424.8 .5508
3 0.6 309.1 .0000
1 and 2 41.5 –0.3 65.3 .9296
1 and 3 55.6 2.0 –170.1 .7563
2 and 3 –0.6 –0.4 438.9 .4873
1 and 2 and 3 42.2 –0.3 1.1 17.7 .9243

1 is floor area, 2 is distance to a station, and 3 is manager's age.

When 1 and 2 are used, adjusted R2 is highest.

so our
equation is now we really know that
the best! y = 41.5x1 – 0.3x2 + 65.3
we rock.
does a good job at
predicting the sales at
the new shop.

That’s right!
Good work,

140  Chapter 3  Multiple Regression Analysis

yeah! I can't
are you starting believe how
so, tell to understand much I’ve really? then you
me, miu. regression? learned! should pay me a
consultation fee.

I think
I may have

I earned
you can one, too!
pay us in
croissants! oh well...
I can't say
no to that.

I'll race you

to the bakery!

sure is.

risa is
really cool,
isn't she?

slo ’s go,
w po
hey, miu! kes

Assessing Populations with
Multiple Regression Analysis

Let’s review the procedure of multiple regression analysis, shown

on page 112.
1. Draw a scatter plot of each predictor variable and the outcome
variable to see if they appear to be related.
2. Calculate the multiple regression equation.
3. Examine the accuracy of the multiple regression equation.
4. Conduct the analysis of variance (ANOVA) test.
5. Calculate confidence intervals for the population.
6. Make a prediction!
As in Chapter 2, we’ve talked about Steps 1 through 6 as if they
were all mandatory. In reality, Steps 4 and 5 can be skipped for the
analysis of some data sets.
Kazami Bakery currently has only 10 stores, and of those
10 stores, only one (Yumenooka Shop) has a floor area of 10 tsubo1
and is 80 m to the nearest station. However, Risa calculated a confi-
dence interval for the population of stores that were 10 tsubo and
80 m from a station. Why would she do that?
Well, it’s possible that Kazami Bakery could open another
10-tsubo store that’s also 80 m from a train station. If the chain
keeps growing, there could be dozens of Kazami shops that fit that
description. When Risa did that analysis, she was assuming that
more 10-tsubo stores 80 m from a station might open someday.
The usefulness of this assumption is disputable. Yumenooka
Shop has more sales than any other shop, so maybe the Kazami
family will decide to open more stores just like that one. However,
the bakery’s next store, Isebashi Shop, will be 10 tsubo but 110 m
from a station. In fact, it probably wasn’t necessary to analyze such
a specific population of stores. Risa could have skipped from calcu-
lating adjusted R2 to making the prediction, but being a good friend,
she wanted to show Miu all the steps.

1. Remember that 1 tsubo is about 36 square feet.

142  Chapter 3  Multiple Regression Analysis

Standardized Residuals

As in simple regression analysis, we calculate standardized resi­

duals in multiple regression analysis when assessing how well the
equation fits the actual sample data that’s been collected.
Table 3-1 shows the residuals and standardized residuals for
the Kazami Bakery data used in this chapter. An example calcula-
tion is shown for the Misato Shop.

Table 3-1: Standardized residuals of the Kazami bakery example

Floor Distance
area to the
of the nearest Monthly
shop station sales Monthly sales Residual Standardized
Bakery x1 x2 y yˆ = 41.5x1 − 0.3x 2 + 65.3 y − yˆ residual
Yumenooka Shop 10 80 469 453.2 15.8 0.8
Terai Station Shop 8 0 366 397.4 –31.4 –1.6
Sone Shop 8 200 371 329.3 41.7 1.8
Hashimoto 5 200 208 204.7 3.3 0.2
Station Shop
Kikyou Town Shop 7 300 246 253.7 –7.7 –0.4
Post Office Shop 8 230 297 319.0 –22.0 1.0
Suidobashi 7 40 363 342.3 20.7 1.0
Station Shop
Rokujo 9 0 436 438.9 –2.9 –0.1
Station Shop
Wakaba 6 330 198 201.9 –3.9 –0.2
Riverside Shop
Misato Shop 9 180 364 377.6 –13.6 –0.6

If a residual is positive, the measurement is higher than pre-

dicted by our equation, and if the residual is negative, the measure-
ment is lower than predicted; if it’s 0, the measurement and our
prediction are the same. The absolute value of the residual tells us
how well the equation predicted what actually happened. The larger
the absolute value, the greater the difference between the measure-
ment and the prediction.

Standardized Residuals  143

If the absolute value of the standardized residual is greater
than 3, the data point can be considered an outlier. Outliers are
measurements that don’t follow the general trend. In this case,
an outlier could be caused by a store closure, by road construc-
tion around a store, or by a big event held at one of the bakeries—
anything that would significantly affect sales. When you detect an
outlier, you should investigate the data point to see if it needs to be
removed and the regression equation calculated again.

Mahalanobis Distance

The Mahalanobis distance was introduced in 1936 by mathe­

matician and scientist P.C. Mahalanobis, who also founded the
Indian Statistical Institute. Mahalanobis distance is very useful
in statistics because it considers an entire set of data, rather than
looking at each measurement in isolation. It’s a way of calculating
distance that, unlike the more common Euclidean concept of dis-
tance, takes into account the correlation between measurements
to determine the similarity of a sample to an established data set.
Because these calculations reflect a more complex relationship,
a linear equation will not suffice. Instead, we use matrices, which
condense a complex array of information into a more manage-
able form that can then be used to calculate all of these distances
at once.
On page 137, Risa used her computer to find the prediction
interval using the Mahalanobis distance. Let’s work through that
calculation now and see how she arrived at a prediction interval of
¥3,751,000 and ¥5,109,000 at a confidence level of 95%.

Step 1
Obtain the inverse matrix of

 S11 S12  S1 p   S11 S12  S1 p   S11 S12  S1 p 
     21 
 S21 S22  S2 p  S21 S22  S2 p  S S22
 S2 p 
, which is  = .
                 
     p1 
 Sp1 Sp 2  Spp   Sp1 Sp 2  Spp  S S p2  S pp 

The first matrix is the covariance matrix as calculated on

page 132. The diagonal of this matrix (S11, S22, and so on) is the vari-
ance within a certain variable.

144  Chapter 3  Multiple Regression Analysis

The inverse of this matrix, the second and third matrices shown
here, is also known as the concentration matrix for the different
predictor variables: floor area and distance to the nearest station.
For example, S22 is the variance of the values of the ­distance
to the nearest station. S25 would be the covariance of the distance to
the nearest station and some fifth predictor variable.
The values of S11 and S22 on page 132 were obtained through
this series of calculations.
The values of Sii and Sij in
 S11 S12  S1 p 
 
 S21 S22  S2 p 
     
 
 Sp1 Sp 2  Spp 

and the values of Sii and Sij obtained from conducting individual
tests of the partial regression coefficients are always the same.
That is, the values of Sii and Sij found through partial regression
will be equivalent to the values of Sii and Sij found by calculating
the inverse matrix.

Step 2
Next we need to calculate the square of Mahalanobis distance for a
given point using the following equation:

DM2 ( x ) = ( x − x )
(S ) ( x − x )

The x values are taken from the predictors, x is the mean of

a given set of predictors, and S –1 is the concentration matrix from
Step 1. The Mahalanobis distance for the shop at Yumenooka is
shown here:

( x1 − x1 ) ( x1 − x1 ) S11 + ( x1 − x1 ) ( x 2 − x 2 ) S12 +  + ( x1 − x1 ) x p − x p S1 p ) 

D =
+ ( x 2 − x 2 ) ( x1 − x1 ) S + ( x 2 − x 2 ) ( x 2 − x 2 ) S +  + ( x 2 − x 2 ) x p − x p S 
) 2 p

 ( number off individuals − 1)

 
 pp 
 +( x p − x) p ( x 1 − x 1 ) S(p1
+ x )
p − x p ( x 2 − x 2 (
) S p2
+  + x)(
p − x p x p − x)p S 
 (
 1 x − x ) ( x − x ) S 11
+ ( x − x ) ( x − x ) S 12

D2 = 
1 1 1 1 1 2 2
22 
( number of indiividuals − 1)
+ ( x 2 − x 2 ) ( x1 − x1 ) S + ( x 2 − x 2 ) ( x 2 − x 2 ) S 

(10 − 7.7 ) (10 − 7.7 ) × 0.0657 + (10 − 7.7 ) ( 80 − 156 ) × 0.0004 

=  (10 − 1)
+ ( 80 − 156 ) (10 − 7.7 ) × 0.0004 + ( 80 − 156 ) ( 80 − 156 ) × 0.00001
= 2.4

Mahalanobis Distance  145

Step 3
Now we can calculate the confidence interval, as illustrated here:
This is the confidence interval.

Monthly sales

453.2 - 35 = 418 a1 × 10 + a 2 × 80 + b 453 + 35 = 488

= 41.5 × 10 - 0.3 × 80 + 65.3
= 453

The minimum value of the confidence interval is the same distance from
the mean as the maximum value of the interval. In other words, the confidence
interval “straddles” the mean equally on each side. We calculate the distance
from the mean as shown below (D2 stands for Mahalanobis distance, and x rep-
resents the total number of predictors, not a value of some predictor):

 1 D2  Se
F (1, sample size − x − 1;.05 ) ×  + ×
 sample size sample size − 1  sample size − x − 1
 1 2.4  4173.0
= F (1, 10 − 2 − 1;.05 ) ×  + ×
 10 10 − 1  10 − 2 − 1
= 35

As with simple regression analysis, when obtaining the predic-

tion interval, we add 1 to the second term:

 1 D2  Se
F (1, sample size − x − 1;.05 ) ×  1 + + ×
 sample size sample size − 1  sample size − x − 1

If the confidence rate is 99%, just change the .05 to .01:

F (1, sample size − x − 1;.05 ) = F (1, 10 − 2 − 1;.05 ) = 5.6

F (1, sample size − x − 1;.01) = F (1, 10 − 2 − 1;.01) = 12.2

You can see that if you want to be more confident that the pre-
diction interval will include the actual outcome, the interval needs
to be larger.

146  Chapter 3  Multiple Regression Analysis

Using Categorical Data in Multiple Regression Analysis

Recall from Chapter 1 that categorical data is data that can’t be

measured with numbers. For example, the color of a store manager’s
eyes is categorical (and probably a terrible predictor variable for
monthly sales). Although categorical variables can be represented
by numbers (1 = blue, 2 = green), they are discrete—there’s no such
thing as “green and a half.” Also, one cannot say that 2 (green eyes)
is greater than 1 (blue eyes). So far we’ve been using the numeri-
cal data (which can be meaningfully represented by continuous
numerical values—110 m from the train station is further than
109.9 m) shown in Table 3-2, which also appears on page 113.

Table 3-2: Kazami Bakery Example data

Floor space Distance to the

of the shop nearest station Monthly sales
Bakery (tsubo) (meters) (¥10,000)
Yumenooka Shop 10 80 469
Terai Station Shop 8 0 366
Sone Shop 8 200 371
Hashimoto Station Shop 5 200 208
Kikyou Town Shop 7 300 246
Post Office Shop 8 230 297
Suidobashi Station Shop 7 40 363
Rokujo Station Shop 9 0 436
Wakaba Riverside Shop 6 330 198
Misato Shop 9 180 364

The predictor variable floor area is measured in tsubo, distance

to the nearest station in meters, and monthly sales in yen. Clearly,
these are all numerically measurable. In multiple regression analy-
sis, the outcome variable must be a measurable, numerical variable,
but the predictor variables can be
• all numerical variables,
• some numerical and some categorical variables, or
• all categorical variables.
Tables 3-3 and 3-4 both show valid data sets. In the first,
categorical and numerical variables are both present, and in the
second, all of the predictor variables are categorical.

Using Categorical Data in Multiple Regression Analysis  147

table 3-3: A Combination of Categorical and Numerical Data

Floor space Distance to the

of the shop nearest station Free Monthly sales
Bakery (tsubo) (meters) samples (¥10,000)
Yumenooka Shop 10 80 1 469
Terai Station Shop 8 0 0 366
Sone Shop 8 200 1 371
Hashimoto Station Shop 5 200 0 208
Kikyou Town Shop 7 300 0 246
Post Office Shop 8 230 0 297
Suidobashi Station Shop 7 40 0 363
Rokujo Station Shop 9 0 1 436
Wakaba Riverside Shop 6 330 0 198
Misato Shop 9 180 1 364

In Table 3-3 we’ve included the categorical predictor variable

free samples. Some Kazami Bakery locations put out a tray of free
samples (1), and others don’t (0). When we include this data in the
analysis, we get the multiple regression equation

y = 30.6 x1 − 0.4 x 2 + 39.5 x 3 + 135.9

where y represents monthly sales, x1 represents floor area, x2

represents distance to the nearest station, and x3 represents free

table 3-4: Categorical Predictor data Only

Floor space Distance to the Samples on

of the shop nearest station Samples the weekend Monthly sales
Bakery (tsubo) (meters) every day only (¥10,000)
Yumenooka Shop 1 0 1 0 469
Terai Station Shop 1 0 0 0 366
Sone Shop 1 1 1 0 371
Hashimoto Station Shop 0 1 0 0 208
Kikyou Town Shop 0 1 0 0 246
Post Office Shop 1 1 0 0 297
Suidobashi Station Shop 0 0 0 0 363
Rokujo Station Shop 1 0 1 1 436
Wakaba Riverside Shop 0 1 0 0 198
Misato Shop 1 0 1 1 364

Less than 8 tsubo = 0 Less than 200 m = 0 Does not offer samples = 0
8 tsubo or more = 1 200 m or more = 1 Offers samples = 1

148  Chapter 3  Multiple Regression Analysis

In Table 3-4, we’ve converted numerical data (floor space and
distance to a station) to categorical data by creating some general
categories. Using this data, we calculate the multiple regression

y = 50.2x1 − 110.1x 2 + 13.4 x 3 + 75.1x 4 + 336.4

where y represents monthly sales, x1 represents floor area, x2

represents distance to the nearest station, x3 represents samples
every day, and x4 represents samples on the weekend only.


Multicollinearity occurs when two of the predictor variables are

strongly correlated with each other. When this happens, it’s hard to
distinguish between the effects of these variables on the outcome
variable, and this can have the following effects on your analysis:
• Less accurate estimate of the impact of a given variable on the
outcome variable
• Unusually large standard errors of the regression coefficients
• Failure to reject the null hypothesis
• Overfitting, which means that the regression equation describes
a relationship between the outcome variable and random error,
rather than the predictor variable
The presence of multicollinearity can be assessed by using an
index such as tolerance or the inverse of tolerance, known as the
variance inflation factor (VIF). Generally, a tolerance of less than
0.1 or a VIF greater than 10 is thought to indicate significant multi-
collinearity, but sometimes more conservative thresholds are used.
When you’re just starting out with multiple regression analysis,
you don’t need to worry too much about this. Just keep in mind
that multicollinearity can cause problems when it’s severe. There-
fore, when predictor variables are correlated to each other strongly,
it may be better to remove one of the highly correlated variables
and then reanalyze the data.

Determining the Relative Influence of Predictor

Variables on the Outcome Variable

Some people use multiple regression analysis to examine the rela-

tive influence of each predictor variable on the outcome variable.
This is a fairly common and accepted use of multiple regression
analysis, but it’s not always a wise use.

Multicollinearity 149
The story below illustrates how one researcher used multiple
regression analysis to assess the relative impact of various factors
on the overall satisfaction of people who bought a certain type of
Mr. Torikoshi is a product development researcher in a confec-
tionery company. He recently developed a new soda-flavored candy,
Magic Fizz, that fizzes when wet. The candy is selling astonishingly
well. To find out what makes it so popular, the company gave free
samples of the candy to students at the local university and asked
them to rate the product using the following questionnaire.

Magic Fizz Questionnaire

Please let us know what you thought of Magic Fizz by
answering the following questions. Circle the answer that
best represents your opinion.
Flavor 1. Unsatisfactory
2. Satisfactory
3. Exceptional
Texture 1. Unsatisfactory
2. Satisfactory
3. Exceptional
Fizz sensation 1. Unsatisfactory
2. Satisfactory
3. Exceptional
Package design 1. Unsatisfactory
2. Satisfactory
3. Exceptional
Overall satisfaction 1. Unsatisfactory
2. Satisfactory
3. Exceptional

Twenty students returned the questionnaires, and the results

are compiled in Table 3-5. Note that unlike in the Kazami Bakery
example, the values of the outcome variable—overall satisfaction—
are already known. In the bakery problem, the goal was to predict
the outcome variable (profit) of a not-yet-existing store based on the
trends shown by existing stores. In this case, the purpose of this
analysis is to examine the relative effects of the different predictor
variables in order to learn how each of the predictors (flavor, tex-
ture, sensation, design) affects the outcome (satisfaction).

150  Chapter 3  Multiple Regression Analysis

table 3-5: Results of Magic Fizz Questionnaire

Fizz Package Overall

Respondent Flavor Texture sensation design satisfaction
1 2 2 3 2 2
2 1 1 3 1 3
3 2 2 1 1 3
3 2 2 1 1 1
4 3 3 3 2 2
5 1 1 2 2 1
6 1 1 1 1 1
7 3 3 1 3 3
8 3 3 1 2 2
9 3 3 1 2 3
10 1 1 3 1 1
11 2 3 2 1 3
12 2 1 1 1 1
13 3 3 3 1 3
14 3 3 1 3 3
15 3 2 1 1 2
16 1 1 3 3 1
17 2 2 2 1 1
18 1 1 1 3 1
19 3 1 3 3 3
20 3 3 3 3 3

Each of the variables was normalized before the multiple

regression equation was calculated. Normalization reduces the
effect of error or scale, allowing a researcher to compare two vari-
ables more accurately. The resulting equation is

y = 0.41x1 + 0.32x 2 + 0.26 x 3 + 0.11x 4

where y represents overall satisfaction, x1 represents flavor, x2

represents texture, x3 represents fizz sensation, and x4 represents
package design.
If you compare the partial regression coefficients for the four
predictor variables, you can see that the coefficient for flavor is the
largest. Based on that fact, Mr. Torikoshi concluded that the flavor
has the strongest influence on overall satisfaction.
Mr. Torikoshi’s reasoning does make sense. The outcome vari-
able is equal to the sum of the predictor variables multiplied by
their partial regression coefficients. If you multiply a predictor vari-
able by a higher number, it should have a greater impact on the final
tally, right? Well, sometimes—but it’s not always so simple.

Determining the Relative Influence of Predictor Variables on the Outcome Variable  151
Let’s take a closer look at Mr. Torikoshi’s reasoning as
depicted here:

Texture Package Design

Flavor FizZ Sensation

OveralL Satisfaction

In other words, he is assuming that all the variables relate

independently and directly to overall satisfaction. However, this is
not necessarily true. Maybe in reality, the texture influences how
satisfied people are with the flavor, like this:


FizZ Sensation Flavor Package Design

OveralL Satisfaction

Structural equation modeling (SEM) is a better method for

comparing the relative impact of various predictor variables on an
outcome. This approach makes more flexible assumptions than
linear regression does, and it can even be used to analyze data sets
with multicollinearity. However, SEM is not a cure-all. It relies on
the assumption that the data is relevant to answering the ques-
tion asked.
SEM also assumes that the data is correctly modeled. It’s worth
noting that the questions in this survey ask each reviewer for a
subjective interpretation. If Miu gave the candy two “satisfactory”
and two “exceptional” marks, she might rate her overall satisfac-
tion as either “satisfactory” or “exceptional.” Which rating she
picks might come down to what mood she is in that day!
Risa could rate the four primary categories the same as Miu,
give a different overall satisfaction rating from Miu, and still be
confident that she is giving an unbiased review. Because Miu and
Risa had different thoughts on the final category, our data may
not be correctly modeled. However, structural equation modeling
can still yield useful results by telling us which variables have an
impact on other variables rather than the final outcome.

152  Chapter 3  Multiple Regression Analysis

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy