Manga Cap 3 PDF
Manga Cap 3 PDF
Manga Cap 3 PDF
Multiple
Regression
Analysis
Predicting with Many Variables
Well...
...I have my
reasons.
Risa...
Oh, that's
her.
pant
Over here!
Wheeze
Sorry
I'm late!
My class
ended late.
I ran.
It's okay.
We just got
here too.
Oh! Who's
your
Beam friend?
Hello!
This is
Kazami. He's Ah! Nice
Oh, yeah. in my history to meet
class. you.
We're going to
Tha nks f
cover multiple h elping
or
me. No problem.
regression analysis This is going to
today, and Kazami help me, too.
brought us some
data to analyze.
Of
course! Definitely Kazami
They're bakery—Theirs
delicious. are the best!
Risa, don't
be dramatic.
suid
ob rokujo
Post sta ash i
Kikyou tio station
Office n
Tow n
to
Hash im o
station
We're planning
w a ka b
rivers
a
ide Misato
Isebashi to open a new
one soon.
We’re going to
predict the sales Wow!
of the new shop
using multiple
regression
analysis.
That's
right. outcome predictor partial regression
variable variables coefficients
every x variable
has its own a , but
there's still just
one intercept.
Yep. Just
one b.
regression
multiple regression analysis
analysis
outcome outcome
variable variable
they're similar—
but not exactly
the same.
well...
We have to look at
each predictor alone I'll write
and all of them that
together. down.
have you
gotten first, let's look
results yet? at the data from
Can I see? the shops already
in business.
not yet.
Be patient.
le
scribb
hold
on...
le
scribb
done!
..
m. totally.
h mm
looks like
monthly sales both of
are higher in these affect
bigger shops and monthly
in shops near a sales.
train station.
That’s right!
Let’s review.
Come on!
sharpen your aha! this is
pencils and get where we use
ready! the matrices...
here, I
labeled
everything.
click
!
you
wimps.
click!
so here's
you are the equation.
a genius.
,
e it
sav ones.
yb
laz
finall
monthly floor distance to Pay d y!
irt.
sales space the nearest
station
you should
write this
down.
there's
one more
thing...
…that will help
you understand the
multiple regression
equation and multiple
regression analysis.
what
is it?
oh yeah! when
we plot our
equation, the line
passes through
the averages.
good memory!
we don't need Se
yet, but we will
use it later.
R=
sum of ( y − y ) yˆ − yˆ( ) =
Syyˆ
=
72026.6
= .9722
× sum of ( yˆ − yˆ )
2
Syy × Syy 76199.6 × 72026.6
sum of ( y − y )
2
ˆˆ
R2 is
R2 = (.9722)2 = .9452 .9452.
okay.
Okay,
R is
2
.9452
!
Well, the
what? trouble is...
huh?!
why
age is now would after adding
the third
predictor
age this variable...
matter?
variable.
I knew it!
the age of the
shop manager has
So what was
nothing to do with
the point of
monthly sales! all those
calculations?
what?
another R 2?
Adjusted R2
too many
layers! it’s worse
than R 2 !
Um, I
think
so...
go miu!
let's
see...
Se S
e
sample size − number of predictor variables − 1
1 − 1 − sample size − number of predictor variables − 1
Syy Syy
ple size − 1
m
sam mple size − 1
sam
4173 .0 .0
10 − 2 − 1
4173 it!
I got
= 1 −= 1 − 10 − 2 −= 1 =
.9296
76199 .6 .6
10 − 1
76199
10 − 1
R2
predictor va
Se
Se
sample size − number of predictor variables − 1
1 − sample size − number of predictor variables − 1
1− Syy
Syy
m
sam ple size − 1
It’s sammple size − 1
3846.4. 3846.4
10 − 3
3846.4
−1
Wait a
= 1 − 10 − 3−=1 minute...
= 176199
− .6 = .9243
10 76199 .6
− 1
10 − 1
Adjusted R2 125
predictor variables
see? adjusted R2
to the rescue!
hey, look
at this. adjusted R2 is smaller
than R2 in both cases.
Is it always smaller?
adjusted
R2 is
awesome.
R2 Yes, but in
multiple regression
analysis, we have
partial regression
coefficients, instead.
I think So. we
do you tested whether
remember the population
how we did matched the
the hypothesis equation and then
testing before? checked that A
didn't equal zero.
right! it's
basically
alternative hypothesis
the same
with multiple
If the floor area of the shop
regression.
is x1 tsubo and the distance to
the nearest station is x 2 meters,
the monthly sales follow a normal
distribution with mean A1x1 + A2x 2 + B
and standard deviation σ.
now, we have
more than one x
and more than one
A. at least one of
these A's must not I see!
equal zero.
the multiple
regression equation is
y = 41.5x1 − 0.3x2 + 65.3,
so...
A2 is approximately –0.3.
B is approximately 65.3.
null null
hypothesis and hypothesis
alternative alternative
hypothesis not hypothesis
Step 1 Define the population. The population is all Kazami Bakery shops.
Step 2 Set up a null hypothesis and Null hypothesis is A1 = 0 and A2 = 0.
an alternative hypothesis. Alternative hypothesis is that A1 or A2 or both ≠ 0.
Step 3 Select which hypothesis test We’ll use an F‑test.
to conduct.
Step 4 Choose the significance level. We’ll use a significance level of .05.
Step 5 Calculate the test statistic The test statistic is:
from the sample data. Syy − Se Se
÷
number of predictor variables sample size − number off predictor var
Syy − Se 76199.6 − 4173
Syy −.0S÷ Se .0
4173 Se
÷ e = 60.4 =
number of predictor variables sample
numbersize − number
2of predictor − 1 ÷ sample
−off2predictor
10variables size−−1number off predictor var
variables
76199.6 − 4173.0 4173.0 .6 − 4173.0
÷ = 6076199
.4 ÷
4173.0
= 60.4
2 10 − 2 − 1 2 10 − 2 − 1
Step 1 Define the population. The population is all Kazami Bakery shops.
Step 2 Set up a null hypothesis and Null hypothesis is A1 = 0.
an alternative hypothesis. Alternative hypothesis is A1 ≠ 0.
Step 3 Select which hypothesis test We’ll use an F‑test.
to conduct.
Step 4 Choose the significance level. We’ll use a significance level of .05.
Step 5 Calculate the test statistic The test statistic is:
from the sample data. a12 Se
÷ =
S11 sample size − number of predictor variables − 1
41.52 4173.0
÷ = 44
0.0657 10 − 2 − 1
The test statistic will follow an F distribution with
first degree of freedom 1 and second degree of freedom
7 (sample size minus the number of predictor variables
minus 1), if the null hypothesis is true. (The value of S11
will be explained on the next page.)
Step 6 Determine whether the At significance level .05, with d1 being 1 and d2 being 7,
p-value for the test statistic the critical value is 5.5914. Our test statistic is 44.
obtained in Step 5 is smaller
than the significance level.
Step 7 Decide whether you can reject Since our test statistic is greater than the critical value,
the null hypothesis. we reject the null hypothesis.
a12 Se
÷
S11 sample size − number of predictor variables − 1
0.0657 . . . ...
= ... 0.00001 . . .
... ... ...
This is S22.
Distance to the
Floor space
nearest station
so A1 doesn't
equal zero! We
can reject the
null hypothesis.
you really
did it!
You're my
hero, miu!
what's
next? was it yes, that's right. we're
something about going to calculate
confidence? confidence intervals.
but this
time...
mahala... what?
slum ber
party?
no! I
you really changed
want to? my mind!
eep! no math!
well then, i
guess we'll thank you,
have to find out computer!
the confidence
intervals using
data analysis
software.
sorry!
you are
such a jerk
sometimes!
I thought it
was funny!
sorry
...
For a 10 tsubo
shop that’s 80 m
from a station, the
confidence interval
is 453.2 ± 34.9.*
So first
we do I got it! We’ve found out
453.2 + 34.9... that the average shop precisely!
earns between ¥4,183,000
and ¥4,881,000, right?
...
and
453.2 then
− 34.
9.
p in
A sh o h i?
s e bas
I se
clo
That’s h ouse!
m y
to
you're a genius,
can you miu! I should
predict the name the shop
sales, miu? after you.
you should
probably
name it
* after risa...
¥4,473,000
per month!
yes, it's
similar.
smirk
Could we...maybe,
please use your
computer again?
Just once more?
If you
insist.
the confidence
level is 95%, so ...are between
predicted sales... ¥3,751,000 and
¥5,109,000.
not bad!
a better
!
eq uation?
we need to check
whether there's
a better multiple
regression
equation!
what's wrong
with the one we
have? What about
my prediction?
Meaningless!
now who's
being dramatic?
the equation
like the age of becomes
the shop manager? complicated if
we used that, even you have too
though it didn't have many predictor
any effect on sales! variables.
so many
ingredients...
height of this soup is
e the ceiling a mess.
age of th
er
sh op manag
just as with simple
numb
regression analysis, er of
trays
we can calculate a nu mber of se
ats tor
multiple regression predic
stew
equation using any
variables we have
data on, whether or exactly.
not they actually
affect the outcome
variable.
• forward selection
• backward elimination
• forward-backward
stepwise selection
• A
sk a domain expert
which variables are round
the most important robin?
a fat
bird?
we'll make
a table that
shows the partial
Is our equation regression
the winner? coefficients and
adjusted R2...
...like
this. presto!
Predictor
variables a1 a2 a3 b R2
1 54.9 –91.3 .07709
2 –0.6 424.8 .5508
3 0.6 309.1 .0000
1 and 2 41.5 –0.3 65.3 .9296
1 and 3 55.6 2.0 –170.1 .7563
2 and 3 –0.6 –0.4 438.9 .4873
1 and 2 and 3 42.2 –0.3 1.1 17.7 .9243
so our
equation is now we really know that
the best! y = 41.5x1 – 0.3x2 + 65.3
we rock.
does a good job at
predicting the sales at
the new shop.
That’s right!
Good work,
folks!
I think
I may have
learned
something,
too.
I earned
you can one, too!
pay us in
croissants! oh well...
I can't say
no to that.
she
sure is.
risa is
really cool,
isn't she?
let
slo ’s go,
w po
hey, miu! kes
!
141
Assessing Populations with
Multiple Regression Analysis
Floor Distance
area to the
of the nearest Monthly
shop station sales Monthly sales Residual Standardized
Bakery x1 x2 y yˆ = 41.5x1 − 0.3x 2 + 65.3 y − yˆ residual
Yumenooka Shop 10 80 469 453.2 15.8 0.8
Terai Station Shop 8 0 366 397.4 –31.4 –1.6
Sone Shop 8 200 371 329.3 41.7 1.8
Hashimoto 5 200 208 204.7 3.3 0.2
Station Shop
Kikyou Town Shop 7 300 246 253.7 –7.7 –0.4
Post Office Shop 8 230 297 319.0 –22.0 1.0
Suidobashi 7 40 363 342.3 20.7 1.0
Station Shop
Rokujo 9 0 436 438.9 –2.9 –0.1
Station Shop
Wakaba 6 330 198 201.9 –3.9 –0.2
Riverside Shop
Misato Shop 9 180 364 377.6 –13.6 –0.6
Mahalanobis Distance
Step 1
Obtain the inverse matrix of
−1
S11 S12 S1 p S11 S12 S1 p S11 S12 S1 p
21
S21 S22 S2 p S21 S22 S2 p S S22
S2 p
, which is = .
p1
Sp1 Sp 2 Spp Sp1 Sp 2 Spp S S p2 S pp
and the values of Sii and Sij obtained from conducting individual
tests of the partial regression coefficients are always the same.
That is, the values of Sii and Sij found through partial regression
will be equivalent to the values of Sii and Sij found by calculating
the inverse matrix.
Step 2
Next we need to calculate the square of Mahalanobis distance for a
given point using the following equation:
DM2 ( x ) = ( x − x )
T
(S ) ( x − x )
−1
(
( x1 − x1 ) ( x1 − x1 ) S11 + ( x1 − x1 ) ( x 2 − x 2 ) S12 + + ( x1 − x1 ) x p − x p S1 p )
D =
2
22
(
+ ( x 2 − x 2 ) ( x1 − x1 ) S + ( x 2 − x 2 ) ( x 2 − x 2 ) S + + ( x 2 − x 2 ) x p − x p S
21
) 2 p
Monthly sales
The minimum value of the confidence interval is the same distance from
the mean as the maximum value of the interval. In other words, the confidence
interval “straddles” the mean equally on each side. We calculate the distance
from the mean as shown below (D2 stands for Mahalanobis distance, and x rep-
resents the total number of predictors, not a value of some predictor):
1 D2 Se
F (1, sample size − x − 1;.05 ) × + ×
sample size sample size − 1 sample size − x − 1
1 2.4 4173.0
= F (1, 10 − 2 − 1;.05 ) × + ×
10 10 − 1 10 − 2 − 1
= 35
1 D2 Se
F (1, sample size − x − 1;.05 ) × 1 + + ×
sample size sample size − 1 sample size − x − 1
You can see that if you want to be more confident that the pre-
diction interval will include the actual outcome, the interval needs
to be larger.
Less than 8 tsubo = 0 Less than 200 m = 0 Does not offer samples = 0
8 tsubo or more = 1 200 m or more = 1 Offers samples = 1
Multicollinearity
Multicollinearity 149
The story below illustrates how one researcher used multiple
regression analysis to assess the relative impact of various factors
on the overall satisfaction of people who bought a certain type of
candy.
Mr. Torikoshi is a product development researcher in a confec-
tionery company. He recently developed a new soda-flavored candy,
Magic Fizz, that fizzes when wet. The candy is selling astonishingly
well. To find out what makes it so popular, the company gave free
samples of the candy to students at the local university and asked
them to rate the product using the following questionnaire.
Determining the Relative Influence of Predictor Variables on the Outcome Variable 151
Let’s take a closer look at Mr. Torikoshi’s reasoning as
depicted here:
OveralL Satisfaction
Texture
OveralL Satisfaction