简单线性回归分析Simple Linear Regression PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Chapter 1 Simple Linear Regression (part 4)

Analysis of Variance (ANOVA) approach to regression analysis

Recall the model again Yi = 0 + 1 Xi + i , The observations can be written as obs 1 2 . . . n Y Y1 Y2 . . . Yn X X1 X2 . . . Xn i = 1, ..., n

The deviation of each Yi from the mean Y , Yi Y The tted Yi = b0 + b1 Xi , i = 1, ..., n are from the regression and determined by Xi . Their mean is 1 Y = n Thus the deviation of Yi from its mean is Yi Y The residuals ei = Yi Yi , with mean is e=0 Thus the deviation of ei from its mean is ei = Yi Yi 1 (why?)
n

Yi = Y
i=1

Write Yi Y Total deviation obs 1 2 . . . n Sum of squares Yi Y Deviation due the regression

ei Deviation due to the error deviation of ei = Yi Yi e1 e = e1 e2 e = e2 . . . en e = en n 2 i=1 ei Sum of squares of error/residuals (SSE)
n i=1

deviation of Yi Y1 Y Y2 Y . . . Yn Y n 2 i=1 (Yi Y ) Total Sum of squares (SST)

deviation of i = b0 + b1 Xi Y Y1 Y Y2 Y . . . n Y Y n 2 i=1 (Yi Y ) Sum of squares due to regression (SSR)


n i=1

We have

n i=1

(Yi Y )2 = SST

(Yi Y )2 + SSR

e2 i

SSE

Proof:
n i=1

(Yi Y )2 = =

n i=1 n i=1

(Yi Y + Yi Yi )2 {(Yi Y )2 + (Yi Yi )2 + 2(Yi Y )(Yi Yi )}


n

= SSR + SSE + 2
i=1 n

(Yi Y )(Yi Yi ) (Yi Y )ei


i=1 n

= SSR + SSE + 2 = SSR + SSE + 2 = SSR + SSE + 2b0

(b0 + b1 Xi Y )ei
i=1 n n n

ei + 2b1
i=1 i=1

Xi ei 2Y
i=1

ei

= SSR + SSE It is also easy to check


n

SSR =
i=1

(b0 + b1 Xi b0 b1 X)2 = b2 1 2

n i=1

(Xi X)2

(1)

Breakdown of the degree of freedom The degrees of freedom for SST is n 1: noticing that Y1 Y , ....., Yn Y have one constraint
n i=1 (Yi

Y) = 0

The degrees of freedom for SSR is 1: noticing that Yi = b0 + b1 Xi (see Figure 1)

1 0

1 0

residuals e 0 0.5 X 1

2 fitted yhat Y

1 0 1

0.5 X

0.5 X

Figure 1: A gure shows the degree of freedom The degrees of freedom for SSE is n 2: noticing that e1 , ..., en have TWO constraints Mean (of ) Squares M SR = SSR/1 M SE = SSE/(n 2) called regression mean square called error mean square
n i=1 ei

= 0 and

n i=1 Xi ei

= 0 (i.e., the normal equation).

Analysis of variance (ANOVA) table Based on the break-down, we write it as a table Source of variation Regression Error Total

SSR = SSE = SST =

n 2 i=1 (Yi Y ) n 2 i=1 (Yi Yi ) n 2 i=1 (Yi Y )

SS

df 1 n-2 n-1

MS MSR = SSR 1 MSE = SSE n2

F-value M SR F = M SE

P (> F ) p-value

R command for the calculation anova(object, ...) where object is the output of a regression.

Expected Mean Squares E(M SE) = 2 and


2 E(M SR) = 2 + 1 n i=1

(Xi X)2

[Proof: the rst equation was proved (where?). By (1), we have E(M SR) = E(b1 ) = [ ]
2 n i=1 2

(Xi X)2 = [V ar(b1 ) + (Eb1 )2 ] + X)2


2 1 ] n i=1

n i=1

(Xi X)2
n i=1

n i=1 (Xi

2 (Xi X)2 = 2 + 1

(Xi X)2

F-test of H0 : 1 = 0

Consider the hypothesis test H0 : 1 = 0, Note that Yi = b0 + b1 Xi and SSR = b2 1


n i=1

Ha : 1 = 0.

(Xi X)2

If b1 = 0 then SSR = 0 (why). Thus we can test 1 = 0 based on SSR. i.e. under H0 , SSR or MSR should be small. We consider the F-statistic F = Under H0 , F F (1, n 2) For a given signicant level , our criterion is 4 SSR/1 M SR = . M SE SSE/(n 2)

If F F (1 , 1, n 2) (i.e. indeed small), accept H0 If F > F (1 , 1, n 2)(i.e. not small), reject H0 where F (1 , 1, n 2) is the (1 ) quantile of the F distribution. We can also do the test based on the p-value = P (F > F ), If p-value , accept H0 If p-value < , reject H0 Example 2.1 For the example above (with n = 25, in part 3), we t a model Yi = 0 + 1 Xi + i (By (R code)), we have the following output Analysis of Variance Table Response: Y Df Sum Sq Mean Sq X 1 252378 252378 Residuals 23 54825 2384

F value 105.88

P r(> F ) 4.449e-10

***

Suppose we need to test H0 : 1 = 0 with signicant level 0.01, based on the calculation, the p-value is 4.449 1010 <0.01, we should reject H0 . Equivalence of F -test and t-test We have two methods to test H0 : 1 = 0 versus H1 : 1 = 0. Recall SSR = b2 1
n i=1 (Xi

X)2 . Thus
n i=1 (Xi

b2 SSR/1 = 1 F = SSE/(n 2) But since s2 (b1 ) = M SE/


n i=1 (Xi

X)2 M SE

X)2 (where?), we have under H0 ,


2

F = Thus

b1 b2 1 = s2 (b1 ) s(b1 )

= (t )2 .

F > F (1 , 1, n 2) (t )2 > (t(1 /2, n 2))2 |t | > t(1 /2, n 2). and F F (1 , 1, n 2) (t )2 (t(1 /2, n 2))2 |t | t(1 /2, n 2). (you can check in the statistical table F (1 , 1, n 2) = (t(1 /2, n 2))2 ) Therefore, the test results based on F and t statistics are the same. (But ONLY for simple linear regression model) 5

General linear test approach

To test whether H0 : 1 = 0, we can do it by comparing two models Full model : Yi = 0 + 1 Xi + i and Reduced model : Yi = 0 + i Denote the SSR of the FULL and REDUCED models by SSR(F ) and SSR(R) respectively (and SSE(R), SSR(F)). We have immediately SSR(F ) SSR(R) or SSE(F ) SSE(R). A question: when does the equality hold? Note that if H0 : 1 = 0 holds, then SSE(R) SSE(F ) should be small SSE(F ) Considering the degree of freedoms, dene F = (SSE(R) SSE(F ))/(dfR dfF ) should be small SSE(F )/dfF

where dfR and dfF indicate the degrees of freedom of SSE(R) and SSE(F ) respectively. Under H0 : 1 = 0, it is proved that F F (dfR dfF , dfF ) Suppose we get the F value as F , then If F F (1 , dfR dfF , dfF ), accept H0 If F > F (1 , dfR dfF , dfF ), reject H0 Similarly, based on the p-value = P (F > F ), If p-value , accept H0 If p-value < , reject H0

Descriptive measures of linear association between X and Y

It follows from SST = SSR + SSE that 1= where


SSR SST

SSR SSE + SST SST

is the proportion of Total sum of squares that can be explained/predicted by the

predictor X
SSE SST

is the proportion of Total sum of squares that caused by the random eect.

A good model should have large R2 = SSE SSR =1 SST SST

R2 is called Rsquare, or coecient of determination Some facts about R2 for simple linear regression model 1. 0 R2 1. 2. if R2 = 0, then b1 = 0 (because SSR = b2 1 3. if R2 = 1, then Yi = b0 + b1 Xi (why?) 4. the correlation coecient between rX,Y = R2 [Proof: R2 = b2 SSR = 1 SST
n 2 i=1 (Xi X) n 2 i=1 (Yi Y ) 2 = rXY n i=1 (Xi

X)2 )

5. R2 only indicates the tness in the observed range/scope. We need to be careful if we make prediction outside the range. 6. R2 only indicates the linear relationships. R2 = 0 does not mean X and Y have no nonlinear association. 7

Considerations in Applying regression analysis


1. In prediction a new case, we need to ensure the model is applicable to the new case. 2. Sometimes we need to predict X, and thus predict Y . As a consequence, the prediction accuracy also depends on the prediction of X 3. The range of X for the model. If a new case X is far from the range, in the prediction, we need be careful 4. 1 = 0 only indicates the correlation relationship, but not a cause-and-eect relation (causality). 5. Even if 1 = 0 can be concluded, we cannot say Y has no relationship/association with X. We can only say there is no LINEAR relationship/association between X and Y .

Write an estimated model


Y (S.E.) = b0 (s(b0 )) + b1 X (s(b1 ))

2 (or MSE) = ..., R2 = ..., F-statistic = ... (and others) Other formats of writing a tted model can be found in Part 3 of the lecture notes.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy