简单线性回归分析Simple Linear Regression PDF
简单线性回归分析Simple Linear Regression PDF
简单线性回归分析Simple Linear Regression PDF
Recall the model again Yi = 0 + 1 Xi + i , The observations can be written as obs 1 2 . . . n Y Y1 Y2 . . . Yn X X1 X2 . . . Xn i = 1, ..., n
The deviation of each Yi from the mean Y , Yi Y The tted Yi = b0 + b1 Xi , i = 1, ..., n are from the regression and determined by Xi . Their mean is 1 Y = n Thus the deviation of Yi from its mean is Yi Y The residuals ei = Yi Yi , with mean is e=0 Thus the deviation of ei from its mean is ei = Yi Yi 1 (why?)
n
Yi = Y
i=1
Write Yi Y Total deviation obs 1 2 . . . n Sum of squares Yi Y Deviation due the regression
ei Deviation due to the error deviation of ei = Yi Yi e1 e = e1 e2 e = e2 . . . en e = en n 2 i=1 ei Sum of squares of error/residuals (SSE)
n i=1
We have
n i=1
(Yi Y )2 = SST
(Yi Y )2 + SSR
e2 i
SSE
Proof:
n i=1
(Yi Y )2 = =
n i=1 n i=1
= SSR + SSE + 2
i=1 n
(b0 + b1 Xi Y )ei
i=1 n n n
ei + 2b1
i=1 i=1
Xi ei 2Y
i=1
ei
SSR =
i=1
(b0 + b1 Xi b0 b1 X)2 = b2 1 2
n i=1
(Xi X)2
(1)
Breakdown of the degree of freedom The degrees of freedom for SST is n 1: noticing that Y1 Y , ....., Yn Y have one constraint
n i=1 (Yi
Y) = 0
1 0
1 0
residuals e 0 0.5 X 1
2 fitted yhat Y
1 0 1
0.5 X
0.5 X
Figure 1: A gure shows the degree of freedom The degrees of freedom for SSE is n 2: noticing that e1 , ..., en have TWO constraints Mean (of ) Squares M SR = SSR/1 M SE = SSE/(n 2) called regression mean square called error mean square
n i=1 ei
= 0 and
n i=1 Xi ei
Analysis of variance (ANOVA) table Based on the break-down, we write it as a table Source of variation Regression Error Total
SS
df 1 n-2 n-1
F-value M SR F = M SE
P (> F ) p-value
R command for the calculation anova(object, ...) where object is the output of a regression.
(Xi X)2
[Proof: the rst equation was proved (where?). By (1), we have E(M SR) = E(b1 ) = [ ]
2 n i=1 2
n i=1
(Xi X)2
n i=1
n i=1 (Xi
2 (Xi X)2 = 2 + 1
(Xi X)2
F-test of H0 : 1 = 0
Ha : 1 = 0.
(Xi X)2
If b1 = 0 then SSR = 0 (why). Thus we can test 1 = 0 based on SSR. i.e. under H0 , SSR or MSR should be small. We consider the F-statistic F = Under H0 , F F (1, n 2) For a given signicant level , our criterion is 4 SSR/1 M SR = . M SE SSE/(n 2)
If F F (1 , 1, n 2) (i.e. indeed small), accept H0 If F > F (1 , 1, n 2)(i.e. not small), reject H0 where F (1 , 1, n 2) is the (1 ) quantile of the F distribution. We can also do the test based on the p-value = P (F > F ), If p-value , accept H0 If p-value < , reject H0 Example 2.1 For the example above (with n = 25, in part 3), we t a model Yi = 0 + 1 Xi + i (By (R code)), we have the following output Analysis of Variance Table Response: Y Df Sum Sq Mean Sq X 1 252378 252378 Residuals 23 54825 2384
F value 105.88
P r(> F ) 4.449e-10
***
Suppose we need to test H0 : 1 = 0 with signicant level 0.01, based on the calculation, the p-value is 4.449 1010 <0.01, we should reject H0 . Equivalence of F -test and t-test We have two methods to test H0 : 1 = 0 versus H1 : 1 = 0. Recall SSR = b2 1
n i=1 (Xi
X)2 . Thus
n i=1 (Xi
X)2 M SE
F = Thus
b1 b2 1 = s2 (b1 ) s(b1 )
= (t )2 .
F > F (1 , 1, n 2) (t )2 > (t(1 /2, n 2))2 |t | > t(1 /2, n 2). and F F (1 , 1, n 2) (t )2 (t(1 /2, n 2))2 |t | t(1 /2, n 2). (you can check in the statistical table F (1 , 1, n 2) = (t(1 /2, n 2))2 ) Therefore, the test results based on F and t statistics are the same. (But ONLY for simple linear regression model) 5
To test whether H0 : 1 = 0, we can do it by comparing two models Full model : Yi = 0 + 1 Xi + i and Reduced model : Yi = 0 + i Denote the SSR of the FULL and REDUCED models by SSR(F ) and SSR(R) respectively (and SSE(R), SSR(F)). We have immediately SSR(F ) SSR(R) or SSE(F ) SSE(R). A question: when does the equality hold? Note that if H0 : 1 = 0 holds, then SSE(R) SSE(F ) should be small SSE(F ) Considering the degree of freedoms, dene F = (SSE(R) SSE(F ))/(dfR dfF ) should be small SSE(F )/dfF
where dfR and dfF indicate the degrees of freedom of SSE(R) and SSE(F ) respectively. Under H0 : 1 = 0, it is proved that F F (dfR dfF , dfF ) Suppose we get the F value as F , then If F F (1 , dfR dfF , dfF ), accept H0 If F > F (1 , dfR dfF , dfF ), reject H0 Similarly, based on the p-value = P (F > F ), If p-value , accept H0 If p-value < , reject H0
predictor X
SSE SST
is the proportion of Total sum of squares that caused by the random eect.
R2 is called Rsquare, or coecient of determination Some facts about R2 for simple linear regression model 1. 0 R2 1. 2. if R2 = 0, then b1 = 0 (because SSR = b2 1 3. if R2 = 1, then Yi = b0 + b1 Xi (why?) 4. the correlation coecient between rX,Y = R2 [Proof: R2 = b2 SSR = 1 SST
n 2 i=1 (Xi X) n 2 i=1 (Yi Y ) 2 = rXY n i=1 (Xi
X)2 )
5. R2 only indicates the tness in the observed range/scope. We need to be careful if we make prediction outside the range. 6. R2 only indicates the linear relationships. R2 = 0 does not mean X and Y have no nonlinear association. 7
2 (or MSE) = ..., R2 = ..., F-statistic = ... (and others) Other formats of writing a tted model can be found in Part 3 of the lecture notes.