Linear Regression
Linear Regression
Linear Regression
ACTL3142 & ACTL5110 Statistical Machine Learning for Risk Applications
Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with
permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
2 / 100
Linear Regression
A classical and easily applicable approach for supervised learning
Useful tool for predicting a quantitative response
Model is easy to interpret
Many more advanced techniques can be seen as an extension of linear
regression
2 / 100
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
3 / 100
Overview
Suppose we have pairs of data (y1 , x1 ), (y2 , x2 ), ..., (yn , xn ) and we want to
All of these methods are examples of models we can specify. Let’s focus on the
linear prediction. Some questions:
Overview
Simple linear regression is a linear prediction.
Y = β0 + β1 X + ϵ,
identifiability reasons.
5 / 100
Advertising Example
sales ≈ β0 + β1 × TV
6 / 100
E(ϵi ∣X) = 0,
V(ϵi ∣X) = σ 2
for i = 1, 2, 3, ..., n. In other words, errors are i.i.d. Normal random variables
with zero mean and constant variance. Parameters estimation: Maximum
Likelihood or Least Squares
7 / 100
Model estimation
We have paired data (y1 , x1 ), ..., (yn , xn ).
Y = β0 + β1 X + ϵ,
= β^0 + β^1 xi
where we used the fact that E[ϵi ∣X] = 0 and we estimate βj by β^j .
8 / 100
i=1 i=1
∑ni=1 (xi − x
ˉi )(yi − yˉi ) Sxy
β^1 = =
β^0
= yˉ − β^1 x
ˉ
LS Demo
9 / 100
E [β^0 ∣X ] = β0
and E [β^1 ∣X ] = β1 .
2
s =
n−2
=SE(β^0 )2
2 2
σ σ
Var (β^1 ∣X ) = n = = SE( ^1 )2
β
2 2
x σ x σ
Cov (β^0 , β^1 ∣X ) = − n 2
=−
∑i=1 (xi − x) Sxx
Proof: See Lab questions. Verify yourself all three quantities goes to 0 as n gets
larger.
11 / 100
Under the strong assumptions (i.i.d Normal RV), the joint density of
Y1 , Y2 , … , Yn is the product of their marginals (independent by assumption)
i.i.d. i.i.d.
Proof: Since Y = β0 + β1 X + ϵ, where ϵi ∼ N (0, σ 2 ), then yi ∼
2
∑ni=1 (xi − x) Sxx
β^0 =y − β^1 x,
and
n
1 2
σ 2
^MLE = ∑ (yi − (β0 + β1 xi )) .
^ ^
n i=1
Note that the parameters β0 and β1 have the same estimators as that produced
Interpretation of parameters
How do we interpret a linear regression model such as β^0 = 1 and β^ = −0.5?
xi = 0.
E.g., predict yi = 1 if xi = 0
The slope parameter β^1 as the expected change in the mean-response of yi for
a 1 unit increase in xi .
increase in xi .
14 / 100
Example 1
The below data was generated by Y = 1 − 0.5 × X + ϵ where X ∼ U [0, 10] and
ϵ ∼ N (0, 1) with n = 30.
Estimates of Beta_0 and Beta_1:
1.309629 -0.5713465
Standard error of the estimates:
0.346858 0.05956626
15 / 100
Example 2
The below data was generated by Y = 1 − 0.5 × X + ϵ where X ∼ U [0, 10] and
ϵ ∼ N (0, 1) with n = 5000.
Estimates of Beta_0 and Beta_1:
1.028116 -0.5057372
Standard error of the estimates:
0.02812541 0.00487122
16 / 100
Example 3
The below data was generated by Y = 1 − 0.5 × X + ϵ where X ∼ U [0, 10] and
ϵ ∼ N (0, 100) with n = 30.
Estimates of Beta_0 and Beta_1:
-2.19991 -0.4528679
Standard error of the estimates:
3.272989 0.5620736
17 / 100
Example 4
The below data was generated by Y = 1 − 0.5 × X + ϵ where X ∼ U [0, 10] and
ϵ ∼ N (0, 100) with n = 5000.
Estimates of Beta_0 and Beta_1:
1.281162 -0.5573716
Standard error of the estimates:
0.2812541 0.0487122
18 / 100
Example 5
The below data was generated by Y = 1 − 40 × X + ϵ where X ∼ U [0, 10] and
ϵ ∼ N (0, 100) with n = 30.
Estimates of Beta_0 and Beta_1:
4.096286 -40.71346
Standard error of the estimates:
3.46858 0.5956626
19 / 100
Example 6
The below data was generated by Y = 1 + 0.2 × X 2 + ϵ where X ∼ U [0, 10] and
ϵ ∼ N (0, 0.01) with n = 30.
Estimates of Beta_0 and Beta_1:
-2.32809 2.000979
Variances of the estimates:
0.01808525 0.0005420144
21 / 100
Note
For inference (e.g. confidence intervals, hypothesis tests), we need the strong assumptions!
22 / 100
for β1 :
for β0 :
s
β^1 ± t1−α/2,n−2 ⋅ 1 x2
β^0 ± t1−α/2,n−2 ⋅ s +
Sxx
n Sxx
^ (β^1 )
SE
^ (β^0 )
SE
H0 : β1 = β1
vs H1 : β1
= β1
^ β^1 − β1 β^1 − β1
t(β1 ) = =
^ ^
SE (β1 ) (s / Sxx )
∣ ^ ∣
β1
= β1 t (β1 ) > t1−α/2,n−2
∣ ∣
β1 > β1
t (β^1 ) > t1−α,n−2
β1 < β1
t (β^1 ) < −t1−α,n−2
Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-1.8580 -0.7026 -0.1236 0.5634 1.8463
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.30963 0.34686 3.776 0.000764 ***
X -0.57135 0.05957 -9.592 2.4e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-3.1179 -0.6551 -0.0087 0.6655 3.4684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.028116 0.028125 36.55 <2e-16 ***
X -0.505737 0.004871 -103.82 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-20.306 -5.751 -2.109 5.522 27.049
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.1999 3.2730 -0.672 0.507
X -0.4529 0.5621 -0.806 0.427
Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-31.179 -6.551 -0.087 6.655 34.684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.28116 0.28125 4.555 5.36e-06 ***
X -0.55737 0.04871 -11.442 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-18.580 -7.026 -1.236 5.634 18.463
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0963 3.4686 1.181 0.248
X -40.7135 0.5957 -68.350 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-1.8282 -1.3467 -0.4217 1.1207 3.4041
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.32809 0.13448 -17.31 <2e-16 ***
X 2.00098 0.02328 85.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
yi − y
= (yi − y^i )
+ (y^i − y) .
We then obtain:
n n n
∑ (yi − y)2 = ∑ (yi − y^i )2 + ∑ (y^i − y)2 ,
where:
MSE
RSS
Error RSS = ∑ni=1 (yi − y^i )2
DFE = n − 2 MSE = DFE
=TSS =SSM
Syy
Example 6 has a higher R2 than 1-4, despite it clearly not being linear.
Example 6 does not satisfy either the weak or strong assumptions, the results
cannot be trusted. (More on this later)
There is more to modelling than looking at numbers!
36 / 100
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
37 / 100
Overview
Extend the simple linear regression model to accommodate multiple
predictors
Y = β 0 + β 1 X1 + β 2 X2 + ⋯ + β p Xp + ϵ
Data is now paired as (y11 , x11 , x12 , ..., x1p ), ..., (yn1 , xn1 , ..., xnp ).
βj : the average effect on yij of a one unit increase in xij , holding all xik , k
=j
variables fixed.
Instead of fitting a line, we are now fitting a (hyper-)plane
Important note: If we denote xi to be the i’th row of X , you should observe
that the response Y is still linear with respect to the predictors since
y i = xi β .
38 / 100
Advertising Example
sales ≈ β0 + β1 × TV + β2 × radio
39 / 100
Y = Xβ + ϵ
with β = (β0 , β1 , ..., βp )⊤ , Y and ϵ is defined the same as simple linear regression.
X=
⋮ ⋮ ⋮ ⋱ ⋮
⎣ 1 xn1 xn2 … xnp ⎦
Verify all the dimensions make sense, expand it! Also verify simple linear
regression can be recovered from this notation.
Take careful note of the notation in different contexts. Here X is a matrix,
while in simple linear regression it was a column vector. Depending on the
context it should be obvious which is which.
40 / 100
Cov(ϵi , ϵj ∣X) =
0, for all i
= j.
In words, the errors have zero means, common variance, and are uncorrelated.
In matrix form, we have:
i.i.d
Strong Assumptions: ϵi ∣X ∼ N (0, σ 2 ).
In words, errors are i.i.d. normal random variables with zero mean and
constant variance.
41 / 100
i=1 i=1
n
= (Y − Xβ ) (Y − Xβ ) = ∑ ϵ^2i .
⊤
i=1
−1
If (X ⊤ X ) exists, it can be shown that the solution is given by:
−1
^
β (X X ) X ⊤ Y .
= ⊤
Y^ = X β^.
42 / 100
−1
σ 2 × (X ⊤ X )
3. An unbiased estimator of σ 2 is:
1 ⊤ RSS
2
s = (Y − Y ) (Y − Y ) =
^ ^ ,
n−p−1 n−p−1
slide.
42 / 100
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
43 / 100
Qualitative predictors
Suppose a predictor is qualitative (e.g., 2 different levels) - how would you
model/code this in a regression? What if there are more than 2 levels?
Consider for example the problem of predicting salary for a potential job
applicant:
A quantitative variable could be years of relevant work experience.
A two-category variable could be is the applicant currently an employee of
this company? (T/F)
A multiple-category variable could be highest level of education? (HS
diploma, Bachelors, Masters, PhD) How do we incorporate this qualitative
data into our modelling?
44 / 100
Integer encoding
One solution - assign the values of the categories to a number.
Problem? The numbers you use specify a relationship between the categories.
For example, we are saying a Bachelors degree is above a HS diploma (in
particular, is worth 2x more). So βedu (B) = 2 × βedu (HS).
Now this gives an interpretation that a HS diploma is worth more than a PhD
but less than a Bachelors?
What if the categories are completely unrelated like colours (green, blue, red,
yellow)?
45 / 100
One-hot encoding
Another solution is to use a technique called one-hot encoding. Create a set of
binary variables that take 0 or 1 depending if the variable belongs to a certain
category.
⎛R⎞ ⎛1 0 0⎞
G 0 1 0
= ,
0 1 0
⎝B ⎠ ⎝0 1⎠
G
where the first column represents red, second green and third blue.
46 / 100
Dummy encoding
Technically, we cannot use one-hot encoding in linear regression, but instead
use a technique called dummy encoding.
We pick a base case, i.e. set the entry of the row of the matrix to be 0 if it’s the
base case.
Using the same example as before and we set ‘Red’ to be the base case we have:
⎛R⎞ ⎛0 0⎞
G 1 0
= ,
1 0
⎝B ⎠ ⎝0 1⎠
G
where now the first column is green, second is blue. If both columns are 0, then
it represents red (implicitly).
Need this to prevent a singularity in (X ⊤ X), since the first column of X are
1’s (recall your definition of linear independence!)
Bonus question: What if we remove the intercept column in our design matrix
X ? Do we still need a base case?
46 / 100
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
47 / 100
⋮ ⋮ ⋮ ⋱ ⋮ ⋮
151.5 41.3 18.5 ⎣ 1 xn1 xn2 … xnp ⎦
⎣ yn ⎦
Brief refresher
Fitting: Minimise the residuals sum of squares
n n
RSS = ∑(yi − y^i )2 = ∑(yi − β^0 − β^1 xi,1 − … − β^p xi,p )
2
i=1 i=1
⊤
= (Y − Xβ ) (Y − Xβ)
−1
If (X ⊤ X ) exists, it can be shown that the solution is given by:
−1
^
β = (X X ) X ⊤ Y .
⊤
Y = X β^.
49 / 100
1 model <- lm(sales ~ TV + radio, data = df_adv) 1 X <- model.matrix(~ TV + radio, data = df_adv)
2 coef(model) 2 y <- df_adv$sales
3 beta <- solve(t(X) %*% X) %*% t(X) %*% y
(Intercept) TV radio 4 beta
2.92109991 0.04575482 0.18799423
[,1]
(Intercept) 2.92109991
TV 0.04575482
radio 0.18799423
Y^ = X β^.
1 budgets <- data.frame(TV = c(100, 200, 300), radio 1 X_new <- model.matrix(~ TV + radio, data = budgets
2 predict(model, newdata = budgets) 2 X_new %*% beta
1 2 3 [,1]
11.25647 17.71189 24.16731 1 11.25647
2 17.71189
3 24.16731
50 / 100
Dummy encoding
Design matrices are normally an ‘Excel’-style table of covariates/predictors plus
a column of ones.
If categorical variables are present, they are added as dummy variables:
1 fake <- tibble( 1 model.matrix(~ speed + risk, data = fake)
2 speed = c(100, 80, 60, 60, 120, 40),
3 risk = c("Low", "Medium", "High", (Intercept) speed riskLow riskMedium
4 "Medium", "Low", "Low") 1 1 100 1 0
5 ) 2 1 80 0 1
6 fake 3 1 60 0 0
4 1 60 0 1
5 1 120 1 0
# A tibble: 6 × 2
6 1 40 1 0
speed risk
attr(,"assign")
<dbl> <chr>
[1] 0 1 2 2
1 100 Low
attr(,"contrasts")
2 80 Medium
attr(,"contrasts")$risk
3 60 High
[1] "contr.treatment"
4 60 Medium
5 120 Low
6 40 Low
51 / 100
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
53 / 100
H0 : β1 = ⋯ = βp = 0
(TSS−RSS)/p
F-statistic = RSS/(n−p−1)
∼ Fp,n−p−1
linear regression!
Question: Given the individual p-values for each variable, why do we need to
look at the overall F-statistics?
Because a model with all insignificant p-values may jointly still be able to
explain a significant proportion of the variance.
Conversely, a model with significant predictors may still fail to explain a
significant proportion of the variance.
54 / 100
ANOVA
This partitioning of the variability is used in ANOVA tables:
MSE
1 − FDFM,DFE (F )
n RSS
Error SSE = ∑i=1 (yi − y^i )2
DFE = n − p − 1 MSE = DFE
significance.
linear model is an approximation
random error ϵ
57 / 100
Call:
lm(formula = Y ~ X1 + X2)
Residuals:
Min 1Q Median 3Q Max
-1.6923 -0.4883 -0.1590 0.5366 1.9996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.22651 0.45843 2.675 0.0125 *
X1 -0.71826 0.05562 -12.913 4.56e-13 ***
X2 1.01285 0.05589 18.121 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Y ~ X1 + X2)
Residuals:
Min 1Q Median 3Q Max
-16.923 -4.883 -1.591 5.366 19.996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2651 4.5843 0.712 0.4824
X1 -0.8826 0.5562 -1.587 0.1242
X2 1.1285 0.5589 2.019 0.0535 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
63 / 100
Subset selection
The classic approach is subset selection
Standard approaches include
Best subset
Forward stepwise
Backwards stepwise
Hybrid stepwise
66 / 100
Y = β 0 + β 1 X1 + β 2 X2 + ⋯ + β p Xp
Algorithm:
Consider the models with 0 predictors, and call this M0 . This is the null
model
Consider all models with 1 predictor, pick the best fit, and call this M1
…
Consider the model with p predictor, and call this Mp . This is the full model
k=0
68 / 100
Consider the p models with 1 predictor, pick the best, and call this M1
Extend M1 with one of the p − 1 remaining predictors. Pick the best, and call
this M2
…
End with the full model Mp
k=0
RSS and R2 for each possible model containing a subset of the ten predictors in the Credit data set.
Indirect methods
1. Cp with d predictors:
1
^2)
(RSS + 2dσ
n
^ 2 is an unbiased estimate of σ 2
Unbiased estimate of test MSE if σ
2. Akaike information criteria (AIC) with d predictors:
1
^2)
(RSS + 2dσ
n
1
^2)
(RSS + log(n) dσ
n
log(n) > 2 for n > 7, so this is a much heavier penalty
4. Adjusted R2 with d predictors
RSS/(n − d − 1)
1−
TSS/(n − 1)
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
75 / 100
Potential Problems/Concerns
To apply linear regression properly:
The relationship between the predictors and response are linear and additive
(i.e. effects of the covariates must be additive);
Homoskedastic (constant) variance;
Errors must be independent of the explanatory variables with mean zero
(weak assumptions);
Errors must be Normally distributed, and hence, symmetric (only in case of
testing, i.e., strong assumptions).
76 / 100
Potential Problems/Concerns
1. Non-linearity of the response-predictor relationships
2. Correlation of error terms
3. Non-constant variance of error terms
4. Outliers
5. High-leverage points
6. Collinearity
7. Confounding effect (correlation does not imply causality!)
78 / 100
1. Non-linearities
Example: residuals vs fitted for MPG vs Horsepower:
Quadratic model removes much of the pattern - we look at these in more detail
later.
79 / 100
4. Outliers
82 / 100
5. High-leverage points
The following compares the fitted line with (RED) and without (BLUE)
observation 41 fitted.
83 / 100
High-leverage points
Have unusual predictor values, causing the regression line to be dragged
towards them
A few points can significantly affect the estimated regression line
Compute the leverage using the hat matrix:
H = X(X ⊤ X)−1 X ⊤
Note that
n n
y^i = ∑ hij yj = hii yi + ∑ hij yj
j=1
j=i
so each prediction is a linear function of all observations, and hii = [H]ii is the
This point (y = −7, x = 20) has a leverage value of 0.47 >> 4/30, depsite it not
being an outlier.
85 / 100
6. Collinearity
Two or more predictor values are closely related to each other (linearly
dependent)
If a column is linearly dependent on another, the matrix (X ⊤ X) is singular,
hence non-invertible.
Reduces the accuracy of the regression by increasing the set of plausible
coefficient values
In effect, the causes SE of the beta coefficients to grow.
Correlation can indicate one-to-one (linear) collinearity
86 / 100
Multicollinearity
Use variance inflation factor
1
VIF(β^j ) = 2
1 − RX
j ∣X−j
2 2
RX j
∣X
−j
is
the R from Xj being regressed onto all other predictors
U [0, 10], X2 = 2X1 + ε, where ε ∼ N (0, 10−8 ) is a small change (to make this
Call:
lm(formula = Y ~ X1 + X2)
Residuals:
Min 1Q Median 3Q Max
-2.32126 -0.46578 0.02207 0.54006 1.89817
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.192e-01 3.600e-01 1.442 0.1607
X1 5.958e+04 3.268e+04 1.823 0.0793 .
X2 -2.979e+04 1.634e+04 -1.823 0.0793 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
7. Confounding effects
But what about confounding variables? Be careful, correlation does not imply
causality!1
C is a confounder (confounding variable) of the relation between X and Y if:
C influences X and C influences Y ,
but X does not influence Y (directly).
Confounding effects
The predictor variable X would have an indirect influence on the dependent
variable Y .
Example: Age ⇒ Experience ⇒ Aptitude for mathematics. If experience
can not be measured, age can be a proxy for experience.
The predictor variable X would have no direct influence on dependent
variable Y .
Example: Being old doesn’t necessarily mean you are good at maths!
Hence, a predictor variable works as a predictor, but action taken on the
predictor itself will have no effect.
92 / 100
Confounding effects
How to correctly use/don’t use confounding variables?
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
93 / 100
Lecture Outline
Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
ANOVA
Linear model selection
Potential problems with Linear Regression
So what’s next
Appendices
94 / 100
n−1
i=1
n
Syy
Syy = ∑(yi − y ) 2
⟹ s2y =
n−1
i=1
n
Sxy
Sxy = ∑(xi − x)(yi − y ) ⟹ sxy = ,
n−1
i=1
Rationale for β1 : Recall that β^1 is unbiased and Var(β^1 ) = σ 2 /Sxx . However σ 2 is
(n−2)⋅s2
β^1 − β1 β^1 − β1
= / σ2
∼ tn−2
n−2
s/ Sxx σ/ Sxx
σ2 σ2
E[β^k ] = βk ,
Var(β^k ) = σ 2 ⋅ ckk ,
th −1
where ckk is the (k + 1) diagonal entry of the matrix C = (X⊤ X) .
want to predict the corresponding Y value associated with it. The mean of Y is:
E[Y ∣ x0 ] = E[β0 + β1 x ∣ x = x0 ]
= β0 + β1 x 0 .
Our (unbiased) estimator for this mean (also the fitted value of y0 ) is:
1 (x − x0 )2
Var(y^0 ) = ( + ) σ 2 = SE(y^0 )2
n Sxx
β1 x0 (mean of Y ) is:
1 (x − x0 )2
(β^0 + β^1 x0 ) ± t1−α/2,n−2 × s + ,
n Sxx
y^0
^ (y^0 )
SE
as we have and
^
SE(y^0 )
with
E [Y i − y^i ∣X = x, X = xi ] = 0, and
1 (x − xi )2 2
Var(Y i − y^i ∣X = x, X = xi ) = σ (1 + + ).
n Sxx
1 (x − xi )2
β^0 + β^1 xi ± t1−α/2,n−2 ⋅ s ⋅ 1+ + ,
n Sxx
y^i
as
1 (x − xi )2
(Y i − y^i ∣X = x, X = xi ) ∼ N (0, σ (1 + + )), and 2
n Sxx
Yi − y^i
∼ tn−2 .
1 (x−xi )2
s 1+ n
+ Sxx