Sasin DECS 434 Session 3 - Regression For Prediction
Sasin DECS 434 Session 3 - Regression For Prediction
Prediction
Professor Brett Saraniti
Module 3, 2022
What single factor is the biggest
determinant in the price of a home?
To use DATA to build a MODEL that predicts the market value of a home in
Wilmette, IL.
Wilmette is the small suburb just North of Evanston where some Kellogg faculty
live. Also a few derivative traders, big shot attorneys, various tycoons & titans of
industry…
Price = β0 + β1 squarefeet + 𝜖𝜖
Noise or
Error
Linear
The simple linear regression model
Y = β0 + β1 X + 𝜖𝜖
Y is called the dependent variable
X is called the independent variable
We use Excel (or other software) to estimate β0 and β1 .
We’ll make a few assumptions about 𝜖𝜖
Price = β0 + β1 (2968) + 𝜖𝜖
The error term represents the value of the home that
cannot be explained just by its size…
Excel and a million other software platforms use a process called the “least
squares method” to estimate the coefficients in the model.
Some details
Excel will do all the hard work for us. It will produce the equation for “the
best” line. In our case, it looks like this:
Run
= 273.54
Rise
Run
The Regression MODEL –
Y = β0 + β1 X + 𝜖𝜖
Then Here
Input the Column
Info for Y and X.
We are 95%
Confident that β1 is
in the range
[249.99, 297.09]
𝑋𝑋� ± 𝑡𝑡 𝑠𝑠� 𝑋𝑋� ± 𝑡𝑡 (𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸) 𝑏𝑏1 ± 𝑡𝑡 𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸
𝑛𝑛
t is computed using
=T.INV.2T(.05,185) 273.54 ± 23.55
[249.99 , 297.09]
The degrees of freedom used
to compute t-values for
confidence intervals in
regression equals: n – k – 1
where k is the number of X’s
n=
=T.INV.2T(.05,185) = 1.973
We ALWAYS run this particular test:
𝐻𝐻0 : β1 = 0
𝐻𝐻𝑎𝑎 : β1 ≠ 0
Recall our Model: Y = β0 + β1 X + 𝜖𝜖
Suppose we find that the p-value for this test is quite small.
Then we say …
the coefficient β1 is statistically significant or
X has a statistically significant relationship with Y
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 14699.43561 37993.96821 0.386888664 0.699283 -60257.72241 89656.59364
SQUARE FEET 273.5377151 11.93702873 22.91505879 6.29E-56 249.9875099 297.0879203
𝐻𝐻0 : β1 = 0
𝐻𝐻𝑎𝑎 : β1 ≠ 0
“We can be very very very sure that something matters a teeny
teeny tiny bit”
-Professor Brett Saraniti
𝑠𝑠
𝑠𝑠𝑏𝑏1 =
𝑠𝑠𝑥𝑥 𝑛𝑛 − 1
We want to say something about Professor Schummer’s house
We know sqft = 2968. How do we use our regression to estimate the selling price?
Confidence Interval for the mean Prediction Interval for one instance
(b0 + b1x*) +/- 1.96(se-regression)
1 (𝑥𝑥 ∗ −𝑥𝑥)̅ 2
st. err. of an individual prediction at x*= 𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + +1
𝑛𝑛 (𝑛𝑛−1)𝑠𝑠𝑥𝑥2
This is disappointing…
Can Zillow do better?
86% are within 20% of sale price
𝑦𝑦�𝑖𝑖
𝑒𝑒𝑖𝑖
𝑦𝑦𝑖𝑖
𝑦𝑦�
𝑥𝑥𝑖𝑖
𝑒𝑒𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
𝑦𝑦𝑖𝑖 = 𝑦𝑦�𝑖𝑖 + 𝑒𝑒𝑖𝑖
𝑉𝑉𝑉𝑉𝑉𝑉(𝑦𝑦𝑖𝑖 ) = 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦�𝑖𝑖 + 𝑉𝑉𝑉𝑉𝑉𝑉(𝑒𝑒𝑖𝑖 )
𝑛𝑛 2 𝑛𝑛 2 𝑛𝑛
∑
𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦 ∑
� = 𝑖𝑖=1 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� + 𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2
∑
Total SS = Regression SS + Residual SS
We define R2 as follows:
2
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑆𝑆𝑆𝑆
𝑅𝑅 =
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑆𝑆𝑆𝑆
R2 is a measure of relative precision. It tells us the fraction of the total
variance that can be explained by our linear model.
Explained
Unexplained
by Model
Regression Statistics
Multiple R 0.859926325
R Square 0.739473284
Adjusted R Square 0.738065032
Standard Error 209343.68
Observations 187
ANOVA
df SS MS F Significance F
Regression 1 2.30124E+13 2.30124E+13 525.0999 6.29254E-56
Residual 185 8.10758E+12 43824776351
Total 186 3.112E+13
If the variance of Y is extremely high, then even an R2 close to ONE does not
guarantee useful prediction intervals.
R2 is one way (but not the only way) to measure regression performance
R2 is the fraction of the variation in y explained by the x-variables in your
regression model
A regression with a high R2 can be uninformative
A regression with a low R2 can be highly informative
The specific questions motivating your analysis usually suggest better, more
relevant tools than R2
Practice:
Prediction
Professor Brett Saraniti
Movie studios often budget tens or even hundreds of millions of dollars
towards marketing a film…
They can reallocate funds or other assets up to the very last minute
The critical point for any movie is the opening weekend. After that there is
little more that the studios can do to improve the financial performance of a
movie
Any surveys from more than a week or two away from opening were total
noise.
HSX sold forecasts to Hollywood studios for a six-figure fee.
The studios want to receive their forecasts on Thursday morning which is the
last time they have any ability to add or subtract support for the film.
HSX can take the Wednesday closing price to utilize the most recent
information when making its forecasts.
So can we…
Grab the data from canvas: HSX.xlsx
Grab the homework instructions on sasinware