0% found this document useful (0 votes)
31 views55 pages

Sasin DECS 434 Session 3 - Regression For Prediction

The document discusses building a linear regression model to predict home prices in Wilmette, IL based on square footage. It explains how to create the model using Excel and estimate the coefficients, and how to use the model to predict prices and provide prediction intervals.

Uploaded by

rawich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views55 pages

Sasin DECS 434 Session 3 - Regression For Prediction

The document discusses building a linear regression model to predict home prices in Wilmette, IL based on square footage. It explains how to create the model using Excel and estimate the coefficients, and how to use the model to predict prices and provide prediction intervals.

Uploaded by

rawich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Regression:

Prediction
Professor Brett Saraniti
Module 3, 2022
What single factor is the biggest
determinant in the price of a home?
 To use DATA to build a MODEL that predicts the market value of a home in
Wilmette, IL.

 Wilmette is the small suburb just North of Evanston where some Kellogg faculty
live. Also a few derivative traders, big shot attorneys, various tycoons & titans of
industry…

 Most famous recent resident 


Every house sold in Wilmette: April 17, 2018 - April 17, 2019
Excel Spreadsheet - Wilmette.xlsx
 Is there a relationship between the size of the home and its sale price?
 Is this relationship LINEAR

 How precise is the relationship –


 Does it FIT exactly on a straight line?
 How far from “the line” do the data points tend to be?

 Are there other observable/testable traits about the relationship between


Price and Squarefootage?
 What can we do to improve our model?
https://www.redfin.com/IL/Winnetka/56-Indian-Hill-Rd-60093/home/13782109
 Let’s build a model to Predict the Price of a home in Wilmette

 We believe Price can be expressed as an “almost linear” function of squarefeet:

Price = β0 + β1 squarefeet + 𝜖𝜖

Noise or
Error

Linear
 The simple linear regression model

Y = β0 + β1 X + 𝜖𝜖
 Y is called the dependent variable
 X is called the independent variable
 We use Excel (or other software) to estimate β0 and β1 .
 We’ll make a few assumptions about 𝜖𝜖

 Specifically, we’ll assume 𝜖𝜖 ~ Normal (0,𝜎𝜎2)


Our model assumes that the actual price for any home of a
certain squarefootage (sqft), will be “near” the line --

 For a house with 2968 sqft, we assume the price can be


represented by the following model:

Price = β0 + β1 (2968) + 𝜖𝜖
 The error term represents the value of the home that
cannot be explained just by its size…
 Excel and a million other software platforms use a process called the “least
squares method” to estimate the coefficients in the model.

 The method dates back to 1795 and a guy named Gauss

 Some details 
 Excel will do all the hard work for us. It will produce the equation for “the
best” line. In our case, it looks like this:

Price = 14,699 + 273.54 squarefeet

 The coefficient β0 is estimated to be 14,699.


 β0 represents the Y-intercept or constant term in the line.
 Always remember: this is just an estimate of β0. We won’t ever KNOW the
true value of β0
 The value reported by Excel as the intercept is referred to as bo
14,699
 Excel gave us this linear equation:

Price = 14,699 + 273.54 sqft

 The coefficient β1 is estimated to be 273.54


 β1 represents the slope or rate of change term between sqft and price.
 Always remember: this is just an estimate of β1. We won’t ever KNOW the
true value of β1
 The value reported by Excel as the coefficient of X is referred to as b1
Rise

Run
= 273.54
Rise

Run
 The Regression MODEL –
Y = β0 + β1 X + 𝜖𝜖

 The Regression LINE –


ŷ = b 0 + b 1x
Click Here

Then Here
Input the Column
Info for Y and X.

Tick Boxes for


Residuals and
Line Fit Plots
Wilmette Real Estate

Price = 14699 + 273.54 sqft


 b1 estimates β1
 We can compute 95% Confidence Intervals for β1

We are 95%
Confident that β1 is
in the range
[249.99, 297.09]
𝑋𝑋� ± 𝑡𝑡 𝑠𝑠� 𝑋𝑋� ± 𝑡𝑡 (𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸) 𝑏𝑏1 ± 𝑡𝑡 𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸
𝑛𝑛

Confidence Interval for the unknown


Confidence
population mean, μ
Interval for the
unknown model
parameter, β1
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 14699.43561 37993.96821 0.386888664 0.699283 -60257.72241 89656.59364
SQUARE FEET 273.5377151 11.93702873 22.91505879 6.29E-56 249.9875099 297.0879203

𝑏𝑏1 ± 𝑡𝑡 𝑆𝑆𝑆𝑆𝑆𝑆. 𝐸𝐸𝐸𝐸𝐸𝐸 273.54 ± 1.973 11.94

t is computed using
=T.INV.2T(.05,185) 273.54 ± 23.55

[249.99 , 297.09]
The degrees of freedom used
to compute t-values for
confidence intervals in
regression equals: n – k – 1
where k is the number of X’s

n=

=T.INV.2T(.05,185) = 1.973
 We ALWAYS run this particular test:
𝐻𝐻0 : β1 = 0
𝐻𝐻𝑎𝑎 : β1 ≠ 0
 Recall our Model: Y = β0 + β1 X + 𝜖𝜖
 Suppose we find that the p-value for this test is quite small.
 Then we say …
 the coefficient β1 is statistically significant or
 X has a statistically significant relationship with Y
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 14699.43561 37993.96821 0.386888664 0.699283 -60257.72241 89656.59364
SQUARE FEET 273.5377151 11.93702873 22.91505879 6.29E-56 249.9875099 297.0879203

𝐻𝐻0 : β1 = 0
𝐻𝐻𝑎𝑎 : β1 ≠ 0

𝑏𝑏1 − 𝛽𝛽1 273.54 − 0


𝑡𝑡 = = = 22.915
𝑠𝑠𝑏𝑏1 11.94
p = T. DIST. 2T(22.915,185) = 6.29 x 10-56 ≈ 0.000000000
 Statistical Significance ≠ Economic Significance

“We can be very very very sure that something matters a teeny
teeny tiny bit”
-Professor Brett Saraniti

𝑠𝑠
𝑠𝑠𝑏𝑏1 =
𝑠𝑠𝑥𝑥 𝑛𝑛 − 1
 We want to say something about Professor Schummer’s house
 We know sqft = 2968. How do we use our regression to estimate the selling price?

 Price = 14699 + 273.54 sqft 

 Price = 14699 + 273.54 (2968) = $826,559

 Next Question: How GOOD is our prediction?


Y = β0 + β1 X + 𝜖𝜖 The Standard Error of the
regression estimates the σ for the
𝜖𝜖 ~ Normal (0,𝜎𝜎2) error term.
 We want to add a sense of precision to our Point Estimate: $826,559
 A Confidence Interval is a range or interval estimate for the MEAN
 We don’t want to be 95% confident about the average price of houses with 2968
sqft of living space.
 We want to be 95% confident about the price of ONE PARTICULAR house: Jim’s

 The tool for this job is called a Prediction Interval

Confidence Interval for the mean Prediction Interval for one instance
 (b0 + b1x*) +/- 1.96(se-regression)

 $826,559 +/- 1.96 ∙ $209,344


 We are 95% confident that Jimmy’s home price will be in this
range:
$826,559 +/- $410,314

 Why is this “approximate” ?


 We should use t, not 1.96
 There is also error in the coefficients which we are ignoring (for now.)
 (b0 + b1x*) +/- (t-value)(st. err. of an individual prediction at x*)

1 (𝑥𝑥 ∗ −𝑥𝑥)̅ 2
st. err. of an individual prediction at x*= 𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + +1
𝑛𝑛 (𝑛𝑛−1)𝑠𝑠𝑥𝑥2

Q: What happens when n


A1: This standard error converges to the 𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
is really really really large?

A2: The t-value converges to 1.96


 The value of the “± part” depends
1 (𝑥𝑥 ∗ − 𝑥𝑥)̅ 2
on x*, the particular value of X we are 𝑡𝑡 ∗ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 2 +1
𝑛𝑛 (𝑛𝑛 − 1)𝑠𝑠𝑥𝑥
using to make our prediction of Y.

 When x* is closer to 𝑥𝑥̅ the interval will be tighter


 When x* is further from 𝑥𝑥̅ the interval will be wider

 In our example, x* is 2968 sqft.


$826,559 +/- $410,314

This is disappointing…
Can Zillow do better?
 86% are within 20% of sale price

 Our Model -- 95% are within 50% of sale price

 They WIN!! Do we surrender?


 Analysts often talk about how well (or not so well) their models “fit the data”
 The best known (but not the only) statistic that is used to measure regression
performance is the R2
 Some analysts put enormous weight on R2
 A former adjunct professor at Kellogg once offered “Rules of Thumb” along the lines
of “if your R2 is below 0.30, your work is trash; if it above .7, your work is
excellent”
 Most seasoned analysts, and virtually everyone who truly understands statistics,
usually make no more than a passing reference to R2
 I would like to explain why R2 is usually not an important statistic and, in
the process, dissuade you from paying it much attention
 Wilmette home prices vary.

 Some of the variance in Price can be explained by the square footage.


“That house sold for a high Price because it was enormous”
“That other house sold for a small Price because it was tiny”

 R2 measures the percentage of the variance of Y that can be explained


by the value of X.
Notation

𝑦𝑦�𝑖𝑖
𝑒𝑒𝑖𝑖
𝑦𝑦𝑖𝑖

𝑦𝑦�

𝑒𝑒𝑖𝑖 = (𝑦𝑦𝑖𝑖 −𝑦𝑦�𝑖𝑖 )

𝑥𝑥𝑖𝑖
 𝑒𝑒𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 
 𝑦𝑦𝑖𝑖 = 𝑦𝑦�𝑖𝑖 + 𝑒𝑒𝑖𝑖 
 𝑉𝑉𝑉𝑉𝑉𝑉(𝑦𝑦𝑖𝑖 ) = 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦�𝑖𝑖 + 𝑉𝑉𝑉𝑉𝑉𝑉(𝑒𝑒𝑖𝑖 )

𝑛𝑛 2 𝑛𝑛 2 𝑛𝑛

 𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦 ∑
� = 𝑖𝑖=1 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� + 𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2

Total SS = Regression SS + Residual SS
 We define R2 as follows:

2
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑆𝑆𝑆𝑆
𝑅𝑅 =
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑆𝑆𝑆𝑆
R2 is a measure of relative precision. It tells us the fraction of the total
variance that can be explained by our linear model.

Total SS = Regression SS + Residual SS

Explained
Unexplained
by Model
Regression Statistics
Multiple R 0.859926325
R Square 0.739473284
Adjusted R Square 0.738065032
Standard Error 209343.68
Observations 187

ANOVA
df SS MS F Significance F
Regression 1 2.30124E+13 2.30124E+13 525.0999 6.29254E-56
Residual 185 8.10758E+12 43824776351
Total 186 3.112E+13

𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹 𝑺𝑺𝑺𝑺 2.30𝑒𝑒+13


R2 = = = 0.7395
𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻 𝑺𝑺𝑺𝑺 3.11𝑒𝑒+13
 R2 is a relative measure of accuracy.
 If R2 is very high, it means a very specific thing:

The regression equation does a much better job


predicting y as compared to the sample mean, 𝑦𝑦.

 If the variance of Y is extremely high, then even an R2 close to ONE does not
guarantee useful prediction intervals.
 R2 is one way (but not the only way) to measure regression performance
 R2 is the fraction of the variation in y explained by the x-variables in your
regression model
 A regression with a high R2 can be uninformative
 A regression with a low R2 can be highly informative
 The specific questions motivating your analysis usually suggest better, more
relevant tools than R2
Practice:
Prediction
Professor Brett Saraniti
 Movie studios often budget tens or even hundreds of millions of dollars
towards marketing a film…

They can reallocate funds or other assets up to the very last minute

 The critical point for any movie is the opening weekend. After that there is
little more that the studios can do to improve the financial performance of a
movie

 Predicting opening weekend gross is worth a great deal to the studios


 Until the 2000’s the studios used mall surveys to forecast revenues…

 These all sucked. Really bad.

 Any surveys from more than a week or two away from opening were total
noise.
 HSX sold forecasts to Hollywood studios for a six-figure fee.

 The studios want to receive their forecasts on Thursday morning which is the
last time they have any ability to add or subtract support for the film.

 HSX can take the Wednesday closing price to utilize the most recent
information when making its forecasts.

 So can we…
 Grab the data from canvas: HSX.xlsx
 Grab the homework instructions on sasinware

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy