0% found this document useful (0 votes)

31 views55 pages

Sasin DECS 434 Session 3 - Regression For Prediction

The document discusses building a linear regression model to predict home prices in Wilmette, IL based on square footage. It explains how to create the model using Excel and estimate the coefficients, and how to use the model to predict prices and provide prediction intervals.

Uploaded by

rawich

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views55 pages

Sasin DECS 434 Session 3 - Regression For Prediction

Uploaded by

rawich

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Regression:

Prediction
Professor Brett Saraniti
Module 3, 2022
What single factor is the biggest
determinant in the price of a home?
 To use DATA to build a MODEL that predicts the market value of a home in
Wilmette, IL.

 Wilmette is the small suburb just North of Evanston where some Kellogg faculty
live. Also a few derivative traders, big shot attorneys, various tycoons & titans of
industry…

 Most famous recent resident 

Every house sold in Wilmette: April 17, 2018 - April 17, 2019
Excel Spreadsheet - Wilmette.xlsx
 Is there a relationship between the size of the home and its sale price?
 Is this relationship LINEAR

 How precise is the relationship –

 Does it FIT exactly on a straight line?
 How far from “the line” do the data points tend to be?

 Are there other observable/testable traits about the relationship between

Price and Squarefootage?
 What can we do to improve our model?
https://www.redfin.com/IL/Winnetka/56-Indian-Hill-Rd-60093/home/13782109
 Let’s build a model to Predict the Price of a home in Wilmette

 We believe Price can be expressed as an “almost linear” function of squarefeet:

Price = β0 + β1 squarefeet + 𝜖𝜖

Noise or
Error

Linear
 The simple linear regression model

Y = β0 + β1 X + 𝜖𝜖
 Y is called the dependent variable
 X is called the independent variable
 We use Excel (or other software) to estimate β0 and β1 .
 We’ll make a few assumptions about 𝜖𝜖

 Specifically, we’ll assume 𝜖𝜖 ~ Normal (0,𝜎𝜎2)

Our model assumes that the actual price for any home of a
certain squarefootage (sqft), will be “near” the line --

 For a house with 2968 sqft, we assume the price can be

represented by the following model:

Price = β0 + β1 (2968) + 𝜖𝜖
 The error term represents the value of the home that
cannot be explained just by its size…
 Excel and a million other software platforms use a process called the “least
squares method” to estimate the coefficients in the model.

 The method dates back to 1795 and a guy named Gauss

 Some details 
 Excel will do all the hard work for us. It will produce the equation for “the
best” line. In our case, it looks like this:

Price = 14,699 + 273.54 squarefeet

 The coefficient β0 is estimated to be 14,699.

 β0 represents the Y-intercept or constant term in the line.
 Always remember: this is just an estimate of β0. We won’t ever KNOW the
true value of β0
 The value reported by Excel as the intercept is referred to as bo
14,699
 Excel gave us this linear equation:

Price = 14,699 + 273.54 sqft

 The coefficient β1 is estimated to be 273.54

 β1 represents the slope or rate of change term between sqft and price.
 Always remember: this is just an estimate of β1. We won’t ever KNOW the
true value of β1
 The value reported by Excel as the coefficient of X is referred to as b1
Rise

Run
= 273.54
Rise

Run
 The Regression MODEL –
Y = β0 + β1 X + 𝜖𝜖

 The Regression LINE –

ŷ = b 0 + b 1x
Click Here

Then Here
Input the Column
Info for Y and X.

Tick Boxes for

Residuals and
Line Fit Plots
Wilmette Real Estate

Price = 14699 + 273.54 sqft

 b1 estimates β1
 We can compute 95% Confidence Intervals for β1

We are 95%
Confident that β1 is
in the range
[249.99, 297.09]
𝑋𝑋� ± 𝑡𝑡 𝑠𝑠� 𝑋𝑋� ± 𝑡𝑡 (𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸) 𝑏𝑏1 ± 𝑡𝑡 𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸
𝑛𝑛

Confidence Interval for the unknown

Confidence
population mean, μ
Interval for the
unknown model
parameter, β1
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 14699.43561 37993.96821 0.386888664 0.699283 -60257.72241 89656.59364
SQUARE FEET 273.5377151 11.93702873 22.91505879 6.29E-56 249.9875099 297.0879203

𝑏𝑏1 ± 𝑡𝑡 𝑆𝑆𝑆𝑆𝑆𝑆. 𝐸𝐸𝐸𝐸𝐸𝐸 273.54 ± 1.973 11.94

t is computed using
=T.INV.2T(.05,185) 273.54 ± 23.55

[249.99 , 297.09]
The degrees of freedom used
to compute t-values for
confidence intervals in
regression equals: n – k – 1
where k is the number of X’s

=T.INV.2T(.05,185) = 1.973
 We ALWAYS run this particular test:
𝐻𝐻0 : β1 = 0
𝐻𝐻𝑎𝑎 : β1 ≠ 0
 Recall our Model: Y = β0 + β1 X + 𝜖𝜖
 Suppose we find that the p-value for this test is quite small.
 Then we say …
 the coefficient β1 is statistically significant or
 X has a statistically significant relationship with Y
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 14699.43561 37993.96821 0.386888664 0.699283 -60257.72241 89656.59364
SQUARE FEET 273.5377151 11.93702873 22.91505879 6.29E-56 249.9875099 297.0879203

𝐻𝐻0 : β1 = 0
𝐻𝐻𝑎𝑎 : β1 ≠ 0

𝑏𝑏1 − 𝛽𝛽1 273.54 − 0

𝑡𝑡 = = = 22.915
𝑠𝑠𝑏𝑏1 11.94
p = T. DIST. 2T(22.915,185) = 6.29 x 10-56 ≈ 0.000000000
 Statistical Significance ≠ Economic Significance

“We can be very very very sure that something matters a teeny
teeny tiny bit”
-Professor Brett Saraniti

𝑠𝑠
𝑠𝑠𝑏𝑏1 =
𝑠𝑠𝑥𝑥 𝑛𝑛 − 1
 We want to say something about Professor Schummer’s house
 We know sqft = 2968. How do we use our regression to estimate the selling price?

 Price = 14699 + 273.54 sqft 

 Price = 14699 + 273.54 (2968) = $826,559

 Next Question: How GOOD is our prediction?

Y = β0 + β1 X + 𝜖𝜖 The Standard Error of the
regression estimates the σ for the
𝜖𝜖 ~ Normal (0,𝜎𝜎2) error term.
 We want to add a sense of precision to our Point Estimate: $826,559
 A Confidence Interval is a range or interval estimate for the MEAN
 We don’t want to be 95% confident about the average price of houses with 2968
sqft of living space.
 We want to be 95% confident about the price of ONE PARTICULAR house: Jim’s

 The tool for this job is called a Prediction Interval

Confidence Interval for the mean Prediction Interval for one instance
 (b0 + b1x*) +/- 1.96(se-regression)

 $826,559 +/- 1.96 ∙ $209,344

 We are 95% confident that Jimmy’s home price will be in this
range:
$826,559 +/- $410,314

 Why is this “approximate” ?

 We should use t, not 1.96
 There is also error in the coefficients which we are ignoring (for now.)
 (b0 + b1x*) +/- (t-value)(st. err. of an individual prediction at x*)

1 (𝑥𝑥 ∗ −𝑥𝑥)̅ 2
st. err. of an individual prediction at x*= 𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + +1
𝑛𝑛 (𝑛𝑛−1)𝑠𝑠𝑥𝑥2

Q: What happens when n

A1: This standard error converges to the 𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
is really really really large?

A2: The t-value converges to 1.96

 The value of the “± part” depends
1 (𝑥𝑥 ∗ − 𝑥𝑥)̅ 2
on x*, the particular value of X we are 𝑡𝑡 ∗ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 2 +1
𝑛𝑛 (𝑛𝑛 − 1)𝑠𝑠𝑥𝑥
using to make our prediction of Y.

 When x* is closer to 𝑥𝑥̅ the interval will be tighter

 When x* is further from 𝑥𝑥̅ the interval will be wider

 In our example, x* is 2968 sqft.

$826,559 +/- $410,314

This is disappointing…
Can Zillow do better?
 86% are within 20% of sale price

 Our Model -- 95% are within 50% of sale price

 They WIN!! Do we surrender?

 Analysts often talk about how well (or not so well) their models “fit the data”
 The best known (but not the only) statistic that is used to measure regression
performance is the R2
 Some analysts put enormous weight on R2
 A former adjunct professor at Kellogg once offered “Rules of Thumb” along the lines
of “if your R2 is below 0.30, your work is trash; if it above .7, your work is
excellent”
 Most seasoned analysts, and virtually everyone who truly understands statistics,
usually make no more than a passing reference to R2
 I would like to explain why R2 is usually not an important statistic and, in
the process, dissuade you from paying it much attention
 Wilmette home prices vary.

 Some of the variance in Price can be explained by the square footage.

“That house sold for a high Price because it was enormous”
“That other house sold for a small Price because it was tiny”

 R2 measures the percentage of the variance of Y that can be explained

by the value of X.
Notation

𝑦𝑦�𝑖𝑖
𝑒𝑒𝑖𝑖
𝑦𝑦𝑖𝑖

𝑦𝑦�

𝑒𝑒𝑖𝑖 = (𝑦𝑦𝑖𝑖 −𝑦𝑦�𝑖𝑖 )

𝑥𝑥𝑖𝑖
 𝑒𝑒𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 
 𝑦𝑦𝑖𝑖 = 𝑦𝑦�𝑖𝑖 + 𝑒𝑒𝑖𝑖 
 𝑉𝑉𝑉𝑉𝑉𝑉(𝑦𝑦𝑖𝑖 ) = 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦�𝑖𝑖 + 𝑉𝑉𝑉𝑉𝑉𝑉(𝑒𝑒𝑖𝑖 )

𝑛𝑛 2 𝑛𝑛 2 𝑛𝑛
∑
 𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦 ∑
� = 𝑖𝑖=1 𝑦𝑦�𝑖𝑖 − 𝑦𝑦� + 𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2
∑
Total SS = Regression SS + Residual SS
 We define R2 as follows:

2
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑆𝑆𝑆𝑆
𝑅𝑅 =
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑆𝑆𝑆𝑆
R2 is a measure of relative precision. It tells us the fraction of the total
variance that can be explained by our linear model.

Total SS = Regression SS + Residual SS

Explained
Unexplained
by Model
Regression Statistics
Multiple R 0.859926325
R Square 0.739473284
Adjusted R Square 0.738065032
Standard Error 209343.68
Observations 187

ANOVA
df SS MS F Significance F
Regression 1 2.30124E+13 2.30124E+13 525.0999 6.29254E-56
Residual 185 8.10758E+12 43824776351
Total 186 3.112E+13

𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹 𝑺𝑺𝑺𝑺 2.30𝑒𝑒+13

R2 = = = 0.7395
𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻 𝑺𝑺𝑺𝑺 3.11𝑒𝑒+13
 R2 is a relative measure of accuracy.
 If R2 is very high, it means a very specific thing:

The regression equation does a much better job

predicting y as compared to the sample mean, 𝑦𝑦.
�

 If the variance of Y is extremely high, then even an R2 close to ONE does not
guarantee useful prediction intervals.
 R2 is one way (but not the only way) to measure regression performance
 R2 is the fraction of the variation in y explained by the x-variables in your
regression model
 A regression with a high R2 can be uninformative
 A regression with a low R2 can be highly informative
 The specific questions motivating your analysis usually suggest better, more
relevant tools than R2
Practice:
Prediction
Professor Brett Saraniti
 Movie studios often budget tens or even hundreds of millions of dollars
towards marketing a film…

They can reallocate funds or other assets up to the very last minute

 The critical point for any movie is the opening weekend. After that there is
little more that the studios can do to improve the financial performance of a
movie

 Predicting opening weekend gross is worth a great deal to the studios

 Until the 2000’s the studios used mall surveys to forecast revenues…

 These all sucked. Really bad.

 Any surveys from more than a week or two away from opening were total
noise.
 HSX sold forecasts to Hollywood studios for a six-figure fee.

 The studios want to receive their forecasts on Thursday morning which is the
last time they have any ability to add or subtract support for the film.

 HSX can take the Wednesday closing price to utilize the most recent
information when making its forecasts.

 So can we…
 Grab the data from canvas: HSX.xlsx
 Grab the homework instructions on sasinware

Simple and Multiple Regression
100% (2)
Simple and Multiple Regression
39 pages
Regression
No ratings yet
Regression
64 pages
Unit 3 - Predictive Analysis
No ratings yet
Unit 3 - Predictive Analysis
73 pages
Unit Ii
No ratings yet
Unit Ii
48 pages
Lecture 3 - Linear Regression Imran 20022025 092939am
No ratings yet
Lecture 3 - Linear Regression Imran 20022025 092939am
46 pages
Regression Analysis Using Excel
100% (1)
Regression Analysis Using Excel
85 pages
Simple Linear Reg Ex 1
No ratings yet
Simple Linear Reg Ex 1
34 pages
Stats 101 - Class 03
No ratings yet
Stats 101 - Class 03
94 pages
Session 3 - Linear Regression
No ratings yet
Session 3 - Linear Regression
96 pages
Chapter 13 Part 1
No ratings yet
Chapter 13 Part 1
49 pages
Regression
No ratings yet
Regression
56 pages
Session Presentation
No ratings yet
Session Presentation
79 pages
Regression Analysis
No ratings yet
Regression Analysis
52 pages
BA Notes (End Sem)
No ratings yet
BA Notes (End Sem)
26 pages
Chapter 7 - S
No ratings yet
Chapter 7 - S
49 pages
Sessions 18 19 - Regression - SLR MLR
No ratings yet
Sessions 18 19 - Regression - SLR MLR
70 pages
Lecture 11
No ratings yet
Lecture 11
62 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
195 pages
Statistics For Business Analysis: Learning Objectives
No ratings yet
Statistics For Business Analysis: Learning Objectives
37 pages
Regression Analysis
No ratings yet
Regression Analysis
37 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Applied Quantitative Analysis and Practices: Lecture#22
No ratings yet
Applied Quantitative Analysis and Practices: Lecture#22
27 pages
Simple Linear Regression1
No ratings yet
Simple Linear Regression1
51 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Lecture 12
No ratings yet
Lecture 12
36 pages
Simple Linear Reg Ex 1
No ratings yet
Simple Linear Reg Ex 1
34 pages
SimpleLinearRegression PDF
No ratings yet
SimpleLinearRegression PDF
86 pages
Regressi On
No ratings yet
Regressi On
16 pages
Chap 10 Regression Analysis
No ratings yet
Chap 10 Regression Analysis
68 pages
CUHK STAT5102 Ch1
No ratings yet
CUHK STAT5102 Ch1
54 pages
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
No ratings yet
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
23 pages
DISC 212 Session 13
No ratings yet
DISC 212 Session 13
29 pages
Master of Business Administration Arpit
No ratings yet
Master of Business Administration Arpit
75 pages
Business Statistics: A Decision-Making Approach: Introduction To Linear Regression and Correlation Analysis
No ratings yet
Business Statistics: A Decision-Making Approach: Introduction To Linear Regression and Correlation Analysis
64 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Linear Regression
No ratings yet
Linear Regression
64 pages
Simple Regression
No ratings yet
Simple Regression
46 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
No ratings yet
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
23 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Application of Regression Analysis Final
No ratings yet
Application of Regression Analysis Final
27 pages
Prac 1
No ratings yet
Prac 1
6 pages
Topic 7-Regression Analysis
No ratings yet
Topic 7-Regression Analysis
56 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
Simplelinearregression NBC
No ratings yet
Simplelinearregression NBC
50 pages
Multiple Regression
100% (1)
Multiple Regression
21 pages
Regression Analysis Notes
No ratings yet
Regression Analysis Notes
6 pages
AI Lab7
No ratings yet
AI Lab7
13 pages
Regression Analysis: (And It's Application in Business)
No ratings yet
Regression Analysis: (And It's Application in Business)
31 pages
L
No ratings yet
L
8 pages
Chapter 8 PPT New Period 3
No ratings yet
Chapter 8 PPT New Period 3
12 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
Regression
No ratings yet
Regression
6 pages
Introduction To Management Science: Post Mid Sessions 2 & 3 November 4 and 6 2019
No ratings yet
Introduction To Management Science: Post Mid Sessions 2 & 3 November 4 and 6 2019
26 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
73 pages
Regression Analysis: (And It's Application in Business)
No ratings yet
Regression Analysis: (And It's Application in Business)
31 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Sasin DECS 434 Session 3 - Regression For Prediction

Uploaded by

Sasin DECS 434 Session 3 - Regression For Prediction

Uploaded by

Regression:

 Most famous recent resident 

 How precise is the relationship –

 Are there other observable/testable traits about the relationship between

 We believe Price can be expressed as an “almost linear” function of squarefeet:

 Specifically, we’ll assume 𝜖𝜖 ~ Normal (0,𝜎𝜎2)

 For a house with 2968 sqft, we assume the price can be

 The method dates back to 1795 and a guy named Gauss

Price = 14,699 + 273.54 squarefeet

 The coefficient β0 is estimated to be 14,699.

Price = 14,699 + 273.54 sqft

 The coefficient β1 is estimated to be 273.54

 The Regression LINE –

Tick Boxes for

Price = 14699 + 273.54 sqft

Confidence Interval for the unknown

𝑏𝑏1 ± 𝑡𝑡 𝑆𝑆𝑆𝑆𝑆𝑆. 𝐸𝐸𝐸𝐸𝐸𝐸 273.54 ± 1.973 11.94

𝑏𝑏1 − 𝛽𝛽1 273.54 − 0

 Price = 14699 + 273.54 sqft 

 Price = 14699 + 273.54 (2968) = $826,559

 Next Question: How GOOD is our prediction?

 The tool for this job is called a Prediction Interval

 $826,559 +/- 1.96 ∙ $209,344

 Why is this “approximate” ?

Q: What happens when n

A2: The t-value converges to 1.96

 When x* is closer to 𝑥𝑥̅ the interval will be tighter

 In our example, x* is 2968 sqft.

 Our Model -- 95% are within 50% of sale price

 They WIN!! Do we surrender?

 Some of the variance in Price can be explained by the square footage.

 R2 measures the percentage of the variance of Y that can be explained

𝑒𝑒𝑖𝑖 = (𝑦𝑦𝑖𝑖 −𝑦𝑦�𝑖𝑖 )

Total SS = Regression SS + Residual SS

𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹 𝑺𝑺𝑺𝑺 2.30𝑒𝑒+13

The regression equation does a much better job

 Predicting opening weekend gross is worth a great deal to the studios

 These all sucked. Really bad.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.