0% found this document useful (0 votes)
13 views

Week 10_Lecture 10

The document discusses model building in the context of linear and logistic regression, emphasizing the importance of evidence-based modeling and the relationship between independent and dependent variables. It outlines the goals of model development, assessment, and the need to avoid overfitting by using training and validation datasets. Additionally, it covers techniques for selecting predictor variables and evaluating model performance using various statistical measures.

Uploaded by

hanyuan2079
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Week 10_Lecture 10

The document discusses model building in the context of linear and logistic regression, emphasizing the importance of evidence-based modeling and the relationship between independent and dependent variables. It outlines the goals of model development, assessment, and the need to avoid overfitting by using training and validation datasets. Additionally, it covers techniques for selecting predictor variables and evaluating model performance using various statistical measures.

Uploaded by

hanyuan2079
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

`

W10
Tertulia:…

Lecture: Model Building

Class discussion: Based on this week’s readings

Case Presentation: Moved to March 24

Python workshop: Linear and Logistic Regression with Python


1
• Election and Digital Transformation

2
3
• Linear and Logistic Regression

4
5
George Box

6
Models are built
based on evidence,
not vice versa.
7
Careful:
The outcome
There is never It is all about
But your story
50-50 is a bad
It is far from Prescribing
matters theONLY ONE the art of should predictive
SIMPLE! has
most. WAY! presentation!MAKE SENSE!performance.
consequences!

8
Dependent
Variable
Actual
Price
Salary

𝑀𝑜𝑑𝑒𝑙

𝑀𝑒𝑎𝑛

Independent Performance
Condition
9
Variable(s)
10
Advantage Limitations
• Simple, easy to understand, • Cannot capture and
easy to implement represent non-linear
relationship between input
and output variable
• Cannot deal with interaction
between input variables

11
Description Prediction

12
Goal: to explain relationship between independent
(explanatory) variables and dependent variable

Datasets: rows are cases (observations) and


columns variables (features)

Objective: to fit the data well and understand the


contribution of explanatory variables to the target
variable
Measurement of the goodness-of-fit: R2, residual
analysis, p-values
13
Goal: to predict target values in situations where we
only have predictor values, but not target values

Datasets: rows are cases (observations) and


columns variables (features)

Objective: to optimize predictive accuracy

Model development and assessment: training


model with training set and measure its
performance on validation set
14
• Problem: How well will our model perform with

Training
Data
new data? Build
• Solution: Separate data into two parts Model(s)
• Training partition to develop the model
• Validation partition to implement the model and
evaluate its performance on “new” data. It addresses

Validation
Data
the issue of overfitting.
Evaluate
Model(s)

15
y= b0 + b1x1 + b2x2+ .... + bkxk+ e

MULTIPLE Given value of


predictors (independent variables) (x1, x2, …, xk),
LINEAR the algorithm chooses regression
coefficients (b0, b1, b2, ..., bk )
REGRESSION to minimize
(WHAT DATA DO error (e):
the difference between
WE NEED?) actual values (dependent variable) (y)
and
predicted values (y’).
16
PRICE
y b0

b 1 x1 b 2 x2 b 3 x3 b 4 x4 b 5 x5

Age Mileage Fuel Type Engine Power Metallic Color

b 6 x6 b 7 x7 b 8 x8 b 9 x9 b10 x10
Automatic Engine Number of Number of
Gear Capacity Doors Gears Weight

b11 x11 b12 x12 b13 x13 b14 x14 e


Air Central Powered
ABS Conditioner Lock Windows
17
remodeled None Old Recent

None 1 0 0

Old 0 1 0

Recent 0 0 1

18
Fuel
region
Type East North South West
East 1 0 0 0
North 0 1 0 0
South 0 0 1 0
West 0 0 0 1
19
pd.get_dummies(dataframe[predictors],drop_first=True
X
train_X,
= pd.get_dummies(dataframe[predictors],drop_first=True
valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.3,
(of False))
(of False))  y = dataframe[outcome]
random_state=1)

CONTINOUS VARIABLES CATEGORICAL


DUMMY VARIABLES
VARIABLES

train_X X train_p train_e y


train_y
OUTCOME
MODELPREDICTORS VARIABLE

valid_X valid_p valid_e valid_y


THE DATASET
20
21
(𝑒 = 𝑦 − 𝑦′ )

RMSE or RASE
MAE or MAD Average error MAPE Total SSE
Root mean
mean absolute systematic over- or mean absolute total sum of
(average) squared
error (deviation) under-prediction percentage error squared errors
error

•∑ •∑ • ∑ ∑ •∑ 𝑒

22
y y' e |e| e2 |e/y|
33 34 -1 1 1 0.03
59 49 10 10 100 0.17
47 51 -4 4 16 0.09
65 70 -5 5 25 0.08
Toral 0 20 142 0.36
Average error MAE RMSE MAPE

0 20 142 0.36
=0 =5 ≅6 = 0.09
4 4 4 4

23
% of % of
y y' |e| e2 MAE RMSE
(20) (142)
e1 33 34 1 5% 1 0.7% 5%
e2 59 49 10 50% 100 70.4% 0.7%
e3 47 51 4 20% 16 11.3% 25% 17.6%
e4 65 70 5 25% 25 17.6%
20 142 11.3%

20% 50% 70.4%


MAE RMSE

20 142
=5 ≅ 5.96
4 4
e1 e2 e3 e4 e1 e2 e3 e4

24
TSS: represents the variation of y
ESS: represents the variation explained by a function of Observation
predicting variables y
RSS: represents the prediction error
𝑦
RSS: Residual Sum of Squares
TSS: Toral Sum of 𝑅𝑆𝑆 = (𝑦 − 𝑦 )
Squares
𝑦
𝑇𝑆𝑆 = (𝑦 − 𝑦 ) ESS: Explained Sum of Squares 𝑦 predicted
𝑇𝑆𝑆 = (𝑦 − 𝑦 )
𝑦 mean
𝑦
𝐸𝑆𝑆 𝑅𝑆𝑆
𝑅 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
Model explains nothing  0≤ R2 ≤1  Model explains everything
𝑥 x
25
Observation
y
𝑦

𝑦 mean
𝑦

𝑥 x
26
Observation
y
𝑦

𝑦 mean
𝑦

𝑥 x
27
predicted_price =
631,256
– 10706age
+ 65size
+ 116,211rooms
-250area_density
REGRESSION +54307school_score
+ 105593remodeled_Old
FORMULA + 267035remodeled_Recent
-16081region_North
- 78267region_South
+16655region_West

28
UNDERFIT,
OVERFIT,
OR RIGHT FIT
29
Problem:

• Overly complex models have the danger of overfitting

Solution:

• Reduce variables via automated selection of variable


subsets.
30
Training set
Validation set
Happiness

Income 31
Training set
Validation set
Happiness

High Bias
Low Variance
High Training Error
High Validation Error

Income 32
Training set
Validation set
Happiness

Low Bias
High Variance
No Training Error
High Validation Error

Income 33
Training set
Validation set
Happiness

Low Bias
Low Variance
Low Training Error
Low Validation Error

Income 34
Optimal Model
Complexity
Training error
Prediction Error

Low Variance High Variance Validation error


High Bias Low Bias

Variance

Bias

Low Model Complexity High 35


Selecting the best set of
variables for predicting
• Manually
• Automatically

36
Goal:

• To find parsimonious model.


• The simplest model that
SELECTING performs sufficiently well.

SUBSETS OF Benefits:
PREDICTORS
• Minimizing data collection
requirement
• Maximizing robustness
• Improving predictive accuracy

37
Forward Selection
• Addition

Backward Elimination
• Deletion

Mixed Stepwise Exhaustive Search


• Addition/Deletion • All possible combinations
38
• Starts with no predictors
• Adds them one by one (add the one with largest contribution)
• When to stop:
• p-value Threshold: Stops when no other potential predictor has statistically
significant contribution
• Max Validation R2: Stops when the R2 on the validation set stops improving
when predictors are added (only available when there is a validation
column)
 Using more predictors in the prediction model does not always improve RSquare for the
validation data set
39
a4
1 x4
a3
x3
a2
R2 x2 y
a1
x1
a0 y= a0 x0 + a1x1 + a2 x2 + a3x3 + a4x4
0 x0=1
40
• Starts with all predictors
• Successively eliminates least useful predictors one by one
• Stopping Rules
• p-value Threshold: Stops when all remaining predictors have statistically
significant contribution
• Max Validation R2 : Stops when the R2 on the validation set does not improve
by removing the predictor (only available when there is a validation
column)

41
a4
1 x4
a3
x3
a2
R2 x2 y
a1
x1
a0 y= a0 x0 + a1x1 + a2 x2 + a3x3 + a4x4
0 x0=1
42
• Like Forward selection except that at each step we consider
dropping predictors that are not statistically significant as we do in
backward elimination
• The stopping rule is p-value Threshold with
• Prob to Enter (for adding) and
• Prob to Leave (for dropping)

43
a4
1 x4
a3
x3
a2
R2 x2 y
a1
x1
a0 y= a0 x0 + a1x1 + a2 x2 + a3x3 + a4x4
0 x0=1
44
• Exhaustive Search
• All possible subsets of predictors
• 2n-1 combinations
 15 combinations for 4 variables
• Computationally intensive: O(2n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
a b c d a a a b b c a a a b a
b c d c d d b b c c b
c d d d c
d

45
RSS: The residual sum of squares (the smaller the better)

R2: Goodness-of-fit, (the larger the better)

Adj. R2: Adjusted R2 values. (the larger the better)

Cp : Mallows Cp is a measure of the error in the best subset model, relative to the
error incorporating all variables. (should be smaller)
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion):
measuring information lost by fitting a given model (should be smaller)
46
• Adjusted R2 is a modification of R2 that adjusts for the number of
explanatory terms in a model.
1 − 𝑅 × (𝑛 − 1) 𝑆𝑆 𝑑𝑓
𝑅 =1− =1− ×
𝑛−𝑝−1 𝑆𝑆 𝑑𝑓
• where p is the total number of regressors in the linear model (but not
counting the constant term), n is the sample size, dft is the degrees of
freedom n – 1 of the estimate of the population variance of the dependent
variable, and dfe is the degrees of freedom n – p – 1 of the estimate of the
underlying population error variance.

47
• Measurements for comparing models
• Both measures penalize complexity in models.
• Use in the context, in combination with other measures, informed by domain
knowledge
• AICc (corrected for small samples)
• It could be as simple as:
THE MODEL WITH THE LOWEST AIC OR BIC VALUE IS THE BEST
MODEL.
• AIC and BIC are based on likelihood and MSE which could be a bit complicated for
an introductory course.
• It is challenging to find consistent formulas for AICc and BIC

48
×( )
• 𝐴𝐼𝐶𝑐 = 𝐴𝐼𝐶 +
×
• 𝐴𝐼𝐶𝑐 = (+2𝑘 − 2𝐿𝐿) +
×
• 𝐴𝐼𝐶𝑐 = +2𝑘 + 𝑛 × ln + 𝑛 × ln 2𝜋 + 𝑛 +

• Where
• n = number of observations
• k = number of parameters in the model
49
• 𝐵𝐼𝐶 = −2𝐿𝐿 + 𝑘 × 𝑙𝑛(𝑛)
• 𝐵𝐼𝐶 = 𝑛 × ln + 𝑘 × 𝑙𝑛(𝑛) + 𝑛 × ln 2𝜋 + 𝑛

• 𝐵𝐼𝐶 = 𝑛 × ln + 𝑘 × 𝑙𝑛(𝑛)

• Where
• n = number of observations
• k = number of parameters in the model
50
predicted_price =
641,400
– 10,477age
+ 71size
+ 118,214rooms
+55,257school_score
REGRESSION + 169,436remodeled_Recent
+39,656region_West
FORMULA

51
52
• Algorithms Need Managers, Too.
by: Porter, M. E., Davenport, T. H., Daugherty, P., & Wilson, H. J. (2018). HBR's 10
Must Reads on AI, Analytics, and the New Machine Age

53
Algorithms follow • Discuss examples where algorithms maximized short-
literal instructions, term gains at the expense of long-term brand reputation
but business goals
often involve
nuanced trade-offs. • Explore how managers can set explicit multi-objective
goals (e.g., balancing profitability with fairness, as in the
How can managers case of targeted neighborhood inspections).
ensure that both
short-term and
long-term goals • Explore how managers can set explicit multi-objective
are embedded in goals (e.g., balancing profitability with fairness, as in the
algorithm design? case of targeted neighborhood inspections).

54
Algorithms often • How companies like Netflix or eBay have benefited
provide accurate from algorithms, despite limited understanding of the
predictions but "why" behind their predictions?
don't explain why
they made a
particular • What challenges do managers face when algorithmic
recommendation. decisions lack transparency, and how can
experimentation and data validation help?
How can
managers trust
and use these • Does reliance on predictions without understanding
predictions causation lead to suboptimal outcomes (e.g., eBay’s
effectively? ineffective advertising).

55
• What is the importance of using wide and diverse data
Algorithms rely inputs (e.g., Yelp reviews used in Boston’s restaurant
heavily on the inspection algorithm)?
quality and
diversity of data • How can managers prevent the algorithm from being too
inputs. myopic by expanding the range of data inputs (e.g.,
How can short-term sales versus long-term customer
businesses ensure satisfaction)?
they are using the
right data to train • How do companies adjust their data strategy to improve
their algorithms? predictive power (e.g., moving from sales data to
satisfaction metrics for product longevity predictions).
56
57
`
 ANY QUESTION? W10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy