Week 10_Lecture 10
Week 10_Lecture 10
W10
Tertulia:…
2
3
• Linear and Logistic Regression
4
5
George Box
6
Models are built
based on evidence,
not vice versa.
7
Careful:
The outcome
There is never It is all about
But your story
50-50 is a bad
It is far from Prescribing
matters theONLY ONE the art of should predictive
SIMPLE! has
most. WAY! presentation!MAKE SENSE!performance.
consequences!
8
Dependent
Variable
Actual
Price
Salary
𝑀𝑜𝑑𝑒𝑙
𝑀𝑒𝑎𝑛
Independent Performance
Condition
9
Variable(s)
10
Advantage Limitations
• Simple, easy to understand, • Cannot capture and
easy to implement represent non-linear
relationship between input
and output variable
• Cannot deal with interaction
between input variables
11
Description Prediction
12
Goal: to explain relationship between independent
(explanatory) variables and dependent variable
Training
Data
new data? Build
• Solution: Separate data into two parts Model(s)
• Training partition to develop the model
• Validation partition to implement the model and
evaluate its performance on “new” data. It addresses
Validation
Data
the issue of overfitting.
Evaluate
Model(s)
15
y= b0 + b1x1 + b2x2+ .... + bkxk+ e
b 1 x1 b 2 x2 b 3 x3 b 4 x4 b 5 x5
b 6 x6 b 7 x7 b 8 x8 b 9 x9 b10 x10
Automatic Engine Number of Number of
Gear Capacity Doors Gears Weight
None 1 0 0
Old 0 1 0
Recent 0 0 1
18
Fuel
region
Type East North South West
East 1 0 0 0
North 0 1 0 0
South 0 0 1 0
West 0 0 0 1
19
pd.get_dummies(dataframe[predictors],drop_first=True
X
train_X,
= pd.get_dummies(dataframe[predictors],drop_first=True
valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.3,
(of False))
(of False)) y = dataframe[outcome]
random_state=1)
RMSE or RASE
MAE or MAD Average error MAPE Total SSE
Root mean
mean absolute systematic over- or mean absolute total sum of
(average) squared
error (deviation) under-prediction percentage error squared errors
error
•∑ •∑ • ∑ ∑ •∑ 𝑒
•
22
y y' e |e| e2 |e/y|
33 34 -1 1 1 0.03
59 49 10 10 100 0.17
47 51 -4 4 16 0.09
65 70 -5 5 25 0.08
Toral 0 20 142 0.36
Average error MAE RMSE MAPE
0 20 142 0.36
=0 =5 ≅6 = 0.09
4 4 4 4
23
% of % of
y y' |e| e2 MAE RMSE
(20) (142)
e1 33 34 1 5% 1 0.7% 5%
e2 59 49 10 50% 100 70.4% 0.7%
e3 47 51 4 20% 16 11.3% 25% 17.6%
e4 65 70 5 25% 25 17.6%
20 142 11.3%
20 142
=5 ≅ 5.96
4 4
e1 e2 e3 e4 e1 e2 e3 e4
24
TSS: represents the variation of y
ESS: represents the variation explained by a function of Observation
predicting variables y
RSS: represents the prediction error
𝑦
RSS: Residual Sum of Squares
TSS: Toral Sum of 𝑅𝑆𝑆 = (𝑦 − 𝑦 )
Squares
𝑦
𝑇𝑆𝑆 = (𝑦 − 𝑦 ) ESS: Explained Sum of Squares 𝑦 predicted
𝑇𝑆𝑆 = (𝑦 − 𝑦 )
𝑦 mean
𝑦
𝐸𝑆𝑆 𝑅𝑆𝑆
𝑅 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
Model explains nothing 0≤ R2 ≤1 Model explains everything
𝑥 x
25
Observation
y
𝑦
𝑦 mean
𝑦
𝑥 x
26
Observation
y
𝑦
𝑦 mean
𝑦
𝑥 x
27
predicted_price =
631,256
– 10706age
+ 65size
+ 116,211rooms
-250area_density
REGRESSION +54307school_score
+ 105593remodeled_Old
FORMULA + 267035remodeled_Recent
-16081region_North
- 78267region_South
+16655region_West
28
UNDERFIT,
OVERFIT,
OR RIGHT FIT
29
Problem:
Solution:
Income 31
Training set
Validation set
Happiness
High Bias
Low Variance
High Training Error
High Validation Error
Income 32
Training set
Validation set
Happiness
Low Bias
High Variance
No Training Error
High Validation Error
Income 33
Training set
Validation set
Happiness
Low Bias
Low Variance
Low Training Error
Low Validation Error
Income 34
Optimal Model
Complexity
Training error
Prediction Error
Variance
Bias
36
Goal:
SUBSETS OF Benefits:
PREDICTORS
• Minimizing data collection
requirement
• Maximizing robustness
• Improving predictive accuracy
37
Forward Selection
• Addition
Backward Elimination
• Deletion
41
a4
1 x4
a3
x3
a2
R2 x2 y
a1
x1
a0 y= a0 x0 + a1x1 + a2 x2 + a3x3 + a4x4
0 x0=1
42
• Like Forward selection except that at each step we consider
dropping predictors that are not statistically significant as we do in
backward elimination
• The stopping rule is p-value Threshold with
• Prob to Enter (for adding) and
• Prob to Leave (for dropping)
43
a4
1 x4
a3
x3
a2
R2 x2 y
a1
x1
a0 y= a0 x0 + a1x1 + a2 x2 + a3x3 + a4x4
0 x0=1
44
• Exhaustive Search
• All possible subsets of predictors
• 2n-1 combinations
15 combinations for 4 variables
• Computationally intensive: O(2n)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
a b c d a a a b b c a a a b a
b c d c d d b b c c b
c d d d c
d
45
RSS: The residual sum of squares (the smaller the better)
Cp : Mallows Cp is a measure of the error in the best subset model, relative to the
error incorporating all variables. (should be smaller)
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion):
measuring information lost by fitting a given model (should be smaller)
46
• Adjusted R2 is a modification of R2 that adjusts for the number of
explanatory terms in a model.
1 − 𝑅 × (𝑛 − 1) 𝑆𝑆 𝑑𝑓
𝑅 =1− =1− ×
𝑛−𝑝−1 𝑆𝑆 𝑑𝑓
• where p is the total number of regressors in the linear model (but not
counting the constant term), n is the sample size, dft is the degrees of
freedom n – 1 of the estimate of the population variance of the dependent
variable, and dfe is the degrees of freedom n – p – 1 of the estimate of the
underlying population error variance.
47
• Measurements for comparing models
• Both measures penalize complexity in models.
• Use in the context, in combination with other measures, informed by domain
knowledge
• AICc (corrected for small samples)
• It could be as simple as:
THE MODEL WITH THE LOWEST AIC OR BIC VALUE IS THE BEST
MODEL.
• AIC and BIC are based on likelihood and MSE which could be a bit complicated for
an introductory course.
• It is challenging to find consistent formulas for AICc and BIC
48
×( )
• 𝐴𝐼𝐶𝑐 = 𝐴𝐼𝐶 +
×
• 𝐴𝐼𝐶𝑐 = (+2𝑘 − 2𝐿𝐿) +
×
• 𝐴𝐼𝐶𝑐 = +2𝑘 + 𝑛 × ln + 𝑛 × ln 2𝜋 + 𝑛 +
• Where
• n = number of observations
• k = number of parameters in the model
49
• 𝐵𝐼𝐶 = −2𝐿𝐿 + 𝑘 × 𝑙𝑛(𝑛)
• 𝐵𝐼𝐶 = 𝑛 × ln + 𝑘 × 𝑙𝑛(𝑛) + 𝑛 × ln 2𝜋 + 𝑛
• 𝐵𝐼𝐶 = 𝑛 × ln + 𝑘 × 𝑙𝑛(𝑛)
• Where
• n = number of observations
• k = number of parameters in the model
50
predicted_price =
641,400
– 10,477age
+ 71size
+ 118,214rooms
+55,257school_score
REGRESSION + 169,436remodeled_Recent
+39,656region_West
FORMULA
51
52
• Algorithms Need Managers, Too.
by: Porter, M. E., Davenport, T. H., Daugherty, P., & Wilson, H. J. (2018). HBR's 10
Must Reads on AI, Analytics, and the New Machine Age
53
Algorithms follow • Discuss examples where algorithms maximized short-
literal instructions, term gains at the expense of long-term brand reputation
but business goals
often involve
nuanced trade-offs. • Explore how managers can set explicit multi-objective
goals (e.g., balancing profitability with fairness, as in the
How can managers case of targeted neighborhood inspections).
ensure that both
short-term and
long-term goals • Explore how managers can set explicit multi-objective
are embedded in goals (e.g., balancing profitability with fairness, as in the
algorithm design? case of targeted neighborhood inspections).
54
Algorithms often • How companies like Netflix or eBay have benefited
provide accurate from algorithms, despite limited understanding of the
predictions but "why" behind their predictions?
don't explain why
they made a
particular • What challenges do managers face when algorithmic
recommendation. decisions lack transparency, and how can
experimentation and data validation help?
How can
managers trust
and use these • Does reliance on predictions without understanding
predictions causation lead to suboptimal outcomes (e.g., eBay’s
effectively? ineffective advertising).
55
• What is the importance of using wide and diverse data
Algorithms rely inputs (e.g., Yelp reviews used in Boston’s restaurant
heavily on the inspection algorithm)?
quality and
diversity of data • How can managers prevent the algorithm from being too
inputs. myopic by expanding the range of data inputs (e.g.,
How can short-term sales versus long-term customer
businesses ensure satisfaction)?
they are using the
right data to train • How do companies adjust their data strategy to improve
their algorithms? predictive power (e.g., moving from sales data to
satisfaction metrics for product longevity predictions).
56
57
`
ANY QUESTION? W10