Session 5 - Logistic Regression
Session 5 - Logistic Regression
Session 5 - Logistic Regression
CLASSIFICATION PROBLEMS
Classification is an important category of problems in which the decision maker would like to
classify the case/entity/customers into two or more groups.
3
CHALLENGING CLASSIFICATION PROBLEMS
Ransomware
Anomaly Detection
Text Classification
4
CLASSIFICATION PROBLEMS
Classification problems are an important category of problems in analytics in which the
response variable (Y) takes a discrete value.
The primary objective is to predict the class of a customer (or class probability) based
on the values of explanatory variables or predictors.
It’s referred to as regression because it takes the output of the linear regression
function as input and uses a sigmoid function to estimate the probability for the
given class.
The difference between linear regression and logistic regression is that linear
regression output is the continuous value that can be anything while logistic
regression predicts the probability that an instance belongs to a given class or not.
5
EXAMPLE
The X-axis of this graph displays the number of years in the company, which is the dependent
variable. The Y-axis tells us the probability that a person will get promoted, and these values
range from 0 to 1.
6
EXAMPLE
You can see that as an employee spends more time working in the company, their chances of
getting promoted increases.
7
EXAMPLE
Linear regression is a technique that is commonly used to model problems with continuous output.
Here is an example of how linear regression fits a straight line to model the observed data:
8
EXAMPLE
If we used linear regression to model whether a person will get promoted:
10
LOGISTIC REGRESSION
Logistic Regression
Classification
Discrete Choice
Class probability
12
LOGISTIC REGRESSION - INTRODUCTION
Logistic Function (Sigmoidal function)
1 𝑒𝑒 𝑧𝑧
−𝑧𝑧
=
1 + 𝑒𝑒 1 + 𝑒𝑒 𝑧𝑧
13
BINARY LOGISTIC REGRESSION
1 𝑒𝑒 𝑧𝑧
𝑃𝑃(𝑌𝑌 = 1) = 𝜋𝜋(𝑧𝑧) = −𝑧𝑧
=
1 + 𝑒𝑒 1 + 𝑒𝑒 𝑧𝑧
𝑌𝑌 – response variable takes only two values. For example, assume that the value of
Y is either 1 (positive outcome) or 0 (negative outcome).
14
LOGISTIC REGRESSION WITH ONE EXPLANATORY VARIABLE
𝑃𝑃 𝑌𝑌 = 1 𝑋𝑋 = 𝑥𝑥 = 𝜋𝜋 𝑥𝑥 =
𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
1 + 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑋𝑋1)
𝜋𝜋
𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥) = 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
𝜋𝜋 = 1 − 𝜋𝜋
1 + 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
16
LOGIT FUNCTION
The logit function is the logarithmic transformation of the logistic function.
It is defined as the natural logarithm of odds.
Logit of a variable π (with value between 0 and 1) is given by:
𝜋𝜋
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿(𝜋𝜋) = ln( ) = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
1 − 𝜋𝜋
17
PARAMETER ESTIMATION IN LOGISTIC
REGRESSION (MAXIMUM LIKELIHOOD
ESTIMATE)
18
LIKELIHOOD FUNCTION FOR BINARY LOGISTIC FUNCTION
One of the major assumptions of simple linear regression and multiple linear
regression is that the residuals follow a normal distribution. However, the
residuals in logistics regression does not follow a normal distribution.
For example, consider a regression, where the response variable Y takes only
two values (0 or 1).
𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1𝑖𝑖 + 𝜖𝜖𝑖𝑖
The error (𝜖𝜖𝑖𝑖) is given by
19
LIKELIHOOD FUNCTION FOR BINARY LOGISTIC FUNCTION
𝑦𝑦
𝑓𝑓(𝑦𝑦𝑖𝑖 ) = 𝜋𝜋𝑖𝑖 𝑖𝑖 (1 − 𝜋𝜋𝑖𝑖 )1−𝑦𝑦𝑖𝑖
𝑛𝑛
𝑦𝑦
𝐿𝐿(𝛽𝛽) = 𝑓𝑓(𝑦𝑦1 , 𝑦𝑦2 , . . . , 𝑦𝑦𝑛𝑛 ) = � 𝜋𝜋𝑖𝑖 𝑖𝑖 (1 − 𝜋𝜋𝑖𝑖 )1−𝑦𝑦𝑖𝑖
𝑖𝑖=1
𝑛𝑛 𝑛𝑛
20
LIKELIHOOD FUNCTION FOR BINARY LOGISTIC FUNCTION
n
ln[L( β )] = ∑ yi ( β 0 + β1xi ) −
i =1
n
∑ ln(1 + exp( β 0 + β1xi ))
i =1
21
ESTIMATION OF LR PARAMETERS
∂ ln( L( β 0, β1 )) n n exp( β + β x )
0 1 i
= ∑ yi − ∑ =0
∂β 0 i =1 i =11 + exp( β 0 + β1xi )
∂ ln( L( β 0, β1 )) n n x exp( β + β x )
= ∑ xi yi − ∑ i 0 1 i =0
∂β1 i =1 i =11 + exp( β 0 + β1xi )
22
LIMITATIONS OF MLE
Closed form solution may not exist for many cases, one may have to use
iterative procedure to estimate the parameter values.
23
LOGISTIC REGRESSION MODEL DEVELOPMENT
NO
Model satisfies
diagnostic test
YES STOP
24
SPACE SHUTTLE CHALLENGER CRASH
Space shuttle orbiter Challenger (Mission STS-51-L) was the 25th shuttle
launched by NASA on January 28, 1986 (Smith, 1986; Feynman 1988).
The Challenger crashed 73 seconds into its flight due to the erosion of O-rings
which were part of the solid rocket boosters of the shuttle.
Before the launch, the engineers at NASA were concerned about the outside
temperature which was very low (the actual launch occurred at 36°F).
25
SPACE SHUTTLE CHALLENGER DATA
Flt Temp Damage Flt Temp Damage
STS-1 66 No STS-41G 78 No
STS-2 70 Yes STS-51-A 67 No
STS-3 69 No STS-51-C 53 Yes
STS-4 80 No STS-51-D 67 No
STS-5 68 No STS-51-B 75 No
STS-6 67 No STS-51-G 70 No
STS-7 72 No STS-51-F 81 No
STS-8 73 No STS-51-I 76 No
STS-9 70 No STS-51-J 79 No
STS-41B 57 Yes STS-61-A 75 Yes
STS-41C 63 Yes STS-61-B 76 No
STS-41D 70 Yes STS-61-C 58 Yes
EXAMPLE
27
LOGISTIC REGRESSION OF CHALLENGER DATA
Let:
Yi = 0 denote no damage
Yi = 1 denote damage to the O-ring
P(Yi = 1) = Πi and P(Yi = 0) = 1 - ΠI.
We have to estimate P(Yi = 1 |Xi).
𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
The logistic regression model is given by: 𝜋𝜋 =
1 + 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥)
𝜋𝜋
= 𝑒𝑒 (𝛽𝛽0+𝛽𝛽1𝑥𝑥) 𝜋𝜋
1 − 𝜋𝜋 ln
1 − 𝜋𝜋
= 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
Odds: Likelihood of a damage relative to
the likelihood of a non-damage. Logit Function
28
LOGISTIC REGRESSION OF CHALLENGER DATA
Let:
Yi = 0 denote no damage
Yi = 1 denote damage to the O-ring
P(Yi = 1) = Πi and P(Yi = 0) = 1 - ΠI.
We have to estimate P(Yi = 1 |Xi).
Variables in the Equation
πi
ln = 15.297 − 0.236 X i
1− πi 29
CHALLENGER: PROBABILITY OF FAILURE ESTIMATE
Probability of damage to O-ring as a function of launch temperature is:
𝑒𝑒 15.297−0.236𝑋𝑋𝑖𝑖
𝜋𝜋𝑖𝑖 =
1 + 𝑒𝑒 15.297−0.236𝑋𝑋𝑖𝑖
30
EXERCISE
Use the model to calculate the probability that an O-ring will become
unmanaged by the following ambient temperatures: 51, 53, and 55 degrees
Fahrenheit.
πi
ln = 15.297 − 0.236 X i
1− πi
31
ACTUAL VS. PREDICTED
Flight Launch Damage to Predicted Flight Launch Damage to Predicted
Number Temperature O-ring Probability Number Temperature O-ring Probability
33
INTERPRETATION
34
ODDS AND ODDS RATIO
π
odds =
1− π
35
ODDS RATIO
π (1) 1 − π (1)
OR =
π ( 0) 1 − π ( 0)
36
ODDS RATIO
π (1) 1 − π (1)
OR = = e β1
π (0) 1 − π (0)
37
INTERPRETATION OF LR COEFFICIENTS
π ( x + 1) (1 − π ( x + 1))
β1 = ln = Change in ln odds ratio
π ( x) (1 − π ( x + 1))
β1 π ( x + 1) (1 − π ( x + 1))
e = = Change in odds ratio
π ( x) (1 − π ( x + 1))
38
CLASSIFICATION TABLE
The output from a logistic regression model is the class probability P(Y = 1). Based on the
value of P(Y = 1), the decision maker has to classify the observation as belonging to either
class 1 (positive) or class 0 (negative).
To classify the observations, the decision maker has to first decide the classification
cut-off probability Pc .
Whenever the predicted probability of an observation, P(Yi = 1), is less than the
classification cut-off probability, Pc , then the observation is classified as negative (Yi = 0)
if the predicted probability is greater than or equal to Pc , then the observation is classified
as positive (Yi = 1).
39
SENSITIVITY, SPECIFICITY AND PRECISION
The ability of the model to correctly classify positives and negatives are called sensitivity and
specificity, respectively.
Sensitivity = P(model classifies Yi as positive | Yi is positive)
where True Positive (TP) is the number of positives correctly classified as positives by the model
and False Negative (TN) is positives misclassified as negative by the model.
Sensitivity is also called as recall.
40
SPECIFICITY
Specificity is the ability of the diagnostic test to correctly classify the test as negative when the
disease is not present. That is:
Specificity = P(model classifies Yi as negative | Yi is negative)
Specificity can be calculated using the following equation:
where True Negative (TN) is number of the negatives correctly classified as negatives by the model
and False Positive (FP) is number of negatives misclassified as positives by the model.
41
PRECISION
The decision maker has to consider the tradeoff between sensitivity and
specificity to arrive at an optimal cut-off probability.
Precision measures the accuracy of positives classified by the model.
Precision = P(Yi is positive | model classifies Yi as positive)
42
F-SCORE
2 × Precision × Recall
F − Score =
Precision + Recall
In the F-score case, the metric value interpretation is straightforward.
For any beta parameter, the best possible value for F-measure is 1, and the worst is 0.
43
SENSITIVITY, SPECIFICITY AND PRECISION
Predicted (cut-off prob=0.2)
Damage to O-ring
Observed Percentage
0 (Negative) 1 (positive) Correct
TP 6
Sensitivity = = = 0.857
TP + FN 6+1
TN 9 TP + TN
Specificity = = = 0.529 Accuracy =
TN + FP 9+8 TP + FP + FN + TN
T𝑃𝑃
Precision = 6 =0.625
TP + FP = = 0.428
6+8
44
ACCURACY PARADOX
Predicted (cut-off prob=0.2)
Damage to O-ring
Observed Percentage
0 (Negative) 1 (positive) Correct
The accuracy paradox in classification problem states that a model with higher overall accuracy may not be a
better model.
45
CUSTOMER SUBSCRIPTION PREDICTION FOR A FINANCIAL SERVICE
Background: A Bank offers a special term deposit scheme with different benefits, and
they want to improve their subscription rates for this service. The bank is interested in
predicting whether a customer will subscribe to the premium service based on various
features. This prediction will enable the bank to tailor marketing strategies and
promotions to target potential subscribers more effectively.
Objective: The objective is to build a logistic regression model to predict whether a
customer will subscribe to the premium financial service or not based on features such
as income, account balance, credit score, and transaction history.
46
DATA DESCRIPTION:
47
DATA
0 50 20 700 8 30 1
1 80 50 750 9 35 1
2 45 10 650 6 28 0
3 75 40 720 7 40 1
4 60 30 680 7 32 0
5 85 60 800 10 45 1
6 55 25 700 8 29 0
7 70 35 760 9 38 1
8 40 15 630 5 27 0
9 95 70 820 10 50 1
48
CONCORDANT AND DISCORDANT PAIRS
Concordant Pairs. A pair of positive and negative observations for which the
model has a cut-off probability to classify both of them correctly are called
concordant pairs.
49
ACTUAL VS. PREDICTED
Flight Launch Damage to Predicted Flight Launch Damage to Predicted
Number Temperature O-ring Probability Number Temperature O-ring Probability
Discordant pairs
50
RECEIVER OPERATING CHARACTERISTICS (ROC) CURVE
51
RECEIVER OPERATING CHARACTERISTICS (ROC) CURVE
ROC curve is a plot between sensitivity (true positive
rate) in the vertical axis and 1 – specificity (false positive
rate) in the horizontal axis.
The area under the ROC curve is (AUC).
• 0.5 ⇒ No discrimination
55
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
58
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
IDEAL
POSITION
(0,1)
59
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
In the ROC curve, the coordinate (0,1) implies sensitivity = specificity = 1,
which is the ideal model that we would like to use.
The ROC curve provides information regarding how sensitivity and specificity
change when the classification cut-off probability changes.
The point on the ROC curve which is at minimum distance from coordinate
(0,1) (or which is at the maximum distance from the diagonal line) will give the
best cut-off probability.
Youden’s Index is the classification cut-off probability value for which the
distance from the diagonal line to the ROC curve is maximum.
We use Youden’s Index only when both sensitivity and specificity are equally
important.
60
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
61
YOUDEN’S INDEX FOR OPTIMAL CUT-OFF PROBABILITY
62
COST-BASED CUT-OFF PROBABILITY
When sensitivity and specificity are not equally important then we use cost-based approach for
cut-off probability
In cost-based approach, we assign penalty cost for misclassification of positives and negatives.
Assume that cost of misclassifying negative (0) as positive (1) is C01 and cost of
misclassifying positive (1) as negative (0) is C10 as shown in Table
Classified
Observed
0 1
0 --- C01
1 C10 ---
The optimal cut-off probability is the one which minimizes the total penalty cost and is given
by
[
Min C 01P01 + C10 P10 ] P01 and P10 are the probability of classifying negative as positive
p and positive as negative.
63
COST-BASED CUT-OFF PROBABILITY
Predicted
C01 = Cost of classifying 0 as 1 = 100 Observed 0 1
0 P00 P01 (FP)
C10 = Cost of classifying 1 as 0 = 200
1 P10 (FN) P11
Cut-off
Probability P01 P10 C01 C10 Cost
0.05 0.85 0.01 100.00 200.00 88.00
0.10 0.68 0.07 100.00 200.00 81.80
0.15 0.52 0.11 100.00 200.00 74.20
0.20 0.44 0.14 100.00 200.00 72.3
0.25 0.37 0.21 100.00 200.00 78.1
0.28 0.33 0.23 100.00 200.00 77.8
0.30 0.29 0.24 100.00 200.00 77.3
0.35 0.24 0.76 100.00 200.00 175.3
64
LOGISTIC REGRESSION MODEL
DIAGNOSTICS
65
OMNIBUS TESTS
In logistic regression model, likelihood ratio test is used as the test for
checking the statistical significance of the overall model.
The log likelihood function for binary logistic regression model is given by
n n
LL = ∑ Yi ln[π ( Z )] + ∑ (1 − Yi )[ln(1 − π ( Z ))]
i =1 i =1
66
WALD’S TEST
2
Wald’s test statistic is given by ∧
βi
W =
∧
S e ( β i )
67
What are the applications of logistic regression?
68
PYTHON CODE – CUSTOMER PREDICTION
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix, roc_auc_score
import seaborn as sns
import matplotlib.pyplot as plt
69
# Load the dataset
data = {
'Income': [50, 80, 45, 75, 60, 85, 55, 70, 40, 95],
'AccountBalance': [20, 50, 10, 40, 30, 60, 25, 35, 15, 70],
'CreditScore': [700, 750, 650, 720, 680, 800, 700, 760, 630, 820],
'TransactionHistory': [8, 9, 6, 7, 7, 10, 8, 9, 5, 10],
'Age': [30, 35, 28, 40, 32, 45, 29, 38, 27, 50],
'Subscription': [1, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)
70
# Split the data into features (X) and target variable (y)
X = df.drop('Subscription', axis=1)
y = df['Subscription']
72