0% found this document useful (0 votes)
33 views

Lab 4

lab lecture notes of R language.

Uploaded by

neilzhaony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Lab 4

lab lecture notes of R language.

Uploaded by

neilzhaony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Logistic Regression in R

MACC7006 Accounting Data and Analytics

Keri Hu

Faculty of Business and Economics

1/20
Today: Logistic regression in R

By the end of today’s lecture, you should be able to:

• Create training and testing sets


• Build a logistic regression model
• Evaluate the model

We will work with the dataset: Healthcare.csv

• Predict whether a patient receives poor quality care, based on


information in his/her medical claims history

2/20
Variables in the dataset

3/20
Create training and testing sets

• Training dataset: used to build model

• Testing dataset: used to test the model’s out-of-sample accuracy

• If there is no chronological order on the observations, we randomly


assign observations to the training set or testing set.

4/20
Install and load new package

1. Install the package: install.packages("caTools")

2. Load into your current R session: library(caTools)


• When you use this package in the future, you will not need to
re-install it, but you will need to load it with the library function.

5/20
Split dataset

1. To replicate results by the same random number:


set.seed(any number )
• Restore “the seed” from a previous session, which enables us to reuse
the same set of random values

2. Randomly group data points:


sample.split(dependent variable, fraction of data in training set)
• Produce a TRUE/FALSE vector that helps us randomly split data into
two pieces according to the SplitRatio value (% of training data)

3. Split data into training set or testing set:


subset(data frame, spl==TRUE/FALSE )
• If spl is TRUE, put the corresponding observation in the training set;
if spl is FALSE, put the corresponding observation in the testing set.

6/20
Build a logistic regression model

1. Change the type/class of variables if needed using as.factor(),


as.numeric(), as.character(), etc.
• Here, PoorCare “ Y means quality is poor and N otherwise.

2. Generalized linear model:


glm(dependent variable „ sum of independent variables, data =
training set, family = binomial)
• Used for many different types of models
• family = binomial indicates that we are building a logistic
regression model

7/20
Result of the model

8/20
Evaluate performance of the model

If we want to calculate accuracy on the training set with threshold 0.5:

1. Prediction for the training set:


PredictTrain <- predict(logistic model, type="response")
• The type="response" option tells R to output probabilities of the
form PrpY “ 1|Xq, as opposed to other information such as logit.
• If no new data is specified within predict(), then probabilities are
computed for the training data used to fit the logistic regression.

2. Create a classification/confusion matrix for a threshold of 0.5:


table(training set$dependent variable, PredictTrain > 0.5)
• table() counts observations in each class of the variable(s).

9/20
Plot predictions

1. Add the vector of predictions to the data set:


training set$Predict <- PredictTrain
2. Plot predictions (about the training set)

10/20
Example: Classification/confusion matrix

Threshold value “ 0.5:

FALSE (predicted good care) TRUE (predicted poor care)


N (actual good care) 71 3
Y (actual poor care) 14 11

• The prediction is FALSE if the probability is less than (or equal to)
0.5, and TRUE if the probability is greater than 0.5.

71 ` 11
Accuracy “ “ 82.83%
p71 ` 11q ` p3 ` 14q

• 3 false positive errors: predict poor care but actually good care
• 14 false negative errors: predict good care but actually poor care

11/20
Different threshold values

Threshold value “ 0.3:

FALSE (predicted good care) TRUE (predicted poor care)


N (actual good care) 67 7
Y (actual poor care) 12 13

67 ` 13
Accuracy “ “ 80.81%
p67 ` 13q ` p7 ` 12q

• 7 false positive errors: predict poor care but actually good care
• 12 false negative errors: predict good care but actually poor care

12/20
Different threshold values

Threshold value “ 0.7:

FALSE (predicted good care) TRUE (predicted poor care)


N (actual good care) 73 1
Y (actual poor care) 19 6

73 ` 6
Accuracy “ “ 79.80%
p73 ` 6q ` p1 ` 19q

• 1 false positive errors: predict poor care but actually good care
• 19 false negative errors: predict good care but actually poor care

13/20
ROC curve for the training set

1. Install and load the ROCR package:


install.packages("ROCR"), library(ROCR)

2. Generate an ROC curve:


2.1 Create a prediction object that the ROCR package can understand:
ROCRpred <- prediction(PredictTrain, training set$
dependent variable)
2.2 Calculate performance metrics for the ROC curve:
ROCCurve <- performance(ROCRpred, "tpr", "fpr")
• "tpr": true positive rate
• "fpr": false positive rate

2.3 Plot the ROC curve:


plot(ROCCurve)

14/20
Example: ROC curve

Where is the threshold, say 0.5, on the curve?

15/20
Add threshold labels and calculate AUC

• plot(ROCCurve, colorize=TRUE,
print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,0.7))

• AUC of the training set


as.numeric(performance(ROCRpred, "auc")@y.values)
[1] 0.7945946
16/20
Prediction for the test set

• We should make out-of-sample predictions.

• This can be done on the test set by adding newdata:


PredictTest = predict(logistic model, type = "response",
newdata = testing set)

17/20
Classification/confusion matrix for the test set

Threshold value “ 0.5:


table(testing set$dependent variable, PredictTest > 0.5)

18/20
Example: Classification matrix for the test set

FALSE (predicted good care) TRUE (predicted poor care)


N (actual good care) 23 1
Y (actual poor care) 3 5

• Accuracy on the test set “ p23 ` 5q{rp23 ` 5q ` p1 ` 3qs “ 87.5%


• 1 false positive prediction
• 3 false negative prediction

19/20
ROC curve and AUC of the test set

• Plot ROC curve


• ROCRpredtest = prediction(PredictTest, testing set$
dependent variable)
• ROCCurvetest = performance(ROCRpredtest, "tpr", "fpr")
• plot(ROCCurvetest, colorize=TRUE,
print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,0.7)))

• AUC of the test set


as.numeric(performance(ROCRpredtest, "auc")@y.values)
[1] 0.875

20/20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy