Logistic Regression Implementation in R: The Dataset
Logistic Regression Implementation in R: The Dataset
Logistic Regression Implementation in R: The Dataset
categorical variable. The typical use of this model is predicting y given a set of
predictors x. The predictors can be continuous, categorical or a mix of both.
The categorical variable y, in general, can assume different values. In the simplest case
scenario y is binary meaning that it can assume either the value 1 or 0. A classical example
used in machine learning is email classification: given a set of attributes for each email
such as a number of words, links, and pictures, the algorithm should decide whether the
email is spam (1) or not (0). In this post, we call the model “binomial logistic regression”,
since the variable to predict is binary, however, logistic regression can also be used to
predict a dependent variable which can assume more than 2 values. In this second case, we
call the model “multinomial logistic regression”. A typical example, for instance, would
be classifying films between “Entertaining”, “borderline” or “boring”.
The dataset
We’ll be working on the Titanic dataset. There are different versions of this dataset freely
available online, however, I suggest to use the one available at Kaggle since it is almost
ready to be used (in order to download it you need to sign up to Kaggle).
The dataset (training) is a collection of data about some of the passengers (889 to be
precise), and the goal of the competition is to predict the survival (either 1 if the passenger
survived or 0 if they did not) based on some features such as the class of service, the sex,
the age etc. As you can see, we are going to use both categorical and continuous variables.
Now we need to check for missing values and look how many unique values there are for
each variable using the sapply() function which applies the function passed as argument to
each column of the dataframe.
sapply(training.data.raw,function(x) sum(is.na(x)))
sapply(training.data.raw, function(x) length(unique(x)))
0 0 0 0 0
177 0 0 0 0
Cabin Embarked
687 2
891 2 3 891 2
89 7 7 681 248
Cabin Embarked
148 4
A visual take on the missing values might be helpful: the Amelia package has a special
plotting function missmap() that will plot your dataset and highlight missing values:
library(Amelia)
As far as categorical variables are concerned, using the read.table() or read.csv() by default
will encode the categorical variables as factors. A factor is how R deals categorical
variables.
We can check the encoding using the following lines of code
is.factor(data$Sex)
is.factor(data$Embarked)
TRUE
TRUE
For a better understanding of how R is going to deal with the categorical variables, we can
use the contrasts() function. This function will show us how the variables have been
dummyfied by R and how to interpret them in a model.
contrasts(data$Sex)
contrasts(data$Embarked)
male
female 0
male 1
Q S
C 0 0
Q 1 0
S 0 1
For instance, you can see that in the variable sex, the female will be used as the reference.
As for the missing values in Embarked, since there are only two, we will discard those two
rows (we could also have replaced the missing values with the mode and keep the data
points).
data <- data[!is.na(data$Embarked),]
Before proceeding to the fitting process, let me remind you how important is cleaning and
formatting of the data. This preprocessing step often is crucial for obtaining a good fit of
the model and better predictive ability.
Model fitting
We split the data into two chunks: training and testing set. The training set will be used to
fit our model which we will be testing over the testing set.
train <- data[1:800,]
Now, let’s fit the model. Be sure to specify the parameter family=binomial in
the glm() function.
model <- glm(Survived ~.,family=binomial(link='logit'),data=train)
Call:
data = train)
Deviance Residuals:
Coefficients:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Response: Survived
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The difference between the null deviance and the residual deviance shows how our model
is doing against the null model (a model with only the intercept). The wider this gap, the
better. Analyzing the table we can see the drop in deviance when adding each variable one
at a time. Again, adding Pclass, Sex and Agesignificantly reduces the residual deviance.
The other variables seem to improve the model less even though SibSp has a low p-value.
A large p-value here indicates that the model without the variable explains more or less the
same amount of variation. Ultimately what you would like to see is a significant drop in
deviance and the AIC.
While no exact equivalent to the R2 of linear regression exists, the McFadden R2index can
be used to assess the model fit.
library(pscl)
pR2(model)
print(paste('Accuracy',1-misClasificError))
"Accuracy 0.842696629213483"
The 0.84 accuracy on the test set is quite a good result. However, keep in mind that this
result is somewhat dependent on the manual split of the data that I made earlier, therefore
if you wish for a more precise score, you would be better off running some kind of cross
validation such as k-fold cross validation.
As a last step, we are going to plot the ROC curve and calculate the AUC (area under the
curve) which are typical performance measurements for a binary classifier.
The ROC is a curve generated by plotting the true positive rate (TPR) against the false
positive rate (FPR) at various threshold settings while the AUC is the area under the ROC
curve. As a rule of thumb, a model with good predictive ability should have an AUC closer
to 1 (1 is ideal) than to 0.5.
library(ROCR)
plot(prf)
auc
0.8647186
And here is the ROC plot: