R Assignment
R Assignment
# Code: NC6NVD
#### What types of machine learning models perform best on this dataset? Build at least two types of models.
###------------------------- The answer---------------------------------------------
the machine learning model for such problem is that this problem is Classification problem- Supervised learning, and there are
many methods for Classification to solve such a problem such as :
# 1- Logistic Regression -- 2- Decision tree Classifier 3- Random Forest Classifier or we can use K Nearest Neighbors as a fourth
techniques
## i built three model to solve such a problem where i used the Decision tree model, the Random forest model and Finally the
logistic model
install.packages("GGally")
library(ggplot2)
library(dplyr)
library(rpart)
library(rpart.plot)
library(randomForest)
library(GGally)
library(readr)
default_creditcard <- read_delim("D:/MSc 3 Semester/data mining/default_creditcard.csv",
str(default_creditcard)
for (i in cols){
default_creditcard[,i] <- as.factor(unlist(default_creditcard[,i]))
str(default_creditcard)
The Variables DEFAULT", "PAY_2", "PAY_3 are numbers so I factored them to be Categorical as it is obvious from the above
codes, then I checked the structure of the data again to ensure that are transferred
# Missing values
colSums(is.na(default_creditcard))
# We have a lot of missing data in the LIMIT_BAL feature (2657/221521), I got 2657 in the LIMIT_BAL feature
colSums(is.na(default_creditcard))
Comment:
I checked the missing values in all features, and there was missing values in LIMIT_BAL (2657out of 221521), also, BILL-AMTI
(1), so I replaced all theses missing’s with the mean and checked again to become zero missing’s
# let's look at the relationship between sex and the relationship between sex and DEFAULT:
DEFAULT:
ggplot(data=default_creditcard,
aes(x=SEX,fill= DEFAULT)) +
geom_bar()
ggplot(data = default_creditcard,
aes(x=AGE,fill=DEFAULT)) + geom_histogram(binwidth
=3)
ggplot(data = default_creditcard,
aes(x=EDUCATION ,fill=DEFAULT)) +
geom_bar(position="fill") + ylab("Frequency")
# The relationship between DEFAULT and MARITAL The relationship between DEFAULT and MARITAL Status:
Status:
ggplot(data = default_creditcard,
aes(x=MARRIAGE ,fill=DEFAULT)) +
geom_bar(position="fill") + ylab("Frequency")
train_features<- default_creditcard[,c("DEFAULT",
"LIMIT_BAL","SEX","EDUCATION","PAY_2","PAY_3","BILL_AMT1","PAY_AMT1", "PAY_AMT2","PAY_AMT3","PAY_AMT4")]
set.seed(2019)
defaultdatasample <-sample(1:nrow(train_features)[1],10000)
train<-train_features[defaultdatasample,] # The train set of the model
test<-train_features[-defaultdatasample,] # The test set of the model
train
test
Comment:
Before I start modeling , I classified the data in to two sub categories, the first part is the training part which Is consist of
10000 observation as a sample randomly talking, and the remaining are the testing data to test the accuracy of the model
as the above codes indicating, then we will start to deploy different methods as following
# A- let’s try a decision tree model to predict the default : #The Results of the codes:
> precision_dt
presicion_dt<-Confmatrix[2,2]/(sum(Confmatrix[2,])) [1] 0.7979481
recall_dt<- Confmatrix[2,2]/(sum(Confmatrix[,2])) > recall_dt
precision_dt [1] 0.2768773
recall_dt
model_rf<-randomForest(DEFAULT~.,data=train)
model_rf
t3<-table(pred.train.rf,test$DEFAULT)
t3
precision_rf<- t3[1,1]/(sum(t3[1,]))
recall_rf<- t3[1,1]/(sum(t3[,1]))
precision_rf
recall_rf
# 80% percision and 93.5% recall for the Non default--> almost the same as the tree!!
presicion_rf<- t3[2,2]/(sum(t3[2,]))
presicion_rf<- t3[2,2]/(sum(t3[2,]))
recall_rf<- t3[2,2]/(sum(t3[,2]))
precision_rf
recall_rf
# 79.6% percision and 28.1% recall for the default--> abit better for recall
Call:
randomForest(formula = DEFAULT ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
> t3
pred.train.rf 0 1
0 8507 2172
1 622 851
> precision_rf<- t3[1,1]/(sum(t3[1,]))
> recall_rf<- t3[1,1]/(sum(t3[,1]))
> precision_rf
[1] 0.7966102
> recall_rf
[1] 0.9318655
> precision_rf
[1] 0.7966102
> recall_rf
[1] 0.2815084
# C- Let's try to run a logistic regression (glm (generalized The Results of codes :
linear model)) Call:
logit <- glm(DEFAULT glm(formula = DEFAULT ~ ., family = binomial(link = "logit"),
~.,family=binomial(link='logit'),data=train)
summary(logit) data = train)
# We can see EDUCATION and other features are not
statisticaly significant at 5%. Deviance Residuals:
Min 1Q Median 3Q Max
# Let's try to run a simple backward feature selection in -1.8037 -0.6716 -0.5737 -0.3661 2.6094
order to remove not significant features
Coefficients:
backwards = step(logit, direction = "backward") Estimate Std. Error z value Pr(>|z|)
# Let's look at the prediction of this model on the test set: (Intercept) -9.028e-01 1.014e-01 -8.899 < 2e-16 ***
pred.test <- predict(backwards,test) LIMIT_BAL -1.572e-06 3.169e-07 -4.959 7.09e-07 ***
SEXmale 1.624e-01 5.153e-02 3.151 0.001628 **
pred.test <- ifelse(pred.test > 0.5,1,0) EDUCATIONhigh school 7.989e-02 7.586e-02 1.053 0.292292
EDUCATIONother -1.322e-01 4.952e-01 -0.267 0.789492
t1<-table(pred.test,test$DEFAULT) EDUCATIONuniversity 4.964e-02 5.973e-02 0.831 0.405862
PAY_2-1 -4.336e-01 1.466e-01 -2.957 0.003107 **
t1 PAY_20 -6.630e-01 1.673e-01 -3.962 7.43e-05 ***
PAY_21 3.495e-01 8.752e-01 0.399 0.689674
PAY_22 6.442e-01 1.710e-01 3.767 0.000165 ***
# Presicion and recall of the model for the deaths PAY_23 3.007e-01 2.775e-01 1.084 0.278529
PAY_3-1 2.743e-01 1.451e-01 1.890 0.058695 .
precision<- t1[1,1]/(sum(t1[1,]))
PAY_30 2.277e-01 1.618e-01 1.408 0.159265
recall<- t1[1,1]/(sum(t1[,1])) PAY_31 -9.452e+00 1.195e+02 -0.079 0.936941
PAY_32 9.531e-01 1.669e-01 5.712 1.11e-08 ***
precision
PAY_33 1.482e+00 3.492e-01 4.243 2.20e-05 ***
recall BILL_AMT1 3.702e-06 8.609e-07 4.301 1.70e-05 ***
PAY_AMT1 -6.184e-05 1.528e-05 -4.046 5.21e-05 ***
# 78.6% percision and 97% recall for the dead, better PAY_AMT2 -2.725e-05 1.419e-05 -1.920 0.054916 .
than before!! PAY_AMT3 -5.776e-05 1.478e-05 -3.909 9.28e-05 ***
# Presicion and recall of the model for the survivors PAY_AMT4 -4.121e-05 1.518e-05 -2.715 0.006625 **
---
presicion<- t1[2,2]/(sum(t1[2,])) Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
recall<- t1[2,2]/(sum(t1[,2]))
(Dispersion parameter for binomial family taken to be 1)
precision Null deviance: 10917 on 9999 degrees of freedom
Residual deviance: 9710 on 9979 degrees of freedom
recall
AIC: 9752
# 78.6% percision and 18% recall for the default, very low Number of Fisher Scoring iterations: 9
than the other model After deleting the insignificance features , we will get the
following features:
Df Deviance AIC
<none> 9711.3 9747.3
- PAY_AMT2 1 9715.1 9749.1
- PAY_AMT4 1 9719.0 9753.0
- SEX 1 9721.0 9755.0
- PAY_AMT3 1 9727.3 9761.3
- PAY_AMT1 1 9728.8 9762.8
- BILL_AMT1 1 9730.0 9764.0
- LIMIT_BAL 1 9739.6 9773.6
- PAY_3 5 9790.9 9816.9
Pred.test 0 1
0 8885 2639
1 244 384
> precision
[1] 0.7709997
> recall
[1] 0.973272
> precision
[1] 0.7897
> recall
[1] 0.1870261