0% found this document useful (0 votes)
24 views

R Assignment

The document discusses building machine learning models to predict credit card default using different classification techniques. Three models were built: [1] A decision tree model achieved 80% precision and 94.6% recall for non-defaults, and 80% precision and 27.6% recall for defaults. [2] A random forest model achieved similar results to the decision tree with 80% precision and 93.5% recall for non-defaults, and 79.6% precision and 28.1% recall for defaults. [3] A logistic regression identified some features as not statistically significant and backward selection was used to remove them from the model.

Uploaded by

Tuna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

R Assignment

The document discusses building machine learning models to predict credit card default using different classification techniques. Three models were built: [1] A decision tree model achieved 80% precision and 94.6% recall for non-defaults, and 80% precision and 27.6% recall for defaults. [2] A random forest model achieved similar results to the decision tree with 80% precision and 93.5% recall for non-defaults, and 79.6% precision and 28.1% recall for defaults. [3] A logistic regression identified some features as not statistically significant and backward selection was used to remove them from the model.

Uploaded by

Tuna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

# Name: Mohamed Reda Ibrahim

# Code: NC6NVD

#### What types of machine learning models perform best on this dataset? Build at least two types of models.
###------------------------- The answer---------------------------------------------

the machine learning model for such problem is that this problem is Classification problem- Supervised learning, and there are
many methods for Classification to solve such a problem such as :

# 1- Logistic Regression -- 2- Decision tree Classifier 3- Random Forest Classifier or we can use K Nearest Neighbors as a fourth
techniques
## i built three model to solve such a problem where i used the Decision tree model, the Random forest model and Finally the
logistic model

##The Steps for Solving and building up models:


Before we Explain the Codes, I First start by uploading the necessary packages to facilitate my work, then I start to upload the
data , after uploading it , I made the necessary cleaning for this data , then some visualization between the features, and finally
applying the necessary models and testing the accuracy of such a models .

# 1. install RandomForest and GGally


install.packages("randomForest")

install.packages("GGally")

#2. load the necessary packages

library(ggplot2)

library(dplyr)

library(rpart)

library(rpart.plot)
library(randomForest)

library(GGally)

Data loading and cleaning


I uploaded the data , before processing I renamed the label variables default_nextMonth to be
DEFAULT

library(readr)
default_creditcard <- read_delim("D:/MSc 3 Semester/data mining/default_creditcard.csv",

";", escape_double = FALSE, trim_ws = TRUE)


View(default_creditcard)

#Renaming default_nexrMonth as Default


colnames(default_creditcard)[colnames((default_creditcard)) == "default_nextMonth"] = "DEFAULT"
# Checking the structure

str(default_creditcard)

# 3. Let's move the features to be factors

cols<-c("DEFAULT", "PAY_2", "PAY_3")

for (i in cols){
default_creditcard[,i] <- as.factor(unlist(default_creditcard[,i]))

# Now let's look at the structure of the full data set

str(default_creditcard)

The Variables DEFAULT", "PAY_2", "PAY_3 are numbers so I factored them to be Categorical as it is obvious from the above
codes, then I checked the structure of the data again to ensure that are transferred

# Missing values

colSums(is.na(default_creditcard))

# We have a lot of missing data in the LIMIT_BAL feature (2657/221521), I got 2657 in the LIMIT_BAL feature

#so I Replaced these missing values with the mean as following

default_creditcard$LIMIT_BAL [is.na(default_creditcard$LIMIT_BAL )] <- mean(default_creditcard$LIMIT_BAL ,na.rm=T)

# Replace BILL_AMT1 missing values with the mean

default_creditcard$BILL_AMT1 [is.na(default_creditcard$BILL_AMT1 )] <- mean(default_creditcard$BILL_AMT1 ,na.rm=T)

# So, no missing values now

colSums(is.na(default_creditcard))

Comment:
I checked the missing values in all features, and there was missing values in LIMIT_BAL (2657out of 221521), also, BILL-AMTI
(1), so I replaced all theses missing’s with the mean and checked again to become zero missing’s

######## Some Visualizations#####------------------------------


# Plot Density for Client categorized on Education

> ggplot(default_creditcard, aes(x=EDUCATION, fill = EDUCATION)) +

+ geom_density(aes(fill=factor(EDUCATION)), alpha=0.8) + + labs(title="Density Plot", + subtitle="Clients Grouped by


Education", + caption="Source: UCI Credit Card",+ x="Education", + fill="Education")

# let's look at the relationship between sex and the relationship between sex and DEFAULT:
DEFAULT:
ggplot(data=default_creditcard,

aes(x=SEX,fill= DEFAULT)) +
geom_bar()

# females had a much greater chance of


defaulting rather than male

# The relationship between DEFAULT and Age:

ggplot(data = default_creditcard,
aes(x=AGE,fill=DEFAULT)) + geom_histogram(binwidth
=3)

# whose more than 20 years old are expecting to see


high default than the other, while 60 years old and
more will show no default
# The relationship between DEFAULT and EDUCATION: # The relationship between DEFAULT and EDUCATION:

ggplot(data = default_creditcard,

aes(x=EDUCATION ,fill=DEFAULT)) +
geom_bar(position="fill") + ylab("Frequency")

#student in High schools and university are the most


exposure to default than the other and graduate

# The relationship between DEFAULT and MARITAL The relationship between DEFAULT and MARITAL Status:
Status:

ggplot(data = default_creditcard,

aes(x=MARRIAGE ,fill=DEFAULT)) +
geom_bar(position="fill") + ylab("Frequency")

# there is non significance conclusion between the


married and single almost the same percentage in case
of default.

------------------------------------------### Predictions with simple machine learning models## ---------

# 4. The train set with the variables

train_features<- default_creditcard[,c("DEFAULT",
"LIMIT_BAL","SEX","EDUCATION","PAY_2","PAY_3","BILL_AMT1","PAY_AMT1", "PAY_AMT2","PAY_AMT3","PAY_AMT4")]
set.seed(2019)
defaultdatasample <-sample(1:nrow(train_features)[1],10000)
train<-train_features[defaultdatasample,] # The train set of the model
test<-train_features[-defaultdatasample,] # The test set of the model
train
test

Comment:
Before I start modeling , I classified the data in to two sub categories, the first part is the training part which Is consist of
10000 observation as a sample randomly talking, and the remaining are the testing data to test the accuracy of the model
as the above codes indicating, then we will start to deploy different methods as following
# A- let’s try a decision tree model to predict the default : #The Results of the codes:

# (Gini index is the default cutting function) pred.train.dt 0 1


model_dt<- rpart(DEFAULT ~.,data=train, method="class") 0 8633 2186
rpart.plot(model_dt) 1 496 837
pred.train.dt <- predict(model_dt,test,type = "class") > precision_dt<- Confmatrix[1,1]/(sum(Confmatrix[1,]))
Confmatrix<-table(pred.train.dt,test$DEFAULT) > recall_dt<- Confmatrix[1,1]/(sum(Confmatrix[,1]))
Confmatrix > precision_dt
precision_dt<- Confmatrix[1,1]/(sum(Confmatrix[1,])) [1] 0.7979481
recall_dt<- Confmatrix[1,1]/(sum(Confmatrix[,1])) > recall_dt
precision_dt [1] 0.9456677
recall_dt > presicion_dt<-Confmatrix[2,2]/(sum(Confmatrix[2,]))
# 80% precision and 94.6% recall for the non default > recall_dt<- Confmatrix[2,2]/(sum(Confmatrix[,2]))

> precision_dt
presicion_dt<-Confmatrix[2,2]/(sum(Confmatrix[2,])) [1] 0.7979481
recall_dt<- Confmatrix[2,2]/(sum(Confmatrix[,2])) > recall_dt
precision_dt [1] 0.2768773
recall_dt

# 80% precision and 27.6% recall for the default

#B- Let's try to predict DEFAULT using a random forest.

model_rf<-randomForest(DEFAULT~.,data=train)

model_rf

# Let's look at the error as a function of the number of trees


plot(model_rf)

# taiking about 200 trees


model_rf<-randomForest(DEFAULT~.,data=train,ntree=200)

pred.train.rf <- predict(model_rf,test)

t3<-table(pred.train.rf,test$DEFAULT)
t3
precision_rf<- t3[1,1]/(sum(t3[1,]))
recall_rf<- t3[1,1]/(sum(t3[,1]))
precision_rf
recall_rf
# 80% percision and 93.5% recall for the Non default--> almost the same as the tree!!

presicion_rf<- t3[2,2]/(sum(t3[2,]))
presicion_rf<- t3[2,2]/(sum(t3[2,]))
recall_rf<- t3[2,2]/(sum(t3[,2]))
precision_rf
recall_rf

# 79.6% percision and 28.1% recall for the default--> abit better for recall

The Results of codes :

Call:
randomForest(formula = DEFAULT ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3

OOB estimate of error rate: 21.52%


Confusion matrix:
0 1 class.error
0 7172 4 3 0.0618705
1 1679 676 0.7129512
> # Let's look at the error as a function of the number of trees
> plot(model_rf)
> # taiking about 200 trees
> model_rf<-randomForest(DEFAULT~.,data=train,ntree=200)
> pred.train.rf <- predict(model_rf,test)
> t3<-table(pred.train.rf,test$DEFAULT)

> t3

pred.train.rf 0 1
0 8507 2172
1 622 851
> precision_rf<- t3[1,1]/(sum(t3[1,]))
> recall_rf<- t3[1,1]/(sum(t3[,1]))

> precision_rf

[1] 0.7966102

> recall_rf

[1] 0.9318655

> presicion_rf<- t3[2,2]/(sum(t3[2,]))

> recall_rf<- t3[2,2]/(sum(t3[,2]))

> precision_rf

[1] 0.7966102

> recall_rf

[1] 0.2815084
# C- Let's try to run a logistic regression (glm (generalized The Results of codes :
linear model)) Call:
logit <- glm(DEFAULT glm(formula = DEFAULT ~ ., family = binomial(link = "logit"),
~.,family=binomial(link='logit'),data=train)
summary(logit) data = train)
# We can see EDUCATION and other features are not
statisticaly significant at 5%. Deviance Residuals:
Min 1Q Median 3Q Max
# Let's try to run a simple backward feature selection in -1.8037 -0.6716 -0.5737 -0.3661 2.6094
order to remove not significant features
Coefficients:
backwards = step(logit, direction = "backward") Estimate Std. Error z value Pr(>|z|)
# Let's look at the prediction of this model on the test set: (Intercept) -9.028e-01 1.014e-01 -8.899 < 2e-16 ***
pred.test <- predict(backwards,test) LIMIT_BAL -1.572e-06 3.169e-07 -4.959 7.09e-07 ***
SEXmale 1.624e-01 5.153e-02 3.151 0.001628 **
pred.test <- ifelse(pred.test > 0.5,1,0) EDUCATIONhigh school 7.989e-02 7.586e-02 1.053 0.292292
EDUCATIONother -1.322e-01 4.952e-01 -0.267 0.789492
t1<-table(pred.test,test$DEFAULT) EDUCATIONuniversity 4.964e-02 5.973e-02 0.831 0.405862
PAY_2-1 -4.336e-01 1.466e-01 -2.957 0.003107 **
t1 PAY_20 -6.630e-01 1.673e-01 -3.962 7.43e-05 ***
PAY_21 3.495e-01 8.752e-01 0.399 0.689674
PAY_22 6.442e-01 1.710e-01 3.767 0.000165 ***
# Presicion and recall of the model for the deaths PAY_23 3.007e-01 2.775e-01 1.084 0.278529
PAY_3-1 2.743e-01 1.451e-01 1.890 0.058695 .
precision<- t1[1,1]/(sum(t1[1,]))
PAY_30 2.277e-01 1.618e-01 1.408 0.159265
recall<- t1[1,1]/(sum(t1[,1])) PAY_31 -9.452e+00 1.195e+02 -0.079 0.936941
PAY_32 9.531e-01 1.669e-01 5.712 1.11e-08 ***
precision
PAY_33 1.482e+00 3.492e-01 4.243 2.20e-05 ***
recall BILL_AMT1 3.702e-06 8.609e-07 4.301 1.70e-05 ***
PAY_AMT1 -6.184e-05 1.528e-05 -4.046 5.21e-05 ***
# 78.6% percision and 97% recall for the dead, better PAY_AMT2 -2.725e-05 1.419e-05 -1.920 0.054916 .
than before!! PAY_AMT3 -5.776e-05 1.478e-05 -3.909 9.28e-05 ***
# Presicion and recall of the model for the survivors PAY_AMT4 -4.121e-05 1.518e-05 -2.715 0.006625 **
---
presicion<- t1[2,2]/(sum(t1[2,])) Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
recall<- t1[2,2]/(sum(t1[,2]))
(Dispersion parameter for binomial family taken to be 1)
precision Null deviance: 10917 on 9999 degrees of freedom
Residual deviance: 9710 on 9979 degrees of freedom
recall
AIC: 9752
# 78.6% percision and 18% recall for the default, very low Number of Fisher Scoring iterations: 9
than the other model After deleting the insignificance features , we will get the
following features:
Df Deviance AIC
<none> 9711.3 9747.3
- PAY_AMT2 1 9715.1 9749.1
- PAY_AMT4 1 9719.0 9753.0
- SEX 1 9721.0 9755.0
- PAY_AMT3 1 9727.3 9761.3
- PAY_AMT1 1 9728.8 9762.8
- BILL_AMT1 1 9730.0 9764.0
- LIMIT_BAL 1 9739.6 9773.6
- PAY_3 5 9790.9 9816.9
Pred.test 0 1

0 8885 2639

1 244 384

> # Presicion and recall of the model for the deaths

> precision<- t1[1,1]/(sum(t1[1,]))

> recall<- t1[1,1]/(sum(t1[,1]))

> precision

[1] 0.7709997

> recall

[1] 0.973272

> # Presicion and recall of the model for the survivors

> presicion<- t1[2,2]/(sum(t1[2,]))

> recall<- t1[2,2]/(sum(t1[,2]))

> precision

[1] 0.7897

> recall

[1] 0.1870261

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy