Janani Prakash Loan Prediction Study

Download as pdf or txt
Download as pdf or txt
You are on page 1of 97

Loan Prediction Study

29.03.2019

Janani Prakash
PGPBABI-Online
GreatLearning, Great Lakes Institute of Management
Project Objective 3

Exploratory Data Analysis – Step by step approach 3


Install necessary Packages and Invoke Libraries 3
Set up working Directory 3
Import and Read the Dataset 3

Explanation and conclusion 4


Exploring the dataset 4
Building Classification and Regression Tree (CART) 6
Building Random Forest 33
Building Artificial Neural Network 46
Comparison between the models 70

Appendix A – Source Code 72


1. Project Objective
The data-set provides details from MyBank about a Personal Loan Campaign that was
executed by the bank. 20000 customers were targeted with an offer of personal loan on
10% interest rate, out of which 2512 customers responded positively. The data needs to
be used to create classification model(s) in order to predict the response of new set of
customers in the future, depending on the attributes available in the data.
Classification Models should be built using following Supervised Machine Learning
Techniques:
1. Classification and Regression Tree
2. Random Forest
3. Artificial Neural Network
Once the Classification Models are built, a comparison has to be drawn among the three
models. .

2. Exploratory Data Analysis – Step by step


approach
2.1.1. Install necessary Packages and Invoke Libraries
The necessary packages were installed and the associated libraries were
invoked. Having all the packages at the same places increases code
readability.
Please refer Appendix A for Source Code.

2.1.2. Set up working Directory


Setting a working directory on starting of the R session makes importing
and exporting data files and code files easier. Basically, working directory
is the location/ folder on the PC where you have the data, codes etc.
related to the project.
Please refer Appendix A for Source Code.

2.1.3. Import and Read the Dataset


The given dataset is in .csv format. Hence, the command ‘read.csv’ is
used for importing the file.
Please refer Appendix A for Source Code.

3. Explanation and conclusion

3.1. Exploring the dataset


After importing the dataset, the following can be done to analyse and explore the
dataset.

>dim(Loan)
[1] 20000 40
The dim function explains the dimension of the dataset

>names(Loan)
[1] "CUST_ID" "TARGET"
[3] "AGE" "GENDER"
[5] "BALANCE" "OCCUPATION"
[7] "AGE_BKT" "SCR"
[9] "HOLDING_PERIOD" "ACC_TYPE"
[11] "ACC_OP_DATE" "LEN_OF_RLTN_IN_MNTH"
[13] "NO_OF_L_CR_TXNS" "NO_OF_L_DR_TXNS"
[15] "TOT_NO_OF_L_TXNS" "NO_OF_BR_CSH_WDL_DR_TXNS"
[17] "NO_OF_ATM_DR_TXNS" "NO_OF_NET_DR_TXNS"
[19] "NO_OF_MOB_DR_TXNS" "NO_OF_CHQ_DR_TXNS"
[21] "FLG_HAS_CC" "AMT_ATM_DR"
[23] "AMT_BR_CSH_WDL_DR" "AMT_CHQ_DR"
[25] "AMT_NET_DR" "AMT_MOB_DR"
[27] "AMT_L_DR" "FLG_HAS_ANY_CHGS"
[29] "AMT_OTH_BK_ATM_USG_CHGS" "AMT_MIN_BAL_NMC_CHGS"
[31] "NO_OF_IW_CHQ_BNC_TXNS" "NO_OF_OW_CHQ_BNC_TXNS"
[33] "AVG_AMT_PER_ATM_TXN" "AVG_AMT_PER_CSH_WDL_TXN"
[35] "AVG_AMT_PER_CHQ_TXN" "AVG_AMT_PER_NET_TXN"
[37] "AVG_AMT_PER_MOB_TXN" "FLG_HAS_NOMINEE"
[39] "FLG_HAS_OLD_LOAN" "random"
From the name function, the names of the column can be found.

> head(Loan,5)
CUST_ID TARGET AGE GENDER BALANCE OCCUPATION AGE_BKT SCR
1 C7927 0 27 M 3383.75 SELF-EMP 26-30 776
2 C6877 0 47 M 287489.04 SAL 46-50 324
3 C19922 0 40 M 18216.88 SELF-EMP 36-40 603
4 C8183 0 53 M 71720.48 SAL >50 196
5 C12123 0 36 M 1671622.89 PROF 36-40 167
HOLDING_PERIOD ACC_TYPE ACC_OP_DATE LEN_OF_RLTN_IN_MNTH
NO_OF_L_CR_TXNS
1 30 SA 3/23/2005 146 7
2 28 SA 10-11-08 104 8
3 2 SA 4/26/2012 61 10
4 13 CA 07-04-08 107 36
5 24 SA 12/29/2001 185 20
NO_OF_L_DR_TXNS TOT_NO_OF_L_TXNS
NO_OF_BR_CSH_WDL_DR_TXNS
1 3 10 0
2 2 10 0
3 5 15 1
4 14 50 4
5 1 21 1
NO_OF_ATM_DR_TXNS NO_OF_NET_DR_TXNS NO_OF_MOB_DR_TXNS
NO_OF_CHQ_DR_TXNS
11200
21100
31102
42314
50000
FLG_HAS_CC AMT_ATM_DR AMT_BR_CSH_WDL_DR AMT_CHQ_DR
AMT_NET_DR AMT_MOB_DR
1 0 13100 0 0 973557 0
2 0 6600 0 0 799813 0
3 0 11200 561120 49320 997570 0
4 0 26100 673590 60780 741506 71388
5 0 0 808480 0 0 0
AMT_L_DR FLG_HAS_ANY_CHGS AMT_OTH_BK_ATM_USG_CHGS
AMT_MIN_BAL_NMC_CHGS
1 986657 0 0 0
2 806413 1 0 0
3 1619210 1 0 0
4 1573364 0 0 0
5 808480 0 0 0
NO_OF_IW_CHQ_BNC_TXNS NO_OF_OW_CHQ_BNC_TXNS
AVG_AMT_PER_ATM_TXN
1 0 0 13100
2 0 0 6600
3 0 1 11200
4 0 0 13050
5000
AVG_AMT_PER_CSH_WDL_TXN AVG_AMT_PER_CHQ_TXN
AVG_AMT_PER_NET_TXN
1 0.0 0 486778.5
2 0.0 0 799813.0
3 561120.0 24660 997570.0
4 168397.5 15195 247168.7
5 808480.0 0 0.0
AVG_AMT_PER_MOB_TXN FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN
random
1 0 1 1 0.000011400
2 0 1 0 0.000111373
3 0 1 1 0.000119954
4 71388 1 0 0.000136825
5 0 1 0 0.000173976

>table(TARGET)
TARGET
0 1
17488 2512
The table function of the target shows that 2512 were the number of responders
and 17488 were the number of non responders out of 20000 customers who
were contacted by the bank.

3.2. Building Classification and Regression Tree (CART)


The Classification and Regression Tree refers to the following decision trees:
● Classification Trees: where the target variable is categorical and the tree
is used to identify the “class” within which a target variable would likely fall
into.
● Regression Trees: where the target variable is continuous and tree is
used to predict its value

The CART algorithm is structured as a sequence of questions, the answers to


which determine what the next question, if any should be. The result of these
questions is a tree like structure where the ends are terminal nodes at which
point there are no more questions.

Splitting data into training and test dataset


>trainIndex <- createDataPartition(Loan$TARGET,
p=0.7,
list = FALSE,
times = 1)
>Loan.dev <- Loan[trainIndex, ]
>Loan.holdout <- Loan[-trainIndex,]
>dim(Loan.dev)
[1] 14000 40
>dim(Loan.holdout)
[1] 6000 40
>table(Loan.dev$TARGET)
0 1
12272 1728
>table(Loan.holdout$TARGET)
0 1
5216 784

Model Building - CART (Unbalanced Dataset)


The control parameter inputs for rpart is set using the following code. Here, the
minsplit denotes the minimum number of observations that must exist in a node
in order for a split to be attempted. The ​minbucket denotes the minimum number
of observations in any terminal node.The ​xval denotes the number of cross
validations. The ​cp​ denotes the complexity parameter.
>r1.ctrl <- rpart.control(minsplit = 100, minbucket = 10, cp = 0, xval = 10)

The following code is used to build the model on Training Dataset.

>cart.train <- Loan.dev


> m2 <- rpart(formula = TARGET~., data = cart.train[,-c(1,11)], method = "class",
control = r1.ctrl)

>fancyRpartPlot(m1)
The fancyRpartPlot ​plots a fancy RPart decision tree using the pretty rpart plotter.

>printcp(m2)

Classification tree:
rpart(formula = cart.train$TARGET ~ ., data = cart.train[, -c(1,
11)], method = "class", control = r1.ctrl)

Variables actually used in tree construction:


[1] ACC_TYPE AGE_BKT AMT_ATM_DR
[4] AMT_BR_CSH_WDL_DR AMT_L_DR AMT_NET_DR
[7] AVG_AMT_PER_ATM_TXN AVG_AMT_PER_CSH_WDL_TXN
AVG_AMT_PER_MOB_TXN
[10] AVG_AMT_PER_NET_TXN BALANCE FLG_HAS_CC
[13] GENDER HOLDING_PERIOD LEN_OF_RLTN_IN_MNTH
[16] NO_OF_ATM_DR_TXNS NO_OF_BR_CSH_WDL_DR_TXNS
NO_OF_L_CR_TXNS
[19] NO_OF_L_DR_TXNS NO_OF_OW_CHQ_BNC_TXNS
OCCUPATION
[22] SCR

Root node error: 1728/14000 = 0.12343


n= 14000

CP nsplit rel error xerror xstd


1 0.00593171 0 1.00000 1.00000 0.022523
2 0.00405093 4 0.97627 0.99942 0.022517
3 0.00173611 10 0.95139 0.98669 0.022393
4 0.00166377 12 0.94792 0.99884 0.022512
5 0.00154321 26 0.91725 0.99826 0.022506
6 0.00135031 30 0.91088 0.99884 0.022512
7 0.00101273 41 0.89294 1.00347 0.022556
8 0.00096451 45 0.88889 1.00926 0.022612
9 0.00086806 48 0.88600 1.00926 0.022612
10 0.00057870 50 0.88426 1.01215 0.022640
11 0.00000000 55 0.88021 1.02546 0.022767

From the above result, it can be seen that the xerror value decreases till cp =
0.0017 and steadily increases thereafter.
>plotcp(m2)

Pruning the Tree


>ptree1<- prune(m2, cp= 0.0017 ,"CP")
>printcp(ptree1)
Classification tree:
rpart(formula = cart.train$TARGET ~ ., data = cart.train[, -c(1, 11)], method =
"class", control = r1.ctrl)

Variables actually used in tree construction:


[1] AGE_BKT AMT_ATM_DR BALANCE HOLDING_PERIOD
NO_OF_L_CR_TXNS
[6] NO_OF_L_DR_TXNS OCCUPATION SCR
Root node error: 1728/14000 = 0.12343
n= 14000
CP nsplit rel error xerror xstd
1 0.0059317 0 1.00000 1.00000 0.022523
2 0.0040509 4 0.97627 0.99942 0.022517
3 0.0017361 10 0.95139 0.98669 0.022393
4 0.0017000 12 0.94792 0.99884 0.022512

>fancyRpartPlot(ptree1, uniform=TRUE, main="Pruned Classification


Tree")
Performance Measures on Training Data Set
The following model performance measures will be calculated on the training set
to gauge the goodness of
the model:
● Rank Ordering
● KS
● Area Under Curve (AUC)
● Gini Coefficient
● Classification Error

>cart.train$predict.class = predict(ptree1, cart.train, type = "class")


>cart.train$predict.score = predict(ptree1, cart.train, type = "prob")
Deciling Code
>decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
>cart.train$deciles <- decile(cart.train$predict.score[,2])

Ranking Code
>tmp_DT = data.table(cart.train)
>rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
>rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
>sum(rank$cnt_non_resp),4);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank

deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp


1: 10 2670 654 2016 24.5% 654 2016
2: 9 2911 310 2601 10.6% 964 4617
3: 7 8419 764 7655 9.1% 1728 12272
cum_rel_resp cum_rel_non_resp ks
1: 37.8% 16.4% 21.42
2: 55.8% 37.6% 18.17
3: 100.0% 100.0% 0.00
Interpretation:
● The response rate in top deciles is above 24.5%.
● The KS is around 21.42%, indicating it to be not a good model.

KS and Area under Curve


>pred <- prediction(cart.train$predict.score[,2], cart.train$TARGET)
>perf <- performance(pred, "tpr", "fpr")
>plot(perf)

>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])


>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>gini = ineq(cart.train$predict.score[,2], type="Gini")
>with(cart.train, table(TARGET, predict.class))
​predict.class
TARGET 0 1
0 12226 46
1 1592 136
>KS
[1] 0.2142056
>auc
[1] 0.624329
>gini
[1] 0.2179665

Summary: CART - Model Performance(Training Unbalanced Dataset):


● The KS = 21.4% and the AUC = 62.4% which indicates that the model is
not good.
● The Gini Coefficient = 21.7% also indicates that the model in not good.
● Confusion matrix:
1. Accuracy = (12226+136)/(12226+1592+46+136) = 88.3%
2. Classification Error Rate = 1 - Accuracy = 11.7%

Predict Test Data Set


>cart.test <- Loan.holdout
>cart.test$predict.class = predict(ptree1, cart.test, type = "class")
>cart.test$predict.score = predict(ptree1, cart.test, type = "prob")
>cart.test$deciles <- decile(cart.test$predict.score[,2])
Ranking Code - Test Data
>tmp_DT = data.table(cart.test)
>rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
>rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),4);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank

deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp


1: 10 1230 324 906 26.3% 324 906
2: 8 1212 129 1083 10.6% 453 1989
3: 6 3558 331 3227 9.3% 784 5216
cum_rel_resp cum_rel_non_resp ks
1: 41.3% 17.4% 23.96
2: 57.8% 38.1% 19.65
3: 100.0% 100.0% 0.00

Interpretation:
● The response rate in top deciles is above 26.3%.
● The KS is around 23.9%, indicating it to be not a good model
KS and Area under Curve

>pred <- prediction(cart.test$predict.score[,2], cart.test$TARGET)


>perf <- performance(pred, "tpr", "fpr")
>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>gini = ineq(cart.test$predict.score[,2], type="Gini")
>with(cart.test, table(TARGET, predict.class))
predict.class
TARGET 0 1
0 5193 23
1 738 46

>plot(perf)

>KS
[1] 0.2456805
>auc
[1] 0.6350559
>gini
[1] 0.2192173

CART - Model Performance(Test Dataset):


● The KS = 24.5% and the AUC = 63.5% which indicates that the model is
not good.
● The Gini Coefficient = 21.9% also indicates that the model in not good.
● Confusion matrix:
1. Accuracy = (5193 + 46)/(5193+738+23+46) = 87.32%
2. Classification Error Rate = 1 - Accuracy = 12.68%

Balancing the training data using ROSE Algorithm


>N=12272*2
>N
24544

>cart.train.over <- ovun.sample(TARGET~.,data=cart.train,


method="over",N=24544)$data
>table(cart.train.over$TARGET)
0 1
12272 12272

Model Building - CART with Oversampled Dataset


Setting the control parameter
>r1.ctrl = rpart.control(minsplit=100, minbucket = 20, cp = 0, xval = 10)

>m2 <- rpart(formula = TARGET ~ ., data = cart.train.over[,-c(1,11,41,42,43)],


method = "class", control = r1.ctrl)

>printcp(m2)
Classification tree:
rpart(formula = TARGET ~ ., data = cart.train.over[, -c(1, 11,
41, 42, 43)], method = "class", control = r1.ctrl)

Variables actually used in tree construction:


[1] ACC_TYPE AGE AGE_BKT
AMT_ATM_DR
[5] AMT_BR_CSH_WDL_DR AMT_CHQ_DR AMT_L_DR
AMT_MOB_DR
[9] AMT_NET_DR AVG_AMT_PER_ATM_TXN
AVG_AMT_PER_CHQ_TXN AVG_AMT_PER_CSH_WDL_TXN
[13] AVG_AMT_PER_NET_TXN BALANCE
FLG_HAS_ANY_CHGS FLG_HAS_CC
[17] FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN GENDER
HOLDING_PERIOD
[21] LEN_OF_RLTN_IN_MNTH NO_OF_ATM_DR_TXNS
NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_CHQ_DR_TXNS
[25] NO_OF_IW_CHQ_BNC_TXNS NO_OF_L_CR_TXNS
NO_OF_L_DR_TXNS NO_OF_NET_DR_TXNS
[29] NO_OF_OW_CHQ_BNC_TXNS OCCUPATION SCR
TOT_NO_OF_L_TXNS

Root node error: 12272/24544 = 0.5

n= 24544

CP nsplit rel error xerror xstd


1 1.8106e-01 0 1.00000 1.02591 0.0063809
2 2.6239e-02 1 0.81894 0.83572 0.0062963
3 1.9883e-02 2 0.79270 0.80256 0.0062574
4 1.3174e-02 3 0.77282 0.77705 0.0062224
5 1.1367e-02 6 0.73330 0.74389 0.0061701
6 1.0308e-02 8 0.71056 0.72849 0.0061433
7 8.2301e-03 10 0.68994 0.69712 0.0060832
8 6.4782e-03 11 0.68171 0.68970 0.0060680
9 6.1930e-03 13 0.66876 0.68326 0.0060544
10 4.5361e-03 14 0.66257 0.67031 0.0060261
11 4.2373e-03 17 0.64896 0.65271 0.0059857
12 3.5039e-03 19 0.64048 0.63722 0.0059482
13 3.2595e-03 24 0.61954 0.61514 0.0058914
14 3.2323e-03 26 0.61302 0.60495 0.0058638
15 3.1780e-03 29 0.60332 0.60373 0.0058605
16 3.0557e-03 33 0.58841 0.59884 0.0058469
17 3.0150e-03 37 0.57309 0.59037 0.0058229
18 2.9878e-03 38 0.57008 0.58336 0.0058026
19 2.9335e-03 41 0.56111 0.57863 0.0057887
20 2.8928e-03 42 0.55818 0.57668 0.0057829
21 2.8113e-03 45 0.54808 0.56454 0.0057461
22 2.7705e-03 47 0.54245 0.55590 0.0057191
23 2.6890e-03 49 0.53691 0.55386 0.0057126
24 2.6076e-03 52 0.52885 0.54319 0.0056781
25 2.5261e-03 53 0.52624 0.53382 0.0056470
26 2.4853e-03 55 0.52119 0.52893 0.0056304
27 2.4446e-03 62 0.49756 0.52306 0.0056103
28 2.3631e-03 63 0.49511 0.51263 0.0055736
29 2.2816e-03 73 0.46537 0.50301 0.0055389
30 2.2545e-03 78 0.45396 0.50196 0.0055351
31 2.2409e-03 82 0.44492 0.49641 0.0055146
32 2.2001e-03 85 0.43783 0.49641 0.0055146
33 2.1186e-03 88 0.42911 0.49047 0.0054923
34 1.9964e-03 91 0.42275 0.47906 0.0054485
35 1.9557e-03 93 0.41876 0.47531 0.0054338
36 1.9149e-03 97 0.41094 0.47376 0.0054277
37 1.8742e-03 99 0.40711 0.46855 0.0054070
38 1.7927e-03 103 0.39920 0.46545 0.0053945
39 1.7384e-03 104 0.39741 0.45681 0.0053593
40 1.7112e-03 109 0.38649 0.45331 0.0053447
41 1.6705e-03 114 0.37793 0.44524 0.0053108
42 1.6297e-03 116 0.37459 0.44043 0.0052902
43 1.5890e-03 119 0.36970 0.43530 0.0052679
44 1.5482e-03 121 0.36653 0.42699 0.0052312
45 1.4668e-03 122 0.36498 0.42080 0.0052033
46 1.4260e-03 127 0.35764 0.41387 0.0051716
47 1.3853e-03 129 0.35479 0.41354 0.0051701
48 1.2223e-03 133 0.34925 0.40238 0.0051178
49 1.1951e-03 139 0.34192 0.38828 0.0050495
50 1.1408e-03 142 0.33833 0.38095 0.0050129
51 1.0593e-03 145 0.33491 0.37744 0.0049952
52 9.7784e-04 147 0.33279 0.37036 0.0049589
53 8.9635e-04 151 0.32888 0.36685 0.0049407
54 8.5561e-04 168 0.31274 0.36294 0.0049201
55 8.1486e-04 175 0.30525 0.36278 0.0049193
56 7.3338e-04 176 0.30443 0.35927 0.0049007
57 6.9263e-04 178 0.30297 0.35528 0.0048793
58 6.5189e-04 182 0.29979 0.35292 0.0048666
59 5.7040e-04 186 0.29718 0.34998 0.0048506
60 5.2966e-04 190 0.29490 0.34958 0.0048484
61 4.8892e-04 192 0.29384 0.34892 0.0048448
62 4.4817e-04 207 0.28471 0.34868 0.0048435
63 4.3459e-04 209 0.28382 0.34762 0.0048377
64 4.0743e-04 212 0.28251 0.34754 0.0048372
65 3.6669e-04 217 0.28048 0.34672 0.0048327
66 2.4446e-04 219 0.27974 0.34306 0.0048124
67 2.0372e-04 220 0.27950 0.34208 0.0048070
68 1.9013e-04 224 0.27836 0.34249 0.0048093
69 1.6297e-04 227 0.27779 0.34224 0.0048079
70 8.1486e-05 228 0.27762 0.34192 0.0048061
71 5.4324e-05 230 0.27746 0.34241 0.0048088
72 4.0743e-05 233 0.27730 0.34241 0.0048088
73 0.0000e+00 235 0.27722 0.34265 0.0048102

>plotcp(m2)

Pruning the Tree


>ptree<- prune(m2, cp= 0.00029 ,"CP")
>printcp(ptree)
Classification tree:
rpart(formula = TARGET ~ ., data = cart.train.over[, -c(1, 11,
41, 42, 43)], method = "class", control = r1.ctrl)

Variables actually used in tree construction:


[1] ACC_TYPE AGE AGE_BKT
AMT_ATM_DR
[5] AMT_BR_CSH_WDL_DR AMT_CHQ_DR AMT_L_DR
AMT_MOB_DR
[9] AMT_NET_DR AVG_AMT_PER_ATM_TXN
AVG_AMT_PER_CHQ_TXN AVG_AMT_PER_CSH_WDL_TXN
[13] AVG_AMT_PER_NET_TXN BALANCE
FLG_HAS_ANY_CHGS FLG_HAS_CC
[17] FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN GENDER
HOLDING_PERIOD
[21] LEN_OF_RLTN_IN_MNTH NO_OF_ATM_DR_TXNS
NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_CHQ_DR_TXNS
[25] NO_OF_IW_CHQ_BNC_TXNS NO_OF_L_CR_TXNS
NO_OF_L_DR_TXNS NO_OF_NET_DR_TXNS
[29] NO_OF_OW_CHQ_BNC_TXNS OCCUPATION SCR
TOT_NO_OF_L_TXNS

Root node error: 12272/24544 = 0.5

n= 24544

CP nsplit rel error xerror xstd


1 0.18106258 0 1.00000 1.02591 0.0063809
2 0.02623859 1 0.81894 0.83572 0.0062963
3 0.01988266 2 0.79270 0.80256 0.0062574
4 0.01317362 3 0.77282 0.77705 0.0062224
5 0.01136734 6 0.73330 0.74389 0.0061701
6 0.01030802 8 0.71056 0.72849 0.0061433
7 0.00823012 10 0.68994 0.69712 0.0060832
8 0.00647816 11 0.68171 0.68970 0.0060680
9 0.00619296 13 0.66876 0.68326 0.0060544
10 0.00453607 14 0.66257 0.67031 0.0060261
11 0.00423729 17 0.64896 0.65271 0.0059857
12 0.00350391 19 0.64048 0.63722 0.0059482
13 0.00325945 24 0.61954 0.61514 0.0058914
14 0.00323229 26 0.61302 0.60495 0.0058638
15 0.00317797 29 0.60332 0.60373 0.0058605
16 0.00305574 33 0.58841 0.59884 0.0058469
17 0.00301499 37 0.57309 0.59037 0.0058229
18 0.00298783 38 0.57008 0.58336 0.0058026
19 0.00293351 41 0.56111 0.57863 0.0057887
20 0.00289276 42 0.55818 0.57668 0.0057829
21 0.00281128 45 0.54808 0.56454 0.0057461
22 0.00277053 47 0.54245 0.55590 0.0057191
23 0.00268905 49 0.53691 0.55386 0.0057126
24 0.00260756 52 0.52885 0.54319 0.0056781
25 0.00252608 53 0.52624 0.53382 0.0056470
26 0.00248533 55 0.52119 0.52893 0.0056304
27 0.00244459 62 0.49756 0.52306 0.0056103
28 0.00236310 63 0.49511 0.51263 0.0055736
29 0.00228162 73 0.46537 0.50301 0.0055389
30 0.00225445 78 0.45396 0.50196 0.0055351
31 0.00224087 82 0.44492 0.49641 0.0055146
32 0.00220013 85 0.43783 0.49641 0.0055146
33 0.00211864 88 0.42911 0.49047 0.0054923
34 0.00199641 91 0.42275 0.47906 0.0054485
35 0.00195567 93 0.41876 0.47531 0.0054338
36 0.00191493 97 0.41094 0.47376 0.0054277
37 0.00187419 99 0.40711 0.46855 0.0054070
38 0.00179270 103 0.39920 0.46545 0.0053945
39 0.00173837 104 0.39741 0.45681 0.0053593
40 0.00171121 109 0.38649 0.45331 0.0053447
41 0.00167047 114 0.37793 0.44524 0.0053108
42 0.00162973 116 0.37459 0.44043 0.0052902
43 0.00158898 119 0.36970 0.43530 0.0052679
44 0.00154824 121 0.36653 0.42699 0.0052312
45 0.00146675 122 0.36498 0.42080 0.0052033
46 0.00142601 127 0.35764 0.41387 0.0051716
47 0.00138527 129 0.35479 0.41354 0.0051701
48 0.00122229 133 0.34925 0.40238 0.0051178
49 0.00119513 139 0.34192 0.38828 0.0050495
50 0.00114081 142 0.33833 0.38095 0.0050129
51 0.00105932 145 0.33491 0.37744 0.0049952
52 0.00097784 147 0.33279 0.37036 0.0049589
53 0.00089635 151 0.32888 0.36685 0.0049407
54 0.00085561 168 0.31274 0.36294 0.0049201
55 0.00081486 175 0.30525 0.36278 0.0049193
56 0.00073338 176 0.30443 0.35927 0.0049007
57 0.00069263 178 0.30297 0.35528 0.0048793
58 0.00065189 182 0.29979 0.35292 0.0048666
59 0.00057040 186 0.29718 0.34998 0.0048506
60 0.00052966 190 0.29490 0.34958 0.0048484
61 0.00048892 192 0.29384 0.34892 0.0048448
62 0.00044817 207 0.28471 0.34868 0.0048435
63 0.00043459 209 0.28382 0.34762 0.0048377
64 0.00040743 212 0.28251 0.34754 0.0048372
65 0.00036669 217 0.28048 0.34672 0.0048327
66 0.00029000 219 0.27974 0.34306 0.0048124

>fancyRpartPlot(m2, uniform = TRUE, main = "Final Tree")


Measure CART Model Performance (Oversampled Training Dataset)
>cart.train.over$predict.class <- predict(ptree, cart.train.over, type = "class")
>cart.train.over$predict.score <- predict(ptree, cart.train.over, type = "prob")

Deciling Code
>decile <- function (x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}

>cart.train.over$deciles <- decile(cart.train.over$predict.score[,2])

Ranking Code
>tmp_DT = data.table(cart.train.over)
>rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
>rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),4);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank

deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp


1: 10 2463 2318 145 94.1% 2318 145
2: 9 2644 2384 260 90.2% 4702 405
3: 8 2346 2009 337 85.6% 6711 742
4: 7 2530 2085 445 82.4% 8796 1187
5: 6 2334 1750 584 75.0% 10546 1771
6: 5 2413 976 1437 40.4% 11522 3208
7: 4 2668 451 2217 16.9% 11973 5425
8: 3 2297 218 2079 9.5% 12191 7504
9: 2 2489 81 2408 3.2% 12272 9912
10: 1 2360 0 2360 0.0% 12272 12272
cum_rel_resp cum_rel_non_resp ks
1: 18.9% 1.2% 17.71
2: 38.3% 3.3% 35.01
3: 54.7% 6.0% 48.64
4: 71.7% 9.7% 62.01
5: 85.9% 14.4% 71.51
6: 93.9% 26.1% 67.75
7: 97.6% 44.2% 53.35
8: 99.3% 61.2% 38.19
9: 100.0% 80.8% 19.23
10: 100.0% 100.0% 0.00

Interpretation from the rank order table:


● The response rate in top deciles is above 76%.
● The KS is above 70%, indicating it to be a good model.

KS and Area under Curve


>pred <- prediction(cart.train.over$predict.score[,2], cart.train.over$TARGET)
>perf <- performance(pred, "tpr", "fpr")
>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>gini = ineq(cart.train.over$predict.score[,2], type="Gini")
>with(cart.train.over, table(TARGET, predict.class))
predict.class
TARGET 0 1
0 10235 2037
1 1396 10876

>plot(perf)
>KS
[1] 0.7202575
>auc
[1] 0.9162399
>gini
[1] 0.4162399
Summary - CART Model Performance(Oversampled Training Dataset):
● The KS = 72.18% and the AUC = 91.46% which indicating it to be a good
model.
● The Gini Coefficient = 41.46% also indicating to be a good model, with
scope of improvement.
● Confusion matrix:
1. Accuracy = (10106 + 11025)/(10106 + 1247 + 2166 + 11025) =
86.9%
2. Classification Error Rate = 1 - Accuracy = 13.91%

Scoring Holdout Sample


>cart.test$predict.class <- predict(ptree, cart.test, type = "class")
>cart.test$predict.score <- predict(ptree, cart.test, type = "prob")
>cart.test$deciles <- decile(cart.test$predict.score[,2])

Ranking Code
>tmp_DT = data.table(cart.test)
>h_rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
>h_rank$rrate <- round(h_rank$cnt_resp / h_rank$cnt,4);
>h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
>h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
>h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),4);
>h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp /
sum(h_rank$cnt_non_resp),4);
>h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp) * 100;
>h_rank$rrate <- percent(h_rank$rrate)
>h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
>h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
>h_rank
deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
1: 10 637 270 367 42.4% 270 367
2: 9 605 199 406 32.9% 469 773
3: 8 573 108 465 18.8% 577 1238
4: 7 621 66 555 10.6% 643 1793
5: 6 664 33 631 5.0% 676 2424
6: 5 519 41 478 7.9% 717 2902
7: 4 759 26 733 3.4% 743 3635
8: 3 563 9 554 1.6% 752 4189
9: 2 1059 32 1027 3.0% 784 5216
cum_rel_resp cum_rel_non_resp ks
1: 34.4% 7.0% 27.40
2: 59.8% 14.8% 45.00
3: 73.6% 23.7% 49.87
4: 82.0% 34.4% 47.64
5: 86.2% 46.5% 39.75
6: 91.4% 55.6% 35.81
7: 94.8% 69.7% 25.08
8: 95.9% 80.3% 15.61
9: 100.0% 100.0% 0.00

Interpretation from the rank order table:


● The response rate in top three deciles is 45%, 29% and 17% respectively.
● With top 3 deciles, the KS is 46%, indicating it to be a good model.

KS and Area under Curve


>pred <- prediction(cart.test$predict.score[,2], cart.test$TARGET)
>perf <- performance(pred, "tpr", "fpr")
>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>gini = ineq(cart.test$predict.score[,2], type="Gini")
>with(cart.test, table(TARGET, predict.class))
predict.class
TARGET 0 1
0 4228 988
1 255 529
>plot(perf)
>KS
[1] 0.5004147
>auc
[1] 0.8027087
>gini
[1] 0.5933811
Summary - CART Model Performance(Oversampled Training Dataset):
● The KS = 47% and the AUC = 79% which indicating it to be a good
model.
● The Gini Coefficient = 60.4% also indicating to be a good model, with
scope of improvement.
● Confusion matrix:
1. Accuracy = (4178 + 523)/(4178 + 261 + 1038 + 523) = 78.35%
2. Classification Error Rate = 1 - Accuracy = 21.65%

CART - Conclusion
Comparative Summary of the CART Model on Training and Testing Dataset
is as follows:
Measures Train Test %Deviation
KS 72.18 47.40 35%
AUC 91.46 79.23 13%
Gini 41.46 60.42 -46%
Accuracy 86.9 78.35 10%
CeR 13.91 21.65 -56%

CeR= Classification error rate


It can be observed that most of the Model Performance values for Training &
Testing sets are above the maximum tolerance deviation of +/- 10%. Hence, the
model is over - fitting.
The good performance on the model performance measures indicates good
prediction making capabilities of the developed CART model.
3.3. Building Random Forest
A Supervised Classification Algorithm, as the name suggests, this algorithm
creates the forest with a number of trees in random order. In general, the more
trees in the forest the more robust the forest looks like. In the same way in the
random forest classifier, the higher the number of trees in the forest gives the
high accuracy results.
Some advantages of using Random Forest are as follows:
1. The same random forest algorithm or the random forest classifier can use
for both classification and the regression task.
2. Random forest classifier will handle the missing values.
3. When we have more trees in the forest, random forest classifier won’t
over fit the model.
4. Can model the random forest classifier for categorical values also.
Creating Training and Testing Dataset for RF Model
>trainIndex <- createDataPartition(Loan$TARGET,
p=0.7,
list = FALSE,
times = 1)
>Loan.dev <- Loan[trainIndex, ]
>Loan.holdout <- Loan[-trainIndex,]
>dim(Loan.dev)
[1] 14000 40
>dim(Loan.holdout)
[1] 6000 40
>table(Loan.dev$TARGET)
0 1
12272 1728
>table(Loan.holdout$TARGET)
0 1
5216 784
Random Forest Model - Train Dataset
The model is built with dependant variable as TARGET, and considering all
independent variables except Customer ID and Account Opening Date.
>rf.train <- Loan.dev
>RF=randomForest(as.factor(TARGET)~., data = rf.train[,-c(1,11)], ntree = 501,
mtry = 3, nodesize = 10, importance=TRUE)
>print(RF)
Call:
randomForest(formula = as.factor(TARGET) ~ ., data = rf.train[, -c(1, 11)],
ntree = 501, mtry = 3, nodesize = 10, importance = TRUE)
Type of random forest: classification
Number of trees: 501
No. of variables tried at each split: 3

OOB estimate of error rate: 7.75%


Confusion matrix:
0 1 class.error
0 12271 1 8.148631e-05
1 1084 644 6.273148e-01

Out of Bag Error Rate:


Random Forests algorithm is a classifier based on primarily two methods -
bagging and random subspace method. Suppose we decide to have S number of
trees in our forest then we first create S datasets of “same size as original”
created from random resampling of data with-replacement. Each of these
datasets is called a bootstrap dataset. Due to “with-replacement” option, every
dataset can have duplicate data records and at the same time, can be missing
several data records from original datasets. This is called Bagging. The algorithm
uses m (=sqrt(M)) random sub features out of M possible features to create any
tree. This is called random subspace method. After creating the classifiers (S
trees), there is a subset of records which does not include any of the records part
of the classifier tree. This subset, is a set of bootstrap datasets which does not
contain a particular record from the original dataset. This set is called out-of-bag
examples. There are n such subsets (one for each data record in original dataset
T). OOB classifier is the aggregation of all such records. Out-of-bag estimate for
the generalization error is the error rate of the out of-bag classifier on the training
set (compare it with known yi’s). Out-of-bag (OOB) error, also called out-of-bag
estimate, is a method of measuring the prediction error of random forests,
boosted decision trees, and other machine learning models utilizing bootstrap
aggregating to subsample data samples used for training. Out-of-bag estimates
help in avoiding the need for an independent validation dataset.
The graphical output for the OOB estimate of error rate.
>plot(RF, main="")
>legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
>title(main="Error Rates Random Forest Loan Training data")

It is observed that as the number of trees increases, the OOB error rate starts
decreasing till it reaches around 155th tree with OOB = 0.0757 (the minimum
value). After this, the OOB doesn’t decrease further and remains largely steady.
Hence, the optimal number of trees would be 155.
Variable Importance
To understand the important variables in Random Forest, the following measures
are generally used:
> Mean Decrease in Accuracy is based on permutation >> Randomly permute
values of a variable for which importance is to be computed in the OOB sample
>> Compute the Error Rate with permuted values >> Compute decrease in OOB
Error rate (Permuted - Not permuted) >> Average the decrease over all the trees
> Mean Decrease in Gini is computed as “total decrease in node impurities from
splitting on the variable,averaged over all trees”

List the importance of the variable


>impVar <- round(randomForest::importance(RF), 2)
>impVar[order(impVar[,3], decreasing=TRUE),]
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## LEN_OF_RLTN_IN_MNTH 34.95 51.92 50.20 98.80
## AGE_BKT 35.40 47.16 46.81 88.60
## OCCUPATION 38.57 43.34 45.44 66.11
## BALANCE 35.40 44.26 43.18 124.62
## SCR 34.33 43.85 42.03 116.92
## AGE 31.46 39.66 41.08 73.73
## AMT_L_DR 32.93 31.95 39.27 105.82
## NO_OF_L_CR_TXNS 32.00 28.38 37.55 93.59
## HOLDING_PERIOD 29.40 38.55 36.72 98.03
## FLG_HAS_CC 30.26 34.22 34.61 30.22
## TOT_NO_OF_L_TXNS 28.20 22.28 33.42 95.35
## AMT_BR_CSH_WDL_DR 26.38 32.95 32.48 86.82
## AVG_AMT_PER_CSH_WDL_TXN 25.01 33.86 31.92 86.31
## AMT_ATM_DR 23.60 22.21 28.24 82.45
## AVG_AMT_PER_CHQ_TXN 22.65 21.94 27.10 75.33
## NO_OF_BR_CSH_WDL_DR_TXNS 19.35 28.20 24.86 44.06
## FLG_HAS_OLD_LOAN 17.95 24.99 24.86 13.71
## AVG_AMT_PER_ATM_TXN 20.31 23.13 24.75 80.67
## AMT_CHQ_DR 21.10 24.95 24.52 75.79
## NO_OF_L_DR_TXNS 21.75 23.99 24.05 58.82
## GENDER 20.35 21.13 22.56 21.71
## AVG_AMT_PER_NET_TXN 19.03 20.92 22.51 61.53
## AMT_NET_DR 18.38 22.15 21.65 61.92
## NO_OF_CHQ_DR_TXNS 18.40 21.46 20.66 36.37
## FLG_HAS_ANY_CHGS 14.98 21.39 20.40 12.47
## AMT_MOB_DR 15.95 14.57 20.13 39.82
## FLG_HAS_NOMINEE 12.28 19.08 19.20 9.88
## AVG_AMT_PER_MOB_TXN 14.34 15.96 18.22 37.73
## NO_OF_ATM_DR_TXNS 17.11 6.50 17.80 26.76
## NO_OF_IW_CHQ_BNC_TXNS 7.26 17.83 16.71 7.67
## NO_OF_NET_DR_TXNS 14.04 18.44 16.41 21.32
## ACC_TYPE 14.21 13.34 16.30 12.67
## NO_OF_OW_CHQ_BNC_TXNS 8.96 17.20 15.91 8.38
## AMT_MIN_BAL_NMC_CHGS 5.50 9.75 10.03 2.79
## NO_OF_MOB_DR_TXNS 7.75 8.93 8.79 9.70
## AMT_OTH_BK_ATM_USG_CHGS 2.00 3.07 3.69 1.29
## random -0.86 -0.48 -0.99 58.39
Optimal mtry value
In the random forests the number of variables available for splitting at each tree
node is referred to as the mtry parameter. The optimum number of variables is
obtained using tuneRF function.
x = Predictor variables
y = Target variable
mtryStart = starting value of mtry
ntree = No of tree used for tuning
stepFactor = steps to increase (deflate) mtry
improve = the relative oob by atleast this much
trace = print the trace or not
plot = plot OOB vs mtry graph or not
doBest = Finally build the RF using optimal mtry
nodesize = min terminal node size
importance = compute variable importance or not
Tuning Random Forest
>tRF<- tuneRF(x = rf.train[,-c(1,2,11)],
y=as.factor(rf.train$TARGET),
mtryStart = 6, ​#Approx, Sqrt of Total no. of variables
ntreeTry = 155,
stepFactor = 1.5,
improve = 0.0001,
trace = TRUE,
plot = TRUE,
doBest = TRUE,
nodesize = 10,
importance = TRUE
)
mtry = 6 OOB error = 6.44%
Searching left ...
mtry = 4 OOB error = 7.1%
-0.1032186 1e-04
Searching right ...
mtry = 9 OOB error = 6.14%
0.04661487 1e-04
mtry = 13 OOB error = 5.9%
0.03841676 1e-04
mtry = 19 OOB error = 5.55%
0.05932203 1e-04
mtry = 28 OOB error = 5.54%
0.002574003 1e-04
mtry = 37 OOB error = 5.36%
0.03225806 1e-04

As can be seen, the optimum number of variable is 28.

Scoring
>rf.train$predict.class <- predict(tRF, rf.train, type = "class")
>rf.train$predict.score <- predict(tRF, rf.train, type = "prob")
Deciling code
>decile <- function (x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
>rf.train$deciles <- decile(rf.train$predict.score[,2])
Rank order table
>tmp_DT = data.table(rf.train)
>rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET==1),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
>rank$rrate <- round (rank$cnt_resp / rank$cnt,2);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),2);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp); library (scales)
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank
deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp cum_rel_resp
1: 10 1404 1404 0 100% 1404 0 81.0%
2: 9 1435 324 1111 23% 1728 1111 100.0%
3: 8 1431 0 1431 0% 1728 2542 100.0%
4: 7 1402 0 1402 0% 1728 3944 100.0%
5: 6 1446 0 1446 0% 1728 5390 100.0%
6: 5 1522 0 1522 0% 1728 6912 100.0%
7: 4 1493 0 1493 0% 1728 8405 100.0%
8: 3 1495 0 1495 0% 1728 9900 100.0%
9: 2 1327 0 1327 0% 1728 11227 100.0%
10: 1 1045 0 1045 0% 1728 12272 100.0%
cum_rel_non_resp ks
1: 0% 0.81
2: 9% 0.91
3: 21% 0.79
4: 32% 0.68
5: 44% 0.56
6: 56% 0.44
7: 68% 0.32
8: 81% 0.19
9: 91% 0.09
10: 100% 0.00

Baseline Response Rate


>sum(rf.train$TARGET) / nrow(rf.train)
[1] 0.1234286
Interpretation:
● The baseline Response Rate is 12.34%, whereas the response rate in top
two deciles is 100% and 23% respectively.
● With top 2 deciles, the KS is 91%, indicating it to be a good model.

KS and Area under Curve


>pred <- prediction(rf.train$predict.score[,2], rf.train$TARGET)
>perf <- performance(pred, "tpr", "fpr")

>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])


>KS
[1] 0.9975471
>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>auc
[1] 0.9999683
>gini = ineq(rf.train$predict.score[,2], type="Gini")
>gini
[1] 0.7485589

Classification Error
> with(rf.train, table(TARGET, predict.class))
predict.class
TARGET 0 1
0 12270 2
1 171 1557
KS=99.75% AUC=99.99% Gini=74.85% Accuracy=98.76% CeR=1.24%

Model Performance on Testing Data Set


>rf.test$predict.class <- predict(tRF, rf.test, type="class")
>rf.test$predict.score <- predict(tRF, rf.test, type="prob")
>rf.test$deciles <- decile(rf.test$predict.score[,2])

>tmp_DT = data.table(rf.test)
>h_rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
>h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2);
>h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
>h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
>h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),2);
>h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp /
sum(h_rank$cnt_non_resp),2);
>h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp);
>h_rank$rrate <- percent(h_rank$rrate)
>h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
>h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
>h_rank
deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
1: 10 602 542 60 90% 542 60
2: 9 607 139 468 23% 681 528
3: 8 591 51 540 9% 732 1068
4: 7 627 27 600 4% 759 1668
5: 6 592 10 582 2% 769 2250
6: 5 592 12 580 2% 781 2830
7: 4 657 1 656 0% 782 3486
8: 3 647 0 647 0% 782 4133
9: 2 616 0 616 0% 782 4749
10: 1 469 2 467 0% 784 5216
cum_rel_resp cum_rel_non_resp ks
1: 69% 1% 0.68
2: 87% 10% 0.77
3: 93% 20% 0.73
4: 97% 32% 0.65
5: 98% 43% 0.55
6: 100% 54% 0.46
7: 100% 67% 0.33
8: 100% 79% 0.21
9: 100% 91% 0.09
10: 100% 100% 0.00
Interpretation:
● The baseline Response Rate is 12.34%, whereas the response rate in top
three deciles is 90%, 23%, 9% respectively.
● With two deciles, the KS is 77% indicating it to be a good model.
KS and AUC
>pred1 <- prediction(rf.test$predict.score[,2], rf.test$TARGET)
>perf1 <- performance(pred1, "tpr", "fpr")
>plot(perf1)
>KS1 <- max(attr(perf1, 'y.values')[[1]]-attr(perf1, 'x.values')[[1]])
>KS1
[1] 0.774446
Area Under Curve
>auc1 <- performance(pred1,"auc");
>auc1 <- as.numeric(auc1@y.values)
>auc1
[1] 0.9584023
Gini Coefficient
>gini1 = ineq(rf.test$predict.score[,2], type="Gini")
>gini1
[1] 0.6463728
Classification Error
>with(rf.test, table(TARGET, predict.class))
predict.class
TARGET 0 1
0 5211 5
1 343 441
KS = 77.44% AUC = 95.84% Gini = 64.63% Accuracy = 94.2% CeR = 5.8%

Random Forest Conclusion


Measures Train Test %Deviation
KS 99.75% 77.44% 22%
AUC 99.99% 95.84% 04%
Gini 74.85% 64.63% 14%
Accuracy 98.76% 94.20% 05%
CeR 01.24% 05.80% -368%

3.4. Building Artificial Neural Network


Artificial neural networks (ANNs) are statistical models directly inspired by, and
partially modeled on biological neural networks. They are capable of modeling
and processing nonlinear relationships between inputs and outputs in parallel.
Artificial neural networks are characterized by containing adaptive weights along
paths between neurons that can be tuned by a learning algorithm that learns from
observed data in order to improve the model. In addition to the learning algorithm
itself, one must choose an appropriate cost function.
The cost function is what’s used to learn the optimal solution to the problem
being solved. This involves determining the best values for all of the tuneable
model parameters, with neuron path adaptive weights being the primary target,
along with algorithm tuning parameters such as the learning rate. It’s usually
done through optimization techniques such as gradient descent or stochastic
gradient descent.
These optimization techniques basically try to make the ANN solution be as close
as possible to the optimal solution, which when successful means that the ANN is
able to solve the intended problem with high performance.
>NNInput = read.table("Loan.csv", sep = ",", header = T)
>attach(NNInput)
>dim(NNInput)
[1] 20000 40

Create Dummy Variables for Categorical Variables


# Gender
>GEN.matrix <- model.matrix(~ GENDER - 1, data = NNInput)
>NNInput <- data.frame(NNInput, GEN.matrix)
# Occupation
>occ.matrix <- model.matrix(~ OCCUPATION - 1, data = NNInput)
>NNInput <- data.frame(NNInput, occ.matrix)
# AGE_BKT
>AGEBKT.matrix <- model.matrix(~ AGE_BKT - 1, data = NNInput)
>NNInput <- data.frame(NNInput, AGEBKT.matrix)
# ACC_TYPE
>ACCTYP.matrix <- model.matrix(~ ACC_TYPE - 1, data = NNInput)
>NNInput <- data.frame(NNInput, ACCTYP.matrix)
>names(NNInput)
[1] "CUST_ID" "TARGET" "AGE"
[4] "GENDER" "BALANCE" "OCCUPATION"
[7] "AGE_BKT" "SCR" "HOLDING_PERIOD"
[10] "ACC_TYPE" "ACC_OP_DATE"
"LEN_OF_RLTN_IN_MNTH"
[13] "NO_OF_L_CR_TXNS" "NO_OF_L_DR_TXNS"
"TOT_NO_OF_L_TXNS"
[16] "NO_OF_BR_CSH_WDL_DR_TXNS" "NO_OF_ATM_DR_TXNS"
"NO_OF_NET_DR_TXNS"
[19] "NO_OF_MOB_DR_TXNS" "NO_OF_CHQ_DR_TXNS"
"FLG_HAS_CC"
[22] "AMT_ATM_DR" "AMT_BR_CSH_WDL_DR" "AMT_CHQ_DR"
[25] "AMT_NET_DR" "AMT_MOB_DR" "AMT_L_DR"
[28] "FLG_HAS_ANY_CHGS" "AMT_OTH_BK_ATM_USG_CHGS"
"AMT_MIN_BAL_NMC_CHGS"
[31] "NO_OF_IW_CHQ_BNC_TXNS" "NO_OF_OW_CHQ_BNC_TXNS"
"AVG_AMT_PER_ATM_TXN"
[34] "AVG_AMT_PER_CSH_WDL_TXN" "AVG_AMT_PER_CHQ_TXN"
"AVG_AMT_PER_NET_TXN"
[37] "AVG_AMT_PER_MOB_TXN" "FLG_HAS_NOMINEE"
"FLG_HAS_OLD_LOAN"
[40] "random" "GENDERF" "GENDERM"
[43] "GENDERO" "OCCUPATIONPROF" "OCCUPATIONSAL"
[46] "OCCUPATIONSELF.EMP" "OCCUPATIONSENP"
"AGE_BKT.25"
[49] "AGE_BKT.50" "AGE_BKT26.30" "AGE_BKT31.35"
[52] "AGE_BKT36.40" "AGE_BKT41.45" "AGE_BKT46.50"
[55] "ACC_TYPECA" "ACC_TYPESA"

>dim(NNInput)
​[1] 20000 56
Creating Training and Test Datasets
>trainIndex <- createDataPartition(TARGET,
p = .7,
list = FALSE,
times = 1)
>NN.train.data <- NNInput[trainIndex,]
>NN.test.data <- NNInput[-trainIndex,]
>dim(NN.train.data)
[1] 14000 56
dim(NN.test.data)
[1] 6000 56
Scaling the Dataset and Variables
>x <- subset(NN.train.data,
select = c("AGE",
"BALANCE",
"SCR",
"HOLDING_PERIOD",
"LEN_OF_RLTN_IN_MNTH",
"NO_OF_L_CR_TXNS",
"NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS",
"NO_OF_BR_CSH_WDL_DR_TXNS",
"NO_OF_ATM_DR_TXNS",
"NO_OF_NET_DR_TXNS",
"NO_OF_MOB_DR_TXNS",
"NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC",
"AMT_ATM_DR",
"AMT_BR_CSH_WDL_DR",
"AMT_CHQ_DR",
"AMT_NET_DR",
"AMT_MOB_DR",
"AMT_L_DR",
"FLG_HAS_ANY_CHGS",
"AMT_OTH_BK_ATM_USG_CHGS",
"AMT_MIN_BAL_NMC_CHGS",
"NO_OF_IW_CHQ_BNC_TXNS",
"NO_OF_OW_CHQ_BNC_TXNS",
"AVG_AMT_PER_ATM_TXN",
"AVG_AMT_PER_CSH_WDL_TXN",
"AVG_AMT_PER_CHQ_TXN",
"AVG_AMT_PER_NET_TXN",
"AVG_AMT_PER_MOB_TXN",
"FLG_HAS_NOMINEE",
"FLG_HAS_OLD_LOAN",
"random",
"GENDERF",
"GENDERM",
"GENDERO",
"OCCUPATIONPROF",
"OCCUPATIONSAL",
"OCCUPATIONSELF.EMP",
"OCCUPATIONSENP",
"AGE_BKT.25",
"AGE_BKT.50",
"AGE_BKT26.30",
"AGE_BKT31.35",
"AGE_BKT36.40",
"AGE_BKT41.45",
"AGE_BKT46.50",
"ACC_TYPECA",
"ACC_TYPESA" )
)
>nn.devscaled <- scale(x)
>nn.devscaled <- cbind(NN.train.data[2], nn.devscaled)
Building the NN Model
>nn2 <- neuralnet(formula = TARGET ~
AGE +
BALANCE +
SCR +
HOLDING_PERIOD +
LEN_OF_RLTN_IN_MNTH +
NO_OF_L_CR_TXNS +
NO_OF_L_DR_TXNS +
TOT_NO_OF_L_TXNS +
NO_OF_BR_CSH_WDL_DR_TXNS +
NO_OF_ATM_DR_TXNS +
NO_OF_NET_DR_TXNS +
NO_OF_MOB_DR_TXNS +
NO_OF_CHQ_DR_TXNS +
FLG_HAS_CC +
AMT_ATM_DR +
AMT_BR_CSH_WDL_DR +
AMT_CHQ_DR +
AMT_NET_DR +
AMT_MOB_DR +
AMT_L_DR +
FLG_HAS_ANY_CHGS +
AMT_OTH_BK_ATM_USG_CHGS +
AMT_MIN_BAL_NMC_CHGS +
NO_OF_IW_CHQ_BNC_TXNS +
NO_OF_OW_CHQ_BNC_TXNS +
AVG_AMT_PER_ATM_TXN +
AVG_AMT_PER_CSH_WDL_TXN +
AVG_AMT_PER_CHQ_TXN +
AVG_AMT_PER_NET_TXN +
AVG_AMT_PER_MOB_TXN +
FLG_HAS_NOMINEE +
FLG_HAS_OLD_LOAN +
random +
GENDERF +
GENDERM +
GENDERO +
OCCUPATIONPROF +
OCCUPATIONSAL +
OCCUPATIONSELF.EMP +
OCCUPATIONSENP +
AGE_BKT.25 +
AGE_BKT.50 +
AGE_BKT26.30 +
AGE_BKT31.35 +
AGE_BKT36.40 +
AGE_BKT41.45 +
AGE_BKT46.50 +
ACC_TYPECA +
ACC_TYPESA ,
data = nn.devscaled,
hidden = 3,
err.fct = "sse",
linear.output = FALSE,
lifesign = "full",
lifesign.step = 10,
threshold = 0.1,
stepmax = 2000
)

hidden: 3 thresh: 0.1 rep: 1/1 steps: 10 min thresh:


3.37952280114674
20 min thresh: 1.89140910238266
30 min thresh: 0.862865375968638
40 min thresh: 0.862865375968638
50 min thresh: 0.862865375968638
60 min thresh: 0.862865375968638
70 min thresh: 0.862865375968638
80 min thresh: 0.862865375968638
90 min thresh: 0.862865375968638
100 min thresh: 0.862865375968638
110 min thresh: 0.862865375968638
120 min thresh: 0.608434037001895
130 min thresh: 0.480596576949426
140 min thresh: 0.480596576949426
150 min thresh: 0.39689809880363
160 min thresh: 0.39689809880363
170 min thresh: 0.39689809880363
180 min thresh: 0.379798690155089
190 min thresh: 0.379798690155089
200 min thresh: 0.379798690155089
210 min thresh: 0.379798690155089
220 min thresh: 0.379798690155089
230 min thresh: 0.368412793351693
240 min thresh: 0.368412793351693
250 min thresh: 0.368412793351693
260 min thresh: 0.30602933263663
270 min thresh: 0.30602933263663
280 min thresh: 0.288789699213087
290 min thresh: 0.288789699213087
300 min thresh: 0.288789699213087
310 min thresh: 0.288789699213087
320 min thresh: 0.288789699213087
330 min thresh: 0.288789699213087
340 min thresh: 0.288789699213087
350 min thresh: 0.288789699213087
360 min thresh: 0.282516159624653
370 min thresh: 0.282516159624653
380 min thresh: 0.246596879777488
390 min thresh: 0.220709600883585
400 min thresh: 0.220709600883585
410 min thresh: 0.220709600883585
420 min thresh: 0.220709600883585
430 min thresh: 0.220709600883585
440 min thresh: 0.220709600883585
450 min thresh: 0.185235140113238
460 min thresh: 0.185235140113238
470 min thresh: 0.185235140113238
480 min thresh: 0.185235140113238
490 min thresh: 0.185235140113238
500 min thresh: 0.185235140113238
510 min thresh: 0.185235140113238
520 min thresh: 0.153225062178873
530 min thresh: 0.153225062178873
540 min thresh: 0.153225062178873
550 min thresh: 0.153225062178873
560 min thresh: 0.153225062178873
570 min thresh: 0.153225062178873
580 min thresh: 0.153225062178873
590 min thresh: 0.153225062178873
600 min thresh: 0.153225062178873
610 min thresh: 0.153225062178873
620 min thresh: 0.153225062178873
630 min thresh: 0.153225062178873
640 min thresh: 0.153225062178873
650 min thresh: 0.153225062178873
660 min thresh: 0.153225062178873
670 min thresh: 0.153225062178873
680 min thresh: 0.153225062178873
690 min thresh: 0.153225062178873
700 min thresh: 0.153225062178873
710 min thresh: 0.153225062178873
720 min thresh: 0.153225062178873
730 min thresh: 0.153225062178873
740 min thresh: 0.153225062178873
750 min thresh: 0.153225062178873
760 min thresh: 0.153225062178873
770 min thresh: 0.153225062178873
780 min thresh: 0.153225062178873
790 min thresh: 0.153225062178873
800 min thresh: 0.153225062178873
810 min thresh: 0.135546561018561
820 min thresh: 0.135546561018561
830 min thresh: 0.135546561018561
840 min thresh: 0.135546561018561
850 min thresh: 0.135546561018561
860 min thresh: 0.135546561018561
870 min thresh: 0.135546561018561
880 min thresh: 0.133368085051786
890 min thresh: 0.133368085051786
900 min thresh: 0.120131426392604
910 min thresh: 0.120131426392604
920 min thresh: 0.117968622426568
930 min thresh: 0.117968622426568
940 min thresh: 0.117968622426568
950 min thresh: 0.117968622426568
960 min thresh: 0.117968622426568
970 min thresh: 0.101892921480415
980 min thresh: 0.101892921480415
990 min thresh: 0.101892921480415
1000 min thresh: 0.101892921480415
1010 min thresh: 0.101892921480415
1020 min thresh: 0.101892921480415
1030 min thresh: 0.101892921480415
1034 error: 649.5383 time: 10.66 secs

The Artificial Neural Network - Graphical Representation

>plot (nn2)
Measuring Model Performance
Prediction on Train Data
>NN.train.data$Prob = nn2$net.result[[1]]
Probabilities in Training Dataset
>quantile(NN.train.data$Prob, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)
0% 1% 5% 10% 25%
0.005340550 0.005340550 0.005341659 0.005787379 0.031114205
50% 75%
0.036584905 0.180627476
90% 95% 98% 99% 100%
0.188810014 0.513345635 0.581857453 0.581921405 0.581921674

>hist(NN.train.data$Prob)

Performance Measures on Training Data Set


The following model performance measures will be calculated on the training set
to gauge the goodness of the model:
> Rank Ordering > KS > Area Under Curve (AUC) > Gini Coefficient >
Classification Error
Deciling code
>decile <- function (x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
>NN.train.data$deciles <- decile(NN.train.data$Prob)

Ranking code

>tmp_DT = data.table(NN.train.data)
>rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET==1),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
>rank$rrate <- round (rank$cnt_resp / rank$cnt,2);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),2);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp)
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank
deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp cum_rel_resp
1: 10 1400 549 851 39.0% 549 851 32.0%
2: 9 1400 251 1149 18.0% 800 2000 46.0%
3: 8 1400 251 1149 18.0% 1051 3149 61.0%
4: 7 1400 228 1172 16.0% 1279 4321 74.0%
5: 6 1400 62 1338 4.0% 1341 5659 78.0%
6: 5 1400 74 1326 5.0% 1415 6985 82.0%
7: 4 1400 72 1328 5.0% 1487 8313 86.0%
8: 3 1400 96 1304 7.0% 1583 9617 92.0%
9: 2 1400 85 1315 6.0% 1668 10932 97.0%
10: 1 1400 60 1340 4.0% 1728 12272 100.0%
cum_rel_non_resp ks
1: 7.0% 0.25
2: 16.0% 0.30
3: 26.0% 0.35
4: 35.0% 0.39
5: 46.0% 0.32
6: 57.0% 0.25
7: 68.0% 0.18
8: 78.0% 0.14
9: 89.0% 0.08
10: 100.0% 0.00

Interpretation:
● The baseline Response Rate is 12.34%, whereas the response rate in top
three deciles is 39%, 18% and 18%respectively.
● With top 4 deciles, the KS is 39%, which is close to a good fitness model
indicator.
Assigning 0 / 1 class based on certain threshold
>NN.train.data$Class = ifelse(NN.train.data$Prob>0.5,1,0)
>with( NN.train.data, table(TARGET, as.factor(Class) ))

TARGET 0 1
0 11981 291
1 1296 432
Error Computation
>sum((NN.train.data$TARGET - NN.train.data$Prob)^2)/2
[1] 649.5382956
Other Model Performance Measures
>pred3 <- prediction(NN.train.data$Prob, NN.train.data$TARGET)
>perf3 <- performance(pred3, "tpr", "fpr")
>plot(perf3)
>KS3 <- max(attr(perf3, 'y.values')[[1]]-attr(perf3, 'x.values')[[1]])
>KS3
[1] 0.3913591313
>auc3 <- performance(pred3,"auc");
>auc3 <- as.numeric(auc3@y.values)
>auc3
[1] 0.7290062405
>gini3 = ineq(NN.train.data$Prob, type="Gini")
>auc3
[1] 0.7290062405
>KS3
[1] 0.3913591313
>gini3
[1] 0.5473745079
Summary: Model Performance Measures (Training Dataset)
Measure | Value
———-|———
KS | 39.13%
AUC | 72.90%
Gini | 54.73%
Accuracy | 88.66%
CeR | 11.00%
The Gini Coefficient is the ratio of the area between the line of perfect equality
and the observed Lorenz curve to the area between the line of perfect equality
and the line of perfect inequality.
The higher the coefficient, the more unequal the distribution is. Gini coefficient
can be straight away derived from the AUC ROC number.
Gini above 60% is a good model.
The lower the classification error rate, higher the model accuracy, resulting in a
better model. The classification error rate can be reduced if there were more
independent variables were present for modeling.
The model observed to perform at par expectations on majority of the model
performance measures,indicating it to be a good model, with scope for
improvement.

Scoring Test dataset using the Neural Net Model Object


>x <- subset(NN.test.data,
select = c("AGE",
"BALANCE",
"SCR",
"HOLDING_PERIOD",
"LEN_OF_RLTN_IN_MNTH",
"NO_OF_L_CR_TXNS",
"NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS",
"NO_OF_BR_CSH_WDL_DR_TXNS",
"NO_OF_ATM_DR_TXNS",
"NO_OF_NET_DR_TXNS",
"NO_OF_MOB_DR_TXNS",
"NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC",
"AMT_ATM_DR",
"AMT_BR_CSH_WDL_DR",
"AMT_CHQ_DR",
"AMT_NET_DR",
"AMT_MOB_DR",
"AMT_L_DR",
"FLG_HAS_ANY_CHGS",
"AMT_OTH_BK_ATM_USG_CHGS",
"AMT_MIN_BAL_NMC_CHGS",
"NO_OF_IW_CHQ_BNC_TXNS",
"NO_OF_OW_CHQ_BNC_TXNS",
"AVG_AMT_PER_ATM_TXN",
"AVG_AMT_PER_CSH_WDL_TXN",
"AVG_AMT_PER_CHQ_TXN",
"AVG_AMT_PER_NET_TXN",
"AVG_AMT_PER_MOB_TXN",
"FLG_HAS_NOMINEE",
"FLG_HAS_OLD_LOAN",
"random",
"GENDERF",
"GENDERM",
"GENDERO",
"OCCUPATIONPROF",
"OCCUPATIONSAL",
"OCCUPATIONSELF.EMP",
"OCCUPATIONSENP",
"AGE_BKT.25",
"AGE_BKT.50",
"AGE_BKT26.30",
"AGE_BKT31.35",
"AGE_BKT36.40",
"AGE_BKT41.45",
"AGE_BKT46.50",
"ACC_TYPECA",
"ACC_TYPESA" )
)
>x.scaled <- scale(x)
Scoring the Test Dataset
To score we will use the compute function
>compute.output <- compute(nn2, x.scaled)
>NN.test.data$Predict.score <- compute.output$net.result
Estimated Probabilities
The distribution of the estimated probabilities
>quantile(NN.test.data$Predict.score, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)
0% 1% 5% 10%
0.005340550102 0.005340550137 0.005341313796 0.005684425855
25% 50% 75% 90%
0.031114934658 0.036585067251 0.183669290563 0.188810016694
95% 98% 99% 100%
0.494376372589 0.581827903967 0.581920926121 0.581921674484

> hist(NN.test.data$Predict.score)
Deciling
>NN.test.data$deciles <- decile(NN.test.data$Predict.score)
Rank ordering
>tmp_DT = data.table(NN.test.data)
>h_rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
>h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2);
>h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
>h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
>h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),2);
>h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp /
sum(h_rank$cnt_non_resp),2);
>h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp);
>h_rank$rrate <- percent(h_rank$rrate)
>h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
>h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
>h_rank

deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp


1: 10 600 226 374 38.0% 226 374
2: 9 600 118 482 20.0% 344 856
3: 8 600 103 497 17.0% 447 1353
4: 7 600 103 497 17.0% 550 1850
5: 6 600 53 547 9.0% 603 2397
6: 5 600 48 552 8.0% 651 2949
7: 4 600 42 558 7.0% 693 3507
8: 3 600 33 567 6.0% 726 4074
9: 2 600 35 565 6.0% 761 4639
10: 1 600 23 577 4.0% 784 5216
cum_rel_resp cum_rel_non_resp ks
1: 29.0% 7.0% 0.22
2: 44.0% 16.0% 0.28
3: 57.0% 26.0% 0.31
4: 70.0% 35.0% 0.35
5: 77.0% 46.0% 0.31
6: 83.0% 57.0% 0.26
7: 88.0% 67.0% 0.21
8: 93.0% 78.0% 0.15
9: 97.0% 89.0% 0.08
10: 100.0% 100.0% 0.00
Interpretation:
The baseline Response Rate is 12.34%, whereas the response rate in top three
deciles is 38%, 20%, 17%respectively.
With top 4 deciles, the KS is 35%, indicating scope for improvement.
Assigning the Probabilities to the Test Dataset
​>NN.test.data$Prob = compute.output$net.result
Assigning 0/1 class based on certain threshold
>NN.test.data$Class = ifelse(NN.test.data$Prob>0.5,1,0)
>with( NN.test.data, table(TARGET, as.factor(Class) ))

TARGET 0 1
0 5065 151
1 639 145
Error Computation
>sum((NN.test.data$TARGET - NN.test.data$Prob)^2)/2
[1] 308.6009414
Other Model Performance Measures
>pred4 <- prediction(NN.test.data$Prob, NN.test.data$TARGET)
>perf4 <- performance(pred4, "tpr", "fpr")
>plot(perf4)
>KS4 <- max(attr(perf4, 'y.values')[[1]]-attr(perf4, 'x.values')[[1]])
>KS4
[1] 0.3545409728
>auc4 <- performance(pred4,"auc");
>auc4 <- as.numeric(auc4@y.values)
>auc4
[1] 0.7213823293
gini4 = ineq(NN.test.data$Prob, type="Gini")
gini4
[1] 0.5462225246

Summary: Model Performance Measures (Test Dataset)


Measure | Value
———-|———
KS | 35.45%
AUC | 72.13%
Gini | 54.62%
Accuracy | 86.83%
CeR | 13.16%
The model observed to perform “normal” on majority of the model performance
measures, indicating it to be a good model, with scope for improvement.

Artificial Neural Network - Conclusion


Comparative Summary of the Neural Network Model on Training and Testing
Dataset is as follows:
Measures Train Test %Deviation
KS 39.00% 35.45% 9.10%
AUC 73.00% 72.13% 1.19%
Gini 54.70% 54.62% 0.14%
Accuracy 88.66% 86.83% 2.06%
CeR 11.00% 13.16% -19.6%
It can be observed that most of the Model Performance values for Training &
Testing sets are within the maximum tolerance deviation of +/- 10%. Hence, the
model is not over-fitting.
At an overall level, the model observed to perform “normal” on the various model
performance measures such as KS value close to 40, Gini coefficient value > 50
etc. The good performance on the model performance measures indicates good
prediction making capabilities of the developed Neural Network model

3.5. Comparison between the models


The main objective of the project was to develop a predictive model to predict if
MyBank customers will respond positively to a promotion or an offer using tools
of Machine Learning. In this context, the key parameter for model evaluation was
‘Accuracy’, i.e., the proportion of the total number of predictions that were correct
(i.e. % of the customers that were correctly predicted).

The predictive models was be developed using the following Machine Learning
techniques:
1. Classification Tree - CART
2. Random Forest
3. Neural Network

The snapshot of the performance of all the models on accuracy, over-fitting and
other model performance measures is provided below:
CART
Measures Train Test %Deviation
KS 72.18 47.40 35%
AUC 91.46 79.23 13%
Gini 41.46 60.42 -46%
Accuracy 86.9 78.35 10%
CeR 13.91 21.65 -56%

Random Forest Conclusion


Measures Train Test %Deviation
KS 99.75% 77.44% 22%
AUC 99.99% 95.84% 04%
Gini 74.85% 64.63% 14%
Accuracy 98.76% 94.20% 05%
CeR 01.24% 05.80% -368%

Artificial Neural Network


Measures Train Test %Deviation
KS 39.00% 35.45% 9.10%
AUC 73.00% 72.13% 1.19%
Gini 54.70% 54.62% 0.14%
Accuracy 88.66% 86.83% 2.06%
CeR 11.00% 13.16% -19.6%

Interpretation:
● The CART method has given poor performance compared to Random
Forest and ANN. Looking at the percentage deviation between Training
and Testing Dataset, it looks like the Model is over fit.
● The Random Forest method has the best performance (best accuracy)
among all the three models. The percentage deviation between Training
and Testing Dataset also is reasonably under control, suggesting a robust
model.
● Neural Network has given relatively secondary performance compared to
Random Forest, however, better than CART. However, the percentage
deviation between Training and Testing Data set is minimal among three
models.

Random Forest seems to be the overall winner because of the best


accuracy % and reasonable deviations.

4. Appendix A – Source Code


4.1. The source code for libraries has been written below
>install.packages("caret")
>install.packages("rpart")
>install.packages("rpart.plot")
>install.packages("randomForest")
>install.packages("rattle")
>install.packages("RColorBrewer")
>install.packages("data.table")
>install.packages("scales")
>install.packages("neuralnet")
>library(caret)
>library(rpart)
>library(rpart.plot)
>library(randomForest)
>library(rattle)
>library(RColorBrewer)
>library(data.table)
>library(scales)
>library(neuralnet)

4.2. The source code for importing dataset has been written below
>Loan <- read.csv("Loan.csv", header=TRUE)

4.3. The source code for Exploratory Data Analysis and Descriptive Statistics
has been written below
>head(Loan)
# Find out Class of each Feature, along with internal structure
>str(Loan)
# Find out Total Number of Rows and Columns
>dim(Loan)
# Find out Names of the Columns (Features)
>names(Loan)
>summary(Loan)

4.4. The source code for Creating Training and Testing Dataset has been
written below
>trainIndex <- createDataPartition(Loan$TARGET,
p=0.7,
list = FALSE,
times = 1)
>Loan.dev <- Loan[trainIndex, ]
>Loan.holdout <- Loan[-trainIndex,]
>dim(Loan.dev)
>dim(Loan.holdout)
>table(Loan.dev$TARGET)
>table(Loan.holdout$TARGET)
4.5. The source code for CART technique has been written below
#Build the model on Training Dataset (Unbalanced)
>cart.train <- Loan.dev
#Setting the control parameter inputs for rpart
>r1.ctrl = rpart.control(minsplit=100, minbucket = 10, cp = 0, xval = 10)
>m2 <- rpart(formula = cart.train$TARGET ~ .,
data = cart.train[,-c(1,11)], method = "class",
control = r1.ctrl)
>m2
>fancyRpartPlot(m2)
>printcp(m2)
>plotcp(m2)

# Pruning Code
>ptree1<- prune(m2, cp= 0.0017 ,"CP")
>printcp(ptree1)
>fancyRpartPlot(ptree1, uniform=TRUE, main="Pruned Classification Tree")

#Predict Training Data Set


>cart.train$predict.class = predict(ptree1, cart.train, type = "class")
>cart.train$predict.score = predict(ptree1, cart.train, type = "prob")

# Deciling code
>decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}

# Deciling
>cart.train$deciles <- decile(cart.train$predict.score[,2])
>View(cart.train)

# Ranking code
>tmp_DT = data.table(cart.train)
>rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
>rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),4);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank

#KS and Area under Curve


>pred <- prediction(cart.train$predict.score[,2], cart.train$TARGET)
>perf <- performance(pred, "tpr", "fpr")
>plot(perf)
>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>gini = ineq(cart.train$predict.score[,2], type="Gini")
>with(cart.train, table(TARGET, predict.class))
>auc
>KS
>gini

# Scoring Holdout sample


>cart.test <- Loan.holdout
>cart.test$predict.class = predict(ptree1, cart.test, type = "class")
>cart.test$predict.score = predict(ptree1, cart.test, type = "prob")
>cart.test$deciles <- decile(cart.test$predict.score[,2])

#Ranking Code Test Data


>tmp_DT = data.table(cart.test)
>rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]

>rank$rrate <- round(rank$cnt_resp / rank$cnt,4);


>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),4);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank

#KS and Area under Curve


>pred <- prediction(cart.test$predict.score[,2], cart.test$TARGET)
>perf <- performance(pred, "tpr", "fpr")
>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>gini = ineq(cart.test$predict.score[,2], type="Gini")
>with(cart.test, table(TARGET, predict.class))
>plot(perf)
>auc
>KS
>gini

#Balancing the training data using ROSE Algorithm


>N=12272*2
>N
>cart.train.over <- ovun.sample(TARGET~.,data=cart.train,
method="over",N=24544)$data
>table(cart.train.over$TARGET)

#Setting the control parameter


>r1.ctrl = rpart.control(minsplit=100, minbucket = 20, cp = 0, xval = 10)

#Building the CART Model on Oversampled Training Dataset


>names(cart.train.over)
>m2 <- rpart(formula = TARGET ~ .,
data = cart.train.over[,-c(1,11,41,42,43)],
method = "class",
control = r1.ctrl)
>printcp(m2)
>plotcp(m2)

# Pruning the Tree


>ptree<- prune(m2, cp= 0.00029 ,"CP")
>printcp(ptree)
>fancyRpartPlot(m2, uniform = TRUE, main = "Final Tree")

#Measure CART Model Performance (Oversampled Training Dataset)


>cart.train.over$predict.class <- predict(ptree, cart.train.over, type = "class")
>cart.train.over$predict.score <- predict(ptree, cart.train.over, type = "prob")

#Deciling Code
>decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}

# Deciling
>cart.train.over$deciles <- decile(cart.train.over$predict.score[,2])

#Ranking Code
>tmp_DT = data.table(cart.train.over)
>rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
>rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),4);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank

#KS and Area under Curve


>pred <- prediction(cart.train.over$predict.score[,2], cart.train.over$TARGET)
>perf <- performance(pred, "tpr", "fpr")
>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>gini = ineq(cart.train.over$predict.score[,2], type="Gini")
>with(cart.train.over, table(TARGET, predict.class))
>plot(perf)
>auc
>KS
>gini

#Scoring Holdout Sample


>cart.test$predict.class <- predict(ptree, cart.test, type = "class")
>cart.test$predict.score <- predict(ptree, cart.test, type = "prob")
>cart.test$deciles <- decile(cart.test$predict.score[,2])

#Ranking Code
>tmp_DT = data.table(cart.test)
>h_rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
>h_rank$rrate <- round(h_rank$cnt_resp / h_rank$cnt,4);
>h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
>h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
>h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),4);
>h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp /
sum(h_rank$cnt_non_resp),4);
>h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp) * 100;
>h_rank$rrate <- percent(h_rank$rrate)
>h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
>h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
>h_rank

#KS and Area under Curve


>pred <- prediction(cart.test$predict.score[,2], cart.test$TARGET)
>perf <- performance(pred, "tpr", "fpr")
>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>gini = ineq(cart.test$predict.score[,2], type="Gini")
>with(cart.test, table(TARGET, predict.class))
>plot(perf)
>auc
>KS
>gini

4.6. The source code for Random Forest technique has been written below
#Random Forest Model - Train Dataset
>rf.train <- Loan.dev
>RF=randomForest(as.factor(TARGET)~.,
data = rf.train[,-c(1,11)],
ntree = 501, mtry = 3, nodesize = 10,
importance=TRUE)
>print(RF)
>plot(RF, main="")
>legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
>title(main="Error Rates Random Forest Loan Training data")

#List the importance of the variable


>impVar <- round(randomForest::importance(RF), 2)
>impVar[order(impVar[,3], decreasing=TRUE),]

#Tuning Random Forest


>tRF<- tuneRF(x = rf.train[,-c(1,2,11)],
y=as.factor(rf.train$TARGET),
mtryStart = 6, #Aprox, Sqrt of Total no. of variables
ntreeTry = 155,
stepFactor = 1.5,
improve = 0.0001,
trace = TRUE,
plot = TRUE,
doBest = TRUE,
nodesize = 10,
importance = TRUE
)
>tRF$importance
#Scoring
>rf.train$predict.class <- predict(tRF, rf.train, type = "class")
>rf.train$predict.score <- predict(tRF, rf.train, type = "prob")
>head(rf.train)
>class(rf.train$predict.score)

#Deciling code
>decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
>rf.train$deciles <- decile(rf.train$predict.score[,2])

#Rank order table


>tmp_DT = data.table(rf.train)
>rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET==1),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
>rank$rrate <- round (rank$cnt_resp / rank$cnt,2);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),2);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp); library(scales)
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank

# Baseline Response Rate


>sum(rf.train$TARGET) / nrow(rf.train)

#KS and Area under Curve


>pred <- prediction(rf.train$predict.score[,2], rf.train$TARGET)
>perf <- performance(pred, "tpr", "fpr")
>KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
>KS

# Area Under Curve


>auc <- performance(pred,"auc");
>auc <- as.numeric(auc@y.values)
>auc

# Gini Coefficient
>gini = ineq(rf.train$predict.score[,2], type="Gini")
>gini

# Classification Error
>with(rf.train, table(TARGET, predict.class))

# Scoring syntax
>rf.test <- Loan.holdout
>rf.test$predict.class <- predict(tRF, rf.test, type="class")
>rf.test$predict.score <- predict(tRF, rf.test, type="prob")
>rf.test$deciles <- decile(rf.test$predict.score[,2])

#Rank order
>tmp_DT = data.table(rf.test)
>h_rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
>h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2);
>h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
>h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
>h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),2);
>h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp /
sum(h_rank$cnt_non_resp),2);
>h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp);
>h_rank$rrate <- percent(h_rank$rrate)
>h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
>h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
>h_rank

#KS and AUC


>pred1 <- prediction(rf.test$predict.score[,2], rf.test$TARGET)
>perf1 <- performance(pred1, "tpr", "fpr")
>plot(perf1)
>KS1 <- max(attr(perf1, 'y.values')[[1]]-attr(perf1, 'x.values')[[1]])
>KS1

# Area Under Curve


>auc1 <- performance(pred1,"auc");
>auc1 <- as.numeric(auc1@y.values)
>auc1

# Gini Coefficient
>gini1 = ineq(rf.test$predict.score[,2], type="Gini")
>gini1

# Classification Error
>with(rf.test, table(TARGET, predict.class))

4.7. The source code for Artificial Neural Network technique has been written
below
>NNInput = read.table("Loan.csv", sep = ",", header = T)
>attach(NNInput)
>dim(NNInput)
# Gender
>GEN.matrix <- model.matrix(~ GENDER - 1, data = NNInput)
>NNInput <- data.frame(NNInput, GEN.matrix)

# Occupation
>occ.matrix <- model.matrix(~ OCCUPATION - 1, data = NNInput)
>NNInput <- data.frame(NNInput, occ.matrix)

# AGE_BKT
>AGEBKT.matrix <- model.matrix(~ AGE_BKT - 1, data = NNInput)
>NNInput <- data.frame(NNInput, AGEBKT.matrix)

# ACC_TYPE
>ACCTYP.matrix <- model.matrix(~ ACC_TYPE - 1, data = NNInput)
>NNInput <- data.frame(NNInput, ACCTYP.matrix)

#Creating Training and Test Datasets


>trainIndex <- createDataPartition(TARGET,
p = .7,
list = FALSE,
times = 1)
>NN.train.data <- NNInput[trainIndex,]
>NN.test.data <- NNInput[-trainIndex,]
>dim(NN.train.data)
>dim(NN.test.data)

#Scaling the Dataset and Variables


>names(NN.train.data)
>x <- subset(NN.train.data,
select = c("AGE",
"BALANCE",
"SCR",
"HOLDING_PERIOD",
"LEN_OF_RLTN_IN_MNTH",
"NO_OF_L_CR_TXNS",
"NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS",
"NO_OF_BR_CSH_WDL_DR_TXNS",
"NO_OF_ATM_DR_TXNS",
"NO_OF_NET_DR_TXNS",
"NO_OF_MOB_DR_TXNS",
"NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC",
"AMT_ATM_DR",
"AMT_BR_CSH_WDL_DR",
"AMT_CHQ_DR",
"AMT_NET_DR",
"AMT_MOB_DR",
"AMT_L_DR",
"FLG_HAS_ANY_CHGS",
"AMT_OTH_BK_ATM_USG_CHGS",
"AMT_MIN_BAL_NMC_CHGS",
"NO_OF_IW_CHQ_BNC_TXNS",
"NO_OF_OW_CHQ_BNC_TXNS",
"AVG_AMT_PER_ATM_TXN",
"AVG_AMT_PER_CSH_WDL_TXN",
"AVG_AMT_PER_CHQ_TXN",
"AVG_AMT_PER_NET_TXN",
"AVG_AMT_PER_MOB_TXN",
"FLG_HAS_NOMINEE",
"FLG_HAS_OLD_LOAN",
"random",
"GENDERF",
"GENDERM",
"GENDERO",
"OCCUPATIONPROF",
"OCCUPATIONSAL",
"OCCUPATIONSELF.EMP",
"OCCUPATIONSENP",
"AGE_BKT.25",
"AGE_BKT.50",
"AGE_BKT26.30",
"AGE_BKT31.35",
"AGE_BKT36.40",
"AGE_BKT41.45",
"AGE_BKT46.50",
"ACC_TYPECA",
"ACC_TYPESA" )
)
>nn.devscaled <- scale(x)
>nn.devscaled <- cbind(NN.train.data[2], nn.devscaled)

#Building the NN Model


>nn2 <- neuralnet(formula = TARGET ~
AGE +
BALANCE +
SCR +
HOLDING_PERIOD +
LEN_OF_RLTN_IN_MNTH +
NO_OF_L_CR_TXNS +
NO_OF_L_DR_TXNS +
TOT_NO_OF_L_TXNS +
NO_OF_BR_CSH_WDL_DR_TXNS +
NO_OF_ATM_DR_TXNS +
NO_OF_NET_DR_TXNS +
NO_OF_MOB_DR_TXNS +
NO_OF_CHQ_DR_TXNS +
FLG_HAS_CC +
AMT_ATM_DR +
AMT_BR_CSH_WDL_DR +
AMT_CHQ_DR +
AMT_NET_DR +
AMT_MOB_DR +
AMT_L_DR +
FLG_HAS_ANY_CHGS +
AMT_OTH_BK_ATM_USG_CHGS +
AMT_MIN_BAL_NMC_CHGS +
NO_OF_IW_CHQ_BNC_TXNS +
NO_OF_OW_CHQ_BNC_TXNS +
AVG_AMT_PER_ATM_TXN +
AVG_AMT_PER_CSH_WDL_TXN +
AVG_AMT_PER_CHQ_TXN +
AVG_AMT_PER_NET_TXN +
AVG_AMT_PER_MOB_TXN +
FLG_HAS_NOMINEE +
FLG_HAS_OLD_LOAN +
random +
GENDERF +
GENDERM +
GENDERO +
OCCUPATIONPROF +
OCCUPATIONSAL +
OCCUPATIONSELF.EMP +
OCCUPATIONSENP +
AGE_BKT.25 +
AGE_BKT.50 +
AGE_BKT26.30 +
AGE_BKT31.35 +
AGE_BKT36.40 +
AGE_BKT41.45 +
AGE_BKT46.50 +
ACC_TYPECA +
ACC_TYPESA ,
data = nn.devscaled,
hidden = 3,
err.fct = "sse",
linear.output = FALSE,
lifesign = "full",
lifesign.step = 10,
threshold = 0.1,
stepmax = 2000
)
>plot (nn2)
# Prediction on Train Data
>NN.train.data$Prob = nn2$net.result[[1]]
>head(NN.train.data)

# The distribution of the estimated probabilities


>quantile(NN.train.data$Prob, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)
>hist(NN.train.data$Prob)

#Deciling code
>decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
>NN.train.data$deciles <- decile(NN.train.data$Prob)
# Ranking code
>tmp_DT = data.table(NN.train.data)
>rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET==1),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
>rank$rrate <- round (rank$cnt_resp / rank$cnt,2);
>rank$cum_resp <- cumsum(rank$cnt_resp)
>rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
>rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2);
>rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),2);
>rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp)
>rank$rrate <- percent(rank$rrate)
>rank$cum_rel_resp <- percent(rank$cum_rel_resp)
>rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
>rank

#Assigning 0 / 1 class based on certain threshold


>NN.train.data$Class = ifelse(NN.train.data$Prob>0.5,1,0)
>with( NN.train.data, table(TARGET, as.factor(Class) ))

#Error Computation
>sum((NN.train.data$TARGET - NN.train.data$Prob)^2)/2

#Other Model Performance Measures


>pred3 <- prediction(NN.train.data$Prob, NN.train.data$TARGET)
>perf3 <- performance(pred3, "tpr", "fpr")
>plot(perf3)
>KS3 <- max(attr(perf3, 'y.values')[[1]]-attr(perf3, 'x.values')[[1]])
>KS3
>auc3 <- performance(pred3,"auc");
>auc3 <- as.numeric(auc3@y.values)
>auc3
>gini3 = ineq(NN.train.data$Prob, type="Gini")
>gini3

#Scoring Test dataset using the Neural Net Model Object


>x <- subset(NN.test.data,
select = c("AGE",
"BALANCE",
"SCR",
"HOLDING_PERIOD",
"LEN_OF_RLTN_IN_MNTH",
"NO_OF_L_CR_TXNS",
"NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS",
"NO_OF_BR_CSH_WDL_DR_TXNS",
"NO_OF_ATM_DR_TXNS",
"NO_OF_NET_DR_TXNS",
"NO_OF_MOB_DR_TXNS",
"NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC",
"AMT_ATM_DR",
"AMT_BR_CSH_WDL_DR",
"AMT_CHQ_DR",
"AMT_NET_DR",
"AMT_MOB_DR",
"AMT_L_DR",
"FLG_HAS_ANY_CHGS",
"AMT_OTH_BK_ATM_USG_CHGS",
"AMT_MIN_BAL_NMC_CHGS",
"NO_OF_IW_CHQ_BNC_TXNS",
"NO_OF_OW_CHQ_BNC_TXNS",
"AVG_AMT_PER_ATM_TXN",
"AVG_AMT_PER_CSH_WDL_TXN",
"AVG_AMT_PER_CHQ_TXN",
"AVG_AMT_PER_NET_TXN",
"AVG_AMT_PER_MOB_TXN",
"FLG_HAS_NOMINEE",
"FLG_HAS_OLD_LOAN",
"random",
"GENDERF",
"GENDERM",
"GENDERO",
"OCCUPATIONPROF",
"OCCUPATIONSAL",
"OCCUPATIONSELF.EMP",
"OCCUPATIONSENP",
"AGE_BKT.25",
"AGE_BKT.50",
"AGE_BKT26.30",
"AGE_BKT31.35",
"AGE_BKT36.40",
"AGE_BKT41.45",
"AGE_BKT46.50",
"ACC_TYPECA",
"ACC_TYPESA" )
)
>x.scaled <- scale(x)

#Scoring the Test Dataset


>compute.output <- compute(nn2, x.scaled)
>NN.test.data$Predict.score <- compute.output$net.result
>head(NN.test.data)

# The distribution of the estimated probabilities


>quantile(NN.test.data$Predict.score, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)
>hist(NN.test.data$Predict.score)

# Deciling
>NN.test.data$deciles <- decile(NN.test.data$Predict.score)

#Rank ordering
>tmp_DT = data.table(NN.test.data)
>h_rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
>h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2);
>h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
>h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
>h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),2);
>h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp /
sum(h_rank$cnt_non_resp),2); h_rank$ks <- abs(h_rank$cum_rel_resp -
>h_rank$cum_rel_non_resp); # library(scales)
>h_rank$rrate <- percent(h_rank$rrate)
>h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
>h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
>h_rank

#Assigning the Probabilities to the Test Dataset


>NN.test.data$Prob = compute.output$net.result

#Assigning 0/1 class based on certain threshold


>NN.test.data$Class = ifelse(NN.test.data$Prob>0.5,1,0)
>with( NN.test.data, table(TARGET, as.factor(Class) ))

# Error Computation
>sum((NN.test.data$TARGET - NN.test.data$Prob)^2)/2

#Other Model Performance Measures


>pred4 <- prediction(NN.test.data$Prob, NN.test.data$TARGET)
>perf4 <- performance(pred4, "tpr", "fpr")
>plot(perf4)
>KS4 <- max(attr(perf4, 'y.values')[[1]]-attr(perf4, 'x.values')[[1]])
>KS4
>auc4 <- performance(pred4,"auc");
>auc4 <- as.numeric(auc4@y.values)
>auc4
>gini4 = ineq(NN.test.data$Prob, type="Gini")
>gini4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy