0% found this document useful (0 votes)
47 views9 pages

Big Data Management and Architecture Assignment

The document discusses learning curves and model performance for decision trees and logistic regression models. It then analyzes variable importance and proposes using a decision tree model to target customers. An expected value framework and profit curve analysis is also suggested.

Uploaded by

utsavkp99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views9 pages

Big Data Management and Architecture Assignment

The document discusses learning curves and model performance for decision trees and logistic regression models. It then analyzes variable importance and proposes using a decision tree model to target customers. An expected value framework and profit curve analysis is also suggested.

Uploaded by

utsavkp99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Big Data Management and Analytics

Individual Assignment

UTSAV PALIWAL
STUDENT NUMEBR -555818

1
A1.

Learning Curves by Accuracy


80%
75%
70%
65%
60%
55%
50%
45%
40%
10 20 40 80 160 320 640 1280 2560 5120 9000

Decision Tree Logistic Regression

Fig. 1 Learning Curve charted by Accuracy

Learning Curves by AUC


90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
10 20 40 80 160 320 640 1280 2560 5120 9856

Decision Tree Logistic Regression

Fig. 2 Learning Curve charted by Area under ROC curve

2
A1.1

In order to interpret the plots above we first need to differentiate between Accuracy and AUC.
Accuracy defines how successfully a model classifies a class as positive or negative. Whereas
AUC determines the probability that the model will rank a chosen positive class higher than a
negative class.

Decision trees- In plot 1 & 2 decision trees initially stay constant till about 80-160 samples.
This is because since the sample size is too small, and by classifying all classes as “stay” they
are accurate 70% of the time as 70% of the training data belongs to the “stay” class. This
happens as they are being trained and tested only on training data This is called Overfitting.
Only after the sample grows to a reasonable size, is when the Decision Tree becomes highly
accurate, due to its high flexibility.

Logistic Regression- Regressions are better classifiers in terms of probability, and are better
for smaller sets of data. They are useful in cases where the probability of an event happening
is to be found out for smaller sets of data. This is why the logistic regression model outperforms
the decision tree in AUC instead of Accuracy plot, at smaller samples of 40 and 80.

A1.2

The ideal learning curve is a clockwise 90 degree Rotated L Shaped curve that rises a lot
initially, but then tapers off as the sample size grows , and eventually slopes down as the
marginal benefits from increasing data samples start to diminish.

Therefore, after a certain point, it makes no business sense to collect more data, from a time
and cost saving perspective, as this additional data would bring in significant expenses, and no
significant additions and improvements to the model.

In our case, the decision tree model is still not tapering off at the end. This means that there is
still more scope to collect and incorporate more samples into the study as they would bring in
significant benefits in terms of the accuracy.

The logistic regression on the other hand starts to slope down around the 300 sample size mark,
indicating its bounds with respect to larger sample sizes.

3
A2.

Fitting Graph

79%

77%

75%

73%

71%

69%

67%

65%
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
Accuracy AUC

Fig. 3 Fitting Graph for Accuracy and AUC

A2.1

Firstly, we can see that the model gravely overfits till about a minimum leaf size of 20, as
depicted by the sharp rise in both AUC and accuracy beyond 20 as complexity starts to
decrease.

Beyond 20, the accuracy and AUC reach reasonable fit and are roughly 77% and 75% accurate.
This trend continues till about 150 minimum leaf size. This is when the accuracy starts to dip
a little again and the gap between the two starts to decrease. At this complexity level the model
begins to underfit just a little, making a this the “Sweet Spot” of complexity.

Given AUC’s higher success in prediction, we should treat that with 150 minimum leaf size
as the best and most optimal model. Another reason for choosing AUC over Accuracy is the
fact that there is a class imbalance between leave(2762) and stay(7094) in the sample data set.

4
A2.2

Variable Importance Table


ATTRIBUTE WEIGHT
Income 0,486
College 0,291
Leftover 0,310
Average call duration 0,054
Over 15mins calls per month 0,000
House 0,284
Overage 1,000
Handset_price 0,004

The Variable Importance table tell us about the significance/Influence that each variable has
on the model. It is important to analyze this table as some variables that might be considered
of importance, might not have that much significance, and vice-versa.

From the table above, we can gather that overage, is by far the most significant predictor in the
model as it has the maximum value of 1. Income and leftover are also important variables by
the same logic. Handset price and over 15 mins calls, however, are the least significant
variables as their Importance’s are close to 0. Hence, we can consider discarding them from
the model to save on time and resources that are spent on their computation.

5
A3.

A3.1

We should move ahead with the decision tree as the model of choice. As we can see in the
previous questions this is the best performing model for our data set and objectives because
they have higher accuracies when compared to logistic regression (as shown in Q1) and are
more suited to our large data set due to their higher flexibility.

A3.2

The following are the assumptions we need to make in order to facilitate the Expected Value
Framework;

1. Base Scenario- This is the default scenario with which we will compare all costs.
In this case the base scenario should be to not target any customers whatsoever.

2. Cost of targeting- We shall assume that the cost of targeting/giving a customer an


offer is 5 Euros.

3. Timeline- the analysis will be conducted only for the 6-month time frame from the
time of targeting of customers. The costs will be incorporated accordingly.

6
A3.3
Cost Benefit Matrix
Leave (Actual) Stay (Actual)
Leave (Predicted) 87 Euros -5 Euros
Stay (Predicted) 0 Euros 0 Euros

The following are the justification for each cell of the cost-benefit matrix

1. Benefit from customers targeted successfully (Cell 2,2) - These will be the True
Positives. Our benefits would be 50% of 50 Euros x 6 months (because of the
discount) – the 5 Euro Targeting cost. This would be = 145 Euros. But since the
effectiveness of the offer is only 60%, we will only make 60% of the 145 euros.
Therefore the Benefit would be = 145 x 0.6= 87

2. Loss from customers targeted unsuccessfully (Cell 2,3) - These will be the False
Positives. Our loss in such a case will be the cost that we incurred to target them.
This would be 5 Euros, as shown by the 2nd assumption.

3. Benefit from not targeting those who would not have accepted (Cell 3,3) - These
would be the True Negatives. These are the people which, if we did target, would
decline the offer. In this case the benefit is 0 Euros, as in the Base Scenario we were
not going to target anyone anyway. Hence making both benefits 0 Euros.

4. Loss from not targeting those who would have accepted (Cell 3,2)- These are
the False Negatives. These are the people which, if we did target, would have
accepted our offer. Our loss in such cases will be 0 Euros, as when compared to the
Base Scenario, we were not targeting anyone anyway, making both losses 0 Euros.

7
A3.4

Profit Curve Table

True False True False


Threshold Positives Positives Negatives Negatives Revenue Cost Profit
0% 0 0 0 0 0 0 0
€ €
10% 242 183 0 0 21.054,00 € 915,00 20.139,00
€ € €
20% 466 383 0 0 40.542,00 1.915,00 38.627,00
€ € €
30% 496 410 221 146 43.152,00 2.050,00 41.102,00
€ € €
40% 496 410 491 300 43.152,00 2.050,00 41.102,00
€ € €
50% 496 410 804 411 43.152,00 2.050,00 41.102,00
€ € €
60% 496 410 1134 505 43.152,00 2.050,00 41.102,00
€ € €
70% 496 410 1484 579 43.152,00 2.050,00 41.102,00
€ € €
80% 496 410 1864 623 43.152,00 2.050,00 41.102,00
€ € €
90% 496 410 2257 654 43.152,00 2.050,00 41.102,00
€ € €
100% 496 410 2644 681 43.152,00 2.050,00 41.102,00

8
A3.5

Profit Curve
€ 45.000,00

€ 40.000,00

€ 35.000,00

€ 30.000,00

€ 25.000,00

€ 20.000,00

€ 15.000,00

€ 10.000,00

€ 5.000,00

€ 0,00
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Of Customers To Be Targeted

Fig. 4 Profit Curve

According to the plot presented above, a maximum of 30% of the most likely churners should
be targeted, because beyond that there are no additional profits to be made.

Our True Positives stay constant as no one is classified as “Leave” beyond a 0.5 confidence
level. This is also true for False Positives. Similarly for True Negatives and False Negatives,
there are no “Stay” classes for the top 30% churners, hence they start increasing only after
30%.

Our profit stays constant despite this because it is only a function on the True Positives and
False Positives (TPRevenue-FPCost) as True Negatives and False Negatives’ value in the Cost-
Benefit Matrix is 0.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy