Big Data Management and Architecture Assignment
Big Data Management and Architecture Assignment
Individual Assignment
UTSAV PALIWAL
STUDENT NUMEBR -555818
1
A1.
2
A1.1
In order to interpret the plots above we first need to differentiate between Accuracy and AUC.
Accuracy defines how successfully a model classifies a class as positive or negative. Whereas
AUC determines the probability that the model will rank a chosen positive class higher than a
negative class.
Decision trees- In plot 1 & 2 decision trees initially stay constant till about 80-160 samples.
This is because since the sample size is too small, and by classifying all classes as “stay” they
are accurate 70% of the time as 70% of the training data belongs to the “stay” class. This
happens as they are being trained and tested only on training data This is called Overfitting.
Only after the sample grows to a reasonable size, is when the Decision Tree becomes highly
accurate, due to its high flexibility.
Logistic Regression- Regressions are better classifiers in terms of probability, and are better
for smaller sets of data. They are useful in cases where the probability of an event happening
is to be found out for smaller sets of data. This is why the logistic regression model outperforms
the decision tree in AUC instead of Accuracy plot, at smaller samples of 40 and 80.
A1.2
The ideal learning curve is a clockwise 90 degree Rotated L Shaped curve that rises a lot
initially, but then tapers off as the sample size grows , and eventually slopes down as the
marginal benefits from increasing data samples start to diminish.
Therefore, after a certain point, it makes no business sense to collect more data, from a time
and cost saving perspective, as this additional data would bring in significant expenses, and no
significant additions and improvements to the model.
In our case, the decision tree model is still not tapering off at the end. This means that there is
still more scope to collect and incorporate more samples into the study as they would bring in
significant benefits in terms of the accuracy.
The logistic regression on the other hand starts to slope down around the 300 sample size mark,
indicating its bounds with respect to larger sample sizes.
3
A2.
Fitting Graph
79%
77%
75%
73%
71%
69%
67%
65%
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
Accuracy AUC
A2.1
Firstly, we can see that the model gravely overfits till about a minimum leaf size of 20, as
depicted by the sharp rise in both AUC and accuracy beyond 20 as complexity starts to
decrease.
Beyond 20, the accuracy and AUC reach reasonable fit and are roughly 77% and 75% accurate.
This trend continues till about 150 minimum leaf size. This is when the accuracy starts to dip
a little again and the gap between the two starts to decrease. At this complexity level the model
begins to underfit just a little, making a this the “Sweet Spot” of complexity.
Given AUC’s higher success in prediction, we should treat that with 150 minimum leaf size
as the best and most optimal model. Another reason for choosing AUC over Accuracy is the
fact that there is a class imbalance between leave(2762) and stay(7094) in the sample data set.
4
A2.2
The Variable Importance table tell us about the significance/Influence that each variable has
on the model. It is important to analyze this table as some variables that might be considered
of importance, might not have that much significance, and vice-versa.
From the table above, we can gather that overage, is by far the most significant predictor in the
model as it has the maximum value of 1. Income and leftover are also important variables by
the same logic. Handset price and over 15 mins calls, however, are the least significant
variables as their Importance’s are close to 0. Hence, we can consider discarding them from
the model to save on time and resources that are spent on their computation.
5
A3.
A3.1
We should move ahead with the decision tree as the model of choice. As we can see in the
previous questions this is the best performing model for our data set and objectives because
they have higher accuracies when compared to logistic regression (as shown in Q1) and are
more suited to our large data set due to their higher flexibility.
A3.2
The following are the assumptions we need to make in order to facilitate the Expected Value
Framework;
1. Base Scenario- This is the default scenario with which we will compare all costs.
In this case the base scenario should be to not target any customers whatsoever.
3. Timeline- the analysis will be conducted only for the 6-month time frame from the
time of targeting of customers. The costs will be incorporated accordingly.
6
A3.3
Cost Benefit Matrix
Leave (Actual) Stay (Actual)
Leave (Predicted) 87 Euros -5 Euros
Stay (Predicted) 0 Euros 0 Euros
The following are the justification for each cell of the cost-benefit matrix
1. Benefit from customers targeted successfully (Cell 2,2) - These will be the True
Positives. Our benefits would be 50% of 50 Euros x 6 months (because of the
discount) – the 5 Euro Targeting cost. This would be = 145 Euros. But since the
effectiveness of the offer is only 60%, we will only make 60% of the 145 euros.
Therefore the Benefit would be = 145 x 0.6= 87
2. Loss from customers targeted unsuccessfully (Cell 2,3) - These will be the False
Positives. Our loss in such a case will be the cost that we incurred to target them.
This would be 5 Euros, as shown by the 2nd assumption.
3. Benefit from not targeting those who would not have accepted (Cell 3,3) - These
would be the True Negatives. These are the people which, if we did target, would
decline the offer. In this case the benefit is 0 Euros, as in the Base Scenario we were
not going to target anyone anyway. Hence making both benefits 0 Euros.
4. Loss from not targeting those who would have accepted (Cell 3,2)- These are
the False Negatives. These are the people which, if we did target, would have
accepted our offer. Our loss in such cases will be 0 Euros, as when compared to the
Base Scenario, we were not targeting anyone anyway, making both losses 0 Euros.
7
A3.4
8
A3.5
Profit Curve
€ 45.000,00
€ 40.000,00
€ 35.000,00
€ 30.000,00
€ 25.000,00
€ 20.000,00
€ 15.000,00
€ 10.000,00
€ 5.000,00
€ 0,00
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Of Customers To Be Targeted
According to the plot presented above, a maximum of 30% of the most likely churners should
be targeted, because beyond that there are no additional profits to be made.
Our True Positives stay constant as no one is classified as “Leave” beyond a 0.5 confidence
level. This is also true for False Positives. Similarly for True Negatives and False Negatives,
there are no “Stay” classes for the top 30% churners, hence they start increasing only after
30%.
Our profit stays constant despite this because it is only a function on the True Positives and
False Positives (TPRevenue-FPCost) as True Negatives and False Negatives’ value in the Cost-
Benefit Matrix is 0.