0% found this document useful (0 votes)

47 views9 pages

Big Data Management and Architecture Assignment

The document discusses learning curves and model performance for decision trees and logistic regression models. It then analyzes variable importance and proposes using a decision tree model to target customers. An expected value framework and profit curve analysis is also suggested.

Uploaded by

utsavkp99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views9 pages

Big Data Management and Architecture Assignment

Uploaded by

utsavkp99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Big Data Management and Analytics

Individual Assignment

UTSAV PALIWAL
STUDENT NUMEBR -555818

1
A1.

Learning Curves by Accuracy

80%
75%
70%
65%
60%
55%
50%
45%
40%
10 20 40 80 160 320 640 1280 2560 5120 9000

Decision Tree Logistic Regression

Fig. 1 Learning Curve charted by Accuracy

Learning Curves by AUC

90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
10 20 40 80 160 320 640 1280 2560 5120 9856

Decision Tree Logistic Regression

Fig. 2 Learning Curve charted by Area under ROC curve

2
A1.1

In order to interpret the plots above we first need to differentiate between Accuracy and AUC.
Accuracy defines how successfully a model classifies a class as positive or negative. Whereas
AUC determines the probability that the model will rank a chosen positive class higher than a
negative class.

Decision trees- In plot 1 & 2 decision trees initially stay constant till about 80-160 samples.
This is because since the sample size is too small, and by classifying all classes as “stay” they
are accurate 70% of the time as 70% of the training data belongs to the “stay” class. This
happens as they are being trained and tested only on training data This is called Overfitting.
Only after the sample grows to a reasonable size, is when the Decision Tree becomes highly
accurate, due to its high flexibility.

Logistic Regression- Regressions are better classifiers in terms of probability, and are better
for smaller sets of data. They are useful in cases where the probability of an event happening
is to be found out for smaller sets of data. This is why the logistic regression model outperforms
the decision tree in AUC instead of Accuracy plot, at smaller samples of 40 and 80.

A1.2

The ideal learning curve is a clockwise 90 degree Rotated L Shaped curve that rises a lot
initially, but then tapers off as the sample size grows , and eventually slopes down as the
marginal benefits from increasing data samples start to diminish.

Therefore, after a certain point, it makes no business sense to collect more data, from a time
and cost saving perspective, as this additional data would bring in significant expenses, and no
significant additions and improvements to the model.

In our case, the decision tree model is still not tapering off at the end. This means that there is
still more scope to collect and incorporate more samples into the study as they would bring in
significant benefits in terms of the accuracy.

The logistic regression on the other hand starts to slope down around the 300 sample size mark,
indicating its bounds with respect to larger sample sizes.

3
A2.

Fitting Graph

79%

77%

75%

73%

71%

69%

67%

65%
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
Accuracy AUC

Fig. 3 Fitting Graph for Accuracy and AUC

A2.1

Firstly, we can see that the model gravely overfits till about a minimum leaf size of 20, as
depicted by the sharp rise in both AUC and accuracy beyond 20 as complexity starts to
decrease.

Beyond 20, the accuracy and AUC reach reasonable fit and are roughly 77% and 75% accurate.
This trend continues till about 150 minimum leaf size. This is when the accuracy starts to dip
a little again and the gap between the two starts to decrease. At this complexity level the model
begins to underfit just a little, making a this the “Sweet Spot” of complexity.

Given AUC’s higher success in prediction, we should treat that with 150 minimum leaf size
as the best and most optimal model. Another reason for choosing AUC over Accuracy is the
fact that there is a class imbalance between leave(2762) and stay(7094) in the sample data set.

4
A2.2

Variable Importance Table

ATTRIBUTE WEIGHT
Income 0,486
College 0,291
Leftover 0,310
Average call duration 0,054
Over 15mins calls per month 0,000
House 0,284
Overage 1,000
Handset_price 0,004

The Variable Importance table tell us about the significance/Influence that each variable has
on the model. It is important to analyze this table as some variables that might be considered
of importance, might not have that much significance, and vice-versa.

From the table above, we can gather that overage, is by far the most significant predictor in the
model as it has the maximum value of 1. Income and leftover are also important variables by
the same logic. Handset price and over 15 mins calls, however, are the least significant
variables as their Importance’s are close to 0. Hence, we can consider discarding them from
the model to save on time and resources that are spent on their computation.

5
A3.

A3.1

We should move ahead with the decision tree as the model of choice. As we can see in the
previous questions this is the best performing model for our data set and objectives because
they have higher accuracies when compared to logistic regression (as shown in Q1) and are
more suited to our large data set due to their higher flexibility.

A3.2

The following are the assumptions we need to make in order to facilitate the Expected Value
Framework;

1. Base Scenario- This is the default scenario with which we will compare all costs.
In this case the base scenario should be to not target any customers whatsoever.

2. Cost of targeting- We shall assume that the cost of targeting/giving a customer an

offer is 5 Euros.

3. Timeline- the analysis will be conducted only for the 6-month time frame from the
time of targeting of customers. The costs will be incorporated accordingly.

6
A3.3
Cost Benefit Matrix
Leave (Actual) Stay (Actual)
Leave (Predicted) 87 Euros -5 Euros
Stay (Predicted) 0 Euros 0 Euros

The following are the justification for each cell of the cost-benefit matrix

1. Benefit from customers targeted successfully (Cell 2,2) - These will be the True
Positives. Our benefits would be 50% of 50 Euros x 6 months (because of the
discount) – the 5 Euro Targeting cost. This would be = 145 Euros. But since the
effectiveness of the offer is only 60%, we will only make 60% of the 145 euros.
Therefore the Benefit would be = 145 x 0.6= 87

2. Loss from customers targeted unsuccessfully (Cell 2,3) - These will be the False
Positives. Our loss in such a case will be the cost that we incurred to target them.
This would be 5 Euros, as shown by the 2nd assumption.

3. Benefit from not targeting those who would not have accepted (Cell 3,3) - These
would be the True Negatives. These are the people which, if we did target, would
decline the offer. In this case the benefit is 0 Euros, as in the Base Scenario we were
not going to target anyone anyway. Hence making both benefits 0 Euros.

4. Loss from not targeting those who would have accepted (Cell 3,2)- These are
the False Negatives. These are the people which, if we did target, would have
accepted our offer. Our loss in such cases will be 0 Euros, as when compared to the
Base Scenario, we were not targeting anyone anyway, making both losses 0 Euros.

7
A3.4

Profit Curve Table

True False True False

Threshold Positives Positives Negatives Negatives Revenue Cost Profit
0% 0 0 0 0 0 0 0
€ €
10% 242 183 0 0 21.054,00 € 915,00 20.139,00
€ € €
20% 466 383 0 0 40.542,00 1.915,00 38.627,00
€ € €
30% 496 410 221 146 43.152,00 2.050,00 41.102,00
€ € €
40% 496 410 491 300 43.152,00 2.050,00 41.102,00
€ € €
50% 496 410 804 411 43.152,00 2.050,00 41.102,00
€ € €
60% 496 410 1134 505 43.152,00 2.050,00 41.102,00
€ € €
70% 496 410 1484 579 43.152,00 2.050,00 41.102,00
€ € €
80% 496 410 1864 623 43.152,00 2.050,00 41.102,00
€ € €
90% 496 410 2257 654 43.152,00 2.050,00 41.102,00
€ € €
100% 496 410 2644 681 43.152,00 2.050,00 41.102,00

8
A3.5

Profit Curve
€ 45.000,00

€ 40.000,00

€ 35.000,00

€ 30.000,00

€ 25.000,00

€ 20.000,00

€ 15.000,00

€ 10.000,00

€ 5.000,00

€ 0,00
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Of Customers To Be Targeted

Fig. 4 Profit Curve

According to the plot presented above, a maximum of 30% of the most likely churners should
be targeted, because beyond that there are no additional profits to be made.

Our True Positives stay constant as no one is classified as “Leave” beyond a 0.5 confidence
level. This is also true for False Positives. Similarly for True Negatives and False Negatives,
there are no “Stay” classes for the top 30% churners, hence they start increasing only after
30%.

Our profit stays constant despite this because it is only a function on the True Positives and
False Positives (TPRevenue-FPCost) as True Negatives and False Negatives’ value in the Cost-
Benefit Matrix is 0.

Unit-4 Part 2 Modelling and Evaluation
No ratings yet
Unit-4 Part 2 Modelling and Evaluation
35 pages
Telco Customer Churn
100% (2)
Telco Customer Churn
11 pages
Probability of A Term Deposit
No ratings yet
Probability of A Term Deposit
31 pages
IS4242 W6 Model Evaluation and Selection
No ratings yet
IS4242 W6 Model Evaluation and Selection
86 pages
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
Session6 Choice
No ratings yet
Session6 Choice
40 pages
Evaluating Models - Categorical Target
No ratings yet
Evaluating Models - Categorical Target
52 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
MIT15 053S13 Lec1 PDF
No ratings yet
MIT15 053S13 Lec1 PDF
36 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
BUSI2013 Introduction
No ratings yet
BUSI2013 Introduction
55 pages
AC 1103 Midterms
No ratings yet
AC 1103 Midterms
29 pages
BA Unit5
No ratings yet
BA Unit5
39 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Ablan, Arjay I. Other Examples of Quantitative Models For Decision Making Simulation
No ratings yet
Ablan, Arjay I. Other Examples of Quantitative Models For Decision Making Simulation
13 pages
Lecture 8
No ratings yet
Lecture 8
24 pages
CH 03
No ratings yet
CH 03
48 pages
Mansci Notes
No ratings yet
Mansci Notes
26 pages
Part 1 Building Your Own Binary Classification Model
43% (14)
Part 1 Building Your Own Binary Classification Model
6 pages
Trabajo - Fase - 4 - GrUPO - 212066 - 42
No ratings yet
Trabajo - Fase - 4 - GrUPO - 212066 - 42
44 pages
Capstone Assessment
No ratings yet
Capstone Assessment
18 pages
MIT15 053S13 Lec1
No ratings yet
MIT15 053S13 Lec1
36 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
23 pages
Sanatander Analysis
No ratings yet
Sanatander Analysis
19 pages
BANA 7012 Lecture 2.1 Models in Analytics
No ratings yet
BANA 7012 Lecture 2.1 Models in Analytics
21 pages
Part 1 - Building Your Own Binary Classification Model - Coursera
0% (10)
Part 1 - Building Your Own Binary Classification Model - Coursera
3 pages
Data Mining
No ratings yet
Data Mining
17 pages
Analytics in Practice: Model Evaluation
No ratings yet
Analytics in Practice: Model Evaluation
40 pages
Introduction OT
No ratings yet
Introduction OT
20 pages
QM Ch4-I
No ratings yet
QM Ch4-I
27 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
Predicting The Term Deposit Subscription
No ratings yet
Predicting The Term Deposit Subscription
38 pages
Design of Credit Model Design IIM Fintech Abrg
No ratings yet
Design of Credit Model Design IIM Fintech Abrg
13 pages
Predictive Modelling Report
No ratings yet
Predictive Modelling Report
13 pages
Part 1: Building Your Own Binary Classification Model: Data - Final Project
No ratings yet
Part 1: Building Your Own Binary Classification Model: Data - Final Project
9 pages
Machine Learning Model
No ratings yet
Machine Learning Model
9 pages
Lec5 DSS 23
No ratings yet
Lec5 DSS 23
9 pages
Blue Property
No ratings yet
Blue Property
10 pages
2022 Final Exam - All
No ratings yet
2022 Final Exam - All
9 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
Mobd Final Notes
No ratings yet
Mobd Final Notes
5 pages
Business Infomatics
No ratings yet
Business Infomatics
15 pages
Aryansh MDS202312
No ratings yet
Aryansh MDS202312
7 pages
Problem Solving and Decision Making: Chapter 1 (AE 4)
No ratings yet
Problem Solving and Decision Making: Chapter 1 (AE 4)
32 pages
Data Mining Primer
No ratings yet
Data Mining Primer
5 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
(Original PDF) Real Stats Using Econometrics For Political Science and Public Policy Instant Download
100% (1)
(Original PDF) Real Stats Using Econometrics For Political Science and Public Policy Instant Download
45 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
Supervised Learning With Scikit-Learn
No ratings yet
Supervised Learning With Scikit-Learn
178 pages
Linear Programmimg
No ratings yet
Linear Programmimg
18 pages
fixed random trắc nghiệm tự luận
100% (1)
fixed random trắc nghiệm tự luận
12 pages
Data Science For Online Customer Analytics - Assignment
No ratings yet
Data Science For Online Customer Analytics - Assignment
11 pages
AC 1103 Presentations
No ratings yet
AC 1103 Presentations
10 pages
Microsoft-Word Summary-Quiz-1
No ratings yet
Microsoft-Word Summary-Quiz-1
2 pages
Module 7 Homework Prompt - JMP
No ratings yet
Module 7 Homework Prompt - JMP
6 pages
SGM1
No ratings yet
SGM1
6 pages
Hq792.p3a28 2013
No ratings yet
Hq792.p3a28 2013
173 pages
Un US1 GW A0 WDW
No ratings yet
Un US1 GW A0 WDW
6 pages
Spspxtregress
No ratings yet
Spspxtregress
25 pages
Ae Test Bank This Is Applied Econometrics Testbank
100% (1)
Ae Test Bank This Is Applied Econometrics Testbank
134 pages
Demand Forecasting PDF
No ratings yet
Demand Forecasting PDF
30 pages
Cheat Sheet
No ratings yet
Cheat Sheet
1 page
Group Assignment - Fraud Detection-1
No ratings yet
Group Assignment - Fraud Detection-1
15 pages
Measures of Relationship
No ratings yet
Measures of Relationship
17 pages
Python Session 14092024.ipynb - Colab
No ratings yet
Python Session 14092024.ipynb - Colab
6 pages
Tugas 1 - MANPRO - Riska Tiana - 140610210002 - Silma Minnatika - 140610210014
No ratings yet
Tugas 1 - MANPRO - Riska Tiana - 140610210002 - Silma Minnatika - 140610210014
24 pages
BSD 3101-Lab Exercise 1
No ratings yet
BSD 3101-Lab Exercise 1
12 pages
10 2015 Social Support Stress and Suicidal Ideation in Professional Firefighters
No ratings yet
10 2015 Social Support Stress and Suicidal Ideation in Professional Firefighters
7 pages
Strategy For Complete Discriminant Analysis
No ratings yet
Strategy For Complete Discriminant Analysis
87 pages
Komputasi Geologi: Muhammad Rizqy Septyandy, M.T
No ratings yet
Komputasi Geologi: Muhammad Rizqy Septyandy, M.T
40 pages
CFA Level 2 1712974289
No ratings yet
CFA Level 2 1712974289
19 pages
1800-Article Text-4427-1-10-20210824
No ratings yet
1800-Article Text-4427-1-10-20210824
20 pages
Moanassar,+8986 26608 1 LE
No ratings yet
Moanassar,+8986 26608 1 LE
13 pages
Econometrics Formulas
80% (5)
Econometrics Formulas
2 pages
Etf3600 Lecture3 Mle LPM 2013
No ratings yet
Etf3600 Lecture3 Mle LPM 2013
36 pages
Multicollinearity Assignment April 5
100% (1)
Multicollinearity Assignment April 5
15 pages
LOGISTIC REGRESSION - Jupyter Notebook
No ratings yet
LOGISTIC REGRESSION - Jupyter Notebook
2 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Lampiran Hasil PSS
No ratings yet
Lampiran Hasil PSS
10 pages
Heckman Selection Model
No ratings yet
Heckman Selection Model
9 pages
Econometrics: Problem Set 1: Professor: Mauricio Sarrias
No ratings yet
Econometrics: Problem Set 1: Professor: Mauricio Sarrias
5 pages
Test Bank Statistics
No ratings yet
Test Bank Statistics
9 pages
Name Some Linear Regression Problem and Deeply Explain One of Them
No ratings yet
Name Some Linear Regression Problem and Deeply Explain One of Them
3 pages
Dummy Variables
No ratings yet
Dummy Variables
2 pages
Solutions Manual to accompany Introduction to Linear Regression Analysis
From Everand
Solutions Manual to accompany Introduction to Linear Regression Analysis
Douglas C. Montgomery
1/5 (1)
Its HOT! Build a Temperature Warning Sound Alarm with Thermistor
From Everand
Its HOT! Build a Temperature Warning Sound Alarm with Thermistor
GURUPRASAD N H
No ratings yet
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
From Everand
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
Stuart A. Klugman
4/5 (1)
Calculus Essentials For Dummies
From Everand
Calculus Essentials For Dummies
Mark Ryan
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data Management and Architecture Assignment

Uploaded by

Big Data Management and Architecture Assignment

Uploaded by

Big Data Management and Analytics

Learning Curves by Accuracy

Decision Tree Logistic Regression

Fig. 1 Learning Curve charted by Accuracy

Learning Curves by AUC

Decision Tree Logistic Regression

Fig. 2 Learning Curve charted by Area under ROC curve

Fig. 3 Fitting Graph for Accuracy and AUC

Variable Importance Table

2. Cost of targeting- We shall assume that the cost of targeting/giving a customer an

Profit Curve Table

True False True False

Fig. 4 Profit Curve

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.