0% found this document useful (0 votes)
37 views

Balaji Capstone Project 1

The document discusses a project to develop a churn prediction model for a DTH company facing competition. It describes understanding the business problem, need for the study, and social opportunity. It then discusses the customer churn dataset including data ingestion, visualization, and attribute information to understand the data for building a predictive model.

Uploaded by

Balaji Bala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Balaji Capstone Project 1

The document discusses a project to develop a churn prediction model for a DTH company facing competition. It describes understanding the business problem, need for the study, and social opportunity. It then discusses the customer churn dataset including data ingestion, visualization, and attribute information to understand the data for building a predictive model.

Uploaded by

Balaji Bala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

CAPSTONE GRADED PROJECT -1

A PROJECT REPORT

Submitted by

BALAJI S (PGP-DSBA-JUNE 2023 TO JUNE2024)


Introduction of the business problem
Problem Statement: -
A DTH company provider is facing a lot of competition in the
current market and it has become a challenge to retain the existing
customers in the current situation. Hence, the DTH company wants to
develop a model through which they can do churn prediction of the
accounts and provide segmented offers to the potential churners. In
this company, account churn is a major thing because 1 account can
have multiple customers. hence by losing one account the company
might be losing more than one customer.
we have been assigned to develop a churn prediction model for this
company and provide business recommendations on the campaign.
The model or campaign has to be unique and has to sharp when offers
are suggested. The offers suggested should have a win-win situation
for company as well as customers so that company doesn’t hit on
revenue and on the other hand able to retain the customers.

Need of the study/project


This study/project is very essential for the client to plan for
future in terms of product designing, sales or in rolling out different
offers for different segment of clients. The outcome of this project
will give a clear understanding where the firm stands now and what’s
the capacity it holds in terms for taking risk. It will also denote
what’s the future prospective of the organization and how they can
make it even better and can plan better for the same and can help
them retaining customers in a longer run.

Understanding business/social opportunity


This a case study of a DTH company where in they have
customers assigned with unique account ID and a single account ID
can hold many customers (like family plan) across gender and marital
status, customers get flexibility in terms of mode of payment they
want to opt for. Customers are again segmented across various types
of plans they opt for as per their usage which also based on the device
they use (computer or mobile) moreover they ear cashbacks on bill
payment.
The overall business runs in customers loyalty and stickiness which
in-turn comes from providing quality and value-added services. Also,
running various promotional and festivals offers may help
organization in getting new customers and also retaining the old one.
We can conclude that a customer retained is a regular income for
organization, a customer added is a new income for organization and
a customers lost will be a negative impact as a single account ID
holds multiple number of customers i.e.; closure of one account ID
means loosing multiple customers.
It’s a great opportunity for the company as it’s a need of almost every
individual of family to have a DTH connection which in-turn also
leads to increase and competition.
Question arises how can a company creates difference when
compared to other competitors, what are the parameter plays a vital
role having customers loyalty and making them stay. All these social
responsibilities will decide the best player in the market.
Data Report
Dataset of problem: - Customer Churn Data Data Dictionary: -
 AccountID -- account unique identifier
 Churn -- account churn flag (Target Variable)
 Tenure -- Tenure of account
 City_Tier -- Tier of primary customer's city
 CC_Contacted_L12m -- How many times all the customers of the
account has
contacted customer care in last 12months
 Payment -- Preferred Payment mode of the customers in the
account
 Gender -- Gender of the primary customer of the account
 Service_Score -- Satisfaction score given by customers of the
account on service
provided by company
 Account_user_count -- Number of customers tagged with this
account
 account_segment -- Account segmentation on the basis of spend
 CC_Agent_Score -- Satisfaction score given by customers of the
account on customer
care service provided by company
 Marital_Status -- Marital status of the primary customer of the
account
 rev_per_month -- Monthly average revenue generated by account
in last 12 months
 Complain_l12m -- Any complaints has been raised by account in
last 12 months
 rev_growth_yoy -- revenue growth percentage of the account (last
12 months vs last
24 to 13 month)
 coupon_used_l12m -- How many times customers have used
coupons to do the
payment in last 12 months
 Day_Since_CC_connect -- Number of days since no customers in
the account has
contacted the customer care
 cashback_l12m -- Monthly average cashback generated by account
in last 12 months
 Login_device -- Preferred login device of the customers in the
account
Data Ingestion: -
Loaded the required packages, set the working directory, and
loaded the data file.
The data set has 11,260 observations and 19 variables (18
independent and 1 dependent or target variable).

Table 1 – glimpse of the data-frame head with top 5 rows


Understanding how data was collected in terms of time, frequency
and methodology
• data has been collected for random 11,260 unique account ID,
across gender and marital status.
• Looking at variables “CC_Contacted_L12m”, “rev_per_month”,
“Complain_l12m”,“rev_growth_yoy”, “coupon_used_l12m”,
“Day_Since_CC_connect” and “cashback_l12m”we can
conclude that the data has been collected for last 12 month.
• Data has 19 variables, 18 independent and 1 dependent or the
target variable, which shows if customer churned or not.
• The data is the combination of services customers are
usingalong with their payment option and also then basic
individual
• details as well. Data is mixed of categorical as well as
continuous variables.
Visual inspection of data (rows, columns, descriptive details)

Data has 11,260 rows and 19 variables.

Table 2:- Dataset Information

Fig 1:- Shape of dataset



• Describing data: - This shows description of variation in
various statistical
measurements across variables which denotes that each variable is
unique and
different.
Table 3: - Describing Dataset

1. Except variables “AccountID”, “Churn”, “rev_growth_yoy” and


“coupon_used_for_payment” all other variables have null values
present.

Table 4: - Showing Null Values in Dataset


Data has “NIL” duplicate observations.

Understanding of attributes (variable info, renaming if required)


This project has 18 attributes contributing towards the target
variable. Let’s discuss about these variables one after another.
• AccountID – This variable represents a unique ID which
represents a unique
customer. This is of Integer data type and there is no null values
present in this.
• Churn – This is our target variable, which represents if
customer has churned or not.
This is categorical in nature will no null values. “0” represents “NO”
and ”1”
represents “YES”.
• Tenure – This represents the total tenure of the account since
opened. This is a
continuous variable with 102 null values.
• City_Tier – These variable segregates customer into 3 parts
based on city the
primary customer resides. This variable is categorical in nature and
have 112 null
values.
• CC_Contacted_L12m – This variable represents the number
of times all the
customers of the account has contacted customer care in last
12months. This
variable is continuous in nature and have 102 null values.
• Payment – This variable represents the preferable mode of bill
payment opted by
customer. This is categorical in nature and have 109 null values.
• Gender – This variable represents the gender of the primary
account holder. This is
categorical in nature and 108 null values.
• Service_Score – Scores provided by the customer basis the
service provided by the
company. This variable is categorical in nature and have 98 null
values.
• Account_user_count – This variable gives the number of
customers attached with an
accountID. This is continuous in nature and have 112 null values.

• account_segment – These variable segregates customers into


different segment
basis their spend and revenue generation. This is categorical in nature
and have 97
null values.
• CC_Agent_Score -- Scores provided by the customer basis the
service provided by
the customer care representative of the company. This variable is
categorical in
nature and have 116 null values.
• Marital_Status – This represents marital status of the primary
account holder. This is
categorical in nature and have 212 null values.
rev_per_month – This represents average revenue generated per
account ID in last
12 months. This variable is continuous in nature and have 102 null
values.
• Complain_l12m – This denotes if customer have raised any
complaints in last 12
months. This is categorical in nature and have 357 null values.
• rev_growth_yoy – This variable shows revenue growth in
percentage of account for
12 months Vs 24 to 13 months. This is continuous in nature and
doesn’t have any
null values.
• upon_used_l12m – This represents the number of times
customer’s have used
discount coupons for bill payment. This is continuous in nature and
doesn’t have any
null values.
• Day_Since_CC_connect – This represents the number of days
since customer have
contacted the customer care. Higher the number of days denotes better
the service.
This is continuous in nature and have 357 null values.
• cashback_l12m – This variable represents the amount of cash
back earned by the
customer during bill payment. This is continuous in nature and have
471 null values.
• Login_device – This variable represents in which device
customer is availing the
services if it’s on phone or on computer. This is categorical in nature
and have 221
null values.

❖ With the above understanding of data, renaming any of the


variables is not required.
❖ With the above understanding of data, we can move towards the
EDA part where
❖ we will understand the data better along with treating bad data,
null values, and outliers.

Exploratory data analysis


Univariate analysis (distribution and spread for every continuous
attribute, distribution of data in categories for categorical ones)
Univariate Analysis: -
 The variable shows outlier in data, which needs to be treated in
further steps.
Table 5: - Showing Outliers in data
❖ None of the variables show normal distribution and are skewed
in nature.
Fig 2: - Count plot of categorical variable
Inferences from count plot: -
• Maximum customers are from city tire type “1”, which
indicates the high number of population density in this city
type.
• A maximum number of customers prefer debit and credit cards
as their preferred mode of payment.
• The ratio of male customers is higher when compared to
females.
• The average service score given by a customer for the service
provided is around “3” which shows the area of improvement.
• Most of the customers are in the “Super+” segment and least
number of customers are in the “Regular” segment.
• Most of the customers availing services are “Married”.
• Most customers prefer “Mobile” as the device to avail services.
Bi-variate Analysis: -
• Pair plot across all categorical data and its impact on the target
variable.
fig 4: - pair plot across categorical variables

• The pair-plot shown above indicates that the independent


variable are week or poor predictors of target variable as we the
density of independent
• variable overlaps with the density of target variable.

Correlation among variable:-

We have performed correlation between variables after treating


bad data and missing values. We have also converted into integer data
types to check on correlation as data type as categorical wont show in
the pictures below.
Fig 6: - Correlation among variables
Inferences from correlation: -
• Variable “Tenure” shows high co-relation with Churn.
• Variable “Marital Status” shows high co-relation with churn.
• Variable “complain_ly” shows high- correlation with churn.

Removal of unwanted variables: - After in-depth understanding of


data we conclude that removal of variables is not required at this stage
of project. We can remove the variable “AccountID” which denotes a
unique ID assigned to unique customers. However, removing them
will lead to 8 duplicate rows. Rest all the variables looks important
looking at the univariate and bi-variate analysis.

Outlier treatment: -
This dataset is the mix of continuous as well as categorical variables.
It doesn’t make nay sense if we perform outlier treatment on
categorical variable as each category denotes a type of customer. So,
we are performing outlier treatment only for variables continuous in
nature.
• Used box plot to determine the presence if outlier in a variable.
• The dots outside the upper limit of a quantile represents the
outlier in the variable.
• We have 8 continuous variables in the dataset namely,
“Tenure”,
• “CC_Contacted_LY”, “Account_user_count”, “cashback”,
“rev_per_month”,
• “Day_Since_CC_connect”, “coupon_used_for_payment” and
“rev_growth_yoy”.
• We have used upper limit and lower limit to remove outliers.
Below is the pictorial representation of variables before and
after outlier treatment.
Before After
Fig 7: - Before and after outlier treatement
Missing Value treatment and variable transformation: -
• Out of 19 variables we have data anomalies present in 17
variables and null values in 15 variables.
• Using “Median” to impute null values where the variable is
continuous because the Median is less prone to outliers when
compared with the mean.
• Using “Mode: to impute null values where variables are
categorical.
• We have treated null values variable by variable as each variable
is unique.
Treating Variable “Tenure”
• We look at the unique observations in the variable and see that
we have “#” and “nan” present in the data.
• Where “#” is a anomaly and “nan” represents null value.
Fig 8: - before treatment
• Replacing “#” with “nan” and further we replace “nan” with the
calculated median of the
variable and now we don’t see any presence of bad data and null
values.
• Converted data type to integer, because IDE has recognized it
as object data type
due presence of bad data.
Treating Variable “City_Tier”
• We look at the unique observations in the variable and presence
of null value as shown below.

Fig 9: - before treatment


• we replaced “nan” with the calculated mode of the variable and
now we don’t see any presence of null values.
• Converted data type to integer, because IDE has recognized it
as object data type due presence of bad data.

Treating Variable “CC_Contacted_LY”


• We look at the unique observations in the variable and see the
presence of a null value as shown below.
• we are replacing “nan” with the calculated Median of the
variable and now we don’t see
any presence of null values.
• Converted data type to integer, because IDE has recognized it
as object data type
due presence of bad data.
Treating Variable “Payment”
• We look at the unique observations in the variable and see the
presence of a null value as
shown below.

Fig 10: - before treatment


• we are replacing “nan” with the calculated Mode of the variable
and now we don’t see
• any presence of null values.
• Also performed label encoding for the observations. Where 1 =
Debit card, 2 = UPI, 3 = credit card, 4 = cash on delivery and 5
= e-wallet. Then converting them to integer
• data type as it will be used for further model building.
Treating Variable “Gender”
• We look at the unique observations in the variable and see
presence of a null value
and multiple abbreviations of the same observations as shown below.

Fig 11: - before treatment


• we are replacing “nan” with calculated Mode of the variable
and now we don’t see any presence of null values.
• Also performed label encoding for the observations.
• Where 1 = Female card and 2 =Male.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Service_Score”
• We look at the unique observations in the variable and see
presence of null value as
shown below.
• we are replacing “nan” with calculated Mode of the variable and
now we don’t see any presence of null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Account_user_count”
• We look at the unique observations in the variable and see
presence of null value as well “@” as bad data, shown below.

Fig 12: - before treatment


• Replacing “@” with “nan” and further we replace “nan” with
the calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then convert them to integer data type as it will be used for
further model building.
Treating Variable “account_segment”
• We look at the unique observations in the variable and see the
presence of a null value as
well different denotations for the same type of observations, shown
below.

Fig 13: - before treatment


• Replacing “nan” with the calculated Mode of the variable and
also labeled different account segments, where in 1 = Super, 2 =
Regular Plus, 3 = Regular, 4 = HNI and 5 =
Super Plus and now we don’t see any presence of bad data and null
values.
• Then convert them to an integer data type as it will be used for
further model building.
Treating Variable “CC_Agent_Score”
We look at the unique observations in the variable and see the
presence of a null value as
shown below.

• Replacing “nan” with the calculated Mode of the variable and


now we don’t see any presence of bad data and null values.
• Then convert them to integer data type as it will be used for
further model building.

Treating Variable “Marital_Status”


• We look at the unique observations in the variable and see
presence of null value as
shown below.

Fig 14: - before treatment


• Replacing “nan” with the calculated Mode of the variable and
also labelled the observations.
• Where in 1 = Single, 2 = Divorced and 3 = Married and now
we don’t see any presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “rev_per_month”
• We look at the unique observations in the variable and see
presence of null value as
• well as presence of “+” which denoted bad data. shown below.
Fig 15: - before treatment
• Replacing “+” with “nan” and further we replace “nan” with
calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then converting them to an integer data type as it will be used
for further model building.
Treating Variable “Complain_ly”
• We look at the unique observations in the variable and see the
presence of a null value as
• Replacing “nan” with calculated Mode of the variable and now
we don’t see any presence of null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “rev_growth_yoy”
• We look at the unique observations in the variable and see
presence of “$” which denoted bad data. shown below.

Fig 15: - before treatment


• Replacing “$” with “nan” and further we replace “nan” with
calculated median of the
variable and now we don’t see any presence of bad data and null
values.
• Then converting them to integer data type as it will be used for
further model
building.

Treating Variable “coupon_used_for_payment”


• We look at the unique observations in the variable and see
presence of “$”, “*” and
“#” which denoted bad data. shown below.

• Replacing “$”, “*” and “#” with “nan” and further we replace
“nan” with calculated median of the variable and now we don’t
see any presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Day_Since_CC_connect”
• We look at the unique observations in the variable and see
presence of “$” which denoted bad data and also the presence of
null values.
• Replacing “$” with “nan” and further we replace “nan” with
calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.

Treating Variable “cashback”


• We look at the unique observations in the variable and see
presence of “$” which denoted bad data and also the presence of
null values.
• Replacing “$” with “nan” and further replace “nan” with the
calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Login_device”
• We look at the unique observations in the variable and see
presence of “&&&&”which denoted bad data and also the
presence of null values.
• Replacing “&&&&” with “nan” and further we replace “nan”
with calculated Mode of the variable.
• Also, labelling the observations where in 1= Mobile and 2 =
Computer and now we don’t see any presence of bad data and
null values.
• Then converting them to integer data type as it will be used for
further model building.
Count of null values before and after treatment

Before After

Fig 16: - Before and after null value treatment


• We see NIL null values across variable which indicated that the
data is now cleaned and we can move further for data
transformation of required.
Variable transformation: -
• We see that the different variable have different dimensions.
Like variable “Cashback” denotes currency where as
“CC_Agent_Score” denotes rating provided by the customers.
Due to which they differ in their statistical rating as well
• Scaling would be required for this data set which in turn will
normalize the date and standard deviation will be close to “0”
• Using MinMax scalar to perform normalization of data.
Addition of new variables: -
At the current stage we don’t see to create ay new variable as such.
May be required at further stage of model building and can be created
accordingly.
Business insights from EDA
Is the data unbalanced? If so, what can be done? Please explain in
the context of the business
• Dataset provided is imbalance in nature. The categorical count
of our target variable “Churn” shows high variation in counts.
• We have count of “0” as 9364 and count of “1” as 1896.

Fig 18: - Imbalanced dataset


Any other business insights
• We see decent variations in data collection with a mixture of
services provided along
• with rating provided by the customer and also about customer
profile.
• Business needs to increase its visibility in tier 2 city and can
acquire new customers.
• Business can promote payment via standing instruction in bank
account or UPI which can be hassle free and safe for customer.
• There is need of improvement in service scores and have a lot of
grey area left over.
• Business and roll out a survey for better understanding of
customer’s expectations.
• Business can train their customer care executive to provide
better customer experience which in turn will improve their
feedback scores.
• Can have curated plans for customers not only based on the
spend they have but also the tenure they have spent with the
business.
• Can have curated plan for married people something like a
family floater.
End of Project Note - 1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy