Customer Churn Internship Report PDF
Customer Churn Internship Report PDF
Customer Churn Internship Report PDF
on
CUSTOMER CHURN PREDICTION IN TELECOM
INDUSTRY
Submitted by
R SANTOSHKUMAR REDDY
22695A3114
BACHELOR OF TECHNOLOGY
In
ARTIFICIAL INTELLIGENCE
2024-2025
1
2009-213
BONAFIDE CERTIFICATE
This is to certify that the internship work entitled “Customer churn prediction in telecomindustry”
is a bonafide work carried out by
Submitted in partial fulfillment of the requirements for the award of degree Bachelor of Technology
in the Department of Artificial Intelligence, Madanapalle Institute of Technology & Science,
Madanapalle, affiliated to Jawaharlal Nehru Technological University Anantapur,
Ananthapuramu during the academic year 2024-2025.
2
ACKNOWLEDGEMENT
I sincerely thank Dr. C. Yuvaraj, M.E., Ph.D., Principal for guiding and providing facilities
for the successful completion of our project at Madanapalle Institute of Technology & Science,
Madanapalle.
I express our deep sense of gratitude to Dr. K. Chokkanathan, M. Tech., Ph.D., Associate
Professor & Head, Department of AI for his continuous support in making necessary arrangements
for the successful completion of the project.
I express our sincere thanks to the Internship Coordinator, Dr. A. Poongodai, Ph,D.,
Assistant Professor, Department of AI for his tremendous support for the successful completion of
the internship Project.
3
CERTIFICATE:
4
DECLARATION
I, the undersigned hereby declare that the results embodied in this Internship
“Customer churn prediction in telecom industry” is a bonafide record of the work done by
me in partial fulfillment of the award of Bachelor of Technology in Artificial Intelligence
from Jawaharlal Nehru Technological University Anantapur, Anantapur. The content of
this report is not submitted to any other University/Institute for award of any other degree.
5
CONTENTS
ABSTRACT…………………………………………………………………………… 7
List of Figures…………………………………………………………………………. 8
List of Symbols, Abbreviations or Nomenclature (optional)………………………… 9
1. INTRODUCTION…………………………………………………………………….. 15
1.1. Problem Statement 10
1.2. Objective 10
1.3. Domain Technology 10
1.4. Industry Vertical 11
1.5. Data 13
1.6. Methods 14
2. HARDWARE AND SOFTWARE…………………………………………………… 17
2.1. Platform and Hardware Used 16
2.2. Software Used 17
3. PROJECT ANALYSIS……………………………………………………….……… 24
3.1. Architecture Diagram 18
3.2. Implementation 18
3.3. Algorithm / Techniques 20
3.4. Result 24
CONCLUSIONS…………………………………………………………………..………. 25
APPENDICES………………………………………………………………………..……. 33
SOURCE CODE 26
SCREEN SHOT 27
BIBLIOGRAPHY………………………………………………………………….……… 34
6
ABSTRACT
In this project we attempt to implement machine learning approach to predict Customer churn
in telecom industry.
Customer churn prediction is a crucial analytical technique for the telecom industry to identify
customers likely to discontinue their services. High churn rates can lead to revenue losses and increased
customer acquisition costs.
This project explores the development of a predictive model that can accurately forecast
customer churn using historical data, including customer demographics, service usage, billing
information, and interaction history. By leveraging machine learning techniques, such as logistic
regression, decision trees, and ensemble methods, we aim to identify patterns that contribute to customer
attrition. The model's insights enable telecom providers to implement targeted retention strategies,
enhance customer satisfaction, and ultimately reduce churn rates.
7
LIST OF FIGURES
Figure
Number Description Page No
8
LIST OF ABBREVIATION
ACRONYM ABBREVIATION
DS Data Science
ML Machine Learning
AI Artificial Intelligence
9
CHAPTER 1
INTRODUCTION
1.2 Objective
Customer Churn Prediction in Telecom Industry
The primary objective of this project is to develop a predictive model for customer churn
in the telecom industry, enabling the early identification of customers likely to discontinue their
service. Specifically, the project aims to:
o Analyze customer behavior patterns and identify critical factors contributing to churn.
o Build and evaluate machine learning models that can predict churn with high accuracy.
o Provide actionable insights to help the telecom company design targeted retention
strategies.
o Develop a cost-effective and efficient solution to reduce churn, enhance customer
satisfaction, and increase overall customer lifetime value..
Machine Learning
Machine learning plays a crucial role in developing predictive models that can analyze
historical Customer data and forecast future Churners. Here's how machine learning can be applied
in various stages of the project.
1. Data Preprocessing
10
• Normalization/Scaling
Ensuring that features are on similar scales to prevent certain features from dominating
the learning process.
2. Model Selection
• Machine Learning Algorithms
Regression algorithms (linear regression, decision trees, random forests) and more
advanced techniques like gradient boosting or neural networks can capture complex patterns.
4. Evaluation Metrics
. Accuracy, Precision, Recall (Sensitivity), F1 Score, ROC-AUC (Receiver Operating
Characteristic - Area Under Curve), Confusion Matrix
Common metrics for evaluating regression models.
It's important to note that stock price prediction is a challenging task due to the inherent
volatility and complexity of financial markets. Continuous monitoring and adaptation of the model
to changing market conditions are often necessary for sustained performance. Additionally,
considering economic indicators and external events that might influence Customer churn can further
enhance the predictive capabilities of the model.
In this domain, customer churn prediction has emerged as a crucial strategic focus, as it helps providers
identify which customers are at risk of leaving for a competitor, allowing for proactive retention
strategies.
Churn prediction leverages data science and machine learning to analyse customer behaviours, usage
patterns, complaints, and service interactions, identifying factors that contribute to churn.
By effectively predicting and managing churn, telecom companies can reduce revenue loss, optimize
customer service, and increase customer satisfaction and loyalty.
11
This focus on customer retention is critical in the telecom industry where acquisition costs are high, and
customer lifetime value is a key metric for profitability.
12
1.5 Data
1. Demographic Data:
Includes age, gender, location, and income level, which helps understand different customer
segments and their propensity to churn.
2. Usage Data:
Encompasses information about call usage, SMS, data usage, and roaming frequency.
Patterns in high or low usage can be indicators of customer engagement or dissatisfaction.
7. Engagement Data:
Captures how customers interact with various services, apps, promotions, and loyalty
programs. Low engagement with these offerings can indicate a lower likelihood of customer
retention.
Using these data types, machine learning models can be developed to predict churn risk
accurately, enabling telecom companies to take targeted, data-driven retention actions.
13
1.6 Methods
In a "Customer Churn Prediction" project, various methods and techniques are used to analyze
historical data and forecast future customer churn. Here are common methods employed in
such projects:
Decision Trees and Random Forests: Ensembles of decision trees can capture complex patterns
in historical data.
Gradient Boosting Machines (GBM): Builds a strong predictive model by combining weak
models sequentially.
Support Vector Machines (SVM): Can be used for regression to predict stock prices.
1. Feature Engineering:
Combining multiple models using ensemble techniques, such as bagging or boosting, to improve
overall prediction accuracy.
4. Hyperparameter Tuning:
Using techniques like grid search or random search to find the optimal hyperparameters for the
chosen models.
14
1. Cross-Validation:
Employing k-fold cross-validation to assess model performance on different subsets of the data.
2. Evaluation Metrics:
Identifying and selecting the most relevant features to improve model efficiency and
interpretability.
5. Back testing:
Simulating the performance of the predictive model on historical data to assess how well it would
have performed in the past.
6. Algorithmic Trading Strategies:
Exploring more advanced deep learning architectures beyond traditional neural networks, such as
recurrent neural networks (RNNs) or transformer models.
8. Time Series Forecasting Libraries:
Leveraging specialized libraries like Prophet or stats models for time series forecasting.
Reinforcement Learning (in specific cases):
Exploring reinforcement learning approaches for dynamic decision-making in trading, although
this is less common due to the challenges of applying RL to financial markets.
It's important to note that the choice of methods depends on factors such as the characteristics of
the data, the time horizon for predictions, and the specific goals of the project. Often, a
combination of methods or an ensemble approach yields the best results. Additionally, regular
model updating and adaptation to changing market conditions are crucial for sustained perform.
15
CHAPTER 2
Hardware description development and deployment of the application requires the following
general and specific minimum requirements for Hardware.
It is a development and deployment of the application requires the following general and
specific minimum requirements for software.
Operating System : Windows operating system.
IDE : Google colab
Programming Language : Python3
Domain : Machine Learning
Module : Sklearn
Packages : NumPy, Pandas, Matplotlib,Seaborn,os
GOOGLE COLAB
The Basics. Colaboratory, or “Colab” for short, is a product from Google Research. Colab
allows anybody to write and execute arbitrary python code through the browser, and is
especially well suited to machine learning, data analysis and education.
Google Colab Features
Google Colab has all of the exciting features that any modern IDE has, plus a lot more. The
following are some of exciting features.
16
➢ · Machine learning and neural networks are taught using interactive tutorials.
➢ · Without a local setup, you can write and execute Python 3 code.
➢ · Execute terminal commands from the Notebook.
➢ · Import datasets from external sources such as Kaggle.
➢ · Save your Notebooks to Google Drive.
➢ · Import Notebooks from Google Drive.
➢ · Free cloud service, GPUs and TPUs.
➢ · Integrate with PyTorch, Tensor Flow, Open C
1. Programming Languages
• Python
Widely used for data analysis, machine learning, and building predictive models.
2. Data Processing and Analysis
• Pandas
A Python library for data manipulation and analysis.
• NumPy
Essential for numerical operations in Python.
•Matplotlib and Seaborn
Libraries for data visualization.
3. Machine Learning Libraries
• Scikit-learn
Offers a wide range of machine learning algorithms and tools for data preprocessing.
4. Time Series Analysis
•Stats models
Useful for time series analysis and statistical modeling.
5. Financial Data APIs
•Yahoo Finance API, Alpha Vantage, Quandl
Sources for obtaining historical stock price data.
6. Data Visualization
•Matplotlib, Plotly
Tools for creating interactive and informative visualizations.
17
CHAPTER 3
PROJECT ANALYSIS
3.1 Architecture Diagram
3.2 Implementation
• Importing the data
• Importing the libraries
• Loading the dataset and analyzing the customer behavior
• Scaling the data
• Building the model and train the data
• Performance evaluation on test set
1. Data Collection:
Gather historical customer data for the target company, including features such as opening price,
tenure, payment and services ,monthly charges and Yearly Charges.
18
2. Data Preprocessing:
Handle missing data, normalize or scale features as needed, and perform any necessary data
cleaning.
3. Feature Engineering:
Create lag features and incorporate relevant financial indicators and technical indicators to enhance
the dataset.
4. Model Development:
Implement three regression models: Decision Tree, Random Forest and Support Vector
Regression (SVR).
Train each model on a subset of historical data.
5. Hyperparameter Tuning:
Optimize hyperparameters for each regression model to improve predictive accuracy.
6. Evaluation Metrics:
Use appropriate evaluation metrics such as Accuracy,Precision,Recall,F1-score,ROC-AUC to
assess the performance of each model.
7. Model Comparison:
Compare the accuracy of the three models and identify the one that demonstrates the highest
predictive performance.
8. Visualization:
Visualize the predicted Customer churn against the actual prices to provide a clear understanding of
the model's performance.
9. Documentation:
Document the entire process, including data preprocessing steps, feature engineering, model
development, hyperparameter tuning, and evaluation metrics.
10. Deliverables:
A report detailing the methodology, findings, and insights gained from the customer churn
prediction models.
Visualizations comparing predicted and actual customer churn.
Codebase with well-documented scripts for data preprocessing, feature engineering, and model
development.
11. Success Criteria:
The success of the project will be measured by selecting the model that achieves the highest
accuracy in predicting customer churn, as determined by the chosen evaluation metrics.
19
3.3 Algorithm/Techniques
These algorithms and techniques offer a diverse set of tools for addressing the complexities of
stock price prediction, combining traditional time series analysis, machine learning, deep
learning, and specialized forecasting libraries
20
3.3.1 Algorithm 1
• Importing the data
• Importing the libraries
• Loading the dataset and analyzing the customer behavior
• Scaling the data
• Building the model and train the data
• Performance evaluation on test set
Here we will import the dataset and used to explore through it.
The data set contains the following features.
• customerID: Unique identifier for each customer.
21
Steps for Exploratory Data Analysis (EDA):
1. Data Cleaning:
o Handle missing values.
o Convert data types where necessary (e.g., Total Charges might need to be converted from
string to numeric).
2. Univariate Analysis:
o Distribution of the Churn variable to check churn rates.
o Analyze individual features like tenure, Monthly Charges, etc., using histograms or box plots
to understand distributions.
3. Bivariate Analysis:
o Analyze relationships between Churn and other features (e.g., gender, Contract, Payment
Method) using bar plots or stacked charts.
o Correlation analysis between numeric variables like tenure, Monthly Charges, and Total
Charges.
4. Multivariate Analysis:
o Investigate interactions between multiple variables (e.g., how tenure and Contract affect
Churn).
1. Decision Trees
A Decision Tree is a simple, interpretable model that splits the data into branches based on
feature values to reach a decision (in this case, whether a customer will churn). It uses a series of
if-then rules to classify data. For instance, a tree might split customers based on features like
monthly charges, contract length, or customer service calls. Each split aims to increase the purity
of the nodes, meaning each node ideally contains mostly churners or non-churners.
Pros:
• Easy to interpret and visualize.
22
2. Random Forests
A Random Forest is an ensemble method that builds multiple decision trees and combines their
predictions. Each tree in the forest is trained on a random subset of the data and features. The final
prediction is made by averaging the predictions from all the trees (for regression) or using
majority voting (for classification).
Pros:
• Reduces overfitting compared to a single decision tree.
• More accurate and robust to variations in data.
Cons:
• Less interpretable than a single decision tree.
23
3.4 Result
1. Model Performance Metrics:
Higher values of Accuracy, Precision, Recall indicate better predictive performance.
Evaluate the metrics for each model to understand how close the predicted Customer churn are to
the actual prices.
2. Model Comparison:
Identify the model that consistently achieves the lowest values across the chosen metrics. This
model is considered the best performer in terms of accuracy for your specific dataset.
3. Visualization:
Create visualizations, such as line charts, comparing the predicted Customer churn of each model
against the actual stock prices. This provides a qualitative understanding of how well the
models capture the trends and patterns in the data.
4. Consideration of Complexity:
While accuracy is crucial, also consider the complexity of each model. Simpler models are often
preferred, especially if they achieve comparable accuracy, as they may generalize better to
unseen data.
5. Potential Next Steps:
If the selected model meets your accuracy requirements, you may proceed to deploy it for real-
time predictions or further refine it with additional features or hyperparameter tuning.
6. Documentation:
Document the results comprehensively, including insights gained from the comparison, any
challenges encountered, and recommendations for future improvements or adjustments to the
models.
24
CONCLUSIONS
The "Customer Churn Prediction in Telecom Industry" project aimed to leverage Decision Tree ,
Random Forest, and support vector Machine (SVM) to forecast the customer churn of company. The
following key conclusions and insights were obtained from the analysis:
Random Forest predicts the Customer churn of a company precisely because the accuracy of the
model is very high as compared to the accuracies of other models.
25
APPENDICES
SOURCE CODE:
import numpy as np
import pandas as pd
data=pd.read_csv("/content/Telco_Customer_Churn_Dataset (1).csv")
data.head()
data.shape
data.columns.values
data.dtypes
data.describe()
data['Churn'].value_counts().plot(kind='barh', figsize=(8,6))
plt.xlabel("count",labelpad=14)
plt.ylabel("Target variable",labelpad=14)
plt.title("count of Target per category",y=1.02)
data['Churn'].value_counts()
100*data['Churn'].value_counts()/len(data['Churn'])
data.info(verbose=True)
miss=pd.DataFrame((data.isnull().sum())*100/data.shape[0]).reset_index()
plt.figure(figsize=(16,5))
ax=sns.pointplot(x='index',y=0,data=miss)
plt.xticks(rotation=90,fontsize=7)
plt.title("percentage of missimg value")
plt.ylabel("percentage")
plt.show()
telco_data.TotalCharges=pd.to_numeric(telco_data.TotalCharges,errors='coerce')
telco_data.isnull().sum()
telco_data.loc[telco_data['TotalCharges'].isnull()==True]
telco_data.dropna(how='any',inplace=True)
print(telco_data['tenure'].max())
26
#Data Exploration
for i,predictor in enumerate(telco_data.drop(columns=['Churn','TotalCharges','MonthlyCharges'])):
plt.figure(i)
sns.countplot(data=telco_data,x=predictor,hue='Churn')
Mth=sns.kdeplot(telco_data_dummies.MonthlyCharges[(telco_data_dummies["Churn"]==0)],
color="Red",fill=True)
Mth=sns.kdeplot(telco_data_dummies.MonthlyCharges[(telco_data_dummies["Churn"]==1)],
ax =Mth,color="Blue",fill=True)
Mth.legend(["Churn","No Churn"],loc='upper right')
Mth.set_ylabel('Density')
Mth.set_xlabel('Monthly Charges')
Mth.set_title('Monthly Charges by churn')
plt.figure(figsize=(20,8))
telco_data_dummies.corr()['Churn'].sort_values(ascending=False).plot(kind='bar')
plt.figure(figsize=(20,8))
sns.heatmap(telco_data_dummies.corr(),cmap="Paired")
def uniplot(df,col,title,hue=None):
sns.set_style('whitegrid')
sns.set_context('talk')
plt.rcParams["axes.labelsize"]=20
plt.rcParams['axes.titlesize']=22
plt.rcParams['axes.titlepad']=30
temp=pd.Series(data=hue)
fig,ax=plt.subplots()
width=len(df[col].unique()) + 7 + 4 * len(temp.unique())
fig.set_size_inches(width,8)
plt.xticks(rotation=45)
27
plt.yscale('log')
plt.title(title)
ax=sns.countplot(data=df,x=col,order=df[col].value_counts().index,hue=hue,palette='bright')
plt.show()
28
Model Building
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTEENN
df=pd.read_csv("df_dummies.csv")
df
x=df.drop('Churn',axis=1)
print(x)
y=df['Churn']
print(y)
x_train,x_test,y_train,y_test=train_test=train_test_split(x,y,test_size=0.2)
y_pred=dt.predict(x_test)
y_pred
dt.score(x_test,y_pred)
print(classification_report(y_test,y_pred,labels=[0,1]))
print(confusion_matrix(y_test,y_pred))
29
# Random Forest Classifier
30
ScreenShots:
Dataset:
31
Percentage of missing values:
32
Distribution of Churners:
HeatMap:
33
BIBLIOGRAPHY
Code Links:
mailto:https://colab.research.google.com/drive/1N-
1kkpIRlKglirZHJxqpzkAN9Fx868ew?authuser=3#scrollTo=p5TybBo9v9xz
Book References:
34