Applying Data Mining To Telecom Churn Ma
Applying Data Mining To Telecom Churn Ma
www.elsevier.com/locate/eswa
Abstract
Taiwan deregulated its wireless telecommunication services in 1997. Fierce competition followed, and churn management becomes a major
focus of mobile operators to retain subscribers via satisfying their needs under resource constraints. One of the challenges is churner prediction.
Through empirical evaluation, this study compares various data mining techniques that can assign a ‘propensity-to-churn’ score periodically to
each subscriber of a mobile operator. The results indicate that both decision tree and neural network techniques can deliver accurate churn
prediction models by using customer demographics, billing information, contract/service status, call detail records, and service change log.
q 2005 Elsevier Ltd. All rights reserved.
Keywords: Churn management; Wireless telecommunication; Data mining; Decision tree; Neural network
37
40 Table 1 below summarizes some data mining functional-
25 ities, techniques, and applications in the CRM domain.
Churning
30
20
10 3. Churn prediction data mining assessment methodology
0
Europe U.S. Asia The purpose of this research is to assess the performance of
various data mining techniques when applied to churn
Fig. 1. Annual telecom operator customer churn rate by region (Mattersion,
2001).
prediction. The methodology consists of three parts:
Create
3, 5 Predictive
Model
6
Test
Model
Model
4
Scoring
7
Extract 2
Extract
Sample Full
Data Extraction
Score
8
Population
9
Monitor Results
Cubes/Reports
This research selected decision tree, neural network and 3.2. Prediction model creation process
K-means cluster as data mining techniques to build predictive
models or segment customers. Fig. 3 shows our process of creating a predictive model.
Note that in addition to conducting empirical research, we
can use the same IT infrastructure to collect, analyze, detect, 3.2.1. Define scope
and eliminate major customer churn factors. This ‘closed loop’ In this study, we focus on the post-paid subscribers who pay a
infrastructure is imperative to business management as we monthly fee and were activated for at least 3 months prior to July
manage churn to sustain our relationship with customers. 1, 2001. A churner is defined as a subscriber who is voluntary to
Exploring Data
Exploring Data
Analysis DWDW
Analysis
Data Preprocess
Data Preprocess Data Extraction
Data Extraction
Variable Analysis
Variable Analysis
Sample
Sample DB DB
BPN
Decision
DecisionTree
Tree
w/w/Segmentation
Segmentation Decision
DecisionTree
Tree
Approach 1 Approach 2
leave; non-churner is the subscriber who is still using this 1 in Fig. 3). This is to assess if the churn behaviors are different
operator’s service. Moreover, we used the latest 6 months in different ‘value-loyalty’ segments.
transactions of each subscriber to predict customer’s churn In Approach 2 (see Approach 2 in Fig. 3), we used neural
probability of the following month. The transaction data include network to segment customers, followed by decision tree
billing data, call detail records (CDR), customer care, etc. modeling. This is a technology assessment to test if BPN can
improve DT prediction accuracy.
3.2.2. Exploratory data analysis (EDA)
The purpose of EDA is to explore from the customer 3.3. Model performance evaluation
database those possible variables that can characterize or
differentiate the customer behavior. For the variables extrac- In actual practice, it is necessary to know how accurate a
tion, we interviewed telecom experts, such as telecom business model predicts and how long before the model requires
consultants, marketing analysts, customers, and sales of mobile maintenance.
provider to identify churn causes or symptoms prior to To assess the model performance, we use LIFT and hit ratio.
customer churn, such as ‘contract expired’, ‘low usage’, or The following chart illustrates their definitions, where A is the
‘query about terminating the contract’. number of subscribers predicted to churn in the predictive time
window who actually churned, B is the number of subscribers
3.2.3. Data preprocessing, variable analysis and selection, and predicted to churn but did not.
data extraction Hit ratio is defined as A/(ACB), instead of (ACD)/(ACBC
Based on the results of interviews with experts, we extract CCD). This is a model effectiveness measurement in
some possible variables from the customer database as an predicting churners instead of predicting all customer behavior
analytical base of EDA to determine which variables are useful in the predictive window.
to differentiate between churners and non-churners. To assess LIFT, we have to rank order all customers by their
For each of the causes/symptoms gained from interviews, churn score, and define hit ratio (X%) as the hit ratio of the ‘X%
we start to determine if we can observe similar customer Customers with top churn score’. LIFT (X%) (is this X% the
behavior from the database. For example, to the symptom same as the other?) is then defined as the ratio of hit ratio (X%)
‘contract expired’ we can define a variable ‘number of days to the overall monthly churn rate. For example, if the overall
between today and contract expiration date’ to test its monthly churn rateZ2%, XZ5, and hit ratio (5%)Z20%, then
correlation with customer churn, where ‘today’ is the date of LIFT (5%)Z20/2%Z10. LIFT is a measure of productivity for
prediction. Depending on the variable type, we can use modeling: with random sampling of the entire customer base
different statistical test tools, such as z-test. (That is, we you would yield 2% churners, focusing instead on the 5% Top-
examine variable significance by Z-score, 99%, and select if its Churn-Score customers you would yield 20% churners. (Note
Z-score is over 3). that in this case, the top 5% contains 50% of the total churners.)
Note that ‘contract expiration date’ must be a quality data To assess model robustness, we monitor each month’s
field (‘table column’) in the database. Otherwise, the statistical model hit ratio and LIFT for an extended period of time to
inference based on this variable would be invalid. A significant detect degradation.
effort in data preprocessing is to resolve data quality issues
related to unspecified business rules or business rules not 4. Empirical finding
enforced in the business process.
Note also that we can have alternative variable definitions, 4.1. Data source
such as ‘1 if today is later than contract expiration date and 0
otherwise’. It is an iterative process to define the variables, A wireless telecom company in Taiwan provides their
identify the table columns, specify the calculation formula, test customer related data. To protect customer privacy, the data
the validity of statistical inference, and select useful variables source includes data of about 160,000 subscribers, including
for modeling. Data extraction is a formalized system integration 14,000 churners, from July 2001 to June 2002, randomly
process to ensure data quality and code optimization in selected based on their telephone numbers.
modeling, production (e.g. scoring), and model maintenance.
4.2. Exploring data analysis results
3.2.4. Machine learning (model creation)
We took two approaches to assess how models built by We developed possible variables from other research and
decision tree (C5.0) and back propagation neural network interviews with telecom experts. We then analyzed these
(BPN) techniques perform. variables with z-test from four dimensions and listed significant
In Approach 1, we used K-means clustering methods to variables of churn below, based on our analysis database.
segment the customers into five clusters according to their
billing amount (to approximate ‘customer value’), tenure - Customer demography
months (to approximate ‘customer loyalty’), and payment † Age: analysis shows that the customers between
behaviors (to approximate ‘customer credit risks’). Then 45 and 48 have a higher propensity to churn than
we create a decision tree model in each cluster (see Approach population’s churn rate.
S.-Y. Hung et al. / Expert Systems with Applications 31 (2006) 515–524 519
26.8%
1.0%
Table 2 0.8%
Significant variables of churn 0.6% C1
0.4% C4 32.9% C3
Dimension Items 0.2% 16.7% 14.0%
Table 3
Customer segmentation-cluster
Cluster ID Tenure Bill AMT MOU MTU PYMT rate Percentage of Churn rate (%)
population
C1 H H H H M 32.9 0.50
C2 L L L L L 26.8 1.19
C3 H M M M L 14.0 0.32
C4 L M M M L 16.7 0.30
C5 M M M M H 9.6 1.37
modeling by the cluster identity. Then we used the same In general, the performance of building a predictive model
validation sets to evaluate all models. Table 5 shows the on individual segments is more accurate than the one built
performance of each clusters’ decision tree model. on the entire customer population. Thus, a decision tree
model without segmentation should outperform a decision
4.4.2. Neural network (back propagation network, BPN) tree model with segmentation. Our experiment shows
Based on other research results (e.g. Cybenco, 1998; Zhan, otherwise. Furthermore, we are concerned about the
Patuwo, & Hu, 1998), we know that using one-layer hidden significant performance gap after the first month between
layer and optimal network design might provide a more the NN and DT techniques.
accurate model from neural network. In this study, we use 1-1-
1 (input-hidden-output) as a training model type. This training
type includes 43 inputs and only one output. Since public 4.5.2. Test modeling technique differences
information is not available on key modeling parameters such We use T-test to compare modeling techniques under
as the learning rate or number of neurons in the hidden layer, different modeling parameters. The hypotheses are:
we try many different combinations. Table 6 shows the results,
in which model N18-R6, for example, uses 18 neurons in the † H01: hit ratio of the decision tree model without
hidden layer with 0.6 learning rate. segmentation is not different from that with segmentation.
To minimize other variances, we use the same training set (DTH-DTSH)
for BPN as for decision tree. Table 6 shows that N21-R6 † H02: capture rate of the decision tree model without
achieves the best performance from R-square and MSE segmentation is not different from that with segmentation.
measurements. (DTC-DTSC)
† H03: hit ratio of the neural network model is not different
from the decision tree model without segmentation. (NNH-
4.5. Model performance stability
DTH)
† H04: capture rate of the neural network model is not
4.5.1. Overall performance trend different from the decision tree model without segmenta-
We use the data from the telecom operator to ‘track’ model tion. (NNC-DTC)
performance over a period of time. Fig. 5 shows the trend of † H05: hit ratio of the neural network model is not different
performance in terms of hit rate and capture rate: from the decision tree model with segmentation. (NNH-
DTSH)
† Fig. 5 shows that all the models demonstrate stable † H06: capture rate of the neural network model is not
accuracy in the first 6 months. However, significant different from the decision tree model with segmentation.
degradation occurs in the month of February 2002, (NNC-DTSC)
regardless of modeling techniques. The Chinese New
Year is in February 2002 and it is possible that consumers Table 7 lists T-test results:
behave differently during this period.
† In the first 6 months, NN outperforms DT, and DT without † The performance of the decision tree model without
segmentation slightly outperforms DT with segmentation. segmentation is better than that with segmentation.
Table 4
Models evaluation of decision tree without segmentation
Model Hit ratio (%) Capture ratio (%) Lift at 10% Description
S1-1M-RS-30K 92.92 85.91 8.74 Analytical baseZ1 M, random sample, learning recordsZ30 K
S1-1M-P3 97.90 84.82 9.18 Analytical baseZ1 M, over samplingZ3%, learning recordsZ36.7 K
S1-1M-P5 96.21 94.55 9.96 Analytical baseZ1 M, over samplingZ5%, learning recordsZ22 K
S1-1M-P10 96.72 93.82 9.93 Analytical baseZ1 M, over samplingZ3%, learning recordsZ11 K
S1-2M-RS 87.85 92.95 9.21 Analytical baseZ2 M, random sample, learning recordsZ30 K
S1-3M-RS 98.04 86.27 9.53 Analytical baseZ3 M, random sample, learning recordsZ30 K
S.-Y. Hung et al. / Expert Systems with Applications 31 (2006) 515–524 521
19.02
19.02
18.87
18.85
18.82
18.34
10.96
11.12
10.97
11.01
10.92
10.92
LIFT
rate.
DT with segment (C5)
Cap (%)
0.54
2.42
0.60
1.16
0.00
0.00
98.80
98.32
96.15
95.65
95.24
89.47
4.5.3. Sample size impact
One theory is that our results are biased because of limited
churn samples in the analysis base: the mobile service provider
Hit (%)
4.30
1.20
0.30
0.61
0.00
0.00
33.07
26.70
12.90
12.43
17.79
customers, and the associated monthly churn rate was only
0.71%. The data size was not sufficient to build a good
predictive model by each customer segment because we could
not explore real significant information from a few churners in
LIFT
8.95
9.66
9.43
9.48
9.76
8.81
1.08
1.25
2.06
2.26
2.16
1.69
each customer segment. For example, Table 3 shows that C3
contains about 17% of the customer population with a 0.3%
DT with segment (C4)
Cap (%)
0.00
1.11
1.11
88.42
94.32
92.45
93.10
95.24
84.75
10.28
12.20
11.58
10.71
13.33
9.88
8.88
0.00
4.41
4.76
months disappeared.
Fig. 7 compares the LIFT of all models: both NN and DT
techniques generate models with a hit rate of 98% from the top
DT with segment (C3)
Cap (%)
about 10.
Hit (%)
12.77
14.89
14.29
9.90
6.00
4.65
9.55
9.55
9.87
9.82
0.94
0.94
1.08
1.16
0.94
1.11
LIFT
Wei (Wei & Chiu, 2002) used customer call detail records as a
predictor and generated models with a hit rate of less than 50%
DT with segment (C2)
Cap (%)
from the top 10% of predicted churners in the list. That is, LIFT
88.89
92.44
96.99
92.00
91.55
90.68
0.00
0.00
1.52
0.00
0.00
1.82
(10%) is less than 5. Our LIFT (10%) is about 10. Although the
customer bases are different and there are other modeling
parameters to consider, the LIFT achieved by all proposed
Hit (%)
0.72
1.39
1.61
0.42
0.40
1.63
94.61
93.99
94.43
95.32
93.29
93.39
200108
200109
200110
200111
200112
200201
200202
200203
200204
200205
200206
200207
Predict
Table 6
Learning results of BPN
Table 7
Model performance evaluation
One-sample test
DTH_DTSH 6.020 11 .000 1.992!10K2 1.263!10K2 2.720!10K2
DTC_DTSC 4.762 11 .001 1.108!10K2 5.961!10K3 1.621!10K2
NNH_DTH 6.307 11 .000 .1508 9.820!10K2 .2035
NNC_DTC 5.490 11 .000 3.333!10K2 1.997!10K2 4.670!10K2
NNH_DTSH 7.980 11 .000 .1708 .1237 .2180
NNC_DTSC 5.863 11 .000 4.417!10K2 2.759!10K2 6.075!10K2
DTH; hit ratio of decision tree model without segmentation; DTC, capture rate of decision tree model without segmentation; DTSH, hit ratio of decision tree model
with segmentation; DTC, capture rate of decision tree model with segmentation; NNH, hit ratio of neural network (BPN).
can effectively assist telecom service providers to make more customers. Furthermore, integrating churn score with customer
accurate churner prediction. segment and applying customer value also helps mobile service
However, the effective churn prediction model only providers to design the right strategies to retain valuable
supports companies to know which customers are about to customers.
leave. Successful churn management must also include Data mining techniques can be applied in many CRM fields,
effective retention actions. Mobile service providers need to such as credit card fraud detection, credit score, affinity
develop attractive retention programs to satisfy those between churners and retention programs, response modeling,
120%
100%
80%
60%
40%
20%
0%
200108 200109 200110 200111 200112 200201 200202 200203 200204 200205 200206 200207
120%
100%
80%
Rate
60%
40%
20%
0%
200108 200109 200110 200111 200112 200201 200202 200203 200204 200205 200206 200207
Lift
15
15
10 10
5 5
0 0
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% of Population % of Population
Neural Network
30
25
20
Lift
15
10
5
0
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% of Population
and customer purchase decision modeling. We expect to see Kentrias, S. (2001). Customer relationship management: The SAS perspective,
more data mining applications in business management, and www.cm2day.com.
Langley, P., & Simon, H. A. (1995). Applications of machine learning and rule
more sophisticated data mining techniques will be developed
induction. Communication of the ACM, 38(11), 55–64.
as business complexity increases. Lariviere, B., Van den Poel, D., & Van den Poel (2004). Investigating the role
of product features in preventing customer churn, by using survival analysis
References and choice modeling: The case of financial services. Expert Systems with
Applications, 27(2), 277–285.
Berson, A., Smith, S., & Thearling, K. (2000). Building data mining Lau, H. C. W., Wong, C. W. Y., Hui, I. K., & Pun, K. F. (2003). Design and
applications for CRM. New York, NY: McGraw-Hill. implementation of an integrated knowledge system. Knowledge-Based
Bortiz, J. E., & Kennedy, D. B. (1995). Effectiveness of neural network types Systems, 16(2), 69–76.
for prediction of business failure. Expert Systems with Applications, 9(4), Lejeune, M. (2001). Measuring the impact of data mining on churn
503–512. management. Internet Research: Electronic Network Applications and
Cybenco, H. (1998). Approximation by super-positions of sigmoidal function. Policy, 11(5), 375–387.
Mathematical Control Cignal Systems, 2, 303–314. Mattersion, R. (2001). Telecom churn management. Fuquay-Varina, NC:
Fletcher, D., & Goss, E. (1993). Forecasting with neural networks: An APDG Publishing.
application using bankruptcy data. Information and Management, 3, Salchenberger, L. M., Cinar, E. M., & Lash, N. A. (1992). Neural networks: A
159–167. new tool for predicting thrift failures. Decision Sciences, 23(4), 899–916.
524 S.-Y. Hung et al. / Expert Systems with Applications 31 (2006) 515–524
SAS Institute, (2000). Best Price in Churn Prediction, SAS Institute White Wei, C. P., & Chiu, I. T. (2002). Tuning telecommunications call detail to
Paper. churn prediction: A data mining approach. Expert Systems with
Su, C. T., Hsu, H. H., & Tsai, C. H. (2002). Knowledge mining from trained Applications, 23, 103–112.
neural networks. Journal of Computer Information Systems, 42(4), 61–70. Zhan, G., Patuwo, B. E., & Hu, M. Y. (1998). Forecasting with artificial neural
Tam, K. Y., & Kiang, M. Y. (1992). Managerial applications of neural network: The state of the art. International Journal of Forecasting, 14, 35–62.
networks: The case of bank failure predictions. Management Science, Zhang, G., Hu, M. Y., Patuwo, B. E., & Indro, D. C. (1999). Artificial neural
38(7), 926–947. networks in bankruptcy prediction: General framework and cross-
Thearling, K. (1999). An introduction of data mining. Direct Marketing validation analysis. European Journal of Operational Research, 116,
Magazine . 16–32.