Lariviere 2005
Lariviere 2005
www.elsevier.com/locate/eswa
Abstract
In an era of strong customer relationship management (CRM) emphasis, firms strive to build valuable relationships with their existing
customer base. In this study, we attempt to better understand three important measures of customer outcome: next buy, partial-defection and
customers’ profitability evolution. By means of random forests techniques we investigate a broad set of explanatory variables, including past
customer behavior, observed customer heterogeneity and some typical variables related to intermediaries. We analyze a real-life sample of
100,000 customers taken from the data warehouse of a large European financial services company. Two types of random forests techniques
are employed to analyze the data: random forests are used for binary classification, whereas regression forests are applied for the models with
linear dependent variables. Our research findings demonstrate that both random forests techniques provide better fit for the estimation and
validation sample compared to ordinary linear regression and logistic regression models. Furthermore, we find evidence that the same set of
variables have a different impact on buying versus defection versus profitability behavior. Our findings suggest that past customer behavior is
more important to generate repeat purchasing and favorable profitability evolutions, while the intermediary’s role has a greater impact on the
customers’ defection proneness. Finally, our results demonstrate the benefits of analyzing different customer outcome variables
simultaneously, since an extended investigation of the next buy–partial-defection–customer profitability triad indicates that one cannot fully
understand a particular outcome without understanding the other related behavioral outcome variables.
q 2005 Elsevier Ltd. All rights reserved.
Keywords: Data mining; Customer relationship management; Customer retention and profitability; Random forests and regression forests
cancel a product that is characterized by a ‘non-ending’ implications are reported in Section 4. In Section 5, we
status. Contrary to typical grocery products like milk, coffee summarize and discuss the results of this study.
or cookies, financial products are bought and owned for a
specific period in time. As a consequence, you remain a
customer until all the products are closed or expired. 2. Methodology
Regarding the ending status of financial products, there exist
two notable types: (i) products that have a fixed duration In this study, we use random forests techniques to predict
term and as a consequence automatically end when the customers’ profitability evolution and their next buy and
expiration date is reached, and (ii) products that do not have partial-defection decisions. Two types of random forests are
a fixed expiration date and hence receive a ‘non-ending’ used depending on the conceptualization of the dependent
label, since they only stop when a customer explicitly asks variable: that is binary classification and linear prediction
to cancel that product. With the ‘active partial-defection’ outcomes. In the next paragraphs we present the methodo-
retention variable, we emphasize the latter ending status logical underpinnings of the random forests techniques and
scenario. The ‘partial’ refers to the fact that the closure of the evaluation criteria we use to investigate their
one particular product does not necessarily mean a ‘total’ performance.
defection of the customer, since that customer is allowed to
have other products that are still open or not expired. 2.1. Random forests
With respect to the ‘customer profitability’ dependent
variables, we investigate the customer’s evolution in profit. With regard to binary classification tasks, decision trees
Contrary to the existent literature that mainly investigated (DT) have become very popular, thanks to their ease of use
profitability in a cross-sectional manner by spanning and interpretability (Duda, Hart, & Stork, 2001) as well as
companies and industries, we investigate each customer’s their ability to deal with covariates measured at different
profitability longitudinally. As such we are able to analyze measurement levels (including nominal variables). Never-
the direct relationship between a customer’s set of theless, conventional decision trees techniques also have
their disadvantages. For instance, Dudoit, Fridlyand, and
explanatory variables and his generated profits in contrast
Speed (2002) mention their lack of robustness and the
to previous studies that were often constrained by linking
suboptimal performance. Fortunately, many of these
aggregated customer information with, for example, the
disadvantages have been dealt with by some researchers
stock-price performance per firm or the turnover per outlet
who optimized the DT technique. More specifically, the
due to the unavailability of profitability measures at the
creation of an ensemble of trees followed by a vote for the
customer level. In this study, we investigate two measures
most popular class, labeled forests (Breiman, 2001), is the
of customer profitability. The first measure is ‘profit
result of such a DT optimization.
evolution’ and represents the customers’ evolution with
In this paper, we also use the more advanced DT
respect to the profits generated during the observed window
technique. We select the random forests as proposed by
of observation. The second variable ‘profit drop’ is a Breiman (2001), which uses the strategy of a random
deduced version of the former profitability measure. ‘Profit selection of a subset of m predictors to grow each tree,
drop’ is a binary variable expressing whether the customer where each tree is grown on a bootstrap sample of the
has become less profitability for the company by the end of training set. This number, m, is used to split the nodes and is
observation. The variable is created as an extra tool to much smaller than the total number of variables available
validate the accuracy of predicting customers’ profitability for analysis.
evolutions and to compare its performance with the other Since its introduction, random forests have been enjoy-
binary retention dependent variables. ing increased popularity. The number of applications in
In sum, we investigate two major groups of customer fields with large datasets is growing: e.g. in bioinformatics
outcome: customer retention and profitability. We analyze (Deng et al., 2004). On the other hand, the number of
two measures of retention that both involve an ‘active’ applications in economics, and, more specifically in
transaction of the customer: the opening of a new product marketing related issues are rather scarce (Buckinx & Van
(next buy) and the decision to end a product that is still open den Poel, 2005). The available applications using random
(active partial-defection). Furthermore, we also investigate forests reveal that the predictive performance is among the
how customers evolve in terms of the profitability they best of available techniques (Luo et al., 2004). Furthermore,
represent for the company by means of a linear (profit an interesting by-product of the technique are the produced
evolution) and a binary (profit drop) dependent variable. importance measures for each variable (Ishwaran, Black-
The rest of the paper is organized as follows. In Section stone, Pothier, & Lauer, 2004) that indicate which variables
2, we elucidate the methodological underpinnings of the have the strongest impact on the dependent variables of
random forests and the regression forests techniques. In investigation. Another advantage of the technique concerns
Section 3, we present the data set and the explanatory the consistent high and robust performance results
variables under investigation. The study results and its (Breiman, 2001). Finally, the random forests as proposed
474 B. Larivière, D. Van den Poel / Expert Systems with Applications 29 (2005) 472–484
by Breiman have reasonable computing times (Buckinx & With respect to the linear dependent variable, profit
Van den Poel, 2005) and are easy to use; the only two evolution, we cannot use the AUC evaluator, since both
parameters a user of the technique has to determine are the predicted and real values have more than two (i.e. binary)
number of trees to be used and the number of variables (m) values. Profit evolution represents the change in the
to be randomly selected from the available set of variables. customer’s profitability during the observed window of
In both cases, we follow Breiman’s recommendation to pick analysis, and consequently can have a wide range of both
a large number (5000 in this case) for the number of trees to positive and negative values. In order to evaluate the
be used, as well as the square root of the number of variables predicted values, we calculate the mean absolute deviation
for the latter parameter. Since the number of explanatory (MAD)
variables equals to 30 (cf. Table 2) in this study, we fix the
number of variables to six. 1X n
MAD Z jP K Ri j (1)
n iZ1 i
2.2. Regression forests
where n is the sample size, Pi the predicted profit evolution
for customer i and Ri the real profit evolution for customer i.
Breiman also extended the concept of random forests to
Similar to the goodness-of-fit evaluation of the random
regression cases. Random forests for regression are formed
forests models, we also apply conventional linear regression
by growing trees depending on a random vector such that
models in order to benchmark its performance against the
the tree predictor takes on numerical values as opposed to
regression forests results with respect to the profit evolution
class labels (cf. Section 2.1). The random forests predictor is
target variable.
formed by taking the average over a number of the trees
specified by the user.
In this study, we investigate four different dependent A major Belgian financial services company delivered
variables: next buy, active partial-defection, profit drop and the data for this study. Their data warehouse stores detailed
profit evolution. The first three measures involve a binary information about customers’ banking and insurance
classification problem of a specific event; that is the event of acquisitions; that is we know when, what, how much and
buying a new product, the event of canceling a ‘non-ending’ at which point of sales the customer has bought a specific
status product and the event of becoming less profitable for product. Furthermore, the company gathers demographic
the company. information about its customers and provides its customers
In order to assess the predictive performance of the with a monthly revenue indicator. Since our research setting
classification models based on the random forests technique, implies a fourfold analysis of dependent variables, we
we use the area under the receiver operating characteristic decided to use the same group of customers, as well as the
curve (AUC) criterion. Furthermore, we benchmark the same set of potential explanatory variables in order to
performance of the random forests against the AUC compare their relative and different impact on the customer
resulting from conventional logistic regression models in retention and profitability target variables we emphasize.
which we use the same set of customers, independent and We decided to take two randomly selected samples of
dependent variables. The AUC measure is based on a range 50,000 customers each of which one is used for the
of comparisons between the predicted status of the event estimation process and the second sample is used for
and the true status of the customer with respect to that event, validation. In the next paragraphs, we present the dependent
by considering all possible cut off levels for the predicted and explanatory variables that are created to perform the
values. More specifically, for all the cut off points, the customer retention and profitability models.
sensitivity (the number of true positive versus the total
number of events) and the specificity (the number of true 3.1. Conceptualization of the dependent variables
negatives versus the total number of non-events) of the
confusion matrix are considered and summarized by means The following timeline provides some detailed infor-
of a two-dimensional graph, resulting in a ROC curve. The mation about the period of analysis in this study.
area under this curve is used to evaluate the predictive As it is clear from the timeline, we determine the
accuracy of the classification models (Hanley & McNeil, dependent variables within the time period of 1 June 2003
1982). In order to compare the AUC’s resulting from the through 1 February 2004 (Zlatest release date of the data
random forests with these of the logistic regression models, warehouse). Two measures of retention are created in order
we apply the non-parametric test proposed by DeLong, to investigate the postulated research objectives. The first
DeLong, and Clark-Pearson (1988) that investigates measure is ‘next buy’ and expresses whether the customer
whether the areas under both ROC curves are significantly has bought a new product during the 8 months of follow-up
different. (i.e. 1 June 2003 through 1 February 2004). The second
B. Larivière, D. Van den Poel / Expert Systems with Applications 29 (2005) 472–484 475
dependent variable ‘active partial-defection’ explores the minimum and maximum values that there is a wide
whether the customer has ended himself a product that range of movements within the follow-up period with regard
was still open. Note that with respect to the latter dependent to the customers’ profit evolutions. Furthermore, the mean
variable, we explicitly focus on ‘active’ defection, meaning and median values are situated around zero, indicating that
that we do not consider an ‘automatic’ product defection as the extra profits generated by some customers are fully
the event of investigation (cf. Section 1). Both retention absorbed by the lost revenues of some other profitability
variables are binary and receive the value of ‘1’ when the defectors. Given the fact that only one quarter experienced a
event happened during the follow-up period (‘0’ in the other decrease in profits, we can ascertain the need to gain insight
case). into the drivers of the target variable ‘profit drop’, because
With respect to the profitability measures we make use of on average one customer is likely to absorb the extra profits
the company’s internal records. Each month, the investi- generated by three other customers.
gated company computes an individual profitability score
for its entire customer base. The monthly score is calculated 3.2. Explanatory variables
as a weighted average of the total number of products owned
multiplied by the corresponding balance amount (at the end In this study, we explore three major predictor categories
of each month) and the net margin that the product that encompass potential explanatory variables. The three
represents for the company. Based on the scores throughout categories are: past customer behavior, observed customer
the follow-up period, we were able investigate the heterogeneity and variables related to intermediaries. In the
customers’ evolution with respect to that profitability next paragraphs, we introduce each category by presenting
score. We created two dependent variables. The first its variables. Note that all explanatory variables are
profitability measure is ‘profit evolution’ and represents measured at the date of 31 May 2003 (cf. Fig. 1). Table 2
the shift (expressed in profitability points) in the customer’s presents the explanatory variables that are investigated in
profitability, whereas ‘profit drop’ is a binary indicator this study.
expressing whether the customer showed a negative
evolution with respect to his revenue profile, meaning that 3.2.1. Past customer behavior
he became less profitable for the company by the end of the There is ample evidence in the literature that behavioral
follow-up period. In Table 1, we provide some insights exchange characteristics are strong predictors of future
about the 100,000 customers under investigation in this customer behavior (Baesens et al., 2004; Reinartz & Kumar,
study and their corresponding retention and profitability 2003) and profitability (Hsieh, 2004). In this study, we
measures. investigate the following past customer behavior variables:
It is clear from Table 1 that some 13% of the customers specific product ownership, self-banking activity, total
bought a new product during the follow-up period, whereas number of products owned, monetary value and cross-
fewer customers (6.8% of the customers) decided to cancel a buying.
product with a non-ending status. With respect to the binary
profitability measure, we observe that approximately a 3.2.1.1. Specific product ownership. Some researchers
quarter of the customers experienced a negative evolution in investigated the impact of specific product ownership on
the profitability they represent for the firm. This latter customer outcome. Their findings indicate that specific
finding is intriguing in the context of the second profitability product ownership is likely to influence future customer
measure that reflects the absolute shift in a customer’s profit behavior (e.g. Athanassopoulos, 2000; Larivière & Van den
evolution expressed in profitability points. It is clear from Poel, 2004). In this study, we test for the impact of seven
Table 1
Insight in the dependent variables for both estimation and validation sample
31 May 2003
1 June 2003
1 Feb 2004
the company of investigation enables its customers to use
internet or phone services for both banking and insurance
transactions. For this study, we created a dummy variable
expressing whether the customer is a self-banking user (by
means of internet or phone).
Explanatory Follow-up period:
variables Conceptualization of the 3.2.1.3. Total number of products owned and monetary
dependent variables value. Previous research suggests that there exists a positive
association between these two explanatory variables and
Fig. 1. Period of analysis.
customers’ subsequent customer behavior. For instance,
different ownership variables. We introduce six dummy Huber, Lane, and Pofcher (1998) reveal that the more
variables that categorize all types of banking and insurance products a customer possesses with the bank, the more
products as well as one variable expressing whether the retention prone he is. Similarly, the more money a customer
customer owns credit cards or not. invests with a company the more likely he is to stay
(Baesens, Viaene, Van den Poel, Vanthienen, & Dedene,
3.2.1.2. Self-banking by means of internet and phone. 2002; Ganesan, 1994). With respect to the customer’s
Nowadays, more and more financial services providers profitability, it is plausible to assume that a higher quantity
encourage their customers to perform their daily trans- of products represents higher profits, since previous
actions by means of electronic banking services (such as research found a positive relationship between customers’
Table 2
Explanatory variables used in this study
spending level and profitable lifetimes (Reinartz & Kumar, remain with a service supplier for both Australia and
2003). In this study, we also control for the customers’ total Thailand and found a significant difference. In this study, we
product ownership and monetary value. also account for this cohort information in order to test
whether we observe some significant differences with
3.2.1.4. Cross-buying. Cross-buying refers to the degree to respect to the profitability and retention proneness for
which customers purchase products from different product Flemish versus Walloon customers.
categories offered by the company. In this study, we
explicitly decided to create a cross-buying variable, since 3.2.2.4. Geo-demographic data. Besides customer demo-
the investigated company is characterized by a large group graphic data gathered at the customer level, the company
of mono-product customers. As such, it offers a viable also buys some additional customer information that is
opportunity to investigate the impact of a higher share-of- gathered based on the place of residence (that is geo-
wallet on both retention and profitability dependent demographic data). In this study, we analyze two different
variables. information items: the social status and the median income
of the region of residence. The social status consists of nine
3.2.2. Customer demographics groups. Therefore, we create eight dummy variables per
It is clear from previous research that accounting for customer in order to know to which categorical group a
observed customer heterogeneity is warranted. In this study, customer belongs. We wonder whether these variables
we control for the customer’s age, lifecycle stage, gender, provide some additional explanatory information with
geographical region, and some geo-demographic data. respect to the dependent variables we emphasize; and as a
consequence—in terms of practical reasons for the com-
3.2.2.1. Age and lifecycle stage. It is well known that pany—are worth paying for.
customers’ financial-need priorities and resource avail-
ability vary at different stages of his lifecycle, and as such 3.2.3. Variables related to intermediaries
influence the quantity and the sequence in which financial To date there is still a poor understanding of the impact
products and services are acquired (Kamakura, Ramas- of salespersons (or intermediaries) on customers’ behavior
wami, & Srivastava, 1991): e.g. in general, younger (Guenzi & Pelloni, 2004). Nevertheless, it seems important
customers (i.e. the ‘bachelor’ stage) have less money to to investigate the salesperson’s role, since he acts as the
invest than older individuals (i.e. the ‘empty-nest’ and the crucial player who interacts with the company’s customers.
‘retirement’ stage). As such older people that belong to a In this study, we investigate three variables related to these
later stage in their lifecycle are assumed to have more intermediary agents: the selling tendency of the salesperson,
money available. In this study, the lifecycle stage consists of the number of customers served by a salesperson and the
five stages; as such we create four dummy variables in order sales assortment.
to express to which stage the customer belongs. A higher
number corresponds with a later stage in the lifecycle. 3.2.3.1. Selling tendency of the salesperson. In real life, it is
likely to assume that not every intermediary is equally
3.2.2.2. Gender. As in most studies that account for skilled in selling financial products and services to the
customer demographic data, we also control for the company’s customers. With the variable ‘selling tendency’,
customer’s gender. The variable ‘gender’ is operationalized we aim to explore the impact of a salesperson’s selling
as a dummy variable that receives the value of ‘1’ when the capabilities on both the customers’ profitability and
customer is male, and a ‘0’ when the customer is female. retention proneness. The variable ‘selling tendency’ rep-
resents the number of products sold in relation to the number
3.2.2.3. Geographical region. The investigated company of customers served by a specific intermediary. The variable
provides its financial products and services at the Belgian is created by using the information from 1 year preceding
market. The variable region reflects a geographical cohort the date of 31 May 2003 (cf. Fig. 1); the higher the value for
and is operationalized as a dummy variable. In general, the the variable the more products the intermediary had sold to
Belgian market can be divided into two large geographical its own customer base.
areas: Flanders in the north and the Walloon part in the
south. Besides the fact that each region has its own language 3.2.3.2. Number of customers served by the salesperson.
(respectively, Dutch and French), the marketing department Although many researchers have suggested that the per-
of the investigated company reveals that they also have a formance of the salesperson during sales encounters is
different way of doing business with a financial services critical, many of the underlying mechanisms that govern the
company; that is Flemish people are known to be ‘savers’, interaction between salespersons and customers are still
whereas the Walloons make more use of personal loans to unclear (Van Dolen, Lemmink, de Ruyter, & de Jong, 2002).
acquire the products they want. Also previous research has In the financial services setting, it is plausible to believe that
taken the geographical region into account. For example, some customers experience less personal attention when a
Patterson and Smith (2003) investigated the propensity to salesperson is serving a large customer base, and as a
478 B. Larivière, D. Van den Poel / Expert Systems with Applications 29 (2005) 472–484
consequence is unable to know each customer personally. In models. For all three binary classification targets, we
this study, we account for this information and investigate its observe a significant (DeLong et al., 1988) and better
impact on both customer’s behavior and profitability. performance in favor of the random forests (cf. all p-values
range between !0.0001 and 0.025). Even for the next-buy
3.2.3.3. Sales assortment. The ‘sales assortment’ represents classification, we find a significant difference although the
the product variety offered by the salesperson. In their study, increase in prediction accuracy is rather low; that is an AUC
Hoch, Bradlow, and Wansink (1999) state that the variety in improvement of 0.006 (0.005) for the validation (esti-
offerings is viewed as the entree fee for maintaining future mation) sample. With respect to the predictive performance
customer loyalty. With respect to the investigated financial of the profit drop target variable, we observe a significant
services company, not every intermediary is selling the difference in AUC of 0.019 and 0.016 for, respectively, the
whole range of financial products to its customers; that is validation and estimation sample. In this study, the most
some typical ‘banking’ intermediaries solely supply bank- important and outperforming prediction accuracy of random
ing products, whereas some others only sell a limited variety forests can be found in the active partial-defection analysis,
of insurance products to their customers. As such, it is where we observe an AUC improvement of 0.106 (0.094)
possible that some customers are unable to acquire all for the validation (estimation) sample when benchmarking
financial products and services they need with their current its performance against a logistic regression model.
salesperson. In this study, we explore its impact on both In sum, the classification findings of this study indicate
retention and future customer profitability. the viable opportunity for both academics and practitioner
to consider other than the conventional prediction tech-
4. Findings niques (such as logistic regression) when investigating a
binary-classification problem. Especially, when the
The next paragraphs present the findings of the study. obtained goodness-of-fit indices based on conventional
First, we report the prediction accuracies of the various prediction models perform rather low—indicating that there
models. Next, we present the relative importance of each is more room for improvement—it is appealing to
explanatory variable with respect to the four dependent investigate whether other prediction techniques (such as
variables under investigation. Finally, we further examine random forests) perform better, since each major improve-
the signs of the 10 most important covariates for each target ment in predictive accuracy is likely to represent major
variable by means of some descriptive statistics.
shifts in terms of the effectiveness and the return on
4.1. Performance evaluation investment of marketing actions—that are based on
prediction models.
The evaluation criteria applied to investigate the In order to evaluate the performance of the linear
predictability of the four dependent variables are presented dependent variable, we use the mean absolute deviation
in Table 3. (MAD) criterion (cf. Section 2.3). The MAD for the
It is clear from the table that random forests provide regression forests model amounts to 5 (more specifically,
better prediction accuracies compared to logistic regression 5.099 for the test sample and 4.940 for the estimation
Table 3
Performance results
Table 4
Importance of variables
Random forests Regression forests
Dependent variableZNext buy Dependent variableZActive partial- Dependent variableZProfit drop Dependent variableZProfit evolution
defection
No. Importance Variable name No. Importance Variable name No. Importance Variable name No. Importance Variable name
measurea measure measure measure
1 149.536 d_SI_high_risk 1 273.260 ST 1 174.616 mon_val 1 1.490 d_credits
2 147.642 d_curracc 2 240.702 nbr_cust 2 148.870 d_curracc 2 1.004 age
3 141.566 age 3 191.625 sales_assort 3 125.883 d_card 3 0.623 mon_val
4 128.041 mon_val 4 167.320 age 4 114.565 nbr_p 4 0.575 nbr_p
5 124.833 nbr_p 5 144.917 mon_val 5 95.953 age 5 0.273 cross_b
6 118.025 d_card 6 137.873 nbr_p 6 95.284 cross_b 6 0.232 nbr_cust
7 106.901 d_risks 7 132.024 d_region 7 94.376 d_credits 7 0.171 d_curracc
8 92.611 d_lifec_stage_4 8 126.118 d_risks 8 89.111 d_lifec_stage_4 8 0.144 d_soc_status_6
9 91.995 cross_b 9 123.324 cross_b 9 83.848 d_self_b 9 0.137 d_risks
10 75.584 ST 10 94.772 d_SI_high_risk 10 81.666 d_lifec_stage_2 10 0.091 d_lifec_stage_2
11 74.022 d_SI_low_risk 11 93.972 d_lifec_stage_2 11 79.941 sales_assort 11 0.039 d_card
12 73.556 d_credits 12 88.316 d_credits 12 72.971 d_region 12 0.009 d_self_b
13 70.177 nbr_cust 13 86.139 d_card 13 72.587 d_risks 13 0 ST
14 67.816 med_income 14 85.948 med_income 14 69.856 d_SI_high_risk 14 0 sales_assort
15 62.856 d_self_b 15 85.570 d_curracc 15 69.758 nbr_cust 15 0 med_income
16 61.791 d_lifec_stage_2 16 72.341 d_lifec_stage_4 16 64.157 d_SI_low_risk 16 0 d_SI_low_risk
17 57.312 d_lifec_stage_3 17 66.582 d_SI_low_risk 17 59.016 ST 17 0 d_SI_high_risk
18 54.640 sales_assort 18 63.783 d_lifec_stage_3 18 54.052 med_income 18 0 d_SI_stepst
19 54.232 d_SI_stepst 19 59.480 d_soc_status_6 19 47.041 d_lifec_stage_3 19 0 d_gender
20 36.329 d_region 20 54.517 d_self_b 20 39.008 d_SI_stepst 20 0 d_soc_status_1
21 28.606 d_soc_status_1 21 52.649 d_soc_status_8 21 38.321 d_soc_status_8 21 0 d_soc_status_2
22 25.130 d_soc_status_7 22 39.375 d_soc_status_2 22 25.617 d_soc_status_7 22 0 d_soc_status_3
23 21.529 d_soc_status_2 23 37.604 d_SI_stepst 23 22.509 d_soc_status_3 23 0 d_soc_status_4
24 19.544 d_soc_status_6 24 35.199 d_soc_status_1 24 21.363 d_soc_status_6 24 0 d_soc_status_5
25 14.317 d_soc_status_3 25 32.463 d_soc_status_7 25 17.138 d_soc_status_2 25 0 d_soc_status_7
26 8.789 d_soc_status_8 26 30.958 d_soc_status_3 26 13.215 d_soc_status_1 26 0 d_soc_status_8
27 4.66 d_lifec_stage_1 27 12.634 d_soc_status_5 27 11.699 d_soc_status_4 27 0 d_lifec_stage_1
28 4.193 d_gender 28 9.913 d_gender 28 10.836 d_lifec_stage_1 28 0 d_lifec_stage_3
29 0 d_soc_status_4 29 9.059 d_lifec_stage_1 29 9.72 d_gender 29 0 d_lifec_stage_4
30 0 d_soc_status_5 30 6.666 d_soc_status_4 30 4.288 d_soc_status_5 30 0 d_region
a
An importance measure of ‘0’ represents no significant impact of the corresponding explanatory variable on the target variable of investigation.
4.3. Investigation of the direction of impact on the explanatory variables, we apply simple chi-square statistics,
dependent variables whereas T-tests are performed for the other covariates.
While most explanatory variables have the excepted
While Section 4.2 provides a clear understanding of the sign, some other findings deserve some further explanation.
explanatory variables that have a strong impact on the four In the next paragraphs we briefly summarize the most
dependent variables of this study, the directions of these intriguing findings of Table 5.
impacts are still unknown. For example, the variable Section 4.2 revealed the importance of the past behavior
‘d_region’ plays an important role in the prediction of variables, such as total product ownership, cross-buying,
active partial-defection, but nevertheless we have no monetary value and specific product ownership; in this
indication whether Flemish customers, in contrast to their extended analysis we find that all these explanatory
Walloon counterparts, are less or more likely to defect. variables have a positive association with all the events
Hence, we decided to perform some additional descriptive under investigation: that is next buy, active partial-defection
analyses to gain insight into the direction of the most and profit drop. The latter finding implies that, for instance,
important explanatory variables. Analogous to Section 4.2, customers with higher monetary value or individuals that
we focus on the top-10 most important predictors and we possess more products from different categories (cross-
only investigate the binary target variables. Table 4 buying) are not only more likely to buy new products in the
summarizes the descriptive statistics. In fact, we analyze future (next buy), they are also more vulnerable to cancel
two strata (e.g. next buyers or not) and we wonder whether other products with a non-ending status (active partial-
we observe a statistically significant difference with respect defection), which probably results in a negative profitability
to the 10 most important variables. For the binary evolution (profit drop). In sum, our findings suggest the
B. Larivière, D. Van den Poel / Expert Systems with Applications 29 (2005) 472–484 481
Table 5
Descriptive statistics for the most important explanatory variables
Table 5 (continued)
Explanatory variable Strata
existence of a typical group of active customers that are retention and profitability. For the first outcome, we analyze
constantly buying and defecting on financial products. two different measures: the opening of a new product (next
Furthermore, with respect to the variables related to the buy) and the decision to cancel a product with a non-ending
salesperson in the active partial-defection case, Table 5 status (active partial-defection). With respect to the latter
reveals rather small (but significant) differences when outcome, we investigate how customers evolve in terms of
comparing defectors versus non-defectors. Given the fact the profitability they represent for the company by means of
that these variables nevertheless represent the top three of a binary (profit drop) and a linear (profit evolution)
most important variables, we can certainly ascertain the dependent variable. More specifically, the first three
need to consider the intermediaries’ role when trying to measures involve a binary classification problem and are
understand typical customer behavior outcomes; since even analyzed by using random forests; for the latter target
small improvements in, for example, the intermediary’s variable (profit evolution), we applied regression forests.
selling capabilities or the sales assortment are likely to result Our research findings support previous studies that favor
in favorable customer behaviors. With regard to the the use of random forests techniques. In this study, we
‘number of customers served’ variable, we find the opposite observe significant improvements in terms of prediction
effect of what was hypothesized; that is customers accuracy when benchmarking the random and regression
belonging to larger agencies show lower active partial-
forests against the conventional logistic and linear
defection rates. A possible explanation might be found in
regression models.
the fact that serving fewer customers is just the result
Another interesting feature of the random forests
(instead of the ‘cause’) of customer defections in the past.
technique concerns the produced importance measures
Furthermore, it is also likely to assume that intermediaries
which indicate the variables that have the greatest impact
who serve fewer customers, experience a heavier compe-
on the dependent variable of investigation. In this study,
tition in their immediate vicinity, such that their customers
have more alternatives to switch. Another explanation might we find evidence that past customer behavior variables
be that customers perceive large agencies as more reliable, play an important role in predicting future customer
and as a consequence prefer them above smaller agencies. behavior and profitability. Another important finding of
Further research on this issue is warranted. the study is the relative importance of the variables
Finally, when we consider the customer lifecycle stage, related to intermediaries with respect to the active
we observe that seniors (d_lifec_stage_4) are more likely to partial-defection classification. It is clear that good
repurchase, but less vulnerable to decrease their profitability. selling agents not only generate more repeat purchases,
Also, families with young children show evidence of positive they also indirectly prevent customers from (partial)
profitability evolutions. As a consequence, the other defection. The same logics apply for the sales assortment
categories, such as the youngsters and the midlife category of the salesperson. For the company of investigation, it
are mainly responsible for the negative profit evolutions. offers a viable opportunity to encourage its salesforce to
Hence, it is crucial for financial services companies to gain a supply the whole range of financial products and
better understanding of these typical lifecycle stages such services, since a limited sales assortment is likely to
that the appropriate and proactive actions can be taken to stimulate customer-switching behavior. With respect to
guarantee the company’s future profits. the customer demographic variables, our findings reveal
the importance of the customer’s age and the stage of his
lifecycle. On the other hand, the customer’s gender and
5. Discussion the geo-graphical data gathered at the place of residence
level are less powerful in terms of predicting customer
This study investigates two typical and major outcomes retention and profitability, although they report significant
of customer relationship management (CRM): customer associations with the binary dependent variables.
B. Larivière, D. Van den Poel / Expert Systems with Applications 29 (2005) 472–484 483
Furthermore, we comparing the three binary classifi- points, respectively). In sum, just as the well-known
cation outcomes and its most important predictors, it is claim that ‘it is important to retain existing customers’,
striking that four of the top-10 variables are the same: total our research findings extend the same analogy with
product ownership, monetary value, cross-buying and the regard to customers’ profitability: ‘It is more profitable to
customer’s age. Moreover, when exploring their impact of retain the most profitable customers of the company’.
the dependent variable by means of descriptive statistics, we
observe the same positive impact on next buy, active partial-
Acknowledgements
defection and profit drop. As such, we find evidence that the
same set of variables is likely to generate both next-buy and
The authors would like to thank the anonymous company
defection behavior in terms of profits and products. These
that supplied the data to perform this research study.
intriguing findings suggest the existence of a highly active
Moreover, we are grateful to Leo Breiman for the public
customer segment, that is buying new products while it is
availability of the random forests and regression forests
switching on other financial products and invite us to
software.
perform some extra analyses that relate the next buy with
the active partial-defection and customer profitability
variables. Appendix A. Investigation of the next
In Appendix A, we present the statistics for the next buy–partial-defection–customer profitability triad
buy versus active partial-defection versus customer
profitability triad. The statistics indicate that more than Frequency table of next buy!active partial-defection
25% of the active partial-defectors also bought a new
product within the same period of observation, compared Frequency
to a 12.36 buying percentage for the customers who did Row% Active partial-defection
not cancel a non-ending product. As such, we find Column% No Yes Total
support for our theory that the company contains a Next buy No 81,671 5043 86,714
typical segment of active customers that constantly 94.18% 5.82%
replace old products by newer ones. Another striking 87.64% 74.10%
Yes 11,523 1763 13,286
finding concerns the link between next buy and 86.73% 13.27%
customers’ profit evolutions. It is clear from Appendix 12.36% 25.90%
A that more than 35% of the customers who bought a Total 93,194 6806 100,000
new product also experienced a profit drop, whereas their p-valueZ!0.0001.
counterparts who did not repurchase report lower
Frequency table of next buy!profit drop
percentages for the profit drop variable (that is
27.44%). Fortunately, in terms of absolute profitability Frequency
shifts (cf. profit evolution), we do not observe a Row% Profit drop
statistically significant difference whether customers Column% No Yes Total
purchased a next product. With respect to the relationship Next buy No 62,923 23,791 86,714
between active partial-defection and customers’ profit 72.56% 27.44%
evolution we observe the dramatic impact of customers’ 88.02% 83.43%
decision to cancel a non-ending product on their Yes 8561 4725 13,286
64.44% 35.56%
profitability evolution. Appendix A reveals that almost 11.98% 16.57%
70% of the active partial-defectors experienced a profit Total 71,484 28,516 100,000
drop, while approximately one quarter (25.56%) of the
p-valueZ!0.0001.
people who did not defect on products showed a negative
evolution with regard to the revenues they represent for Frequency table of active partial-defection!profit drop
the company. Similar conclusions can be derived for the Frequency
profit evolution variable. Summarized, the latter findings Row% Profit drop
are in line with previous research studies that underscore Column% No Yes Total
the impact of customer retention on a company’s Active par- No 69,372 23,822 93,194
profitability: ‘It is important to retain existing customers’. tial-defec- 74.44% 25.56%
Finally, when linking the two profit evolution variables tion 97.05% 83.54%
with each other, we confirm our descriptive findings Yes 2112 4694 6806
31.03% 68.97%
resulting from Table 1 (cf. Section 3.1): on average the
2.95% 16.46%
extent to which customers experience profit drops (in Total 71,484 28,516 100,000
terms of absolute profitability points) is more intense
than the extent to which other customers are able to p-valueZ!0.0001.
grow in profits (that is K8.72 versus C1.60 profitability T-tests for the profit evolution variables
484 B. Larivière, D. Van den Poel / Expert Systems with Applications 29 (2005) 472–484