Assessed Coursework Coversheet: Leeds University Business School
Assessed Coursework Coversheet: Leeds University Business School
Assessed Coursework Coversheet: Leeds University Business School
Business School
Student ID Number: 2 0 1 4 5 6 4 5 4
Please Note:
Your declared word count must be accurate, and should not mislead. Making a fraudulent statement concerning the
work submitted for assessment could be considered academic malpractice and investigated as such.
If the amount of work submitted is higher than that specified by the word limit or that declared on your word count, this
may be reflected in the mark awarded and noted through individual feedback given to you.
It is not acceptable to present matters of substance, which should be included in the main body of the text, in the
appendices (“appendix abuse”). It is not acceptable to attempt to hide words in graphs and diagrams; only text which
is strictly necessary should be included in graphs and diagrams.
By submitting an assignment you confirm you have read and understood the University of Leeds
Declaration of Academic Integrity
( http://www.leeds.ac.uk/secretariat/documents/academic_integrity.pdf).
1
2
Data Analysis Report for Premium Chocolate Company: to
Manage Customers’ Segmentation, Preference, and
Sustainable
Student ID: 201456454
3
Table of Content
1. Introduce......................................................................................1
2. Managing customer heterogeneity.......................................1
2.1 Customer segmentation.................................................1
a. Hierarchical method:.....................................................1
b. Kmeans method:............................................................2
c. Mclust method:...............................................................2
2.2 Understand consumers’ perception............................3
3. Customer dynamics..................................................................6
3.1 Consumer sustainable....................................................6
3.2 customer lifetime value...................................................7
4. Company sustainable competitive advantage...................8
4.1 Customer value.................................................................8
4.2 Managing sustainable...................................................10
5. Summarize.................................................................................11
6. Future suggestions.................................................................11
7. Reference...................................................................................12
0
1. Introduce
This report aims to analyse the data from a premium chocolate manufacturer (Crafty
Chocolates) about their customers heterogeneity, customers dynamics and their
sustainable competitive advantage by using appropriate analysis tools (RStudio).
a. Hierarchical method:
In cluster analysis, the data shows the clear picture of 4 clusters. This data frame
gave each observation a particular group distinguished by their distance.
It means 378 customers can be divided into 4 groups.
Table 1. Cluster Dendrogram
Using hierarchical cluster plot to has a close look at different numbers of the clusters.
1
Table 2. Hierarchical cluster plot
b. Kmeans method:
This method is better for larger data set.
The result shows 4 groups difference scales sustainability score.
Table 4. Boxplot
2
3 2.305 3.011 -0.303
4 1.851 3.021 0.152
In this boxplot, it can say that the better option is the group 2, this group both have
highest salary, positive chocolate consumption and highest sustainability.
c. Mclust method:
Mclust method can give an observation of what is the best number of groups
seperations.
## Mclust VEV (ellipsoidal, equal shape) model with 7 components:
##
## log-likelihood n df BIC ICL
## -4131.093 378 336 -10256.31 -10260.48
##
## Clustering table:
## 1 2 3 4 5 6 7
## 60 79 48 62 59 49 21
3
Table 7. Correlation plot
From the principal component analysis (PCA) result, we can have the the majority of
datas if we keep 5 components to compare our brand ranking.
Table 8. PCA brand.pc
Countinualy, we saw the rank of previous 5 components having more than half
observations.
Closer to the cumulative proportion of each component:
Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp.
.1 .2 .3 .4 .5 .6 .7 .8 .9 10
27.6 48.1 63.3 75.3 84% 91.4 97.3 99.7 100% 100%
% % % % % % %
Comparing the position map of component 1 and component 2 (include 48%), it
helps us understand more specific groups of variable classification. Sugar, cooca
percentage and sweetener can be considered into one factor, and other various can
4
be seen as one factor. Therefore, this result is not clear, so we need to use rotation
to confirm factors.
Table 9. Visualising PCA
In order to load into two factors, we delet two highly correlated various
sweetener(correlated with cocoa percentage) and salt(correlated with vanilla).
After loading, we can see each various only has one high loading factor, such as
brand in factor1 is higher than factor2. In this result, we know ingredients, butter,
organic, and sugar belong to factor1, and brand, rating, cocoa percent, and vanilla
belong to factor2.
Factor1 Factor2
Brand 0.0124 -0.070
Cocoa percent -0.1157 0.1183
Rating 0.0351 -0.2156
Counts of 0.9767 0.2024
ingredients
Cocoa butter 0.8499 -0.0589
Vanilla 0.2772 0.9582
Organic 0.7186 -0.2686
Sugar 0.2176 -0.0126
And then we draw a plot about the relationship between the original variables.
5
Table 10. Variable relation
So from this factor analysis we can get the point of view that customers key attribute
of ranking brands from two aspect, one is the chocolate main ingredients, another is
brand image.
3. Customer dynamics
In this part, we would like to know customer purchase orientation and lifetime value.
3.1 Consumer sustainable
In order to understand customer requirement changing, we need to use dynamic
customer segmentation approach to analyse the data.
In this case, purchase label can be seen as a dependent variable comparing to
country, rural or urban, GDP, BMI, and children these five independent variables.
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.457e+00 8.129e-01 6.713 1.91e-11 ***
## country 1.175e-02 2.322e-03 5.061 4.18e-07 ***
## ruralurban 1.206e-01 2.065e-02 5.840 5.23e-09 ***
## GDP 2.807e-05 1.492e-06 18.809 < 2e-16 ***
## BMI -3.206e-01 3.133e-02 -10.233 < 2e-16 ***
## children 1.851e-01 3.619e-02 5.115 3.13e-07 ***
In this coefficients result, we know these five independent variables’ P value are
lower than 0.05, so these items are 99.9% confidence in our estimation.
Then we want to calculate the exponential of this coefficient, so we gain the odds
(the ratio between the purchase probability and non-purchase probability) of each
independent variable.
##> exp(coef(model1))
##(Intercept) country ruralurban GDP BMI children
234.3179038 1.0118223 1.1281484 1.0000281 0.7256828 1.2033729
This result means if the children value is increased by one unit, and then the odds of
the purchaselabel for customer will increase by 1.2.
Next step we use anova to compare with another model in order to check whether
model1 is better or not.
## Analysis of Deviance Table
6
## Model 1: purchaselabel ~ country + ruralurban + GDP + BMI + children
## Model 2: purchaselabel ~ 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
##1 25337 23758
##2 25342 24346 -5 -588 < 2.2e-16 ***
In this result, we know model1 actually is apparently a larger model.
Next, we gain the estimation of the logistic regression model to predict customer
purchase probability.
Table 11. Logistic regression results
Then we predict the customers purchase probability, then we segment the probability
into buy(>0.5%) and not buy (<0.5%), following by running the confusion matrix to
have a comparison between predicted purchase and the actual purchase behaviour.
0 1
0 20409 4627
1 221 86
From this result, we know there are 221 customers predicted as not purchase and
actually they are purchase, and there are 4627 customers predicted as purchase but
actually they are not purchase. So we gain our accuracy is 80.87%.
In final step, we can use the receiver operating characteristic (ROC) curve to know
the area under a curve.
Table 12. ROC curve
This curve indicated a poor prediction (lower than 90%) but still a positive rate. If we
want to have the best cutoff, we can use 0.22 as the cutoff.
Then we know in 61.9% of time, customer will have a higher purchase probability.
7
Using these data, we then can calculate the CLV in each month.
Table 13. Monthly data and CLV
Each variable comes from customer buying data. P is purchase cost, C is the total
number they buy, r is retention ratio.
Table 14. CLV evolution
From this data we can observe first month online sell obtain the highest CLV and the
sum of eighteen months CLV values are 22970.6.
8
At the beginning, we would like to know the quantity of choice in different prices:
## 2.76 price has 266 choices
## Price
## 2 2.76 3 4 4.95 5 7
## 81 266 443 339 31 380 350
We find that the lowest price £2 does not have the biggest choice number, in the
contrast, the two highest prices actually have a second and third of customer select
amount.
Next, we would like to know whether premium chocolates popular or not, so we use
xtabs() to cucullate the quantity.
##No Yes
##394 1496
From the result, we know more customer choose premium chocolates compared
with low-priced chocolates. Premium is the most popular choice.
Comparing with reference data, we know there are significant different on nuts,
loyalty taken with chocolates (especially for donate one), organic, premium, fairtrade,
sugar, and price (higher price means lower utility). And it shows consumers have
lower sub utility of origin and manufacturing locations.
9
## NutsNuts and Fruit
## -2.845858
For example, form this formula, we know customer will more like to pay for nuts only
chocolate than nuts and fruit chocolate.
In this result we know there are 7824 groups have the significant co-occurrence.
Then we want to focus on the group have chocolate.
10
Table 18. Data of the baskets have chocolate
From the plot, we can see chocolate buying with milk, salty, snack has highly lift.
So it is highly recommend that company can put their product beside milk, salty,
snack areas or put them into the sale package.
5. Summarize
To sum up all of the analyses.
If company want to segment customer. Mclust method is the best choice if company
focus on the small and high-quality group. This method not only separate customers
into smallest group, but also provides the subdivision result. If company has an
ample budget and they want to promote boarder customer groups, the hierarchical
method is the best choice because of the wider cover customer range.
Then when we compare several brands across a lot of dimensions, we could use
some helpful components to position. A perceptual map helps us understanding the
differences which influence customers’ ranking in chocolate industry.
In logistic regression, we know there are some factors, such as country, live in rural
or urban, GDP, and having children or not, will influence customer buying
sustainability. And in the highly prediction accuracy, we know company has about
12% customer sustainable buying rate in all of customers who have bought their
chocolate.
To understand customer lifetime value changing on different time, we could use
customer evolution to explore the changing. We need to reorganize the data in order
to run them in R, so this case teach me how to manage the big data.
Using conjoint analysis helps company providing market requirement to innovation
department. It also helps company to compare with other brands about their product
difference.
Using the market basket data, company can know what kinds of sets they can use to
increase sells at terminal marketing.
6. Future suggestions
In this R analysis, we know the customer segmentation, brand rating, choice model,
customer lifetime value, sustainable, and basket correlation these analyses can help
company make decisions on promotion, brand image management, innovation,
market trend and requirement innovation.
In addition to this, I think there could be a model to analyse customer journey and
experience of the post-selling process, it will very helpful for company to improve in
the future sell service and produce development.
Also because trends and data are always changing, if there is a system that can
convert and analyse the data that companies routinely collect into specific indicators,
11
and this system can automatically filter irrelevant data, it will help companies easier
understanding the meaning of the data and using them quickly.
7. Reference
1. Brown, Allison L ; Bakke, Alyssa J ; Hopfer, Helene. 2020. Understanding
American premium chocolate consumer perception of craft chocolate and
desirable product attributes using focus groups and projective mapping. PloS
one, 2020, Vol.15 (11), p.e0240177-e0240177
12