0% found this document useful (0 votes)
7 views8 pages

Download File

This paper analyzes retail sales data using clustering techniques to identify profitable areas for investment and improve sales strategies. It employs the K-means algorithm to compare online and offline sales channels, revealing that offline sales are preferred despite online channels generating higher revenue for bulk purchases. The findings aim to assist retailers in optimizing their sales channels and addressing areas needing improvement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

Download File

This paper analyzes retail sales data using clustering techniques to identify profitable areas for investment and improve sales strategies. It employs the K-means algorithm to compare online and offline sales channels, revealing that offline sales are preferred despite online channels generating higher revenue for bulk purchases. The findings aim to assist retailers in optimizing their sales channels and addressing areas needing improvement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

ISSN: 0374-8588

Volume 21 Issue 14, December 2019

_____________________________________________________________________________________

RETAIL SALES ANALYSIS


USING CLUSTERING
Dr. M. Rajeshwari 1 , P.R.Bharathi Nandha 2

ABSTRACT
This project aims at analysing on the sales data of any retail company and gives the result on
where to invest more of its investment to get more profit. If we take a retail company, it can
have different branches and can sell many products through online and offline (Channel of
sale). By selling its products through various methods the company should analyse and invest
more on the area where it is getting more profit. The profitable area can be a particular
branch where sale is high (or) can a product which is selling most (or) can be an online/offline
method which helps in getting high profit. This analysis will help in finding the mode of sale
which is giving the more profit to the company and what is the reason behind that mode to give
the more profit through it. This paper also analyses the area of improvement for a company so
that company can focus on improving that particular part.

Key words : Retail sales, Clustering, Data Analysis, K-means.


_____________________________________________________________________________

I. INTRODUCTION:

Data analysis is the process of inspecting, transforming, analyzing data sets to get the insights
from the data to take business decisions using the machine learning algorithm. For that, we
used an analysis tool “WEKA”. It is one of the data mining software. It was developed at the
University of Waikato, New Zealand. It contains tools for data pre-processing, classification,
clustering and regression. Here, Simple K-means clustering (Clustering is the method of
grouping the similar objects with other objects in the same group) is used to analyse the retail
sales data.Every business retailer’s main aim is to have a high profit. For that purpose, for the
ease of purchasing for the customer, many retailers extend their business to online using
mobile app, website or by using social media. But, not every retailers will be successful in
that. It is important to know some things which customer will hate about online shopping.

II. RELATED WORKS

The paper mainly focus on the retail sales improvement of a shop which has both online and
offline establishments in different regions. Major analysis is the comparison of sales in both
offline and online channels. We conclude with the step to improve the channel which has low

876
ISSN: 0374-8588
Volume 21 Issue 14, December 2019

_____________________________________________________________________________________

sales. This paper examines the relationship between the sales channel and the sales revenue.
This study may be helpful for retailers who has both online and offline channels for their shop.
This can also be useful while the giving offers or discounts, like which offer will increase
revenue in different sales channel
Data Analytics the science of examining raw data to draw conclusions about that
information. It's just a process of analyzing raw data to find trends and answer questions
which involves applying an algorithmic or mechanical process to derive insights from the
data
whether it's structured or unstructured. Now a days, Data is everywhere and it becomes the
main asset. In fact, the amount of digital data that exists is growing at a rapid rate. So, the data
analysis becomes the powerful technology. Data mining is a data analysis technique used on
the statistical modeling and getting insights on the data for predictive analysis.
Predictive analytics and text analytics is the main application of the data
analysis. Predictive analytics mainly focuses on the future forecasting and classification while
text analytics focused on the unstructured text data which had a wide application in all sectors.
In today's world particularly in business field, data analysis plays a major role in taking more
effective business decisions.
Cluster analysis or clustering is the most commonly used technique of Machine
Learning. Machine Learning (ML) is an application and part of Artificial intelligence. It's gives
computers the capability to learn without being explicitly programmed. ML is one of the most
todays trending and powerful technologies that one would have ever come across. It has 2
major types.
One is Supervised learning, in which an algorithm learns from existing data with it's
respective target responses which consist of numeric values or labels which are strings, such
as classes or tags, for the purpose of predicting the correct predictive variable like sales in
future year when the new data is given to it.
Another one is unsupervised learning, the training using data that is neither classified
nor labelled and allowing the algorithm to act on that data without any insights. In this, the
main thing is to group the unlabelled data using its similarities, patterns and differences
without any previous training of data. Clustering is a main part of unsupervised learning. It is
mainly used to find data clusters (group of similar data points) such that each cluster has most
closely matched data. Actually, clustering could be “the process of organizing objects into
groups whose members are similar in some way”.

Clustering algorithms can be applied in many fields, for instance:


 Marketing: finding groups of customers with similar behaviour given a large database of
customer data containing their properties and past buying records;
 Biology: Finding clusters of similar genes in DNA analysis .Segmenting communities in
ecology

877
ISSN: 0374-8588
Volume 21 Issue 14, December 2019

_____________________________________________________________________________________

 Libraries:book ordering.
Retail: Grouping the content of a website or product in a retail business., Customer
segmentation
An important component of a clustering algorithm is the distance measure between data points.
If the components of the data instance vectors are all in the same physical units then it is
possible that the simple Euclidean distance metric is sufficient to successfully group similar
data instances.

For higher dimensional data, a popular measure is the Minkowski metric,

Where d is the dimensionality of the data. The Euclidean distance is a special case
where p=2, while Manhattan metric has p=1. However, there are no general theoretical
guidelines for selecting a measure for any given application.

III. METHODOLOGY:

The most used and important clustering algorithms are

 K-means
 Fuzzy C-means
 Hierarchicalclustering
 Mixture of Gaussians

Among those algorithms, K-means is the most popular and simple method for clustering. It
assumes that most of the data is located near prototypes (element of data space representing a
group of elements). It assigns training data to matching cluster based on similarity and
involves iterative process to get data points in the best possible clusters until the model is
optimized. Itis commonly used in medical imaging, biometrics, and related fields.
The algorithm keeps track of the centroids of the subsets, and proceeds in simple iterations.
The initial partitioning is randomly generated, that is, we randomly initialize the centroids to
some points in the region of the space.
K-means algorithm is an iterative algorithm that tries to partition the dataset into K pre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one
group.It involves

878
ISSN: 0374-8588
Volume 21 Issue 14, December 2019

_____________________________________________________________________________________

• Start by picking k random centroids.


• Assign each point to the nearest centroid.
• Move each centroid to the centre of the respective cluster.
• Calculate the distance of the centroids from each point again.
• Move points across clusters and re-calculate the distance from the centroid.
• Keep moving the points across clusters until the distance from the centre is minimized.

S.NO VARIABLES DISCRIPTION POSSIBLE


VALUES
1. Region The region of the customer who bought (string)
the product
2. Country The country of the customer (numeric: Min 100
Max:1750
3. Item types Type of the product (character: Business
to Business, Business
to consumer,
Business to govt.
nonprofit.
4. Sales Channel Mode of the order (string: offline ,
online)
5. Shipping Cost Applicable only for online customer (numeric)
cost for the shipping the product
6. Units sold Total quantity of the product ordered (numeric)

7. Unit price Selling price of the one product (numeric)

8. Unit cost Actual Price of one product (numeric)

9. Total Revenue Total Revenue for the product (numeric)

10. Total cost Actual cost of the product (numeric)

11. Total Profit Total Profit (numeric)

Table 1: Selected variables from retail sales record.


The above table displays the attributes which are used in the data with description and details
about the value in that.Using this, the frequency of sales channel’s values were analyzed to
found out which channel is preferred highly preferred by customer. Also, analyzed the
879
ISSN: 0374-8588
Volume 21 Issue 14, December 2019

_____________________________________________________________________________________

relationship between the sales channels and shipping cost. Then, found out the reason for the
low preferred channel.

IV. RESULT

Cluster No. No. of Orders Percentage


0 (online) 75 75%
1 (offline) 25 25%
Total samples: 100

880
ISSN: 0374-8588
Volume 21 Issue 14, December 2019

_____________________________________________________________________________________

Figure 1: -

Problem 1: Which sales channel (offline/online) is highly preferred by Customer?

From the above analysis, it’s clear that the offline sales channel is highly preferred by
customers than the online sales channel. Then, further drilling down into the plot shows that,
offline orders have low revenue compared to the online orders. Both channels have their own
advantage and disadvantage.Further analyzing in the regions and units sold, we get to know
that, most of the online purchase are from the farther regions and the quantity of the product.
Maybe, according to the insights that I gained from the data, the products with many quantity or
high weighed products (if quantity is less but the price is high) only be purchased via online
channel. Furthermore analysis with the shipping cost, got some strong insights.

881
ISSN: 0374-8588
Volume 21 Issue X, Month 2019

_____________________________________________________________________________________

Figure 2: -

Problem 2: Why the online channel is least preferred?

Referring through the figure-1, Orders coming through online was receiving maximum only in
bulk and is also giving more revenue than offline mode of sale. This may be due to following
reasons:
1. Products ordered through online cannot be self-transferred easily.
2. Online products might have many options to choose and customize rather than offline.
This might be reason for high revenue.
3. Online ordering might give tracking facilities which cannot be done through
offline. My analysis for why orders are less in online.
1. Shipping cost is more for the purchase of less amount.
2. Some household products might be needed fast. Since online orders gets delayed in
delivery offline sales for small products is high.
3. People who is unaware of online mode of sale will buy through offline.

V. CONCLUSION
In this paper, weka is used to analyze the sales data of a retail company using centroid
based clustering. The main intention of this paper is to help the company in knowing the
sectors where it is getting more profit and where it needs to improve its sales. We used 100
sales data which contains various factors like mode of sale, area of sale, profit amounts,
product etc. Using these data this project will analyses and provide the necessary data to the
company.

REFERENCES

[1] Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8:


Data Clustering”
[2] J. B. MacQueen (1967): "Some Methods for classification and Analysis of
Multivariate Observations, Proceedings of 5-th Berkeley Symposium on
Mathematical Statistics and Probability", Berkeley, University of California Press,
1:281-297
[3] Andrew Moore: “K-means and Hierarchical Clustering - Tutorial
Slides” http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html
[4] J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in
Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: 32-5
[5] Quoc Qv Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and
882
ISSN: 0374-8588
Volume 21 Issue X, Month 2019

_____________________________________________________________________________________

Documents. International Conference on Machine Learning - ICML 2014 32(2014),


1188– 1196. https://doi.org/10.1145/2740908.2742760
[6] Marina Meilă. 2007. Comparing clusterings—an information based distance.Journal of
Multivariate Analysis 98, 5 (5 2007), 873–895
[7] David Newman, Edwin V Bonilla, and Wray Buntine. 2011. Improving Topic
Coherence with Regularized Topic Models. In Advances in Neural
InformationProcessing Systems 24, J Shawe-Taylor, R S Zemel, P L Bartlett, F Pereira,
and K QWeinberger (Eds.). Curran Associates, Inc., 496–504
[8] K. Mumtaz, “An Analysis on Density Based Clustering of Multi-Dimensional Spatial
Data”, Indian Journal of Computer Science and Engineering Vol 1 No 1 8-12.
[9] Sie Tang Lau, Journal of Sales management, vol 12, issue 3, july 2002 2017, page 234-
256.

883

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy