EPSM Unit 7 Data Analytics
EPSM Unit 7 Data Analytics
EPSM Unit 7 Data Analytics
Business Understanding
Business understanding is the essential and mandatory first phase in any data mining or data
analytics project.
It involves identifying and describing the fundamental aims of the project from a business perspective.
This may involve solving a key business problem or exploring a particular business opportunity. Such
problems might be:
Establishing whether the business has been performing or under-performing and in which
areas
Monitoring and controlling performance against targets or budgets
Identifying areas where efficiency and effectiveness in business processes can be improved
Understanding customer behaviour to identify trends, patterns and relationships
Predicting sales volumes at given prices
Detecting and preventing fraud more easily
Using scarce resources most profitably
Optimising sales or profits
Having identified the aims of the project to address the business problem or opportunity, the next step
is to establish a set of project objectives and requirements. These are then used to inform the
development of a project plan. The plan will detail the steps to be performed over the course of the
rest of the project and should cover the following:
Determining criteria to determine whether or not the project will have been a success;
Evaluating whether the project has been a success using the predetermined criteria.
Data Understanding
The second phase of the CRISP-DM process involves obtaining and exploring the data identified as
part of the previous phase and has three separate steps, each resulting in the production of a report.
Select each item to reveal more information.
Data acquisition
This step involves retrieving the data from their respective sources and the production of a data
acquisition report that lists the sources of data, along with their provenance, the tools or techniques
used to acquire them. It should also document any issues which arose during the acquisition along
with the relevant solutions. This report will facilitate the replication of the data acquisition process if
the project is repeated in the future.
Data description
The next step requires loading the data and performing a rudimentary examination of the data to aid
in the production of a data quality report. This report should describe the data that has been
acquired.
It should detail the number of attributes and the type of data they contain. For quantitative data, this
should include descriptive statistics such as minimum and maximum values as well as their mean and
median and other statistical measures. For qualitative data, the summary data should include the
number of distinct values, known as the cardinality of data, and how many instances of each value
exists. The first step is to describe the raw data. For instance, if analysing a purchases ledger, you
would at this stage produce counts of the number of transactions for each department and cost
centre, the minimum, mean and maximum for amounts, etc. Relationships between variables are
examined in the data exploration phase (eg. by calculating correlation). For both types of data, the
report should also detail the number of missing or invalid values in each of the attributes.
If there are multiple sources of data, the report should state on which common attributes these
sources will be joined. Finally, the report should include a statement as to whether the data acquired
is complete and satisfies the requirements outlined during the business understanding phase.
Data exploration
This step builds on the data description and involves using statistical and visualisation techniques to
develop a deeper understanding of the data and their suitability for the analysis.
The results of the exploratory data analysis should be presented as part of a data exploration
report that should also detail any initial findings.
Data Preparation
As with the data exploration phase, the data preparation phase is composed of multiple steps and is
about ensuring that the correct data is used, in the correct form in order for the data analytics model to
work effectively:
Data selection
The first step in data preparation is to determine the data that will be used in the analysis. This
decision will be informed by the reports produced in the data understanding phase but may also be
based on the relevance of particular datasets or attributes to the objectives of the data mining project,
as well as the capabilities of the tools and systems used to build analytical models. There are two
distinct types of data selection, both of which may be used as part of this step.
Feature selection is the process of eliminating features or variables which exhibit little predictive
value or those that are highly correlated with others and retaining those that are the most relevant to
the process of building analytical models such as:
Multiple linear regression, where the correlation between multiple independent variables and
the dependent variable is used to model the relationship between them;
Decision trees, simulating human approaches to solving problems by dividing the set of
predictors into smaller and smaller subsets and associating an outcome with each one.
Neural networks, a naïve simulation of multiple interconnected brain cells that can be
configured to learn and recognise patterns.
Sampling may be needed if the amount of data exceeds the capabilities of the tools or systems used
to build the model. This normally involves retaining a random selection of rows as a predetermined
percentage of the total number of rows. Often surprisingly small samples can give reasonably reliable
information about the wider population of data, such as obtained from voter exit polls in local and
national elections.
Any decisions taken during this step should be documented, along with a description of the reasons
for eliminating non-significant variables or selecting samples of data from a wider population of such
data.
Data cleaning
Data cleaning is the process of ensuring the data can be used effectively in the analytical model. The
next step is to process missing and erroneous data identified during the data understanding or
collection phase. Erroneous data, values outside of reasonably expected ranges, are generally set as
missing.
Missing values in each feature are then replaced either using simple rules of thumb, such as setting
them to be equal to the mean or median of data in the feature or by building models that represent the
patterns of missing data and using those models to 'predict' the missing values.
Other data cleaning tasks include transforming dates into a common format and removing non-
alphanumeric characters from text. The activities undertaken, and decisions made during this step
should be documented in a data cleaning report.
Data integration
Data mining algorithms expect a single source of data to be organised into rows and columns. If
multiple sources of data are to be used in the analysis, it is necessary to combine them. This involves
using common features in each dataset to join the datasets together. For example, a dataset of
customer details may be combined with records of their purchases. The resulting joined dataset will
have one row for each purchase containing attributes of the purchase combined with attributes related
to the customer.
Feature engineering
This optional step involves the creation or inclusion of new variables or derived attributes into the
existing variables or features originally included to improve the model’s capability. This step is
frequently performed when the data analyst feels that the derived attribute or new feature or variable
is likely to make a positive contribution to the modelling process and where it involves a complex
relationship that the model is unlikely to infer by itself.
An example of a derived feature might be adding such attributes such as the amount a customer
spends on different products in a given time period, how soon they pay and how often they return
goods to more reliably assess the profitability of that customer, rather than just measure the gross
profit generated by the customer based on sales values.
Modelling
This key part of the data mining process involves creating generalised, concise representations of the
data. These are frequently mathematical in nature and are used later to generate predictions from
new, previously unseen data.
Evaluation
At this stage in the project, you need to verify and document that the results you have obtained from
modelling have the veracity (are reliable enough) for you to prove or reject your hypotheses in the
business understanding stage. For example, if you have performed a multiple regression analysis on
predicting sales based on weather patterns, are you sure that the results you have obtained are
statistically significant enough for you to implement the solution, or have you checked that there are
no other intermediate variables linked to the X, Y variables in your relationship which are a more
direct causal link?
Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review
the steps executed to create it; to be certain the model properly achieves the business objectives. A
key objective is to determine if there is some important business issue that has not been sufficiently
considered. At the end of this phase, a decision on the use of the data mining results should be
reached.
At this stage, you will determine if it is feasible to move on to the final phase deployment, or whether it
is preferable to return to and refine some of the earlier steps. The outcome of this phase should be a
document providing an overview of the evaluation and details of the final decision together with a
supporting rationale for proceeding.
Deployment
During this final phase, the outcome of the evaluation will be used to establish a timetable and
strategy for the deployment of the data mining models, detailing the required steps and how they
should be implemented.
Data mining projects are rarely 'set it and forget it' in nature. At this time, you will need to develop a
comprehensive plan for the monitoring of the deployed models as well as their future maintenance.
This should take the form of a detailed document. Once the project has been completed there should
be a final written report, re-stating and re-affirming the project objectives, identifying the deliverables,
providing a summary of the results and identifying any problems encountered and how they were
dealt with.
Depending on the requirements, the deployment phase can be as simple as generating a report and
presenting it to the sponsors or as complex as implementing a repeatable data mining process across
the enterprise.
In many cases, it is the customer, not the data analyst, who carries out the deployment steps.
However, even if the analyst does carry out the deployment, it is important for the customer to clearly
understand which actions need to be carried out in order to actually make use of the created models.
This is where data visualisation is most important as the data analyst hands over the findings from the
modelling to the sponsor or the end user and these should be presented and communicated in a form
which is easily understood.
The main focus in big data and the digital revolution is not so much about the quantity of data,
although this is a big advantage, but it is more about the speed and currency of the data and the
variety in which it is made available. Sophisticated data analytics is about accessing data that is
useful for decision making and the three things that Big Data brings to improve the quality of decision
making are:
Volume
In data analytics, the amount of data can make a difference. With big data, you will often have to
process large amounts of data, most of it unstructured and with low information density.
The term volume is used to refer to these massive quantities of data. Most of this may have no value
but you will not know until you somehow try and structure it and use it. This data can come from a
wide range of sources as we will see later, but could include social media data, hits on a website, or
results from surveys or approval ratings given by consumers.
The main benefit from the volume of big data is the additional reliability it gives the data analyst. As
any statistician knows, the more data you have, the more reliable your analysis becomes and the
more confident you are about using the results you obtain to inform decision-making.
For some organisations the quantity of this data will be enormous, and will be difficult to collect, store
and manage without the correct infrastructure, including adequate storage and processing capacity.
Velocity
Velocity refers to the rate at which data is received, stored and used. In today’s world transactions are
conducted and recorded in real time. Every time you scan your goods at a supermarket, the store
knows instantly how much inventory it still has available and so it knows as soon as possible when it
needs to re-order each item.
Similarly, as people shop with debit and credit cards using their phone apps, these transactions are
updated immediately. The bank knows immediately that funds have gone out of your account. The
business also knows that funds have been transferred into their account – all in real time.
Variety
In the past the data we collected electronically came in the very familiar rows and columns form
encountered in any database or spreadsheet. With the advent of the Internet and more recently the
Worldwide Web, the forms data comes in have significantly broadened. Variety refers to the multitude
of types of data that are available for analysis as a result of these changes. Thanks to rapid
developments in communications technology, the data we store increasingly comes in different forms
which possess far less structure and which take on a variety of forms. Examples of these include
numerical data, plain text, audio, pictures and videos.
With the increasingly prevalent use of mobile internet access, sensor data also counts as a new data
type. We still also have the more traditional data types; those that are highly structured such as data
held in relational databases. Corporate information systems such as Enterprise Resource Planning,
Customer Relationship Management and financial accounting functions employ such database
systems and the data these systems contain are a valuable resource for data analysts to work with.
Unstructured data require significant additional processing as in the data preparation stage in the
CRISP framework to transform them into meaningful and useful data which can be used to support
decision-making, but being able to access them and use them provides richer information which can
make the information obtained from the data analysed more relevant and significant than larger
amounts of data from more structured sources.
Big data has become an important form of organisational capital. For some of the world’s biggest tech
companies, such as Facebook, a large part of the value they offer comes from their data, which
they’re constantly analysing to produce more efficiency and develop new revenue streams.
However, the impact of big data and data reliance doesn't stop with the tech giants. Data is
increasingly considered by many enterprises to be a key business asset with significant potential
value.
Data which is not used or analysed has no real value. However, value can be added to data as it is
cleaned, processed, transformed and analysed. Data collected can be considered to be the raw
material, as in a manufacturing process and is frequently referred to as 'raw data'.
Some of this raw material is unrefined, such as unstructured data, and some refined, as is the case
with structured data. Such data needs to be stored in a virtual warehouse, such as a cloud storage
provider or an on-premise storage solution.
The cleaning and transformation of the data into a suitable form for analysis is really where the value
is being added, so that the data can become the finished product - the useful information which needs
to be delivered or communicated to the user. Reliable, timely and relevant information is what the
customer wants.
Deriving value from big data isn’t only about analysing it. It is a discovery process that requires
insightful analysts, business users and managers who ask the right questions, recognise patterns,
make informed assumptions, and predict behaviour.
If the original assumptions are wrong, the interpretation of the original business question or issue is
incorrect, or the integrity of the data used in the analysis is suspect, the data analysis may yield
unreliable or irrelevant information. A data analyst must be sceptical of the information that comes out
of the data analytics process and properly challenge or verify what it is saying.
Recent technological breakthroughs have exponentially reduced the cost of data storage and
computing, making it easier and less expensive to store and process more data than ever before. As
the costs of handling big data are becoming cheaper and more accessible, it is possible to make more
accurate and informed business decisions as long as the big data is stored, processed and
interpreted appropriately.
HDFS
The Hadoop Distributed File System allows the storage of extremely large files in a highly redundant
manner, using a cluster of computers, in this case built using ‘off-the-shelf’ commodity hardware.
MapReduce
This is a divide and conquer approach to big data processing, allowing processing of data to be
distributed across multiple computers in a Hadoop cluster.
Hive
Data Query Language is a query tool used to analyse large sets of data stored on HDFS. It uses a
SQL-like language. It is a declarative language – in other words, you specify what you want, not how
to retrieve it.
Pig
Another high-level programming language used to query large data sets stored on HDFS. It is a data-
flow language that specifies the flows of data from one task to another.
HBase
A NoSQL database that runs on Hadoop clusters. NoSQL stands for Not Only SQL and is a pattern of
data access that is more suited to larger data stores. It differs from relational databases in a number
of ways, not least in that it stores each column in the data as a separate physical file.
Drill
A data processing environment for large-scale data projects where data is spread across thousands
of nodes in a cluster and the volume of data is in the petabytes.
Descriptive Analytics
Descriptive analytics takes raw data and summarises or describes it in order to provide useful
information about the past. In essence, this type of analytics attempts to answer the question 'What
has happened?'.
Descriptive analysis does exactly what the name implies because they “Describe” raw data and allow
the user to see and analyse data which has been classified and presented in some logical way. They
are analytics that describe the past. The past refers to any point of time that an event has occurred,
whenever that was, from a second ago to a year ago.
Descriptive analytics are useful because they allow analysts to learn from past behaviours, and
understand how they might influence future outcomes. Spreadsheet tools such as filtering and pivot
tables are an excellent way to view and analyse historic data in a variety of ways.
Descriptive statistics can be used to show things many different types of business data such as total
sales by volume or value, cost breakdowns, average amounts spent per customer and profitability per
product.
An example of this kind of descriptive analytics can be illustrated where retail data on the sales, cost
of sales (COS) and gross profit margin (GP) in six retail outlets of a range of five products within each
store are tracked over time to establish trends and or to detect potential fraud or loss.
By looking at the overall figures for the company as a whole, or even by individual product across the
company, or for a store as a whole, the business leader may not notice any unusual trends or
departures from the expected levels from a chart or graph of these measures.
See below how all these metrics are reasonably constant when the overall performance is described:
Descriptive Analytics
Only by analysing and charting these trends more closely by product, in each individual store (such as
by using pivot tables) could the business leader detect if and where there is any specific fraud or loss
and such discrepancies would become more apparent if this type of micro level descriptive analysis is
undertaken. In the above example it looks like there was a problem with Product 2 in Store 6. See
below:
In the above example when the trend for Product 2 in Store 6 is examined more closely, it can be
seen that the GP margin falls from 33% down to about 17% and it is nothing to do with sales which
remain relatively constant over time, but is caused by a significantly rising COS from period 2 which
rises from just above $800 in periods 1 and 2 to $1000 by period 5. In this case the business
manager, possibly an internal auditor in this case, would be looking at a potential loss or theft of
inventory relating to this product and would need to investigate further.
Descriptive Analytics
Another example of descriptive analytics would be to analyse historical data about customers to
analyse them.
This database shows a data table with three products which are being sold by a business and how
much each product sells for, how much it costs to make or purchase, and what cost is associated with
the return of each unit of the product from customers. The database itself shows, for each of the 30
customers, how much of each product they have purchased, how many they have returned and an
analysis of how much sales and gross profit each customer has generated is included. The database
also shows the totals, the means and the medians of each measure.
Using data analytics, this type of descriptive analysis could help the business understand more about
the following business questions:
The spreadsheet above uses filters at the top of each column so that the analyst can sort the data in
any way they choose. For example, they might wish to see a list of customers listed in order of sales,
profitability or by the number of returns they process.
A powerful tool to use in descriptive analytics is pivot tables in Excel. Pivot tables allow the original
data table in a spreadsheet to be presented in a number of different ways, where the rows and
columns can be interchanged or where only certain fields or data are displayed.
For example, from the above data, the analyst may want to focus only on the returns which each
customer sends back to the company. The following shows an example of how that might be done:
Using the pivot table under the insert tab of the spreadsheet and selecting the range of data in the
previous example, it can be seen that for each customer the pivot table has prescribed that against
each customer ID, under row labels, only the values for the quantity of returns per product and the
totals are to be shown.
An increasingly popular area to apply descriptive data analytics is in finance by using externally
available information from the stock markets to help inform and support investment decisions. Many
analysts source their data from a range of external sources such as Yahoo, Google finance or other
easily accessible and free to use databases. This now means that historical data of share prices and
stock market indices are readily and widely available to use by anyone.
As an example, finance analysts often need to calculate the riskiness of stocks in order to estimate
the equity cost of capital and to inform their investment plans.
By examining this, it seems that the total number of returns of products 1 and 3 are very similar for all
customers, but what does this really tell us? To get more insights into this it would be necessary to
compare these returns figures with the actual quantities of each product sold to each customer to
identify what percentage of each product was returned overall. More relevant still would be an
analysis of the percentage of returns by product returned by individual customers to establish which
customers sent back the greatest proportion of returns under each product type, requiring an even
more targeted analysis.
An increasingly popular area to apply descriptive data analytics is in finance by using externally
available information from the stock markets to help inform and support investment decisions. Many
analysts source their data from a range of external sources such as Yahoo, Google finance or other
easily accessible and free to use databases. This now means that historical data of share prices and
stock market indices are readily and widely available to use by anyone.
As an example, finance analysts often need to calculate the riskiness of stocks in order to estimate
the equity cost of capital and to inform their investment plans.
An analyst wants to estimate the beta of Amazon shares against the Standard and Poor (S+P) 100
stock index. The beta measures how volatile the periodic returns in this share have been, against the
S+P index as a whole.
To do this, the analyst would access the financial data from an external website and download it to
their spreadsheet.
In the example shown, monthly returns in Amazon shares have been measured against the returns in
the S+P between February 2017 and January 2019 and shown using a scatter chart:
The above sheet shows the share/index price and returns data on the left and the comparative returns
plotted in a scatter chart on the right. The beta of the Amazon stock is calculated in cell F5. The
formula used in F5 is shown in the formula bar above the spreadsheet.
The easiest way to calculate a beta is to estimate the slope of a best fit line through the data. This is
achieved using the =Slope formula in Excel and selecting the range of returns from the individual
company (Y axis) and the selecting the range containing the returns from the stock market as a whole
(X axis).
Interpreting this, it can be seen that Amazon has a positive beta, meaning that as the stock market
rises of a period, the price of the company’s shares also rise in that same period. The measure for
beta calculated here is +1.2 for Amazon over the period considered which means that investing in
Amazon’s share independently, is riskier than investing, or tracking, the index as a whole.
Predictive Analytics
Predictive analytics builds statistical models from processed raw data with the aim of being able to
forecast future outcomes. It attempts to answer the question 'What will happen?'
This type of analytics is about understanding the future. Predictive analytics provides businesses with
valuable insights based on data which allow analysts to extrapolate from the past to assume
behaviours and outcomes in the future. It is important to remember that data analytics cannot be
relied upon to exactly “predict” the future with complete certainty. Business managers should
therefore be sceptical and recognise the limitations of such analytics, and that the prediction can only
be based only on reasonable probabilities and assumptions.
These analytics use historical (descriptive data) and statistical techniques to estimate future outcomes
based on observed relationships between attributes or variables. They identify patterns in the data
and apply statistical models and algorithms to capture relationships between various data sets.
Predictive analytics can be used throughout the organization, from forecasting customer behaviour
and purchasing patterns to identifying trends in manufacturing processes and the predicted impact on
quality control.
Regression
Regression analysis is a popular method of predicting a continuous numeric value. A simple example
in a business context would be using past data on sales volumes and advertising spend to build a
regression model that allows manager to predict future sales volumes on the basis of the projected or
planned advertising spend. Using a single predictor or independent variable to forecast the value of a
target or dependent variable is known as simple regression. The inclusion of multiple independent
variables is more typical of real-world applications and is known as multiple-regression.
The simplest regression models, such as those produced by Microsoft Excel, assume that the
relationship between the independent variables and the dependent variable is strictly linear. It is
possible to accommodate a limited range of alternative possible relationships by transforming the
variables using logarithms or by raising them to a power. More sophisticated algorithms can model
curved or even arbitrarily-shaped relationships between the variables.
The performance of a regression model is determined by how far the predictions are away from the
actual values. If the magnitude of errors is a consideration, the squared differences are used
otherwise the absolute differences are used. In Excel, this information is given by a regression output
table which indicates the predictive values of the independent variable(s) and the dependent variable.
The key statistics are R2 which ranges from 0 being a completely random association to 1 which is
perfect correlation. The statistical significance of the relationships can also be confirmed by looking at
the P-values and the Significance F, which should be sufficiently small to allow greater confidence.
One common application, most people are familiar with, is the use of predictive analytics to
estimate sales of product based on different factors such as the weather.
The following spreadsheet includes data on monthly barbecue sales and how these are potentially
influenced by:
Artificial intelligence is an important branch of computer science that has the broad aim of creating
machines that behave intelligently. The field has subfields, including robotics and machine learning.
There are three major categories of artificial intelligence:
Artificial General Intelligence, also known as Strong AI or Human-Level AI, is the term used for
artificial intelligence that permits a machine to have the same capabilities as a human.
Artificial Intelligence
Artificial Intelligence has many uses in business and finance, many of which draw heavily on machine
learning, including:
Robotics
Not all robots are designed to resemble human appearance, but many will be given human-like
features to allow them to perform physical tasks otherwise performed by humans. The design of such
robots makes considerable use of sensor technology, including but not limited to computer vision
systems, which allow the robot to 'see' and identify objects.
Machine Learning
Machine learning is the use of statistical models and other algorithms in order to enable computers to
learn from data. It is divided into two distinct types, unsupervised and supervised learning. The main
feature of machine learning is that the machine learns from its own experience from interaction with
the data it is processing and can make decisions independently of any input from human beings. They
can adapt or create their own algorithms to help them make better and more relevant decisions on the
basis of this experience.
Unsupervised Learning draws inferences and learns structure from data without being provided with
any labels, classifications or categories. In other words, unsupervised learning can occur without
being provided with any prior knowledge of the data or patterns it may contain.
The most frequently used form of unsupervised learning is clustering, which is the task of grouping a
set of observations so that those in the same group (cluster) are more similar to each other in some
way than they are to those in other clusters. There are multiple methods of determining similarity or
dissimilarity. The most commonly used being some form of distance measure, with observations that
are close to each other being considered to be part of the same cluster.
The quality of clusters can be determined by a number of evaluation measures. These generally base
their quality score on how compact each cluster is and how distant it is from other clusters.
The use of market basket analysis can also be found in online outlets such as Amazon, who use the
results of the analysis to inform their product recommendation system. The two most often used
market basket analysis approaches are the a-priori algorithm and frequent pattern growth.
Supervised Learning is similar to the human task of concept learning. At its most basic level, it
allows a computer to learn a function that maps a set of input variables to an output variable using a
set of example input-output pairs. It does this by analysing the supplied examples and inferring what
the relationship between the two may be.
The goal is to produce a mapping that allows the algorithm to correctly determine the output value for
as yet unseen data instances. This is very much concerned with the predictive analytics covered
earlier in the unit so that the machine learns from past relationships between variables and then builds
up a measure of how some variables, factors or behaviours can predict the responses that the
machine should give, such as knowing the temperature and other weather conditions for the coming
week will allow the machine to calculate orders for specific products, based on those forecast factors.
There are many tools available for descriptive analytics, some of which are briefly described below:
Microsoft Excel
Microsoft Excel with the Data Analysis Tool Pack is a relatively easy to use yet powerful application
for descriptive analysis. It has one drawback in that the number of rows of data that can be processed
is limited to one million. However, it is a viable and readily available tool for descriptive statistical
analysis of smaller datasets.
RapidMiner
RapidMiner is a data science software platform developed by the company of the same name that
provides an integrated environment for data preparation, machine learning, deep learning, text mining,
and predictive analytics.
WEKA
WEKA, the Waikato Environment for Knowledge Analysis is a suite of machine learning software
written in Java, developed at the University of Waikato, New Zealand.
KNIME
KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and
integration platform. KNIME integrates various components for machine learning and data mining
through its modular data pipelining concept.
R is a statistical programming language and computing environment created by the R Foundation for
Statistical Computing. The R language is widely used among statisticians and data miners for
developing statistical software. It is particularly useful for data analysts because it can read any type
of data and supports much larger data sets than is currently possible with spreadsheets.
Python
Python is a general-purpose programming language that can make use of additional code in the form
of 'packages' that provide statistical and machine learning tools.
SAS
SAS is a commercial provider of Business Intelligence and data management software with a suite of
solutions that include artificial intelligence and machine learning tools, data management, risk
management and fraud intelligence.
SPSS Statistics
SPSS Statistics is a commercial solution from IBM, while originally designed for social science
research, is increasingly used in health sciences and marketing. In common with the other
applications listed here, it provides a comprehensive range of tools for descriptive statistics.
Stata
Stata is a commercial statistical software solution frequently used in economics and health sciences.
All of the tools mentioned in the previous section can also be used for predictive analytics. Some,
such as Excel and SPSS Statistics are limited to in the range of predictive analytics tasks they can
perform. In particular, these tools do not offer the wide range of options for classification or advanced
regression available.
Predictive analytics features are also provided by applications and services such as IBM Predictive
Analytics, SAS Predictive Analytics, Salford Systems SPM 8, SAP Predictive Analytics, Google Cloud
Prediction API. R and Python can also be used to perform predictive analytics.
Other tools in the predictive analytics space include SPSS Modeler from IBM, Oracle Data Mining,
Microsoft Azure Machine Learning and TIBCO Spotfire.
Tools in the prescriptive analytics space are fewer in number. One frequently overlooked solution is
the 'what if' analysis tool which is part of Excel's Analysis Tool Pack. This simple yet effective small-
scale predictive analytics tool allows the user to model different scenarios by plugging in different
values to a worksheet's formulas.
As mentioned earlier in the unit, there is also ‘Scenario Manager’ which allows the analyst to test
outcomes from different scenarios, but the most powerful tool in the Tool Pack is ‘Solver’ which is a
flexible and powerful optimisation tool and examples of how ‘Solver’ can help solve business
problems and determine optimal solutions have already been illustrated.
Although spreadsheets are versatile tools which most people have access to and can easily use, R
and Python would be two other widely used tools for more advanced prescriptive analytics as they
use programming languages which allow the user the flexibility to design prescriptive analytical
models, only limited in their sophistication by the programmer or coder’s skill, ingenuity and
imagination.
Answer questions that would be difficult, if not impossible, to answer using non-visual
analyses.
Discover questions that were not previously apparent and reveal previously unidentified
patterns.
Gain rapid insights into data which are relevant and timely.
Data visualisation is not a new concept. It could be argued that it reaches all the way back to pre-
history.
One of the most ancient calculators was invented over 2,500 years ago by the Chinese, called the
Abacus. It is not only an ancient calculator, but is also an early example of data visualisation where
the number of beads in a column shows the relative quantities as counted on rods. The Abacus has
two sections, top and bottom with a bar or ‘beam’ dividing them. When beads are pushed upwards or
downwards towards the bar they are considered as counted.
The magnitude of the numbers increases by a multiple of 10 for each rod, going from the right to the
left, with the far right-hand side rod containing the beads with the lowest value or denomination. This
means that below the bar of the rod at the far right, beads are worth 1 unit each and there are a total
of 5. Each of the two beads above the bar are worth the same as all five beads below the bar, so each
of the beads above the bar on the far right rod is worth 5 each. In the next adjacent column to the left,
each of the bottom beads is worth 10 and the top beads are worth 50 each and so on.
In the following example Mr Hoo has been calculating his fuel expenses for a month on an
Abacus and has arrived at the total. (Note the minimum denomination in the Abacus is assumed
at $1):
More recently, William Playfair is credited with the invention of a now common form of data
visualisation, the bar chart, with this visualisation of exports and imports of Scotland to and from
different parts of the world, over one year from December 1780 to December 1781. He was one of the
earliest data visualisers, having also created both the area chart and the stacked bar chart.
Types of data visualisation - Comparison
In business many types of data visualisation are used to present data and information more clearly
and effectively to users of that data. Visualisation types for comparison and composition among
categories are classified into two distinct types:
Static
For a single category or a small number of categories, each with few items, a column chart is the best
choice. In this case sales of groceries, home wares, Deli and Bakery are shown as sold in two
different stores.
For a single category or a small number of categories, each with many items, a bar chart is the best
choice. In the chart below where there are 6 stores the vertical column chart becomes less useful. A
more effective way to visualise this data is in a horizontal bar chart, with stores being listed on the Y
axis and the sales along the X axis.
The pie chart is an example of a static visualisation, but shows the relative composition using the size
of the slices, each of which represents a simple share of the total.
The waterfall chart shows the how each component adds to or subtracts from the total.
In this example, the green bars represent revenue, adding to the total. The red bars represent costs
and are subtracted from the total. The net amount left after costs have been subtracted from the
revenues, is represented by the blue bar which is profit.
Dynamic composition shows the change in the composition of the data over time. Where the
analysis involves few periods, a stacked bar chart is used where the absolute value for each
category matters in addition to the relative differences between categories.
Where only the relative differences between categories matters; a stacked 100% column chart can be
used. In this example it is useful as a way of visualising how much of total sales are made up of which
product group. In the below example it can be seen that in 2018 grocery is becoming a bigger
component of total sales but deli sales and bakery sales are declining as a percentage of the total.
The scatter plot is ideal for visualising the relationship between two variables and identifying potential
correlations between them. Each observation for the two variables is plotted as a point, with the
position on the x axis representing the value of one variable and the position on the y axis
representing the value of the other.
In the example below taken from an earlier example is a scatter diagram of barbecue sales against
the recorded hours of sunshine per month:
The scatter diagram shows that there is a reasonably close positive correlation between the monthly
hours of sunshine and the sales of barbecues.
Although it is possible to introduce a third variable and create a 3D scatter chart, these can be difficult
to visualise and for the user to interpret. In this instance, the preferred solution is to produce multiple
2D scatter charts for each combination of variables. An alternative to this approach is the bubble
chart, which is a standard 2D scatter chart where the values in the third variable are represented by
the size of the points or ‘bubbles’. The bubble chart can sometimes be difficult to interpret where the
number of observations is high and the range of values in the third variable is wide.
The main problem with bubble charts is the ability of the reader to compare the dimensions of each
bubble in absolute terms. A useful feature to resolve this problem is to add a key where the relative
size of the bubbles is indicated. The bubble chart below shows the position on the grid of the sales
generated for a product over four quarterly periods. The size of the bubbles represents the advertising
spend in each quarter. As the size and height of the bubbles in the graph are positively correlated this
chart shows clearly that the greater the advertising spend the greater the quarterly sales generated.
The key which has been added in white, helps the reader gauge the value of absolute sales
generated in different periods against the total advertising spend.
According to Andy Kirk a data visualisation specialist, a good data visualisation should have the
following qualities:
It must be trustworthy;
It must be elegant.
In his work on graphical excellence, statistician Edward R. Tufte describes an effective visualisation
as:
that which gives to the viewer the greatest number of ideas in the shortest time with the least
ink in the smallest space;
nearly always multivariate;
The principle of accessibility suggested by Kirk echoes Tufte's statement that a good visualisation
should not only give the viewer the greatest number of ideas in the shortest space of time, but should
also have clarity, precision and efficiency. In effect, this means concentrating on those design
elements that actively contribute to visualising the data and avoiding the use of unnecessary
decoration; which Tufte refers to as 'chart junk'. It also means we should avoid trying to represent too
many individual data series in a single visualisation, breaking them into separate visualisations if
necessary.
Accessibility is, to paraphrase Kirk, all about offering the most insight for the least amount of viewer
effort. This implies that a significant part of designing a visualisation is to understand the needs of the
audience for the work and making conscious design decisions based on that knowledge.
Careful use of colour is a fundamental part of ensuring an accessible design. The choice of colours
should be a deliberate decision. They should be limited in number, should complement each other
and should be used in moderation to draw attention to key parts of the design. When colour is used
this way, it provides a cue to the viewer as to where their attention should be focussed. For
visualisations with a potentially international audience, the choice of colours should also be informed
by cultural norms. For example, in the western world, the colour red means danger but for the
population of East Asia, it signifies luck or spirituality.
An accessible design should enable to the viewer to gain rapid insights from the work, much like a
handle enables us to use a cup more efficiently. These metaphors include the 'traffic light' design
used to provide a high-level summary of the state of some key business risk indicator or the
speedometer metaphor employed to indicate performance. Both are frequently found on Business
Intelligence dashboards used by businesses.
Kirk's final principle, elegance, is a difficult one to describe, but is vital to any successful visualisation.
The key here is that what is aesthetically pleasing is usually simpler and also easier to interpret and is
likely to not only catch our attention but hold it for longer. A good design should avoid placing
obstacles in the way of the viewer; it should flow seamlessly and should guide the viewer to the key
insights it is designed to impart.