Business Analytics
Business Analytics
Content Writers
Dr. Abhishek Kumar Singh, Dr. Satish Kumar Goel,
Mr. Anurag Goel, Dr. Sanjay Kumar
Academic Coordinator
Mr. Deekshant Awasthi
Published by:
Department of Distance and Continuing Education
Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110007
Printed by:
School of Open Learning, University of Delhi
DISCLAIMER
Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
New Delhi - 110026 (300 Copies, 2023)
PAGE
PAGE i
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
PAGE
ii PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
1
Introduction to Business
Analytics and Descriptive
Analytics
Dr. Abhishek Kumar Singh
Assistant Professor
University of Delhi
Email-Id: abhishekbhu008@gmail.com
STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Business Analytics
1.4 Role of Analytics for Data-Driven Decision Making
1.5 Types of Business Analytics
1.6 Introduction to the Concepts of Big Data Analytics
1.7 Overview of Machine Learning Algorithms
1.8 Introduction to Relevant Statistical Software Packages
1.9 Summary
1.10 Answer to In-Text Questions
1.11 Self-Assessment Questions
1.12 References
1.13 Suggested Readings
PAGE 1
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
1.2 Introduction
Business Analytics (BA) consists of using data to gain valuable insights
and make informed decisions in a business setting. It involves analysing
and interpreting data to uncover patterns, trends, and correlations that
can help organizations improve their operations, better understand their
customers, and make strategic decisions. Business Analytics (BA) places
a focus on statistical analysis. In addition to statistical analysis, business
analytics also focuses on various other aspects, such as data mining,
predictive modelling, data visualization, machine learning, and data-driven
decision making.
Companies committed to making data-driven decisions employ business
analytics. The study of data through statistical and operational analysis, the
creation of predictive models, the use of optimisation techniques, and the
communication of these results to clients, business partners, and college
executives are all considered to be components of business analytics. It
relies on quantitative methodologies, and data needed to create specific
business models and reach lucrative conclusions must be supported by
proof. As a result, Business Analytics heavily relies on and utilises Big
Data. Business analytics is the process used to analyse data after looking
at past outcomes and problems in order to create an effective future plan.
Big Data, or a lot of data, is utilised to generate answers. The economy
and the sectors that prosper inside it depend on this way of conducting
business or this outlook on creating and maintaining a business. Over the
past ten or so years, the word analytics has gained popularity. Analytics are
now incredibly important due to the growth of the internet and information
technology. In this lesson we are going to learn about Business Analytics
and the area of analytics integrates data, information technology, statistical
analysis, and combining quantitative techniques with computer-based
2 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
models. All of these factors work together to give decision-makers every Notes
possibility that can arise, allowing them to make well-informed choices.
The computer-based model makes sure that decision-makers may examine
how their choices function in various scenarios.
Figure 1.1
1.3.1 Meaning
Business Analytics (BA) utilizes data analysis, statistical models, and
various quantitative techniques as a comprehensive discipline and
technological approach. It involves a systematic and iterative examination
of organizational data, with a specific emphasis on statistical analysis, to
facilitate informed decision-making.
Business analytics primarily entails a combination of the following:
discovering novel patterns and relationships using data mining; developing
business models using quantitative and statistical analysis; conducting A/B
and multi-variable testing based on findings; forecasting future business
needs, performance, and industry trends using predictive modelling; and
reporting your findings to co-workers, management, and clients in simple-
to-understand reports.
PAGE 3
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
4 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 5
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes market share and revenue, and give shareholders a higher return. It entails
improved primary and secondary data interpretation, which again affects
the operational effectiveness of several departments. Moreover, it provides
a competitive advantage to the organization. The flow of information is
nearly equal among all actors in this digital age. The competitiveness of
the company is determined by how this information is used. Corporate
analytics improves corporate decisions by combining readily available
data with numerous carefully considered models.
6 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 7
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
8 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Structured data, as the name implies, has a clear structure and follows Notes
a regular sequence. A person or machine may readily access and
utilise this type of information since it has been intended to be user-
friendly. Structured data is typically kept in databases, especially
relational database management systems, or RDBMS, and tables
with clearly defined rows and columns, such as spreadsheets.
While semi-structured data displays some of the same characteristics
as structured data, for the most part it lacks a clear structure and
cannot adhere to the formal specifications of data models like an
RDBMS.
Unstructured data does not adhere to the formal structural norms of
traditional data models and lacks a consistent structure across all
of its different forms. In a very small number of cases, it might
contain information on the date and time.
PAGE 9
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes Veracity: The level of dependability and truth that big data can
provide in terms of its applicability, rigour, and correctness.
Value: This feature examines whether information and analytics will
eventually be beneficial or detrimental as the main goal of big
data collection and analysis is to uncover insights that can guide
decision-making and other activities.
Figure 1.3
10 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 11
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes 3. A process for model optimisation: If the model can more closely
match the data points in the training set, weights are modified to
lessen the difference between the known example and the model
prediction. Until an accuracy level is reached, the algorithm will
iteratively evaluate and optimise, updating weights on its own each
time.
12 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
feature extraction and classification from a larger, unlabelled data set. If Notes
you don’t have enough labelled data—or can’t pay to label enough data—
to train a supervised learning system, semi-supervised learning can help.
PAGE 13
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes Access to a sizable database that reduces sampling error and enables
data-driven decision making.
Relevant statistical software packages:
Increases productivity and accuracy in data management and analysis
requiring less time.
Simple personalization access to a sizable database reduces sampling
error and enables data-driven decision-making.
1. SPSS (Statistical Package for Social Sciences)
The most popular and effective programme for analysing complex
statistical data is called SPSS.
To make the results easy to discuss, it quickly generates descriptive
statistics, parametric and non-parametric analysis, and delivers graphs
and presentation-ready reports.
Here, estimate and the discovery of missing values in the data sets
lead to more accurate reports.
For the analysis of quantitative data, SPSS is utilised.
2. Stata
Stata is another commonly used programme that makes it possible
to manage, save, produce, and visualise data graphically. It does
not require any coding expertise to use.
Its use is more intuitive because it has both a command line and a
graphical user interface.
3. R
Free statistical software known as “R” offers graphical and statistical
tools, including linear and non-linear modelling.
Available for a wide range of applications are toolboxes, which are
effective plugins. Here, coding expertise is necessary.
It offers interactive reports and apps, makes extensive use of data,
and complies with security guidelines.
R is used to analyse quantitative data.
4. Python
Another freely available software.
14 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 15
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes 8. Epi-info
It is a public domain software suite created by the Centres for
Disease Control and Prevention (CDC) for researchers and public
health professionals worldwide.
For those who might not have a background in information technology,
it offers simple data entry forms, database development, and data
analytics including epidemiology statistics, maps, and graphs.
Investigations into disease outbreaks, the creation of small to medium-
sized disease monitoring systems, and the Analysis, Visualisation,
and Reporting (AVR) elements of bigger systems all make use of
it.
It is utilised for the analysis of numerical data.
9. NVivo
It is a piece of software that enables the organisation and archiving
of qualitative data for analysis.
The analysis of unstructured text, audio, video, and image data, such
as that from interviews, Focus Group Discussion (FGD), surveys,
social media, and journal articles, is done using NVivo.
You can import Word documents, PDFs, audio, video, and photos
here.
It facilitates users’ more effective organisation, analysis, and discovery
of insights from structured or qualitative data.
The user-friendly layout makes it instantly familiar and intuitive for
the user. It contains a free version as well as automated transcribing
and auto coding.
Research using mixed methods and qualitative data is conducted
using NVivo.
10. Mini-tab
Mini-tab provides both fundamental and moderately sophisticated
statistical analysis capabilities.
It has the ability to analyse a variety of data sets, automate statistical
calculations, and provide beautiful visualisations.
The usage of mini-tabs allows users to concentrate more on data
analysis by allowing them to examine both current and historical
16 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
data to spot trends and patterns as well as hidden links between Notes
variables.
It makes it easier to understand the data’s insights.
For the examination of qualitative data, Mini-tab is employed.
11. Dedoose
Dedoose, a tool for qualitative and quantitative data analysis, is
entirely web-based.
This low-cost programme is user-friendly and team-oriented, and
it makes it simple to import both text and visual data.
It has access to cutting-edge data security equipment.
12. ATLAS.ti
It is a pioneer in qualitative analysis software and has incorporated
AI as it has developed.
The greatest places for this are research organisations, businesses,
and academic institutions. Due to the cost of doing individual
studies.
With sentiment analysis and auto coding, it is more potent.
It gives users the option to use any language or character set.
13. MAXDQA 12
It is expert software for analysing data using quantitative, qualitative,
and mixed methods.
It imports the data, reviews it in a single spot, and categorises any
unstructured data with ease.
With this software, a literature review may also be created.
It costs money and is not always easy to collaborate with others in
a team.
IN-TEXT QUESTIONS
1. What is the term used for a collection of large, complex data
sets that cannot be processed using traditional data processing
tools?
PAGE 17
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
18 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 19
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes 11. What is the process of cleaning and transforming data before
it is used for analysis?
(a) Data Mining
(b) Data Warehousing
(c) Data Integration
(d) Data Preprocessing
12. Which of the following is not a common use case for Big Data
analytics?
(a) Fraud Detection
(b) Customer Segmentation
(c) Social Media Analysis
(d) Inventory Management
13. Which of the following is not a common method for selecting
the best features for a machine learning model?
(a) Filter Methods
(b) Wrapper Methods
(c) Embedded Methods
(d) Extrapolation Methods
14. Which of the following is a technique for grouping similar data
points together?
(a) Classification
(b) Regression
(c) Clustering
(d) Dimensionality Reduction
15. Which of the following is a measure of how well a machine
learning model is able to make predictions on new data?
(a) Accuracy
(b) Precision
(c) Recall
(d) All of the above
20 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 21
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
1.12 References
Evans, J.R. (2021), Business Analytics: Methods, Models and Decisions,
Pearson India.
Kumar, U. D. (2021), Business Analytics: The Science of Data-Driven
Decision Making, Wiley India.
Larose, D. T. (2022), Data Mining and Predictive Analytics, Wiley
India.
Shmueli, G. (2021), Data Mining and Business Analytics, Wiley India.
22 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 23
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
2
Predictive Analytics
Dr. Satish Kumar Goel
Assistant Professor
Shaheed Sukhdev College of Business Studies
University of Delhi
Email-Id: satish@sscbsdu.ac.in
STRUCTURE
2.1 Learning Objectives
2.2 Introduction
2.3 Classical Linear Regression Model (CLRM)
2.4 Multiple Linear Regression Model
2.5 Practical Exercises Using R/Python Programming
2.6 Summary
2.7 Answers to In-Text Questions
2.8 Self-Assessment Questions
2.9 Reference
2.10 Suggested Readings
2.2 Introduction
In this chapter, we will explore the field of predictive analytics, focusing on two fundamental
techniques: Simple Linear Regression and Multiple Linear Regression. Predictive analytics
24 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
is a powerful tool for analysing data and making predictions about future Notes
outcomes. We will cover various aspects of regression models, including
parameter estimation, model validation, coefficient of determination,
significance tests, residual analysis, and confidence and prediction
intervals. Additionally, we will provide practical exercises to reinforce your
understanding of these concepts, using R or Python for implementation.
2.3.1 Introduction
Predictive analytics is the use of statistical techniques, machine learning
algorithms, and other tools to identify patterns and relationships in
historical data and use them to make predictions about future events. These
predictions can be used to inform decision-making in a wide variety of
areas, such as business, marketing, healthcare, and finance.
Linear regression is the traditional statistical technique used to model the
relationship between one or more independent variables and a dependent
variable.
Linear regression involving only two variables is called simple linear
regression. Let us consider two variables as ‘x’ and ‘y’. Here ‘x’ represents
independent variable or explanatory variable and ‘y’ represents dependent
variable or response variable. Dependent variable must be a ratio variable,
whereas independent variable can be ratio or categorical variable. We can
talk about regression model for cross-sectional data or for time series data.
In time series regression model, time is taken as independent variable
and is very useful for predicting future. Before we develop a regression
model, it is a good exercise to ensure that two variables are linearly
related. For this, plotting the scatter diagram is really helpful. A linear
pattern can easily be identified in the data.
The Classical Linear Regression Model (CLRM) is a statistical framework
used to analyse the relationship between a dependent variable and one or
more independent variables. It is a widely used method in econometrics
and other fields to study and understand the nature of this relationship,
make predictions, and test hypotheses.
PAGE 25
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
26 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 27
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
9DU L σ 2
28 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Notes
PAGE 29
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
30 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 31
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
32 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
TSS = Σ ( yi - y)
2
TSS can further be split into two variations; Explained Sum of Square
(ESS) and Residual Sum of Squares (RSS).
Explained Sum of Square (ESS) or Regression sum of squares or Model
sum of squares is a statistical quantity used in modelling of a process.
ESS gives an estimate of how well a model explains the observed data
for the process.
ESS = Σ ( yˆ i - y)
2
RSS = Σ ( yˆ i - y)
2
PAGE 33
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes r2 = ESS/TSS
Alternatively, from above r2 = 1-RSS/TSS
34 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 35
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
36 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Sensitivity: OLS estimators and their standard errors are sensitive to Notes
small changes in the data.
Efficiency: Despite increased variance, OLS estimators are still efficient,
meaning they have minimum variance among all linear unbiased estimators.
In summary, multicollinearity undermines the precision of coefficient
estimates and can lead to unreliable statistical inference. While the OLS
estimators remain unbiased, they become imprecise, resulting in wider
confidence intervals and potential insignificance of coefficients.
We will learn how to detect multicollinearity using the Variance Inflation
Factor (VIF) and explore strategies to address this issue, ensuring the
accuracy and interpretability of the regression model.
VIF stands for Variance Inflation Factor, which is a measure used to
assess multicollinearity in multiple regression model. VIF quantifies how
much the variance of the estimated regression coefficient is increased
due to multicollinearity. It measures how much the variance of one
independent variable’s estimated coefficient is inflated by the presence
of other independent variables in the model.
The formula for calculating the VIF for an independent variable Xj is:
VIF(Xj) = 1 / (1 – rj2)
where rj2 represents the coefficient of determination (R-squared) from a
regression model that regresses Xj on all other independent variables.
The interpretation of VIF is as follows:
If VIF(Xj) is equal to 1, it indicates that there is no correlation between
Xj and the other independent variables.
If VIF(Xj) is greater than 1 but less than 5, it suggests moderate
multicollinearity.
If VIF(Xj) is greater than 5, it indicates a high degree of multicollinearity,
and it is generally considered problematic.
When assessing multicollinearity, it is common to examine the VIF
values for all independent variables in the model. If any variables have
high VIF values, it indicates that they are highly correlated with the
other variables, which may affect the reliability and interpretation of the
regression coefficients.
PAGE 37
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes If high multicollinearity is detected (e.g., VIF greater than 5), some steps
can be taken to address it:
Remove one or more of the highly correlated independent variables
from the model. Combine or transform the correlated variables into
a single variable.
Obtain more data to reduce the correlation among the independent
variables.
By addressing multicollinearity, the stability and interpretability of the
regression model can be improved, allowing for more reliable inferences
about the relationships between the independent variables and the dependent
variable.
How to Detect Multicollinearity
To detect multicollinearity in your regression model, you can use several
methods:
Pairwise Correlation: Calculate the pairwise correlation coefficients between
each pair of explanatory variables. If the correlation coefficient is very
high (typically greater than 0.8), it indicates potential multicollinearity.
However, low pairwise correlations do not guarantee the absence of
multicollinearity.
Variance Inflation Factor (VIF) and Tolerance: VIF measures the extent
to which the variance of the estimated regression coefficient is increased
due to multicollinearity. High VIF values (greater than 10) suggest
multicollinearity. Tolerance, which is the reciprocal of VIF, measures
the proportion of variance in the predictor variable that is not explained
by other predictors. Low tolerance values (close to zero) indicate high
multicollinearity.
Insignificance of Individual Variables: If many of the explanatory
variables in the model are individually insignificant (i.e., their t-statistics
are statistically insignificant) despite a high R-squared value, it suggests
the presence of multicollinearity.
Auxiliary Regressions: Conduct auxiliary regressions where each
independent variable is regressed against the remaining independent
variables. Check the overall significance of these regressions using the
38 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
2.4.5 Autocorrelation
Autocorrelation, also known as serial correlation, refers to the correlation
between observations in a time series data set or within a regression
model. It arises when there is a systematic relationship between the current
observation and one or more past observation. Autocorrelation occurs
when the residuals of a regression model exhibit a pattern, indicating a
potential violation of the model’s assumptions. We will cover methods
PAGE 39
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
40 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
intermediate lags. Similar to the PACF plot, significant values beyond Notes
the first lag in the ACF plot indicate the presence of autocorrelation.
Figure 2.2
Autocorrelation and partial autocorrelation function (ACF and PACF)
plots, prior to differencing (A and B) and after differencing (C and D).
In both the PACF and ACF plots, significance can be determined by
comparing the correlation values against the confidence intervals. If the
correlation values fall outside the confidence intervals, it suggests the
presence of autocorrelation.
It’s important to note that these graphical methods provide indications of
autocorrelation, but further statistical tests, such as the Durbin-Watson
test or Ljung-Box test, should be conducted to confirm and quantify the
autocorrelation in the model.
2. Durbin Watson D Test
The Durbin-Watson test is a statistical test used to detect autocorrelation
in the residuals of a regression model. It is specifically designed for
detecting first-order autocorrelation, which is the correlation between
adjacent observations.
The Durbin-Watson test statistic is computed using the following formula:
G Ȉ HBL HBL A ȈHBLA
PAGE 41
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes where:
HBL LV WKH UHVLGXDO IRU REVHUYDWLRQ L
HBL LV WKH UHVLGXDO IRU WKH SUHYLRXV REVHUYDWLRQ L
The test statistic is then compared to critical values to determine the
presence of autocorrelation. The critical values depend on the sample
size, the number of independent variables in the regression model, and
the desired level of significance.
The Durbin-Watson test statistic, denoted as d, ranges from 0 to 4. The
test statistic is calculated based on the residuals of the regression model
and is interpreted as follows:
A value of d close to 2 indicates no significant autocorrelation. It
suggests that the residuals are independent and do not exhibit a systematic
relationship.
A value of d less than 2 indicates positive autocorrelation. It suggests
that there is a positive relationship between adjacent residuals, meaning
that if one residual is high, the next one is likely to be high as well.
A value of d greater than 2 indicates negative autocorrelation. It suggests
that there is a negative relationship between adjacent residuals, meaning
that if one residual is high, the next one is likely to be low.
The closer it is to zero, the greater is the evidence of positive autocorrelation,
and the closer it is to 4, the greater is the evidence of negative autocorrelation.
If d is about 2, there is no evidence of positive or negative (first-) order
autocorrelation.
3. The Breusch-Godfrey Test
The Breusch-Godfrey test, also known as the LM test for autocorrelation,
is a statistical test used to detect autocorrelation in the residuals of a
regression model. Unlike the Durbin-Watson test, which is primarily
designed for detecting first-order autocorrelation, the Breusch-Godfrey
test can detect higher-order autocorrelation.
The test is based on the idea of regressing the residuals of the original
regression model on lagged values of the residuals. It tests whether the
lagged residuals are statistically significant in explaining the current
residuals, indicating the presence of autocorrelation.
42 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
The general steps for performing the Breusch-Godfrey test are as follows: Notes
1. Estimate the initial regression model and obtain the residuals.
2. Extend the initial regression model by including lagged values of
the residuals as additional independent variables.
3. Estimate the extended regression model and obtain the residuals from
this model.
4. Perform a hypothesis test on whether the lagged residuals are jointly
significant in explaining the current residuals.
The test statistic for the Breusch-Godfrey test follows a chi-square
distribution and is calculated based on the Residual Sum of Squares
(RSS) from the extended regression model. The test statistic is compared
to the critical values from the chi-square distribution to determine the
presence of autocorrelation.
The interpretation of the Breusch-Godfrey test involves the following steps:
1. Set up the null hypothesis (H0): There is no autocorrelation in the
residuals (autocorrelation is absent).
2. Set up the alternative hypothesis (Ha): There is autocorrelation in
the residuals (autocorrelation is present).
3. Conduct the Breusch-Godfrey test and calculate the test statistic.
4. Compare the test statistic to the critical value(s) from the chi-square
distribution.
5. If the test statistic is greater than the critical value, reject the null
hypothesis and conclude that there is evidence of autocorrelation and
If the test statistic is less than the critical value, fail to reject the
null hypothesis and conclude that there is no significant evidence
of autocorrelation.
PAGE 43
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
44 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 45
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes Generally, a p-value less than a significance level (e.g., 0.05) suggests
that the coefficient is statistically significant, implying a relationship
between the independent variable and the dependent variable.
R-squared: The R-squared value (R-squared or R2) measures the
proportion of the variance in the dependent variable that can be explained
by the independent variable(s). It ranges from 0 to 1, with higher values
indicating a better fit of the regression model to the data. R-squared can
be interpreted as the percentage of the dependent variable’s variation
explained by the independent variable(s).
Residuals: The regression results also include information about the
residuals, which are the differences between the observed values of the
dependent variable and the predicted values from the regression model.
Residuals should ideally follow a normal distribution with a mean of zero,
and their distribution can provide insights into the model’s goodness of
fit and potential violations of the regression assumptions.
It’s important to note that interpretation may vary depending on the specific
context and dataset. Therefore, it’s essential to consider the characteristics
of your data and the objectives of your analysis while interpreting the
results of an OLS regression.
Exercise 2: Test the assumptions of OLS (multicollinearity, autocorrelation,
normality etc.) on R/Python.
Sol. To test the assumptions of OLS, including multicollinearity,
autocorrelation, and normality, you can use various diagnostic tests in R
or Python. Here are the steps and some commonly used tests for each
assumption:
Multicollinearity:
Step 1: Calculate the pairwise correlation matrix between the independent
variables using the cor () function in R or the corrcoef () function in
Python (numpy).
Step 2: Calculate the Variance Inflation Factor (VIF) for each independent
variable using the vif () function from the “car” package in R or the
YDULDQFHBLQIODWLRQBIDFWRU IXQFWLRQ IURP WKH ³VWDWVPRGHOV´ OLEUDU\ LQ
Python. VIF values greater than 10 indicate high multicollinearity.
46 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 47
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes
48 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 49
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
50 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
IN-TEXT QUESTIONS
1. What is the primary goal of linear regression?
(a) Classification
(b) Prediction
(c) Clustering
(d) Dimensionality reduction
PAGE 51
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
52 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
2.6 Summary
This chapter discusses a comprehensive understanding of predictive
analytics techniques, with a specific focus on simple linear regression
and multiple linear regression. It provides the knowledge and practical
skills necessary to apply these techniques using R or Python, enabling
one to make informed predictions and interpretations in the context of
the regression analysis.
PAGE 53
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
1. (b) Prediction
2. (a) One independent variable
3. (a) Y = aX + b
4. (c) Residual
5. (a) The strength of the linear relationship
6. (b) As many as desired
7. (b) To minimize the sum of squared residuals
8. (c) The significance of an independent variable
9. (c) When dealing with categorical outcomes
10. (c) Instability in coefficient estimates
2.9 Reference
Business Analytics: The Science of Data Driven Decision Making,
First Edition (2017), U Dinesh Kumar, Wiley, India.
54 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 55
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
3
Logistic and Multinomial
Regression
Anurag Goel
Assistant Professor, CSE Deptt.
Delhi Technological University
Email-Id: anurag@dtu.ac.in
STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Logistic Function
3.4 Omnibus Test
3.5 Wald Test
3.6 Hosmer Lemeshow Test
3.7 Pseudo R Square
3.8 &ODVVL¿FDWLRQ 7DEOH
3.9 *LQL &RHI¿FLHQW
3.10 ROC
3.11 AUC
3.12 Summary
3.13 Answers to In-Text Questions
3.14 Self-Assessment Questions
3.15 References
3.16 Suggested Readings
56 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
3.2 Introduction
In machine learning, we often are required to determine if a particular
variable belongs to a given class. In such cases, one can use logistic
regression. Logistic Regression, a popular supervised learning technique,
is commonly employed when the desired outcome is a categorical variable
such as binary decisions (e.g., 0 or 1, yes or no, true or false). It finds
extensive applications in various domains, including fake news detection
and cancerous cell identification.
Some examples of logistic regression applications are as follows:
To detect whether a given news is fake or not.
To detect whether a given cell is Cancerous cell or not.
In essence, logistic regression can be understood as the probability of
belonging to a class given a particular input variable. Since it’s probabilistic
in nature, the logistic regression output values lie in the range of 0 and 1.
Generally, when we think about regression from a strictly statistical
perspective, the output value is generally not restricted to a particular
interval. Thus, to achieve this in logistic regression, we utilise logistic
function. An intuitive example to see the use of logistic function can be
to understand logistic regression as any simple regression value model,
on top of whose output value, we have applied a logistic function so that
the final output becomes restricted in the above defined range.
Generally, logistic regression results work well when the output is of binary
type, that is, it either belongs to a specific category or it does not. This,
however, is not always the case in real-life problem statements. We may
PAGE 57
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
58 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
1
f ( x) =
1 + e −x
0.5
It is a mathematical function that assigns
values between 0 and 1 based on the input
variable. It is characterized by its S-shaped –6 –4 –2 0 2 4 6
curve and is commonly used in statistics,
machine learning, and neural networks to model non-linear relationships
and provide probabilistic interpretations.
PAGE 59
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
60 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
X1 X2 X3 Y Notes
2.1 6 7 10.8
2.7 7 6 9.7
3.9 4 9 12.9
2.4 5 8 10.1
2.8 6 7 11.5
PAGE 61
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
W= 0
Var (β)
ZKHUHȕLVWKHHVWLPDWHGFRHIILFLHQWIRUWKHSUHGLFWRUYDULDEOHRILQWHUHVW
ȕ0 is the hypothesized value of the coefficient under the null hypothesis
W\SLFDOO\IRUWHVWLQJLIWKHFRHIILFLHQWLV]HUR DQG9DU ȕ LVWKHHVWLPDWHG
variance of the coefficient.
The Wald test statistic is compared to the chi-square distribution, where the
degrees of freedom are set to 1 (since we are testing a single parameter)
to obtain the associated p-value. Rejecting the null hypothesis occurs
when the calculated p-value falls below a predetermined significance
level (e.g., 0.05), indicating that the predictor variable has a statistically
significant impact on the dependent variable.
The Wald test allows us to determine the individual significance of
predictor variables by testing whether their coefficients significantly deviate
from zero. It is a valuable tool for identifying which variables have a
meaningful impact on the outcome of interest in a regression model.
Let’s consider an example where we have a logistic regression model with
two predictor variables (X1 and X2) and a binary outcome variable (Y).
62 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
We want to assess the significance of the coefficient for each predictor Notes
using the Wald test.
Here is a sample dataset with the predictor variables and the binary
outcome variable:
X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1
Step 1: Fit the Logistic Regression Model
We start by fitting the logistic regression model with the predictor
variables X1 and X2:
ORJLW S ȕ0 ȕ1 × X1 ȕ2 × X2
By using statistical software, we obtain the estimated coefficients and
their standard errors:
ȕ0 ȕ1 ȕ2 = 0.372
6( ȕ0 6( ȕ1 6( ȕ2) = 0.295
Step 2: Calculate the Wald Test Statistic
Next, we calculate the Wald test statistic for each predictor variable
using the formula:
: ȕ ȕ0 ð9DU ȕ
For X1:
W1 = (0.921 - 0)²/(0.512)² = 1.790
For X2:
W2 = (0.372 - 0)²/(0.295)² = 1.608
PAGE 63
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
IN-TEXT QUESTIONS
1. What does the Wald test statistic compare to obtain the associated
p-value?
(a) The F-distribution
(b) The t-distribution
(c) The normal distribution
(d) The chi-square distribution
2. What does the Omnibus test assess in a regression model?
(a) The individual significance of predictor variables
(b) The collinearity between predictor variables
(c) The overall significance of predictor variables collectively
(d) The goodness-of-fit of the regression model
64 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
(O )
2
− Eij
H =∑
ij
( E (1 − E ))
ij ij
PAGE 65
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes variable (X). We will divide the predicted probabilities into three bins
and calculate the observed and expected frequencies in each bin.
Y X Predicted
Probability
0 2.5 0.25
1 3.2 0.40
0 1.8 0.15
1 2.9 0.35
1 3.5 0.45
0 2.1 0.20
1 2.7 0.30
0 3.9 0.60
0 2.4 0.18
1 2.8 0.28
Step 1: Fit the Logistic Regression Model
By fitting the logistic regression model, we obtain the predicted probabilities
for each observation based on the predictor variable X.
Step 2: Divide the Predicted Probabilities into Bins
Let’s divide the predicted probabilities into three bins: [0.1-0.3], [0.3-
0.5], and [0.5-0.7].
Step 3: Calculate Observed and Expected Frequencies in Each Bin
Now, we calculate the observed and expected frequencies in each bin.
Bin: [0.1-0.3]
Total cases in bin: 3
Observed cases (Y = 1): 1
Expected cases: (0.25 + 0.20 + 0.28) × 3 = 1.23
Bin: [0.3-0.5]
Total cases in bin: 4
Observed cases (Y = 1): 2
Expected cases: (0.40 + 0.35 + 0.30 + 0.28) × 4 = 3.52
Bin: [0.5-0.7]
Total cases in bin: 3
66 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Notes There are various methods to calculate Pseudo R-squared, and one
commonly used method is Nagelkerke’s R-squared. The formula for
Nagelkerke’s R-squared is as follows:
(/model – /null )
R2 =
(/max – /null )
where Lmodel is the log-likelihood of the full model, Lnull is the log-
likelihood of the null model (a model with only an intercept term)
and Lmax is the log-likelihood of a model with perfect prediction (a
hypothetical model that perfectly predicts all outcomes).
Nagelkerke’s R-squared ranges from 0 to 1, with 0 indicating that the
predictors have no explanatory power, and 1 suggesting a perfect fit of
the model. However, it is important to note that Nagelkerke’s R-squared
is an adjusted measure and should not be interpreted in the same way
as R-squared in linear regression.
Pseudo R-squared provides an indication of how well the predictor variables
explain the variance in the dependent variable in logistic regression. While
it does not have a direct interpretation as the proportion of variance
explained, it serves as a relative measure to compare the goodness-of-fit
of different models or assess the improvement of a model compared to
a null model.
One commonly used pseudo R-squared measure is the Cox and Snell
R-squared. Let’s calculate the Cox and Snell R-squared using the given
example of a logistic regression model with two predictor variables.
X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1
68 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 69
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes problem on a dataset of 100 random cells in which 10 cells are cancerous
cells and 90 cells are non-cancerous cells. Let suppose the model X outputs
20 input cells as cancerous and rest 80 as non-cancerous cells. Out of the
total predicted cancerous cells, only 5 input cells are actually cancerous
as per the ground truth while the rest 15 cells are non-cancerous. On the
other hand, out of the total predicted non-cancerous cells, 75 cells are
also non-cancerous cells in the ground truth but 5 cells are cancerous.
Here, cancerous cell is considered as positive class while non-cancerous
cell is considered as negative class for the given classification problem.
Now, we define the four primary building blocks of the various evaluation
metrics of classification models as follows:
True Positive (TP): The number of input cells for which the classification
model X correctly predicts that they are cancerous cells are referred as
True Positive. For example, for the model X, TP = 5.
True Negative (TN): The number of input cells for which the classification
model X correctly predicts that they are non-cancerous cells are referred
as True Negative. For example, for the model X, TN = 75.
False Positive (FP): The number of input cells for which the classification
model X incorrectly predicts that they are cancerous cells are referred as
False Positive. For example, for the model X, FP = 15.
False Negative (FN): The number of input cells for which the classification
model X incorrectly predicts that they are non-cancerous cells are referred
as False Negative Positive. For example, for the model X, FN = 5.
Actual
Cancerous Non-Cancerous
Predicted Cancerous TP = 5 FP = 15
Non-Cancerous FN = 5 TN = 75
Figure 3.1: Classification Matrix
3.8.1 Sensitivity
Sensitivity, also referred to as True Positive Rate or Recall, is calculated
as the ratio of correctly predicted cancerous cells to the total number of
cancerous cells in the ground truth. To compute sensitivity, you can use
the following formula:
70 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
TP Notes
Sensitivity =
(TP + FN)
3.8.2 Specificity
Specificity is defined as the ratio of number of input cells that are correctly
predicted as non-cancerous to the total number of non-cancerous cells
in the ground truth. Specificity is also known as True Negative Rate. To
compute specificity, we can use the following formula:
TN
6SHFL¿FLW\ =
(TN + FP)
3.8.3 Accuracy
Accuracy is calculated as the ratio of correctly classified cells to the total
number of cells. To compute accuracy, you can use the following formula:
(TP + TN)
Accuracy =
(TP + FP + TN + FN)
3.8.4 Precision
Precision is calculated as the ratio of the correctly predicted cancerous
cells to the total number of cells predicted as cancerous by the model.
To compute precision, you can use the following formula:
TP
Precision =
(TP + FP)
3.8.5 F score
The F1-score is calculated as the harmonic mean of Precision and Recall.
To compute the F1-score, you can follow the following formula:
2 × Precision × Recall
F1 - score =
(Precision + Recall)
PAGE 71
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
3.10 ROC
In particular in logistic regression or machine learning techniques, the
performance of a binary classification model is assessed using a graphical
representation called the Receiver Operating Characteristic (ROC) curve.
The trade-off between the true positive rate (sensitivity) and the false
positive rate (specificity minus 1) for various categorization thresholds
is demonstrated.
72 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Notes
Plotting the True Positive Rate (TPR) against the False Positive Rate
(FPR) at various categorization thresholds results in the ROC curve. The
formula for TPR and FPR are as follows:
TP
True Positive Rate (TPR) =
(TP + FN)
FP
False Positive Rate (TPR) =
(FP + TN)
We may evaluate the model’s capacity to distinguish between positive
and negative examples at various classification levels using the ROC
curve. With a TPR of 1 and an FPR of 0, a perfect classifier would have
a ROC curve that reaches the top left corner of the plot. The model’s
discriminatory power increases with the distance between the ROC curve
and the top left corner.
3.11 AUC
When employing a Receiver Operating Characteristic (ROC) curve, the
Area Under the Curve (AUC) is a statistic used to assess the effectiveness
of a binary classification model. The likelihood that a randomly selected
positive occurrence will have a greater projected probability than a
randomly selected negative instance is represented by the AUC.
The AUC is calculated by integrating the ROC curve. However, it is
important to note that the AUC does not have a specific formula since
it involves calculating the area under a curve. Instead, it is commonly
calculated using numerical methods or software.
The AUC value ranges between 0 and 1. A model with an AUC of 0.5
indicates a random classifier, where the model’s predictive power is no
PAGE 73
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes better than chance. An AUC value that is nearer 1 indicates a classifier
that is more accurate and is better able to distinguish between positive
and negative situations. Conversely, an AUC value closer to 0 suggests
poor performance, with the model performing worse than random guessing.
In binary classification tasks, the AUC is a commonly utilized statistic
since it offers a succinct assessment of the model’s performance at
different categorization thresholds. It is especially useful when the dataset
is imbalanced i.e. the number of instances that are positive and negative
differ significantly.
In conclusion, the AUC measure evaluates a binary classification model’s
total discriminatory power by delivering a single value that encapsulates the
model’s capacity to rank cases properly. Better classification performance
is shown by higher AUC values, whilst worse performance is indicated
by lower values.
IN-TEXT QUESTIONS
5. Which of the following illustrates trade-off between True Positive
Rate and False Positive Rate?
(a) Gini Coefficient (b) F1-score
(c) ROC (d) AUC
6. Which of the following value of AUC indicates a more accurate
classifier?
(a) 0.01 (b) 0.25
(c) 0.5 (d) 0.99
7. What is the range of values for the Gini coefficient?
(a) -1 to 1 (b) 0 to 1
(c) 0 to infinity (d) -infinity to infinity
8. How can the Gini coefficient be computed?
(a) By calculating the area under the precision-recall curve
(b) By calculating the area under the Receiver Operating
Characteristic (ROC) curve
(c) By calculating the ratio of true positives to true negatives
(d) By calculating the ratio of false positives to false negatives
74 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 75
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
3.15 References
LaValley, M. P. (2008). Logistic regression. Circulation, 117(18),
2395-2399.
Wright, R. E. (1995). Logistic regression.
Chatterjee, Samprit, and Jeffrey S. Simonoff. Handbook of regression
analysis. John Wiley & Sons, 2013.
Kleinbaum, David G., K. Dietz, M. Gail, Mitchel Klein, and Mitchell
Klein. Logistic regression. New York: Springer-Verlag, 2002.
DeMaris, Alfred. “A tutorial in logistic regression.” Journal of
Marriage and the Family (1995): 956-968.
Osborne, J. W. (2014). Best practices in logistic regression. Sage
Publications.
Bonaccorso, Giuseppe. Machine learning algorithms. Packt Publishing
Ltd, 2017.
76 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
4
Decision Tree and
Clustering
Dr. Sanjay Kumar
Deptt. of Computer Science and Engineering
Delhi Technological University
Email-Id: sanjay.kumar@dtu.ac.in
STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.3 &ODVVL¿FDWLRQ DQG 5HJUHVVLRQ 7UHH
4.4 CHAID
4.5 Impurity Measures
4.6 Ensemble Methods
4.7 Clustering
4.8 Summary
4.9 Answers to In-Text Questions
4.10 Self-Assessment Questions
4.11 References
4.12 Suggested Readings
PAGE 77
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
4.2 Introduction
Decision tree is a popular machine learning approach for classification
and regression tasks. Its structure is similar to a flowchart, where internal
nodes represent features or attributes, branches depict decision rules, and
leaf nodes signify outcomes or predicted values. The data are divided
recursively according to feature values by the decision tree algorithm to
create the tree. It chooses the best feature for data partitioning at each
stage by analysing parameters such as information gain or Gini impurity.
The goal is to divide the data into homogeneous subsets within each
branch to increase the tree’s capacity for prediction.
78 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
begins with the question, “Is it a mammal?” If the answer is “Yes,” we Notes
follow the branch on the left. The next question asks, “Does it have
spots?” If the answer is “Yes,” we conclude that it is a leopard. If the
answer is “No,” we determine it is a cheetah.
If the answer to the initial question, “Is it a mammal?” is “No,” we
follow the branch on the right, which asks, “Is it a bird?” If the answer
is “Yes,” we classify it as a parrot. If the answer is “No,” we classify
it as a fish.
Thus decision tree demonstrates a classification scenario where we aim to
determine the type of animal based on specific attributes. By following
the flowchart, we can systematically navigate through the questions to
reach a final classification.
PAGE 79
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
80 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
χ 2 = ∑ (o − E ) / E
2
(1.1)
O represents the observed frequencies in each category or cell of a
contingency table.
E represents the expected frequencies under the assumption of independence
between variables.
PAGE 81
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes
82 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
To apply the Bonferroni correction, we divide the desired significance level Notes
XVXDOO\ GHQRWHG DV Į E\ WKH QXPEHU RI WHVWV EHLQJ SHUIRUPHG GHQRWHG
DV P 7KLV DGMXVWHG VLJQLILFDQFH OHYHO GHQRWHG DV Į¶ RU ĮB% EHFRPHV
the new threshold for determining statistical significance.
Mathematically, the Bonferroni correction can be represented as:
α
α’ =
m (1.2)
PAGE 83
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Gini Index = 1 − ∑ ( p i )
2
(1.4)
Where, p and i represents the probability of each class label in the node.
By using the Gini impurity index, decision tree algorithms can make
decisions on how to split the data by selecting the feature and threshold
that minimize the impurity after the split. A lower Gini impurity index
indicates a more homogeneous distribution of class labels, which helps
in creating pure and informative branches in the decision tree.
84 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Example: Notes
Suppose we have a dataset with 50 samples and two classes, “A” and
“B”. The table below shows the distribution of class labels for a particular
node in a decision tree:
Sample Class
20 A
10 B
15 A
5 B
To calculate the Gini impurity index, we follow these steps:
Calculate the probability of each class label:
Probability of class A = (20 + 15) / 50 = 35 / 50 = 0.7
Probability of class B = (10 + 5) / 50 = 15 / 50 = 0.3
Square each probability:
Square of 0.7 = 0.49
Square of 0.3 = 0.09
Sum up the squared probabilities:
0.49 + 0.09 = 0.58
Subtract the sum from 1 to obtain the Gini impurity index:
Gini Index = 1 - 0.58 = 0.42
So, the Gini impurity index for this particular node is 0.42. This value
represents the impurity or disorder within the node, with a lower Gini
impurity index indicating a more homogeneous distribution of class labels.
4.5.2 Entropy
Entropy is a concept used in information theory and decision tree
algorithms to measure the level of uncertainty or disorder within a set
of class labels. It helps us understand how mixed or impure the class
distribution is in a given node. The entropy value is calculated based on
the probabilities of each class label within the node.
To compute entropy, we start by determining the probability of each
class label. This is done by dividing the count of elements belonging
PAGE 85
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
(
Entropy = −∑ p i * log 2 ( p i ) (1.5)
Where, pi represents the probability of each class label in the node.
By using entropy, decision tree algorithms can assess the impurity within
a node and determine the feature and threshold that minimize the entropy
after the split. A lower entropy value indicates a more homogeneous
distribution of class labels, leading to more informative and accurate
splits in the decision tree.
Example:
Let’s consider a node with 80 samples, where 60 samples belong to class
A and 20 samples belong to class B. The probability of class A is 60/80
= 0.75, and the probability of class B is 20/80 = 0.25. Applying the
logarithm to these probabilities, we get -0.415 and -1.000, respectively.
Multiplying these values by their probabilities and summing them up,
we obtain -0.311. Taking the negative of this sum, the entropy for this
node is 0.311.
86 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
depends on the problem domain and the assigned costs for different Notes
types of misclassifications. For instance, in a medical diagnosis scenario,
misclassifying a severe condition as a less severe one might incur a higher
cost compared to the opposite error.
Example:
Let’s consider a dataset of 30 fruits, where each fruit has two features:
colour (red, green, or orange) and diameter (small or large). The target
variable is the type of fruit, which can be “Apple” or “Orange”. We
also have costs associated with misclassifications: $10 for each false
positive (classifying an orange as an apple) and $5 for each false negative
(classifying an apple as an orange).
When using cost-based splitting criteria, the decision tree algorithm
considers the features (colour and diameter) to find the optimal split that
minimizes the overall cost. For simplicity, let’s assume the first split is
based on the colour feature. The algorithm assesses the costs associated
with misclassification for each colour category and chooses the colour
that results in the lowest expected cost. For instance, if the algorithm
determines that splitting the data based on colour between “Red/Green”
and “Orange” fruits minimizes the expected cost, it proceeds to evaluate
the diameter feature for each branch. The algorithm continues this recursive
process of splitting the data until it constructs a complete decision tree.
The resulting decision tree may look like this:
Colour
PAGE 87
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes The decision tree in the above picture displays splits depending on colour
and diameter, resulting in the labelling of fruits as “Apple” or “Orange”
at the leaf nodes. Now we can use the decision tree to estimate the type
of a new fruit when its colour and diameter are displayed. The model
determines the fruit’s expected class (apple or orange) by tracing the path
down the tree based on the provided attributes.
88 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
random subset of the data is used to train each decision tree, and their Notes
predictions are aggregated via majority voting.
Each decision tree in this ensemble will identify various spam email
patterns and characteristics. While some trees may concentrate on certain
words or phrases, others may take sender information into consideration.
Each decision tree in the ensemble will make a forecast when a new email
is received, and the final prediction will be based on the consensus of
all the decision trees.
PAGE 89
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes tree is unique thanks to the randomness, which also lessens association
between the trees.
Voting and Aggregation: Each tree in the Random Forest identifies
the target variable separately (for classification tasks) or predicts its
value independently (for regression tasks) while making predictions.
For classification, the final prediction is chosen by a majority vote; for
regression, the predictions are averaged. The overall forecast accuracy
is enhanced by the voting and aggregation procedure.
Random Forest has several key features and advantages:
Robustness against overfitting: The Random Forest is more resilient to
noise or outliers in the data thanks to the integration of many decision
trees, which also helps to avoid overfitting.
Feature importance estimation: The features that have the greatest impact
on the predictions are identified by Random Forest using a measure of
feature importance. With this knowledge, features may be chosen and
underlying relationships in the data can be understood.
4.7 Clustering
Cluster analysis, commonly referred to as clustering, is a machine learning
technique that categorizes datasets without labels. In this method, objects
that share similarities are grouped together while maintaining minimal
or no similarities with other groups. By relying solely on the inherent
characteristics of the data, clustering algorithms unveil patterns and
associations. Their objective is to divide a collection of data points into
distinct clusters or groups, where objects within the same cluster exhibit
greater similarity to each other compared to those in different clusters.
The primary aim is to maximize the similarity within clusters while
minimizing the similarity between clusters.
Let’s look at the clustering technique in activity using the real-world
example of Mall: When we go to a shopping centre, we notice that
items with comparable uses are clustered together. T-shirts, for example,
are arranged in one section and pants in another; similarly, in vegetable
sections, apples, bananas, mangoes, and so on are grouped in distinct
sections so that we can easily discover what we’re looking for.
90 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Figure 4.5 shows the images of uncategorized and categorized data in the Notes
form of three types of fruits mixed. The left side shows the uncategorized
data, where right side shows the categorized data or group of same fruits:
Points in the same cluster are similar
Points in the different clusters are dissimilar
PAGE 91
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Ease of implementation
92 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Interpretability Notes
Applicability to various data types
Disadvantages of partitioning clustering are:
Sensitivity to initial centroid selection
Dependence on the number of clusters
Limited ability to handle complex cluster shapes
K-means Clustering
This is one of the most popular clustering algorithms. It aims to partition
the data into a predetermined number of clusters (K) by minimizing
the sum of squared distances between data points and the centroid of
their assigned cluster. K-means clustering is a popular and widely used
algorithm for partitioning a dataset into a predefined number of clusters.
It is an iterative algorithm that aims to minimize the within-cluster sum
of squares, also known as the total squared distance between data points
and their cluster centres. The algorithm assigns data points to clusters
by iteratively updating the cluster centres until convergence. Here is a
detailed description of the K-means clustering algorithm:
Initialization:
PAGE 93
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes data points to clusters based on the updated cluster centres and
recalculate the cluster centres.
Convergence:
94 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Now calculate the Euclidean distance between each point and the centroids Notes
and assign each point to the closest centroid:
A B C D
AB 5 5 9 5
CD 4 16 2 2
We can observe in above table, the distance between A, CD is 4 and it
is smaller than distance between AB, A which is 5. So we can move A
to CD cluster. Two clusters are formed ACD and B. Now recomputed
the centroids B, ACD
We repeat the process by calculating the distance between each point and
the updated centroids and reassigning the points to the closest centroid. We
continue this iteration until the centroids no longer change significantly.
After a few iterations, the algorithm converges, and the final cluster
assignments are:
Cluster 1: B
Cluster 2: ACD
(B) Density-Based Clustering:
Density-based algorithms, such as DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), identify clusters based on regions
of high data point density. The data points are group together that are
close to each other and separates regions with low density. Density-based
clustering does not assume any specific shape for clusters. It can detect
clusters of arbitrary shapes, including non-linear and irregular clusters. Also,
such clustering techniques handles noise or outlier points appropriately.
PAGE 95
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
96 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
single cluster, containing all the data points, is formed. This algorithm Notes
is also known as agglomerative clustering or bottom-up clustering. Here
are the key steps and characteristics of the agglomerative hierarchical
algorithm.
Initialization: Assign each data point to its own initial cluster,
resulting in N clusters for N data points.
Compute the proximity or dissimilarity matrix: Calculate the
dissimilarity or similarity measure between each pair of clusters.
The choice of distance or dissimilarity measure depends on the
specific application and the nature of the data.
Merge the closest clusters: Identify the pair of clusters with the
smallest dissimilarity or highest similarity measure and merge them
into a single cluster. The dissimilarity or similarity between the new
merged cluster and the remaining clusters is updated.
Repeat the merging process: Repeat steps 2 and 3 until all the data
points are part of a single cluster or until a predefined stopping
criterion is met.
Hierarchical representation: The merging process forms a hierarchy
of clusters, often represented as a dendrogram. The dendrogram
illustrates the sequence of merging and allows for different levels
of granularity in cluster interpretation.
The advantages of agglomerative hierarchical algorithms is the hierarchical
structure and there is no need to specify the number of clusters unlike
partitioning algorithm. The drawback of this algorithm is the high
computational complexity and lack of stability.
PAGE 97
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes 1. Euclidean Distance: This is one of the most widely used distance
measures in clustering. It calculates the straight-line distance between
two data points in a Euclidean space. For two points, P = (p1, p2,
..., pn) and Q = (q1, q2, ..., qn), the Euclidean distance is given
by:
98 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
Compactness: Compactness refers to how close the data points are Notes
within each cluster. A good clustering result should have data points
tightly clustered together within their assigned clusters.
Separability: Separability refers to the distance between different
clusters. A high-quality clustering result should exhibit distinct
separation between clusters, indicating that the clusters are well-
separated from each other.
Stability: Stability measures the consistency of clustering results
under different conditions, such as different initializations or subsets
of the data. A stable clustering result is less prone to variation 4s
and demonstrates robustness.
Domain-specific Measures: Depending on the application domain,
additional measures specific to the problem can be used to assess
clustering quality. For example, in customer segmentation, metrics
like homogeneity, completeness, and silhouette coefficient can be used
to evaluate the effectiveness of clustering in capturing meaningful
customer groups.
(B) Determining the Optimal Number of Clusters in K-means clustering:
Determining the optimal number of clusters, K, in K-means clustering is
a crucial step in clustering analysis. Selecting the appropriate number of
clusters is important for interpreting and extracting meaningful information
from the data. Several methods are commonly used to determine the
optimal number of clusters:
Elbow Method: “The elbow method involves plotting the within-cluster
sum of squares (WCSS) as a function of the number of clusters”.
The plot resembles an arm, and the optimal number of clusters is
often identified at the “elbow” point, where the rate of decrease
in WCSS slows down significantly. In the Figure 4.9 below, it is
clear that k=3 is the optimal number of clusters.
PAGE 99
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
Notes
100 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
(b) Random forest are easy to interpret but often very accurate Notes
(c) Random forest are difficult to interpret but very less accurate
(d) None of these
5. Predicting with trees evaluate _________ within each group of
data.
(a) Equality
(b) Homogeneity
(c) Heterogeneity
(d) All of the mentioned
6. Which of the following is finally produced by Hierarchical
Clustering?
(a) Final estimate of cluster centroids
(b) Tree showing how close things are to each other
(c) Assignment of each point to clusters
(d) All of the mentioned
7. Which of the following is required by K-means clustering?
(a) Defined distance metric
(b) Number of clusters
(c) Initial guess as to cluster centroids
(d) All of the mentioned
8. Which of the following combination is incorrect?
(a) Continuous – Euclidean distance
(b) Continuous – Correlation similarity
(c) Binary – Manhattan distance
(d) None of the mentioned
9. Which is not true about K-mean Clustering?
(a) K-means is sensitive to cluster center initializations
(b) Bad initialization can lead to Poor convergence speed
(c) Bad initialization can produce good clustering
(d) None of the mentioned
PAGE 101
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
4.8 Summary
Decision Tree is a popular machine learning approach for classification
and regression tasks.
The CHAID algorithm looks for meaningful patterns by splitting the
data into groups based on different categories of variables.
The Bonferroni correction is a statistical method used to adjust the
significance levels (p-values).
Gini impurity index is a measure used in decision tree algorithms
to evaluate the impurity or disorder within a set of class labels.
Entropy is a concept used in information theory and decision tree
algorithms to measure the impurity.
102 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 103
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
4.11 References
Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and
techniques. Morgan Kaufmann.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of
statistical learning: data mining, inference, and prediction. Springer
Science & Business Media.
Bishop, C. M. (2006). Pattern recognition and machine learning.
Springer.
Murphy, K. P. (2012). Machine learning: A probabilistic perspective.
MIT Press.
Mann, A. K., & Kaur, N. (2013). Review paper on clustering
techniques. Global Journal of Computer Science and Technology.
Rai, P., & Singh, S. (2010). A survey of clustering techniques.
International Journal of Computer Applications, 7(12), 1-5.
Cheng, Y. M., & Leu, S. S. (2009). Constraint-based clustering and
its applications in construction management. Expert Systems with
Applications, 36(3), 5761-5767.
104 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS
PAGE 105
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Glossary
Adjusted R-squared: A modified version of R-squared that accounts for the number of
independent variables in the model, penalizing the inclusion of irrelevant variables.
Big Data: Big data refers to large and complex datasets. It is characterized by the volume,
velocity, and variety of data, often generated from various sources such as social media,
sensors, devices, and business transactions.
Business Analytics: Business analytics consists of using data analysis and statistical
methods to gain insights, make informed decisions, and drive strategic actions in a business
or organizational context.
Classification: The classification algorithm is a supervised learning technique that is used
to categorize new observations on the basics of the training data.
Clustering: Clustering is a machine learning technique, which group the unlabelled datasets
or grouping the data points into different clusters.
Coefficient: In the context of linear regression, these are the parameters that represent
the weights or slopes of the independent variables in the linear equation.
Dependent Variable: Also known as the response variable or target variable, it is the
variable that you are trying to predict or explain in a linear regression model.
Distance Measures: Distance measures are used to quantify the similarity or dissimilarity
between the data points. It is widely used in clustering, classification and nearest neighbour
search.
Entropy: Entropy is used to measure the impurity or disorder in the dataset. It is commonly
involved in decision tree.
F-statistic: A statistical test used to determine the overall significance of a linear regression
model by comparing the model’s fit to a model with no independent variables.
F1-score: The F1-score is calculated as the harmonic mean of Precision and Recall.
Gini Coefficient: A metric used to measure the inequality.
Gini index: Gini index also known as Gini impurity, it is used to measure the degree or
probability of a particular variable being wrongly classified.
Heteroscedasticity: A violation of one of the assumptions of linear regression, where the
variance of the residuals is not constant across all levels of the independent variables.
PAGE 107
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS
108 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
GLOSSARY
R-squared (R²): A statistical measure that represents the proportion of the Notes
variance in the dependent variable that is explained by the independent
variables in the linear regression model.
Residuals: The differences between the actual observed values of the
dependent variable and the values predicted by the linear regression model.
ROC curve: ROC curve demonstrates the balance between the true positive
rate and the false positive rate across various classification thresholds.
Simple Linear Regression: A type of linear regression that involves only
one independent variable to predict the dependent variable.
T-statistic: A statistical test used to determine the individual significance
of each independent variable’s coefficient in a linear regression model.
Wald Test: A statistical test used to evaluate the significance of each
individual predictor variables within a regression model.
PAGE 109
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi