0% found this document useful (0 votes)
34 views

Business Analytics

Uploaded by

shriharshjha27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Business Analytics

Uploaded by

shriharshjha27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

Editorial Board

Dr. Rishi Rajan Sahay


Assistant Professor, Shaheed Sukhdev College of Business Studies, University of Delhi
Dr. Sanjay Kumar
Assistant Professor, Delhi Technological University

Content Writers
Dr. Abhishek Kumar Singh, Dr. Satish Kumar Goel,
Mr. Anurag Goel, Dr. Sanjay Kumar
Academic Coordinator
Mr. Deekshant Awasthi

© Department of Distance and Continuing Education


ISBN: 978-81-19169-84-9
1st Edition: 2023
E-mail: ddceprinting@col.du.ac.in
management@col.du.ac.in

Published by:
Department of Distance and Continuing Education
Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110007

Printed by:
School of Open Learning, University of Delhi
DISCLAIMER

This Study Material is duly recommended and approved in Academic Council


meeting held on 11/08/2023 Vide item no. 1015 and subsequently Executive
Council Meeting held on 25/08/2023 vide item no. 1267.

Corrections/Modifications/Suggestions proposed by Statutory Body, DU/


Stakeholder/s in the Self Learning Material (SLM) will be incorporated in
WKH QH[W HGLWLRQ +RZHYHU WKHVH FRUUHFWLRQVPRGL¿FDWLRQVVXJJHVWLRQV ZLOO EH
uploaded on the website https://sol.du.ac.in.
Any feedback or suggestions can be sent to the email-feedback.slm@col.du.ac.in.

Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
New Delhi - 110026 (300 Copies, 2023)

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
Contents

PAGE

Lesson 1: Introduction to Business Analytics and Descriptive Analytics


1.1 Learning Objectives 1
1.2 Introduction 2
1.3 Business Analytics 3
1.4 Role of Analytics for Data-Driven Decision making 5
1.5 Types of Business Analytics 6
1.6 Introduction to the Concepts of Big Data Analytics 8
1.7 Overview of Machine Learning Algorithms 11
1.8 Introduction to Relevant Statistical Software Packages 13
1.9 Summary 21
1.10 Answers to In-Text Questions 21
1.11 Self-Assessment Questions 22
1.12 References 22
1.13 Suggested Readings 23

Lesson 2: Predictive Analytics


2.1 Learning objectives 24
2.2 Introduction 24
2.3 Classical Linear Regression Model (CLRM) 25
2.4 Multiple Linear Regression Model 34
2.5 Practical Exercises using R/Python Programming 44
2.6 Summary 53
2.7 Answers to In-Text Questions 54
2.8 Self-Assessment Questions 54
2.9 Reference 54
2.10 Suggested Readings 55

PAGE i
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

PAGE

Lesson 3: Logistic and Multinomial Regression


3.1 Learning Objectives 57
3.2 Introduction 57
3.3 Logistic Function 59
3.4 Omnibus Test 59
3.5 Wald Test 62
3.6 Hosmer Lemeshow Test 65
3.7 Pseudo R Square 67
3.8 &ODVVL¿FDWLRQ 7DEOH 
3.9 *LQL &RHI¿FLHQW 
3.10 ROC 72
3.11 AUC 73
3.12 Summary 75
3.13 Answers to In-Text Questions 75
3.14 Self-Assessment Questions 75
3.15 References 76
3.16 Suggested Readings 76

Lesson 4: Decision Tree and Clustering


4.1 Learning Objectives 77
4.2 Introduction 78
4.3 &ODVVL¿FDWLRQ DQG 5HJUHVVLRQ 7UHH 
4.4 CHAID 81
4.5 Impurity Measures 84
4.6 Ensemble Methods 88
4.7 Clustering 90
4.8 Summary 102
4.9 Answers to In-Text Questions 103
4.10 Self-Assessment Questions 104
4.11 References 104
4.12 Suggested Readings 105
Glossary 107

ii PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

1
Introduction to Business
Analytics and Descriptive
Analytics
Dr. Abhishek Kumar Singh
Assistant Professor
University of Delhi
Email-Id: abhishekbhu008@gmail.com

STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Business Analytics
1.4 Role of Analytics for Data-Driven Decision Making
1.5 Types of Business Analytics
1.6 Introduction to the Concepts of Big Data Analytics
1.7 Overview of Machine Learning Algorithms
1.8 Introduction to Relevant Statistical Software Packages
1.9 Summary
1.10 Answer to In-Text Questions
1.11 Self-Assessment Questions
1.12 References
1.13 Suggested Readings

1.1 Learning Objectives


‹ Define Business Analytics.
‹ State the Role of Analytics for Data-Driven Decision Making.

PAGE 1
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes ‹ Mention the types of Business Analytics.


‹ Classify the concepts of Big Data Analytics.
‹ Describe Machine Learning Algorithms.
‹ Identify relevant statistical software packages.

1.2 Introduction
Business Analytics (BA) consists of using data to gain valuable insights
and make informed decisions in a business setting. It involves analysing
and interpreting data to uncover patterns, trends, and correlations that
can help organizations improve their operations, better understand their
customers, and make strategic decisions. Business Analytics (BA) places
a focus on statistical analysis. In addition to statistical analysis, business
analytics also focuses on various other aspects, such as data mining,
predictive modelling, data visualization, machine learning, and data-driven
decision making.
Companies committed to making data-driven decisions employ business
analytics. The study of data through statistical and operational analysis, the
creation of predictive models, the use of optimisation techniques, and the
communication of these results to clients, business partners, and college
executives are all considered to be components of business analytics. It
relies on quantitative methodologies, and data needed to create specific
business models and reach lucrative conclusions must be supported by
proof. As a result, Business Analytics heavily relies on and utilises Big
Data. Business analytics is the process used to analyse data after looking
at past outcomes and problems in order to create an effective future plan.
Big Data, or a lot of data, is utilised to generate answers. The economy
and the sectors that prosper inside it depend on this way of conducting
business or this outlook on creating and maintaining a business. Over the
past ten or so years, the word analytics has gained popularity. Analytics are
now incredibly important due to the growth of the internet and information
technology. In this lesson we are going to learn about Business Analytics
and the area of analytics integrates data, information technology, statistical
analysis, and combining quantitative techniques with computer-based

2 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

models. All of these factors work together to give decision-makers every Notes
possibility that can arise, allowing them to make well-informed choices.
The computer-based model makes sure that decision-makers may examine
how their choices function in various scenarios.

Figure 1.1

1.3 Business Analytics

1.3.1 Meaning
Business Analytics (BA) utilizes data analysis, statistical models, and
various quantitative techniques as a comprehensive discipline and
technological approach. It involves a systematic and iterative examination
of organizational data, with a specific emphasis on statistical analysis, to
facilitate informed decision-making.
Business analytics primarily entails a combination of the following:
discovering novel patterns and relationships using data mining; developing
business models using quantitative and statistical analysis; conducting A/B
and multi-variable testing based on findings; forecasting future business
needs, performance, and industry trends using predictive modelling; and
reporting your findings to co-workers, management, and clients in simple-
to-understand reports.

PAGE 3
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 1.3.2 Definition


Business Analytics (BA) involves utilizing knowledge, tools, and procedures
to analyse past business performance in order to gain insight and inform
present and future business strategy.
Business analytics is the process of transforming data into insights to
improve business choices. It is based on data and statistical approaches to
provide new insights and understanding of business performance. Some of
the methods used to extract insights from data include data management,
data visualisation, predictive modelling, data mining, forecasting simulation,
and optimisation.

1.3.3 Business Analytics Evolution


Business analytics has been around for a very long time and has developed
as more, better technology have been available. Operations research,
which was widely applied during World War II, is where it has its roots.
Operations research was initially designed as a methodical strategy to
analyse data in military operations. Over time, this strategy began to be
applied in business domain as well. Gradually, the study of operations
evolved into management science. Furthermore, the fundamental elements
such as decision-making models, and other foundations of management
science were the same as those of operation research.
Ever since Frederick Winslow Taylor implemented management exercises
in the late 19th century, analytics have been employed in business. Henry
Ford’s freshly constructed assembly line involved timing of each component.
However, when computers were deployed in decision support systems in
the late 1960s, analytics started to garner greater attention. Since then,
Enterprise Resource Planning (ERP) systems, data warehouses, and a
huge range of other software tools and procedures have all modified and
shaped analytics.
With the advent of computers, business analytics have grown in recent
years. This modification has elevated analytics to entirely new heights
and opened up a world of opportunity. Many people would never guess
that analytics began in the early 1900s with Mr Ford himself, given how
far analytics has come in history and what the discipline is now. Business

4 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

intelligence, decision support systems, and PC software all developed Notes


from management science.

1.4 Role of Analytics for Data-Driven Decision Making

1.4.1 Applications and Uses of Business Analytics


Applications and uses for business analytics are numerous. It can be
applied to descriptive analysis, which makes use of facts to comprehend
the past and present. This form of descriptive analysis is employed to
evaluate the company’s present position in the market and the success
of earlier business decisions.
Predictive analysis, which is frequently employed to evaluate past
business performance, is used with it. Prescriptive analysis, which is used
to develop optimisation strategies for better corporate performance, is
another application of business analytics. Business analytics, for instance,
is used to base price decisions for various products in a department store
on historical and current data.

1.4.2 Workings of Business Analytics


Several fundamental procedures are first carried out by BA before any
data analysis is done:
‹ Identify the analysis’s corporate objective.
‹ Choose an analytical strategy.
‹ Gather business data often from multiple systems and sources to
support the study.
‹ Cleanse and incorporate all data into one location, such as data
warehouse or data mart.

1.4.3 Need/Importance of Business Analytics


Business analytics serves as an approach to help in making informed
business decisions.
As a result, it affects how the entire organisation functions. Business
analytics can therefore help a company become more profitable, grow its

PAGE 5
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes market share and revenue, and give shareholders a higher return. It entails
improved primary and secondary data interpretation, which again affects
the operational effectiveness of several departments. Moreover, it provides
a competitive advantage to the organization. The flow of information is
nearly equal among all actors in this digital age. The competitiveness of
the company is determined by how this information is used. Corporate
analytics improves corporate decisions by combining readily available
data with numerous carefully considered models.

1.4.4 Transforms Data into Insightful Knowledge


Business analytics is the serves as a resource for a firm to make informed
decisions. These choices will probably have an effect on your entire business
because they will help you expand market share, boost profitability, and
give potential investors a higher return.
While some businesses struggle with how to use massive volumes of
data, business analytics aims to combine this data with useful insights
to enhance the decisions your organisation makes.
In essence, business analytics is significant across all industries for the
following four reasons:
‹ Enhances performance by providing your company with a clear
picture of what is and what isn’t working.
‹ Facilitates quicker and more precise decision-making.
‹ Reduces risks by assisting a company in making wise decisions on
consumer behaviour, trends, and performance.
‹ By providing information on the consumer, it encourages innovation
and change.

1.5 Types of Business Analytics


Business analytics can be divided into four primary categories, each of
which gets more complex. They bring us one step closer to implementing
scenario insight applications for the present and the future. Below is a
description of each of these business analytics categories:
1. Descriptive analytics
2. Diagnostic Analytics

6 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

3. Predictive Analytics Notes


4. Prescriptive Analytics
1. Descriptive analytics: In order to understand what has occurred in the
past or is happening right now, it summarises the data that an organisation
currently has. The simplest type of analytics is descriptive analytics, which
uses data aggregation and mining techniques. It increases the availability
of data to an organization’s stakeholders, including shareholders, marketing
executives, and sales managers. It can aid in discovering strengths and
weaknesses and give information about customer behaviour. This aids in
the development of strategies for the field of focused marketing.
2. Diagnostic Analytics: This kind of analytics aids in refocusing attention
from past performance to present occurrences and identifies the variables
impacting trends. Drill-down, data mining, and other techniques are used
to find the underlying cause of occurrences. Probabilities and likelihoods
are used in diagnostic analytics to comprehend the potential causes of
events. For classification and regression, methods like sensitivity analysis
and training algorithms are used.
3. Predictive Analytics: With the aid of statistical models and ML
approaches, this type of analytics is used to predict the likelihood of a
future event. The outcome of descriptive analytics is built upon to create
models that extrapolate item likelihood. Machine learning specialists are
used to conduct predictive analyses. They can be more accurate than they
might be with just business intelligence. Sentiment analysis is among its
most popular uses. Here, social media data already in existence is used
to construct a complete picture of a user’s viewpoint. To forecast their
attitude (positive, neutral, or negative), this data is evaluated.
4. Prescriptive Analytics: It offers suggestions for the next best course
of action, going beyond predictive analytics. It makes all beneficial
predictions in accordance with a particular course of action and also
provides the precise steps required to produce the most desirable outcome.
It primarily depends on a robust feedback system and ongoing iterative
analysis. It gains knowledge of the connection between acts and their
results. The development of recommendation systems is a typical use of
this kind of analytics.

PAGE 7
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes Business Questions Tools Outcomes Focus


Analytics
Prescrip- ‹ What ‹ Decision mod- ‹ Optimization- ‹ Focus on de-
tive (Au- should I elling Best possible cision making
tomation) do? ‹ Optimization business deci- and efficiency
‹ Why sion
‹ Simulation
should I
‹ Expert systems
do it?
Predictive ‹ What is ‹ Data mining ‹ Accurate pro- ‹ Identify past
(Fore- likely to ‹ Text/media jection of the patterns to
sight) happen? mining future condi- predict the
‹ What tions and states future
‹ Predictive
will hap- modelling
pen?
‹ Artifical
‹ Why will Neural
it hap- Networks
pen? (ANN)
Diagnostic ‹ Why did ‹ Enterprise data ‹
Accurate pro- ‹ Identify past
(Insight) it hap- warehouse jections of the patterns to
pen? ‹ Data discovery future condi- predict the
tions and states future
‹ Data mining
and correla-
tions
‹ Drill-down/
roll-up
Descrip- ‹ What ‹ Data modelling ‹ Well defined ‹ Uncovering
tive hap- ‹ Business re- business prob- patterns that
(Hind- pened? porting lems or oppor- offer insight
sight) ‹ What is tunities
‹ Visualization
happen-
‹ Dashboard
ing?
‹ Regression
Figure 1.2

1.6 Introduction to the Concepts of Big Data Analytics


Big Data Analytics is made up of enormous volumes of information
that cannot be processed or stored using conventional data processing or
storage methods. There are often three distinct versions.

8 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

‹ Structured data, as the name implies, has a clear structure and follows Notes
a regular sequence. A person or machine may readily access and
utilise this type of information since it has been intended to be user-
friendly. Structured data is typically kept in databases, especially
relational database management systems, or RDBMS, and tables
with clearly defined rows and columns, such as spreadsheets.
‹ While semi-structured data displays some of the same characteristics
as structured data, for the most part it lacks a clear structure and
cannot adhere to the formal specifications of data models like an
RDBMS.
‹ Unstructured data does not adhere to the formal structural norms of
traditional data models and lacks a consistent structure across all
of its different forms. In a very small number of cases, it might
contain information on the date and time.

1.6.1 Large-scale Data Management Traits


According to traditional definitions of the term, big data is typically
linked to three essential traits:
‹ Volume: The massive amounts of information produced every second
by social media, mobile devices, automobiles, transactions, connected
sensors, photos, video, and text are referred to by this characteristic.
Only big data technologies can handle enormous volumes, which
come in petabyte, terabyte, or even zettabyte sizes.
‹ Diversity: Information in the form of images, audio streams, video,
and a variety of other forms now contributes to a diversity of data
kinds, around 80% of which are completely unstructured, to the
existing landscape of transactional and demographic data like phone
numbers and addresses.
‹ Velocity: This attribute relates to the velocity of data accumulation
and refers to the phenomenal rate at which information is flooding
into data repositories. It also describes how quickly massive data can
be analysed and processed to draw out the insights and patterns it
contains. Now, that speed is frequently real-time. Current definitions
of big data management also contain two additional features in
addition to “the Three v/s,” namely:

PAGE 9
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes ‹ Veracity: The level of dependability and truth that big data can
provide in terms of its applicability, rigour, and correctness.
‹ Value: This feature examines whether information and analytics will
eventually be beneficial or detrimental as the main goal of big
data collection and analysis is to uncover insights that can guide
decision-making and other activities.

Figure 1.3

1.6.2 Services for Big Data Management


Organisations can pick from a wide range of big data management
options when it comes to technology. Big data management solutions
can be standalone or multi-featured, and many businesses employ several
of them. The following are some of the most popular kinds of big data
management capabilities:
‹ Finding and resolving problems in data sets is known as data cleansing.
‹ Data integration is the process of merging data from several sources.
‹ Data preparation is the process of preparing data for use in analytics
or other applications. Data enrichment is the process of enhancing
data by adding new data sets, fixing minor errors, or extrapolating
new information from raw data. Data migration is the process of
moving data from one environment to another, such as from internal
data centres to the cloud.
‹ Adding new data sets, fixing minor errors, or extrapolating new
information from raw data are all examples of data enrichment.

10 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Data analytics is the process of analysing data using a variety of Notes


techniques in order to gain insights.

1.7 Overview of Machine Learning Algorithms

1.7.1 Machine Learning


Machine Learning (ML) is the study of computer algorithms that can
get better on their own over time and with the help of data. It is thought
to be a component of artificial intelligence. Without being expressly
taught to do so, machine learning algorithms create a model using sample
data, also referred to as training data, in order to make predictions or
judgments. In a wide range of fields where it is challenging or impractical
to design traditional algorithms, such as medicine, email filtering, speech
recognition, and computer vision, machine learning algorithms are applied.
Computational statistics, which focuses on making predictions with
computers, is closely related to a subset of machine learning, but not
all machine learning is statistical learning. The field of machine learning
benefits from the tools, theory, and application fields that come from
the study of mathematical optimisation. Data mining is a related area of
study that focuses on unsupervised learning for exploratory data analysis.
Some machine learning applications employ data and neural networks in
a way that closely resembles how a biological brain function. Machine
learning is also known as predictive analytics when it comes to solving
business problems.
How does machine learning operate?
The operation of machine learning:
1. The Making of a Decision: Typically, machine learning algorithms
are employed to produce a forecast or classify something. Your
algorithm will generate an estimate about a pattern in the supplied
data, which may be tagged or unlabelled.
2. An Error Function: A model’s prediction can be assessed using an
error function. In order to evaluate the model’s correctness when
there are known examples, an error function can compare the results.

PAGE 11
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 3. A process for model optimisation: If the model can more closely
match the data points in the training set, weights are modified to
lessen the difference between the known example and the model
prediction. Until an accuracy level is reached, the algorithm will
iteratively evaluate and optimise, updating weights on its own each
time.

1.7.2 Machine Learning Methods


Machine learning classifiers fall into three primary categories:
1. Supervised machine learning: The definition of supervised learning,
commonly referred to as supervised machine learning, is the use of
labelled datasets to train algorithms that can reliably classify data or
predict outcomes. As the model receives input data and modifies its
weights until the model is properly fitted. This happens as part of the
cross-validation process to make sure the model doesn’t fit too well or
too poorly. Supervised learning assists organisations in finding saleable
solutions to a range of real-world issues, such as classifying spam in a
different folder from your email. Neural networks, naive Bayes, linear
regression, logistic regression, random forest, support vector machine
(SVM), and other techniques are used in supervised learning.
2. Unsupervised Machine learning: Unsupervised learning, commonly
referred to as unsupervised machine learning, analyses and groups un-labelled
datasets using machine learning algorithms. These algorithms identify
hidden patterns or data clusters without the assistance of a human. It is
the appropriate solution for exploratory data analysis, cross-selling tactics,
consumer segmentation, and picture and pattern recognition because of its
capacity to find similarities and differences in information. Through the
process of dimensionality reduction, it is also used to lower the number
of features in a model; Principal Component Analysis (PCA) and Singular
Value Decomposition (SVD) are two popular methods for this. The use of
neural networks, k-means clustering, probabilistic clustering techniques,
and other algorithms is also common in unsupervised learning.
3. Semi-supervised education: A satisfying middle ground between
supervised and unsupervised learning is provided by semi-supervised
learning. It employs a smaller, labelled data set during training to direct

12 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

feature extraction and classification from a larger, unlabelled data set. If Notes
you don’t have enough labelled data—or can’t pay to label enough data—
to train a supervised learning system, semi-supervised learning can help.

1.7.3 Reinforcement Learning with Computers


Although the algorithm isn’t trained on sample data, reinforcement
machine learning is a behavioural machine learning model that is similar
to supervised learning. Trial and error are used by this model to learn as
it goes. The optimal suggestion or strategy will be created for a specific
problem by reinforcing a string of successful outcomes.
A subset of artificial intelligence called “machine learning” employs
computer algorithms to enable autonomous learning from data and
knowledge. In machine learning, computers can change and enhance their
algorithms without needing to be explicitly programmed.
Computers can now interact with people, drive themselves, write and
publish sport match reports, and even identify terrorism suspects thanks
to machine learning algorithms.

1.8 Introduction to Relevant Statistical Software Packages


A statistical package is essentially a group of software programmes that
share a common user interface and were created to make it easier to do
statistical analysis and related duties like data management.
What is Statistical Software?
Software for doing complex statistical analysis is known as statistical
software. They serve as tools for organising, interpreting, and presenting
particular data sets in order to provide scientific insights on patterns and
trends. To perform data sciences, statistical software uses statistical analysis
theorems and procedures like regression analysis and time series analysis.
Benefits of Statistical Software:
‹ Increases productivity and accuracy in data management and analysis
‹ Requiring less time
‹ Simple personalization

PAGE 13
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes ‹ Access to a sizable database that reduces sampling error and enables
data-driven decision making.
Relevant statistical software packages:
Increases productivity and accuracy in data management and analysis
requiring less time.
Simple personalization access to a sizable database reduces sampling
error and enables data-driven decision-making.
1. SPSS (Statistical Package for Social Sciences)
‹ The most popular and effective programme for analysing complex
statistical data is called SPSS.
‹ To make the results easy to discuss, it quickly generates descriptive
statistics, parametric and non-parametric analysis, and delivers graphs
and presentation-ready reports.
‹ Here, estimate and the discovery of missing values in the data sets
lead to more accurate reports.
‹ For the analysis of quantitative data, SPSS is utilised.
2. Stata
‹ Stata is another commonly used programme that makes it possible
to manage, save, produce, and visualise data graphically. It does
not require any coding expertise to use.
‹ Its use is more intuitive because it has both a command line and a
graphical user interface.
3. R
‹ Free statistical software known as “R” offers graphical and statistical
tools, including linear and non-linear modelling.
‹ Available for a wide range of applications are toolboxes, which are
effective plugins. Here, coding expertise is necessary.
‹ It offers interactive reports and apps, makes extensive use of data,
and complies with security guidelines.
‹ R is used to analyse quantitative data.
4. Python
‹ Another freely available software.

14 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

‹ Extensive libraries and frameworks. Notes


‹ A popular choice for machine learning tasks.
‹ Simplicity and Readability.
5. SAS (Statistical Analysis Software)
‹ It is a cloud-based platform that offers ready-to-use applications for
manipulating data, storing information, and retrieving it.
‹ Its processes employ several threads executing several tasks at once.
‹ Business analysts, statisticians, data scientists, researchers, and
engineers utilise it largely for statistical modelling, spotting trends
and patterns in data, and assisting in decision-making.
‹ For someone unfamiliar with this method, coding can be challenging.
‹ It is utilised for the analysis of numerical data.
6. MATLAB (MATrix LABoratory)
‹ The initials MATLAB stand for Matrix Laboratory.
‹ Software called MATLAB offers both an analytical platform and a
programming language.
‹ It expresses matrix and array mathematics, function and data charting,
algorithm implementation, and user interface development.
‹ A script that combines code, output, and formatted text into an
executable notebook is produced by Live Editor, which is also
provided.
‹ Engineers and scientists utilise it a lot.
‹ For the analysis of quantitative data, MATLAB is employed.
7. Epi-data
‹ Epi-data is a widely used, free data programme created to help
epidemiologists, public health researchers, and others enter, organise,
and analyse data while working on the ground.
‹ It manages all of the data and produces graphs and elementary
statistical analysis.
‹ Here, users can design their own databases and forms.
‹ Epi-data is a tool for analysing quantitative data.

PAGE 15
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 8. Epi-info
‹ It is a public domain software suite created by the Centres for
Disease Control and Prevention (CDC) for researchers and public
health professionals worldwide.
‹ For those who might not have a background in information technology,
it offers simple data entry forms, database development, and data
analytics including epidemiology statistics, maps, and graphs.
‹ Investigations into disease outbreaks, the creation of small to medium-
sized disease monitoring systems, and the Analysis, Visualisation,
and Reporting (AVR) elements of bigger systems all make use of
it.
‹ It is utilised for the analysis of numerical data.
9. NVivo
‹ It is a piece of software that enables the organisation and archiving
of qualitative data for analysis.
‹ The analysis of unstructured text, audio, video, and image data, such
as that from interviews, Focus Group Discussion (FGD), surveys,
social media, and journal articles, is done using NVivo.
‹ You can import Word documents, PDFs, audio, video, and photos
here.
‹ It facilitates users’ more effective organisation, analysis, and discovery
of insights from structured or qualitative data.
‹ The user-friendly layout makes it instantly familiar and intuitive for
the user. It contains a free version as well as automated transcribing
and auto coding.
‹ Research using mixed methods and qualitative data is conducted
using NVivo.
10. Mini-tab
‹ Mini-tab provides both fundamental and moderately sophisticated
statistical analysis capabilities.
‹ It has the ability to analyse a variety of data sets, automate statistical
calculations, and provide beautiful visualisations.
‹ The usage of mini-tabs allows users to concentrate more on data
analysis by allowing them to examine both current and historical

16 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

data to spot trends and patterns as well as hidden links between Notes
variables.
‹ It makes it easier to understand the data’s insights.
‹ For the examination of qualitative data, Mini-tab is employed.
11. Dedoose
‹ Dedoose, a tool for qualitative and quantitative data analysis, is
entirely web-based.
‹ This low-cost programme is user-friendly and team-oriented, and
it makes it simple to import both text and visual data.
‹ It has access to cutting-edge data security equipment.
12. ATLAS.ti
‹ It is a pioneer in qualitative analysis software and has incorporated
AI as it has developed.
‹ The greatest places for this are research organisations, businesses,
and academic institutions. Due to the cost of doing individual
studies.
‹ With sentiment analysis and auto coding, it is more potent.
‹ It gives users the option to use any language or character set.
13. MAXDQA 12
‹ It is expert software for analysing data using quantitative, qualitative,
and mixed methods.
‹ It imports the data, reviews it in a single spot, and categorises any
unstructured data with ease.
‹ With this software, a literature review may also be created.
‹ It costs money and is not always easy to collaborate with others in
a team.

IN-TEXT QUESTIONS
1. What is the term used for a collection of large, complex data
sets that cannot be processed using traditional data processing
tools?

PAGE 17
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes (a) Big Data


(b) Small Data
(c) Medium Data
(d) Mini Data
2. Which of the following is not one of the four V’s of Big Data?
(a) Velocity
(b) Volume
(c) Variety
(d) Value
3. What is the process of transforming structured and unstructured
data into a format that can be easily analyzed?
(a) Data Mining
(b) Data Warehousing
(c) Data Integration
(d) Data Processing
4. Which of the following is a tool used for processing and
analyzing Big Data?
(a) Hadoop
(b) MySQL
(c) Postgre SQL
(d) Oracle
5. What is the process of examining large and varied data sets to
uncover hidden patterns, unknown correlations, market trends,
customer preferences, and other useful information?
(a) Data Mining
(b) Data Warehousing
(c) Data Integration
(d) Data Processing
6. Which of the following is not a common challenge associated
with Big Data?

18 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

(a) Data Quality Notes

(b) Data Integration


(c) Data Privacy
(d) Data Duplication
7. Which of the following is a technique used to extract meaningful
insights from data sets that are too large or complex to be
processed by traditional data processing tools?
(a) Business Intelligence
(b) Machine Learning
(c) Artificial Intelligence
(d) Data Science
8. Which of the following is a common programming language
used for Big Data processing?
(a) C++
(b) Java
(c) Python
(d) All of the above
9. Which of the following is a popular NoSQL database used for
Big Data processing?
(a) MySQL
(b) PostgreSQL
(c) Oracle
(d) MongoDB
10. What is the term used for the ability of a system to handle
increasing amounts of data and traffic without compromising
performance?
(a) Scalability
(b) Reliability
(c) Availability
(d) Security

PAGE 19
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 11. What is the process of cleaning and transforming data before
it is used for analysis?
(a) Data Mining
(b) Data Warehousing
(c) Data Integration
(d) Data Preprocessing
12. Which of the following is not a common use case for Big Data
analytics?
(a) Fraud Detection
(b) Customer Segmentation
(c) Social Media Analysis
(d) Inventory Management
13. Which of the following is not a common method for selecting
the best features for a machine learning model?
(a) Filter Methods
(b) Wrapper Methods
(c) Embedded Methods
(d) Extrapolation Methods
14. Which of the following is a technique for grouping similar data
points together?
(a) Classification
(b) Regression
(c) Clustering
(d) Dimensionality Reduction
15. Which of the following is a measure of how well a machine
learning model is able to make predictions on new data?
(a) Accuracy
(b) Precision
(c) Recall
(d) All of the above

20 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

1.9 Summary Notes

The disciplines of management, business, and computer science are all


combined in business analytics. The commercial component requires
knowledge of the industry at a high level as well as awareness of current
practical constraints. An understanding of data, statistics, and computer
science is required for the analytical portion. Business analysts can close
the gap between management and technology thanks to this confluence
of disciplines. Business analytics also includes effective problem-solving
and communication to translate data insights into information that is
understandable to executives. A related field called business intelligence
likewise uses data to better understand and inform businesses. What
distinguishes business analytics from business intelligence in terms of
objectives? Despite the fact that both areas rely on data to provide
answers, the goal of business intelligence is to comprehend how an
organisation came to be in the first place. Measurement and monitoring of
Key Performance Indicators (KPIs) are part of this. The goal of business
analytics, on the other hand, is to support business improvements by
utilizing predictive models that offer insight into the results of suggested
adjustments. Big data, statistical analysis, and data visualization are all
used in business analytics to implement organizational changes. This
work includes predictive analytics, which is crucial since it uses data
that is already accessible to build statistical models. These models can be
applied to decision-making and result prediction. Business analytics can
provide specific recommendations to fix issues and enhance enterprises
by learning from the data already available.

1.10 Answers to In-Text Questions

1. (a) Big Data


2. (d) Value
3. (c) Data Integration
4. (a) Hadoop
5. (a) Data Mining
6. (d) Data Duplication

PAGE 21
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 7. (b) Machine Learning


8. (d) All of the above
9. (d) MongoDB
10. (a) Scalability
11. (d) Data Preprocessing
12. (d) Inventory Management
13. (d) Extrapolation Methods
14. (c) Clustering
15. (d) All of the above

1.11 Self-Assessment Questions


1. What is Business Analysis?
2. Why a Business Analyst needed in an organization?
3. What is SaaS?
4. What are considered to be the four types of Business analytics?
Explain them in your own words.
5. Explain the importance of Business Analytics?
6. Explain the three relevant statistical software packages?
7. How does the machine learning method works?
8. Explain the difference between any two software packages?
9. What is Big Data Analytics?

1.12 References
‹ Evans, J.R. (2021), Business Analytics: Methods, Models and Decisions,
Pearson India.
‹ Kumar, U. D. (2021), Business Analytics: The Science of Data-Driven
Decision Making, Wiley India.
‹ Larose, D. T. (2022), Data Mining and Predictive Analytics, Wiley
India.
‹ Shmueli, G. (2021), Data Mining and Business Analytics, Wiley India.

22 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

1.13 Suggested Readings Notes

‹ Business Analysis Techniques: 99 Essential Tools for Success, Cadle,


Paul, and Turner, 2014. BCS in Swindon.
‹ Kimi Ziemski, Richard Vander Horst, and Kathleen B. Hass (2008).
Business analyst management concepts: elevating the role of the
analyst, 2008. ISBN 1-56726-213-9. p 94: “As business analysis
becomes a more professionalised discipline”.

PAGE 23
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

2
Predictive Analytics
Dr. Satish Kumar Goel
Assistant Professor
Shaheed Sukhdev College of Business Studies
University of Delhi
Email-Id: satish@sscbsdu.ac.in

STRUCTURE
2.1 Learning Objectives
2.2 Introduction
2.3 Classical Linear Regression Model (CLRM)
2.4 Multiple Linear Regression Model
2.5 Practical Exercises Using R/Python Programming
2.6 Summary
2.7 Answers to In-Text Questions
2.8 Self-Assessment Questions
2.9 Reference
2.10 Suggested Readings

2.1 Learning Objectives


‹ To understand the basic concept of linear regression and where to apply.
‹ To develop a linear relationship between two or more variables.
‹ To predict the value of dependent variable given the value of independent variable
using regression line.
‹ To be familiar with the different metrices used in the regression.
‹ Use of R and Python for regression implementation.

2.2 Introduction
In this chapter, we will explore the field of predictive analytics, focusing on two fundamental
techniques: Simple Linear Regression and Multiple Linear Regression. Predictive analytics

24 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

is a powerful tool for analysing data and making predictions about future Notes
outcomes. We will cover various aspects of regression models, including
parameter estimation, model validation, coefficient of determination,
significance tests, residual analysis, and confidence and prediction
intervals. Additionally, we will provide practical exercises to reinforce your
understanding of these concepts, using R or Python for implementation.

2.3 Classical Linear Regression Model (CLRM)

2.3.1 Introduction
Predictive analytics is the use of statistical techniques, machine learning
algorithms, and other tools to identify patterns and relationships in
historical data and use them to make predictions about future events. These
predictions can be used to inform decision-making in a wide variety of
areas, such as business, marketing, healthcare, and finance.
Linear regression is the traditional statistical technique used to model the
relationship between one or more independent variables and a dependent
variable.
Linear regression involving only two variables is called simple linear
regression. Let us consider two variables as ‘x’ and ‘y’. Here ‘x’ represents
independent variable or explanatory variable and ‘y’ represents dependent
variable or response variable. Dependent variable must be a ratio variable,
whereas independent variable can be ratio or categorical variable. We can
talk about regression model for cross-sectional data or for time series data.
In time series regression model, time is taken as independent variable
and is very useful for predicting future. Before we develop a regression
model, it is a good exercise to ensure that two variables are linearly
related. For this, plotting the scatter diagram is really helpful. A linear
pattern can easily be identified in the data.
The Classical Linear Regression Model (CLRM) is a statistical framework
used to analyse the relationship between a dependent variable and one or
more independent variables. It is a widely used method in econometrics
and other fields to study and understand the nature of this relationship,
make predictions, and test hypotheses.

PAGE 25
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes Regression analysis aims to examine how changes in the independent


variable(s) affect the dependent variable. The CLRM assumes a linear
relationship between the dependent variable (Y) and the independent
variable(s) (X), allowing us to estimate the parameters of this relationship
and make predictions.
The regression equation in the CLRM is expressed as:
<L  Į  ȕ[L  ȝL
Here, Yi represents the dependent variable,
xi represents the independent variable,
Į UHSUHVHQWV WKH LQWHUFHSW
ȕ UHSUHVHQWV WKH FRHIILFLHQW RU VORSH WKDW TXDQWLILHV WKH HIIHFW RI [L RQ
<L DQG ȝL UHSUHVHQWV WKH HUURU WHUP RU UHVLGXDO
The error term captures the unobserved factors and random variations that
affect the dependent variable but are not explicitly included in the model.
The CLRM considers the Population Regression Function (PRF), which is
the true underlying relationship between the variables in the population.
The PRF is expressed as:
<L  Į  ȕ[L  ȝL
The difference between the regression equation and the PRF is the
LQFOXVLRQRIWKHHUURUWHUP ȝL LQWKH35)7KHHUURUWHUPUHSUHVHQWVWKH
discrepancy between the observed value of Yi and the predicted value
based on the regression equation.
In practice, we estimate the parameters of the PRF using sample data and
derive the Sample Regression Function (SRF), which is an approximation
of the PRF. The SRF is represented as:
<L  Į   ȕ [L   X L
,Q WKH 65) Į  DQG ȕ  DUH WKH HVWLPDWHG LQWHUFHSW DQG FRHIILFLHQW
respectively, obtained through statistical methods such as Ordinary Least
Squares (OLS). The estimated error term (u i) captures the residuals or
discrepancies between the observed and predicted values based on the
estimated parameters.

26 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

2.3.2 Assumptions Notes

To ensure reliable and meaningful results, the CLRM relies on several


key assumptions. Let’s discuss these assumptions one by one:
‹ Linearity: The regression model must be linear in its parameters.
/LQHDULW\ UHIHUV WR WKH OLQHDULW\ RI WKH SDUDPHWHUV Į DQG ȕ  QRW
necessarily the linearity of the variables themselves. For example,
even if the variable xi is not linear, the model can still be considered
OLQHDU LI WKH SDUDPHWHUV Į DQG ȕ  DUH OLQHDU
‹ Variation in Independent Variables: There should be sufficient
variation in the independent variable(s) to be qualified as an explanatory
variable. In other words, if there is little or no variation in the
independent variable, it cannot effectively explain the differences
in the dependent variable.
For example, suppose we want to model the consumption level taking
income as the independent variable. If everyone in the sample has
an income of Rs. 10,000, then there is no variation in Xi. Hence,
the difference in their consumption levels cannot be explained by
Xi.
Hence, we assume that there is enough variation in Xi. Otherwise,
we cannot include it as an explanatory variable in the model.
‹ Zero Mean and Normal Distribution of Error Term: The error
WHUP ȝL VKRXOGKDYHDPHDQRI]HUR7KLVPHDQVWKDWRQDYHUDJH
the errors do not systematically overestimate or underestimate the
dependent variable. Additionally, the error term is assumed to
follow a normal distribution, allowing for statistical inference and
hypothesis testing.
‹ Fixed Values of Independent Variables: The values of the independent
variable(s) are considered fixed over repeated sampling. This
assumption implies that the independent variables are not subject
to random fluctuations or changes during the sampling process.
‹ No Endogeneity: Endogeneity refers to the situation where there
is a correlation between the independent variables and the error
term. In other words, the independent variables are not independent
of the error term. To ensure valid results, it is crucial to address

PAGE 27
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes endogeneity issues, as violating this assumption can lead to biased


and inconsistent parameter estimates.
‹ Number of Parameters vs. Sample Size: The number of parameters
to be estimated (k) from the model should be significantly smaller
than the total number of observations in the sample (n). In general,
it is recommended that the sample size (n) should be at least 20
times greater than the number of parameters (k) to obtain reliable
and stable estimates.
‹ Correct Model Specification: The econometric model should be
correctly specified, meaning that it reflects the true relationship
between the variables in the population. Model misspecification
can occur in two ways: improper functional form and inclusion/
exclusion of relevant variables. Improper functional form refers
to using a linear model when the true relationship is non-linear,
leading to biased parameter estimates. The inclusion of irrelevant
variables or exclusion of relevant variables can also lead to biased
and inefficient estimates.
‹ Homoscedasticity: Homoscedasticity assumes that the variance of the
error term is constant across all levels of the independent variables.
It means that the spread or dispersion of the errors does not change
systematically with the values of the independent variable(s). This
assumption is important for obtaining efficient and unbiased estimates
of the parameters.
To understand homoscedasticity visually, let’s consider a scatter plot
with a regression line. In a homoskedastic scenario, the spread of
the residuals around the regression line will be relatively constant
across different values of the independent variable(s).
Homoscedasticity means that variance of the error term is constant
<L  Į ȕ[L —L

9DU —L   σ 2

28 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Notes

Figure 2.1: Scatter Plot


Even at higher levels of Xi, the variance of the error term remains constant.
In a homoskedastic scenario, the spread of the residuals (green lines)
remains relatively constant across different values of the independent
variable. This means that the variability of the dependent variable is
consistent across the range of the independent variable.
Homoscedasticity is an important assumption in the CLRM because
violations of this assumption can lead to biased and inefficient estimators,
affecting the reliability of the regression analysis. If heteroscedasticity is
present (where the spread of the residuals varies across the range of the
independent variable), it can indicate that the model is not adequately
capturing the relationship between the variables, leading to unreliable
inference and misleading results.
To detect heteroscedasticity, you can visually inspect the scatter plot of
the residuals or employ statistical tests specifically designed to assess
the presence of heteroscedasticity, such as the Breusch-Pagan test or the
White test.
If heteroscedasticity is detected, various techniques can be employed
to address it, such as transforming the variables, using Weighted Least
Squares (WLS) regression, or employing heteroscedasticity-consistent
standard errors.

PAGE 29
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes ‹ No Autocorrelation: Autocorrelation, also known as serial correlation,


refers to the correlation between error terms of different observations.
In the case of cross-sectional data, autocorrelation occurs when
the error terms of different individuals or units are correlated. In
time series data, autocorrelation occurs when the error terms of
consecutive time periods are correlated. Autocorrelation violates
the assumption of independent and identically distributed errors,
and it can lead to biased and inefficient estimates.
7KLV PHDQV WKDW WKH FRYDULDQFH EHWZHHQ —L DQG —i-1 should be zero.
If that is not the case, then it is a situation of autocorrelation.
Yi = α + βxi + μi
Yj= α + βxi + μi
Cov(ui,uj    VSDWLDO DXWRFRUUHODWLRQ
Cov(ut, ut+1)  DXWRFRUUHODWLRQ
In cross sectional data, if two error terms do not have zero covariance,
then it is a situation of SPATIAL CORRELATION. In time series
data, if two error terms for consecutive time periods do not have
zero covariance, then it is a situation of AUTOCORRELATION
OR SERIAL CORRELATION.
‹ No Multicollinearity: Multicollinearity occurs when there is a high
degree of correlation between two or more independent variables in
the regression model. This can pose a problem because it becomes
challenging to separate the individual effects of the correlated
variables. Multicollinearity can lead to imprecise and unstable
parameter estimates.
By adhering to these assumptions, the CLRM exhibits desirable properties
such as efficiency, unbiasedness, and consistency. Efficiency refers to
obtaining parameter estimates with the minimum possible variance,
allowing for precise estimation. Unbiasedness means that, on average,
the estimated parameters are not systematically over or underestimating
the true population parameters. Consistency implies that as the sample
size increases, the estimated parameters converge to the true population
parameters.

30 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

In conclusion, the Classical Linear Regression Model (CLRM) is a Notes


widely used statistical framework for analysing the relationship between a
dependent variable and one or more independent variables. By estimating
the parameters of the regression equation, we can make predictions, test
hypotheses, and gain insights into the factors influencing the dependent
variable. However, it is crucial to ensure that the assumptions of the
CLRM are met to obtain reliable and meaningful results. Violating these
assumptions can lead to biased and inconsistent parameter estimates,
compromising the validity of the analysis.

2.3.3 Simple Linear Regression


2.3.3.1 Estimation of Parameters
Simple Linear Regression involves estimating the parameters of a linear
equation that best fits the relationship between a single independent
variable and a dependent variable. We will discuss the methods used to
estimate these parameters and interpret their meaning in the context of
the problem at hand using R/Python programming.
2.3.3.2 Model Validation
Validating the simple linear regression model is crucial to ensure its
reliability. We will cover various techniques, such as hypothesis testing,
to assess the significance of the model and evaluate its performance.
Additionally, we will examine residual analysis to understand the differences
between the observed and predicted values and identify potential issues
with the model.
Validation of a simple linear regression model involves assessing the
model’s performance and determining how well it fits the data. Here are
some common techniques for validating a simple linear regression model:
Residual Analysis: Residuals are the differences between the observed
values and the predicted values of the dependent variable. By analysing
the residuals, you can evaluate the model’s performance. Some key
aspects to consider are:
‹ Checking for randomness: Plotting the residuals against the predicted
values or the independent variable can help identify any patterns
or non-random behaviour.

PAGE 31
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes ‹ Assessing Normality: Plotting a histogram or a Q-Q plot of the residuals


can indicate whether they follow a normal distribution. Departures
from normality might suggest violations of the assumptions.
‹ Checking for Homoscedasticity: Plotting the residuals against the
predicted values or the independent variable can reveal any patterns
indicating non-constant variance. The spread of the residuals should
be consistent across all levels of the independent variable.
R-squared (Coefficient of Determination): R-squared measures the
proportion of the total variation in the dependent variable that is explained
by the linear regression model. A higher R-squared value indicates a better
fit. However, R-squared alone does not provide a complete picture of
model performance and should be interpreted along with other validation
metrics.
Adjusted R-squared: Adjusted R-squared takes into account the number
of independent variables in the model. It penalizes the addition of
irrelevant variables and provides a more reliable measure of model fit
when comparing models with different numbers of predictors.
F-statistic: The F-statistic assesses the overall significance of the linear
regression model. It compares the fit of the model with a null model
(no predictors) and provides a p-value indicating whether the model
significantly improves upon the null model.
Outlier Analysis: Identify potential outliers in the data that may have a
substantial impact on the model’s fit. Outliers can skew the regression
line and affect the estimated coefficients. It is important to investigate
and understand the reasons behind any outliers and assess their influence
on the model.
Cross-Validation: Splitting the dataset into training and testing subsets
allows you to assess the model’s performance on unseen data. The model
is trained on the training set and then evaluated on the testing set. Metrics
such as Mean Squared Error (MSE), or Root Mean Squared Error (RMSE)
can be calculated to quantify the model’s predictive accuracy.
By employing these validation techniques, you can gain insights into
the model’s performance, evaluate its assumptions, and make informed
decisions about its reliability and usefulness for predicting the dependent
variable.

32 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

2.3.4 Coefficient of Determination Notes

The coefficient of determination, commonly known as R-squared, quantifies


the proportion of variance in the dependent variable that can be explained
by the independent variable in a simple linear regression model. We will
delve into the calculation and interpretation of this important metric.
Introduction:
The overall goodness of fit of the regression model is measured by the
coefficient of determination, r2. It tells what proportion of the variation in
the dependent variable, or regressor and, is explained by the explanatory
variable, or regressor. This r2 lies between 0 and 1; the closer it is to 1,
the better is the fit.
Let TSS denotes TOTAL SUM OF SQUARES which is Total variation
of the actual Y values about their sample means which may be called
the total sum of squares:

TSS = Σ ( yi - y)
2

TSS can further be split into two variations; Explained Sum of Square
(ESS) and Residual Sum of Squares (RSS).
Explained Sum of Square (ESS) or Regression sum of squares or Model
sum of squares is a statistical quantity used in modelling of a process.
ESS gives an estimate of how well a model explains the observed data
for the process.

ESS = Σ ( yˆ i - y)
2

The Residual Sum of Squares (RSS) is a statistical technique used to


measure the amount of variance in a data set that is not explained by a
regression model itself. Instead, it estimates the variance in the residuals,
or error term.

RSS = Σ ( yˆ i - y)
2

Since, TSS = ESS + RSS


Or 1= ESS/TSS +RSS/TSS
Since ESS/TSS determines proportion of variability in Y explained by
regression model, therefore.

PAGE 33
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes r2 = ESS/TSS
Alternatively, from above r2 = 1-RSS/TSS

2.3.5 Significance Tests


To determine the significance of the simple linear regression model and its
coefficients, we will explore statistical tests such as t-tests and p-values
in the practical exercise. These tests help assess the statistical significance
of the relationships between variables and make informed conclusions.

2.3.6 Residual Analysis


Residual analysis is a critical step in evaluating the adequacy of a simple
linear regression model. Using practical examples, we will discuss how to
interpret and analyse residuals, which represent the differences between
the observed and predicted values. Residual analysis provides insights
into the model’s assumptions and potential areas for improvement.

2.3.7 Confidence and Prediction Intervals


Confidence and prediction intervals are essential in understanding the
uncertainty associated with the predictions made by a simple linear
regression model. We will cover the calculation and interpretation of
these intervals, allowing us to estimate the range within which future
observations are expected to fall in the practical exercises.

2.4 Multiple Linear Regression Model


Multiple regression is a statistical analysis technique used to examine the
relationship between a dependent variable and two or more independent
variables. It builds upon the concept of simple linear regression, which
analyses the relationship between a dependent variable and a single
independent variable.
The multiple regression model equation looks like this:
<  ȕ  ȕ;  ȕ;    ȕQ;Q  İ

34 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

In this equation: Notes


Y represents the dependent variable that we want to predict or explain.
X1, X2, ..., Xn are the independent variables.
ȕ LV WKH \LQWHUFHSW RU FRQVWDQW WHUP
ȕ ȕ  ȕQ DUH WKH FRHIILFLHQWV RU UHJUHVVLRQ ZHLJKWV WKDW UHSUHVHQW
the change in the dependent variable associated with a one-unit change
in the corresponding independent variable.
İ LV WKH HUURU WHUP RU UHVLGXDO UHSUHVHQWLQJ WKH XQH[SODLQHG YDULDWLRQ LQ
the dependent variable.

2.4.1 Interpretation of Partial Regression Coefficients


Multiple Linear Regression extends the simple linear regression framework
to include multiple independent variables. We will explore the interpretation
of partial regression coefficients, which quantify the relationship between
each independent variable and the dependent variable while holding other
variables constant.

2.4.2 Working with Categorical Variables


Categorical variables require special treatment in regression analysis.
We will discuss how to handle categorical variables by creating dummy
variables or qualitative variables. The interpretation of these coefficients
will be explained to understand the impact of categorical variables on
the dependent variable.

2.4.3 Multicollinearity and VIF


Multicollinearity refers to the presence of high correlation between
independent variables in a multiple linear regression model. One of the
assumptions of the CLRM is that there is no exact linear relationship
among the independent variables (regressors). If there are one or more
such relationships among the regressors, we call it multicollinearity.
There are two types of multicollinearities:
1. Perfect collinearity
2. Imperfect collinearity

PAGE 35
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes Perfect multicollinearity occurs when two or more independent variables in a


regression model exhibit a deterministic (perfectly predictable or containing
no randomness) linear relationship. With imperfect multicollinearity, an
independent variable has a strong but not perfect linear function of one
or more independent variables.
This also means that there are also variables in the model that effects
the independent variable.
Multicollinearity occurs when there is a high correlation between independent
variables in a regression model. It can cause issues with the estimation
of coefficients and affect the reliability of statistical inference.
The causes of multicollinearity are as:
(1) Data collection method: If we sample over a limited range of
values taken by the regressors in the population, it can lead to
multicollinearity.
(2) Model specification: If we introduce polynomial terms into the
model, especially when the values of the explanatory variables are
small; it can lead to multicollinearity.
(3) Constraint on the model or in the population: For example, if
we try to regress electricity expenditure on house size and income,
it may suffer from multicollinearity as there is a constraint in the
population. People with higher incomes typically have bigger houses.
(4) Over determined model: If we have more explanatory variables than
the number of observations, then it could lead to multicollinearity.
Often happens in medical research when you only have a limited
number of patients about whom a large amount of information is
collected.
Impact of multicollinearity:
Unbiasedness: The Ordinary Least Squares (OLS) estimators remain
unbiased.
Precision: OLS estimators have large variances and covariances, making
precise estimation difficult and leading to wider confidence intervals.
Statistically insignificant coefficients may be observed.
High R-squared: The R-squared value can still be high, even with
statistically insignificant coefficients.

36 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Sensitivity: OLS estimators and their standard errors are sensitive to Notes
small changes in the data.
Efficiency: Despite increased variance, OLS estimators are still efficient,
meaning they have minimum variance among all linear unbiased estimators.
In summary, multicollinearity undermines the precision of coefficient
estimates and can lead to unreliable statistical inference. While the OLS
estimators remain unbiased, they become imprecise, resulting in wider
confidence intervals and potential insignificance of coefficients.
We will learn how to detect multicollinearity using the Variance Inflation
Factor (VIF) and explore strategies to address this issue, ensuring the
accuracy and interpretability of the regression model.
VIF stands for Variance Inflation Factor, which is a measure used to
assess multicollinearity in multiple regression model. VIF quantifies how
much the variance of the estimated regression coefficient is increased
due to multicollinearity. It measures how much the variance of one
independent variable’s estimated coefficient is inflated by the presence
of other independent variables in the model.
The formula for calculating the VIF for an independent variable Xj is:
VIF(Xj) = 1 / (1 – rj2)
where rj2 represents the coefficient of determination (R-squared) from a
regression model that regresses Xj on all other independent variables.
The interpretation of VIF is as follows:
If VIF(Xj) is equal to 1, it indicates that there is no correlation between
Xj and the other independent variables.
If VIF(Xj) is greater than 1 but less than 5, it suggests moderate
multicollinearity.
If VIF(Xj) is greater than 5, it indicates a high degree of multicollinearity,
and it is generally considered problematic.
When assessing multicollinearity, it is common to examine the VIF
values for all independent variables in the model. If any variables have
high VIF values, it indicates that they are highly correlated with the
other variables, which may affect the reliability and interpretation of the
regression coefficients.

PAGE 37
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes If high multicollinearity is detected (e.g., VIF greater than 5), some steps
can be taken to address it:
‹ Remove one or more of the highly correlated independent variables
from the model. Combine or transform the correlated variables into
a single variable.
‹ Obtain more data to reduce the correlation among the independent
variables.
By addressing multicollinearity, the stability and interpretability of the
regression model can be improved, allowing for more reliable inferences
about the relationships between the independent variables and the dependent
variable.
How to Detect Multicollinearity
To detect multicollinearity in your regression model, you can use several
methods:
Pairwise Correlation: Calculate the pairwise correlation coefficients between
each pair of explanatory variables. If the correlation coefficient is very
high (typically greater than 0.8), it indicates potential multicollinearity.
However, low pairwise correlations do not guarantee the absence of
multicollinearity.
Variance Inflation Factor (VIF) and Tolerance: VIF measures the extent
to which the variance of the estimated regression coefficient is increased
due to multicollinearity. High VIF values (greater than 10) suggest
multicollinearity. Tolerance, which is the reciprocal of VIF, measures
the proportion of variance in the predictor variable that is not explained
by other predictors. Low tolerance values (close to zero) indicate high
multicollinearity.
Insignificance of Individual Variables: If many of the explanatory
variables in the model are individually insignificant (i.e., their t-statistics
are statistically insignificant) despite a high R-squared value, it suggests
the presence of multicollinearity.
Auxiliary Regressions: Conduct auxiliary regressions where each
independent variable is regressed against the remaining independent
variables. Check the overall significance of these regressions using the

38 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

F-test. If any of the auxiliary regressions show significant F-values, it Notes


indicates collinearity with other variables in the model.
How To Fix Multicollinearity
To address multicollinearity, you can consider the following approaches:
‹ Increase Sample Size: By collecting a larger sample, you can
potentially reduce the severity of multicollinearity. With a larger
sample, you can include individuals with different characteristics,
reducing the correlation between variables. Increasing the sample size
leads to more efficient estimators and mitigates the multicollinearity
problem.
‹ Drop Non-Essential Variables: If you have variables that are highly
correlated with each other, consider excluding non-essential variables
from the model. For example, if both father’s and mother’s education
are highly correlated, you can choose to include only one of them.
However, be cautious when dropping variables as it may result
in model misspecification if the excluded variable is theoretically
important.
Detecting and addressing multicollinearity is crucial for obtaining reliable
regression results. By understanding the signs of multicollinearity and
applying appropriate remedies, you can improve the accuracy and
interpretability of your regression model.

2.4.4 Outlier Analysis


Outliers can significantly influence the results of a regression model. We
will discuss techniques for identifying and handling outliers effectively,
enabling us to build more robust and reliable models.

2.4.5 Autocorrelation
Autocorrelation, also known as serial correlation, refers to the correlation
between observations in a time series data set or within a regression
model. It arises when there is a systematic relationship between the current
observation and one or more past observation. Autocorrelation occurs
when the residuals of a regression model exhibit a pattern, indicating a
potential violation of the model’s assumptions. We will cover methods

PAGE 39
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes for detecting and addressing autocorrelation, ensuring the independence


of residuals and the validity of our model.
Consequences of Autocorrelation
I. OLS estimators are still unbiased and consistent.
II. They are still normally distributed in large samples.
III. But they are no longer efficient. That is, they are no longer BLU. In
a case of autocorrelation, standard errors are UNDERESTIMATED.
This means that the t-values are OVERESTIMATED. Hence, it means
that variables that may not be statistically significant erroneously
appear to be statistically significant with high t-values.
IV. Hypothesis testing procedure is not reliable as standard errors are
erroneous, even with large samples. Therefore, the F and T tests
may not be valid.
Autocorrelation can be detected by following methods:
‹ Graphical Method
‹ Durbin Watson test
‹ Breusch-Godfrey test
1. Graphical Method
Autocorrelation can be detected using graphical methods. Here are a few
graphical techniques to identify autocorrelation:
Residual Plot: Plot the residuals of the regression model against the
corresponding time or observation index. If there is no autocorrelation, the
residuals should appear random and evenly scattered around zero. However,
if autocorrelation is present, you may observe patterns or clustering of
residuals above or below zero, indicating a systematic relationship.
Partial Autocorrelation Function (PACF) Plot: The PACF plot displays
the correlation between the residuals at different lags, while accounting
for the intermediate lags. In the absence of autocorrelation, the PACF
values should be close to zero for all lags beyond the first. If there is
significant autocorrelation, you may observe spikes or significant values
beyond the first lag.
Autocorrelation Function (ACF) Plot: The ACF plot shows the correla-
tion between the residuals at different lags, without accounting for the

40 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

intermediate lags. Similar to the PACF plot, significant values beyond Notes
the first lag in the ACF plot indicate the presence of autocorrelation.

Figure 2.2
Autocorrelation and partial autocorrelation function (ACF and PACF)
plots, prior to differencing (A and B) and after differencing (C and D).
In both the PACF and ACF plots, significance can be determined by
comparing the correlation values against the confidence intervals. If the
correlation values fall outside the confidence intervals, it suggests the
presence of autocorrelation.
It’s important to note that these graphical methods provide indications of
autocorrelation, but further statistical tests, such as the Durbin-Watson
test or Ljung-Box test, should be conducted to confirm and quantify the
autocorrelation in the model.
2. Durbin Watson D Test
The Durbin-Watson test is a statistical test used to detect autocorrelation
in the residuals of a regression model. It is specifically designed for
detecting first-order autocorrelation, which is the correlation between
adjacent observations.
The Durbin-Watson test statistic is computed using the following formula:
G  Ȉ HBL  HBL A   ȈHBLA

PAGE 41
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes where:
‹ HBL LV WKH UHVLGXDO IRU REVHUYDWLRQ L
‹ HBL LV WKH UHVLGXDO IRU WKH SUHYLRXV REVHUYDWLRQ L 
The test statistic is then compared to critical values to determine the
presence of autocorrelation. The critical values depend on the sample
size, the number of independent variables in the regression model, and
the desired level of significance.
The Durbin-Watson test statistic, denoted as d, ranges from 0 to 4. The
test statistic is calculated based on the residuals of the regression model
and is interpreted as follows:
A value of d close to 2 indicates no significant autocorrelation. It
suggests that the residuals are independent and do not exhibit a systematic
relationship.
A value of d less than 2 indicates positive autocorrelation. It suggests
that there is a positive relationship between adjacent residuals, meaning
that if one residual is high, the next one is likely to be high as well.
A value of d greater than 2 indicates negative autocorrelation. It suggests
that there is a negative relationship between adjacent residuals, meaning
that if one residual is high, the next one is likely to be low.
The closer it is to zero, the greater is the evidence of positive autocorrelation,
and the closer it is to 4, the greater is the evidence of negative autocorrelation.
If d is about 2, there is no evidence of positive or negative (first-) order
autocorrelation.
3. The Breusch-Godfrey Test
The Breusch-Godfrey test, also known as the LM test for autocorrelation,
is a statistical test used to detect autocorrelation in the residuals of a
regression model. Unlike the Durbin-Watson test, which is primarily
designed for detecting first-order autocorrelation, the Breusch-Godfrey
test can detect higher-order autocorrelation.
The test is based on the idea of regressing the residuals of the original
regression model on lagged values of the residuals. It tests whether the
lagged residuals are statistically significant in explaining the current
residuals, indicating the presence of autocorrelation.

42 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

The general steps for performing the Breusch-Godfrey test are as follows: Notes
1. Estimate the initial regression model and obtain the residuals.
2. Extend the initial regression model by including lagged values of
the residuals as additional independent variables.
3. Estimate the extended regression model and obtain the residuals from
this model.
4. Perform a hypothesis test on whether the lagged residuals are jointly
significant in explaining the current residuals.
The test statistic for the Breusch-Godfrey test follows a chi-square
distribution and is calculated based on the Residual Sum of Squares
(RSS) from the extended regression model. The test statistic is compared
to the critical values from the chi-square distribution to determine the
presence of autocorrelation.
The interpretation of the Breusch-Godfrey test involves the following steps:
1. Set up the null hypothesis (H0): There is no autocorrelation in the
residuals (autocorrelation is absent).
2. Set up the alternative hypothesis (Ha): There is autocorrelation in
the residuals (autocorrelation is present).
3. Conduct the Breusch-Godfrey test and calculate the test statistic.
4. Compare the test statistic to the critical value(s) from the chi-square
distribution.
5. If the test statistic is greater than the critical value, reject the null
hypothesis and conclude that there is evidence of autocorrelation and
If the test statistic is less than the critical value, fail to reject the
null hypothesis and conclude that there is no significant evidence
of autocorrelation.

2.4.6 Transformation of Variables


Transforming variables can enhance the fit and performance of a regression
model. We would explore techniques such as logarithmic and power
transformations in practical examples, which can help improve linearity,
normality, and homoscedasticity assumptions.

PAGE 43
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 2.4.7 Variable Selection in Regression Model Building


Building an optimal regression model involves selecting the most relevant
independent variables. We will discuss various techniques for variable
selection, including stepwise regression and regularization methods like
Lasso and Ridge regression.

2.5 Practical Exercises Using R/Python Programming


To reinforce the concepts covered in this chapter, practical exercises using
R/Python programming has been shown. These exercises will involve
implementing simple OLS regression using R or Python, interpreting the
results obtained, and conducting assumption tests such as checking for
multicollinearity, autocorrelation, and normality. Furthermore, regression
analysis with categorical/dummy/qualitative variables will be performed
to understand their impact on the dependent variable.
Exercise 1: Perform simple OLS regression on R/Python and interpret
the results obtained.
Sol. Certainly! Here’s an example of how you can perform a simple
Ordinary Least Squares (OLS) regression in both R and Python, along
with results interpretation.
Let’s assume you have a dataset with a dependent variable (Y) and an
independent variable (X). We will use this dataset to demonstrate the
OLS regression.
Using R:
# Load the necessary libraries
Library(dplyr)
# Read the dataset
GDWD  UHDGFVY ³\RXUBGDWDVHWFVY´
# Perform the OLS regression
model <- lm (Y ~ X, data = data)
# Print the summary of the regression results
Summary (model)

44 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Using Python (using the statsmodels library): Notes


# Import the necessary libraries
import pandas as pd.
import statsmodels.api as sm.
# Read the dataset
GDWD  SG UHDGBFVY ³\RXUBGDWDVHWFVY´
# Perform the OLS regression
PRGHO  VP2/6 GDWD>µ<¶@ VPDGGBFRQVWDQW GDWD>µ;¶@
# Fit the model
results = model.fit ()
# Print the summary of the regression results
print(results. summary())
In both R and Python, we first load the necessary libraries (e.g., dplyr
in R and pandas and statsmodels in Python). Then, we read the dataset
containing the variables Y and X.
Next, we perform the OLS regression by specifying the formula in R (Y ~
X) and using the lm function. In Python, we create an OLS model object
using sm.OLS and provide the dependent variable (Y) and independent
YDULDEOH ;  DV DUJXPHQWV :H DOVR DGG D FRQVWDQW WHUP XVLQJ VPDGGB
constant to account for the intercept in the regression.
After fitting the model, we can print the summary of the regression results
using summary (model) in R and print (results. summary()) in Python.
The summary provides various statistical measures and information about
the regression model.
Interpreting the results:
Coefficients: The regression results will include the estimated coefficients
for the intercept and the independent variable. These coefficients represent
the average change in the dependent variable for a one-unit increase in
the independent variable. For example, if the coefficient for X is 0.5,
it suggests that, on average, Y increases by 0.5 units for every one-unit
increase in X.
p-values: The regression results also provide p-values for the coefficients.
These p-values indicate the statistical significance of the coefficients.

PAGE 45
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes Generally, a p-value less than a significance level (e.g., 0.05) suggests
that the coefficient is statistically significant, implying a relationship
between the independent variable and the dependent variable.
R-squared: The R-squared value (R-squared or R2) measures the
proportion of the variance in the dependent variable that can be explained
by the independent variable(s). It ranges from 0 to 1, with higher values
indicating a better fit of the regression model to the data. R-squared can
be interpreted as the percentage of the dependent variable’s variation
explained by the independent variable(s).
Residuals: The regression results also include information about the
residuals, which are the differences between the observed values of the
dependent variable and the predicted values from the regression model.
Residuals should ideally follow a normal distribution with a mean of zero,
and their distribution can provide insights into the model’s goodness of
fit and potential violations of the regression assumptions.
It’s important to note that interpretation may vary depending on the specific
context and dataset. Therefore, it’s essential to consider the characteristics
of your data and the objectives of your analysis while interpreting the
results of an OLS regression.
Exercise 2: Test the assumptions of OLS (multicollinearity, autocorrelation,
normality etc.) on R/Python.
Sol. To test the assumptions of OLS, including multicollinearity,
autocorrelation, and normality, you can use various diagnostic tests in R
or Python. Here are the steps and some commonly used tests for each
assumption:
Multicollinearity:
Step 1: Calculate the pairwise correlation matrix between the independent
variables using the cor () function in R or the corrcoef () function in
Python (numpy).
Step 2: Calculate the Variance Inflation Factor (VIF) for each independent
variable using the vif () function from the “car” package in R or the
YDULDQFHBLQIODWLRQBIDFWRU  IXQFWLRQ IURP WKH ³VWDWVPRGHOV´ OLEUDU\ LQ
Python. VIF values greater than 10 indicate high multicollinearity.

46 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Step 3: Perform auxiliary regressions by regressing each independent Notes


variable against the remaining independent variables to identify highly
collinear variables.
Autocorrelation:
Step 1: Plot the residuals against the predicted values (fitted values)
from the regression model. In R, you can use the plot () function with
the residuals () and fitted () functions. In Python, you can use the scatter
() function from matplotlib.
Step 2: Conduct the Durbin-Watson test using the dwtest () function
from the “lmtest” package in R or the Durbin Watson () function from
the “statsmodels.stats.stattools” module in Python. A value close to 2
indicates no autocorrelation, while values significantly greater or smaller
than 2 suggest positive or negative autocorrelation, respectively.
Normality of Residuals:
Step 1: Plot a histogram or a kernel density plot of the residuals. In R,
you can use the hist () or density () functions. In Python, you can use
the histplot () or kdeplot () functions from the seaborn library.
Step 2: Perform a normality test such as the Shapiro-Wilk test using the
shapiro.test () function in R or the shapiro () function from the “scipy.
stats” module in Python. A p-value greater than 0.05 indicates that the
residuals are normally distributed.
It’s important to note that these tests provide diagnostic information, but
they may not be definitive. It’s also advisable to consider the context and
assumptions of the specific regression model being used.
Here is the random data set to perform the regression code in either R
or Python.

PAGE 47
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes

This dataset consists of three columns: y represents the dependent variable,


and x1 and x2 are the independent variables. Each row corresponds to
an observation in the dataset.
We can use this dataset to run the provided code and perform diagnostic
tests on the OLS regression model.
import numpy as np.
import pandas as pd.
import statsmodels.api as sm.

48 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

import seaborn as sns. Notes


import matplotlib. pyplot as plt
# Set random seed for reproducibility
np. random.seed(123)
# Generate random data
n = 100 # Number of observations
x1 = np. random.normal(0, 1, n) # Independent variable 1
x2 = np. random.normal(0, 1, n) # Independent variable 2
epsilon = np. random.normal(0, 1, n) # Error term
# Generate dependent variable
y = 1 + 2 × x1 + 3 × x2 + epsilon
# Create a Data Frame
data = pd. Data Frame({‘y’: y, ‘x1’: x1, ‘x2’: x2})
# Fit OLS regression model
;  VPDGGBFRQVWDQW GDWD>>µ[¶ µ[¶@@  $GG FRQVWDQW WHUP
model = sm.OLS(data[‘y’], X)
results = model.fit ()
# Diagnostic tests
print (“Multicollinearity:”)
vif = pd. Data Frame ()
vif [“Variable”] = X. columns
YLI >³9,)´@  >YDULDQFHBLQIODWLRQBIDFWRU ;YDOXHV L  IRU L LQ UDQJH ;
shape[1])]
print(vif)
print(“\nAutocorrelation:”)
residuals = results. resid
fig, ax = plt. subplots()
ax. scatter (results.fittedvalues, residuals)
D[VHWB[ODEHO ³)LWWHG YDOXHV´

PAGE 49
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes D[VHWB\ODEHO ³5HVLGXDOV´


plt. show ()
print (“Durbin-Watson test:”)
GZBVWDWLVWLF  VP VWDWVVWDWWRROVGXUELQBZDWVRQ UHVLGXDOV
SULQW I ³'XUELQ:DWVRQ VWDWLVWLF ^GZBVWDWLVWLF`´
print(“\nNormality of Residuals:”)
sns.histplot(residuals, kde=True)
plt.xlabel(“Residuals”)
plt.ylabel(“Frequency”)
plt.show()
VKDSLURBWHVW  VPVWDWVVKDSLUR UHVLGXDOV
SULQW I ³6KDSLUR:LON WHVW SYDOXH ^VKDSLURBWHVW>@`´
In this example, we generated a random dataset with two independent
variables (x1 and x2) and a dependent variable (y). We fit an OLS regression
model using the statsmodels library. Then, we perform diagnostic tests
for multicollinearity, autocorrelation, and normality of residuals.
The code calculates the VIF for each independent variable, plots the
residuals against the fitted values, performs the Durbin-Watson test for
autocorrelation, and plots a histogram of the residuals. Additionally, the
Shapiro-Wilk test is conducted to check the normality of residuals.
We can run this code in a Python environment to see the results and
interpretations for each diagnostic test based on the random dataset
provided.
3. Perform regression analysis with categorical/dummy/qualitative variables
on R/Python.
import pandas as pd
import statsmodels.api as sm
# Create a DataFrame with the data
data = {
‘y’: [3.3723, 5.5593, 8.1878, -2.4581, 3.8578, 5.4747, 6.4135, 8.1032,
5.56, 5.3514, 5.8457],
‘x1’: [-1.085631, 0.997345, 0.282978, -1.506295, -0.5786, 1.651437,
-2.426679, -0.428913, -0.86674, 0.742045, 2.312265],

50 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

‘x2’: [-0.076047, 0.352978, -2.242685, 1.487477, 1.058969, -0.37557, Notes


-0.600516, 0.955434, -0.151318, -0.10322, 0.410598],
‘category’: [‘A’, ‘B’, ‘A’, ‘B’, ‘B’, ‘A’, ‘B’, ‘A’, ‘A’, ‘B’, ‘B’]
}
df = pd.DataFrame(data)
# Convert the categorical variable to dummy variables
GI  SGJHWBGXPPLHV GI FROXPQV >µFDWHJRU\¶@ GURSBILUVW 7UXH
# Define the dependent and independent variables
;  GI>>µ[¶ µ[¶ µFDWHJRU\B%¶@@
y = df[‘y’]
# Add a constant term to the independent variables
;  VPDGGBFRQVWDQW ;
# Fit the OLS model
model = sm.OLS(y, X).fit()
# Print the summary of the regression results
print(model.summary())
In this example, we have created a DataFrame df with the y, x1, x2, and
category variables. The category variable is converted into dummy variables
XVLQJ WKH JHWBGXPPLHV IXQFWLRQ DQG WKH FDWHJRU\ $ FROXPQ LV GURSSHG
to avoid multicollinearity. We then define the dependent variable y and
WKHLQGHSHQGHQWYDULDEOHV;LQFOXGLQJWKHGXPP\YDULDEOHFDWHJRU\B%$
FRQVWDQWWHUPLVDGGHGWRWKHLQGHSHQGHQWYDULDEOHVXVLQJVPDGGBFRQVWDQW
Finally, we fit the OLS model using sm.OLS and print the summary of
the regression results using model.summary(). The regression analysis
provides the estimated coefficients, standard errors, t-statistics, and p-values
for each independent variable, including the dummy variable category B.

IN-TEXT QUESTIONS
1. What is the primary goal of linear regression?
(a) Classification
(b) Prediction
(c) Clustering
(d) Dimensionality reduction

PAGE 51
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 2. In simple linear regression, how many variables are involved?


(a) One independent variable
(b) Two independent variables
(c) One dependent variable
(d) None of the above
3. What is the equation of a simple linear regression line?
(a) Y = aX + b
(b <  D;A  E;  F
(c <  D;A  E
(d) Y = aX
4. What is the term for the vertical distance between an observed
data point and the regression line?
(a) Slope
(b) Intercept
(c) Residual
(d) Coefficient
5. What does the coefficient of determination (R-squared) measure
in linear regression?
(a) The strength of the linear relationship
(b) The number of data points
(c) The standard error of the regression
(d) The intercept of the regression line
6. In multiple linear regression, how many independent variables
can be included in the model?
(a) Only one
(b) As many as desired
(c) Two
(d) Three

52 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

7. What is the purpose of the least squares method in linear Notes


regression?
(a) To maximize the coefficient of determination
(b) To minimize the sum of squared residuals
(c) To minimize the number of independent variables
(d) To maximize the number of data points
8. What does a p-value in linear regression indicate?
(a) The strength of the relationship between variables
(b) The number of data points
(c) The significance of an independent variable
(d) The slope of the regression line
9. When should you use logistic regression instead of linear
regression?
(a) When the relationship between variables is not linear
(b) When there are multiple dependent variables
(c) When dealing with categorical outcomes
(d) When there is no correlation between variables
10. What problem can multicollinearity cause in multiple linear
regression?
(a) Overfitting
(b) Underfitting
(c) Instability in coefficient estimates
(d) Lack of a regression line

2.6 Summary
This chapter discusses a comprehensive understanding of predictive
analytics techniques, with a specific focus on simple linear regression
and multiple linear regression. It provides the knowledge and practical
skills necessary to apply these techniques using R or Python, enabling
one to make informed predictions and interpretations in the context of
the regression analysis.

PAGE 53
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 2.7 Answers to In-Text Questions

1. (b) Prediction
2. (a) One independent variable
3. (a) Y = aX + b
4. (c) Residual
5. (a) The strength of the linear relationship
6. (b) As many as desired
7. (b) To minimize the sum of squared residuals
8. (c) The significance of an independent variable
9. (c) When dealing with categorical outcomes
10. (c) Instability in coefficient estimates

2.8 Self-Assessment Questions

1. What is the purpose of residual analysis in regression?


2. How do you interpret the p-value in regression analysis?
3. What is the purpose of stepwise regression?
4. What is the difference between simple linear regression and multiple
linear regression?
5. What is the purpose of interaction terms in multiple linear regression?
6. How can you assess the goodness of fit in regression analysis?
7. What is the main objective of simple linear regression?
8. What are the key assumptions of multiple linear regression?
9. What is the interpretation of the coefficient of determination
(R-squared)?
10. How is multicollinearity detected in multiple linear regression?

2.9 Reference
‹ Business Analytics: The Science of Data Driven Decision Making,
First Edition (2017), U Dinesh Kumar, Wiley, India.

54 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

2.10 Suggested Readings Notes

‹ Introduction to Machine Learning with Python, Andreas C. Mueller


and Sarah Guido, O’Reilly Media, Inc.
‹ Data Mining or Business Analytics – Concepts, Techniques, and
Applications in Python. Galit Shmueli, Peter C. Bruce, Peter Gedeck,
and Nitin R. Patel. Wiley.

PAGE 55
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

3
Logistic and Multinomial
Regression
Anurag Goel
Assistant Professor, CSE Deptt.
Delhi Technological University
Email-Id: anurag@dtu.ac.in

STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Logistic Function
3.4 Omnibus Test
3.5 Wald Test
3.6 Hosmer Lemeshow Test
3.7 Pseudo R Square
3.8 &ODVVL¿FDWLRQ 7DEOH
3.9 *LQL &RHI¿FLHQW
3.10 ROC
3.11 AUC
3.12 Summary
3.13 Answers to In-Text Questions
3.14 Self-Assessment Questions
3.15 References
3.16 Suggested Readings

56 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

3.1 Learning Objectives Notes

 ‹Familiarize the concepts of logistic regression and multinomial logistic


regression.
 ‹Understand the various evaluation metrics to evaluate the logistic
regression model.
 ‹Analyse the scenario where the logistic regression model is relevant.
 ‹Apply logistic regression model for nominal and ordinal outcomes.

3.2 Introduction
In machine learning, we often are required to determine if a particular
variable belongs to a given class. In such cases, one can use logistic
regression. Logistic Regression, a popular supervised learning technique,
is commonly employed when the desired outcome is a categorical variable
such as binary decisions (e.g., 0 or 1, yes or no, true or false). It finds
extensive applications in various domains, including fake news detection
and cancerous cell identification.
Some examples of logistic regression applications are as follows:
 ‹To detect whether a given news is fake or not.
 ‹To detect whether a given cell is Cancerous cell or not.
In essence, logistic regression can be understood as the probability of
belonging to a class given a particular input variable. Since it’s probabilistic
in nature, the logistic regression output values lie in the range of 0 and 1.
Generally, when we think about regression from a strictly statistical
perspective, the output value is generally not restricted to a particular
interval. Thus, to achieve this in logistic regression, we utilise logistic
function. An intuitive example to see the use of logistic function can be
to understand logistic regression as any simple regression value model,
on top of whose output value, we have applied a logistic function so that
the final output becomes restricted in the above defined range.
Generally, logistic regression results work well when the output is of binary
type, that is, it either belongs to a specific category or it does not. This,
however, is not always the case in real-life problem statements. We may

PAGE 57
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes encounter a lot of scenarios where we have a dependent variable having


multiple classes or categories. In such cases, Multinomial Regression
emerges as a valuable extension of logistic regression, specifically designed
to handle multiclass problems. Multinomial Regression is the generalization
of logistic regression to multiclass problems. For example, based on the
results of some analysis, predicting the engineering branch students will
choose for their graduation is a multinomial regression problem since the
output categories of engineering branches are multiple. In this multinomial
regression problem, the engineering branch will be the dependent variable
predicted by the multinomial regression model while the independent
variables are student’s marks in XII board examination, student’s score
in engineering entrance exam, student’s interest areas/courses etc. These
independent variables are used by the multinomial regression model to
predict the outcome i.e. engineering branch the student may opt for.
To better understand the application of multinomial regression, consider
the example of predicting a person’s blood group based on the results
of various diagnostic tests. Unlike binary classification problems that
involve two categories, blood group prediction involves multiple possible
outcomes. In this case, the output categories are the different blood
groups, and predicting the correct blood group for an individual becomes
a multinomial regression problem. The multinomial regression model
aims to estimate the probabilities associated with each class or category,
allowing us to assign an input sample to the most likely category.
Now, let us understand this better by doing a simple walkthrough of
how a multinomial logistic regression model might work on the above
example. For simplicity, let us assume we have a well-balanced, cleaned,
pre-processed and labelled dataset available with us which has an input
variable (or feature) and a corresponding output blood group. During
training, our multinomial logistic regression model will try to learn the
underlying patterns and relationships between the input features and the
corresponding class labels (from training data). Once trained, the model
can utilise these learned patterns and relationships on new (or novel)
input variable to assign a probability of the input variable to belonging
to each output class using the logistic function. Model can then simply
select the class which has the highest probability as the predicted output
of our overall model.

58 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Thus, multinomial regression serves as a powerful extension of logistic Notes


regression, enabling the handling of multiclass classification problems.
By estimating the probabilities associated with each class using the
logistic function, it provides a practical and effective approach for
assigning input samples to their most likely categories. Applications of
multinomial regression encompass a wide range of domains, including
medical diagnosis, sentiment analysis, and object recognition, where
classification tasks involve more than two possible outcomes.

3.3 Logistic Function

3.3.1 Logistic Function (Sigmoid Function)


The sigmoid function is represented as follows:
1

1
f ( x) =
1 + e −x
0.5
It is a mathematical function that assigns
values between 0 and 1 based on the input
variable. It is characterized by its S-shaped –6 –4 –2 0 2 4 6
curve and is commonly used in statistics,
machine learning, and neural networks to model non-linear relationships
and provide probabilistic interpretations.

3.3.2 Estimation of Probability Using Logistic Function


The logistic function is often used for estimating probabilities in various
fields. By applying the logistic function to a linear combination of input
variables, such as in logistic regression, it transforms the output into a
probability value between 0 and 1. This allows for the prediction and
classification of events based on their likelihoods.

3.4 Omnibus Test


Omnibus test is a statistical test used to test the significance of several
model parameters at once. It examines whether the combined effect of
the predictors is statistically significant.

PAGE 59
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes The Omnibus statistic is calculated by examining the difference in deviance


between the full model (with predictors) and the reduced model (without
predictors) to derive its formula:
Dr − D f
Omnibus =
Dr
where Dr represents the deviance of the reduced model (without predictors)
and Df represents the deviance of the full model (with predictors).
The Omnibus test statistic approximately follows chi-square distribution
with degrees of freedom given by the difference in the number of predictors
between the full and reduced models. By comparing the test statistic to
the chi-square distribution and calculating the associated p-value, we can
calculate the collective statistical significance of the predictor variables.
When the calculated p-value is lower than a predefined significance level
(e.g., 0.05), we reject the null hypothesis, indicating that the group of
predictor variables collectively has a statistically significant influence on
the dependent variable. On the other hand, if the p-value exceeds the
significance level, we fail to reject the null hypothesis, suggesting that
the predictors may not have a significant collective effect.
The Omnibus test provides a comprehensive assessment of the overall
significance of the predictor variables within a regression model, aiding in
the understanding of how these predictors jointly contribute to explaining
the variation in the dependent variable.
Let’s consider an example where we have a regression model with three
predictor variables (X1, X2, X3) and a continuous dependent variable
(Y). We want to assess the overall significance of these predictors using
the Omnibus test.
Here is a sample dataset with the predictor variables and the dependent
variable:
X1 X2 X3 Y
2.5 6 8 10.2
3.2 4 7 12.1
1.8 5 6 9.5
2.9 7 9 11.3
3.5 5 8 13.2

60 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

X1 X2 X3 Y Notes
2.1 6 7 10.8
2.7 7 6 9.7
3.9 4 9 12.9
2.4 5 8 10.1
2.8 6 7 11.5

Step 1: Fit the Full Model


We start by fitting the full regression model that includes all three
predictor variables:
<  ȕ0  ȕ1*X1  ȕ2*X2  ȕ3*X3
By using statistical software, we obtain the estimated coefficients and
the deviance of the full model:
ȕ0   ȕ1   ȕ2   ȕ3 = 0.812
Deviance_full = 5.274
Step 2: Fit the Reduced Model
Next, we fit the reduced model, which only includes the intercept term:
<  ȕ0
Similarly, we obtain the deviance of the reduced model:
Deviance_reduced = 15.924
Step 3: Calculate the Omnibus Test Statistic
Using the deviance values obtained from the full and reduced models,
we can calculate the Omnibus test statistic:
Omnibus = (Deviance_reduced - Deviance_full)/Deviance_reduced
= (15.924 - 5.274) / 15.924
= 0.668
Step 4: Conduct the Hypothesis Test
To assess the statistical significance of the predictors, we compare the
Omnibus test statistic to the chi-square distribution with degrees of
freedom equal to the difference in the number of predictors between the
full and reduced models. In this case, the difference is 3 (since we have
3 predictor variables).

PAGE 61
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes By referring to the chi-square distribution table or using statistical software,


we determine the p-value associated with the Omnibus test statistic. Let’s
assume the p-value is 0.022.
Step 5: Interpret the Results
Since the p-value (0.022) is smaller than the predetermined significance
level (e.g., 0.05), we reject the null hypothesis. This indicates that the
set of predictor variables (X1, X2, X3) collectively has a statistically
significant impact on the dependent variable (Y). In other words, the
predictors significantly contribute.

3.5 Wald Test


The Wald test is a statistical test utilized to assess the significance of
individual predictor variables in a regression model. It examines whether the
estimated coefficient for a specific predictor differs significantly different
from zero, indicating its importance in predicting the dependent variable.
The formula for the Wald test statistic is as follows:
(β − β )
2

W= 0

Var (β)
ZKHUHȕLVWKHHVWLPDWHGFRHIILFLHQWIRUWKHSUHGLFWRUYDULDEOHRILQWHUHVW
ȕ0 is the hypothesized value of the coefficient under the null hypothesis
W\SLFDOO\IRUWHVWLQJLIWKHFRHIILFLHQWLV]HUR DQG9DU ȕ LVWKHHVWLPDWHG
variance of the coefficient.
The Wald test statistic is compared to the chi-square distribution, where the
degrees of freedom are set to 1 (since we are testing a single parameter)
to obtain the associated p-value. Rejecting the null hypothesis occurs
when the calculated p-value falls below a predetermined significance
level (e.g., 0.05), indicating that the predictor variable has a statistically
significant impact on the dependent variable.
The Wald test allows us to determine the individual significance of
predictor variables by testing whether their coefficients significantly deviate
from zero. It is a valuable tool for identifying which variables have a
meaningful impact on the outcome of interest in a regression model.
Let’s consider an example where we have a logistic regression model with
two predictor variables (X1 and X2) and a binary outcome variable (Y).

62 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

We want to assess the significance of the coefficient for each predictor Notes
using the Wald test.
Here is a sample dataset with the predictor variables and the binary
outcome variable:
X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1
Step 1: Fit the Logistic Regression Model
We start by fitting the logistic regression model with the predictor
variables X1 and X2:
ORJLW S   ȕ0  ȕ1 × X1  ȕ2 × X2
By using statistical software, we obtain the estimated coefficients and
their standard errors:
ȕ0   ȕ1   ȕ2 = 0.372
6( ȕ0    6( ȕ1    6( ȕ2) = 0.295
Step 2: Calculate the Wald Test Statistic
Next, we calculate the Wald test statistic for each predictor variable
using the formula:
:  ȕ  ȕ0 ð9DU ȕ
For X1:
W1 = (0.921 - 0)²/(0.512)² = 1.790
For X2:
W2 = (0.372 - 0)²/(0.295)² = 1.608

PAGE 63
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes Step 3: Conduct the Hypothesis Test


To assess the statistical significance of each predictor, we compare the
Wald test statistic for each variable to the chi-square distribution with 1
degree of freedom (since we are testing a single parameter).
By referring to the chi-square distribution table or using statistical software,
we determine the p-value associated with each Wald test statistic. Let’s
assume the p-value for X1 is 0.183 and the p-value for X2 is 0.205.
Step 4: Interpret the Results
For X1, since the p-value (0.183) is larger than the predetermined
significance level (e.g., 0.05), we fail to reject the null hypothesis.
This suggests that the coefficient for X1 is not statistically significantly
different from zero, indicating that X1 may not have a significant effect
on the binary outcome variable Y.
Similarly, for X2, since the p-value (0.205) is larger than the significance
level, we fail to reject the null hypothesis. This suggests that the coefficient
for X2 is not statistically significantly different from zero, indicating that
X2 may not have a significant effect on the binary outcome variable Y.
In summary, based on the Wald tests, we do not have sufficient evidence
to conclude that either X1 or X2 has a significant impact on the binary
outcome variable in the logistic regression model.

IN-TEXT QUESTIONS
1. What does the Wald test statistic compare to obtain the associated
p-value?
(a) The F-distribution
(b) The t-distribution
(c) The normal distribution
(d) The chi-square distribution
2. What does the Omnibus test assess in a regression model?
(a) The individual significance of predictor variables
(b) The collinearity between predictor variables
(c) The overall significance of predictor variables collectively
(d) The goodness-of-fit of the regression model

64 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

3.6 Hosmer Lemeshow Test Notes

The Hosmer-Lemeshow test is a statistical test used to evaluate the


goodness-of-fit of a logistic regression model. It assesses how well the
predicted probabilities from the model align with the observed outcomes.
The Hosmer-Lemeshow test is based on dividing the observations into
groups or “bins” based on the predicted probabilities of the logistic
regression model. The formula for the Hosmer-Lemeshow test statistic
is as follows:

(O )
2
− Eij
H =∑
ij

( E (1 − E ))
ij ij

Where Oij is the observed number of outcomes (events or non-events)


in the ith bin and jth outcome category, Eij is the expected number of
outcomes (events or non-events) in the ith bin and jth outcome category,
calculated as the sum of predicted probabilities in ith the bin for the jth
outcome category.
The test statistic H follows an approximate chi-square distribution with
degrees of freedom equal to the number of bins minus the number of
model parameters. A smaller p-value obtained by comparing the test
statistic to the chi-square distribution suggests a poorer fit of the model
to the data, indicating a lack of goodness-of-fit.
By conducting the Hosmer-Lemeshow test, we can determine whether
the logistic regression model adequately fits the observed data. A non-
significant result (p > 0.05) indicates that the model fits well, suggesting
that the predicted probabilities align closely with the observed outcomes.
Conversely, a significant result (p < 0.05) suggests a lack of fit, indicating
that the model may not accurately represent the data.
The Hosmer-Lemeshow test is a valuable tool in assessing the goodness-
of-fit of logistic regression models, allowing us to evaluate the model’s
performance in predicting outcomes based on observed and predicted
probabilities.
Let’s consider the example again with the logistic regression model
predicting the probability of a disease (Y) based on a single predictor

PAGE 65
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes variable (X). We will divide the predicted probabilities into three bins
and calculate the observed and expected frequencies in each bin.
Y X Predicted
Probability
0 2.5 0.25
1 3.2 0.40
0 1.8 0.15
1 2.9 0.35
1 3.5 0.45
0 2.1 0.20
1 2.7 0.30
0 3.9 0.60
0 2.4 0.18
1 2.8 0.28
Step 1: Fit the Logistic Regression Model
By fitting the logistic regression model, we obtain the predicted probabilities
for each observation based on the predictor variable X.
Step 2: Divide the Predicted Probabilities into Bins
Let’s divide the predicted probabilities into three bins: [0.1-0.3], [0.3-
0.5], and [0.5-0.7].
Step 3: Calculate Observed and Expected Frequencies in Each Bin
Now, we calculate the observed and expected frequencies in each bin.
Bin: [0.1-0.3]
Total cases in bin: 3
Observed cases (Y = 1): 1
Expected cases: (0.25 + 0.20 + 0.28) × 3 = 1.23
Bin: [0.3-0.5]
Total cases in bin: 4
Observed cases (Y = 1): 2
Expected cases: (0.40 + 0.35 + 0.30 + 0.28) × 4 = 3.52
Bin: [0.5-0.7]
Total cases in bin: 3

66 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Observed cases (Y = 1): 2 Notes


Expected cases: (0.45 + 0.60) × 3 = 3.15
Step 4: Calculate the Hosmer-Lemeshow Test Statistic
We calculate the Hosmer-Lemeshow test statistic by summing the
contributions from each bin:
HL = ((O1 - E1)² / E1) + ((O2 – E2)² / E2) + ((O3 – E3)² / E3)
HL = ((1 - 1.23)² / 1.23) + ((2 - 3.52)² / 3.52) + ((2 - 3.15)² / 3.15)
= (0.032) + (0.670) + (0.224)
= 0.926
Step 5: Conduct the Hypothesis Test
We compare the Hosmer-Lemeshow test statistic (HL) to the chi-square
distribution with 1 degree of freedom (number of bins - 2).
By referring to the chi-square distribution table or using statistical
software, let’s assume that the critical value for a significance level of
0.05 is 3.841.
Since the calculated test statistic (0.926) is less than the critical value
(3.841), we fail to reject the null hypothesis. This suggests that the logistic
regression model fits the data well.
Step 6: Interpret the Results
Based on the Hosmer-Lemeshow test, there is no evidence to suggest lack
of fit for the logistic regression model. The calculated test statistic (0.926)
is below the critical value, indicating good fit between the observed and
expected frequencies in the different bins.
In summary, the Hosmer-Lemeshow test assesses the goodness of fit
of a logistic regression model by comparing the observed and expected
frequencies in different bins of predicted probabilities. In this example,
the test result indicates that the model fits the data well.

3.7 Pseudo R Square


Pseudo R-square is a measure used in regression analysis, particularly in
logistic regression, to assess the proportion of variance in the dependent
variable explained by the predictor variables. It is called “pseudo” because
it is not directly comparable to the R-squared used in linear regression.
PAGE 67
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes There are various methods to calculate Pseudo R-squared, and one
commonly used method is Nagelkerke’s R-squared. The formula for
Nagelkerke’s R-squared is as follows:
(/model – /null )
R2 =
(/max – /null )
where Lmodel is the log-likelihood of the full model, Lnull is the log-
likelihood of the null model (a model with only an intercept term)
and Lmax is the log-likelihood of a model with perfect prediction (a
hypothetical model that perfectly predicts all outcomes).
Nagelkerke’s R-squared ranges from 0 to 1, with 0 indicating that the
predictors have no explanatory power, and 1 suggesting a perfect fit of
the model. However, it is important to note that Nagelkerke’s R-squared
is an adjusted measure and should not be interpreted in the same way
as R-squared in linear regression.
Pseudo R-squared provides an indication of how well the predictor variables
explain the variance in the dependent variable in logistic regression. While
it does not have a direct interpretation as the proportion of variance
explained, it serves as a relative measure to compare the goodness-of-fit
of different models or assess the improvement of a model compared to
a null model.
One commonly used pseudo R-squared measure is the Cox and Snell
R-squared. Let’s calculate the Cox and Snell R-squared using the given
example of a logistic regression model with two predictor variables.
X1 X2 Y
2.5 6 0
3.2 4 1
1.8 5 0
2.9 7 1
3.5 5 1
2.1 6 0
2.7 7 1
3.9 4 0
2.4 5 0
2.8 6 1

68 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Step 1: Fit the Logistic Regression Model Notes


By fitting the logistic regression model using the predictor variables X1
and X2, we obtain the estimated coefficients for each predictor.
Step 2: Calculate the Null Log-Likelihood (LL0)
To calculate the null log-likelihood, we fit a null model with only an
intercept term. Let’s assume that the null log-likelihood (LL0) is -48.218.
Step 3: Calculate the Full Log-Likelihood (LLF)
The full log-likelihood represents the maximum value of the log-likelihood
for the fitted logistic regression model. Let’s assume that the full log-
likelihood (LLF) is -31.384.
Step 4: Calculate the Cox and Snell R-Squared
Using the formula R²_CS = 1 - (LL0 / LLF)^(2 / n), we can calculate
the Cox and Snell R-squared.
Given:
LL0 = -48.218
LLF = -31.384
n = 10 (number of observations)
R²_CS = 1 - (-48.218 / -31.384)^(2 / 10)
= 1 - 0.4309
= 0.5691
Step 5: Interpret the Results
The calculated Cox and Snell R-squared is approximately 0.5691. This
indicates that around 56.91% of the variance in the binary outcome
variable can be explained by the logistic regression model.
In summary, based on the calculations, the Cox and Snell R-squared for the
logistic regression model with X1 and X2 as predictors is approximately
0.5691, suggesting a moderate amount of variance explained by the model.

3.8 Classification Table


To understand the classification table, let’s consider a binary classification
problem of detecting whether an input cell is cancerous cell or not. Consider
a logistic regression model X implemented for the given classification

PAGE 69
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes problem on a dataset of 100 random cells in which 10 cells are cancerous
cells and 90 cells are non-cancerous cells. Let suppose the model X outputs
20 input cells as cancerous and rest 80 as non-cancerous cells. Out of the
total predicted cancerous cells, only 5 input cells are actually cancerous
as per the ground truth while the rest 15 cells are non-cancerous. On the
other hand, out of the total predicted non-cancerous cells, 75 cells are
also non-cancerous cells in the ground truth but 5 cells are cancerous.
Here, cancerous cell is considered as positive class while non-cancerous
cell is considered as negative class for the given classification problem.
Now, we define the four primary building blocks of the various evaluation
metrics of classification models as follows:
True Positive (TP): The number of input cells for which the classification
model X correctly predicts that they are cancerous cells are referred as
True Positive. For example, for the model X, TP = 5.
True Negative (TN): The number of input cells for which the classification
model X correctly predicts that they are non-cancerous cells are referred
as True Negative. For example, for the model X, TN = 75.
False Positive (FP): The number of input cells for which the classification
model X incorrectly predicts that they are cancerous cells are referred as
False Positive. For example, for the model X, FP = 15.
False Negative (FN): The number of input cells for which the classification
model X incorrectly predicts that they are non-cancerous cells are referred
as False Negative Positive. For example, for the model X, FN = 5.
Actual
Cancerous Non-Cancerous
Predicted Cancerous TP = 5 FP = 15
Non-Cancerous FN = 5 TN = 75
Figure 3.1: Classification Matrix

3.8.1 Sensitivity
Sensitivity, also referred to as True Positive Rate or Recall, is calculated
as the ratio of correctly predicted cancerous cells to the total number of
cancerous cells in the ground truth. To compute sensitivity, you can use
the following formula:

70 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

TP Notes
Sensitivity =
(TP + FN)

3.8.2 Specificity
Specificity is defined as the ratio of number of input cells that are correctly
predicted as non-cancerous to the total number of non-cancerous cells
in the ground truth. Specificity is also known as True Negative Rate. To
compute specificity, we can use the following formula:
TN
6SHFL¿FLW\ =
(TN + FP)

3.8.3 Accuracy
Accuracy is calculated as the ratio of correctly classified cells to the total
number of cells. To compute accuracy, you can use the following formula:
(TP + TN)
Accuracy =
(TP + FP + TN + FN)

3.8.4 Precision
Precision is calculated as the ratio of the correctly predicted cancerous
cells to the total number of cells predicted as cancerous by the model.
To compute precision, you can use the following formula:
TP
Precision =
(TP + FP)

3.8.5 F score
The F1-score is calculated as the harmonic mean of Precision and Recall.
To compute the F1-score, you can follow the following formula:
2 × Precision × Recall
F1 - score =
(Precision + Recall)

PAGE 71
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes IN-TEXT QUESTIONS


3. For the model X results on the given dataset of 100 cells, the
precision of model is:
(a) 0 (b) 0.25
(c) 0.5 (d) 1
4. For the model X results on the given dataset of 100 cells, the
recall of model is:
(a) 0 (b) 0.25
(c) 0.5 (d) 1

3.9 Gini Coefficient


A metric used to assess inequality is the Gini coefficient, also referred
to as the Gini index. The Gini coefficient has a value between 0 and 1.
The performance of the model improves with increasing Gini coefficient
values. Gini coefficient can be computed from the AUC of ROC curve
using the formula:
Gini Coefficient = 2 × AUC - 1

3.10 ROC
In particular in logistic regression or machine learning techniques, the
performance of a binary classification model is assessed using a graphical
representation called the Receiver Operating Characteristic (ROC) curve.
The trade-off between the true positive rate (sensitivity) and the false
positive rate (specificity minus 1) for various categorization thresholds
is demonstrated.

72 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Notes

Plotting the True Positive Rate (TPR) against the False Positive Rate
(FPR) at various categorization thresholds results in the ROC curve. The
formula for TPR and FPR are as follows:
TP
True Positive Rate (TPR) =
(TP + FN)

FP
False Positive Rate (TPR) =
(FP + TN)
We may evaluate the model’s capacity to distinguish between positive
and negative examples at various classification levels using the ROC
curve. With a TPR of 1 and an FPR of 0, a perfect classifier would have
a ROC curve that reaches the top left corner of the plot. The model’s
discriminatory power increases with the distance between the ROC curve
and the top left corner.

3.11 AUC
When employing a Receiver Operating Characteristic (ROC) curve, the
Area Under the Curve (AUC) is a statistic used to assess the effectiveness
of a binary classification model. The likelihood that a randomly selected
positive occurrence will have a greater projected probability than a
randomly selected negative instance is represented by the AUC.
The AUC is calculated by integrating the ROC curve. However, it is
important to note that the AUC does not have a specific formula since
it involves calculating the area under a curve. Instead, it is commonly
calculated using numerical methods or software.
The AUC value ranges between 0 and 1. A model with an AUC of 0.5
indicates a random classifier, where the model’s predictive power is no

PAGE 73
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes better than chance. An AUC value that is nearer 1 indicates a classifier
that is more accurate and is better able to distinguish between positive
and negative situations. Conversely, an AUC value closer to 0 suggests
poor performance, with the model performing worse than random guessing.
In binary classification tasks, the AUC is a commonly utilized statistic
since it offers a succinct assessment of the model’s performance at
different categorization thresholds. It is especially useful when the dataset
is imbalanced i.e. the number of instances that are positive and negative
differ significantly.
In conclusion, the AUC measure evaluates a binary classification model’s
total discriminatory power by delivering a single value that encapsulates the
model’s capacity to rank cases properly. Better classification performance
is shown by higher AUC values, whilst worse performance is indicated
by lower values.

IN-TEXT QUESTIONS
5. Which of the following illustrates trade-off between True Positive
Rate and False Positive Rate?
(a) Gini Coefficient (b) F1-score
(c) ROC (d) AUC
6. Which of the following value of AUC indicates a more accurate
classifier?
(a) 0.01 (b) 0.25
(c) 0.5 (d) 0.99
7. What is the range of values for the Gini coefficient?
(a) -1 to 1 (b) 0 to 1
(c) 0 to infinity (d) -infinity to infinity
8. How can the Gini coefficient be computed?
(a) By calculating the area under the precision-recall curve
(b) By calculating the area under the Receiver Operating
Characteristic (ROC) curve
(c) By calculating the ratio of true positives to true negatives
(d) By calculating the ratio of false positives to false negatives

74 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

3.12 Summary Notes

Logistic regression is used to solve the classification problems by producing


the probabilistic values within the range of 0 and 1. Logistic regression
uses Logistic function i.e. sigmoid function. Multinomial Regression is the
generalization of logistic regression to multiclass problems. Omnibus test is
a statistical test utilized to test the significance of several model parameters
at once. Wald test is a statistical test used to assess the significance of
individual predictor variables in a regression model. Hosmer-Lemeshow
test is a statistical test employed to assess the adequacy of a logistic
regression model. Pseudo R-square is a measure to assess the proportion
of variance in the dependent variable explained by the predictor variables.
There are various classification metrics namely Sensitivity, Specificity,
Accuracy, Precision, F-score, Gini Coefficient, ROC and AUC, which are
utilized to evaluate the performance of a classifier model.

3.13 Answers to In-Text Questions

1. (d) The chi-square distribution


2. (c) The overall significance of predictor variables collectively
3. (b) 0.25
4. (c) 0.5
5. (c) ROC
6. (d) 0.99
7. (b) 0 to 1
8. (b) By calculating the area under the receiver operating characteristic
(ROC) curve

3.14 Self-Assessment Questions


1. Differentiate between Linear Regression and Logistic Regression.
2. Differentiate between Sensitivity and Specificity.
3. Define True Positive Rate and False Positive Rate.
4. Consider a logistic regression model X that is applied on a problem
of classifying a statement is hateful or not. Consider a dataset D of

PAGE 75
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 100 statements containing equal number of hateful statements and


non-hateful statements. Suppose the model X is classifying all the
input statements as hateful. Comment on the precision and recall
values of the model X.
5. Define F-score and Gini Index.
6. Explain the use of ROC curve and AUC of a ROC curve.

3.15 References
 ‹LaValley, M. P. (2008). Logistic regression. Circulation, 117(18),
2395-2399.
 ‹Wright, R. E. (1995). Logistic regression.
 ‹Chatterjee, Samprit, and Jeffrey S. Simonoff. Handbook of regression
analysis. John Wiley & Sons, 2013.
 ‹Kleinbaum, David G., K. Dietz, M. Gail, Mitchel Klein, and Mitchell
Klein. Logistic regression. New York: Springer-Verlag, 2002.
 ‹DeMaris, Alfred. “A tutorial in logistic regression.” Journal of
Marriage and the Family (1995): 956-968.
 ‹Osborne, J. W. (2014). Best practices in logistic regression. Sage
Publications.
 ‹Bonaccorso, Giuseppe. Machine learning algorithms. Packt Publishing
Ltd, 2017.

3.16 Suggested Readings


 ‹Huang, F. L. (2022). Alternatives to logistic regression models in
experimental studies. The Journal of Experimental Education, 90(1),
213-228.
 ‹https://towardsdatascience.com/logistic-regression-in-real-life-building-
a-daily-productivity-classification-model-a0fc2c70584e.

76 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

4
Decision Tree and
Clustering
Dr. Sanjay Kumar
Deptt. of Computer Science and Engineering
Delhi Technological University
Email-Id: sanjay.kumar@dtu.ac.in

STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.3 &ODVVL¿FDWLRQ DQG 5HJUHVVLRQ 7UHH
4.4 CHAID
4.5 Impurity Measures
4.6 Ensemble Methods
4.7 Clustering
4.8 Summary
4.9 Answers to In-Text Questions
4.10 Self-Assessment Questions
4.11 References
4.12 Suggested Readings

4.1 Learning Objectives


 ‹Exploring the concept of decision tree and its components.
 ‹Evaluating attribute selection measures.
 ‹Understanding ensemble methods.
 ‹Comprehending the random forest algorithm.

PAGE 77
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes  ‹Exploring the concept of clustering and its types.


 ‹Comprehending distance and similarity measures.
 ‹Evaluating cluster quality.

4.2 Introduction
Decision tree is a popular machine learning approach for classification
and regression tasks. Its structure is similar to a flowchart, where internal
nodes represent features or attributes, branches depict decision rules, and
leaf nodes signify outcomes or predicted values. The data are divided
recursively according to feature values by the decision tree algorithm to
create the tree. It chooses the best feature for data partitioning at each
stage by analysing parameters such as information gain or Gini impurity.
The goal is to divide the data into homogeneous subsets within each
branch to increase the tree’s capacity for prediction.

Figure 4.1: Decision Tree for classification scenario of a mammal


By choosing a path through the tree based on feature values, the decision
tree can be used to generate predictions on fresh, unexpected data after it
has been constructed. The circumference. Figure 4.1 shows the decision
tree helps classify an animal based on a series of questions. The flowchart

78 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

begins with the question, “Is it a mammal?” If the answer is “Yes,” we Notes
follow the branch on the left. The next question asks, “Does it have
spots?” If the answer is “Yes,” we conclude that it is a leopard. If the
answer is “No,” we determine it is a cheetah.
If the answer to the initial question, “Is it a mammal?” is “No,” we
follow the branch on the right, which asks, “Is it a bird?” If the answer
is “Yes,” we classify it as a parrot. If the answer is “No,” we classify
it as a fish.
Thus decision tree demonstrates a classification scenario where we aim to
determine the type of animal based on specific attributes. By following
the flowchart, we can systematically navigate through the questions to
reach a final classification.

4.3 Classification and Regression Tree


A popular machine learning approach for classification and regression
tasks is called the Classification and Regression Tree (CART). It is a
decision tree-based model that divides the data into subsets according
to the values of the input features and then predicts the target variable
using the tree structure.
CART is especially well-liked because of how easy it is to understand.
Each core node represents a test on a specific feature, and each leaf
node represents a class label or a predicted value, forming a binary
tree structure. The method divides the data iteratively according to the
features with the goal of producing homogeneous subsets with regard to
the target variable.
In classification tasks, CART measures the impurity or disorder within
each node using a criterion like Gini impurity or entropy. Selecting the
best feature and split point at each node aims to reduce this impurity. The
outcome is a tree that correctly categorises new instances according to
their feature values. In regression problems, CART measures the quality
of each split using a metric called Mean Squared Error (MSE). In order
to build a tree that can forecast the continuous target variable, it searches
for the feature and split point that minimises the MSE.

PAGE 79
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes Example: Let suppose we have a dataset of patients and we want to


predict whether they have a heart disease based on their age and cholesterol
level. The dataset contains the following information:
Age Cholesterol Disease
45 180 Yes
50 210 No
55 190 Yes
60 220 No
65 230 Yes
70 200 No
Using the CART algorithm, we can build a decision tree to make
predictions. The decision tree may look like this:

Figure 4.2: Predicting Disease based on Age and Cholesterol Levels


The decision tree in this illustration begins at the node at the top, which
evaluates the statement “Age = 55.” If a patient is under the age of
55, we proceed to the left branch and examine the “Cholesterol = 200”
condition. The diagnosis is “No Disease” if the patient’s cholesterol
level is less than or equal to 200. The forecast is “Yes Disease” if the
cholesterol level is more than 200.
However, if the patient is older than 55, we switch to the right branch,
where “No Disease” is predicted regardless of the cholesterol level.

80 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

4.4 CHAID Notes

4.4.1 Chi-Square Automatic Interaction Detection


CHAID (Chi-Square Automatic Interaction Detection) is a statistical
method used to analyze the interaction between different categories of
variables. It is particularly useful when working with data that involves
categorical variables, which represent different groups or categories. The
CHAID algorithm aims to identify meaningful patterns by dividing the
data into groups based on various categories of variables. This is achieved
through the application of statistical tests, particularly the chi-square test.
The chi-square test helps determine if there is a significant relationship
between the categories of a variable and the outcome of interest.
It divides the data into smaller groups. It repeats this procedure for each
of these smaller groups in order to find other categories that might be
significantly related to the result. The leaves on the tree indicate the
expected outcomes, and each branch represents a distinct category.
&DOFXODWH WKH &KL6TXDUH VWDWLVWLF ȤA 

χ 2 = ∑ (o − E ) / E
2
(1.1)
O represents the observed frequencies in each category or cell of a
contingency table.
E represents the expected frequencies under the assumption of independence
between variables.

PAGE 81
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes

Figure 4.3: Determining Satisfaction Levels of customer


This flowchart shows how CHAID gradually divides the dataset into
subsets according to the most important predictor factors, resulting in
a hierarchical structure. It enables us to clearly and orderly visualise
the links between the variables and their effects on the target variable
(Customer Satisfaction).
Age Group is the first variable on the flowchart, and it has two branches:
“Young” and “Middle-aged.” We further examine the Gender variable within
the “Young” branch, resulting in branches for “Male” and “Female.” The
Purchase Frequency variable is next examined for each gender sub-group,
yielding three branches: “Low,” “Medium,” and “High.” We arrive at the
leaf nodes, which represent the customer satisfaction outcome and are
either “Satisfied” or “Not Satisfied.”

4.4.2 Bonferroni Correction


The Bonferroni correction is a statistical method used to adjust the
significance levels (p-values) when conducting multiple hypothesis tests
at the same time. It helps control the overall chance of falsely claiming
a significant result by making the criteria for significance more strict.

82 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

To apply the Bonferroni correction, we divide the desired significance level Notes
XVXDOO\ GHQRWHG DV Į  E\ WKH QXPEHU RI WHVWV EHLQJ SHUIRUPHG GHQRWHG
DV P  7KLV DGMXVWHG VLJQLILFDQFH OHYHO GHQRWHG DV Į¶ RU ĮB% EHFRPHV
the new threshold for determining statistical significance.
Mathematically, the Bonferroni correction can be represented as:
α
α’ =
m (1.2)

For example, suppose we are conducting 10 hypothesis tests, and we


ZDQW D VLJQLILFDQFH OHYHO RI  Į    %\ DSSO\LQJ WKH %RQIHUURQL
FRUUHFWLRQZHGLYLGHĮE\UHVXOWLQJLQDQDGMXVWHGVLJQLILFDQFHOHYHORI
D’= 0.05/10=0.005 (1.3)
Now, when we assess the p-values obtained from each test, we compare
WKHP DJDLQVW WKH DGMXVWHG VLJQLILFDQFH OHYHO Į¶  LQVWHDG RI WKH RULJLQDO
Į ,I D SYDOXH LV OHVV WKDQ RU HTXDO WR Į¶ ZH FRQVLGHU WKH UHVXOW WR EH
statistically significant.
Let’s consider an example. Suppose we have conducted 10 independent
hypothesis tests, and we obtain p-values of 0.02, 0.07, 0.01, 0.03, 0.04,
    DQG  8VLQJ WKH %RQIHUURQL FRUUHFWLRQ ZLWK Į
RIDQGP WKHDGMXVWHGVLJQLILFDQFHOHYHOEHFRPHVĮ¶ 
= 0.005.
:H ZDQW D VLJQLILFDQFH OHYHO Į  RI  DQG ZH KDYH  K\SRWKHVLV
WHVWV P    $SSO\LQJ WKH %RQIHUURQL FRUUHFWLRQ ZH GLYLGH Į E\ 
UHVXOWLQJ LQ DQ DGMXVWHG VLJQLILFDQFH OHYHO Į¶  RI 
&RPSDULQJ WKH SYDOXHV WR WKH DGMXVWHG VLJQLILFDQFH OHYHO Į¶  ZH ILQG
 +\SRWKHVLV WHVW  SYDOXH   ” Į¶    6WDWLVWLFDOO\ VLJQLILFDQW
+\SRWKHVLVWHVWSYDOXH  !Į¶  1RWVWDWLVWLFDOO\VLJQLILFDQW
 +\SRWKHVLV WHVW  SYDOXH   ” Į¶    6WDWLVWLFDOO\ VLJQLILFDQW
+\SRWKHVLVWHVWSYDOXH  !Į¶  1RWVWDWLVWLFDOO\VLJQLILFDQW
+\SRWKHVLVWHVWSYDOXH  !Į¶  1RWVWDWLVWLFDOO\VLJQLILFDQW
+\SRWKHVLVWHVWSYDOXH  !Į¶  1RWVWDWLVWLFDOO\VLJQLILFDQW
+\SRWKHVLVWHVWSYDOXH  !Į¶  1RWVWDWLVWLFDOO\VLJQLILFDQW
+\SRWKHVLVWHVWSYDOXH  !Į¶  1RWVWDWLVWLFDOO\VLJQLILFDQW

PAGE 83
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes +\SRWKHVLVWHVWSYDOXH  !Į¶  1RWVWDWLVWLFDOO\VLJQLILFDQW


+\SRWKHVLVWHVWSYDOXH  ”Į¶  6WDWLVWLFDOO\VLJQLILFDQW
Based on the Bonferroni correction, we conclude that Test 1, Test 3, and
Test 10 show statistically significant results, as their p-values are less
than or equal to the adjusted significance level. The remaining tests are
not considered statistically significant.

4.5 Impurity Measures

4.5.1 Gini Impurity Index


Gini impurity index is a measure used in decision tree algorithms to
evaluate the impurity or disorder within a set of class labels. It quantifies
the likelihood of a randomly selected element being misclassified based
on the distribution of class labels in a given node. The Gini impurity
index ranges from 0 to 1, where 0 represents a perfectly pure node with
all elements belonging to the same class, and 1 represents a completely
impure node with an equal distribution of elements across different classes.
To calculate the Gini impurity index, we first compute the probability
of each class label within the node by dividing the count of elements
belonging to that class by the total number of elements. Then, we square
each probability and sum up the squared probabilities for all classes.
Finally, we subtract the sum from 1 to obtain the Gini impurity index.
Mathematically, the formula for Gini impurity index is as follows:

Gini Index = 1 − ∑ ( p i )
2
(1.4)
Where, p and i represents the probability of each class label in the node.
By using the Gini impurity index, decision tree algorithms can make
decisions on how to split the data by selecting the feature and threshold
that minimize the impurity after the split. A lower Gini impurity index
indicates a more homogeneous distribution of class labels, which helps
in creating pure and informative branches in the decision tree.

84 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Example: Notes
Suppose we have a dataset with 50 samples and two classes, “A” and
“B”. The table below shows the distribution of class labels for a particular
node in a decision tree:
Sample Class
20 A
10 B
15 A
5 B
To calculate the Gini impurity index, we follow these steps:
Calculate the probability of each class label:
Probability of class A = (20 + 15) / 50 = 35 / 50 = 0.7
Probability of class B = (10 + 5) / 50 = 15 / 50 = 0.3
Square each probability:
Square of 0.7 = 0.49
Square of 0.3 = 0.09
Sum up the squared probabilities:
0.49 + 0.09 = 0.58
Subtract the sum from 1 to obtain the Gini impurity index:
Gini Index = 1 - 0.58 = 0.42
So, the Gini impurity index for this particular node is 0.42. This value
represents the impurity or disorder within the node, with a lower Gini
impurity index indicating a more homogeneous distribution of class labels.

4.5.2 Entropy
Entropy is a concept used in information theory and decision tree
algorithms to measure the level of uncertainty or disorder within a set
of class labels. It helps us understand how mixed or impure the class
distribution is in a given node. The entropy value is calculated based on
the probabilities of each class label within the node.
To compute entropy, we start by determining the probability of each
class label. This is done by dividing the count of elements belonging

PAGE 85
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes to a particular class by the total number of elements. Next, we apply


the logarithm (typically base 2) to each probability, multiply it by the
probability itself, and sum up these values. Finally, we take the negative
of the sum to obtain the entropy value.
Mathematically, the formula for entropy is as follows:

(
Entropy = −∑ p i * log 2 ( p i ) (1.5)
Where, pi represents the probability of each class label in the node.
By using entropy, decision tree algorithms can assess the impurity within
a node and determine the feature and threshold that minimize the entropy
after the split. A lower entropy value indicates a more homogeneous
distribution of class labels, leading to more informative and accurate
splits in the decision tree.
Example:
Let’s consider a node with 80 samples, where 60 samples belong to class
A and 20 samples belong to class B. The probability of class A is 60/80
= 0.75, and the probability of class B is 20/80 = 0.25. Applying the
logarithm to these probabilities, we get -0.415 and -1.000, respectively.
Multiplying these values by their probabilities and summing them up,
we obtain -0.311. Taking the negative of this sum, the entropy for this
node is 0.311.

4.5.3 Cost-based splitting criteria


Cost-based splitting criteria in decision trees involve considering cost-
related measures when determining how to split the data at each node of
the tree. Cost-based criteria consider the associated costs or penalties of
misclassification, whereas standard decision tree algorithms concentrate
on metrics like information gain or Gini impurity. When compared to
misclassifying instances of another class, misclassifying instances of one
class might occasionally have a bigger effect or cost.
The goal of cost-based splitting criteria is to minimize the overall cost
or expenses related to misclassification by selecting the best feature
and split point at each node. Instead of solely maximizing information
gain or reducing impurity, the algorithm assesses the cost associated
with potential misclassifications. The specific cost-based measure used

86 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

depends on the problem domain and the assigned costs for different Notes
types of misclassifications. For instance, in a medical diagnosis scenario,
misclassifying a severe condition as a less severe one might incur a higher
cost compared to the opposite error.
Example:
Let’s consider a dataset of 30 fruits, where each fruit has two features:
colour (red, green, or orange) and diameter (small or large). The target
variable is the type of fruit, which can be “Apple” or “Orange”. We
also have costs associated with misclassifications: $10 for each false
positive (classifying an orange as an apple) and $5 for each false negative
(classifying an apple as an orange).
When using cost-based splitting criteria, the decision tree algorithm
considers the features (colour and diameter) to find the optimal split that
minimizes the overall cost. For simplicity, let’s assume the first split is
based on the colour feature. The algorithm assesses the costs associated
with misclassification for each colour category and chooses the colour
that results in the lowest expected cost. For instance, if the algorithm
determines that splitting the data based on colour between “Red/Green”
and “Orange” fruits minimizes the expected cost, it proceeds to evaluate
the diameter feature for each branch. The algorithm continues this recursive
process of splitting the data until it constructs a complete decision tree.
The resulting decision tree may look like this:

Colour

Figure 4.4: Split based on colour and diameter

PAGE 87
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes The decision tree in the above picture displays splits depending on colour
and diameter, resulting in the labelling of fruits as “Apple” or “Orange”
at the leaf nodes. Now we can use the decision tree to estimate the type
of a new fruit when its colour and diameter are displayed. The model
determines the fruit’s expected class (apple or orange) by tracing the path
down the tree based on the provided attributes.

4.6 Ensemble Methods

4.6.1 Introduction to Ensemble Methods


In order to increase prediction accuracy and generalisation, ensemble
methods are used in machine learning. Ensemble methods construct an
ensemble from several models rather of relying on a single model. The
ensemble’s separate models are each trained on a distinct subset of the
data or with a different algorithm. Two common approaches for ensemble
method are as follows:
Voting: According to this method, each model in the ensemble provides
a prediction, and the outcome is decided by considering the weighted
majority of all the individual models. For instance, when performing
classification tasks, the ensemble selects the class that receives the most
model votes as its forecast.
Averaging: The predicted outcomes of each individual model are integrated
using this method by averaging their results. This method is frequently
employed in regression tasks because it enables the ensemble to provide
a final prediction by averaging the values predicted by each model.
Some popular ensemble methods include Random Forest, AdaBoost,
Gradient Boosting, and Bagging. These methods have been widely adopted
in various domains and have shown significant improvements in prediction
performance compared to using a single model alone.
Example:
Suppose our goal is to determine whether a specific email is spam or
not. We have a dataset that includes details on email sender, subject, and
content. Using bagging, we can assemble several decision tree models. A

88 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

random subset of the data is used to train each decision tree, and their Notes
predictions are aggregated via majority voting.
Each decision tree in this ensemble will identify various spam email
patterns and characteristics. While some trees may concentrate on certain
words or phrases, others may take sender information into consideration.
Each decision tree in the ensemble will make a forecast when a new email
is received, and the final prediction will be based on the consensus of
all the decision trees.

4.6.2 Random Forest


In machine learning, the widely used ensemble learning technique Random
Forest is utilised for both classification and regression applications. The
outcomes from each individual decision tree are aggregated to create
forecasts using many decision trees. With each tree based on a distinct
subset of the training data, the Random Forest method builds a collection
of decision trees. The subset is produced using a technique known as
bootstrap sampling, which includes replacing some of the randomly chosen
data points. Further randomness is added by considering a random subset
of features for splitting at each node of the decision tree.
First, we create a bunch of decision trees, each using a different set of
data. We randomly pick some of the data for each tree, which helps add
variety to the predictions. Next, we train each decision tree by dividing
the data into smaller groups based on different features. We want the trees
to be different from each other, so we use random subsets of features to
make the divisions. For example, if we’re trying to classify something,
each tree votes for the class it thinks is correct. The final prediction is
based on the majority vote.
Step-by-step explanation of how the Random Forest algorithm works:
Random Sampling: The algorithm starts by randomly selecting subsets
of the training data from the original dataset. Each subset is constructed
by randomly selecting data points with replacement. These subsets are
used to build individual decision trees.
Tree Construction: Recursive partitioning is a technique used to build
a decision tree for each subset of the training data. A random subset of
features is taken into account for splitting at each node of the tree. Each

PAGE 89
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes tree is unique thanks to the randomness, which also lessens association
between the trees.
Voting and Aggregation: Each tree in the Random Forest identifies
the target variable separately (for classification tasks) or predicts its
value independently (for regression tasks) while making predictions.
For classification, the final prediction is chosen by a majority vote; for
regression, the predictions are averaged. The overall forecast accuracy
is enhanced by the voting and aggregation procedure.
Random Forest has several key features and advantages:
Robustness against overfitting: The Random Forest is more resilient to
noise or outliers in the data thanks to the integration of many decision
trees, which also helps to avoid overfitting.
Feature importance estimation: The features that have the greatest impact
on the predictions are identified by Random Forest using a measure of
feature importance. With this knowledge, features may be chosen and
underlying relationships in the data can be understood.

4.7 Clustering
Cluster analysis, commonly referred to as clustering, is a machine learning
technique that categorizes datasets without labels. In this method, objects
that share similarities are grouped together while maintaining minimal
or no similarities with other groups. By relying solely on the inherent
characteristics of the data, clustering algorithms unveil patterns and
associations. Their objective is to divide a collection of data points into
distinct clusters or groups, where objects within the same cluster exhibit
greater similarity to each other compared to those in different clusters.
The primary aim is to maximize the similarity within clusters while
minimizing the similarity between clusters.
Let’s look at the clustering technique in activity using the real-world
example of Mall: When we go to a shopping centre, we notice that
items with comparable uses are clustered together. T-shirts, for example,
are arranged in one section and pants in another; similarly, in vegetable
sections, apples, bananas, mangoes, and so on are grouped in distinct
sections so that we can easily discover what we’re looking for.

90 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Figure 4.5 shows the images of uncategorized and categorized data in the Notes
form of three types of fruits mixed. The left side shows the uncategorized
data, where right side shows the categorized data or group of same fruits:
 ‹Points in the same cluster are similar
 ‹Points in the different clusters are dissimilar

Figure 4.5: Clustering from mixed input


The above diagrams (Figure 4.5) show that the different fruits are divided
into different clusters or groups with similar properties.

4.7.1 Characteristics of Clustering


Clustering analysis possesses several distinct characteristics that make it
a powerful tool in data analysis. Firstly, it is an unsupervised learning
technique, meaning that it does not require prior knowledge or labelled
data to guide the clustering process. Instead, clustering algorithms discover
patterns and groupings solely based on the inherent characteristics of
the data. Secondly, clustering analysis is exploratory in nature, allowing
researchers to uncover hidden structures and relationships that may not be
immediately apparent. This exploratory aspect of clustering is data-driven
and does not impose assumptions or constraints on the structure of the
data, making it applicable to a wide range of domains and data types.

PAGE 91
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 4.7.2 Types of Clustering


The clustering methods are broadly categorized into two partition namely as
Hard-Clustering (data points belong to just one group) and Soft clustering
(data points might belong to more than one group). However, there are
alternative Clustering techniques available. The following are the most
common clustering approaches used in machine learning:
(A) Partitioning Clustering
(B) Density-Based Clustering
(C) Hierarchical clustering
(A) Partitioning Clustering:
Partitioning clustering is a clustering algorithm that seeks to divide a
dataset into separate and non-overlapping clusters. In this type of clustering,
the dataset is split into a predetermined number of groups, denoted as
K. The cluster centres are positioned in a manner that minimizes the
distance between the data points within a cluster and the centroid of
another cluster. Figure 4.6 illustrates the resulting partition of clusters.

Figure 4.6: Partitioning Clustering


The most well-known partitioning algorithm is K-means clustering. Here
are the advantages and disadvantages of partitioning clustering:
Advantages of Partitioning Clustering are:
 ‹Scalability

 ‹Ease of implementation

92 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

 ‹Interpretability Notes
 ‹Applicability to various data types
Disadvantages of partitioning clustering are:
 ‹Sensitivity to initial centroid selection
 ‹Dependence on the number of clusters
 ‹Limited ability to handle complex cluster shapes
K-means Clustering
This is one of the most popular clustering algorithms. It aims to partition
the data into a predetermined number of clusters (K) by minimizing
the sum of squared distances between data points and the centroid of
their assigned cluster. K-means clustering is a popular and widely used
algorithm for partitioning a dataset into a predefined number of clusters.
It is an iterative algorithm that aims to minimize the within-cluster sum
of squares, also known as the total squared distance between data points
and their cluster centres. The algorithm assigns data points to clusters
by iteratively updating the cluster centres until convergence. Here is a
detailed description of the K-means clustering algorithm:
 ‹Initialization:

Specify the number of clusters K that you want to identify in the


dataset. Initialize K cluster centres randomly or using a predefined
strategy, such as randomly selecting K data points as initial centres.
 ‹Assignment Step:
For each data point in the dataset, calculate the distance (e.g.,
Euclidean distance) to each of the K cluster centres.
Assign the data point to the cluster with the nearest cluster centre,
forming K initial clusters.
 ‹Update Step:
Calculate the new cluster centres by computing the mean (centroid)
of all data points assigned to each cluster. The cluster centre is the
average of the feature values of all data points in that cluster.
 ‹Iteration:

Repeat the assignment and update steps until convergence or until


a predefined stopping criterion is met. In each iteration, reassign

PAGE 93
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes data points to clusters based on the updated cluster centres and
recalculate the cluster centres.
 ‹Convergence:

The algorithm converges when the cluster assignments no longer


change significantly between iterations or when the maximum
number of iterations is reached.
 ‹Final Result:
Once the algorithm converges, the final result is a set of K clusters,
where each data point is assigned to one of the clusters based on
its nearest cluster centre.
It is worth noting that K-means clustering is sensitive to the initial
placement of cluster centres. Different initializations can lead to
different clustering results. To mitigate this, it is common to run the
algorithm multiple times with different random initializations and
choose the solution with the lowest within-cluster sum of squares
as the final result. K-means clustering has several advantages,
including its simplicity, scalability to large datasets, and efficiency.
Example:
Cluster the following 4 point in two dimensional space using K value 2
X1 X2
A 2 3
B 6 1
C 1 2
D 3 0
Solution:
Select two centroids as AB and CD, calculate as
AB = Average of A, B
CD = Average of C, D
X1 X2
AB 4 2
CD 2 1

94 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

Now calculate the Euclidean distance between each point and the centroids Notes
and assign each point to the closest centroid:
A B C D
AB 5 5 9 5
CD 4 16 2 2
We can observe in above table, the distance between A, CD is 4 and it
is smaller than distance between AB, A which is 5. So we can move A
to CD cluster. Two clusters are formed ACD and B. Now recomputed
the centroids B, ACD
We repeat the process by calculating the distance between each point and
the updated centroids and reassigning the points to the closest centroid. We
continue this iteration until the centroids no longer change significantly.
After a few iterations, the algorithm converges, and the final cluster
assignments are:
Cluster 1: B
Cluster 2: ACD
(B) Density-Based Clustering:
Density-based algorithms, such as DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), identify clusters based on regions
of high data point density. The data points are group together that are
close to each other and separates regions with low density. Density-based
clustering does not assume any specific shape for clusters. It can detect
clusters of arbitrary shapes, including non-linear and irregular clusters. Also,
such clustering techniques handles noise or outlier points appropriately.

Figure 4.7: Density based Clustering

PAGE 95
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes (C) Hierarchical Clustering:


Hierarchical clustering can be used instead of partitioned clustering
because there is no need to indicate the number of clusters to be produced.
Hierarchical clustering constructs a hierarchical structure of clusters,
represented by a dendrogram, which resembles a tree-like formation.
This clustering method can be categorized into two primary approaches:
agglomerative, where individual data points begin as separate clusters and
are progressively merged, and divisive, where all data points commence
in a single cluster and are recursively divided.

Figure 4.8: Density Hierarchical Clustering


In this example, we have nine data points labelled A, B, C, D, E, F, G,
H, and I. The dendrogram represents the hierarchical relationships between
these data points. Data points B and C are combined into a single cluster
at the first level of the dendrogram because they are the closest to one
another. The difference between B and C can be seen in the heights of
the branches that connect them.
The nearest data points at the level below are D, E, and F, which group
together to create a cluster. The data points G, H, and I also form a
cluster. The branch joining these two clusters to the combined clusters B
and C represents the merging of these two clusters into the bigger cluster.
Agglomerative hierarchical algorithm
The agglomerative hierarchical algorithm is a popular clustering algorithm
that follows a bottom-up approach to create a hierarchical structure of
clusters. It starts with each data point assigned to its own individual
cluster and progressively merges the closest pairs of clusters until a

96 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

single cluster, containing all the data points, is formed. This algorithm Notes
is also known as agglomerative clustering or bottom-up clustering. Here
are the key steps and characteristics of the agglomerative hierarchical
algorithm.
 ‹Initialization: Assign each data point to its own initial cluster,
resulting in N clusters for N data points.
 ‹Compute the proximity or dissimilarity matrix: Calculate the
dissimilarity or similarity measure between each pair of clusters.
The choice of distance or dissimilarity measure depends on the
specific application and the nature of the data.
 ‹Merge the closest clusters: Identify the pair of clusters with the
smallest dissimilarity or highest similarity measure and merge them
into a single cluster. The dissimilarity or similarity between the new
merged cluster and the remaining clusters is updated.
 ‹Repeat the merging process: Repeat steps 2 and 3 until all the data
points are part of a single cluster or until a predefined stopping
criterion is met.
 ‹Hierarchical representation: The merging process forms a hierarchy
of clusters, often represented as a dendrogram. The dendrogram
illustrates the sequence of merging and allows for different levels
of granularity in cluster interpretation.
The advantages of agglomerative hierarchical algorithms is the hierarchical
structure and there is no need to specify the number of clusters unlike
partitioning algorithm. The drawback of this algorithm is the high
computational complexity and lack of stability.

4.7.3 Distance and Dissimilarity Measures in Clustering


In clustering, distance and dissimilarity measures play a crucial role in
determining the similarity or dissimilarity between data points. These
measures quantify the proximity between objects and are used by
clustering algorithms to assign data points to clusters or determine the
cluster centres. Here are some commonly used distance and dissimilarity
measures in clustering.

PAGE 97
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 1. Euclidean Distance: This is one of the most widely used distance
measures in clustering. It calculates the straight-line distance between
two data points in a Euclidean space. For two points, P = (p1, p2,
..., pn) and Q = (q1, q2, ..., qn), the Euclidean distance is given
by:

( p1 − q1) + ( p2 + q2) + .... + ( pn − qn)


2 2
(1.6)
2. Manhattan Distance: Also known as the City Block distance or
L1 norm, it calculates the sum of absolute differences between the
coordinates of two points. For two points, P = (p1, p2, ..., pn) and
Q = (q1, q2, ..., qn), the Manhattan distance is given by:
p1 − q1 + p2 − q2 + ... + pn − qn (1.7)
3. Cosine Similarity: Cosine similarity measures the cosine of the angle
between two vectors, indicating the similarity in their directions.
It is commonly used in text mining or when dealing with high-
dimensional data. For two vectors, P = (p1, p2, ..., pn) and Q =
(q1, q2, ..., qn), the cosine similarity is given by:

n
i =1
piqi
(1.8)
∑ ∑
n 2 n 2
i =1
p i i =1
q i

4. Minkowski Distance: This is a generalized distance measure that


includes Euclidean and Manhattan distances as special cases. The
Minkowski distance between two points P = (p1, p2, ..., pn) and
Q = (q1, q2, ..., qn) is given by:
1/ p
⎛ n p − q p⎞
⎝∑
(1.9)
i =1
i i

4.7.4 Quality and Optimal Number of Clusters


Quality and determining the optimal number of clusters are important
considerations in clustering analysis. Let’s explore each of these aspects:
(A) Quality of Clustering:
The quality of clustering refers to how well the clustering algorithm
captures the inherent structure and patterns in the data. Several factors
contribute to the assessment of clustering quality:

98 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

 ‹Compactness: Compactness refers to how close the data points are Notes
within each cluster. A good clustering result should have data points
tightly clustered together within their assigned clusters.
 ‹Separability: Separability refers to the distance between different
clusters. A high-quality clustering result should exhibit distinct
separation between clusters, indicating that the clusters are well-
separated from each other.
 ‹Stability: Stability measures the consistency of clustering results
under different conditions, such as different initializations or subsets
of the data. A stable clustering result is less prone to variation 4s
and demonstrates robustness.
 ‹Domain-specific Measures: Depending on the application domain,
additional measures specific to the problem can be used to assess
clustering quality. For example, in customer segmentation, metrics
like homogeneity, completeness, and silhouette coefficient can be used
to evaluate the effectiveness of clustering in capturing meaningful
customer groups.
(B) Determining the Optimal Number of Clusters in K-means clustering:
Determining the optimal number of clusters, K, in K-means clustering is
a crucial step in clustering analysis. Selecting the appropriate number of
clusters is important for interpreting and extracting meaningful information
from the data. Several methods are commonly used to determine the
optimal number of clusters:
 ‹Elbow Method: “The elbow method involves plotting the within-cluster
sum of squares (WCSS) as a function of the number of clusters”.
The plot resembles an arm, and the optimal number of clusters is
often identified at the “elbow” point, where the rate of decrease
in WCSS slows down significantly. In the Figure 4.9 below, it is
clear that k=3 is the optimal number of clusters.

PAGE 99
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes

Figure 4.9: Elbow Method


IN-TEXT QUESTIONS
1. Why are classification and Regression Trees (CART) advantageous?
(a) Decision trees implicity perform variable screening or
feature selection
(b) Can handle both numerical and categorical data
(c) Can handle multi-output problems
(d) All of these
2. Decision tree can be used for ___________.
(a) Classification (b) Regression
(c) Both (d) None of these
3. What is the maximum depth in a decision tree?
(a) The length of the longest path from a root to a leaf
(b) The length of the shortest path from a root to a leaf
(c) The length of the longest path from a root to a sub-node
(d) None of these
4. Which of the following is correct with respect to random forest?
(a) Random forest are difficult to interpret but often very
accurate

100 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

(b) Random forest are easy to interpret but often very accurate Notes

(c) Random forest are difficult to interpret but very less accurate
(d) None of these
5. Predicting with trees evaluate _________ within each group of
data.
(a) Equality
(b) Homogeneity
(c) Heterogeneity
(d) All of the mentioned
6. Which of the following is finally produced by Hierarchical
Clustering?
(a) Final estimate of cluster centroids
(b) Tree showing how close things are to each other
(c) Assignment of each point to clusters
(d) All of the mentioned
7. Which of the following is required by K-means clustering?
(a) Defined distance metric
(b) Number of clusters
(c) Initial guess as to cluster centroids
(d) All of the mentioned
8. Which of the following combination is incorrect?
(a) Continuous – Euclidean distance
(b) Continuous – Correlation similarity
(c) Binary – Manhattan distance
(d) None of the mentioned
9. Which is not true about K-mean Clustering?
(a) K-means is sensitive to cluster center initializations
(b) Bad initialization can lead to Poor convergence speed
(c) Bad initialization can produce good clustering
(d) None of the mentioned

PAGE 101
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 10. Which of the following clustering requires merging approach?


(a) Fuzzy mean
(b) Hierarchical
(c) Naive Bayes
(d) K-means
11. Which one of the following is not true about ensemble classifier?
(a) The different learners in boosting can be trained in parallel
(b) The different learners in bagging can be trained in parallel
(c) Boosting methods generally use weak learners as individual
classifier
(d) An individual classifier in a boosting based ensemble is
trained with every points in the training sample
12. Which of the following algorithm is most sensitive to outliers?
(a) K-means clustering algorithm
(b) DBSCAN clustering algorithm
(c) K-medoids clustering algorithm
(d) None of these

4.8 Summary
 ‹Decision Tree is a popular machine learning approach for classification
and regression tasks.
 ‹The CHAID algorithm looks for meaningful patterns by splitting the
data into groups based on different categories of variables.
 ‹The Bonferroni correction is a statistical method used to adjust the
significance levels (p-values).
 ‹Gini impurity index is a measure used in decision tree algorithms
to evaluate the impurity or disorder within a set of class labels.
 ‹Entropy is a concept used in information theory and decision tree
algorithms to measure the impurity.

102 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

 ‹Cost-based splitting criteria aim to minimize the overall cost or Notes


misclassification expenses by selecting the optimal feature and split
point at each node.
 ‹CART measures the impurity or disorder within each node using a
criterion like Gini impurity or entropy.
 ‹Random forest combines multiple decision trees to make predictions
by aggregating the results from each individual tree.
 ‹Clustering algorithms discover patterns and groupings solely based
on the inherent characteristics of the data.
 ‹K-means clustering is a popular and widely used algorithm for
partitioning a dataset into a predefined number of clusters.
 ‹Distance measures quantify the proximity between objects and are
used by clustering algorithms to assign data points to clusters or
determine the cluster centers.

4.9 Answers to In-Text Questions

1. (d) All of these


2. (c) Both
3. (a) The length of the longest path from a root to a leaf
4. (b) Random forest are easy to interpret but often very accurate
5. (b) Homogeneity
6. (b) Tree showing how close things are to each other
7. (d) All of the mentioned
8. (d) None of the mentioned
9. (c) Bad initialization can produce good clustering
10. (b) Hierarchical
11. (a) The different learners in boosting can be trained in parallel
12. (a) K-means clustering algorithm

PAGE 103
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes 4.10 Self-Assessment Questions


1. What are the different decision tree algorithms used in machine
learning?
2. What is entropy?
3. Which metrics is best entropy or gini impurity for node selection
in decision tree?
4. Write some advantages and disadvantages of decision tree.
5. How decision trees are used for classification and regression tasks?
6. Can a random forest handle categorical features and missing values?
7. What is the purpose of using random subsets of data and features
in a random forest?
8. What are the main types of clustering algorithms?
9. What is the key parameter in k-means clustering?
10. What are the limitations of clustering?

4.11 References
 ‹Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and
techniques. Morgan Kaufmann.
 ‹Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of
statistical learning: data mining, inference, and prediction. Springer
Science & Business Media.
 ‹Bishop, C. M. (2006). Pattern recognition and machine learning.
Springer.
 ‹Murphy, K. P. (2012). Machine learning: A probabilistic perspective.
MIT Press.
 ‹Mann, A. K., & Kaur, N. (2013). Review paper on clustering
techniques. Global Journal of Computer Science and Technology.
 ‹Rai, P., & Singh, S. (2010). A survey of clustering techniques.
International Journal of Computer Applications, 7(12), 1-5.
 ‹Cheng, Y. M., & Leu, S. S. (2009). Constraint-based clustering and
its applications in construction management. Expert Systems with
Applications, 36(3), 5761-5767.

104 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
INTRODUCTION TO BUSINESS ANALYTICS

 ‹Bijuraj, L. V. (2013). Clustering and its Applications. In Proceedings Notes


of National Conference on New Horizons in IT-NCNHIT (Vol. 169,
p. 172).
 ‹Kameshwaran, K., & Malarvizhi, K. (2014). Survey on clustering
techniques in data mining. International Journal of Computer Science
and Information Technologies, 5(2), 2272-2276.

4.12 Suggested Readings


 ‹APA FORMAT Han, J., Kamber, M., & Pei, J. (2011). Data mining:
Concepts and techniques. Morgan Kaufmann.
 ‹Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of
statistical learning: Data mining, inference, and prediction. Springer
Science & Business Media.
 ‹Bishop, C. M. (2006). Pattern recognition and machine learning.
Springer.

PAGE 105
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Glossary
Adjusted R-squared: A modified version of R-squared that accounts for the number of
independent variables in the model, penalizing the inclusion of irrelevant variables.
Big Data: Big data refers to large and complex datasets. It is characterized by the volume,
velocity, and variety of data, often generated from various sources such as social media,
sensors, devices, and business transactions.
Business Analytics: Business analytics consists of using data analysis and statistical
methods to gain insights, make informed decisions, and drive strategic actions in a business
or organizational context.
Classification: The classification algorithm is a supervised learning technique that is used
to categorize new observations on the basics of the training data.
Clustering: Clustering is a machine learning technique, which group the unlabelled datasets
or grouping the data points into different clusters.
Coefficient: In the context of linear regression, these are the parameters that represent
the weights or slopes of the independent variables in the linear equation.
Dependent Variable: Also known as the response variable or target variable, it is the
variable that you are trying to predict or explain in a linear regression model.
Distance Measures: Distance measures are used to quantify the similarity or dissimilarity
between the data points. It is widely used in clustering, classification and nearest neighbour
search.
Entropy: Entropy is used to measure the impurity or disorder in the dataset. It is commonly
involved in decision tree.
F-statistic: A statistical test used to determine the overall significance of a linear regression
model by comparing the model’s fit to a model with no independent variables.
F1-score: The F1-score is calculated as the harmonic mean of Precision and Recall.
Gini Coefficient: A metric used to measure the inequality.
Gini index: Gini index also known as Gini impurity, it is used to measure the degree or
probability of a particular variable being wrongly classified.
Heteroscedasticity: A violation of one of the assumptions of linear regression, where the
variance of the residuals is not constant across all levels of the independent variables.

PAGE 107
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
BMS

Notes Homoscedasticity: The assumption that the variance of the residuals


is constant across all levels of the independent variables in a linear
regression model.
Hosmer-Lemeshow Test: A statistical test utilized to assess the adequacy
of fit for a logistic regression model.
Independent Variable: Also known as predictor variables or features,
these are the variables that are used to predict the dependent variable in
a linear regression model.
Intercept: The constant term in the linear equation, representing the
predicted value of the dependent variable when all independent variables
are zero.
Linear Regression: A statistical method used to model the relationship
between a dependent variable and one or more independent variables by
fitting a linear equation to the observed data.
Multicollinearity: A situation in which two or more independent variables
in a multiple linear regression model are highly correlated, making it
difficult to distinguish their individual effects.
Multiple Linear Regression: A type of linear regression that involves
two or more independent variables to predict the dependent variable.
Omnibus Test: A statistical test used to test the significance of multiple
model parameters simultaneously.
Ordinary Least Squares (OLS): The most common method used to
estimate the coefficients of a linear regression model by minimizing the
sum of squared residuals.
Outliers: Data points that are significantly different from the rest of the
data and can have a strong influence on the linear regression model.
Overfitting: A situation where a linear regression model fits the training
data too closely, capturing noise and making it perform poorly on unseen
data.
P-value: A measure that indicates the significance of a coefficient or a
statistical test in a linear regression model. It helps determine whether a
variable is statistically significant.
Pseudo R-square: A metric used to evaluate the portion of variability in
the dependent variable that can be accounted for by the predictor variables.

108 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
GLOSSARY

R-squared (R²): A statistical measure that represents the proportion of the Notes
variance in the dependent variable that is explained by the independent
variables in the linear regression model.
Residuals: The differences between the actual observed values of the
dependent variable and the values predicted by the linear regression model.
ROC curve: ROC curve demonstrates the balance between the true positive
rate and the false positive rate across various classification thresholds.
Simple Linear Regression: A type of linear regression that involves only
one independent variable to predict the dependent variable.
T-statistic: A statistical test used to determine the individual significance
of each independent variable’s coefficient in a linear regression model.
Wald Test: A statistical test used to evaluate the significance of each
individual predictor variables within a regression model.

PAGE 109
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy