Unit 2- Statistics Notes
Unit 2- Statistics Notes
3
1.14 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.14.1 How to Calculate Arithmetic Mean? . . . . . . . . . . . . . . . . . . . . . . . 40
1.14.2 Mean of Individual Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.14.2.1 Direct Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.14.2.2 Assumed Mean Method . . . . . . . . . . . . . . . . . . . . . . . . 43
1.14.2.3 Step-Deviation Method . . . . . . . . . . . . . . . . . . . . . . . . 46
1.14.2.4 Assumed Mean Method vs Step Deviation Method . . . . . . . . . . . . 50
1.14.3 Mean of Ungrouped Frequency Distribution or Discrete Series . . . . . . . . . . . . 51
1.14.3.1 Direct Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.14.3.2 Assumed Mean Method . . . . . . . . . . . . . . . . . . . . . . . . 57
1.14.3.3 Step-Deviation Method . . . . . . . . . . . . . . . . . . . . . . . . 61
1.14.4 Mean of Continuous Series (or Grouped Frequency Distribution) . . . . . . . . . . . 64
1.14.4.1 Direct Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.14.4.2 Assumed Mean Method . . . . . . . . . . . . . . . . . . . . . . . . 67
1.14.4.3 Step-Deviation Method . . . . . . . . . . . . . . . . . . . . . . . . 74
1.14.5 Practice Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
1.15 Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.15.1 Advantages of Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.15.2 Disadvantages of Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . 82
1.15.3 Application of Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.15.4 Geometric Mean for Individual Series . . . . . . . . . . . . . . . . . . . . . . 83
1.15.5 Geometric Mean for Discrete Series . . . . . . . . . . . . . . . . . . . . . . . 87
1.15.6 Geometric Mean for Continuous Series . . . . . . . . . . . . . . . . . . . . . . 90
1.15.7 Geometric Mean vs Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . 93
1.15.8 When is the Geometric mean better than the Arithmetic mean? . . . . . . . . . . . . 94
1.16 Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
1.16.1 Advantages of Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . . . 95
1.16.2 Limitations of Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . . . 95
1.16.3 Applications of Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . . 96
1.16.4 Harmonic Mean for Individual Series . . . . . . . . . . . . . . . . . . . . . . 97
1.16.5 Harmonic Mean for a Discrete Series . . . . . . . . . . . . . . . . . . . . . . 102
1.16.6 Harmonic Mean for a Continuous Series . . . . . . . . . . . . . . . . . . . . . 105
1.16.7 Weighted Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
1.17 Relation Between AM, GM, and HM . . . . . . . . . . . . . . . . . . . . . . . . . . 111
1.18 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
1.18.1 Advantages of the Median . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
1.18.2 Limitations of the Median . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
1.18.3 Applications of the Median . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
1.18.4 Median for Individual Series . . . . . . . . . . . . . . . . . . . . . . . . . . 115
1.18.5 Median for Discrete Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
1.18.6 Median for Continuous Series . . . . . . . . . . . . . . . . . . . . . . . . . 129
1.19 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
1.19.1 Advantages of Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
1.19.2 Limitations of Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
1.19.3 Applications of Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
1.19.4 Mode for Individual Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
1.19.5 Mode for Discrete Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
1.19.6 Mode for Continuous Series . . . . . . . . . . . . . . . . . . . . . . . . . . 148
1.19.7 Relationship Between Mean, Median, and Mode . . . . . . . . . . . . . . . . . . 155
1.20 Measure of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
1.20.1 Advantages of Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . 157
1.20.2 Limitations of Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . 157
1.20.3 Applications of Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . 157
1.20.4 Types of Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . 158
1.20.4.1 Absolute Measure of Dispersion . . . . . . . . . . . . . . . . . . . . 159
1.20.4.2 Relative Measure of Dispersion . . . . . . . . . . . . . . . . . . . . 159
1.21 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
1.21.1 Range for Individual Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
1.21.2 Range for Discrete Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
1.21.3 Range for Continuous Series . . . . . . . . . . . . . . . . . . . . . . . . . . 168
1.21.4 Advantages of Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
1.21.5 Limitations of Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
1.21.6 Applications of Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
1.22 Mean Deviation (MD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
1.22.1 Mean Deviation for Individual Series . . . . . . . . . . . . . . . . . . . . . . 174
1.22.1.1 Mean Deviation around Mean for Individual Series . . . . . . . . . . . . 175
1.22.1.2 Mean Deviation around Median for Individual Series . . . . . . . . . . . 182
1.22.1.3 Mean Deviation around Mode for Individual Series . . . . . . . . . . . . 183
1.22.2 Mean Deviation for Discrete Series . . . . . . . . . . . . . . . . . . . . . . . 184
1.22.2.1 Mean Deviation around the Mean for Discrete Series . . . . . . . . . . . . 185
1.22.2.2 Mean Deviation around the Median for Discrete series . . . . . . . . . . . 187
1.22.2.3 Mean Deviation around the Mode for Discrete Series . . . . . . . . . . . 191
1.22.3 Mean Deviation for Continuous Series . . . . . . . . . . . . . . . . . . . . . . 193
1.22.3.1 Mean Deviation around Mean for a Continuous Series . . . . . . . . . . . 193
1.22.3.2 Mean Deviation around Median for a Continuous Series . . . . . . . . . . 195
1.22.3.3 Mean Deviation around Mode for Continuous Series . . . . . . . . . . . . 197
1.22.4 Advantages of Mean Deviation . . . . . . . . . . . . . . . . . . . . . . . . . 200
1.22.5 Limitations of Mean Deviation . . . . . . . . . . . . . . . . . . . . . . . . . 200
1.22.6 Applications of Mean Deviation . . . . . . . . . . . . . . . . . . . . . . . . . 200
1.23 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
1.23.1 Probability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
1.23.1.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 203
1.23.1.2 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 204
1.23.1.3 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 205
1.23.1.4 Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 206
1.23.1.5 Multi-Stage Sampling . . . . . . . . . . . . . . . . . . . . . . . . 207
1.23.2 Non-Probability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
1.23.2.1 Convenience Sampling . . . . . . . . . . . . . . . . . . . . . . . . 208
1.23.2.2 Judgmental Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 208
1.23.2.3 Quota Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
1.24 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Chapter 1
1
5. Scientific Research
• Hypothesis Testing: Statistics are essential in testing hypotheses and validating research findings
across various scientific disciplines.
• Data Collection and Analysis: Researchers use statistical methods to design experiments, collect
data, and analyze results, ensuring the validity and reliability of their studies.
6. Understanding Variability
• Managing Uncertainty: Statistics help in understanding and managing variability in data, which is
inherent in any real-world process or phenomenon.
• Quantifying Differences: Through statistical tests, it’s possible to determine if observed differences
in data are significant or due to random variation.
7. Policy Formulation and Evaluation
• Public Policy: Governments and organizations use statistical data to formulate policies, assess their
impact, and make necessary adjustments.
• Socio-Economic Analysis: Statistics help in understanding social and economic issues, guiding
policy decisions on health, education, employment, and more.
8. Business and Market Research
• Consumer Insights: Businesses use statistics to understand consumer behavior, preferences, and
market trends.
• Product Development: Statistical analysis helps in identifying market needs, leading to the develop-
ment of new products and services.
9. Education and Psychology
• Educational Assessment: Statistics are used to analyze educational data, assess student performance,
and improve teaching methods.
• Psychological Research: In psychology, statistics help in studying human behavior, testing theories,
and validating psychological assessments.
10. Healthcare and Medicine
• Clinical Trials: Statistics are crucial in designing and analyzing clinical trials to ensure the efficacy
and safety of new treatments.
• Epidemiology: Statistical methods help in studying the distribution and determinants of health-
related events in populations, guiding public health interventions.
2
1.1.2 Advantages of Statistics
• Informed Decision Making
– Data-Driven Decisions: Statistics enable decisions based on data rather than intuition, increasing
the reliability and effectiveness of outcomes.
– Risk Management: Statistical analysis helps in identifying and managing risks, allowing for better
planning and mitigation strategies.
• Predictive Analysis
• Quality Control
• Scientific Research
• Understanding Variability
• Policy Formulation and Evaluation
• Business Applications
• Healthcare Applications
3
1.1.4 Applications of Statistics
1. Business and Economics
• Market Research: Analyzing consumer behavior, preferences, and market trends to guide marketing
strategies and product development.
• Quality Control: Using statistical methods to monitor and improve product and service quality.
• Financial Analysis: Evaluating investment opportunities, assessing risks, and forecasting financial
trends.
• Operational Efficiency: Optimizing supply chain management, inventory control, and resource
allocation.
2. Healthcare and Medicine
• Clinical Trials: Designing and analyzing clinical trials to determine the efficacy and safety of new
drugs and treatments.
• Epidemiology: Studying the distribution and determinants of health-related events to guide public
health interventions and policy.
• Medical Research: Analyzing data from medical studies to understand disease patterns, treatment
outcomes, and health risks.
• Health Services Management: Improving hospital management, patient care, and resource allocation
through statistical analysis.
3. Social Sciences
• Sociological Research: Analyzing social behaviors, trends, and patterns to understand societal
dynamics and inform policy.
• Psychology: Using statistical methods to validate psychological theories, assess interventions, and
analyze behavioral data.
• Education: Evaluating educational programs, assessing student performance, and improving teaching
methods through data analysis.
4. Engineering and Manufacturing
• Quality Assurance: Applying statistical process control (SPC) to monitor and improve manufacturing
processes.
• Reliability Engineering: Analyzing the reliability and life-cycle of products to enhance durability and
performance.
• Design of Experiments: Optimizing product design and development through systematic experimen-
tation and analysis.
5. Environmental Science
• Climate Studies: Analyzing climate data to understand trends, model climate change, and predict
future conditions.
• Environmental Monitoring: Assessing pollution levels, natural resource management, and ecological
impacts through statistical analysis.
• Conservation Biology: Studying species populations, habitat use, and conservation strategies using
statistical methods.
6. Government and Public Policy
• Census and Surveys: Collecting and analyzing population data to inform policy decisions and
resource allocation.
• Economic Planning: Using statistical models to forecast economic growth, unemployment, inflation,
and other macroeconomic indicators.
• Policy Evaluation: Assessing the impact and effectiveness of public policies and programs through
data analysis.
4
7. Sports and Entertainment
• Performance Analysis: Analyzing athlete performance, game statistics, and team strategies to
enhance competitive edge.
• Audience Analytics: Studying viewer preferences, ratings, and engagement to optimize content and
marketing strategies in media and entertainment.
8. Information Technology and Data Science
• Machine Learning: Using statistical methods to develop algorithms for predictive modeling, classi-
fication, and clustering.
• Data Mining: Extracting meaningful patterns and insights from large datasets to inform business
decisions and strategies.
• Cybersecurity: Analyzing security threats, intrusion patterns, and system vulnerabilities through
statistical techniques.
9. Agriculture and Food Science
• Crop Yield Analysis: Studying factors affecting crop yields, pest control, and soil health to improve
agricultural practices.
• Food Safety: Monitoring and analyzing food production processes to ensure safety and compliance
with health regulations.
10. Education
• Assessment and Evaluation: Analyzing student performance data, evaluating educational programs,
and improving instructional methods.
• Educational Research: Using statistical methods to study learning outcomes, teaching effectiveness,
and educational trends.
11. Astronomy and Space Science
• Astrophysical Research: Analyzing astronomical data to study celestial bodies, cosmic phenomena,
and the structure of the universe.
• Space Mission Planning: Using statistical models to plan and optimize space missions, satellite
deployments, and exploration strategies.
12. Law and Forensics
• Criminology: Analyzing crime data to understand trends, patterns, and the effectiveness of law
enforcement strategies.
• Forensic Analysis: Using statistical methods in forensic science to analyze evidence, identify
patterns, and solve crimes.
5
1.2 Types of Data or Variables in Statistics
6
1.3 Qualitative vs Quantitative Data
2 https://www.youtube.com/watch?v=E1C5hB0yAM4
7
1.3.1.1 Nominal Data
• Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked.
• Nominal data is often used to categorize observations into groups, and the groups are not comparable.
• In other words, nominal data has no inherent order or ranking. Therefore, if you would change the order of
its values, the meaning would not change.
• Examples of nominal data include:
– Gender (Male or female),
– Race (White, Black, Asian),
– Religion (Hinuduism, Christianity, Islam, Judaism)
– blood type (A, B, AB, O).
• Nominal data can be represented using frequency tables and bar charts, which display the number or
proportion of observations in each category.
• For example, a frequency table for gender might show the number of males and females in a sample of
people.
• Nominal data is analyzed using non-parametric tests, which do not make any assumptions about the
underlying distribution of the data.
• Common non-parametric tests for nominal data include Chi-Squared Tests and Fisher’s Exact Tests.
These tests are used to compare the frequency or proportion of observations in different categories.
8
1.3.1.2 Ordinal Data
• Ordinal data is a type of data that consists of categories that can be ordered or ranked. However, the distance
between categories is not necessarily equal.
• Ordinal data is nearly the same as nominal data, except that its ordering matters.
• Ordinal data is often used to measure subjective attributes or opinions, where there is a natural order to the
responses.
• Examples of ordinal data include education level (Elementary, Middle, High School, College), job position
(Manager, Supervisor, Employee), etc.
• Note that the difference between Elementary and High School is different from the difference between
High School and College. This is the main limitation of ordinal data, the differences between the values is
not really known. Because of that, ordinal scales are usually used to measure non-numeric features like
happiness, customer satisfaction and so on.
• Ordinal data can be represented using bar charts, line charts. These displays show the order or ranking of
the categories, but they do not imply that the distances between categories are equal.
• Ordinal data is analyzed using non-parametric tests, which make no assumptions about the underlying
distribution of the data.
• Common non-parametric tests for ordinal data include the Wilcoxon Signed-Rank test and Mann-Whitney
U test.
9
1.3.2 Quantitative Data or Numerical Data
3
• Quantitative Data takes about quantity. Something that we can measure in numbers.
• Quantitative Data is a fundamental component of Statistics, providing a numerical foundation for analysis
and decision-making.
• Quantitative data are data represented numerically, including anything that can be counted, measured, or
given a numerical value.
• They are also called the Numerical Data (i.e., how much, how often, how many).
• Quantitative data type is used to represent quantities, measurements, and observations like height, weight,
length and other things of the data.
4
• Quantitative data is further classified into two categories :
– Discrete Data
– Continuous Data
3 https://www.youtube.com/watch?v=kNARs2oeuk0
4 https://www.youtube.com/watch?v=Cg0W6mod9Hw
10
1.3.2.1 Discrete Data
• Discrete data type is a type of data in statistics that only uses distinct and countable values.
• Discrete information contains only a finite number of possible values. Those values cannot be subdivided
meaningfully.
• In a Discrete Dataset, apparent gaps or intervals exist between the values. These gaps indicate that there are
no values between the specified data points.
• The example of the discrete data types are,
– Marks of the students in a class test
– Number of customers
– Dice rolls: When rolling a six-sided dice, the possible outcomes are discrete and countable, ranging
from 1 to 6.
• Discrete Data is often analyzed using Statistical techniques tailored to discrete variables, such as frequency
distributions, bar charts, and probability calculations. These methods help to summarize and interpret Data
that can be counted or categorized into distinct values.
• key characteristics of discrete data.
– Finite, countable, and nondivisible: Discrete data includes discrete variables that are finite, numeric,
and non-negative integers (5, 10, 15, and so on).
– Easy to visualize: Discrete data can be easily visualized and demonstrated using simple statistical
methods such as bar charts, line charts, or pie charts.
– Can be categorical: Discrete data can also be categorical - containing a finite number of data values,
such as the gender of a person.
– Easy to distribute: Discrete data is distributed discretely in terms of time and space. Discrete
distributions make analyzing discrete values more practical.
11
12
1.4 Interval vs Ratio Data
5
Interval vs Ratio Data: Video Lecture
13
14
15
16
1.5 Other Types of Data/Variables
1.5.1 Primary Data
Primary data in mathematics is defined as the data that is collected for the first time. It is pure data and no analysis
is performed in this data.
17
1.6 Statistical Series
Statistical Series
Characteristics Construction
Time Series Spacial Series Condition Series Individual Series (or Raw Data)
Frequency
Frequency Distribution
Frequency Array Inclusive Exclusive Open End Cumulative Mid-Value Equal and Unequal
Series Series Series Frequency Frequency Class Interval Series
Series Series
18
• Data is important for researchers but in its raw form, it is hardly usable.
6
• Therefore, data is often organized in series to facilitate analysis and interpretation.
• Series has its own characteristics and they obey some general principles.
• Such types of series are very important for researchers and economists to gain insights so that they can use
them for actionable purposes.
• A statistical series refers to a set of observations arranged in a particular order based on one or more criteria.
• In other words, arranging data in some logical order such as according to the time of occurrence, size, or
7
some other measurable or non-measurable characteristics is known as Statistical Series. .
• Understanding the different types of statistical series is crucial for effectively analyzing and presenting data.
• Statistical Series can be classified:
– On the Basis of Characteristics:
* Time Series
* Spatial Series
* Condition Series
– On the Basis of Construction:
* Individual Series
* Discrete Series
* Continuous Series
6 https://www.youtube.com/watch?v=NWNW1jln8cc
7 https://www.youtube.com/watch?v=VunpIAw5pPg
19
1.7 On the Basis of Characteristics
When the data is arranged on the basis of qualitative characteristics, statistical series are of three kinds:
• Time Series
• Spatial Series
• Condition Series
20
1.7.1 Time series
• If the different values taken by a variable in a period of time are arranged in chronological order, the series
obtained is called a Time Series. Thus, a Time series is a series of data points indexed (or listed or graphed)
in time order.
• Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a
sequence of discrete-time data.
• Simply, time series is a statistical series in which the given data is presented with regard to time unit; i.e.,
day, month, week, or year.
• Time series analysis is used for non-stationary data—things that are constantly fluctuating over time or are
affected by time. Industries like finance, retail, and economics frequently use time series analysis because
currency and sales are always changing. Stock market analysis is an excellent example of time series
analysis in action, especially with automated trading algorithms. Likewise, time series analysis is ideal for
forecasting weather changes, helping meteorologists predict everything from tomorrow’s weather report to
future years of climate change.
• Examples of time series analysis in action include:
– Weather data
– Rainfall measurements
– Temperature readings
– Heart rate monitoring (EKG)
– Brain monitoring (EEG)
– Quarterly sales
– Stock prices
– Automated stock trading
– Industry forecasts
– Interest rates
21
1.7.2 Spatial Series
• Spatial data is any type of data that directly or indirectly references a specific geographical area or location.
• Example: The following is the sex ratio of 6 different states of India as per the Census of 2011.
22
1.7.3 Condition Series
• In this series, data is classified according to the changes occurring in variables according to certain condition,
then it is called a Condition Series.
• Students of a certain class arranged according to their age. Heights, weights, marks etc.
• Example: The following is the table showing the arrangement of 40 students in a class according to their
age. It is a condition series because the data is arranged on basis of the age of the students
23
1.8 On the Basis of Construction
89
When the data is arranged on the basis of quantitative characteristics, statistical series are of three kinds :
• Individual Series
– Unorganized Individual Series
– Organized Individual Series
• Discrete Series
• Continuous Series
– Exclusive Series
– Inclusive Series
– Open-end Distribution
– Cumulative Frequency Series
– Equal and Unequal Class Interval Series
– Mid-value Series
8 https://www.youtube.com/watch?v=NWNW1jln8cc
9 https://www.tutorialspoint.com/statistical-series
24
1.8.1 Individual Series (or Raw Data)
• Individual series is that series in which the terms are listed singly.
• In simple terms, a separate value of the measurement is given to each item.
• Example: If the marks of 10 students of Class is given individually as, 80, 82, 75, 95, 77, 81, 60, 35, 54, and
99; then, the resultant series will be an individual series.
• In such series, there is no class of the items and also there is no frequency of the items.
• The two types of individual series are:
25
1.8.2 Frequency
• In statistics, the frequency or absolute frequency of an event i is the number ni of times the observation has
occurred/recorded in an experiment or study.
• Frequency is basically the number of times a data item occurs in the series. In other words, it deals with
how frequent a data item is in the series.
10 https://edu.gcfglobal.org/en/statistics-basic-concepts/frequency-tables/1/
26
1.8.3 Discrete Series or UnGrouped Frequency Distribution or Frequency Array
• Discrete Series is nothing but ungrouped frequency distribution series where different values of the variables
are shown with their respective frequencies.
• The classification of data for a discrete variable is known as Frequency Array.
• In discrete series, data obtained in raw form are presented along with their frequencies. In such a series,
data are not presented in ascending or descending manner.
• Instead, the data and its frequencies are presented in a tabular or grouped manner.
• For example, if the monthly wages of five employees of a company are 10,000, 12,000, 10,000, 12,000,
13,000, 14,000, and 15,000, then the discrete series will be made as follows
27
1.8.4 Continuous Series or Grouped Frequency Distribution
• A discrete series cannot take any value in an interval; therefore, in cases where it is essential to represent
continuous variables with a range of values of different items of a given data, Continuous Series is used.
• In continuous series (grouped frequency distribution), the value of a variable is grouped into several class
intervals (such as 0-5,5-10,10-15) along with the corresponding frequencies.
• Other names of Continuous Series are Frequency Distribution, Grouped Frequency Distribution, Series with
Class Intervals, and Series of Grouped Data.
11 12
• Different types of Continuous Series :
– Inclusive Series
– Exclusive Series
– Open-end Distribution
– Cumulative Frequency Series
– Equal and Unequal Class Interval Series
– Mid-value Series
11 https://www.geeksforgeeks.org/types-of-frequency-distribution/
12 https://www.toppr.com/guides/economics/organisation-of-data/frequency-distribution/
28
1.8.4.1 Important Terms under Continuous Series
• Class: Class in Continuous Series refers to a group of numbers in which the items are placed. For example,
0-5, 5-10, 10-15, 15-20, 20-25, etc.
• Number of Classes: The decision regarding the number of classes of a given data usually depends upon
the judgement of the individual investigator. Even though there is no strict rule regarding the number of
classes, the number should not be very small or very large.
• Class Limits: In continuous series, the class limit is formed by the two numbers between which every class
is located. The lowest value of the class is known as Lower Limit and the highest value of the class is known
as Upper Limit. For example, if a class is 5 - 10, then 5 is the lower limit and 10 is the upper limit.
• Class Interval: It is the difference between the lower limit and upper limit of a class.
• Range: It is the difference between the lower limit of the first class interval and the upper limit of the last
class interval. For example, if the classes of a distribution are 0-5, 5-10, 10-15, . . . . . . . . . . . . .till 45-50, then
the range will be 50 – 0 = 50.
• Width of Class Intervals: At the time of constructing the frequency distribution, it is suggested that the
width of each class interval is equal in size. The formula for determining the size or width of each class
interval is as follows:
range
width = √
SampleSize
29
• How to make a grouped frequency table?
Example: A sociologist conducted a survey of 20 adults. She wants to report the frequency distribution of
the ages of the survey respondents. The respondents were the following ages in years:
52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33, 49, 37
19 ≤ a ≤ 29
29 ≤ a ≤ 39
39 ≤ a ≤ 49
49 ≤ a ≤ 59
59 ≤ a ≤ 69
30
1.8.4.2 Inclusive Series
• The series with class intervals, in which all the items having the range from the lower limit up to the upper
limit are included, is known as Inclusive Series.
• However, there is a gap (between 0.1 to 1) between the upper-class limit of one class interval and the lower
limit of the next class interval.
• For example, class intervals of an inclusive series can be, 0-9, 10-19, 20-29, 30-39, and so on. In this case,
the gap between the upper limit of one class interval and the lower limit of the next class interval is 1.
• From the above table of inclusive series, it can be seen that the upper limit of one class interval (say, 9 of
interval 0-9) is not the same as the lower limit of the next class interval (10 of interval 10-19). Also, all the
values that come under 0-9, including 0 and 9 are included in the frequency against 0-9.
• For statistical calculation, sometimes it becomes necessary to convert the inclusive series into exclusive
series. Suppose, in the above example some students have obtained marks such as 10.5, 40,5, etc. In this
case, this series will be converted into exclusive series,
31
1.8.4.3 Exclusive Series
• The series with class intervals, in which all the items having the range from the lower limit to the value just
below its upper limit are included, is known as the Exclusive Series.
• For example, if a class interval is 0-10, and the values of the given series are 4, 10, 2, 15, 8, and 9, then only
4, 2, 8, and 9 will be included in the 0-10 class interval. 10 and 15 will be included in the next class interval,
i.e., 10-20.
• In Exclusive Series, the upper limit of a class interval is the lower limit of the next class interval.
• From the above table of exclusive series, it can be seen that the upper limits of the first class interval is the
lower limit of the second class interval, and so on.
• If the data includes a value 10, it will be included in the class interval 10-20, not in 0-10.
32
1.8.4.4 Conversion of Inclusive Series into Exclusive Series?
• For statistical calculation, sometimes it becomes necessary to convert the inclusive series into exclusive
series.
• Suppose, in the above example some students have obtained marks such as 10.5, 40, 5, etc. In this case, this
series will be converted into exclusive series,
• The steps for converting an inclusive series into exclusive series are:
– In this first step, calculate the difference between the upper class limit of one class interval and the
lower limit of the next class interval.
– The next step is to divide the difference by two and then add the resulting value to the upper limit of
every class interval and subtract it from the lower limit of every class interval.
• The inclusive series of the above example is converted into exclusive series as under:
33
1.8.4.5 Difference between Inclusive and Exclusive Series
• In Inclusive Series, the upper limit of one class interval is not the same as the lower limit of the next class
interval. There is a gap ranging from 0.1 to 1.0 between the upper class limit of one class interval and the
lower class limit of the next class interval. However, in the Exclusive Series, the upper limit of one class
interval is the same as the lower limit of the next class interval.
• In the case of Inclusive Series, the value of the upper and the lower limit are included in that class interval
only. However, in the case of Exclusive Series, the value of upper limit of a class interval is not included in
that interval, instead, it is included in the next class interval.
• Inclusive Series is suitable for an investigator only if the value is in complete number and not in decimal
form. However, an Exclusive Series is suitable for an investigator whether the value is in complete number
or decimal form.
• Counting in Inclusive Series is possible only after converting it into an Exclusive Series. However, counting
in Exclusive Series is possible in all cases.
34
1.8.5 Open End Series
• Sometimes the lower limit of the first class interval and the upper class limit of a series is not available;
instead, Less than or Below is mentioned in the former case (in place of the lower limit of the first class
interval), and More than or Above is mentioned in the latter case (in place of the upper limit of the last class
interval). These types of series are known as Open End Series.
• For statistical calculations, if one needs to change the first and last class open-end class interval into limits,
it can be done by the general practice of giving the same magnitude or class size to these intervals as the
class size of other class intervals.
• In the above example, the magnitude of other class intervals is 5. Therefore, the open-end class intervals
can be written as 5-10 and 30-35, respectively.
35
1.9 Types of Statistics
Statistics can be broadly classified into two main types:
1. Descriptive Statistics
2. Inferential Statistics
36
1.11 Population vs Sample
• Population: A collection or set of individuals or objects or events whose properties are to be analyzed.
• Sample: A subset of the population is called ‘Sample’. A well-chosen sample will contain most of the
information about a particular population parameter.
• Outliers: An outlier is a data point that differs significantly from the majority of the data taken from a
sample or population. There are many possible causes of outliers, but here are a few to start you off:
– Natural variation in data
– Change in the behavior of the observed system
– Errors in data collection
37
1.12 Measures of Central Tendency
• Central Tendencies in Statistics are the numerical values that are used to represent mid-value or central value
a large collection of numerical data. These obtained numerical values are called central values in Statistics.
• Measures of central tendency are statistical metrics that describe or represents the center or the single value
as representative of the entire distribution or a dataset.
• Such a value is of great significance because it depicts the nature or characteristics of the entire data, which
is otherwise very difficult to observe.
• The three most common measures of central tendency are:
– Mean : provides the average value of the dataset
– Median: provides the central value of the dataset
– Mode: provides the most frequent value in the dataset
38
1.13 Mean
• Mean is the measure of central tendency and is mostly used in Statistics.
• Mean is the central tendency of the distributed data, which refers to the average value of the given set of
data.
• The method of finding the mean is also different depending on the type of data (Grouped or Ungrouped
Data).
• Mean is also referred to as the average.
• Mean is sensitive to skewed data and extreme values.
• Arithmetic Mean
• Geometric Mean
• Harmonic Mean
When not specified, the mean is generally referred to as the arithmetic mean.
39
1.14 Arithmetic Mean
1.14.1 How to Calculate Arithmetic Mean?
There are three ways to determine the arithmetic mean for both Grouped/Ungrouped Data or Individual, Discrete
13 14
and Continuous Series. .
• Direct Method
• Assumed Mean Method or Short-Cut Method
• Step Deviation Method
13 https://www.youtube.com/playlist?list=PLYwJOKtPsLuiFjFGKDFoPZOM0g4JBKUrj
14 https://www.youtube.com/playlist?list=PLEHGYFbPuuMEhz_AU8iCrBTYb5eNtFpeg
40
1.14.2 Mean of Individual Series
• Raw data is the dataset simply contains all the data in no particular manner.
• The series in which the items are listed singly is known as Individual Series.
• The mean is of raw data calculated by adding up all the observations and dividing it by the total number of
observations in the set.
• Mean = Sum of all Observations ÷ Total number of Observations
• The population mean is represented by the Greek letter µ (mu).
41
1.14.2.1 Direct Method
• The following equations compute the population mean and sample mean:
x1 + x2 ..... + xN
µ=
N
N
∑ xi
i=1
µ=
N
where, N is the total number of observations in the population
x1 + x2 ..... + xn
x=
n
n
∑ xi
i=1
x=
n
where, n is the total number of observations in the sample
42
1.14.2.2 Assumed Mean Method
15
• Assumed mean method finds the actual mean of the data by first assuming a mean value.
• When the calculation of the mean for raw data using the direct method becomes very tedious, then the mean
can be calculated using the assumed mean method.
• When calculating the mean using the direct mean method, you obtain significantly bigger numbers. The
likelihood of making calculating errors is decreased when utilizing the assumed mean approach, also known
as a shift of origin because it gives you smaller numbers to work with (as well as negative numbers that
lower the sum).
• The Assumed Mean method simplifies the calculation of the arithmetic mean by reducing the size of the
numbers involved in the calculation, making it easier to compute, thus suitable if your data set has large
values.
• The following equations compute the population mean and sample mean:
∑ di
µ = A+
N
where A is the assumed mean and d is the deviation from the mean
∑ di
x̄ = A +
n
• Advantages:
– Simplifies arithmetic by using smaller numbers.
– Reduces computational complexity.
• Disadvantages:
– Assumed mean is still a central value, so deviations might still be relatively large.
• How to Calculate Mean using Assumed Mean Method?: We can calculate mean using the assumed
mean method by following the below steps:
1. Choose an Assumed Mean (A): Select a value from the data, often a central value, to act as an
assumed mean.
2. Calculate the Deviations (d): Subtract the assumed mean from each data point to find the deviation
di = xi − A, where xi is each data point.
3. Find the Sum of Deviations (∑ di ): Add up all the deviations.
4. Calculate the Mean using the above formulas
15 https://testbook.com/maths/assumed-mean-method
43
• Example:
– Assume your data set is 73, 75, 76, 78 and 79.
– Sort your data set from smallest to largest.
– Assume a mean. This should be a number that you feel is a close representation of your data set.
– In a simple example, take the number in the center of your data set; in this case 76.
– Subtract your assumed mean from each data entry.
– In our example, 73−76 = −3, 75−76 = −1, 76−76 = 0, 78−76 = 2and79−
76 = 3
– Add together these differences from the mean.
– (−3) + (−1) + 0 + 2 + 3 = 1
– Divide the sum of the differences from assumed mean by the number of data points.
– 1/5 = 0.2
– Add the result of the division to your assumed mean.
– Mean = 76 + 0.2 = 76.2
• Example: Find the mean of the following data using Assumed mean method 40, 50, 55, 78, 58
n
∑d
i=1
x̄ = A +
n
x̄ = 40 + 81/5
Mean(x̄) = 56.2
44
• Find the average for the following data using Assumed mean method
∑d
x̄ = A +
N
17
x̄ = 8 +
10
Mean(x̄) = 9.7
45
1.14.2.3 Step-Deviation Method
• The Step Deviation method is an extension of the Assumed Mean method.
• This method further simplifies calculations by choosing a common factor (step size) to reduce the size of
the deviations from an assumed mean.
• Advantage:
– The step deviations simplify the calculations, especially when the original deviations are large or
involve complex numbers.
– Makes it easier to work with data when the values are spread out over a large range.
• Disadvantage:
– Requires an additional step of selecting an appropriate step size hh.
– May not always lead to simpler calculations if hh is not chosen wisely.
• How to Calculate Mean using Step Deviation Method?
– Choose an Assumed Mean (A): Select a value close to the center of your data as the assumed mean.
This value can be one of the data points.
– Calculate the Deviations (d): Subtract the assumed mean from each data point to find the deviation
di = xi − A, where xi is each data point.
– Select a Common Factor (h): Choose a common factor hh (also known as the step size), which could
be a convenient value, such as 2, 5, 10, etc., depending on the data range.
– Calculate Step Deviations: Divide each deviation by the chosen factor h to obtain the step deviations
ui .
di x i − A
ui = =
h h
– Find the Sum of Step Deviations (∑ ui ): Add up all the step deviations
– Calculate the Mean: The following equations compute the population mean and sample mean:
∑ ui
µ = A+h×
N
∑ ui
x̄ = A + h ×
n
46
• Example: Let’s consider the following ungrouped data: 47, 53, 59, 65, 71
1. Choose an Assumed Mean (A): Select A = 59 (a central value from the data).
2. Calculate Deviations (d):
(a) d1 = 47 − 59 = −12
(b) d2 = 53 − 59 = −6
(c) d3 = 59 − 59 = 0
(d) d4 = 65 − 59 = 6
(e) d5 = 71 − 59 = 12
3. Step 3: Select a Common Factor (h):
4. Calculate Deviations (d): Choose h = 6 (a convenient value given the range of deviations).
−12
(a) u1 = 6 = −2
−6
(b) u2 = 6 = −1
0
(c) u3 = 6 = 0
6
(d) u4 = 6 = 1
12
(e) u5 = 6 = 2
∑ ui = (−2) + (−1) + 0 + 1 + 2 = 0
7. Calculate the Arithmetic Mean using the above formula:
∑ ui
x̄ = A + h ×
n
0
= 59 + 6 ×
5
= 59
47
• Example: Find the mean of the following data using direct method, assumed mean method and step
deviation method. 40, 50, 55, 78, 58
48
• Find the average for the following data 35, 40, 60, 75, 90 using step-deviation method.
∑ ui
x̄ = A + h ×
n
0
= 60 + 5 ×
5
= 60
49
1.14.2.4 Assumed Mean Method vs Step Deviation Method
• The assumed mean method is typically used when the mean of the dataset is a known, predetermined value.
• This assumed mean method is appropriate when the focus is on calculating the standard deviation rather
than estimating the mean.
• The formula for the standard deviation using the assumed mean method is:
q
∑ni=1 (xi −x̄)2
s= n
• The step deviation method, on the other hand, is used when the mean of the dataset is unknown and needs
to be calculated as part of the standard deviation computation.
• This method involves calculating the deviations of each data point from the actual mean, and then using
those deviations to compute the standard deviation.
• The formula for the standard deviation using the step deviation method is:
r
∑ni=1 (xi −x̄)2
s= n−1
• To summarize:
– Use the assumed mean method when the mean is a known, predetermined value and the focus is on
calculating the standard deviation.
– Use the step deviation method when the mean is unknown and needs to be calculated as part of the
standard deviation computation.
50
1.14.3 Mean of Ungrouped Frequency Distribution or Discrete Series
• In discrete series (ungrouped frequency distribution), the values of variables represent the repetitions.
• It means that the frequencies are given corresponding to the different values of variables.
• The total number of observations in a discrete series, N , equals the sum of the frequencies, which is ∑ f i .
• Example of Discrete Series: If 6 students of a class score 50 marks, 4 students score 60 marks, 7 students
score 70 marks, 3 students score 80 marks, and 5 students score 90 marks, then this information will be
shown as:
51
1.14.3.1 Direct Method
1. List the Data: Prepare a frquency table with values (xi ) and their corresponding frequencies ( f i )
2. Calculate the Product of (xi ) and ( f i ): Multiply each value by its frequency to get xi . f i
3. Find the Sum of the Products ∑(xi . f i ): Add all the products together.
52
• Example:
53
• Example:
∑ xi . f i
x̄ =
∑ fi
264
=
28
= 9.42
54
• Example: Calculate the mean of the following distribution, which represents the scores obtained by students
in a quiz.
∑ xi . f i
x̄ =
∑ fi
3595
=
115
= 31.26
55
• Example: If the mean of the following distribution is 28, locate the missing frequency.
56
1.14.3.2 Assumed Mean Method
1. Choose an Assumed Mean (A): Select a value close to the center of the data as the assumed mean.
2. Calculate Deviations (di )): Find the deviation of each value from the assumed mean di − A.
4. Find the Sum of the Products: Add all the products together, ∑ f i .di
57
58
• Example: Calculate the arithmetic mean for the following data using Assumed Mean Method.
∑ni=1 fi .di
x̄ = A +
∑ fi
= 80 + 115/50
= 82.3
59
• Example: Consider the following data set and calculate the mean using Direct, Assumed Mean and Step
Deviation Method.
60
1.14.3.3 Step-Deviation Method
1. Choose an Assumed Mean (A): Select a central value as the assumed mean.
2. Calculate Deviations (di )): Find the deviation of each value from the assumed mean di − A.
3. Select a Common Factor (h): Choose a step size based on the data.
∑ni=1 fi .ui
x̄ = A + h ×
∑ fi
• Arithmetic Mean for Population:
∑Ni=1 fi .ui
µ = A+h×
∑ fi
61
62
• Determine the arithmetic mean from the following frequency table using Step-Deviation Method:
∑Ni=1 fi .ui
µ = A+h×
∑ fi
−20
= 60 + 10 ×
50
= 56
63
1.14.4 Mean of Continuous Series (or Grouped Frequency Distribution)
• In a continuous series (grouped frequency distribution), the data is grouped into class intervals with
corresponding frequencies.
• Each class interval represents a range of values (such as 0-5,5-10,10-15), and the frequency shows how
many observations fall within that interval.
• Example: If 15 students of a class score marks between 50-60, 10 students score marks between 60-70,
and 20 students score marks between 70-80, then this information will be shown as:
64
1.14.4.1 Direct Method
1. Determine the Midpoint (Class Mark) for Each Class Interval: For each class interval, find the midpoint
(xi ) using:
Lower Limit + U pper Limit
xi = 2
2. Calculate the Product of the Midpoint and Frequency: Multiply each midpoint by its corresponding
frequency f i .xi
3. Find the Sum of the Products: Add all the products together ∑(xi . f i )
∑ni=1 xi . fi
x̄ =
∑ fi
• Arithmetic Mean for Population:
∑Ni=1 xi . fi
µ=
∑ fi
65
66
1.14.4.2 Assumed Mean Method
1. Determine the Midpoint (Class Mark) for Each Class Interval: For each class interval, find the midpoint
(xi ) using:
Lower Limit + U pper Limit
xi = 2
2. Choose an Assumed Mean (A): Select a midpoint value close to the center of the data as the assumed
mean.
3. Calculate Deviations (di )): Find the deviation of each value from the assumed mean di − A.
8. Find the Sum of the Products: Add all the products together ∑(xi . f i )
∑ni=1 fi .di
x̄ = A +
∑ fi
• Arithmetic Mean for Population:
∑Ni=1 fi .di
µ = A+
∑ fi
67
68
∑Ni=1 fi .di
µ = A+
∑ fi
= 25 + (−10/110)
= 24.9
69
∑Ni=1 fi .di
µ = A+
∑ fi
= 7 + (32/20)
= 8.6
70
∑Ni=1 fi .di
µ = A+
∑ fi
= 50 + (580/150)
= 53.87
71
∑Ni=1 fi .di
µ = A+
∑ fi
= 25 + (−10/110)
= 24.9
72
∑Ni=1 fi .di
µ = A+
∑ fi
= 40 + (360/35)
= 50.28
73
1.14.4.3 Step-Deviation Method
1. Determine the Midpoint (Class Mark) for Each Class Interval: For each class interval, find the midpoint
(xi ) using:
Lower Limit + U pper Limit
xi = 2
2. Choose an Assumed Mean (A): Select a midpoint value close to the center of the data as the assumed
mean.
3. Calculate Deviations (di )): Find the deviation of each value from the assumed mean di − A.
4. Select a Common Factor (h): Choose a step size based on the data.
∑ni=1 fi .ui
x̄ = A + h ×
∑ fi
• Arithmetic Mean for Population:
∑Ni=1 fi .ui
µ = A+h×
∑ fi
74
∑Ni=1 fi .di
µ = A+h×
∑ fi
= 35 + 10 × (−4/35)
= 33.86
75
Example: Calculate average profit earned by 50 companies from the following data using Step Deviation
Method:
∑Ni=1 fi .di
µ = A+h×
∑ fi
= 50 + 20 × (−15/50)
= 44
76
1.14.5 Practice Questions
• Calculate the mean for the following set of data 2, 6, 7, 9, 15, 11, 13, 12
• If there are 5 observations, which are 27, 11, 17, 19, and 21 then find the mean
• Find the mean for the following sample data set: 6.4, 5.2, 7.9, 3.4
• Find the mean of 9, 6, -3, 2, -7, 1
• Find the mean of 5,10,15,20,25.
• Find the mean of the given data set: 10,20,30,40,50,60,70,80,90.
• Calculate the mean of the first 10 natural numbers.
• Find the mean of the first 10 even numbers.
• Find the mean of the first 10 odd numbers.
• The Mean of a series with 5 items is 40, and the values of four items are 35, 10, 65, 50. Find out the missing
5th item.
77
78
79
80
81
1.15 Geometric Mean
16
• The geometric mean is a measure of central tendency that is particularly useful when dealing with data
that involves rates, ratios, percentages, or data that grows exponentially.
• The geometric mean must be used when working with percentages, which are derived from values, while
the standard arithmetic mean works with the values themselves.
16 https://www.youtube.com/watch?v=HuZdOvoK4hM
82
1.15.4 Geometric Mean for Individual Series
th
• The geometric mean is calculated for a set of n values by calculating the n root of the product of all n
observed values.
th
• In other words, it is also defined as the n root of the product of n values.
• Before calculating this (geometric mean) measure of central tendency, note that:
– The geometric mean can only be found for positive values.
– If any value in the dataset is zero, the geometric mean is zero.
• There are two main steps to calculating the geometric mean:
– Multiply all values together to get their product.
th
– Find the n root of the product (n is the number of values).
– Given a set of n positive numbers x1 , x2 , x3 , . . . , xn , the geometric mean (GM) is calculated as:
83
84
85
• While the arithmetic means show higher efficiency for Machine B, the geometric means show that Machine
B is more efficient.
• The geometric mean is more accurate here because the arithmetic mean is skewed towards values that are
higher than most of your dataset.
86
1.15.5 Geometric Mean for Discrete Series
• Steps to Calculate Geometric Mean:
1. Multiply the Values by Their Frequencies: Raise each data value to the power of its corresponding
frequency.
2. Multiply All the Resulting Terms: Take the product of all the values obtained in step 1.
th
3. Take the Nth Root: Take the N root of the product, where N is the sum of the frequencies.
• Given a set of data with frequencies, the formula for the geometric mean (GM) is:
!1
n N q
f N f f
Geometric Mean (GM) = ∏ xi i = x11 × x22 × · · · × xnfn
i=1
where:
• x1 , x2 , . . . , xn are the data values.
• f1 , f2 , . . . , fn are the corresponding frequencies of these values.
• N = ∑ni=1 fi is the total number of observations.
87
88
Example:
Example:
89
1.15.6 Geometric Mean for Continuous Series
• Steps to Calculate Geometric Mean:
– Calculate the Midpoint (Class Mark) for Each Class Interval:
Lower Limit + U pper Limit
xi = 2
– Raise Each Midpoint xi to the Power of Its Frequency f i
– Multiply All the Results: Take the product of all the values obtained in step 2.
th
– Take the Nth Root: Take the N root of the product, where N is the sum of the frequencies.
• The formula for the geometric mean for continuous series is:
!1
n N q
f N f f
Geometric Mean (GM) = ∏ xi i = x11 × x22 × · · · × xnfn
i=1
where:
– x1 , x2 , . . . , xn are the Midpoint for Each Class Interval.
– f1 , f2 , . . . , fn are the corresponding frequencies or each Class Interval..
– N = ∑ni=1 fi is the total number of observations.
90
Step 1: Calculate the Midpoint
1000 + 2000
M1 = = 1500
2
2000 + 3000
M2 = = 2500,
2
3000 + 4000
M3 = = 3500,
2
4000 + 5000
M4 = = 4500
2
Step 2: Raise Each Midpoint to the Power of its Frequency
N = 3 + 5 + 4 + 2 = 14
th
– Now, take the 14 root of the product:
p
14
9.947 × 1047 ≈ 2782.56
91
• Example:
• Example:
92
1.15.7 Geometric Mean vs Arithmetic Mean
• Use Cases:
– Arithmetic Mean:
* Used when data points are independent of each other.
* Commonly used for data that is additive in nature, such as total scores, sums, and averages in
daily use.
* Examples: Average income, average test scores, average temperature.
– Geometric Mean:
* Used when data points are interrelated, such as when dealing with rates of change, ratios, or
proportional growth.
* Commonly used in finance, economics, and population studies where growth rates, percentages,
or ratios are involved.
* Examples: Compound annual growth rate (CAGR), average growth rate of populations, invest-
ment returns over time.
• Sensitivity to Values:
– Arithmetic Mean:
* Highly sensitive to extreme values (outliers) as it directly sums all the values.
* An unusually high or low value can skew the mean.
– Geometric Mean:
* Less sensitive to extreme values since it multiplies values and takes the root, thereby diluting
the impact of outliers.
* Provides a better central tendency measure in skewed distributions, especially when dealing
with multiplicative processes.
• Mathematical Relationship:
– The geometric mean is always less than or equal to the arithmetic mean for any set of non-negative
data.
Geometric Mean ≤ Arithmetic Mean
– The only time they are equal is when all the data points are the same.
x1 = x2 = . . . = xn
93
1.15.8 When is the Geometric mean better than the Arithmetic mean?
• Multiplicative Processes or When Dealing with Growth Rates: If a stock grows by 20% in the first year,
30% in the second year, and declines by 10% in the third year, the geometric mean will give the true average
annual growth rate, considering the compounding effect.
• Non-Negative Data with Skewed Distributions:
– For datasets that are heavily skewed, especially with positive values, the geometric mean is less
influenced by extreme outliers than the arithmetic mean, providing a better central tendency measure.
– In a positively skewed distribution, there’s a cluster of lower scores and a spread-out tail on the right.
Income distribution is a common example of a skewed dataset.
– While most values tend to be low, the arithmetic mean is often pulled upward (or rightward) by high
values or outliers in a positively skewed dataset.
– Because the geometric mean tends to be lower than the arithmetic mean, it represents smaller values
better than the arithmetic mean.
• Normalizing Data: The geometric mean is useful in normalizing different data sets, especially when
comparing ratios or indices. For example, when comparing price indexes across different periods or regions,
the geometric mean helps in making the comparisons more meaningful.
• Consistent Measurement Across Different Scales or Combining Different Scales: When aggregating
data measured on different scales, the geometric mean helps in maintaining consistency, as it doesn’t
disproportionately weigh higher values.
94
1.16 Harmonic Mean
17
• The harmonic mean is one of the measures of central tendency that is particularly useful when the data
set contains rates, ratios, or is related to speeds (km/hr, km/liter, hour/semester and tonnes/per month).
• Consistent with the Average Rate Concept: For example, the harmonic mean correctly calculates the
average speed when traveling the same distance at different speeds.
• Useful in Finance and Economics: It is commonly used to calculate the average price-earnings (P/E) ratio
or the average cost in dollar-cost averaging.
17 https://www.youtube.com/watch?v=LK52iuIp84o
95
1.16.3 Applications of Harmonic Mean
• Speed and Time Problems: The harmonic mean is used to find the average speed of an object over a
certain distance when it travels at different speeds for equal time intervals.
• Finance: In finance, the harmonic mean is used to calculate the average price-earnings ratio (P/E ratio) for
companies. It is also used in calculating the weighted average cost of capital (WACC).
• Physics: The harmonic mean is used in situations involving rates of change, such as calculating average
densities, resistance in parallel circuits, or optical problems involving different mediums.
• Economics: It is used to calculate the average rate of growth or to analyze economic data where rates or
ratios are prevalent.
• Harmonic Mean in Decision Making: In multi-criteria decision analysis, the harmonic mean is used when
the aggregation of criteria favors a balance rather than the dominance of one criterion over others.
96
1.16.4 Harmonic Mean for Individual Series
• The harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocals of the data values.
Here, the total number of observations is divided by the sum of reciprocals of all observations.
• Thus, the harmonic mean formula is given by
n=2
2
HM = 1
+ 1b
a
2ab
=
a+b
97
98
99
100
101
1.16.5 Harmonic Mean for a Discrete Series
• Given a set of n values x1 , x2 , . . . , xn with corresponding frequencies f 1 , f 2 , . . . , f n , the harmonic
mean (HM) can be calculated as:
N
Harmonic Mean (HM) = f
∑ni=1 xii
where:
– N = ∑ni=1 fi is the total number of observations (sum of all frequencies).
– xi is the ith data value.
– fi is the frequency of the ith data value.
• Steps to Calculate Harmonic Mean for a Discrete Series:
1
1. Calculate the Reciprocal of Each Value: For each data value xi , find the reciprocal x .
i
2. Multiply Each Reciprocal by Its Frequency: For each value, multiply its reciprocal by the corre-
sponding frequency f i .
3. Sum the Results: Sum all the products obtained from step 2.
4. Divide the Total Number of Observations by the Sum of Reciprocals: Use the formula to calcu-
late the harmonic mean.
102
• Example: Suppose we have the following dataset representing the frequency of different response times (in
seconds) of a computer system:
1 1 1 1
= 0.5, = 0.25, ≈ 0.1667, = 0.125
2 4 6 8
Step 2: Multiply Each Reciprocal by Its Frequency:
3 × 0.5 = 1.5
5 × 0.25 = 1.25
4 × 0.1667 ≈ 0.6668
2 × 0.125 = 0.25
Step 3: Sum the Results:
N = 3 + 5 + 4 + 2 = 14
Now, use the harmonic mean formula:
N 14
Harmonic Mean (HM) = f
= ≈ 3.82 seconds
∑ni=1 xii 3.6668
103
104
1.16.6 Harmonic Mean for a Continuous Series
• Calculating the harmonic mean for a continuous series (grouped frequency distribution) is an extension of
the process used for discrete series.
• In a continuous series, data is grouped into class intervals, and the harmonic mean is calculated by using the
midpoints of these intervals along with their corresponding frequencies.
• Formula for Harmonic Mean in a Continuous Series: Given a set of class intervals, the harmonic mean
(HM) is calculated using the midpoints of the intervals and their frequencies. The formula is:
N
Harmonic Mean (HM) = f
∑ni=1 mii
where:
– N = ∑ni=1 fi is the total number of observations (sum of all frequencies).
– fi is the frequency of the ith class interval.
– mi is the midpoint of the ith class interval.
• Steps to Calculate Harmonic Mean for a Continuous Series
1. Find the Midpoint of Each Class Interval: For each class interval, calculate the midpoint (mi )
using the formula:
Lower Limit + Upper Limit
mi =
2
1
2. Calculate the Reciprocal of Each Midpoint: Find the reciprocal of each midpoint ( m ).
i
3. Multiply Each Reciprocal by Its Frequency: For each class interval, multiply the reciprocal of the
midpoint by the corresponding frequency ( f i ).
4. Sum the Results: Sum all the products obtained from step 3.
5. Divide the Total Number of Observations by the Sum of Reciprocals: Use the formula to calculate
the harmonic mean.
105
• Example: Suppose we have the following data representing the time taken (in minutes) by a group of
students to complete a test, grouped into class intervals:
10 + 20
m1 = = 15
2
20 + 30
m2 = = 25
2
30 + 40
m3 = = 35
2
40 + 50
m4 = = 45
2
Step 2: Calculate the Reciprocal of Each Midpoint:
1 1 1 1
≈ 0.0667, = 0.04, ≈ 0.0286, ≈ 0.0222
15 25 35 45
Step 3: Multiply Each Reciprocal by Its Frequency:
5 × 0.0667 = 0.3335
8 × 0.04 = 0.32
12 × 0.0286 ≈ 0.3432
6 × 0.0222 ≈ 0.1332
Step 4: Sum the Results:
N = 5 + 8 + 12 + 6 = 31
Now, use the harmonic mean formula:
31
Harmonic Mean (HM) = ≈ 27.43 minutes
1.13
106
1.16.7 Weighted Harmonic Mean
• The weighted harmonic mean is an extension of the harmonic mean that accounts for the importance or
weight of each observation.
• It is particularly useful when different data points contribute unequally to the overall mean, where not all
observations have the same significance.
• Formula for Weighted Harmonic Mean: Given a set of values x1 , x2 , . . . , xn with corresponding
weights w1 , w2 , . . . , wn , the weighted harmonic mean (WHM) is calculated using the formula:
∑ni=1 wi
Weighted Harmonic Mean (WHM) = n w
∑i=1 xii
where:
– wi is the weight associated with the ith value.
– xi is the ith data value.
• Steps to Calculate Weighted Harmonic Mean
1
1. Calculate the Reciprocal of Each Value: For each data value xi , find the reciprocal x .
i
2. Multiply Each Reciprocal by Its Corresponding Weight: For each value, multiply its reciprocal
by the corresponding weight wi .
3. Sum the Results: Sum all the products obtained from step 2.
4. Divide the Sum of the Weights by the Sum of the Weighted Reciprocals: Use the formula to
calculate the weighted harmonic mean.
107
• Example: Suppose we have the following dataset representing the time taken (in hours) by different
machines to complete a task, with the number of tasks completed as the weight:
1 1 1
= 0.5, = 0.25, ≈ 0.1667
2 4 6
Step 2: Multiply Each Reciprocal by Its Corresponding Weight:
3 × 0.5 = 1.5
5 × 0.25 = 1.25
2 × 0.1667 ≈ 0.3334
Step 3: Sum the Results:
3 + 5 + 2 = 10
Step 5: Divide the Sum of the Weights by the Sum of the Weighted Reciprocals:
10
Weighted Harmonic Mean (WHM) = ≈ 3.24 hours
3.0834
108
109
110
1.17 Relation Between AM, GM, and HM
• This inequality holds for any set of positive numbers and illustrates that the arithmetic mean is always
greater than or equal to the geometric mean, which in turn is greater than or equal to the harmonic mean.
HM ≤ GM ≤ AM
• Consider a set of positive numbers x1 , x2 , . . . , xn , the square of the geometric mean is equal to the
product of the arithmetic mean and the harmonic mean.
HM × AM = GM2
• This relationship indicates a special balance between the different means. The geometric mean, being the
square root of the product of the arithmetic mean and the harmonic mean, shows its centrality among these
means.
111
112
1.18 Median
• The median is a measure of central tendency that represents the middle value in a sorted, ascending or
descending, list of numbers.
• If the dataset contains an odd number of observations, the median is the middle number. If there is an even
number of observations, the median is typically calculated as the average of the two middle numbers.
113
1.18.2 Limitations of the Median
• Not Suitable for All Types of Data: The median is not appropriate for nominal data (data that cannot be
ordered or ranked) because it requires a sense of order among the values.
• Less Sensitive to Data Changes: The median is less sensitive to small changes in the data than the mean.
For example, changing any value that is not at the median position does not affect the median, while it may
affect the mean.
• Does Not Utilize All Data Points: The median only considers the middle value(s), ignoring the actual
values of all other observations. Therefore, it may not provide a comprehensive picture of the dataset,
especially when the distribution is complex or multimodal.
• Difficult to Use in Mathematical Calculations: Unlike the mean, which can be easily used in further
statistical calculations (like variance and standard deviation), the median does not lend itself to further
mathematical analysis as easily.
• Less Informative for Small Sample Sizes: In very small datasets, the median might not provide as clear an
insight into central tendency as it does in larger datasets because the middle value might not be representative
of the overall trend.
114
1.18.4 Median for Individual Series
To calculate the median for an individual series (ungrouped or raw data series), follow these steps:
115
n+1
– If n is odd: Median position = 2 th value
116
– If n is odd: Median position = The average of n2 th and n2 + 1 th values
117
• Examples: Odd Number of Observations Consider the following individual series of data:
7, 3, 5, 9, 1
1. Arrange in Ascending Order:
1, 3, 5, 7, 9
2. Count the Observations (n):
n = 5 (Odd number)
3. Find the Median Position:
n+1 5+1
Median position = = =3
2 2
4. Identify the Median Value:
The 3rd value in the ordered list is 5.
118
• Example: Even Number of Observations
Consider the following individual series of data: 8, 3, 4, 10
1. Arrange in Ascending Order:
3, 4, 8, 10
2. Count the Observations (n):
n = 4 (even number)
3. Find the Median Position:
n n
Median position = Average of and + 1 values
2 2
n 4 n
= = 2 and + 1 = 3
2 2 2
4. Identify the Median Value:
The 2nd value is 4 and the 3rd value is 8.
4+8
Median = =6
2
Therefore, the median is 6.
119
• Example:
– Step 1: Consider the data: 4, 4, 6, 3, and 2. Let’s arrange this data in ascending order: 2, 3, 4, 4, 6.
– Step 2: Count the number of values. There are 5 values.
– Step 3: Look for the middle value. The middle value is the median. Thus, median = 4.
120
• Example: The age of the members of a weekend poker team has been listed below. Find the median of the
above set. 42, 40, 50, 60, 35, 58, 32
121
122
123
124
125
126
1.18.5 Median for Discrete Series
Calculating the median for a discrete series (ungrouped frequency distribution) involves determining the middle
value of the data, taking into account the frequency of each data point. Here’s a step-by-step guide:
N = ∑ f = 3 + 5 + 4 + 2 + 1 = 15
The median position is given by:
N +1
Median position =
2
In this example:
15 + 1
Median position = =8
2
4. Locate the Median Class:
Look at the cumulative frequency column to find where the median position falls. Identify the first cumulative
frequency that is equal to or greater than the median position.
In this example, the 8th position falls within the cumulative frequency of 8, which corresponds to the data
value x = 4.
5. Determine the Median Value:
The median is the data value corresponding to the median position. Based on the cumulative frequency, the
median position of 8 falls under the data value x = 4.
Therefore, the median is 4.
127
Example:
N = 2 + 3 + 5 + 4 + 1 = 15
15 + 1
Median position = =8
2
3. Locate the Median Class:
The 8th position is under the cumulative frequency of 10, corresponding to x = 15.
4. Determine the Median:
The median is 15.
128
1.18.6 Median for Continuous Series
Calculating the median for a continuous series (grouped frequency distribution) involves finding the value that
divides the data into two equal parts, taking into account the frequencies and class intervals. Here’s a detailed guide:
N = ∑ f = 5 + 8 + 12 + 7 + 3 = 35
The median position is given by:
N
Median position =
2
In this example:
35
Median position = = 17.5
2
4. Locate the Median Class:
Identify the class interval where the cumulative frequency is greater than or equal to the median position.
This class interval is called the median class.
In this example, the cumulative frequency first exceeds 17.5 at 25, which falls in the class interval 20 − 30.
129
5. Apply the Median Formula:
Use the following formula to calculate the median:
!
N
2 − cF
Median = L+ ×C
fm
where:
• L = lower boundary of the median class
• cF = cumulative frequency of the class preceding the median class
• f m = frequency of the median class
• C = class width (size of the class interval)
For the class interval 20 − 30:
• L = 20
• cF = 13 (cumulative frequency of the class before the median class)
• f m = 12 (frequency of the median class)
• C = 10 (width of each class interval)
Substituting the values into the formula:
17.5 − 13
Median = 20 + × 10
12
4.5
= 20 + × 10
12
= 20 + (0.375) × 10
= 20 + 3.75 = 23.75
130
Example: Let’s use another example to solidify the concept:
N = 6 + 11 + 15 + 8 + 5 = 45
45
Median position = = 22.5
2
3. Locate the Median Class:
The 22.5th position is under the cumulative frequency of 32, corresponding to the class interval 25 − 35.
4. Apply the Median Formula:
!
N
2 − cF
Median = L+ ×C
fm
22.5 − 17
= 25 + × 10
15
5.5
= 25 + × 10
15
= 25 + 3.67 ≈ 28.67
131
132
133
134
The median = 45.71
135
1.19 Mode
• Mode is a measure of central tendency that identifies the most frequently occurring value in a dataset.
• Mode: The mode of a dataset is the value (or values) that appear most frequently. A dataset may have
one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all (if all values are
unique).
• Example:
– Single Mode (Unimodal):
* Data: 2, 4, 4, 4, 5, 6, 7
* Mode: 4 (since 4 appears most frequently)
– Two Modes (Bimodal):
* Data: 1, 2, 3, 4, 4, 5, 5, 6
* Modes: 4 and 5 (both appear twice, more than any other values)
– Multiple Modes (Multimodal):
* Data: 1, 2, 2, 3, 3, 4, 4
* Modes: 2, 3, and 4 (each appears twice)
– No Mode:
* Data: 1, 2, 3, 4, 5
* Mode: None (no value repeats)
136
1.19.1 Advantages of Mode
• Simplicity and Ease of Calculation: The mode is straightforward to identify and calculate, especially in
small or ordered datasets.
• Applicable to Categorical Data: Mode is particularly useful for qualitative data, where arithmetic opera-
tions (like calculating a mean) are not meaningful. For example, identifying the most common category
(e.g., favorite color, most purchased product). In others words, the mode is especially useful for categorical
or nominal data, where arithmetic operations are not meaningful.
• Not Affected by Extreme Values: Since the mode is based on frequency, it is not influenced by extreme
values (outliers), which can skew the mean.
137
1.19.4 Mode for Individual Series
Steps to Calculate the Mode for Individual Series:
1. Organize the Data: Arrange the data values in ascending or descending order to facilitate the identification
of frequencies.
2. Count the Frequency: Determine the frequency of each value (i.e., how many times each value appears in
the dataset).
Identify the Mode: The mode is the value with the highest frequency. If multiple values share the highest
frequency, the dataset is multimodal (i.e., has more than one mode).
138
Example 1: Unimodal Series (Single Mode)
3. Identify the Mode: The mode is 85 because it appears 3 times, which is more frequent than any
other value.
• Result: Mode = 85
139
Example 2: Bimodal Series (Two Mode)
3. Identify the Mode: The mode values are 12 and 15, each appearing 3 times.
• Result: 12 and 15 (bimodal)
140
Example 3: No Mode
3. Identify the Mode: Since each value appears only once, there is no mode.
• Result: No Mode (all values have the same frequency)
141
142
143
1.19.5 Mode for Discrete Series
Steps to Calculate the Mode for Discrete Series
1. List the Values and Their Frequencies: Start by organizing the data into a table format, showing each
distinct value and its corresponding frequency.
2. Identify the Highest Frequency: Look for the highest frequency in the list. The value corresponding to
this highest frequency is the mode.
3. Check for Multiple Modes: If more than one value shares the highest frequency, the series is multimodal,
meaning it has multiple modes.
144
Eaxmple:
Number of Sales (x) Frequency (f)
5 2
6 3
7 6
8 8
9 5
10 4
1. Organize the Data:
The data is already organized into a table with the number of sales (x) and their corresponding frequencies
( f ).
2. Identify the Highest Frequency:
From the table, the highest frequency is 8.
3. Find the Value Corresponding to the Highest Frequency:
The number of sales corresponding to the highest frequency (8) is 8.
Result:
Mode = 8
(The most frequent number of sales made by a representative is 8.)
145
Eaxmple:
Number of Pets (x) Frequency (f)
0 3
1 5
2 7
3 7
4 2
5 1
1. Organize the Data:
The data is already listed with the number of pets (x) and their frequencies ( f ).
2. Identify the Highest Frequency:
The highest frequency here is 7.
3. Find the Values Corresponding to the Highest Frequency:
The number of pets corresponding to the highest frequency (7) are 2 and 3.
Result:
Modes = 2 and 3
(The most common number of pets owned by families is either 2 or 3. This dataset is bimodal.)
146
147
1.19.6 Mode for Continuous Series
Steps to Calculate the Mode for Continuous Series:
1. Identify the Modal Class: The modal class is the class interval with the highest frequency.
2. Use the Mode Formula for Grouped Data: Once the modal class is identified, use the following formula
to calculate the mode:
fm − fm−1
Mode = L + ×h
( fm − fm−1 ) + ( fm − fm+1 )
where:
• L = Lower boundary of the modal class
• fm = Frequency of the modal class
• fm−1 = Frequency of the class preceding the modal class
• fm+1 = Frequency of the class succeeding the modal class
• h = Width of the class intervals (assuming all classes have the same width)
148
Example:
Class Interval (Scores) Frequency (f)
0 − 10 5
10 − 20 8
20 − 30 12
30 − 40 20
40 − 50 10
50 − 60 6
1. Identify the Modal Class: The class interval with the highest frequency is the modal class. In this case,
the modal class is 30-40 (with a frequency of 20).
2. Calculate the Mode Using the Formula:
Mode ≈ 34.44
Result: Mode ≈ 34.44
(The most frequent score range is around 34.44).
149
150
151
152
153
154
1.19.7 Relationship Between Mean, Median, and Mode
155
4. Empirical Relationship (Approximation):
• In many practical scenarios, especially with moderate skewness, the mean, median, and mode can be
approximated using the following empirical formula:
156
1.20 Measure of Dispersion
• The measures of central tendency (like mean, median, and mode) provide information about the central
point of a dataset, measures of dispersion tell us how much the data values deviate from this central point.
• Measures of dispersion in statistics are statistical tools used to describe how spread out or scattered the data
is around an average value. It helps to understand if the data points are close together or far apart.
• Measures of Dispersion measure the scattering of the data. It tells us how the values are distributed in the
data set.
• They help in understanding the degree of variability and the reliability of the central tendency.
• Dispersion shows the variability or consistency in a set of data.
157
1.20.4 Types of Measures of Dispersion
Measures of dispersion can be classified into the following two types :
158
1.20.4.1 Absolute Measure of Dispersion
• The measures of dispersion that measure and express the amount of variation in a dataset in the units of data
themselves are called Absolute Measure of Dispersion.
• They provide a direct and specific measure of the spread of values.
• Advantages of Absolute Measures:
– Easy Interpretation: Since they are in the same units as the data, they are easy to understand and
interpret.
– Direct Measurement: They provide a direct measure of the spread of data.
• Limitations of Absolute Measures:
– Unit Dependency: These measures are dependent on the units of measurement, which makes
comparisons across different datasets with different units challenging.
– Not Scaled: They do not provide relative comparisons or standardized measures of variability.
• Some absolute measures of dispersion are:
– Range
– Mean Deviation
– Standard Deviation
– Variance
– Quartile Deviation
– Interquartile Range
– Coefficient of Range
– Coefficient of Variation
– Coefficient of Mean Deviation
– Coefficient of Quartile Deviation
159
1.21 Range
• The range is the simplest measure of dispersion, calculated as the difference between the maximum and
minimum values in a dataset.
• It provides a quick snapshot of the extent to which the data varies but does not provide information about
the distribution of values within the range.
160
1.21.1 Range for Individual Series
• The range is the difference between the largest and the smallest values in the distribution.
• How to Calculate Range?
1. Identify the maximum value (the largest value) in your dataset.
2. Identify the minimum value (the smallest value) in your dataset.
3. Subtract the minimum value from the maximum value to find the range.
161
162
163
164
165
1.21.2 Range for Discrete Series
• In the context of a discrete series, which typically represents data with specific values and their corresponding
frequencies, the range is still calculated as the difference between the maximum and minimum values.
• These values are the highest and lowest data points in the series, not the frequencies themselves.
• The frequencies (how often each goal count occurs) do not affect the calculation of the range. The range is
purely based on the extreme values in the data.
• Steps to Calculate Range for a Discrete Series
1. Identify the Maximum Value: Find the highest value in the dataset (the maximum).
2. Identify the Minimum Value: Find the lowest value in the dataset (the minimum).
3. Calculate the Range: Subtract the minimum value from the maximum value.
166
167
1.21.3 Range for Continuous Series
• In a continuous series, the data is grouped into intervals or classes. To calculate the range for such a series,
we focus on the class boundaries rather than individual data points.
• The range in a continuous series is the difference between the upper boundary of the highest class and the
lower boundary of the lowest class.
• Steps to Calculate Range for a Continuous Series:
1. Identify the Upper Boundary of the Highest Class: This is the maximum value of the dataset,
represented by the upper limit of the last class interval.
2. Identify the Lower Boundary of the Lowest Class: This is the minimum value of the dataset,
represented by the lower limit of the first class interval.
3. Calculate the Range: Subtract the lower boundary of the lowest class from the upper boundary of
the highest class.
Range = Upper Limit of the Last Class Interval − Lower Limit of First Class Interval
168
169
170
1.21.4 Advantages of Range
• Simplicity: The range is easy to understand and calculate. It gives a quick sense of the variability in the
dataset.
• Quick Comparison: It is useful for making quick comparisons between the spread of different datasets.
• Initial Insight: Provides an initial idea about the variability or spread of the data, which can be useful in
exploratory data analysis.
In the example above, the range indicates much more variability in the data than there actually is. Although
we have a large range, most values are actually clustered around a clear middle.
No Information About Distribution: It does not provide any information about how the values are
distributed within the range. Two datasets can have the same range but different distributions.
Not Robust: The range is not a robust measure of dispersion because it only considers the extreme values
and ignores the rest of the data.
171
1.21.6 Applications of Range
• Quality Control: In manufacturing, the range can be used to monitor the consistency of product dimensions,
weights, or other characteristics.
• Weather Reports: The range is often used in weather reports to indicate the difference between the highest
and lowest temperatures recorded in a day or a specific period.
• Finance: In finance, the range can be used to measure the volatility of stock prices over a given period.
• Education: Teachers and educational researchers use the range to analyze the spread of students’ scores in
tests and exams.
172
1.22 Mean Deviation (MD)
• Mean deviation is used to show how far the observations are situated from the central point of the data (the
central point can be either mean, median or mode).
• We simply define the mean deviation of the given data distribution as the mean of the absolute average
deviations of the observations from a suitable central value. This suitable central value can be the mean,
median, and mode of any one of the central tendencies of the data.
• Steps to Calculate Mean Deviation:
1. Calculate the Central Value: Determine whether you are calculating the MD around the Mean ,
Median or Mode and calculate that value.
2. Find Absolute Deviations: Compute the absolute deviations of each data point from the chosen
central value.
3. Sum the Absolute Deviations: Add up all the absolute deviations.
4. Calculate the Mean Deviation: Divide the sum of absolute deviations by the number of observations.
• Some deviations might be positive and some might be negative from central value of the data. If they are
added like that, their sum will not reveal much as they tend to cancel each other’s effect.
173
1.22.1 Mean Deviation for Individual Series
For an individual series, the Mean Deviation can be calculated around the Mean (x) (Section 1.14.2), Median (M )
(Section 1.18.4) or Mode (Mo ) (Section 1.19.4). The formulas are as follows:
∑ni=1 |xi − x|
Mean Deviation (MD) =
n
2. Mean Deviation about the Median:
∑ni=1 |xi − M|
Mean Deviation (MD) =
n
3. Mean Deviation about the Mode:
∑ni=1 |xi − Mo |
Mean Deviation (MD) =
n
where:
• n = number of observations
• xi = each individual observation
• x = mean of the data
• M = median of the data
• Mo = mode of the data
• |xi − x| or |xi − M| or = |xi − Mo | absolute deviation from the mean, median or Mode
174
1.22.1.1 Mean Deviation around Mean for Individual Series
:
Consider a dataset representing the number of hours students study per day:
{2, 3, 4, 5, 7}
• Step 1: Calculate the Mean
2 + 3 + 4 + 5 + 7 21
x= = = 4.2
5 5
• Step 2: Find the Absolute Deviations from the Mean
|2 − 4.2| = 2.2
|3 − 4.2| = 1.2
|3 − 4.2| = 1.2
|4 − 4.2| = 0.2
|5 − 4.2| = 0.8
|7 − 4.2| = 2.8
7.2
Mean Deviation = = 1.44
5
Explanation: The mean deviation of 1.44 hours indicates that, on average, each student’s study time deviates
from the mean study time (4.2 hours) by about 1.44 hours. This provides a sense of how much the study habits vary
among the students.
175
176
177
178
179
180
181
1.22.1.2 Mean Deviation around Median for Individual Series
182
1.22.1.3 Mean Deviation around Mode for Individual Series
• Consider a dataset representing the number of items sold by a store each day:
{3, 3, 4, 4, 4, 5, 6}
• Step 1: Find the Mode
The mode (Mo) is the most frequently occurring value:
Mo = 4
• Step 2: Find the Absolute Deviations from the Mode
|3 − 4| = 1
|3 − 4| = 1
|4 − 4| = 0
|4 − 4| = 0
|4 − 4| = 0
|5 − 4| = 1
|6 − 4| = 2
5
Mean Deviation = ≈ 0.71
7
183
1.22.2 Mean Deviation for Discrete Series
The general formula for Mean Deviation (MD) around a central value (C) (Mean - Section 1.14.3, Median - 1.18.5 ,
Mode - 1.19.5) is:
∑ |xi −C| × fi
Mean Deviation (MD) =
∑ fi
where:
184
1.22.2.1 Mean Deviation around the Mean for Discrete Series
1. Calculate the Mean (x):
∑ni xi . fi
x= n
∑i fi
2. Find Absolute Deviations from the Mean: Calculate |xi − x| for each observation xi .
∑ |xi − x| × fi
MD =
∑ fi
185
Example:
xi (Value) fi (Frequency)
2 3
4 5
6 4
8 2
(2 × 3) + (4 × 5) + (6 × 4) + (8 × 2) 6 + 20 + 24 + 16 66
x= = = ≈ 4.71
3+5+4+2 14 14
2. Find Absolute Deviations from the Mean
xi fi |xi − x| |xi − x| × fi
2 3 |2 − 4.71| = 2.71 2.71 × 3 = 8.13
4 5 |4 − 4.71| = 0.71 0.71 × 5 = 3.55
6 4 |6 − 4.71| = 1.29 1.29 × 4 = 5.16
8 2 |8 − 4.71| = 3.29 3.29 × 2 = 6.58
3. Mean Deviation Around Mean:
186
1.22.2.2 Mean Deviation around the Median for Discrete series
1. Find the Median: For discrete series, find the cumulative frequency to locate the median class (Section ??).
2. Find Absolute Deviations from the Median: Calculate |xi − M| for each observation xi .
∑ |xi − M| × fi
MD =
∑ fi
187
Example:
14+1
1. Find the Median: Since there are 14 total observations, the median position is at the 2 = 7.5th
observation.
xi fi Cumulative Frequency
2 3 3
4 5 8
6 4 12
8 2 14
2. The median M = 4.
xi fi |xi − M| |xi − M| × fi
2 3 |2 − 4| = 2 2 × 3 = 6
4 5 |4 − 4| = 0 0 × 5 = 0
6 4 |6 − 4| = 2 2 × 4 = 8
8 2 |8 − 4| = 4 4 × 2 = 8
3. Mean Deviation Around Median:
6 + 0 + 8 + 8 22
MD = = ≈ 1.57
14 14
188
189
190
1.22.2.3 Mean Deviation around the Mode for Discrete Series
1. Identify the Mode: The mode is the value that appears most frequently in the series.
2. Calculate Absolute Deviations from the Mode: For each observation xi , calculate the absolute deviation
from the mode |xi − Mo|.
∑ |xi − Mo | × fi
MDMode =
∑ fi
Where:
• xi = each individual observation
• fi = frequency of each observation
• Mode = the modal value
• |xi − Mode| = absolute deviation from the mode
• ∑ fi = total number of observations (sum of frequencies)
191
Example:
xi (Value) fi (Frequency)
2 3
4 5
6 4
8 2
1. Step 1: Identify the Mode
The mode is the value with the highest frequency. Here, the mode is Mode = 4 because it has the highest
frequency (5).
2. Step 2: Calculate Absolute Deviations from the Mode
Calculate |xi − 4| for each value xi :
xi fi |xi − 4|
2 3 |2 − 4| = 2
4 5 |4 − 4| = 0
6 4 |6 − 4| = 2
8 2 |8 − 4| = 4
3. Step 3: Multiply by Frequencies
Now multiply the absolute deviations by the corresponding frequencies:
xi fi |xi − 4| |xi − 4| × fi
2 3 2 2×3 = 6
4 5 0 0×5 = 0
6 4 2 2×4 = 8
8 2 4 4×2 = 8
4. Step 4: Calculate the Mean Deviation Around the Mode: Finally, sum all the products and divide by the
total frequency:
6 + 0 + 8 + 8 22
MD = = ≈ 1.57
3 + 5 + 4 + 2 14
192
1.22.3 Mean Deviation for Continuous Series
The central values Mean, Median and Mode for Continuous series can be revised in Section 1.14.4, Section 1.18.6,
and Section 1.19.6 respectively.
|xi − x|
4. Multiply Absolute Deviations by Frequency:
fi |xi − x|
5. Calculate the Mean Deviation Around the Mean:
∑ fi |xi − x|
MDx =
∑ fi
193
Example:
Consider the following frequency distribution:
5 × 5 + 8 × 15 + 12 × 25 + 10 × 35 + 5 × 45 925
x= = = 23.125
40 40
3. Absolute deviations and their products with frequency:
xi fi |xi − x| fi |xi − x|
5 5 18.125 90.625
15 8 8.125 65
25 12 1.875 22.5
35 10 11.875 118.75
45 5 21.875 109.375
4. Calculate Mean Deviation Around the Mean:
194
1.22.3.2 Mean Deviation around Median for a Continuous Series
1. Find the Median Class (1.18.6): The median class is the class interval where the cumulative frequency is
greater than or equal to half the total frequency.
2. Use the Formula to Calculate the Median:
!
N
2 − cF
Median = L+ ×C
fm
where:
• L = lower boundary of the median class
• cF = cumulative frequency of the class preceding the median class
• f m = frequency of the median class
• C = class width (size of the class interval)
3. Calculate Absolute Deviations from the Mean: For each observation xi , calculate the absolute deviation
from the mode |xi − M|.
4. Multiply by Frequencies and calculate the Mean Deviation as done with the Mean.
∑ fi |xi − Median|
MD =
∑ fi
195
Example:
Now compute the absolute deviations and Mean Deviation around the median:
196
1.22.3.3 Mean Deviation around Mode for Continuous Series
1. Identify the Modal Class: The modal class is the class interval with the highest frequency.
2. Find the mode Section 1.18.6 using the formula:
f1 − f0
Mode = L+ ×h
(2 f1 − f0 − f2 )
Where:
• L = lower boundary of the modal class
• f1 = frequency of the modal class
• f0 = frequency of the class preceding the modal class
• f2 = frequency of the class succeeding the modal class
• h = class width
3. Calculate the Absolute Deviations from the Mode:
|xi − Mode|
where xi is the midpoint of each class interval.
4. Multiply Absolute Deviations by Frequencies:
fi |xi − Mode|
where f i is the frequency of each class interval.
5. Calculate Mean Deviation around the mode using the formula:
∑ fi |xi − Mode|
MDMode =
∑ fi
197
Example:
1. Consider the following frequency distribution:
12 − 8
Mode = 20 + × 10
(2 × 12 − 8 − 10)
4
Mode = 20 + × 10
(24 − 18)
4 40
Mode = 20 + × 10 = 20 + = 20 + 6.67 = 26.67
6 6
So, the mode is approximately 26.67.
4. Step 3: Calculate the Midpoints xi
The midpoints xi for each class interval are calculated as follows:
198
5. Step 4: Calculate the Absolute Deviations and Multiply by Frequency
∑ fi = 5 + 8 + 12 + 10 + 5 = 40
The sum of f i |xi − Mode| is:
396.7
MDMode = = 9.92
40
199
1.22.4 Advantages of Mean Deviation
• Simplicity: Mean deviation is easy to understand and compute, as it involves basic arithmetic operations.
• As it is based on all of the Data values provided, it will provide a more accurate assessment of dispersion.
• Uses Absolute Values: By using absolute deviations, mean deviation avoids the problem of positive
and negative deviations canceling each other out, which is common in calculating variance and standard
deviation.
• Indicative of Variability: Mean deviation provides a straightforward indication of the average amount of
variation or dispersion from the central value.
200
1.23 Sampling
• Why we need sampling?: Consider a scenario wherein you’re asked to perform a survey about the eating
habits of teenagers in the US. There are over 42 million teens in the US at present and this number is
growing as you read this blog. Is it possible to survey each of these 42 million individuals about their health?
Obviously not! That’s why sampling is used.
• How can one choose a sample that best represents the entire population?. Sampling is a statistical
method that deals with the selection of individual observations within a population that best represents the
entire population.
• There are two main types of Sampling techniques:
– Probability Sampling
– Non-Probability Sampling
201
1.23.1 Probability Sampling
• This is a sampling technique in which samples from a large population are chosen using the theory of
probability.
• Probability sampling techniques ensure that every member of the population has a known and non-zero
chance of being selected.
• There are three types of probability sampling:
– Simple Random Sampling or Random Sampling
– Systematic Sampling
– Stratified Sampling
202
1.23.1.1 Random Sampling
• In this method, each member of the population has an equal chance of being selected in the sample.
• Example: A company wants to survey its employees’ job satisfaction. They use a random number generator
to select 50 employees out of 500, ensuring each employee has an equal chance of being chosen.
• Advantages:
– Easy to implement.
– Reduces selection bias.
• Disadvantages:
– Requires a complete list of the population.
– May not be practical for large populations.
203
1.23.1.2 Systematic Sampling
• In Systematic sampling, every nth record is chosen from the population to be a part of the sample after a
random starting point.
• Example: In a factory with 1000 products, an inspector selects every 10th product for quality testing,
starting with the 5th product randomly.
• Advantages:
– Simple and quick to implement.
– Ensures a spread across the population.
• Disadvantages:
– Can introduce bias if there is a hidden pattern in the population.
204
1.23.1.3 Stratified Sampling
• Stratified sampling divides the population into stratum/strata (subgroups).
• A stratum is a subset of the population that shares at least one common characteristic.
• After this, the random sampling method is used to select a sufficient number of subjects from each stratum.
• Example: A researcher wants to study the income levels of different age groups. They divide the population
into age strata (e.g., 18-29, 30-49, 50-69) and randomly select individuals from each stratum.
• Advantages:
– Ensures representation of all subgroups.
– Increases precision.
• Disadvantages:
– Requires detailed population information.
– More complex to administer.
205
1.23.1.4 Cluster Sampling
• Divides the population into clusters, randomly selects some clusters, and then samples all or some members
within those clusters.
• Example: A school district wants to evaluate student performance. They randomly select 5 out of 20
schools (clusters) and then test all students in those selected schools.
• Advantages:
– Cost-effective for large populations.
– Reduces travel and administrative costs.
• Disadvantages:
– Less precise if clusters are not homogeneous.
– Can increase sampling error.
206
1.23.1.5 Multi-Stage Sampling
• Multistage sampling is an extension of cluster sampling in that, first, clusters are randomly selected and,
second, sample units within the selected clusters are randomly selected.
• It involves multiple stages of sampling, where each stage becomes progressively smaller and more focused.
• Here’s a step-by-step explanation:
– Stage 1: Primary Sampling Units (PSUs) - Divide the population into larger groups or clusters, such
as cities, states, or regions.
– Stage 2: Secondary Sampling Units (SSUs) - Select a random sample of PSUs.
– Stage 3: Tertiary Sampling Units (TSUs) - Divide the selected SSUs into smaller sub-groups, such as
neighborhoods or blocks.
– Stage 4: Final Sample - Select a random sample of individuals or units from the TSUs.
• Example: A national health survey first randomly selects regions (stage 1), then randomly selects towns
within those regions (stage 2), and finally selects households within those towns (stage 3).
• Advantages:
– Flexible and cost-effective.
– Suitable for large-scale surveys.
• Disadvantages:
– Complex to design and analyze.
– Errors can accumulate at each stage, affecting the overall accuracy of the sample.
– If not properly implemented, multistage sampling can introduce bias at each stage.
207
1.23.2 Non-Probability Sampling
• Non-probability sampling techniques do not provide every individual with a known or equal chance of being
selected.
• These techniques are often used when probability sampling is not feasible.
208
1.23.2.3 Quota Sampling
• Quota sampling is a method for selecting survey participants that is a non-probabilistic version of stratified
sampling.
• Ensures that specific characteristics (quotas) are represented in the sample.
• Quota sampling is a non-probability sampling method that relies on the non-random selection of a predeter-
mined number or proportion of units. This is called a quota.
You first divide the population into mutually exclusive subgroups (called strata) and then recruit sample
units until you reach your quota. These units share specific characteristics, determined by you prior to
forming your strata.
• Example: A researcher ensures that their sample includes a certain number of men and women, age groups,
and ethnic backgrounds, reflecting the population’s proportions.
• Advantages:
– Ensures representation of specific groups.
– More practical than stratified sampling.
• Disadvantages:
– Can introduce bias.
– Not random, limiting generalizability.
209