0% found this document useful (0 votes)

38 views243 pages

Book Allied Paper I Statistics BSC Geography New

Uploaded by

avirarao8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views243 pages

Book Allied Paper I Statistics BSC Geography New

Uploaded by

avirarao8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 243

BHARARHIDASAN UNIVERSITY

TIRUCHIRAPPALLI - 620 024

CENTRE FOR DISTANCE EDUCATION

B.Sc. GEOGRAPHY
STATISTICS
I Year Allied Paper
(Full Package)
BHARARHIDASAN UNIVERSITY
TIRUCHIRAPPALLI - 620 024

CENTRE FOR DISTANCE EDUCATION

B.Sc. GEOGRAPHY
Allied Paper - I

FIRST YEAR

STATISTICS
(Full Package)

Copy rights reserved For Private Circulation Only

Allied Paper I - STATISTICS
Syllabus

Unit-1: Collection of data – Sampling and sample design

- Representation of geographic data: Line and bar
graphs, Pie diagrams, Histograms, Frequency
polygon, Ogives

Unit-2: Measures of Central Tendency: Mean, Median,

Mode - Measures of Dispersion: range, mean
deviation, quartile deviation, standard deviation,
Lorenz curve

Unit-3: Correlation: Scatter diagram and graphic method,

Karl Pearson’s Correlation Coefficient,
Spearman’s Rank Correlation - Regression

Unit-4: Measurement of Trend: graphic method, semi

average and moving average - Measurement of
variations: seasonal, cyclical and irregular
variations

Unit-5: Concept of Probability - Probability distribution:

Binomial distribution and Normal distribution -
Statistical hypothesis - Tests of hypothesis - Tests
of significance
Content

Unit Lesson Page No.

Lesson 1 1
I
Lesson 2 22

Lesson 3 80
II
Lesson 4 103

Lesson 5 128
III
Lesson 6 149

Lesson 7 157
IV
Lesson 8 175

Lesson 9 193
V
Lesson 10 227
UNIT - I

Lesson 1
Statistics
Introduction
“Statistics”, that a word is often used, has been derived from the Latin word
‘Status’ that means a group of numbers or figures; those represent some information of
our human interest.

“Statistics may be defined as the collection, presentation, analysis and

interpretation of numerical data” - Croxton and Cowden. This definition clearly points out
four stages in a statistical investigation, namely:
 Collection of data
 Presentation of data
 Analysis of data and
 Interpretation of data

Data are individual pieces of factual information recorded and used for the purpose
of analysis. It is the raw information from which statistics are created. Statistics are the
results of data analysis - its interpretation and presentation, often these types of statistics
are referred to as 'statistical data'.

Data
Data is the name given to basic facts and entities such as names and numbers. The
main examples of data are weights, prices, costs, numbers of items sold, employee names,
product names, addresses, tax codes, registration marks etc.

1
We do not generally associate data with mathematics. However, data is the base of all
operations in statistics. So let us learn more about data collection, primary data, secondary
data, and a few other important terms.

Types of Data

Data may be qualitative or quantitative. Once you know the difference between them,
you can know how to use them.

 Qualitative Data: They represent some characteristics or attributes. They

depict descriptions that may be observed but cannot be computed or calculated.
 For example, data on attributes such as intelligence, honesty, wisdom,
cleanliness, and creativity collected using the students of your class a sample
would be classified as qualitative. They are more exploratory than conclusive
in nature.
 Quantitative Data: These can be measured and not simply observed. They can
be numerically represented and calculations can be performed on them.
 For example, data on the number of students playing different sports from your
class gives an estimate of how many of the total students play which sport.
This information is numerical and can be classified as quantitative.

Data Collection

Depending on the source, it can classify as primary data or secondary data. Let us
take a look at them both.

2
Collection of Data

Everybody collects, interprets and uses information, much of it in a numerical or

statistical forms in day-to-day life. It is a common practice that people receive large
quantities of information everyday through conversations, televisions, computers, the
radios, newspapers, posters, notices and instructions. It is just because there is so much
information available that people need to be able to absorb, select and reject it. In
everyday life, in business and industry, certain statistical information is necessary and it is
independent to know where to find it how to collect it. As consequences, everybody has to
compare prices and quality before making any decision about what goods to buy. As
employees of any firm, people want to compare their salaries and working conditions,
promotion opportunities and so on. In time the firms on their part want to control costs and
expand their profits. One of the main functions of statistics is to provide information
which will help on making decisions. Statistics provides the type of information by
providing a description of the present, a profile of the past and an estimate of the future.

The following are some of the objectives of collecting statistical information.

 To describe the methods of collecting primary statistical information.

 To consider the status involved in carrying out a survey.
 To analyse the process involved in observation and interpreting.
 To define and describe sampling.
 To analyse the basis of sampling.
 To describe a variety of sampling methods.

3
Statistical investigation is a comprehensive and requires systematic collection of
data about some group of people or objects, describing and organizing the data, analyzing
the data with 28 the help of different statistical method, summarizing the analysis and
using these results for making judgements, decisions and predictions. The validity and
accuracy of final judgement is most crucial and depends heavily on how well the data was
collected in the first place. The quality of data will greatly affect the conditions and hence
at most importance must be given to this process and every possible precautions should be
taken to ensure accuracy while collecting the data.

Nature of Data

It may be noted that different types of data can be collected for different purposes.
The data can be collected in connection with time or geographical location or in
connection with time and location. The following are the three types of data:

 Time series data.

 Spatial data
 Spacio-Temporal data.

Time Series Data

It is a collection of a set of numerical values, collected over a period of time. The

data might have been collected either at regular intervals of time or irregular intervals of
time.

4
Spatial Data

If the data collected is connected with that of a place, then it is termed as spatial
data.

Spacio-Temporal Data

If the data collected is connected to the time as well as place then it is known as
spacio-temporal data.

Categories of Data

Any statistical data can be classified under two categories depending upon the
sources utilized. These categories are,

 Primary data
 Secondary data

Primary Data

Primary data is the one, which is collected by the investigator himself for the
purpose of a specific inquiry or study. Such data is original in character and is generated
by survey conducted by individuals or research institution or any organisation.

These are the data that are collected for the first time by an investigator for a specific
purpose. Primary data are ‘pure’ in the sense that no statistical operations have been
performed on them and they are original. An example of primary data is the Census of India.

5
Example: If a researcher is interested to know the impact of noon meal scheme for
the school children, he has to undertake a survey and collect data on the opinion of parents
and children by asking relevant questions. Such a data collected for the purpose is called
primary data.

The primary data can be collected by the following five methods.

 Direct personal interviews.

 Indirect Oral interviews.
 Information from correspondents.
 Mailed questionnaire method.
 Schedules sent through enumerators.

Direct Personal Interviews

The persons from whom information’s are collected are known as informants. The
investigator personally meets them and asks questions to gather the necessary
information’s. It is the suitable method for intensive rather than extensive field surveys. It
suits best for intensive study of the limited field.

Indirect Oral Interviews

Under this method the investigator contacts witnesses or neighbours or friends or

some other third parties who are capable of supplying the necessary information. This
method is preferred if the required information is on addiction or cause of fire or theft or
murder etc., If a fire has broken out a certain place, the persons living in neighbourhood
and witnesses are likely to give information on the cause of fire.

6
In some cases, police interrogated third parties who are supposed to have
knowledge of a theft or a murder and get some clues. Enquiry committees appointed by
governments generally adopt this method and get people’s views and all possible details of
facts relating to the enquiry.

This method is suitable whenever direct sources do not exists or cannot be relied
upon or would be unwilling to part with the information. The validity of the results
depends upon a few factors, such as the nature of the person whose evidence is being
recorded, the ability of the interviewer to draw out information from the third 32 parties
by means of appropriate questions and cross examinations, and the number of persons
interviewed. For the success of this method one person or one group alone should not be
relied upon.

Information from Correspondents

The investigator appoints local agents or correspondents in different places and

compiles the information sent by them. Information’s to Newspapers and some
departments of Government come by this method.

The advantage of this method is that it is cheap and appropriate for extensive
investigations. But it may not ensure accurate results because the correspondents are likely
to be negligent, prejudiced and biased. This method is adopted in those cases where
informations are to be collected periodically from a wide area for a long time.

7
Mailed Questionnaire Method

Under this method a list of questions is prepared and is sent to all the informants
by post. The list of questions is technically called questionnaire. A covering letter
accompanying the questionnaire explains the purpose of the investigation and the
importance of correct informations and request the informants to fill in the blank spaces
provided and to return the form within a specified time. This method is appropriate in
those cases where the informants are literates and are spread over a wide area.

Schedules sent through Enumerators

Under this method enumerators or interviewers take the schedules, meet the
informants and filling their replies. Often distinction is made between the schedule and a
questionnaire. A schedule is filled by the interviewers in a face-to-face situation with the
informant. A questionnaire is filled by the informant which he receives and returns by
post. It is suitable for extensive surveys.

Secondary Data

Secondary data are those data which have been already collected and analysed by
some earlier agency for its own use; and later the same data are used by a different agency.
According to W.A. Neiswanger, ‘A primary source is a publication in which the data are
published by the same authority which gathered and analysed them. A secondary source is
a publication, reporting the data which have been gathered by other authorities and for
which others are responsible’.

8
Sources of Secondary Data

In most of the studies the investigator finds it impracticable to collect first-hand

information on all related issues and as such he makes use of the data collected by others.
There is a vast amount of published information from which statistical studies may be
made and fresh statistics are constantly in a state of production.

They are the data that are sourced from someplace that has originally collected it.
This means that this kind of data has already been collected by some researchers or
investigators in the past and is available either in published or unpublished form. This
information is impure as statistical operations may have been performed on them already. An
example is an information available on the Government of India, the Department of
Finance’s website or in other repositories, books, journals, etc.

The sources of secondary data can broadly be classified under two heads:

 Published sources
 Unpublished sources

Published Sources

The various sources of published data are clinical and other personal records, death
certificates, published mortality statistics, census publications, etc. Examples include:

 Official publications of Central Statistical Authority.

 Publication of Ministry of Health and Other Ministries
 News Papers and Journals.
 International Publications like Publications by WHO, World Bank, UNICEF.
 Records of hospitals or any Health Institutions.

9
Note: A lot of secondary data is available in the internet. We can access it at any time for
the further studies.

Unpublished Sources

All statistical material is not always published. There are various sources of
unpublished data such as records maintained by various Government and private offices,
studies made by research institutions, scholars, etc. Such sources can also be used where
necessary Precautions in the use of Secondary data. The following are some of the points
that are to be considered in the use of secondary data

 How the data has been collected and processed?

 How to find the accuracy of the data?
 How far the data has been summarized?
 How comparable the data is with other tabulations?
 How to interpret the data, especially when figures collected for one purpose is
used for another Generally speaking, with secondary data, people have to
compromise between what they want and what they are able to find?

Classification of Data

The collected data, also known as raw data or ungrouped data are always in an un
organised form and need to be organised and presented in meaningful and readily
comprehensible form in order to facilitate further statistical analysis. It is, therefore,

10
essential for an investigator to condense a mass of data into more and more
comprehensible and assimilable form.

The process of grouping into different classes or sub classes according to some
characteristics is known as classification, tabulation is concerned with the systematic
arrangement and presentation of classified data. Thus classification is the first step in
tabulation. For Example, letters in the post office are classified according to their
destinations viz., Delhi, Madurai, Bangalore, Mumbai etc.

Objects of Classification
The following are main objectives of classifying the data:

 It condenses the mass of data in an easily assimilable form.

 It eliminates unnecessary details.
 It facilitates comparison and highlights the significant aspect of data.
 It enables one to get a mental picture of the information and helps in drawing
inferences.
 It helps in the statistical treatment of the information collected.

Types of Classification
Statistical data are classified in respect of their characteristics. Broadly there are
four basic types of classification namely

 Chronological classification
 Geographical classification
 Qualitative classification
 Quantitative classification.

11
Chronological classification

In chronological classification the collected data are arranged according to the

order of time expressed in years, months, weeks, etc.

Geographical classification

In this type of classification the data are classified according to geographical

region or place. For instance, the production of paddy in different states in Iraq,
production of wheat in different countries etc., for example, if the population to be
classified in respect to one attribute, say sex, then we can classify them into two namely
that of males and females.

Similarly, they can also be classified into ‘married or ‘single’ on the basis of
another attribute ‘marital status’. Thus when the classification is done with respect to one
attribute, which is dichotomous in nature, two classes are formed, one possessing the
attribute and the other not possessing the attribute.

This type of classification is called simple or dichotomous classification. The

classification, where two or more attributes are considered and several classes are formed,
is called a manifold classification.

Quantitative classification

Quantitative classification refers to the classification of data according to some

characteristics that can be measured such as height, weight, etc.

12
Tabulation of Data

Tabulation is the process of summarizing classified or grouped data in the form of

a table so that it is easily understood and an investigator is quickly able to locate the
desired information. A table is a systematic arrangement of classified data in columns and
rows. Thus, a statistical table makes it possible for the investigator to present a huge mass
of data in a detailed and orderly form. It facilitates comparison and often reveals certain
patterns in data which are otherwise not obvious. Classification and Tabulation, as a
matter of fact, are not two distinct processes.

Sampling

Sampling is a technique of selecting individual members or a subset of the

population to make statistical inferences from them and estimate characteristics of the
whole population. Different sampling methods are widely used by researchers in market
research so that they do not need to research the entire population to collect actionable
insights.

It is also a time-convenient and a cost-effective method and hence forms the basis
of any research design. Sampling techniques can be used in research survey software for
optimum derivation.

For example, if a drug manufacturer would like to research the adverse side effects
of a drug on the country’s population, it is almost impossible to conduct a research study
that involves everyone. In this case, the researcher decides a sample of people from
each demographic and then researches them, giving him/her indicative feedback on the
drug’s behaviour.

13
Types of Sampling

Sampling in market research is of two types – probability sampling and non-

probability sampling.

Let’s take a closer look at these two methods of sampling.

 Probability Sampling: Probability sampling is a sampling technique where a

researcher sets a selection of a few criteria and chooses members of a
population randomly.
 All the members have an equal opportunity to be a part of the sample with this
selection parameter.
 Non-probability Sampling: In non-probability sampling, the researcher
chooses members for research at random.
 This sampling method is not a fixed or predefined selection process.
 This makes it difficult for all elements of a population to have equal
opportunities to be included in a sample.

Difference between Probability and Non-Probability Sampling

Probability Sampling Non-Probability Sampling

Methods Methods

Definition Probability Sampling is a Non-probability sampling is

sampling technique in a sampling technique in
which samples from a which the researcher selects
larger population are samples based on the
chosen using a method researcher’s subjective
based on the theory of judgment rather than
probability. random selection.

14
Alternatively Random sampling Non-random sampling
Known as method. method
Population The population is The population is selected
selection selected randomly. arbitrarily.
Nature The research is The research is exploratory.
conclusive.
Sample Since there is a method Since the sampling method
for deciding the sample, is arbitrary, the population
the population demographics
demographics are representation is almost
conclusively represented. always skewed.
Time Taken Takes longer to conduct This type of sampling
since the research design method is quick since
defines the selection neither the sample or
parameters before the selection criteria of the
market research study sample are undefined.
begins.
Results This type of sampling is This type of sampling is
entirely unbiased and entirely biased and hence
hence the results are the results are biased too,
unbiased too and rendering the research
conclusive. speculative.

Hypothesis In probability sampling, In non-probability

there is an underlying sampling, the hypothesis is
hypothesis before the derived after conducting the
study begins and the research study.
objective of this method
is to prove the
hypothesis.

15
Types of Probability Sampling

 Simple Random Sampling: One of the best probability sampling techniques

that helps in saving time and resources, is the Simple Random Sampling
method.
 It is a reliable method of obtaining information where every single member of
a population is chosen randomly, merely by chance.
 Each individual has the same probability of being chosen to be a part of a
sample.

 Cluster Sampling: Cluster sampling is a method where the researchers divide

the entire population into sections or clusters that represent a population.
 Clusters are identified and included in a sample based on demographic
parameters like age, sex, location, etc.

 Systematic Sampling: Researchers use the systematic sampling method to

choose the sample members of a population at regular intervals.
 It requires the selection of a starting point for the sample and sample size that
can be repeated at regular intervals.
 This type of sampling method has a predefined range, and hence this sampling
technique is the least time-consuming.

16
 Stratified Random Sampling: Stratified random sampling is a method in
which the researcher divides the population into smaller groups that don’t
overlap but represent the entire population.
 While sampling, these groups can be organized and then draw a sample from
each group separately.

Uses of Probability Sampling

There are multiple uses of probability sampling:

 Reduce Sample Bias: Using the probability sampling method, the bias in the
sample derived from a population is negligible to non-existent.
 The selection of the sample mainly depicts the understanding and the inference
of the researcher.
 Probability sampling leads to higher quality data collection as the sample
appropriately represents the population.

 Diverse Population: When the population is vast and diverse, it is essential to

have adequate representation so that the data is not skewed towards
one demographic.
 For example, if Square would like to understand the people that could make
their point-of-sale devices, a survey conducted from a sample of people across
the US from different industries and socio-economic backgrounds helps.

17
 Create an Accurate Sample: Probability sampling helps the researchers plan
and create an accurate sample. This helps to obtain well-defined data.

Types of Non-Probability Sampling

 Convenience Sampling: This method is dependent on the ease of access to

subjects such as surveying customers at a mall or passers-by on a busy street.
 It is usually termed as convenience sampling, because of the researcher’s ease
of carrying it out and getting in touch with the subjects.
 Researchers have nearly no authority to select the sample elements, and it’s
purely done based on proximity and not representativeness.
 This non-probability sampling method is used when there are time and cost
limitations in collecting feedback.
 In situations where there are resource limitations such as the initial stages of
research, convenience sampling is used.

 Judgmental or Purposive Sampling: Judgemental or Purposive samples are

formed by the discretion of the researcher.
 Researchers purely consider the purpose of the study, along with the
understanding of the target audience.
 For instance, when researchers want to understand the thought process of
people interested in studying for their master’s degree.

 Snowball Sampling: Snowball sampling is a sampling method that researchers

apply when the subjects are difficult to trace.

18
 Quota Sampling: In Quota sampling, the selection of members in this
sampling technique happens based on a pre-set standard. In this case, as a
sample is formed based on specific attributes, the created sample will have the
same qualities found in the total population. It is a rapid method of collecting
samples.

Uses of Non-Probability Sampling

Non-probability sampling is used for the following:

 Create a Hypothesis: Researchers use the non-probability sampling method to

create an assumption when limited to no prior information is available.
 This method helps with the immediate return of data and builds a base for
further research.

 Exploratory Research: Researchers use this sampling technique widely when

conducting qualitative research, pilot studies, or exploratory research.

 Budget and Time Constraints: The non-probability method when there are
budget and time constraints, and some preliminary data must be collected.
 Since the survey design is not rigid, it is easier to pick respondents at random
and have them take the survey or questionnaire.

19
Sampling Design

Sampling design is a mathematical function that gives you the probability of any
given sample being drawn. It involves not only learning how to derive the probability
functions which describe a given sampling method but also understanding how to design a
best-fit sampling method for a real life situation.

Types of Sampling Design

20
Sampling Designing Process

The sampling design process includes five steps which are closely related and are
important to all aspect of the marketing research project. The five steps are: defining the
target population; determining the sample frame; selecting a sampling technique;
determining the sample size; and executing the sampling process.

21
Lesson 2
Representation of Geographic Data

Discrete Data

These are data that can take only certain specific values rather than a range of values.
For example, data on the blood group of a certain population or on their genders is termed as
discrete data. A usual way to represent this is by using bar charts.

Continuous Data

These are data that can take values between a certain range with the highest and
lowest values. The difference between the highest and lowest value is called the range of
data. For example, the age of persons can take values even in decimals or so is the case of the
height and weights of the students of your school.

These are classified as continuous data. Continuous data can be tabulated in what is
called a frequency distribution. They can be graphically represented using histograms.

Frequency Polygon

The relevance of presentation of data in the pictorial or graphical form is immense.

Frequency polygons give an idea about the shape of the data and the trends that a particular
data set follows. Let us learn the step by step process of drawing a frequency polygon, with
or without a histogram.

22
Presentation of Data

The key objective of statistics is to collect and organize data. One of the basics of
data organization comes from presentation of data in a recognizable form so that it can be
interpreted easily. You can organize data in the form of tables or you can present it
pictorially.

Pictorial representation of data takes the form of bar charts, pie charts, histograms or
frequency polygons. The benefit of this is that data in the visual form is easy to understand in
one glance.

Graphical representation refers to the use of charts and graphs to visually display,
analyze, clarify, and interpret numerical data, functions, and other qualitative structures.

Types of Graphs and Charts

The list of most commonly used graph types are as follows:

 Bar graph
 Pie graph
 Line graph

All these graphs are used in various places to represent a specific set of data
concisely. The details of each of these graphs (or charts) are explained below in detail
which will not only help to know about these graphs better but will also help to choose the
right kind of graph for a particular data set.

23
Statistical Graphs

A statistical graph or chart is defined as the pictorial representation of statistical

data in graphical form. The statistical graphs are used to represent a set of data to make it
easier to understand and interpret statistical information. The different types of graphs that
are commonly used in statistics are given below.

The statistical data can be represented by various methods such as tables, bar
graphs, pie charts, histograms, frequency polygons, etc.

The four basic graphs used in statistics include bar, line, histogram and pie charts.
These are explained here in brief.

Bar Graph

Bar graphs are the pictorial representation of grouped data in vertical or horizontal
rectangular bars, where the length of bars is proportional to the measure of data.

The chart’s horizontal axis represents categorical data, whereas the chart’s vertical
axis defines discrete data.

Bar graphs are the pictorial representation of data (generally grouped), in the form
of vertical or horizontal rectangular bars, where the length of bars are proportional to the
measure of data. They are also known as bar charts. Bar graphs are one of the means
of data handling in statistics.

24
Types of Bar Charts

The bar graphs can be vertical or horizontal. The primary feature of any bar graph
is its length or height. If the length of the bar graph is more, then the values are greater
than any given data.

Bar graphs normally show categorical and numeric variables arranged in class
intervals. They consist of an axis and a series of labelled horizontal or vertical bars.

The bars represent frequencies of distinctive values of a variable or commonly the

distinct values themselves. The number of values on the x-axis of a bar graph or the y-axis
of a column graph is called the scale.

The types of bar charts are as follows:

 Vertical bar chart

 Horizontal bar chart

Even though the graph can be plotted using horizontally or vertically, the most
usual type of bar graph used is the vertical bar graph.

The orientation of the x-axis and y-axis are changed depending on the type of
vertical and horizontal bar chart. Apart from the vertical and horizontal bar graph, the two
different types of bar charts are:

 Grouped Bar Graph

 Stacked Bar Graph

25
Vertical Bar Graphs

When the grouped data are represented vertically in a graph or chart with the help
of bars, where the bars denote the measure of data, such graphs are called vertical bar
graphs. The data is represented along the y-axis of the graph, and the height of the bars
shows the values.

Horizontal Bar Graphs

When the grouped data are represented horizontally in a chart with the help of
bars, then such graphs are called horizontal bar graphs, where the bars show the measure
of data. The data is depicted here along the x-axis of the graph, and the length of the bars
denote the values.

Grouped Bar Graph

The grouped bar graph is also called the clustered bar graph, which is used to
represent the discrete value for more than one object that shares the same category. In this
type of bar chart, the total number of instances are combined into a single bar.

In other words, a grouped bar graph is a type of bar graph in which different sets of
data items are compared. Here, a single colour is used to represent the specific series
across the set. The grouped bar graph can be represented using both vertical and horizontal
bar charts.

26
Stacked Bar Graph

The stacked bar graph is also called the composite bar chart, which divides the
aggregate into different parts. In this type of bar graph, each part can be represented using
different colours, which helps to easily identify the different categories.

The stacked bar chart requires specific labelling to show the different parts of the
bar. In a stacked bar graph, each bar represents the whole and each segment represents the
different parts of the whole.

Uses of Bar Graphs

Bar graphs are used to match things between different groups or to trace changes
over time. Yet, when trying to estimate change over time, bar graphs are most suitable
when the changes are bigger.

Bar charts possess a discrete domain of divisions and are normally scaled so that
all the data can fit on the graph. When there is no regular order of the divisions being
matched, bars on the chart may be organized in any order. Bar charts organized from the
highest to the lowest number are called Pareto charts.

Properties of Bar Graph

Some of the important properties of a bar graph are as follows:

 All the bars should have a common base.

 Each column in the bar graph should have equal width.

27
 The height of the bar should correspond to the data value.

 The distance between each bar should be the same.

Advantages

 Bar graph summarises the large set of data in simple visual form.

 It displays each category of data in the frequency distribution.

 It clarifies the trend of data better than the table.

 It helps in estimating the key values at a glance.

Disadvantages

 Sometimes, the bar graph fails to reveal the patterns, cause, effects, etc.
 It can be easily manipulated to yield fake information.

Example : The number of trees planted by Eco-club of a school in different years is given
below. Draw the bar graph to represent the data.

Year Number of Trees

2005 150
2006 220
2007 350
2008 400
2009 300
2010 380

28
Solution:

450

400

350
Number of Trees to be Planted

300

250

200

150

100

0
2005 2006 2007 2008 2009 2010

Year

Pie Chart

The “pie chart” is also known as a “circle chart”, dividing the circular statistical
graphic into sectors or sections to illustrate the numerical problems.

Each sector denotes a proportionate part of the whole. To find out the composition
of something, Pie-chart works the best at that time. In most cases, pie charts replace other
graphs like the bar graph, line plots, histograms, etc.

A pie chart is a pictorial representation of data. The slices of pie here shows the
relative sizes of data. The same data is represented in different sizes with the help of pie
charts.

29
Pie charts are used to represent the proportional data or relative data in a single
chart. The concept of pie slices is used to show the percentage of a particular data from the
whole pie.

Measure the angle of each slice of the pie chart and divide by 360 degrees. Now
multiply the value by 100. The percentage of particular data will be calculated.

The examples of a pie chart, there are many real-life examples of pie charts, such
as: Representation of marks obtained by students in a class. Representation of kinds of
cars sold in a month. To show the type of food liked by people in a room.

Formula

The pie chart is an important type of data representation. It contains different

segments and sectors in which each segment and sector of a pie chart forms a specific
portion of the total (percentage).

The sum of all the data is equal to 360°. The total value of the pie is always 100%.

To work out with the percentage for a pie chart, follow the steps given below:

 Categorize the data

 Calculate the total

 Divide the categories

 Convert into percentages

 Finally, calculate the degrees

30
Uses of Pie Chart

 Within a business, it is used to compare areas of growth, such as turnover,

profit and exposure.

 To represent categorical data.

 To show the performance of a student in a test, etc.

Advantages

 The picture is simple and easy-to-understand.

 Data can be represented visually as a fractional part of a whole.

 It helps in providing an effective communication tool for the even uninformed

audience.

 Provides a data comparison for the audience at a glance to give an immediate

analysis or to quickly understand information.

 No need for readers to examine or measure underlying numbers themselves,

which can be removed by using this chart.

 To emphasize a few points you want to make, you can manipulate pieces of
data in the pie chart.

31
Disadvantages

 It becomes less effective if there are too many pieces of data to use.

 If there are too many pieces of data. Even if you add data labels and numbers
may not help here, they themselves may become crowded and hard to read.

 As this chart only represents one data set, you need a series to compare
multiple sets.

 This may make it more difficult for readers when it comes to analyze and
assimilate information quickly.

Example : The following data shows the agricultural production in India during a certain
year. Draw a pie chart to represent the data.

(Total production (57 + 76 + 38 + 19) million tonnes = 190 million tonnes.)

Food grain Production

Rice 57

Wheat 76

Coarsev Cereals 38

Pulses 19

32
Line Graph

A line graph is a unique graph which is commonly used in statistics. It represents

the change in a quantity with respect to another quantity.

For example, the price of different flavours of chocolates varies, which we can
represent with the help of this graph. This variation is usually plotted in a two-dimensional
XY plane. If the relation including any two measures can be expressed utilizing a straight
line in a graph, then such graphs are called linear graphs. Thus, the line graph is also
called a linear graph.

A line graph or line chart or line plot is a graph that utilizes points and lines to
represent change over time. It is a chart that shows a line joining several points or a line
that shows the relation between the points.

33
The graph represents quantitative data between two changing variables with a line
or curve that joins a series of successive data points. Linear graphs compare these two
variables in a vertical axis and a horizontal axis.

A few key takeaways from line graph are as follows:

 A line graph is a graph that is used to display change over time as a series of
data points connected by straight line segments on two axes.

34
 A line graph is also called a line chart. It helps to determine the relationship
between two sets of values, with one data set always being dependent on the
other data set.

 They are helpful to demonstrate information on factors and patterns. Line

diagrams can make expectations about the consequences of information not yet
recorded.

 The slope of the line is the most important observation in this case. The slope
represents how steep a line is. It helps in comparing the magnitude of change
between any two consecutive points on the graph. For example: The steeper the
slope, the greater is the change in magnitude between two consecutive points.

Line graph consists of a horizontal x-axis and a vertical y-axis. Most line graphs
only deal with positive number values, so these axes typically intersect near the bottom of
the y-axis and the left end of the x-axis. The point at which the axes intersect is
always (0,0). Each axis is labelled with a data type. For example, the x-axis could be days,
weeks, quarters, or years, while the y-axis shows revenue in dollars.

Different Parts of Line Graph

Title: The title explains what graph is to be plotted.

Scale: The scale is the numbers that explain the units utilized on the linear graph.

35
Labels: Both the side and the bottom of the linear graph have a label that indicates what
kind of data is represented in the graph. X-axis describes the data points on the line and
the Y-axis shows the numeric value for each point on the line.

Bars: They measure the data number.

Data values: they are the actual numbers for each data point.

Example : Sketch the solution on number line |x + 2| ≤ 5.

Solution: Given, inequality |x + 2| ≤ 5.

Step 1: Change the inequality into a compound inequality – 5 ≤ x + 2 ≤ 5.

Step 2: Subtract 2 from all three sides to get – 5 – 2 ≤ x + 2 – 2 ≤ 5 – 2 = – 7 ≤ x ≤ 3.

Step 3: Place this value on a number line.

Vertical Line Graph

Vertical line graphs are graphs in which a vertical line extends from each data
point down to the horizontal axis. Vertical line graph sometimes also called a column
graph. A line parallel to the y-axis is called a vertical line.

36
Horizontal Line Graph

Horizontal line graphs are graphs in which a horizontal line extends from each data
point parallel to the earth. Horizontal line graph sometimes also called a row graph. A line
parallel to the x-axis is called a vertical line.

37
Straight Line Graph

A line graph is a graph formed by segments of straight lines that join the plotted
points that represent given data. The line graph is used to solve changing conditions,
often over a certain time interval. A general linear function has the form y = mx + c,
where m and c are constants.

The fundamental rule at the rear of sketching a linear graph is that we require only
two points to graph a straight line. The subsequent procedure is followed in drawing linear
graphs:

 By substituting two dissimilar values for x in the equation

y = mx + c, we get two values for y. Thus, we get two points (x1, y1) and
(x2, y2) on the line.

 Plot the horizontal line and vertical line and select the suitable scale for both
the axes.

 If the given table values are large choose the scale for that particular value. It
depends on the given value.

 Plot the two points in the Cartesian plane of the paper. Join the two points
using a line segment and extend to two directions. The closed figure obtained
is the required linear graph.

38
Example: Draw a graph for the line y = 2x - 3y=2x−3.

Step 1: Construct a table with suitable xx values

X -2 0 2 4
Y

Step 2: Find the values of y for each x value.

To work out the missing values, we use the equation like a formula, substituting
the values from the table in, we get the following:

When x = −2, we get y = (2×−2)−3 = −7

When x = 0, we get y = (2×0)−3 = −3

When x = 2, we get y = (2×2)−3 = 1

When x = 4, we get y = (2×4)−3 = 5

X -2 0 2 4
Y -7 -3 1 5

Step 3: So, we know that the line passes through (−2,−7),(0,−3),(2,1) and (4, 5)(4,5)

Now all that remains is to plot them on a pair of axes and draw a straight line
through them. The result should look like the graph below.

39
40
Double Line Graph

A double line graph is a line graph with two lines. A graph that compares two
different subjects over a period of time. A double line graph shows how things change
over a period of time. The double line graph shows two line graphs within one chart.
Double line graphs are used to compare trends and patterns between two subjects.

Steps to Make a Double Line Graph

 Use the data from the table to choose an appropriate scale.

 Draw and label the scale on the vertical and horizontal axis.

 List each item and locate the points on the graph for both the lines.

 Connect the points with line segments separately of both the lines.

 Draw two line graphs within one chart.

Example: See the graph below.

41
Uses of Line Graph

The important use of line graph is to track the changes over the short and long
period of time. It is also used to compare the changes over the same period of time for
different groups. It is always better to use the line than the bar graph, whenever the small
changes exist.

For example, in a company finance team wants to plot the changes in the cash
amount that the company has on hand over time. In that case, they use the line graph
plotting the points over the horizontal and the vertical axis. It usually represents the time
period of the data.

Example : The population of a small village is recorded every 10 years. Draw a line graph
to show this data.

Year Population (1000s)

1970 0.97

1980 1.69

1990 3.51

2000 4.1

2010 6.71

42
43
The line graph should have the year on the x-axis and the population on the y-axis.
It should also have the axes clearly labelled and an appropriate title at the top. With all
points plotted correctly and joined with straight lines, the line graph should look like:

Histogram

A histogram chart displays the frequency of discrete and continuous data in a

dataset using connected rectangular bars. Here, the number of observations that fall into a
predefined class interval represented by a rectangular bar. Statistics is a stream of
mathematics that is applied in various fields. When numerals are repeated in statistical
data, this repetition is known as Frequency and which can be written in the form of a table,
called a frequency distribution. A Frequency distribution can be shown graphically by
using different types of graphs and Histogram is one among them.

A histogram is a graphical representation of a grouped frequency distribution with

continuous classes. It is an area diagram and can be defined as a set of rectangles with
bases along with the intervals between class boundaries and with areas proportional to
frequencies in the corresponding classes.

In such representations, all the rectangles are adjacent since the base covers the
intervals between class boundaries.

The heights of rectangles are proportional to corresponding frequencies of similar

classes and for different classes, the heights will be proportional to corresponding
frequency densities.

44
In other words, histogram a diagram involving rectangles whose area is
proportional to the frequency of a variable and width is equal to the class interval.

A histogram is the graphical representation of data where data is grouped into

continuous number ranges and each range corresponds to a vertical bar.

 The horizontal axis displays the number range.

 The vertical axis (frequency) represents the amount of data that is present in
each range.

The number ranges depend upon the data that is being used.

A histogram graph is a bar graph representation of data. It is a representation of a

range of outcomes into columns formation along the x-axis.

In the same histogram, the number count or multiple occurrences in the data for
each column is represented by the y-axis.

It is the easiest manner that can be used to visualize data distributions. Let us
understand the histogram graph by plotting one for the given below example.

Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a different
height. The height of the trees (in inches): 61, 63, 64, 66, 68, 69, 71, 71.5, 72, 72.5, 73,
73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78, 78.5, 79, 79.2, 80, 81, 82, 83, 84, 85 and 87.
We can group the data as follows in a frequency distribution table by setting a range:

45
Number of Trees
Height Range (ft)
(Frequency)
60 - 75 3
66 - 70 3
71 - 75 8
76 - 80 10
81 - 85 5
86 - 90 1

This data can be now shown using a histogram. We need to make sure that while
plotting a histogram, there shouldn’t be any gaps between the bars.

46
How to Make a Histogram?

The process of making a histogram using the given data is described below:

 Step 1: Choose a suitable scale to represent weights on the horizontal axis.

 Step 2: Choose a suitable scale to represent the frequencies on the vertical

axis.

 Step 3: Then draw the bars corresponding to each of the given weights using
their frequencies.

Example:

Construct a histogram for the following frequency distribution table that describes
the frequencies of weights of 25 students in a class.

Weights (in lbs) Frequency (Number of students)

65 - 70 4

70 - 75 10

75 - 80 8

80 - 85 4

47
Steps to draw a Histogram

 Step 1: On the horizontal axis, we can choose the scale to be 1 unit = 11 lb.
Since the weights in the table start from 65, not from 0, we give a break/kink
on the X-axis.

 Step 2: On the vertical axis, the frequencies are varying from 4 to 10. Thus, we
choose the scale to be 1 unit = 2.

 Step 3: Then draw the bars corresponding to each of the given weights using
their frequencies.

48
Frequency Histogram

A frequency histogram is a histogram that shows the frequencies (the number of

occurrences) of the given data items. For example, in a hospital, there are 20 newborn
babies whose ages in increasing order are as follows: 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4 and 5. This information can be shown in a frequency distribution table as
follows:

Age (in days) Frequency

1 4

2 5

3 8

4 2

5 1

This data can be now shown using a frequency histogram.

Histogram Shapes

The histogram can be classified into different types based on the frequency
distribution of the data. There are different types of distributions, such as normal
distribution, skewed distribution, bimodal distribution, multimodal distribution, comb
distribution, edge peak distribution, dog food distribution, heart cut distribution, and so on.

49
The histogram can be used to represent these different types of distributions. The
different types of a histogram are uniform histogram, symmetric histogram, bimodal
histogram, probability histogram.

Symmetric Histogram

When you draw the vertical line down the centre of the histogram, and the two
sides are identical in size and shape, the histogram is said to be symmetric. The diagram is
perfectly symmetric if the right half portion of the image is similar to the left half. The
histograms that are not symmetric are known as skewed.

50
Probability Histogram

A Probability Histogram shows a pictorial representation of a discrete probability

distribution. It consists of a rectangle centred on every value of x, and the area of each
rectangle is proportional to the probability of the corresponding value. The probability
histogram diagram is begun by selecting the classes. The probabilities of each outcome are
the heights of the bars of the histogram.

The histogram can be used to represent these different types of distributions. We

have mainly 5 types of histogram shapes. They are listed below:

 Bell Shaped Histogram

 Bimodal Histogram

 Skewed Right Histogram

 Skewed Left Histogram

 Uniform Histogram

Bell-Shaped Histogram

A bell-shaped histogram has a single peak. The histogram has just one peak at this
time interval and hence it is a bell-shaped histogram. For example, the following
histogram shows the number of children visiting a park at different time intervals. This
histogram has only one peak. The maximum number of children who visit the park is
between 5.30 p.m. to 6 p.m.

51
Bimodal Histogram

If a histogram has two peaks, it is said to be bimodal. Bimodality occurs when the
data set has observations on two different kinds of individuals or combined groups if the
centres of the two separate histograms are far enough to the variability in both the data
sets.

52
A bimodal histogram has two peaks and it looks like the graph given below. For
example, the following histogram shows the marks obtained by the 48 students of Class 8
of St. Mary’s School.

The maximum number of students have scored either between 40 to 50 marks or

between 60 to 70 marks. This histogram has two peaks (between 40 to 50 and between 60
to 70) and hence it is a bimodal histogram.

53
Skewed Right Histogram

A skewed right histogram is a histogram that is skewed to the right. In this

histogram, the bars of the histogram are skewed to the right, hence called a skewed right
histogram. For example, the following histogram shows the number of people
corresponding to different wage ranges. The histogram is skewed to the right. For the
maximum number of people, wages ranged from 10-20 (thousands).

54
Skewed Left Histogram

A skewed left histogram is a histogram that is skewed to the left. In this histogram,
the bars of the histogram are skewed to the left side, hence, called a skewed left histogram.
For example, the following histogram shows the number of students of Class 10 of
Greenwood High School according to the amount of time they spent on their studies on a
daily basis. The maximum number of students study 4.5-5 (hours) on daily basis.

55
Uniform Histogram

A uniform distribution reveals that the number of classes is too small, and each
class has the same number of elements. It may involve distribution that has several peaks.

A uniform histogram is a histogram where all the bars are more or less of the same
height. In this histogram, the lengths of all the bars are more or less the same. Hence, it is
a uniform histogram.

For example, Ma’am Lucy, the Principal of Little Lilly Playschool, wanted to
record the heights of her students. The following histogram shows the number of students
and their varying heights. The height of the students ranges between 30 inches to 50
inches.

56
Difference Between a Bar Chart and a Histogram

The fundamental difference between histograms and bar graphs from a visual
aspect is that bars in a bar graph are not adjacent to each other.

 A bar graph is the graphical representation of categorical data using

rectangular bars where the length of each bar is proportional to the value they
represent.

 A histogram is the graphical representation of data where data is grouped into

continuous number ranges and each range corresponds to a vertical bar.

The main differences between a bar chart and a histogram are as follows:

Bar Graph Histogram

Equal space between every No space between two

two consecutive bars. consecutive bars. They should be
attached to each other.

X-axis can represent X-axis should represent only

anything. continuous data that is in terms
of numbers.

But in both graphs, Y-axis represents numbers only. We can understand these
differences from the following figure:

57
Difference between Bar Chart and Histogram

58
Example : Consider the following histogram that represents the weights of 34 newborn
babies in a hospital. If the children weighing between 6.5 lb to 8.5 lb are considered
healthy, then find the percentage of the children of this hospital that are healthy.

Solution: We have to first find the number of children weighing between 4.4 lb to 6.6 lb.
From the given histogram, the number of children weighing between:
6.5 lb - 7.5 lb = 10
7.5 lb - 8.5 lb = 18

Therefore, the number of children weighing between 6.5 lb to 8.5 lb = (10+18=28).

The total number of children in the hospital = 34. Hence, the required percentage is:
28/34 × 100 = approx 83%. ∴ Required percentage = 83%.

59
Uses of Histogram

The histogram graph is used under certain conditions. They are:

 The data should be numerical.

 A histogram is used to check the shape of the data distribution.

 Used to check whether the process changes from one period to another.

 Used to determine whether the output is different when it involves two or more
processes.

 Used to analyse whether the given process meets the customer requirements

Graphical Representation

Graphical Representation is a way of analysing numerical data. It exhibits the

relation between data, ideas, information and concepts in a diagram. It is easy to
understand and it is one of the most important learning strategies.

It always depends on the type of information in a particular domain. There are

different types of graphical representation. Some of them are as follows:

 Line Graphs – Line graph or the linear graph is used to display the continuous
data and it is useful for predicting future events over time.

 Bar Graphs – Bar Graph is used to display the category of data and it
compares the data using solid bars to represent the quantities.

60
 Histograms – The graph that uses bars to represent the frequency of numerical
data that are organised into intervals. Since all the intervals are equal and
continuous, all the bars have the same width.

 Line Plot – It shows the frequency of data on a given number line. ‘ x ‘ is

placed above a number line each time when that data occurs again.

 Frequency Table – The table shows the number of pieces of data that falls
within the given interval.

 Circle Graph – Also known as the pie chart that shows the relationships of the
parts of the whole. The circle is considered with 100% and the categories
occupied is represented with that specific percentage like 15%, 56%, etc.

 Stem and Leaf Plot – In the stem and leaf plot, the data are organised from
least value to the greatest value. The digits of the least place values from the
leaves and the next place value digit forms the stems.

 Box and Whisker Plot – The plot diagram summarises the data by dividing
into four parts. Box and whisker show the range (spread) and the middle (
median) of the data.

 General Make sure that the appropriate title is given to the graph which
indicates the subject of the presentation.

 Measurement Unit: Mention the measurement unit in the graph.

61
 Proper Scale: To represent the data in an accurate manner, choose a proper
scale.

 Index: Index the appropriate colours, shades, lines, design in the graphs for
better understanding.

Data Sources: Include the source of information wherever it is necessary at the bottom of
the graph.

 Keep it Simple: Construct a graph in an easy way that everyone can

understand.

 Neat: Choose the correct Rules for Graphical Representation of Data.

There are certain rules to effectively present the information in the graphical
representation. They are suitable title, size, fonts, colours etc in such a way that the graph
should be a visual aid for the presentation of information.

Advantages of Graphical Method

Some of the advantages of graphical representation are:

 It makes data more easily understandable.

 It saves time.

 It makes the comparison of data more efficient.

62
Different Types of Graphical Representation

63
Graphical Representation in Maths

In Mathematics, a graph is defined as a chart with statistical data, which are

represented in the form of curves or lines drawn across the coordinate point plotted on its
surface.

It helps to study the relationship between two variables where it helps to measure
the change in the variable amount with respect to another variable within a given interval
of time. It helps to study the series distribution and frequency distribution for a given
problem. There are two types of graphs to visually depict the information. They are:

 Time Series Graphs – Example: Line Graph

 Frequency Distribution Graphs – Example: Frequency Polygon Graph.

Principles of Graphical Representation

Algebraic principles are applied to all types of graphical representation of data. In

graphs, it is represented using two lines called coordinate axes. The horizontal axis is
denoted as the x-axis and the vertical axis is denoted as the y-axis.

The point at which two lines intersect is called an origin ‘O’. Consider x-axis, the
distance from the origin to the right side will take a positive value and the distance from
the origin to the left side will take a negative value. Similarly, for the y-axis, the points
above the origin will take a positive value, and the points below the origin will a negative
value.

64
Principles of Graphical Representation

Generally, the frequency distribution is represented in four methods, namely

 Histogram

 Smoothed frequency graph

 Pie diagram

 Cumulative or ogive frequency graph

 Frequency Polygon

65
Merits of Using Graphs

Some of the merits of using graphs are as follows:

 The graph is easily understood by everyone without any prior knowledge.

 It saves time

 It allows us to relate and compare the data for different time periods

 It is used in statistics to determine the mean, median and mode for different
data, as well as in the interpolation and the extrapolation of data.

Example for Frequency Polygon Graph

Here are the steps to follow to find the frequency distribution of a frequency
polygon and it is represented in a graphical way.

 Obtain the frequency distribution and find the midpoints of each class interval.

 Represent the midpoints along x-axis and frequencies along the y-axis.

 Plot the points corresponding to the frequency at each midpoint.

 Join these points, using lines in order.

 To complete the polygon, join the point at each end immediately to the lower
or higher class marks on the x-axis.

66
Frequency Polygon

A frequency polygon is a graphical form of representation of data. It is used to depict

the shape of the data and to depict trends. It is usually drawn with the help of a histogram but
can be drawn without it as well. A histogram is a series of rectangular bars with no space
between them and is used to represent frequency distributions.

Steps to Draw a Frequency Polygon

 Mark the class intervals for each class on the horizontal axis. We will plot the
frequency on the vertical axis.

 Calculate the class mark for each class interval. The formula for class mark is:

Class mark = (Upper limit + Lower limit) / 2

 Mark all the class marks on the horizontal axis. It is also known as the mid-
value of every class.

 Corresponding to each class mark, plot the frequency as given to you. The
height always depicts the frequency. Make sure that the frequency is plotted
against the class mark and not the upper or lower limit of any class.

 Join all the plotted points using a line segment. The curve obtained will be
kinked.

 This resulting curve is called the frequency polygon.

67
Note that the above method is used to draw a frequency polygon without drawing
a histogram. You can also draw a histogram first by drawing rectangular bars against the
given class intervals. After this, you must join the midpoints of the bars to obtain the
frequency polygon. Remember that the bars will have no spaces between them in a
histogram.

Example: Construct a frequency polygon using the data given below:

Test scores Frequency

49.5-59.5 5
59.5-69.5 10
69.5-79.5 30
79.5-89.5 40
89.5-99.5 15

Answer: We first need to calculate the cumulate frequency from the frequency given.

Test scores Frequency Cumulative Frequency

49.5-59.5 5 5
59.5-69.5 10 15
69.5-79.5 30 45
79.5-89.5 40 85
89.5-99.5 15 100

68
We now start by plotting the class marks such as 54.5, 64.5, 74.5 and so on till 94.5.
Note that we will also plot the previous and next class marks to start and end the polygon, i.e.
we plot 44.5 and 104.5 as well.

Then, the frequencies corresponding to the class marks are plotted against each class
mark. Like you can see below, this makes sense as the frequency for class marks 44.5 and
104.5 are zero and touching the x-axis. These plot points are used only to give a closed shape
to the polygon. The polygon looks like this:

69
Construction of frequency polygon

In order to create a frequency polygon, one must follow these steps:

 Creation of a histogram.

 Finding the midpoints for each bar that exists on the histogram.

 Placing a point on the origin of the histogram and its end.

 Connection of the points.

The frequency histogram has the similarity to a column graph without the presence of
spaces between columns. The frequency polygon happens to be a special line graph whose
use takes place in statistics. One can draw these graphs either separately or combined. One
can make use of the information that is available in a frequency distribution table for
drawings of these graphs. Frequency polygons provide us with an understanding of the shape
of the data and its trends.

Difference between a frequency polygon and frequency curve

The major difference between a frequency polygon and frequency curve is that the
drawing of a frequency polygon by joining points by a straight line while the drawing of a
frequency curve takes place by a smooth hand.

The following is the age distribution of 1000 persons working in a large industrial
house:

70
Age group Number of persons
20-25 30
25-30 160
30-35 210
35-40 180
40-45 145
45-50 105
50-55 70
55-60 60
60-65 40

To create the Pie diagram, line diagram, bar diagram and histograms.

Bar Diagram
250
Number of persons

200

150

100

0
20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65

number of persons

71
Pie Diagram

Number of persons

20-25 25-30
4% 3%
6%
16%
7%
30-35 35-40

11%
40-45 45-50

21%
50-55 55-60
14%

60-65
18%

Line Diagram

Number of persons

250

200

150

100

0
20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65

number of persons

72
Histogram

A frequency polygon is a graph constructed by using lines to join the midpoints of

each interval, or bin. The heights of the points represent the frequencies. A frequency
polygon can be created from the histogram or by calculating the midpoints of the bins
from the frequency distribution table.

The following histogram represents the marks made by 40 students on a math 10

test.

Use the histogram to construct a frequency polygon to represent the data.

73
The following distribution table represents the number of miles run by 20
randomly selected runners during a recent road race:
Bin (Size) Frequency
5.5-10.5 1
10.5-15.5 3
15.5-20.5 2
20.5-25.5 4
25.5-30.5 5
30.5-35.5 3
35.5-40.5 2
Using this table, construct a frequency polygon.

Step 1: Calculate the midpoint of each bin by adding the 2 numbers of the interval
and dividing the sum by 2.

Example of Midpoints: (5.5+10.5) /2 = 8

Bin Size Frequency Mid point
5.5-10.5 1 8
10.5-15.5 3 13
15.5-20.5 2 18
20.5-25.5 4 23
25.5-30.5 5 28
30.5-35.5 3 33
35.5-40.5 2 38

74
Step 2: Plot the midpoints on a grid, making sure to number the x-axis with a scale
that will include the bin sizes. Join the plotted midpoints with lines.

A frequency polygon usually extends 1 unit below the smallest bin value and 1
unit beyond the greatest bin value. This extension gives the frequency polygon an
appearance of having a starting point and an ending point, which provides a view of the
distribution of data. If the data set were very large so that the number of bins had to be
increased and the bin size decreased, the frequency polygon would appear as a smooth
curve.

An ogives, sometimes called a cumulative frequency polygon, is a type

of frequency polygon that shows cumulative frequencies.

In other words, the cumulative percents are added on the graph from left to right.

75
An ogives graph plots cumulative frequency on the y-axis and class
boundaries along the x-axis. It’s very similar to a histogram, only instead of rectangles,
an ogive has a single point marking where the top right of the rectangle would be.

It is usually easier to create this kind of graph from a frequency table.

Example: Draw an Ogives graph for the following set of data: 2, 7, 3, 8, 3, 15, 19, 16, 17,
13, 29, 20, 21, 21, 22, 25, 31, 51, 55, 55, 57, 58, 56, 57 and 58.

Step 1: Make a relative frequency table from the data. The first column has the
class limits, the second column has the frequency (the count) and the third column has the
relative frequency (class frequency / total number of items):

Class Limits Frequency Relative Frequency

01 to 09 5 5/25=0.20

10 to 19 5 5/25=0.20

20 to 29 6 6/25=0.24

30 to 39 1 1/25=0.04

40 to 49 0 0/25=0

50 to 59 8 8/25=0.32

76
Step 2: Add a fourth column and cumulate (add up) the frequencies in column 2,
going down from top to bottom. For example, the second entry is the sum of the first row
and the second row in the frequency column (5 + 5 = 10), and the third entry is the sum of
the first, second, and third rows in the frequency column (5 + 5 + 6 = 16):

Relative Cumulative
Class Limits Frequency
Frequency frequency
01 to 09 5 5/25=0.20 5
10 to 19 5 5/25=0.20 10
20 to 29 6 6/25=0.24 16

30 to 39 1 1/25=0.04 17

40 to 49 0 0/25=0 17

50 to 59 8 8/25=0.32 25

77
Step 3: Add a fifth column and cumulate the relative frequencies from column 3.
If you do this step correctly, your values should add up to 100% (or 1 as a decimal):

Cumulative
Class Relative Cumulative
Frequency Relative
Limits Frequency frequency
frequency
01 to 09 5 5/25=0.20 5 0.2

10 to 19 5 5/25=0.20 10 0.4

20 to 29 6 6/25=0.24 16 0.64

30 to 39 1 1/25=0.04 17 0.68

40 to 49 0 0/25=0 17 0.68

50 to 59 8 8/25=0.32 25 1

Step 4: Draw an Cartesian plane (x-y graph) with percent cumulative relative
frequency on the y-axis (from 0 to 100%, or as a decimal, 0 to 1). Mark the x-axis with the
class boundaries.

Step 5: Plot your points.

Note: Each point should be plotted on the upper limit of the class boundary. For example,
if your first class boundary is 0 to 10, the point should be plotted at 10.

Step 6: Connect the dots with straight lines. the ogive is one continuous line, made
up of several smaller lines that connect pairs of dots, moving from left to right.

78
79
UNIT - II

Lesson 3
Measures of Central Tendency

Arithmetic Mean

The mean (or average) is the most popular and well known measure of central
tendency. It can be used with both discrete and continuous data, although its use is most
often with continuous data. The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set.

X
 fx
N
where x  mean

f = frequency of each class

x = mid-interval value of each class

N = total frequency

Median

The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data.

For Discrete Data : Median = [(n + 1)/2]th term

80
n 
  cf 
For Grouped Data : Median  l   2 c
 f 
 
where,

l = lower limit of the median class

n = number of observation

cf = cumulative frequency of the class preceding the median class

f = frequency of the median class

h = class size

Mode

The mode is the most frequent score in our data set. On a histogram it represents
the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the
mode as being the most popular option.

 f1  f 0 
M O  l   h
 2 f1  f 0  f 2 
where,

l = lower limit of the modal class

h = size of the class interval (assuming all class size to be equal)

f1 = frequency of model class

f0 = frequency of the class preceding the model class

f2 = frequency of the class succeeding the model class.

81
Finding the Mean, Median and Mode

We want to work out the mean, median and mode for the data as 5, 9, 12, 4, 5, 14,
19, 16, 3, 5 and 7.

To calculate the mean, we need to add all the values up and divide by the number
of values.

5 + 9 + 12 + 4 + 5 + 14 + 19 + 16 + 3 + 5 + 7 99
____________________________________ = ___ = 9
11 11

82
In this case the mean is 9 which is one of the values in the list. Sometimes the
mean will not appear in the original list. It might even be a decimal value.

To calculate the median, we need to put the numbers in order and find the middle
value.

3 4 5 5 5 7 9 12 14 16 19

Here the median is 7 because this is the middle value. Half of the other values in
the list are below 7 and half are above 7.

To calculate the mode, we need to look at which value appears the most often. It
can help if the numbers are in order.

3 4 5 5 5 7 9 12 14 16 19

In this list the mode is 5, because it appears most often.

Sometimes there will be more than one mode, because two or more values appear the
same number of times.

Finding the median where there are an even number of values

When there are an even number of values, there is no clear middle value. For
example,

What is the median of 3, 6, 7, 8, 11 and 15?

In this case, there are two middle values.

83
3 6 7 8 11 15

The median is the mean of these two middle numbers.

7+8 = 7.5

So the median for this set of values is 7.5. Like the mean, the median value does
not always appear in the original list of values.

Example : Find the mean, median, mode, and range for the following list of values: 1, 2, 4
and 7.

Solution: The mean is the usual average:

(1 + 2 + 4 + 7) ÷ 4 = 14 ÷ 4 = 3.5

The median is the middle number. In this example, the numbers are already listed
in numerical order, so we don’t have to rewrite the list. But there is no “middle” number,
because there is even number of numbers. Because of this, the median of the list will be
the mean (that is, the usual average) of the middle two values within the list. The middle
two numbers are 2 and 4, so:

(2 + 4) ÷ 2 = 6 ÷ 2 = 3

So the median of this list is 3, a value that isn’t in the list at all.

84
The mode is the number that is repeated most often, but all the numbers in this list
appear only once, so there is no mode.

Example: Find the mean, median, mode, and range for the following list of values: 13,
18, 13, 14, 13, 16, 14, 21 and 13.

Solution: The mean is the usual average, so we’ll add and then divide:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15

Note that the mean, in this case, isn’t a value from the original list. This is a
common result.

You should not assume that your mean will be one of your original numbers. The
median is the middle value, so first we’ll have to rewrite the list in numerical order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the

(9 + 1) ÷ 2 = 10 ÷ 2

= 5th number :

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

85
The mode is the number that is repeated more often than any other, so 13 is the
mode, since 13 is being repeated 4 times.

The largest value in the list is 21, and the smallest is 13, so the range is 21-13 = 8.

Merits of Mean
 Arithmetic mean is simple to understand and easy to calculate.
 It is rigidly defined.
 It is suitable for further algebraic treatment.
 It is least affected fluctuation of sampling.
 It takes into account all the values in the series.

Demerits of Mean
 It is highly affected by the presence of a few abnormally high or abnormally
low scores.
 In absence of a single item, its value becomes inaccurate.
 It cannot be determined by inspection.

Example :

(a) Find the arithmetic mean of the following frequency distribution:

x 1 2 3 4 5 6 7

y 5 9 12 17 14 10 6

86
(b) Calculate the arithmetic mean of the marks from the following table:

Marks No. of students

0-10 12

10-20 18

20-30 27

30-40 20

40-50 17

50-60 6

Solution:
(a) Computation of mean:

x f fx

1 5 5

2 9 18

3 12 36

4 17 68

5 14 70

6 10 60

7 6 42

1 299
x
N
 fx = 73
= 4.09

87
(b) Computing mean:
Marks No. of students (f) Mid point(x) fx

0-10 12 5 60

10-20 18 15 270

20-30 27 25 675

30-40 20 35 700

40-50 17 45 765

50-60 6 55 330

Total N=100 2800

1 1
x
N
 fx 
100
 2,800  28

Merits of Median

 It is simple to understand and easy to calculate, particularly is individual and

discrete series.

 It is not affected by the extreme items in the series.

 It can be determined graphically.

 For open-ended classes, median can be calculated.

 It can be located by inspection, after arranging the data in order of magnitude.

88
Demerits

 It does not consider all variables because it is a positional average.

 The value of median is affected more by sampling fluctuations

 It is not capable of further algebraic treatment. Like mean, combined median

cannot be calculated.

 It cannot be computed precisely when it lies between two items.

Example:
Obtain the median for the following frequency distribution:

x f

1 8

2 10

3 11

4 16

5 20

6 25

7 15

8 9

9 6

89
Solution:

x f c.f

1 8 8

2 10 18

3 11 29

4 16 45

5 20 65

6 25 90

7 15 105

8 9 114

9 6 120

Total N=120

Here N=120;
120
Median =  60
2
1
The cumulative frequency just greater than N is 65 and the value of x
2
corresponding to 65 is 5 therefore, median is 5.

90
Merits of Mode:

 It is comparatively easy to understand.

 It can be found graphically.

 It is easy to locate in some cases by inspection.

 It is not affected by extreme values.

 It is the simplest descriptive measure of average.

Demerits of Mode:

 It is not suitable for further mathematical treatment.

 It is an unstable measure as it is affected more by sampling fluctuations.

 Mode for the series with unequal class intervals cannot be calculated.

 In a bimodal distribution, there are two modal classes and it is difficult to

determine the values of the mode.

Example:

A doctor who checked 9 patients’ sugar level is given below. Find the mode value
of the sugar levels. 80, 112, 110, 115, 124, 130, 100, 90, 150 and 180.

Solution: Since each values occurs only once, there is no mode.

91
Example:

Compute mode value for the following observations: 2, 7, 10, 12, 10, 19, 2, 11, 3
and 12.

Solution: Here, the observations 10 and 12 occurs twice in the data set, the modes
are 10 and 12. For discrete frequency distribution, mode is the value of the variable
corresponding to the maximum frequency.

Geometric Mean

Definition: In Mathematics, the Geometric Mean (GM) is the average value or

mean which signifies the central tendency of the set of numbers by finding the product of
their values. Basically, we multiply the numbers altogether and take the nth root of the
multiplied numbers, where n is the total number of data values.

Difference between Arithmetic Mean and Geometric Mean

Arithmetic Mean Geometric Mean

The arithmetic mean or mean can be found It can be found by multiplying all the
by adding all the numbers for the given numbers in the given data set and take
data set divided by the number of data the nth root for the obtained result.
points in a set.
For example, the given data sets are For example, consider the given data
5, 10, 15 and 20. set, 4, 10, 16 and 24.
Here, the number of data points = 4 Here n = 4
Arithmetic mean or mean = Therefore, the G.M = 4th root of
(5+10+15+20)/4 (4 ×10 ×16 × 24) = 4th root of 15360
Mean = 50/4 =12.5 G.M = 11.13

92
Geometric Mean Properties

Some of the important properties of the G.M are:

 The G.M for the given data set is always less than the arithmetic mean for the
data set

 If each object in the data set is substituted by the G.M, then the product of the
objects remains unchanged.

 The ratio of the corresponding observations of the G.M in two series is equal to
the ratio of their geometric means

 The products of the corresponding items of the G.M in two series are equal to
the product of their geometric mean.

Application of Geometric Mean

The greatest assumption of the G.M is that data can be really interpreted as a
scaling factor. Before that, we have to know when to use the G.M. The answer to this is, it
should be only applied to positive values and often used for the set of numbers whose
values are exponential in nature and whose values are meant to be multiplied together.
This means that there will be no zero value and negative value which we cannot really
apply. Geometric mean has a lot of advantages and it is used in many fields. Some of the
applications are as follows:

 It is used in stock indexes. Because many of the value line indexes which is
used by financial departments use G.M.

93
 It is used to calculate the annual return on the portfolio.

 It is used in finance to find the average growth rates which are also referred to
the compounded annual growth rate.

 It is also used in studies like cell division and bacterial growth etc.

Example : Find the geometric mean of the following data.

Weight of ear head x (g) Log x

45 1.653

60 1.778

48 1.681

100 2.000

65 1.813

Total 8.925

Solution: Here n=5,

1 n

G=Antilog 
N
f
i 1
i log x  = Antilog 8.925/5


= Antilog 1.785 = 60.95

Therefore the G.M of the given data is 60.95.

94
Example: Find the geometric mean of the following grouped data for the frequency
distribution of weights.

Weights of ear heads (g) No of ear heads (f)

60-80 22

80-100 38

100-120 45

120-140 35

140-160 20

Total 160

Solution:

Weights of ear No of ear

Mid x Log x f log x
heads (g) heads (f)

60-80 22 70 1.845 40.59

80-100 38 90 1.954 74.25

100-120 45 110 2.041 91.85

120-140 35 130 2.114 73.99

140-160 20 150 2.176 43.52

Total 160 324.2

95
From the given data, n = 160,

We know that the G.M for the grouped data is

1 n

G=Antilog 
N
f
i 1
i log x 


GM = Antilog (324.2/160)

= Antilog (2.02625) = 106.23

Therefore, the G.M = 106.23.

Harmonic Mean

Harmonic mean is a type of average that is calculated by dividing the number of

values in a data series by the sum of the reciprocals (1/xi) of each value in the data series.

Formula for Harmonic Mean

The general formula for calculating a harmonic mean is:

1
Harmonic mean H =
1 n 1 

n i 1  x i



where
n = the number of the values in a dataset

xi = the point in a dataset

96
Applications of Harmonic Mean

A few common applications of the harmonic mean formula are given below

 Used in calculating the average under certain conditions.

 Computing Fibonacci sequences.

 Used in finance, specifically to calculate average multiples.

Example : The number of tomatoes per plant is given below. Calculate the harmonic
mean.

Number of tomatoes per plant Number of plants

20 4

21 2

22 7

23 1

24 3

25 1

97
Solution:

Number of tomatoes Number of

l/x f(l/x)
per plant x plants (f)

20 4 0.05 0.2

21 2 0.0476 0.0952

22 7 0.0454 0.3178

23 1 0.0435 0.0435

24 3 0.0417 0.1251

25 1 0.04 0.04

n=18 0.8216

Hence, Harmonic Mean

1
H= 1 n 1 

n i 1  x i



18
Harmonic Mean= 0.8216 =21.908

Harmonic mean for the given data is 21.908.

98
Example :
Calculate the harmonic mean for the following data:

x f

1 2

3 4

5 6

7 8

9 10

11 12
Solution:

The calculation for the harmonic mean is shown in the below table:

x f 1/x f(l/x)

1 2 1 2

3 4 0.333 1.332

5 6 0.2 1.2

7 8 0.143 1.144

9 10 0.1111 1.111

11 12 0.091 1.092

N =42 Σ f/x = 7.879

99
The formula for weighted harmonic mean is

H= 1
1 n 1
 
n i1  xi 

= 42 / 7.879 = 5.331

Therefore, the harmonic mean H is 5.331.

Uses of Harmonic Mean

The main uses of harmonic means are as follows:

 The harmonic mean is applied in the finance to the average multiples like
price-earnings ratio

 It is also used by the market technicians in order to determine the patterns like
Fibonacci Sequences

Merits and Demerits of Harmonic Mean

The following are the merits of the harmonic mean:

 It is rigidly confined.

 It is based on all the views of a series, i.e. it cannot be computed by ignoring

any item of a series.

 It is able to advance the algebraic method.

100
 It provides a more reliable result when the results to be achieved are the same
for the various means adopted.

 It provides the highest weight to the smallest item of a series.

 It can also be measured when a series holds any negative value.

 It produces a skewed distribution of a normal one.

 It produces a curve straighter than that of the A.M and G.M.

The demerits of the harmonic series are as follows:

 The harmonic mean is greatly affected by the values of the extreme items

 It cannot be able to calculate if any of the items is zero

 The calculation of the harmonic mean is cumbersome, as it involves the

calculation using the reciprocals of the number.

Weighted Mean

In calculating arithmetic mean we suppose that all the item in the distribution have
equal importance. But in practice this may not be so. If some items in a distribution are
more importance than other, then this point must be such cases, proper weight age is to be
given to various items, the weights attached to each item being proportional to the
importance of the item in the distribution.

101
 x i i
Weighted arithmetic mean (or weighted mean)= i

 i
i

Weighted Average Meaning

A weighted average is the average of all the values which are arranged on a
priority basis. The weighted average of values is the sum of the weight times values
divided by the sum of the weights.

Examples :

The numbers 40, 45, 80, 75 and 10 have weights 1, 2, 3, 4, and 5 respectively.
Find the weighted mean for the given data set.

Solution:

 x
i
i i
Weighted Arithmetic Mean=
 i
i

40 1  45  2  80  3  75  4  10  5
=
1 2  3  4  5

40  90  240  300  50 720

= = = 48
15 15

102
Lesson 4

Measures of Dispersion

In statistics, the measures of dispersion help to interpret the variability of data i.e.
to know how much homogenous or heterogeneous the data is. In simple terms, it shows
how squeezed or scattered the variable is.

Meaning of Dispersion

Dispersion is the extent to which values in a distribution differ from the average of
the distribution.

In measuring dispersion, it is imperative to know the amount of variation (absolute

measure) and the degree of variation (relative measure).

In the former case weconsider the range, Quartile Deviation, standard deviation etc.
In the latter case we consider the coefficient of range, coefficient quartile deviation, the
coefficientof variation etc.

Types of Measures of Dispersion

There are two main types of dispersion methods in statistics which are:

 Absolute Measure of Dispersion

 Relative Measure of Dispersion

103
Absolute Measure of Dispersion

An absolute measure of dispersion contains the same unit as the original data set.
Absolute dispersion method expresses the variations in terms of the average of deviations
of observations like standard or means deviations. It includes range, standard deviation,
quartile deviation, etc.

The types of absolute measures of dispersion are:

 Range: It is simply the difference between the maximum value and the
minimum value given in a data set.

Range (R) = L – S

𝐿−𝑆
Co-efficient of Range = 𝐿+𝑆

Example: 1, 3,5, 6, 7

=> Range = 7 -1= 6

 Variance: Deduct the mean from each data in the set then squaring each of
them and adding each square and finally dividing them by the total no of
values in the data set is the variance. Variance (σ 2)=∑(X−μ)2/N

 Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.

104
 Quartiles and Quartile Deviation: The quartiles are values that divide a list
of numbers into quarters. The quartile deviation is half of the distance between
the third and the first quartile.

 Mean and Mean Deviation: The average of numbers is known as the mean
and the arithmetic mean of the absolute deviations of the observations from a
measure of central tendency is known as the mean deviation (also called mean
absolute deviation).

Relative Measure of Dispersion

The relative measures of dispersion are used to compare the distribution of two or
more data sets. This measure compares values without units. Common relative dispersion
methods include:

 Co-efficient of Range

 Co-efficient of Variation

 Co-efficient of Standard Deviation

 Co-efficient of Quartile Deviation

 Co-efficient of Mean Deviation

105
Figure - Relative Measure of Dispersion

Co-efficient of Dispersion

The coefficients of dispersion are calculated (along with the measure of dispersion)
when two series are compared, that differ widely in their averages. The dispersion
coefficient is also used when two series with different measurement units are compared. It
is denoted as C.D.

106
Properties of a good measure of dispersion:

 Easy to understand

 Simple to calculate

 Uniquely defined

 Based on all observations

 Not affected by extreme observations

 Capable of further algebraic treatment

Range:

𝐿−𝑆
(i) Co-efficient of range=𝐿+𝑆

where L= Largest item

S= Smallest item

Q3  Q1
(ii) Co-efficient of quartile deviation
Q3  Q1

where Q1= First Quartile

Q3= Third Quartile

107
(iii) Inter Quartile Range=Q3-Q1

Quartile deviation is half of the inter quartile range it is also called semi inter
quartile range.

Q3  Q1
Quartile Deviation=
2

MeanDeviation
(iv) Co-efficient of Mean Deviation=
MeanorMedi an


(v) Co-efficient of Standard Deviation=
X


(vi) Co-efficient of variance  100
X

where 𝜎=Standard deviation

𝑋̅ =Arithmetic mean

Standard Deviation

 Most important & widely used measure of dispersion

 First used by Karl Pearson in 1893

 Also called root mean square deviations

108
 It is defined as the square root of the arithmetic mean of the squares of the

deviation of the value taken from the mean.

 Denoted by 𝜎 (sigma)

 X  X  x
2 2

 (or )
N N

where, x  X  X


 Co-efficient of S.D=
X
Individual series Assumed Mean / Shortcut Method
2


d 2
 d 
 
N  N 
 
or
2


X2 X



N  N 
 

Standard Deviation Discrete series:

  fd 2
  fd 
  
N N 
 

where d = X-A

109
Merits of Standard Deviation

 Squaring the deviations overcomes the drawback of ignoring signs in mean

deviations

 Suitable for further mathematical treatment

 Least affected by the fluctuation of the observations

 The standard deviation is zero if all the observations are constant

 Independent of change of origin

Demerits of Standard Deviation

 Not easy to calculate

 Difficult to understand for a layman

 Dependent on the change of scale

Co-efficient of variation (C.V):

 It was developed by Karl Pearson.

 It is an important relative measured of dispersion

 It is used in comparing the variability, homogeneity, stability, uniformity &

consistency of two or more series.

110
 Higher the C.V lesser the consistency

 C.V =
X

Merits of Range

 It is the simplest of the measure of dispersion

 Easy to calculate

 Easy to understand

 Independent of change of origin

Demerits of Range

 It is based on two extreme observations. Hence, get affected by fluctuations

 A range is not a reliable measure of dispersion

 Dependent on change of scale

Quartile Deviation

 The quartiles divide a data set into quarters. The first quartile, (Q1) is the middle
number between the smallest number and the median of the data. The second
quartile, (Q2) is the median of the data set. The third quartile, (Q3) is the middle
number between the median and the largest number.
 Quartile deviation or semi-inter-quartile deviation is Q = ½ × (Q3 – Q1).

111
Merits of Quartile Deviation
 All the drawbacks of Range are overcome by quartile deviation

 It uses half of the data

 Independent of change of origin

 The best measure of dispersion for open-end classification

Demerits of Quartile Deviation

 It ignores 50% of the data

 Dependent on change of scale

 Not a reliable measure of dispersion

Mean Deviation

Mean deviation is the arithmetic mean of the absolute deviations of the observations
from a measure of central tendency. If x1, x2, … , xn are the set of observation, then the mean
deviation of x about the average A (mean, median, or mode) is

Mean deviation from average A = 1⁄n [∑i|xi – A|]

For a grouped frequency, it is calculated as:
Mean deviation from average A = 1⁄N [∑i fi |xi – A|], N = ∑fi

Here, xi and fi are respectively the mid value and the frequency of the ith class
interval.

112
Merits of Mean Deviation

 Based on all observations

 It provides a minimum value when the deviations are taken from the median

 Independent of change of origin

Demerits of Mean Deviation

 Not easily understandable

 Its calculation is not easy and time-consuming

 Dependent on the change of scale

 Ignorance of negative sign creates artificiality and becomes useless for further
mathematical treatment

Coefficient of Dispersion

Whenever we want to compare the variability of the two series which differ widely in
their averages. Also, when the unit of measurement is different. We need to calculate the
coefficients of dispersion along with the measure of dispersion. The coefficients of dispersion
(C.D.) based on different measures of dispersion are

 Based on Range = (Xmax – Xmin) ⁄ (Xmax + Xmin).

 C.D. based on quartile deviation = (Q3 – Q1) ⁄ (Q3 + Q1).

113
 Based on mean deviation = Mean deviation/average from which it is calculated.

 For Standard deviation = S.D. ⁄ Mean

Coefficient of Variation

100 times the coefficient of dispersion based on standard deviation is the coefficient
of variation (C.V.).

C.V. = 100 × (S.D. / Mean) = (σ/ȳ ) × 100.

Range

Example : The amount spent (in rupees) by the group of 10 students in the school canteen
is as follows:

110, 117, 129, 197, 190, 100, 100, 178, 255, 790.

Find the range and the co-efficient of the range.

Solution: R = L - S = 790 - 100 = 690

Example : Find the range and it’s co-efficient from the following data.

Size 10-20 20-30 30-40 40-50 50-60

Frequency 2 2 3 4 2

114
Solution: R = L – S = 100 – 10 = 90

𝐿−𝑆 100−10 90
Co-efficient of range = 𝐿+𝑆 = 100+10 = 110 = 0.82

Example : Find out the quartile deviation of daily wages (in rupees) of 7 persons is given
below 120,70,150,100,190,170 and 250.

Solution : Arranging the data in an ascending order we get 70, 100, 120, 150, 170, 190
and 250.

Here, n=7

Q1= Size of
N  1th item
4

7  1 items
=Size of 4 = 2nd item

= 100 rupees

3(7  1)
= Size of item = 6th item
4

=190 rupees

Q3  Q1 190  100
Quartile Deviation =   45 rupees
2 2

115
Example : The wheat production (in kg) of 20 acres given as: 1120, 1240, 1320, 1040,
1080, 1200, 1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470,
1750 and 1885, find the quartile deviation and co efficient of quartile deviation.

Solution : After arranging the observations in ascending order we get 1040, 1080, 1120,
1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750, 1755, 1785,
1880, 1885 and 1960.

Q1 = Value of
N  1th item = Value of
20  1th item
4 4

= value of (5.25)th item = 5th item + 0.25(6th item-5th item)

= 1240+0.25(1320-1240) = 1240 + 20
Q1 = 1260 kg.

3 N  1 320  1
th th
Q3= value of item = value of item
4 4

= 15th item + 0.75 (16th item-15th item) = 1750+ 0.75(1755-1750) = 1750 + 3.7

Q3 =1753.75kg.

Q3  Q1 1753.75  1260 492.75

Quartile Deviation = = =  246.875
2 2 2

Q3  Q1 1753.75  1260
Co-efficient of Quartile Deviation = =  0.164
Q3  Q1 1753.75  1260

116
Mean Deviation

 Arrange the data in ascending order (for calculating of median)

 Calculate median/mean/mode

 Take deviations of items from median/mean ignoring ± signs and denote the
column as |D|

 Calculate the sum of these deviation in case of discrete and continuous series

 |D| is multiplied by respective frequency of the item to get ∑ f |D|

 Divide the total obtained by number of items to get mean deviation

∑ f|D|
 M. D =
N

M.D
 Co-efficient of M.D=
Median/Mean/mode

Example :

Calculate mean deviation and co-efficient of mean deviation from both mean and
median for the following data on the monthly income (in rupees) of households.

Income: 8520 6350 7920 8360 7500

Calculation of mean deviation from median:

117
Monthly Deviation from Deviation from
income mean(7730) |D| median (7920) |D|
6350 1380 1570

7500 230 420

7920 190 0

8360 630 440

8520 790 600

∑ 𝑋 = 38650 ∑|𝐷| = 3220 ∑|𝐷| = 3030

∑𝑋 38650
Mean = = =7730
𝑁 5

∑|𝐷| 3220
M.D = = = 644
𝑁 5

𝑀.𝐷 644
Co-efficient of M.D =𝑀𝑒𝑎𝑛 = 7730 = 0.083

and

𝑁+1 5+1
Median= size of item = size of item = size of 3rd item =7920
2 2

∑|𝐷| 3030
M.D= = =606
𝑁 5

𝑀.𝐷 606
Co-efficient of M.D =𝑀𝑒𝑑𝑖𝑎𝑛 =7920

118
Lorenz Curve

Definition: The Lorenz curve is a way of showing the distribution of income (or
wealth) within an economy. It was developed by Max O. Lorenz in 1905 for representing
wealth distribution. In other words, it is a graphical representation of the distribution of
income or wealth.

 The Lorenz curve shows the cumulative share of income from different
sections of the population.

 If there was perfect equality – if everyone had the same salary – the poorest
20% of the population would gain 20% of the total income. The poorest 60%
of the population would get 60% of the income.

Meaning of Lorenz Curve

A Lorenz curve is a graphical representation of income inequality or wealth

inequality developed by American economist Max Lorenz in 1905.

The graph plots percentiles of the population on the horizontal axis according to
income or wealth. It plots cumulative income or wealth on the vertical axis, so that an x-
value of 45 and a y-value of 14.2 would mean that the bottom 45% of the population
controls 14.2% of the total income or wealth.

In practice, a Lorenz curve is usually a mathematical function estimated from an

incomplete set of observations of income or wealth.

119
In this Lorenz curve, the poorest 20% of households have 5% of the nation’s total
income. The poorest 90% of the population holds 55% of the total income. That means the
richest 10% of income earners gain 45% of total income.

120
Shift in the Lorenz Curve

In this example, there has been a reduction in inequality – the Lorenz curve has
moved closer to the line of equality.

 The poorest 20% of the population now gain 9% of total income

 The richest 10% of the population used to gain 45% of total income but now
only get 25% of total income.

121
The Lorenz Curve and Gini Coefficient

The Lorenz Curve can be used to calculate the Gini coefficient – another measure
of inequality.

The closer the Lorenz curve is to the line of equality, the smaller area A is. And the
Gini coefficient will be low.

If there is a high degree of inequality, then area A will be a bigger percentage of

the total area.

A rise in the Gini coefficient shows a rise in inequality – it shows the Lorenz curve
is further away from the line of equality.

122
Lorenz Curve and wealth

Wealth Inequality and Lorenz curve as follows

Example of Lorenz Curve

Following is the example to understand the Lorenz curve with the help of a graph.
Let us consider an economy with the following population and income statistics:

Population 0 20 40 60 80 100
Income portion % 0 10 20 35 60 100

123
And for the line of perfect equality, let us consider this table:

Population Income portion %

0 0

20 20

40 40

80 80

100 100

Let us now see how a graph for this data actually looks:

124
As we can see, there are two lines in the graph of the Lorenz curve, the curved red
line, and the straight black line.

The black line represents the fictional line called the line of equality i.e. the ideal
graph when income or wealth is equally distributed amongst the population. The red
curve, the Lorenz curve, which we have been discussing, represents the actual
distribution of wealth among the population.

Hence, we can say that the Lorenz curve is the graphical method of studying
dispersion. Gini Coefficient, also known as the Gini Index, can be computed as follows.

Let us assume in the graph area between the Lorenz Curve and the line is
represented by A1 and the line below the curve is represented by A2. So,

Gini Coefficient = A1/ (A1+ A2)

Gini Coefficient lies between 0 and 1; 0 being the instance where there is perfect
equality and 1 being the instance where there is perfect inequality. The higher the area
enclosed between the two lines represents higher inequality in the economy.

By this, we can say that in measuring income inequality, there are two indicators:
 The Lorenz curve is the Visual Indicator and
 The Gini Coefficient is the Mathematical Indicator.

Income inequality is a pressing issue across the world. So, what are the reasons for
inequality in an economy?

125
 Corruption
 Education
 Tax
 Gender differences
 Culture
 Race and Cast discriminations
 The difference in preferences of leisure and risks.

Reasons for income inequality

 The distribution of economic characteristics across the population should be

considered.

 Analyzing how the differences give rise to different outcomes in terms of

income.

 A country may have a high degree of inequality because of –

 The great disparity in these characteristics across the population.
 These characteristics generate huge effects on the amount of income a
person earns.
Uses of the Lorenz Curve

 It can be used to show the effectiveness of a government policy to help

redistribute income. The impact of a particular policy introduced can be shown
with the help of the Lorenz curve, how the curve has moved closer to the
perfect equality line post-implementation of that policy.

126
 It is one of the simplest representations of inequality.

 It is most useful in comparing the variability of two or more distributions.

 It shows the distribution of wealth of a country among different percentages of

the population with the help of a graph which helps many businesses in
establishing their target bases.

 It helps in business modelling.

 It can be used majorly while taking specific measures to develop the weaker
sections in the economy.

Limitations

 This might not always be rigorously true for a finite level of population.

 The equality measure shown may be misleading.

 When two Lorenz curves are being compared and such two curves intersect, it
is not possible to ascertain which distribution represented by the curves display
more inequality.

 The variation of income over the lifecycle of an individual is ignored by the

Lorenz Curve while determining the inequality.

127
UNIT - III

Lesson 5
Correlation
Definition
Correlation means association - more precisely it is a measure of the extent to
which two variables are related. There are three possible results of a correlation study a
positive correlation, a negative correlation, and no correlation.

A positive correlation is a relationship between two variables in which both

variables move in the same direction. Therefore, when one variable increases as the other
variable increases, or one variable decreases while the other decreases. An example of
positive correlation would be height and weight. Taller people tend to be heavier.

A negative correlation is a relationship between two variables in which an

increase in one variable is associated with a decrease in the other. An example of negative
correlation would be height above sea level and temperature. As you climb the mountain
(increase in height) it gets colder (decrease in temperature).

A zero correlation exists when there is no relationship between two variables. For
example there is no relationship between the amount of tea drunk and level of intelligence.

So far we have confined ourselves to univariate distributions, i.e., the distributions

involving only one variable. Often we come across situations in which our focus is
simultaneously on two or more variables and invariably, we observe that movements in

128
one variable are accompanied by movements in other variable. For example, husband’s
age move together, scores on an I.Q. test move with scores in university examinations.

Similarly, studies in income and expenditure on households or price and demand

of commodities, exhibit accompanying movements of two variables. Notwithstanding the
cases of spurious or nonsensical relations which we may enjoy through such funny
combinations of variables as suggesting that the number of runs scored by a batsman
increases with an increase in the consumption of fertilizer in the local market or the
number of flights space is increasing with a decrease in the population of tigers, the study
of variables indicating accompanying behaviour is of great interest in statistics.

However, the above statements are not precise enough to be of use to decision
makers. We are therefore on the lookout for a quantitative measure of the relationship
between the two variables, and also for an appropriate mathematical or statistical form of
the relationship.

Meaning of Correlation

In a bivariate distribution we may be interested to find out if there is any

correlation or covariation between the two variables under study. If the change in one
variable affects a change in the other variable, the variables are said to be correlated. If the
two variables deviate in the same direction, i.e., if the increase (or decrease) in one results
in a corresponding increase (or decrease) in the other, correlation is said to be direct or
positive. But if they constantly deviate in the opposite directions, i.e., if increase
(or decrease) in one results in corresponding decrease (or increase) in the other, correlation
is said to be diverse or negative.

129
For example, the correlation between (i) the heights and weights of a group of
persons, and (ii) the income and expenditure; is positive and the correlation between
(i) price and demand of a commodity and (ii) the volume and pressure of a perfect gas; is
negative.

Correlation is said to be perfect if the deviation in one variable is followed by a

corresponding and proportional deviation in the other.

Scatter diagram
A correlation can be expressed visually. This is done by drawing a scatter diagram
(also known as a scatter plot, scatter graph, scatter chart, or scatter diagram).

A scatter diagram is a graphical display that shows the relationships or associations

between two numerical variables (or co-variables), which are represented as points
(or dots) for each pair of score.

A scatter graph indicates the strength and direction of the correlation between the
co-variables.

130
When you draw a scatter diagram it doesn't matter which variable goes on the
x-axis and which goes on the y-axis.

Remember, in correlations we are always dealing with paired scores, so the values
of the 2 variables taken together will be used to make the diagram. Decide which variable
goes on each axis and then simply put a cross at the point where the 2 values coincide.

Graphic Presentation of Data

Apart from diagrams, Graphic presentation is another way of the presentation of data
and information. Usually, graphs are used to present time series and frequency distributions.
In this article, we will look at the graphic presentation of data and information along with its
merits, limitations, and types.

Construction of a Graph

The graphic presentation of data and information offers a quick and simple way of
understanding the features and drawing comparisons. Further, it is an effective analytical tool
and a graph can help us in finding the mode, median, etc.

We can locate a point in a plane using two mutually perpendicular lines – the X-axis
(the horizontal line) and the Y-axis (the vertical line). Their point of intersection is the origin.
We can locate the position of a point in terms of its distance from both these axes.

For example, if a point P is 3 units away from the Y-axis and 5 units away from the
X-axis, then its location is as follows:

131
General Rules for Graphic Presentation of Data and Information

There are certain guidelines for an attractive and effective graphic presentation of
data and information. These are as follows:

 Suitable Title – Ensure that you give a suitable title to the graph which clearly
indicates the subject for which you are presenting it.

 Unit of Measurement – Clearly state the unit of measurement below the title.

 Suitable Scale – Choose a suitable scale so that you can represent the entire
data in an accurate manner.

132
 Index – Include a brief index which explains the different colors and
shades, lines and designs that you have used in the graph. Also, include a scale
of interpretation for better understanding.

 Data Sources – Wherever possible, include the sources of information at the

bottom of the graph.

 Keep it Simple – You should construct a graph which even a layman (without
any exposure in the areas of statistics or mathematics) can understand.

 Neat – A graph is a visual aid for the presentation of data and information.
Therefore, you must keep it neat and attractive. Choose the right size, right
lettering, and appropriate lines, colors, dashes, etc.

Merits of a Graph

 The graph presents data in a manner which is easier to understand.

 It allows us to present statistical data in an attractive manner as compared to

tables. Users can understand the main features, trends, and fluctuations of the
data at a glance.

 A graph saves time.

 It allows the viewer to compare data relating to two different time-periods or

regions.

133
 The viewer does not require prior knowledge of mathematics or statistics to
understand a graph.

 We can use a graph to locate the mode, median, and mean values of the data.

 It is useful in forecasting, interpolation, and extrapolation of data.

Limitations of a Graph

 A graph lacks complete accuracy of facts.

 It depicts only a few selected characteristics of the data.

 We cannot use a graph in support of a statement.

 A graph is not a substitute for tables.

 Usually, laymen find it difficult to understand and interpret a graph.

 Typically, a graph shows the unreasonable tendency of the data and the actual
values are not clear.

Types of Graphs

Graphs are of two types

 Time Series graphs

 Frequency Distribution graphs

134
Time Series Graphs

A time series graph or a “histogram” is a graph which depicts the value of a variable
over a different point of time.

In a time series graph, time is the most important factor and the variable is related to
time. It helps in the understanding and analysis of the changes in the variable at a different
point of time.

Many statisticians and businessmen use these graphs because they are easy to
understand and also because they offer complex information in a simple manner.

Further, constructing a time series graph does not require a user with technical skills.
Here are some major steps in the construction of a time series graph:

 Represent time on the X-axis and the value of the variable on the Y-axis.

 Start the Y-value with zero and devise a suitable scale which helps you present
the whole data in the given space.

 Plot the values of the variable and join different point with a straight line.

 You can plot multiple variables through different lines.

Line Graph

You can use a line graph to summarize how two pieces of information are related and
how they vary with each other.

135
Advantages

 You can compare multiple continuous data-sets easily

 You can infer the interim data from the graph line

Disadvantages

 It is only used with continuous data.

Use of a false Base Line

Usually, in a graph, the vertical line starts from the Origin. However, in some cases, a
false Base Line is used for a better representation of the data. There are two scenarios where
you should use a false Base Line:

 To magnify the minor fluctuation in the time series data

 To economize the space

Net Balance Graph

If you have to show the net balance of income and expenditure or revenue and costs
or imports and exports, etc., then you must use a net balance graph. You can use different
colors or shades for positive and negative differences.

Frequency Distribution Graphs

Let’s look at the different types of frequency distribution graphs.

136
Histogram

A histogram is a graph of a grouped frequency distribution. In a histogram, we plot

the class intervals on the X-axis and their respective frequencies on the Y-axis. Further, we
create a rectangle on each class interval with its height proportional to the frequency density
of the class.

Frequency Polygon or Histograph

A frequency polygon or a Histograph is another way of representing a frequency

distribution on a graph. You draw a frequency polygon by joining the midpoints of the upper
widths of the adjacent rectangles of the histogram with straight lines.

137
Frequency Curve

When you join the verticals of a polygon using a smooth curve, then the resulting
figure is a Frequency Curve. As the number of observations increase, we need to
accommodate more classes. Therefore, the width of each class reduces. In such a scenario,
the variable tends to become continuous and the frequency polygon starts taking the shape of
a frequency curve.

Cumulative Frequency Curve or Ogive

A cumulative frequency curve or Ogive is the graphical representation of a

cumulative frequency distribution. Since a cumulative frequency is either of a ‘less than’ or a
‘more than’ type, Ogives are of two types too – ‘less than ogive’ and ‘more than ogive’.

138
Uses of Correlations

Prediction

 If there is a relationship between two variables, we can make predictions about

one from another.

Validity

 Concurrent validity (correlation between a new measure and an established

measure).

139
Reliability
 Test-retest reliability (are measures consistent).

 Inter-rater reliability (are observers consistent).

Theory verification
 Predictive validity.

Coefficient of Correlation
A coefficient of correlation is generally applied in statistics to calculate a
relationship between two variables. The correlation shows a specific value of the degree of
a linear relationship between the X and Y variables, say X and Y. There are various types
of correlation coefficients. However, Pearson’s correlation (also known as Pearson’s R) is
the correlation coefficient that is frequently used in linear regression.

Karl Pearson’s Coefficient Correlation

Karl Pearson’s coefficient of correlation is an extensively used mathematical
method in which the numerical representation is applied to measure the level of relation
between linearly related variables. The coefficient of correlation is expressed by “r”.

Karl Pearson Correlation Coefficient Formula

r
 X  X Y  Y 
 X  X  Y  Y 
2 2

where 𝑋̅= mean of X variable

𝑌̅ = mean of Y variable

140
Problem : The following data gives the heights (in inches) of father and his eldest son.
Compute the correlation coefficient between the heights of fathers and sons using Karl
Pearson’s method.

Height of father Height of son

65 67

66 68

67 65

67 68

68 72

69 72

70 69

72 71

Solution: Let x denote height of father and y denote height of son. The data is on the ratio
scale.

We use Karl Pearson’s method.

n n n
n xi y i   xi  y i
r i 1 i 1 i 1
2
 n 
2
n
 n  n
n x    xi 
2
i n y     y i  
2
i
i 1  i 1  i 1   i 1  

141
xi yi x i2 y i2 xi y i

65 67 4225 4489 4355

66 68 4356 4624 4488

67 65 4489 4225 4355

67 68 4489 4624 4556

68 72 4624 5184 4896

69 72 4761 5184 4968

70 69 4900 4761 4830

72 71 5184 5041 5112

544 552 37028 38132 37560

8  37560  544  552

r  0.603
8  37028  544 8  38132  552
2 2

Heights of father and son are positively correlated. It means that on the average , if
fathers are tall then sons will probably tall and if fathers are short, probably sons may be
short.

142
Problem 2: The following are the marks scored by 7 students in two tests in a subject.
Calculate coefficient of correlation from the following data and interpret.

Marks in test-1 12 9 8 10 11 13 7

Marks in test-2 14 8 6 9 11 12 3

Solution: Let x denote marks in test-1 and y denote marks in test-2.

xi yi x i2 y i2 xi y i

12 14 144 196 168

9 8 81 64 72

8 6 64 36 48

10 9 100 81 90

11 11 121 121 121

1 12 169 144 156

7 3 49 9 21

70 63 728 651 676

143
n n n
n xi y i   xi  y i
r i 1 i 1 i 1

2 2
n
 n  2
 n 
n x    xi 
2
i n y    y i 
2
i
i 1  i 1  i 1  i 1 

x
i 1
i  70
n

x
i 1
2
i  728
n

x y
i 1
i i  676
n

y
i 1
i  63
n

y
i 1
2
i  651

here n=7

7  676  70  63 4732  4410

r 
7  728  70   7  651  63 
2 2
5096  4900  7  651  3969

322 322 322

    0.95
196  588 14  24.25 339.5

There is a high positive correlation between test -1 and test-2. That is those who
perform well in test-1 will also perform well in test-2 and those who perform poor in test-
1 will perform poor in test- 2.

144
Spearman’s Rank Correlation

Let us suppose that a group of n individuals is arranged in order of merit or

proficiency in possession of two characteristics A and B. These ranks in the two
characteristics will, in general, be different.

For example, if we consider the relation between intelligence and beauty, it is not
necessary that a beautiful individual is intelligent also. Let (𝑥𝑖, 𝑦𝑖 ); i=1, 2,...., n be the
ranks of the 𝑖 𝑡ℎ individual in two characteristics A and B respectively. Pearsonian
coefficient of correlation between the ranks 𝑥𝑖′ and 𝑦𝑖′ is called the rank correlation
coefficient between A and B for that group of individuals.

Problem : The scores of 9 students in History and Geography are mentioned in the table
below.

History Rank Geography Rank d d2

35 3 30 5 2 4
23 5 33 3 2 4
47 1 45 2 1 1
17 6 23 6 0 0
10 7 8 8 1 1
43 2 49 1 1 1
9 8 12 7 1 1
6 9 4 9 0 0
28 4 31 4 0 0
12

145
 Step 1- Create a table of the data obtained.

 Step 2- Start by ranking the two data sets. Data ranking can be achieved by
assigning the ranking “1” to the biggest number in the column, “2” to the
second biggest number and so forth.

 The smallest value will usually get the lowest ranking. This should be done for
both sets of measurements.

 Step 3- Add a third column d to your data set, d here denotes the difference
between ranks.

 For example, if the first student’s physics rank is 3 and the math rank is 5 then
the difference in the rank is 3. In the fourth column, square your d values.

 Step 4- Add up all your d square values, which is 12 (∑d square)

 Step 5- Insert these values in the formula

6i d i2
rR  1 

n n2 1 
=1-(6x12) / (9(81-1))

=1-72/720 =1-01 = 0.9

The Spearman’s Rank Correlation for this data is 0.9 and as mentioned above if
the ⍴ value is nearing +1 then they have a perfect association of rank.

146
Problem : To calculate a Spearman rank-order correlation on data without any ties we
will use the following data:

English 56 75 45 71 62 64 58 80 76 61

Maths 66 70 40 60 65 56 59 77 67 63

We then complete the following table:

English Maths Rank English Rank Maths d d2

56 66 9 4 5 25

75 70 3 2 1 1

45 40 10 10 0 0

71 60 4 7 3 9

62 65 6 5 1 1

64 56 5 9 4 16

58 59 8 8 0 0

80 77 1 1 0 0

76 67 2 3 1 1

61 63 7 6 1 1

where d = difference between ranks and d2 = difference squared.

147
We then calculate the following:

d i
2
 25  1  9  1  16  1  1  54

We then substitute this into the main equation with the other information as
follows:

6 d i2
p  1

n n 1 
6  54
p  1
1010 2  1

324
p  1
990

= 1-0.33 = 0.67

as n = 10.

Hence, we have a ρ of 0.67.

This indicates a strong positive relationship between the ranks individuals obtained
in the Maths and English exam. That is, the higher you ranked in Maths, the higher you
ranked in English also, and vice versa.

148
Lesson 6

Regression

The term “regression” literally means “stepping back towards the average”.
It was first used by British biometrician Sir Francis Galton (1822-1911), in connection
with the inheritance of stature.

Galton found that the off springs of abnormally tall or short parents tend to
“regress” or “step back” to the average population height.

But the term “regression” as now used in statistics is only a convenient term
without having any reference to biometry.

Regression Analysis

Regression analysis is a mathematical measure of the average relationship between

two or more variables in terms of the original units of the data.

In regression analysis there are two types of variables. The variable whose value is
influenced or is to be predicted is called dependent variable and the variable and the
variable which influence

The values or is used for prediction is called independent variable. In regression

analysis independent variable is also known as regressor or predictor or explanatory
variable while the dependent variable is also known as regressed or explained variable.

149
Problem : For 10 randomly selected observations, the following data were recorded:

Observation No. 1 2 3 4 5 6 7 8 9 10

Overtime hours (X) 1 1 2 2 3 3 4 5 6 7

Additional units (Y) 2 7 7 10 8 12 10 14 11 14

Determine the coefficients of regression equation using the non-linear form:

Y = a+𝑏1 X+𝑏2 𝑋 2 .

Solution:

Sl. No. X Y 𝑋2 𝑋3 𝑋4 XY 𝑋2 Y
1 1 2 1 1 1 2 2
2 1 7 1 1 1 7 7
3 2 7 4 8 16 14 28
4 2 10 4 8 16 20 40
5 3 8 9 27 81 24 72
6 3 12 9 27 81 36 108
7 4 10 16 64 256 40 160
8 5 14 25 125 625 70 350
9 6 11 36 216 1296 66 396
10 7 14 49 343 2401 98 686
Total 34 95 154 820 4774 377 1849

150
Using normal equation, we get

10a+34𝑏1 +154𝑏2 = 95,

34a + 154 𝑏1 + 820 𝑏2 = 377,

and

154a + 820 𝑏1 + 4774 𝑏2 = 1849.

The solutions to these three simultaneous equations are:

A=1.80, 𝑏1 =3.48 and 𝑏2 = -0.27

The regression equations, therefore is:

Y = 1.80+3.48X-0.27𝑋 2 .

Types of Regression

Based on the form of the regression line, the regression analysis is divide up into
two types:

 Linear Regression
 Non-Linear Regression

The shape of the regression line depends on the distribution of the data. We can
infer this from the image below. The first image shows linear regression whereas the
second image shows non-linear regression.

151
Linear Function Nonlinear Function

Linear Regression

If the variables in a bivariate distribution are related, we will find that the points in
the scatter diagram will cluster round some curve called me “curve of regression”. If the
curve is a straight line, it is called the line of regression and there is said to be linear
regression between the variables, otherwise regression is said to be curvilinear.

The line of regression is the line which gives the best estimate to the value of one
variable for any specific value of the other variable.

Thus the line of regression is the line of “best fit” and is obtained by the principle
of least squares.

152
Linear Regression Types

Linear regression is further divide up into two categories:

 Simple linear regression

 Multiple linear regression

The choice of the regression technique depends on the number of explanatory

variables.

In the case of linear regression, if there is only one input variable, then we will do
simple linear regression.

If instead, the input variables are two or more, we will need to perform multiple
linear regression.

To summarize this we can say, a simple linear regression shows the relationship
between a dependent variable y and an independent variable x.

A multiple regression model shows the relationship between a dependent variable

y and multiple independent variables x.

In cases where we have multiple response variables rather than explanatory

variables, we use multivariate regression. In this technique, a single regression model can
estimate more than one response variable.

153
Simple Linear Regression Multiple Linear Regression

The equation of the linear regression line with multiple explanatory variables can
be reduce down to:
Y  W1 X  B0

Whereas the equation of regression line with multiple response variables or we can
say the equation of a multivariate regression line is given by:

Y  W1 X 1  W2 X 2  ....  B0

Linear Regression Problem

The sales of a company (in million dollars) for each year are shown in the table as
follows:

154
X (year) 2005 2006 2007 2008 2009

Y (sales) 12 19 29 37 45

a) Find the least square regression line y = a x + b.

b) Use the least squares regression line as a model to estimate the sales of the

company in 2012.

Solution:

a) We first change the variable x into t such that t = x-2005 and therefore t represents
the number of years after 2005.
Using t instead of x makes the numbers smaller and therefore manageable. The
table of values becomes.

T (years after 2005) Y (sales)

0 12

1 19

2 29

3 37

4 45

155
We now use the table to calculate a and b included in the least regression line
formula.

t y ty 𝑡2

0 12 0 0

1 19 19 1

2 29 58 4

3 37 111 9

4 45 180 16

 x  10  y  142  xy  368 x 2
 30

We now calculate a and b using the least square regression formulas for a and b.


A=( n ty   t  y  / n t 2   t 
2

 
 5  368  10 142 / 5  30  102  8.4

B= 1 / n y  a x   1/ 5142  8.4 10  11.6

b) In 2012, t = 2012-2005 = 7

The estimated sales in 2012 are y = 8.4x7+11.6 = 70.4 million dollars.

156
UNIT - IV

Lesson 7

Measurement of Trend

Time Series

An arrangement of statistical data in accordance with time of occurrence or in a

chronological order is called a time series.

In time series analysis, current data in a series may be compared with past data in
the same series. We may also compare the development of two or more series over time.
These comparisons may afford important guide lines for the individual firm. In
Economics, statistics and commerce it plays an important role.

Definition

Time series is an ordered sequence of values of a variable at equally spaced time

intervals. For example, measuring the value of retail sales each month of the year would
comprise a time series. This is because sales revenue is well defined, and consistently
measured at equally spaced intervals.

Applications of Time Series

The usage of time series models is twofold:

 Obtain an understanding of the underlying forces and structure that produced

the observed data

157
 Fit a model and proceed to forecasting, monitoring or even feedback and feed
forward control.

Time Series Analysis is useful in various fields such as:

 Financial Analysis − It includes sales forecasting, inventory analysis, stock

market analysis, price estimation.
 Weather Analysis − It includes temperature estimation, climate change,
seasonal shift recognition, weather forecasting.
 Network Data Analysis − It includes network usage prediction, anomaly or
intrusion detection, predictive maintenance.
 Healthcare Analysis − It includes census prediction, insurance benefits
prediction, patient monitoring and for many applications such as
 Economic Forecasting
 Sales Forecasting
 Stock Market Analysis
 Budgetary Analysis
 Yield Projections
 Process and Quality Control
 Inventory Studies
 Workload Projections
 Utility Studies
 Census Analysis and many more . . .

158
The essential requirements of a Time Series are:

 The time gap, between various values must be as far as possible, equal.
 It must consist of a homogeneous set of values.
 Data must be available for a long period.

Symbolically if ‘t’ stands for time and ‘yt’ represents the value at time t then the
paired values (t, yt) represents a time series data.

Example : Production of rice in Tamil Nadu for the period from 2010-11 to 2016-17.

Production of rice in Tamil Nadu (in ‘000 metric tons)

Year Production

2010-11 400

2011-12 450

2012-13 440

2013-14 420

2014-15 460

2016-17 520

159
Components of Time Series

The values of a time series may be affected by the number of movements or

fluctuations, which are its characteristics. The types of movements characterizing a time
series are called components of time series or elements of a time series. These are four
types

 Secular Trend
 Seasonal Variations
 Cyclical Variations
 Irregular Variations

Secular Trend

Secular Trend is also called long term trend or simply trend. The trend is the long
term pattern of a time series.

160
A trend can be positive or negative depending on whether the time series exhibits
an increasing long term pattern or a decreasing long term pattern. If a time series does not
show an increasing or decreasing pattern then the series is stationary in the mean.

For example if we are studying the figures of sales of cloth store for 1996- 1997
and we find that in 1997 the sales have gone up, this increase cannot be called as secular
trend because it is too short period of time to conclude that the sales are showing the
increasing tendency.

Cyclical Variations

This is a short term variation occurs for a period of more than one year. The
rhythmic movements in a time series with a period of oscillation( repeated again and again
in same manner) more than one year is called a cyclical variation and the period is called a
cycle.

The time series related to business and economics show some kind of cyclical
variations.

One of the best examples for cyclical variations is “Business Cycle”. In this cycle
there are four well defined periods or phases.

 Boom
 Decline
 Depression
 Improvement

161
Seasonal Variations

Seasonal variations occur during a period of one year and have the same pattern
year after year. Here the period of time may be monthly, weekly or hourly.

But if the figure is given in yearly terms then seasonal fluctuations does not exist.
There occur seasonal fluctuations in a time series due to two factors.

 Due to natural forces

 Manmade convention

The most important factor causing seasonal variations is the climate changes in the
climate and weather conditions such as rain fall, humidity, heat etc. act on different
products and industries differently.

For example during winter there is greater demand for woolen clothes, hot drinks
etc. Where as in summer cotton clothes, cold drinks have a greater sale and in rainy season
umbrellas and rain coats have greater demand.

Though nature is primarily responsible for seasonal variation in time series,

customs, traditions and habits also have their impact.

For example on occasions like dipawali, dusserah, Christmas etc. there is a big
demand for sweets and clothes etc.

There is a large demand for books and stationary in the first few months of the
opening of schools and colleges.

162
Irregular Variations

This component is unpredictable. Every time series has some unpredictable

component that makes it a random variable. In prediction, the objective is to “model” all
the components to the point that the only Component that remains unexplained is the
random component.

This type of fluctuations occurs in random way or irregular ways which are
unforeseen, unpredictable and due to some irregular circumstances which are beyond the
control of human being such as earth quakes, wars, floods, famines, lockouts, etc.

Models of Time Series Analysis

The following are the two models which we generally use for the decomposition of
time series into its four components. The objective is to estimate and separate the four
types of variations and to bring out the relative effect of each on the overall behavior of
the time series.

 Additive model, and

 Multiplicative model

Additive Model

In the additive model, we represent a particular observation in a time series as the

sum of these four components.

i.e. O=T+S+C+I

163
where O represents the original data, T represents the trend. S represents the seasonal
variations, C represents the cyclical variations and I represent the irregular variations.

In another way, we can write

Y(t) = T(t) + S(t) +C(t) + I(t)

Multiplicative Model

In this model, four components have a multiplicative relationship. So, we represent

a particular observation in a time series as the product of these four components:

i.e. O = T × S × C × I

where O, T, S, C and I represents the terms as in additive model.

In another way, we can write

Y(t) = T(t) × S(t) × C(t) × I(t)

This model is the most used model in the decomposition of time series. To remove
any doubt between the two models, it should be made clear that in Multiplicative model S,
C, and I are indices expressed as decimal percentages whereas, in Additive model S, C
and I are quantitative deviations about a trend that can be expressed as seasonal, cyclical
and irregular in nature.

Example: If in a multiplicative model:

T = 500,

164
S = 1.4,

C = 1.20,

I = 0.7

Then O = T × S × C × I

By substituting the values,

we get

O = 500 × 1.4 × 1.20 × 0.7 = 588

If in additive model,

T = 500,

S = 100,

C = 25,

I = –60

Then O = T + S + C + I

By substituting the values,

we get

O = 500 + 100 + 25 – 60 = 565.

165
Trend Analysis

Trend is a long term movement in a time series. This component represents basic
tendency of the series.

The following methods are generally used to determine trend in any given time
series.

 Graphic method or eye inspection method

 Semi average method
 Method of moving average
 Method of least squares

Graphic method or eye inspection method

Graphic method is the simplest of all methods and easy to understand. The method
is as follows.

First plot the given time series data on a graph.

Then a smooth free hand curve is drawn through the plotted points in such a way
that it represents general tendency of the series.

As the curve is drawn through eye inspection, this is also called as eye-inspection
method.

The graphic method removes the short term variations to show the basic tendency
of the data.

166
The trend line drawn through the graphic method can be extended further to
predict or estimate values for the future time periods.

As the method is subjective the prediction may not be reliable.

Example: Fit a trend line by the Graphic method for the given data.

Year Sales
2000 30
2001 46
2002 25
2003 59
2004 40
2005 60
2006 38
2007 65

167
Solution:

Advantages

 It is very simplest method for study trend values and easy to draw trend.
 Sometimes the trend line drawn by the statistician experienced in computing
trend may be considered better than a trend line fitted by the use of a
mathematical formula.
 Although the free hand curves method is not recommended for beginners, it
has considerable merits in the hands of experienced statisticians and widely
used in applied situations.

168
Disadvantages

 This method is highly subjective and curve varies from person to person who
draws it.
 The work must be handled by skilled and experienced people.
 Since the method is subjective, the prediction may not be reliable.
 While drawing a trend line through this method a careful job has to be done.

Method of Moving Average

It is a method for computing trend values in a time series which eliminates the
short term and random fluctuations from the time series by means of moving average.

Moving average of a period m is a series of successive arithmetic means of m

terms at a time starting with 1st, 2nd, 3rd and so on.

The first average is the mean of first m terms; the second average is the mean of
2nd term to (m+1)th term and 3rd average is the mean of 3rd term to (m+2)th term and so on.

If m is odd then the moving average is placed against the mid value of the time
interval it covers. But if m is even then the moving average lies between the two middle
periods which does not correspond to any time period.

So further steps has to be taken to place the moving average to a particular period
of time. For that we take 2-yearly moving average of the moving averages which
correspond to a particular time period. The resultant moving averages are the trend values.

169
Example : Calculate 3-yearly moving average for the following data.

Advantages

 This method is simple to understand and easy to execute.

 It has the flexibility in application in the sense that if we add data for a few
more time periods to the original data, the previous calculations are not
affected and we get a few more trend values.
 It gives a correct picture of the long term trend if the trend is linear.
 If the period of moving average coincides with the period of oscillation (cycle),
the periodic fluctuations are eliminated.

170
 The moving average has the advantage that it follows the general movements
of the data and that its shape is determined by the data rather than the
statistician’s choice of mathematical function.

Example : Calculate 4-yearly moving average for the following data.

171
Disadvantages
 For a moving average of 2m+1, one does not get trend values for first m and

last m periods.

 As the trend path does not correspond to any mathematical; function, it cannot

be used for forecasting or predicting values for future periods.

 If the trend is not linear, the trend values calculated through moving averages

may not show the true tendency of data.

 The choice of the period is sometimes left to the human judgment and hence

may carry the effect of human bias.

Method of Semi Averages

In this method the whole data is divided in two equal parts with respect to time.

For example if we are given data from 1999 to 2016 i.e. over a period of 18 years
the two equal parts will be first nine years i.e. from 1999 to 2007 and 2008 to 2016.

In case of odd number of years like 9, 13, 17 etc. Two equal parts can be made
simply by omitting the middle year.

For example if the data are given for 19 years from 1998 to 2016 the two equal
parts would be from 1998 to 2006 and from 2008 to 2016, the middle year 2007 will be
omitted.

172
After the data have been divided into two parts, an average (arithmetic mean) of
each part is obtained.

We thus get two points. Each point is plotted against the mid year of the each part.
Then these two points are joined by a straight line which gives us the trend line. The line
can be extended downwards or upwards to get intermediate values or to predict future
values.

Example:

Thus we get two points 41.75 and 53.75 which shall be plotted corresponding to
their middle years i.e. 2002.5 and 2006.5. By joining these points we shall obtain the
required trend line. This line can be extended and can be used either for prediction or for
determining intermediate values.

173
Example: Fit a trend line by the method of semi-averages for the given data.

Year 2000 2001 2002 2003 2004 2005 2006

Production 105 115 120 100 110 125 135

Solution:

Year Production Average

2000 105

105  115  120

2001 115  113.33
3
2002 120

2003 100 (left out)

2004 110

110  125  135

2005 125  123.33
3
2006 135

174
Lesson 8

Measurement of Variations

Cyclical Variations

The various methods used for measuring cyclical variations are

 Residual method
 Reference cycle analysis method
 Direct method
 Harmonic analysis method

Business Cycle

According to Mitchell, "Business cycle are a type of fluctuation found in the

aggregate economic activity of nations that organize their work mainly in business
enterprises : a cycle consists of expansions occurring at about the same time in many
activities, followed by general recessions, contractions and revivals which merge into the
expansion phase of the next cycle; this sequence of changes is recurrent but not periodic;
in duration business cycles vary from more than one year to ten or twelve years".

There are four phases of a business cycle, such as

 Expansion (prosperity)
 Recession
 Depression (contraction)
 Revival (recovery)

175
A cycle is measured either from trough-to-trough or from peak-to-peak. Recession
and contraction are the result of cumulative downswing of a cycle whereas revival and
expansion are the result of cumulative upswing of a cycle.

Seasonal Variations

Seasonal variations are regular and periodic variations having a period of one year
duration. Some of the examples which show seasonal variations are production of cold
drinks, which are high during summer months and low during winter season. Sales of
sarees in a cloth store which are high during festival season and low during other periods.
The reason for determining seasonal variations in a time series is to isolate it and to study
its effect on the size of the variable in the index form which is usually referred as seasonal
index.

176
Measurement of Seasonal Variations

The study of seasonal variation has great importance for business enterprises to
plan the production schedule in an efficient way so as to enable them to supply to the
public demands according to seasons.

There are different devices to measure the seasonal variations. These are

 Method of simple averages

 Ratio to trend method
 Ratio to moving average method
 Link relative method

Method of Simple Averages

This is the simplest of all the methods of measuring seasonality. This method is
based on the additive modal of the time series. That is the observed values of the series is
expressed by Yt = Tt  St + Ct  Rt and in this method we assume that the trend
component and the cyclical component are absent.

The method consists of the following steps.

 Arrange the data by years and months (or quarters if quarterly data is given).

 Compute the average xi (i = 1,2,…..12 for monthly and i=1,2,3,4 for quarterly)
for the i th month or quarter for all the years.

 Compute the average x of the averages.

177
1 12 1 4
i.e. x   xi for monthly and x   xi for quarterly
12 i 1 4 i 1

 Seasonal indices for different months (quarters) are obtained by expressing

monthly (quarterly) averages as percentages of x . Thus seasonal indices for i-
xi
th month (quarter)  100
x

Advantages and Disadvantages:

 Method of simple average is easy and simple to execute.

This method is based on the basic assumption that the data do not contain any
trend and cyclic components. Since most of the economic and business time series have
trends and as such this method though simple is not of much practical utility.

Example : Assuming that the trend is absent, determine if there is any seasonality in the
data given below.

Year 1stQuarter 2ndQuarter 3rdQuarter 4thQuarter

2004 3.7 4.1 3.3 3.5

2005 3.7 3.9 3.6 3.6

2006 4.0 4.1 3.3 3.1

2007 3.3 4.4 4.0 4.0

What are the seasonal indices for various quarters?

178
Solution:

Year 1stQuarter 2ndQuarter 3rdQuarter 4thQuarter

2004 3.7 4.1 3.3 3.5

2005 3.7 3.9 3.6 3.6

2006 4.0 4.1 3.3 3.1

2007 3.3 4.4 4.0 4.0

Total 14.7 16.5 14.2 14.2

Average 3.675 4.125 3.55 3.55

Seasonal Index 98.66 110.74 95.30 95.30

Notes of calculating seasonal index

3.675  4.125  3.55  3.55 14.9
The average of average =   3.725
4 4

Quarterly average
Seasonal index =  100
General average

3.675
Seasonal index for the first quarter =  100  98.66
3.725

4.125
Seasonal index for the second quarter =  100  110.74
3.725

3.55
Seasonal index for the third quarter =  100  95.30
3.725

179
Ratio to Moving Average Method

The ratio to moving average method is also known as percentage of moving

average method and is the most widely used method of measuring seasonal variations.

The steps necessary for determining seasonal variations by this method are

 Calculate the centered 12-monthly moving average (or 4-quarterly moving

average) of the given data. These moving averages values will eliminate S and
I leaving us T and C components.

 Express the original data as percentages of the centered moving average

values.

 The seasonal indices are now obtained by eliminating the irregular or random
components by averaging these percentages using A.M or median.

 The sum of these indices will not in general be equal to 1200 (for monthly) or
400 (for quarterly).

 Finally the adjustment is done to make the sum of the indices to a total of 1200
for monthly and 400 for quarterly data by multiplying them through out by a
constant K which is given by

1200
K for monthly
Total of the indices

400
K for quarterly
Total of the indices

180
Advantages

 Of all the methods of measuring seasonal variations, the ratio to moving

average method is the most satisfactory, flexible and widely used method.

 The fluctuation of indices based on ratio to moving average method is less than
based on other methods.

Disadvantages

 This method does not completely utilize the data. For example in case of 12-
monthly moving average seasonal indices cannot be obtained for the first and
last 6 months.

Example:

Calculating seasonal indices by the ratio to moving average method, from the
following data:

Year 1stQuarter 2ndQuarter 3rdQuarter 4thQuarter

2005 68 62 61 63

2006 65 58 66 61

2007 68 63 63 67

181
Solution

182
Calculation of Seasonal Index

Year 1stQuarter 2ndQuarter 3rdQuarter 4thQuarter

2005 - - 96.63 101.20

2006 104.21 92.43 104.97 95.50

2007 106.04 97.67 - -

Total 210.25 190.10 201.60 196.70

Average 105.125 95.05 100.80 98.35

Seasonal Index 105.30 95.21 100.97 98.52

399.32
Arithmetic average of averages =  99.83
4

By expressing each quarterly average as percentage of 99.83, we will seasonal

indices.
105.125
Seasonal index for the first quarter =  100  105.30
99.83

95.05
Seasonal index for the second quarter =  100  95.21
99.83

100.80
Seasonal index for the third quarter =  100  100.97
99.83

98.35
Seasonal index for the fourth quarter =  100  98.52
99.83

183
Link Relative Method

This method is slightly more complicated than other methods. This method is also
known as Pearson’s method. This method consists in the following steps.

 The link relatives for each period are calculated by using the below formula.

Current periods figure

Link relative for any period =  100
Previous periods figure

 Calculate the average of the link relatives for each period for all the years using
mean or median.

 Convert the average link relatives into chain relatives on the basis of the first
season.

 Chain relative for any period can be obtained by

Avg link relative for that period  chain relative of the previous period
100

the chain relative for the first period is assumed to be 100.

 Now the adjusted chain relatives are calculated by subtracting correction factor
‘k d’ from (k+1)th chain relative respectively.

 Where k = 1,2,…….11 for monthly and k = 1,2,3 for quarterly data and
1
d [ New chain relative for first period - 100] where N denotes the number
N
of periods i.e. N = 12 for monthly N = 4 for quarterly.

184
 Finally calculate the average of the corrected chain relatives and convert the
corrected chain relatives as the percentages of this average.

 These percentages are seasonal indices calculated by the link relative method.

Advantages

 As compared to the method of moving average the link relative method uses
data more.

Example :

Apply the method of-link relatives to the following data and calculate seasonal
indices:

Quarterly Figures

Quarter 2003 2004 2005 2006 2007

I 6.0 5.4 6.8 7.2 6.6

II 6.5 7.9 6.5 5.8 7.3

III 7.8 8.4 9.3 7.5 8.0

IV 8.7 7.3 6.4 8.5 7.1

185
Solution: Calculation of Seasonal Indices by the method of Link Relatives

Quarter

Year I II III IV

2003 _ 108.3 120.0 111.5

2004 62.1 146.3 106.3 86.9

2005 93.2 95.6 143.1 68.8

2006 112.5 80.6 129.3 113.3

2007 77.6 110.6 109.6 88.8

Arithmetic 345.4 541.4 608.3 469.3

 86.35  108.28  121.66  93.66
4 5 5 5
average

100  108.28 121.66  108.28 93.66  131.73

Chain relative 100 100 100 100
 108.28  131.73  123.64

Corrected 108.28-1.675 131.73-3.35 123.64-5.025

chain 100
relatives =106.605 =128.38 =118.615

Seasonal 100  100 106.605 128.38 118.615

 100  100  100
113.4 113.4 113.4 113.4
indices  108.28  94.01  113.21  104.60

The calculations in the above table are explained below:

186
Chain relative of the first quarter (on the basis of first quarter) = 100

Chain relative of the first quarter (on the basis of last quarter) =

86.35  123.64
=  106.7
100

The difference between these chain relative = 106.7-100 = 6.7

6.7
Difference per quarter =  1.675.
4

Adjusted chain relative are obtained by subtracting

11.675, 21.675 and 31.675

from the chain relatives of the 2nd, 3rd and 4th quarters respectively.

Average of corrected chain relatives =

100  106.605  128.38  118.615 453.6

  113.4
4 4

Seasonal variation index =

correct chain relatives

 100
113.4

Disadvantages

187
 The link relative method needs extensive calculations compared to other
methods and is not as simple as the method of moving average.

 The average of link relatives contains both trend and cyclical components and
these components are eliminated by applying correction.

Ratio to Trend Method

This method is an improvement over the simple averages method and this method
assumes a multiplicative model

i.e. Yt = Tt St Ct Rt

The measurement of seasonal indices by this method consists of the following

steps.

 Obtain the trend values by the least square method by fitting a mathematical
curve, either a straight line or second degree polynomial.

 Express the original data as the percentage of the trend values. Assuming the
multiplicative model these percentages will contain the seasonal, cyclical and
irregular components.

 The cyclical and irregular components are eliminated by averaging the

percentages for different months (quarters) if the data are In monthly
(quarterly), thus leaving us with indices of seasonal variations.

188
 Finally these indices obtained in step(3) are adjusted to a total of 1200 for
monthly and 400 for quarterly data by multiplying them through out by a
constant K which is given by

1200
K for monthly
Total of the indices

400
K for quarterly
Total of the indices

Advantages

 It is easy to compute and easy to understand.

 Compared with the method of monthly averages this method is certainly a

more logical procedure for measuring seasonal variations.

 It has an advantage over the ratio to moving average method that in this
method we obtain ratio to trend values for each period for which data are
available where as it is not possible in ratio to moving average method.

Disadvantages

 The main defect of the ratio to trend method is that if there are cyclical swings
in the series, the trend whether a straight line or a curve can never follow the
actual data as closely as a 12-monthly moving average does. So a seasonal
index computed by the ratio to moving average method may be less biased than
the one calculated by the ratio to trend method.

189
Example: Calculate Seasonal Indices by Ratio to Moving Average Method from the
following data:

Year 1stQuarter 2ndQuarter 3rdQuarter 4thQuarter

2003 30 40 36 34
2004 34 52 50 44
2005 40 58 54 48

2006 54 76 68 62
2007 80 92 86 82

Solution : For determining seasonal variation by ratio-to-trend method, first we will

determine the trend for yearly data and then convert it to quarterly data.

Calculating Trend by Method of Least Squares

Deviations
Yearly Yearly Trend
Year from XY X2
totals average Y Values
mid-year X

2003 140 35 -2 -70 4 32

2004 180 45 -1 -45 1 44
2005 200 50 0 0 0 56
2006 260 65 +1 +65 1 68
2007 340 85 +2 +170 4 80

N=5 Y  280  XY  120 X 2

 10

190
The equation of the straight line is Yc  a  bX

Y 280  XY  120  12
 X  0; a    56, b
N 5  X 10
2

12
Quarterly increment = 3
4

Calculation of quarterly trend values.

Consider 2003, trend values for the middle quarter, i.e., half of 2nd and
3
half 3rd is 32. Quarterly increment is 3. So the trend value of 2 nd quarter is 32  ,
2
3
i.e., 30.5 and for 3rd quarter is 32  , i.e., 33.5. trend value for the 1st quarter is 30.5-3.
2
i.e., 27.5 and 4th quarter is 33.5+3, i.e., 36.5.

We thus get quarterly trend values as shown below:

Trend Values

Year 1stQuarter 2ndQuarter 3rdQuarter 4thQuarter

2003 27.5 30.5 33.5 36.5
2004 39.5 42.5 45.5 48.5
2005 51.5 54.5 57.5 60.5
2006 63.5 66.5 69.5 72.5
2007 75.5 78.5 81.5 84.5

191
The given values are expressed as percentage of the corresponding trend values.
Thus for 1st quarter of 2003, the percentage shall be

(30/27.5)  100 = 109.09, for 2nd quarter.

(40/30.5)  100 = 131.15, etc.

Given quarterly values as% of trend values

Year 1stQuarter 2ndQuarter 3rdQuarter 4thQuarter

2003 109.09 131.15 107.46 93.15

2004 86.08 122.35 109.89 90.72

2005 77.67 106.42 93.91 79.34

2006 85.04 114.29 97.84 85.52

2007 105.96 117.20 105.52 97.04

Total 463.84 591.41 514.62 445.77

Average 92.77 118.28 102.92 89.15

S.I. Adjusted 92.05 117.36 102.12 88.46

Total average = 92.77 + 118.28 + 102.92 + 89.15 = 403.12.

Since the total is more than 400 an adjustment is made by multiplying each
400
average by and final indices are obtained.
403.12

192
UNIT-5

Lesson 9
Probability

Probability is the measure of the likelihood that an event will occur in a Random
Experiment. The probability of an event is a number between 0 and 1, where, roughly
speaking, 0 indicates impossibility of the event and 1 indicates certainty. The higher the
probability of an event P(E), the more likely it is that the event will occur.

Example

A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the
two outcomes (“heads” and “tails”) are both equally probable; the probability of “heads”
equals the probability of “tails”; and since no other outcomes are possible, the probability
of either “heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).

Formula of Probability

The probability formula is defined as the possibility of an event to happen is equal

to the ratio of the number of outcomes and the total number of outcomes.

P(E) = Number of Favourable Outcomes/Number of total outcomes

Terms used in Probability and Statistics

There are various terms utilized in the probability and statistics concepts, such as:

193
 Random Experiment

 Sample Space

 Random variables

 Expected Value

 Independence

 Variance

 Mean

Random Experiment

An experiment whose result cannot be predicted, until it is noticed is called a

random experiment.

For example, when we throw a dice randomly, the result is uncertain to us.

We can get any output between 1 to 6. Hence, this experiment is random.

Sample Space

A sample space is the set of all possible results or outcomes of a random

experiment.

194
Suppose, if we have thrown a dice, randomly, then the sample space for this
experiment will be all possible outcomes of throwing a dice, such as;

Sample Space = {1,2,3,4,5,6}

Random Variable

A random variable is a variable whose possible values are numerical outcomes of

a random experiment. There are two types of random variables.

 Discrete Random Variable is one which may take on only a countable number
of distinct values. Example

 Number of Students in the Class

 Number of Stars in the Sky

 Continuous Random Variable is one which takes an infinite number of

possible values. Example

 Height of the Students in the Class

 Weight of the Students in the Class

Independent Event

When the probability of occurrence of one event has no impact on the probability
of another event, then both the events are termed as independent of each other.

195
For example, if you flip a coin and at the same time you throw a dice, the
probability of getting a ‘head’ is independent of the probability of getting a 6 in dice.

Mean

Mean of a random variable is the average of the random values of the possible
outcomes of a random experiment.

In simple terms, it is the expectation of the possible outcomes of the random

experiment, repeated again and again or n number of times.

It is also called the expectation of a random variable.

Expected Value

Expected value is the mean of a random variable.

It is the assumed value which is considered for a random experiment.

It is also called as expectation, mathematical expectation or first moment.

For example, if we roll a dice having six faces, then the expected value will be the
average value of all the possible outcomes, i.e. 3.5.

Variance

Basically, the variance tells us how the values of the random variable are spread
around the mean value.

It specifies the distribution of the sample space across the mean.

196
Probability Terms and Definition

Some of the important probability terms:

Term Definition Example

Sample Space The set of all the possible outcomes Tossing a coin,
to occur in any trial  Sample Space (S) = {H,T}

Rolling a die,
 Sample Space (S) = {1,2,3,4,5,6}

Sample Point It is one of the possible results In a deck of Cards:

 4 of hearts is a sample point.
 The queen of clubs is a sample
point.

Experiment or A series of actions where the The tossing of a coin, Selecting a card
Trial outcomes are always uncertain. from a deck of cards, throwing a dice.

Event It is a single outcome of an Getting a Heads while tossing a coin is

experiment. an event.

Outcome Possible result of a trial/experiment T (tail) is a possible outcome when a

coin is tossed.

Complimentary The non-happening events. The Standard 52-card deck, A = Draw a

event complement of an event A is the heart, then A’ = Don’t draw a heart
event, not A (or A’)

Impossible The event cannot happen In tossing a coin, impossible to get both
Event head and tail at the same time

197
Example : A bucket contains 5 blue, 4 green and 5 red balls. Sudheer is asked to pick 2
balls randomly from the bucket without replacement and then one more ball is to be
picked. What is the probability he picked 2 green balls and 1 blue ball?

Solution: Total number of balls = 14

Probability of drawing

1 green ball = 4/14

Another green ball = 3/13

1 blue ball = 5/12

Probability of picking 2 green balls and 1 blue ball = 4/14 * 3/13 * 5/12 = 5/182.

Example : What is the probability that Ram will choose a marble at random and that it is
not black if the bowl contains 3 red, 2 black and 5 green marbles.

Solution:

Total number of marble = 10

Red and Green marbles = 8

Find the number of marbles that are not black and divide by the total number of
marbles. So,

P(not black) = (number of red or green marbles)/(total number of marbles) = 8 /10.

198
Types of Probability

Theoretical Probability

For theoretical reasons, we assume that all n possible outcomes of a particular

experiment are equally likely, and we assign a probability of 1/n to each possible outcome.

Example: The theoretical probability of rolling a 3 on a regular 6 sided die is 1/6.

Relative Frequency interpretation of Probability

We conduct an experiment many, many times. Then we say

How many times A occurs

The probability of an event A 
How many trials

Relative Frequency is based on observation or actual measurements.

Example: A die is rolled 100 times. The number 3 is rolled 12 times. The relative
frequency of rolling a 3 is 12/100.

Personal or Subjective Probability

These are values (between 0 and 1 or 0 and 100%) assigned by individuals based
on how likely they think events are to occur.

Example: The probability of my being asked on a date for this weekend is 10%.

199
Probability Rules (or) Probability Models

1. The probability of an event is between 0 and 1. A probability of 1 is equivalent to

100% certainty. Probabilities can be expressed at fractions, decimals, or percents.
0 ≤ P(A) ≤ 1

2. The sum of the probabilities of all possible outcomes is 1 or 100%. If A, B, and C are
the only possible outcomes, then

P(A) +P(B) + P(C) = 1

Example: A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles.

P(red) + P(blue) + P(green) = 1

5 3 2
i .e .,   1
10 10 10

3. The sum of the probability of an event occurring and it not occurring is 1.

P(A) + P(not A) = 1 or
P(not A) = 1 - P(A).

Example: A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles.

P(red) + P(not red) = 1

5
P (not red) =
10
5 5 5
P(red)= i .e .,  1
10 10 10

200
4. If two events A and B are independent (this means that the occurrence of A has no
impact at all on whether B occurs and vice versa), then the probability of A and B
occurring is the product of their individual probabilities.

P(A and B) = P(A)  P(B)

Example: Roll a die and flip a coin.

P(heads and roll a 3) = P(H) and P(3)

1 1 1
 
2 6 12

5. If two events A and B are mutually exclusive (meaning A cannot occur at the same time
as B occurs), then the probability of either A or B occurring is the sum of their individual
probabilities.

P(A or B) = P(A) + P(B)

Example: A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles.

P(red or green) = P(red) + P(green)

5 2 7
 
10 10 10

201
6. If two events A and B are not mutually exclusive (meaning it is possible that A and B
occur at the same time), then the probability of either A or B occurring is the sum of their
individual probabilities minus the probability of both A and B occurring.

P(A or B) = P(A) + P(B) – P(A and B)

Example: There are 20 people in the room: 12 girls (5 with blond hair and 7 with brown
hair) and 8 boys (4 with blond hair and 4 with brown hair). There are a total of 9 blonds
and 11 with brown hair. One person from the group is chosen randomly.

P(girl or blond) = P(girl) + P(blond) – P(girl and blond)

12 9 5 16
  
20 20 20 20

7. The probability of at least one event occurring out of multiple events is equal to one
minus the probability of none of the events occurring.

P(at least one) = 1 – P(none)

Example: Roll a die 4 times. What is the probability of getting at least one head on the
4 rolls.

P(at least one H) = 1 – P(no H) = 1 – P (TTTT)

1 1 1 1 1 15
1     1 
2 2 2 2 16 16

202
8. If event B is a subset of event A, then the probability of B is less than or equal to the
probability of A.
P(B) ≤ P(A)

Example: There are 20 people in the room: 12 girls (5 with blond hair and 7 with brown
hair) and 8 boys (4 with blond hair and 4 with brown hair).

P(girl with brown hair) ≤ P(girl)

7 12

20 20

Probability of an Event

Assume an event E can occur in r ways out of a sum of n probable or

possible equally likely ways. Then the probability of happening of the event or its
success is expressed as;

P(E) = r/n

The probability that the event will not occur or known as its failure is expressed as:

P(E’) = (n-r)/n = 1-(r/n)

E’ represents that the event will not occur. Therefore, now we can say;

P(E) + P(E’) = 1

This means that the total of all the probabilities in any random test or experiment
is equal to 1.

203
Equally Likely Events

When the events have the same theoretical probability of happening, then they are
called equally likely events. The results of a sample space are called equally likely if all of
them have the same probability of occurring.

For example, if you throw a die, then the probability of getting 1 is 1/6. Similarly,
the probability of getting all the numbers from 2,3,4,5 and 6, one at a time is 1/6. Hence,
the following are some examples of equally likely events when throwing a die:

 Getting 3 and 5 on rolling a die

 Getting an even number and an odd number on a die
 Getting 1, 2 or 3 on rolling a die are equally likely events, since the
probabilities of each event are equal.

Complementary Events

The possibility that there will be only two outcomes which states that an event will
occur or not. Like a person will come or not come to your house, getting a job or not
getting a job, etc. are examples of complementary events. Basically, the complement of an
event occurring in the exact opposite that the probability of it is not occurring. Some more
examples are:

 It will rain or not rain today

 The student will pass the exam or not pass.
 You win the lottery or you don’t.

204
Basic Rules of Probability

If A and B are two events, then;

P(A∪B)=P(A)+P(B)−P(A∩B)

P(A∩B)=P(B)  P(A|B)

Distribution

 The distribution of a statistical data set (or a population) is a listing or function

showing all the possible values (or intervals) of the data and how often they
occur.
 Think about a die. It has six sides, numbered from 1 to 6. We roll the die. What
is the probability of getting 1?
 It is one out of six, so one-sixth, right? What is the probability of getting 2?
Once again – one-sixth. The same holds for 3, 4, 5 and 6.
 Now, what is the probability of getting a 7? It is impossible to get a 7 when
rolling a die.
 Therefore, the probability is 0.

Probability Distribution

• Probability Distribution: Table, Graph, or Formula that describes values a

random variable can take on, and its corresponding probability (discrete
Random Variable) or density (continuous Random Variable)

205
• Discrete Probability Distribution: Assigns probabilities (masses) to the
individual outcomes

• Continuous Probability Distribution: Assigns density at individual points,

probability of ranges can be obtained by integrating density function

• Discrete Probabilities denoted by: p(x) = P(X=x)

• Continuous Densities denoted by: f(x)

• Cumulative Distribution Function: F(x) = P(X≤x)

Discrete Probability Distributions

A discrete probability distribution lists each possible value the random variable
can assume, together with its probability. A probability distribution must satisfy the
following conditions:

 The probability of each value of the discrete random variable is between

0 and 1, inclusive.
0 P (x)  1

 The sum of all the probabilities is 1. ΣP (x) = 1

Constructing a Discrete Probability Distribution

Guidelines

Let x be a discrete random variable with possible outcomes x1, x2, … , xn.

206
 Make a frequency distribution for the possible outcomes.
 Find the sum of the frequencies.
 Find the probability of each possible outcome by dividing its frequency by the
sum of the frequencies.
 Check that each probability is between 0 and 1 and that the sum is 1.

Mean

The mean of a discrete random variable is given by

μ = Σx P(x)

Each value of x is multiplied by its corresponding probability and the products are
added.

Variance

The variance of a discrete random variable is given by

2 = Σ(x – μ)2P (x).

Standard Deviation

The standard deviation of a discrete random variable is given by

σ = σ 2.

207
Expected Value

The expected value of a discrete random variable is equal to the mean of the
random variable.
E(x) = μ=Σx P(x)

Binomial Distribution

A binomial Distribution is a probability experiment that satisfies the following

conditions.
 This is a very useful tool for multi-step experiments where each step has 2

outcomes—hence the term binomial.

 The experiment is repeated for a fixed number of trials, where each trial is

independent of other trials.

 There are only two possible outcomes of interest for each trial. The outcomes

can be classified as a success (S) or as a failure (F).

 The probability of a success P (S) is the same for each trial.

 The random variable x counts the number of successful trials.

 Random Variable X, is the number of Successes in the n trials is said to follow

Binomial Distribution with parameters n and p

 X can take on the values x =0,1,…,n

 Notation: X~Bin(n,p)

208
In a binomial distribution, the probability of exactly x successes in n trials is

P (x )  nC x p xq n x  n! p xq n x .
(n  x )! x !

Notations

 n The number of times a trial is repeated.

 p=P(S) The probability of success in a single trial.

 q=P(F) The probability of failure in a single trial (q = 1 – p).

 x The random variable represents a count of the number of

successes in n trials.

Mean

Mean (μ) = np

Variance

Variance (σ2) = npq

Standard Deviation

Standard Deviation = npq

209
Example:

The number of trials (n) is 10. The probability of success (p) is 0.5.

Do the calculation of binomial distribution to calculate the probability of getting

exactly six successes.

Solution:

Use the following data for the calculation of binomial distribution.

The number of repeated trials: n=10

The number of success trials: x=6

The probability of success on individual trial: p=0.5

Use the formula for binomial probability.

P( X  5)10 C6  (0.5)6  (1  0.5)106

 210  0.015625  0.0625

= 0.2051

The probability of getting exactly 6 successes is 0.2051

210
Example :

A manager of an insurance company goes through the data of insurance policies

sold by insurance salesmen working under him.

He finds that 80% of the people who purchase motor insurance are men.

He wants to find out that if 8 motor insurance owners are randomly selected, what
would be the probability that exactly 5 of them are men.

Solution:

We first have to find out what are n, p, and x.

The number of repeated trials: n=8

The number of success trials: x=5

The probability of success on individual trial: p=0.8

P( X  5 )  8 C5  ( 0.8 )5  ( 1  0.8 )85

 56  0.32768  0.008

= 0.14680064

The probability of exactly 5 motor insurance owners being men is 0.14680064.

211
Example :

Evans Electronics is concerned about a low retention rate for its employees.
In recent years, management has seen a turnover of 10% of the hourly employees
annually.

Thus, for any hourly employee chosen at random, management estimates a

probability of 0.1 that the person will not be with the company next year.

Choosing 3 hourly employees at random, what is the probability that 1 of them will
leave the company this year?

Notice that:

• The experiment has three identical trials—that is, n = 3.

• There are two outcomes for each trial—the employee leaves (S) or the
employee stays (F).

• The probability that an employee will leave is .1—that is, ρ = .1

• The decision of each employee to leave is independent of the decisions made

by the other employees.

Number of Experimental Outcomes Providing Exactly x Successes in n trials

n n!
  
 x  x! ( n  x )!

212
In our Evans Electronics example, n = 3 and x = 1. Thus

 3  ( 3 )( 2 )( 1 ) 6
    3
 1  ( 1 )( 2 )( 1 ) 2

Refer to the tree diagram to verify this is right

The probability of the first employee leaving and the second and third employees
staying, denoted (S, F, F), is given by

p(1 – p)(1 – p)

With a 0.10 probability of an employee leaving on any one trial, the probability of
an employee leaving on the first trial and not on the second and third trials is given by,

(0.10)(0.90)(0.90) = (0.10)(0.90)2 = 0.081

Two other experimental outcomes also result in one success and two failures. The
probabilities for all three experimental outcomes involving one success follow:

Experimental Outcome Probability of Experimental Outcome

(S, F, F) p(1 – p)(1 – p) = (0.1)(0.9)(0.9) = 0.081

(F, S, F) (1 – p)p(1 – p) = (0.9)(0.1)(0.9) = 0.081

(F, F, S) (1 – p)(1 – p)p = (0.9)(0.9)(0.1) = 0.081

Total = 0.243

213
Because these events are independent, we can multiply probabilities. Thus the
probability of (S, F, F) is given by: p = 0.1 and q = 0.9.

P(1) = 3 * (0.1)1 * (0.9)2

P(1) = 0.243

Cumulative
X P(x)
Probability

0 0.729 0.729

1 0.243 0.972

2 0.027 0.999

3 0.001 1.000

Mean = np = 3 * 0.1 = 0.3 employees out of 3

Variance = npq = 3* 0.1 * 0.9 = 0.27

Standard Deviation = npq = 3 * 0.1 * 0.9 = 0.52 employees

214
Continuous Probability Distribution

• A continuous random variable can assume any value in an interval on the real
line or in a collection of intervals.

• It is not possible to talk about the probability of the random variable assuming
a particular value.

• Instead, we talk about the probability of the random variable assuming a value
within a given interval.

• The probability of the random variable assuming a value within some given
interval from x1 to x2 is defined to be the area under the graph of the probability
density function between x1 and x2.

Normal Distribution

In probability theory, a normal (or Gaussian or Gauss or Laplace–Gauss)

distribution is a type of continuous probability distribution for a real-valued random
variable.

The Normal Distribution has:

 Mean = Median = Mode

 Symmetry about the center

215
 50% of values less than the mean and 50% greater than the mean

Empirical Rule

The Standard Deviation is a measure of how spread out numbers are (read that
page for details on how to calculate it).

When we calculate the standard deviation we find that generally:

 68% of values are within 1 standard deviation of the mean.

216
 95% of values are within 2 standard deviations of the mean.

 99.7% of values are within 3 standard deviations of the mean.

The empirical rule is a quick way to get an overview of the data and check for any
outliers or extreme values that don’t follow this pattern.

If data from small samples do not closely follow this pattern, then other
distributions like the t-distribution may be more appropriate. Once you identify the
distribution of your variable, you can apply appropriate statistical tests.

217
Central Limit Theorem

The central limit theorem is the basis for how normal distributions work in
statistics.

In research, to get a good idea of a population mean, ideally you’d collect data
from multiple random samples within the population.

A sampling distribution of the mean is the distribution of the means of these

different samples.

The central limit theorem shows the following:

 Law of Large Numbers: As you increase sample size (or the number of
samples), then the sample mean will approach the population mean.

 With multiple large samples, the sampling distribution of the mean is normally
distributed, even if your original variable is not normally distributed.

Parametric statistical tests typically assume that samples come from normally
distributed populations, but the central limit theorem means that this assumption isn’t
necessary to meet when you have a large enough sample.

You can use parametric tests for large samples from populations with any kind of
distribution as long as other important assumptions are met.

A sample size of 30 or more is generally considered large.

218
For small samples, the assumption of normality is important because the sampling
distribution of the mean isn’t known.

For accurate results, you have to be sure that the population is normally distributed
before you can use parametric tests with small samples.

Probability Density Function

The general formula for the probability density function of the normal distribution
is

 x 
1  2 
f ( x)  e   
, x  0,1, 2, ....
2

where μ is the location parameter and σ is the scale parameter.

The case where μ = 0 and σ = 1 is called the standard normal distribution.

The equation for the standard normal distribution is

2
ex /2
f ( x) 
2

Since the general form of probability functions can be expressed in terms of the
standard distribution, all subsequent formulas in this section are given for the standard
form of the function.

219
The following is the plot of the standard normal probability density function.

Cumulative Distribution Function

The formula for the cumulative distribution function of the standard normal
distribution is

x 2
ex /2
F ( x)  
 2

220
Note that this integral does not exist in a simple closed formula. It is computed
numerically.

The following is the plot of the normal cumulative distribution function.

Example: Using the empirical rule in a normal distribution

You collect SAT scores from students in a new test preparation course. The data
follows a normal distribution with a mean score (M) of 1150 and a standard deviation (SD)
of 150.

221
Solution: Following the empirical rule:

 Around 68% of scores are between 1000 and 1300, 1 standard deviation above
and below the mean.

 Around 95% of scores are between 850 and 1450, 2 standard deviations above
and below the mean.

 Around 99.7% of scores are between 700 and 1600, 3 standard deviations
above and below the mean.

222
Standard Normal Distribution

The standard normal distribution, also called the z-distribution, is a special

normal distribution where the mean is 0 and the standard deviation is 1.

Every normal distribution is a version of the standard normal distribution that’s

been stretched or squeezed and moved horizontally right or left.

While individual observations from normal distributions are referred to as x, they

are referred to as z in the z-distribution. Every normal distribution can be converted to the
standard normal distribution by turning the individual values into z-scores.

223
Z-scores tell you how many standard deviations away from the mean each value
lies.

You only need to know the mean and standard deviation of your distribution to
find the z-score of a value.

Z-score Formula Explanation

 x = individual value

 μ = mean

 σ = standard deviation

224
We convert normal distributions into the standard normal distribution for several
reasons:

 To find the probability of observations in a distribution falling above or below

a given value.

 To find the probability that a sample mean significantly differs from a known
population mean.

 To compare scores on different distributions with different means and standard

deviations.

Example: Finding probability using the z-distribution

To find the probability of SAT scores in your sample exceeding 1380, find the z-
score.

The mean of our distribution is 1150, and the standard deviation is 150.

The z-score tells you how many standard deviations away 1380 is from the mean.

Formula Calculation

z = (1380 – 1150) / 150

z = (x – μ) / σ
z = 1.53

225
For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores
being 1380 or less (93.7%), and it’s the area under the curve left of the shaded area.

To find the shaded area, you take away 0.937 from 1, which is the total area under
the curve.

Probability of x>1380 = 1 – 0.937 = 0.063

That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.

226
Lesson 10

Hypothesis Testing

Hypothesis

A hypothesis is an assumption about a population parameter. It is a statement

about the population that may or may not be true. Hypothesis testing aims to make a
statistical conclusion about accepting or not accepting the hypothesis.

So, a statistical hypothesis is an assertion or conjecture concerning one or more

populations. To prove that a hypothesis is true, or false, with absolute certainty, we would
need absolute knowledge. That is, we would have to examine the entire population.
Instead, hypothesis testing concerns on how to use a random sample to judge if it is
evidence that supports or not the hypothesis.

Hypothesis testing is formulated in terms of two hypotheses:

● H0 : the null hypothesis

● H1 : the alternate hypothesis

The null hypothesis (H0) often represents either a skeptical perspective or a claim
to be tested. The alternative hypothesis (H1) represents an alternative claim under
consideration and is often represented by a range of possible parameter values.

The sceptic will not reject the null hypothesis (H0), unless the evidence in favour
of the alternative hypothesis (H1) is so strong that she rejects H0 in favor of H1.

227
Null Hypothesis (H0)

● Represents the status quo.

● The hypothesis that states there is no statistical significance between two

variables in the hypothesis.

● Believed to be true unless there is overwhelming evidence to the contrary.

● It is the hypothesis the researcher is trying to disprove.

Example:

It is hypothesized that flowers watered with lemonade will grow faster than
flowers watered with plain water.

Null hypothesis: There is no statistically significant relationship between the type

of water used and the growth of the flowers.

Alternative Hypothesis (H1):

● Inverse of the null hypothesis.

● States that there is a statistical significance between two variables.

● Holds true if the null hypothesis is rejected.

● Usually what the researcher thinks is true and is testing.

228
Example:

If one plant is fed lemonade for one month and another is fed plain water,

Null hypothesis: There will be no difference in growth between the two plants.

Alternative Hypothesis: If one plant is fed lemonade for one month and another is
fed plain water, the plant that is fed lemonade will grow more than the plant that is fed
plain water

In hypothesis testing, we want to test is if H1 is “likely” true. So, there are two
possible outcomes:

● Reject H0 and accept H1 because of sufficient evidence in the sample in

favor or H1;

● Do not reject H0 because of insufficient evidence to support H1.

Note:

Failure to reject H0 does not mean the null hypothesis is true. There is no formal
outcome that says “accept H0”. It only means that we do not have sufficient evidence to
support H1.

Example:

In a jury trial the hypotheses are:

 H0 : defendant is innocent.

 H1 : defendant is guilty.

229
H0 (innocent) is rejected if H1 (guilty) is supported by evidence beyond
“reasonable doubt”. Failure to reject H0 (prove guilty) does not imply innocence, only that
the evidence is insufficient to reject it.

Hypothesis Testing via Confident Interval:

Earlier we calculated a 95% confidence interval for the average number of

exclusive relationships college students have been in to be (2.7, 3.7). Based on this
confidence interval, do these data support hypothesis that college students on average have
been in more than 3 exclusive relationship.

Then our Null Hypothesis will be will like this:

H0: µ = 3 College students have been in 3 exclusive relationships, on average.

H1: µ > 3 College students have been in more than 3 exclusive relationships, on
average.

For this case, our intervals span from 2.7 to 3.7 and the null value µ=3 are actually
included in the interval. And the interval says any value within this range could
conceivably be the true population mean therefore we cannot reject the null hypothesis in
favor of the alternative.

230
This is quick and dirty approach for hypothesis testing. However, it doesn’t tell us
the likelihood of certain outcome under the null hypothesis. In the other words it does not
tell us the p value.

Note:

We always do hypothesis testing for population parameters. We never

hypothesized sample statistics.

Hypothesis testing via p-Value:

The p-value is a way of quantifying the strength of the evidence against the null
hypothesis and in favor of the alternative. Formally the p-value is a conditional probability

p-Value:

The p-value is the probability of observing data at least as favorable to the

alternative hypothesis as our current data set, if the null hypothesis is true. We typically
use a summary statistic of the data, here the sample mean, to help compute the p-value and
evaluate the hypotheses.

P (Observed or more extreme outcome | H0 true) =?

When N=50,
X = 3.2,
S= 1.74
𝑠 1.74
SE= = = 0.246
√𝑛 √50

231
We are trying to find the value of P ( X > 3.2 | H0: µ= 3) which is coming from null
hypothesis.

Since we are assuming null hypothesis to be true, we can use that to construct the
sampling distribution based on the Central Limit Theorem.

X ~ N (µ= 3, SE= 0.246).

Here 3 is coming from null hypothesis as we are assuming null hypothesis is true.
See the below picture, our area of interest for p-value is the red shaded area.

232
The Z-Score can be calculated by this formula

Test statistics, Z= (3.2 - 3) / 0.246 = 0.81

p-value = P(Z > 0.81) = 0.209

● We use the test statistic to calculate the p-value, the probability of observing
data at least as favorable to the alternative hypothesis as our current data, if the
null hypothesis was true.

● If the p-value is low (lower than the significance level α, which is usually 5%)
we say that it would be very unlikely to observe the data if the null hypothesis
were true, and hence reject H0.

● If the p-value is high (higher than α) we say that it is likely to observe the data
even if the null hypothesis were true, and hence do not reject H0.

Since p-value for this case is 0.209 and it is higher than 0.05, so we do not reject
the null hypothesis.

What is that meaning context of this question? Our null hypothesis was that
college student on average have 3 exclusive relationships Vs the alternative hypothesis
was college students have been in more than 3 exclusive relationships, on average.

In this case, we fail to reject null hypothesis as we do not have enough evidence to
reject null hypothesis.

233
That sets the population average of number of exclusive relationships college
student have been in to 3.

Interpreting the p-value:

If in fact, college students have been in 3 exclusive relationships on average, there

is a 21% chance that a random sample of 50 college students would yield a sample mean
of 3.2 or higher.

This is a pretty high probability, so we think that a sample mean of 3.2 or more
exclusive relationships is likely to happen simply by chance.

How we made this decision?

● Since p-value is high(higher than 5%) we fail to reject H0.

● These data do not provide convincing evidence that college students have been
in more than 3 relationship on average.

● The difference between the null value of 3 relationships and the observed
sample mean of 32 relationship is due to chance or sampling variability.

Two-sided hypothesis testing with p-values:

Often instead of looking for a divergence from the null in a specific direction,
we might be interested in divergence in any direction.

We call such hypothesis tests two-sided (or two-tailed).

234
The definition of a p-value is the same regardless of doing a one or two-sided test,
however the calculation is slightly different since we need to consider “at least as extreme
as the observed outcome” in both direction away from the mean.

For the above example if we want to do the two-sided hypothesis testing then we
have to find P ( X > 3.2) or ( X < 2.8 | H0: µ= 3).

How to we come up with 2.8?

As, 3.2 - 3 = 0.2

so, 3 - 0.2 = 2.8

235
Type I and Type II Errors

Type I Error

The Probability of getting a type I error is the significance level because if our null
hypothesis is true, let’s say that our significance level is 5%. Well, 5% of the time, even if
our null hypothesis is true, we are going to get a statistic that’s going to make you reject
the null hypothesis. So, one way to think about the probability of a Type I error is our
significance level.

In other words, Rejecting null hypothesis H0 even though it is true. Because it is so

unlikely to get a statistic like that assuming the null hypothesis is true, we decide to reject
the null hypothesis.

Type II Error

A type II error is also known as a false negative and occurs when reject a null
hypothesis which is really false. Here concludes there is not a significant effect, when
actually there really is.

236
The probability of making a type II error is called Beta (β), and this is related to
the power of the statistical test (power = 1- β). You can decrease your risk of committing a
type II error by ensuring your test has enough power.

Power Test

This is the probability that you are doing the right thing when the null hypothesis is
not true i.e. we should reject the null hypothesis if it’s not true.

Hence, Power = P(rejecting H0 | H0 is false)

= 1- P(not rejecting H0 | H0 is false) -----> This is called Type II Error

= P( not making a Type II Error )

Example: Let H0: µ = µ1 and H1: µ ≠ µ1

237
Types of Hypothesis Tests

There are many types of statistical hypothesis tests.

Variable Distribution Compare Sample

Compare Sample
Type Tests (Normal Means (Non-
Means (Parametric)
Distribution) parametric)

 Shapiro-Wilk Test  Student’s t-test  Mann-Whitney U Test

 D’Agostino’s K2 Test  Paired Student’s t-test  Wilcoxon Signed-Rank

Test
 Anderson-Darling Test  Analysis of Variance
Test (ANOVA)  Kruskal-Wallis H Test
 Variable Relationship
Tests (Correlation)  Repeated Measures  Friedman Test
ANOVA Test
 Pearson’s Correlation
Coefficient

 Spearman’s Rank
Correlation

 Kendall’s Rank
Correlation

 Chi-Squared Test

238
Reference Books

1. Gupta, S.P. (1995) Statistical Methods, Sultan Chand and Sons, New Delhi.

2. Pal, S.K., (1998) Statistics for Geoscientists: Techniques and Applications,

Concept Publishing Company, New Delhi.

3. King, L.J. (1991) Statistical Analysis in Geography. Prentice Hall, Englewood

Cliff.

4. Cole, J.P. and King, C.A.M. (1968) Quantitative Techniques in Geography. John
Wiley & sons Inc. New York.

5. Elhance, D.N. (1972) Fundamentals of Statistics, Kitab Mahal, Allahabad.

6. Burt, J.E., Barber, G.M., and Rigby, D.L. (2009) Elementary Statistics for
Geographers (3/E), The Guilford Press, New York.

7. Hammond, R., and McCullagh, P.S. (1978) Quantitative Techniques in

Geography: An Introduction (2/E), Oxford University Press, New York.

239

Unit - I: Descriptive Statistics and Methods For Data Science
No ratings yet
Unit - I: Descriptive Statistics and Methods For Data Science
22 pages
BA 1st Sem Statistics EM
No ratings yet
BA 1st Sem Statistics EM
178 pages
Statistics
No ratings yet
Statistics
264 pages
Business Statistics All Chapters Power Points
No ratings yet
Business Statistics All Chapters Power Points
298 pages
Lesson 1 - Definition of Statistics
No ratings yet
Lesson 1 - Definition of Statistics
48 pages
BS Political Science Complete Statistics
No ratings yet
BS Political Science Complete Statistics
65 pages
Math 101 Statistics
No ratings yet
Math 101 Statistics
100 pages
Stats Full Book - CMA Foundation
No ratings yet
Stats Full Book - CMA Foundation
83 pages
Advanced Statistics Hons 19
No ratings yet
Advanced Statistics Hons 19
41 pages
Day 1 2
No ratings yet
Day 1 2
70 pages
Collection of Data
No ratings yet
Collection of Data
115 pages
Lecture 1
No ratings yet
Lecture 1
27 pages
Statistics 4
No ratings yet
Statistics 4
112 pages
Chapter One Quantitative Techniques
No ratings yet
Chapter One Quantitative Techniques
70 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
34 pages
Module 1
No ratings yet
Module 1
44 pages
Business Statistics
No ratings yet
Business Statistics
19 pages
Plus One Statistics Micro (Eng)
No ratings yet
Plus One Statistics Micro (Eng)
4 pages
LAS Unit-04 Study Material F.Y.B.tech Sem-II 2022-23
No ratings yet
LAS Unit-04 Study Material F.Y.B.tech Sem-II 2022-23
38 pages
Assignment
No ratings yet
Assignment
15 pages
Data AnalysisM - Tech
No ratings yet
Data AnalysisM - Tech
99 pages
Statistics
No ratings yet
Statistics
13 pages
Statistics II Year
No ratings yet
Statistics II Year
32 pages
Data Collection and Presentation
No ratings yet
Data Collection and Presentation
21 pages
Stat For Engand Scientist - 231127 - 120304
No ratings yet
Stat For Engand Scientist - 231127 - 120304
75 pages
MMW Module 4
No ratings yet
MMW Module 4
54 pages
Bam 212 2023
No ratings yet
Bam 212 2023
116 pages
UNIT 10 DATA COLLECTION, ORGANISATION AND PRESENTATION (Autosaved)
No ratings yet
UNIT 10 DATA COLLECTION, ORGANISATION AND PRESENTATION (Autosaved)
25 pages
1.introductory and Basic Statistics
No ratings yet
1.introductory and Basic Statistics
6 pages
Note For Students
No ratings yet
Note For Students
68 pages
Classification
No ratings yet
Classification
18 pages
Direct Personal Observation
100% (1)
Direct Personal Observation
7 pages
Statistics
No ratings yet
Statistics
41 pages
Business Statistics Introduction. 1
No ratings yet
Business Statistics Introduction. 1
18 pages
Lesson 1 Definition of Statistics
No ratings yet
Lesson 1 Definition of Statistics
49 pages
Introduction To Statistics - c1
No ratings yet
Introduction To Statistics - c1
19 pages
Engineering Data Analysis: Marcelino C. Yu JR
No ratings yet
Engineering Data Analysis: Marcelino C. Yu JR
40 pages
Introduction To Statistics - Note
No ratings yet
Introduction To Statistics - Note
16 pages
Assign 01 (8614) Wajahat Ali Ghulam BU607455 B.ed 1.5 Years
No ratings yet
Assign 01 (8614) Wajahat Ali Ghulam BU607455 B.ed 1.5 Years
10 pages
Quantitative Analysis For Business Module 3
No ratings yet
Quantitative Analysis For Business Module 3
5 pages
Introduction To Statistical Concepts
No ratings yet
Introduction To Statistical Concepts
10 pages
Chapter 1 - 250119 - 072242
No ratings yet
Chapter 1 - 250119 - 072242
11 pages
Unit 1
No ratings yet
Unit 1
94 pages
Introduction of Statistics
No ratings yet
Introduction of Statistics
5 pages
MMW Module 4 Lesson 1
No ratings yet
MMW Module 4 Lesson 1
13 pages
Chapter - 1 Notes
No ratings yet
Chapter - 1 Notes
9 pages
MAT211 Assignment - 1: Part - 1
No ratings yet
MAT211 Assignment - 1: Part - 1
10 pages
FALLSEM2022-23 MAT5007 ETH VL2022230105838 Reference Material I 22-09-2022 IntroductionToStatistics
No ratings yet
FALLSEM2022-23 MAT5007 ETH VL2022230105838 Reference Material I 22-09-2022 IntroductionToStatistics
37 pages
Chapter - I 1. Introduction: - 1.1 Definition and Classification of Statistics
No ratings yet
Chapter - I 1. Introduction: - 1.1 Definition and Classification of Statistics
14 pages
Basic Stat 1-2 PDF-1-1
No ratings yet
Basic Stat 1-2 PDF-1-1
15 pages
1 Descriptive Part
No ratings yet
1 Descriptive Part
13 pages
K.Santoshi 1 Year PG: Biostatistics
No ratings yet
K.Santoshi 1 Year PG: Biostatistics
60 pages
Topic No II
No ratings yet
Topic No II
9 pages
Sta 111
No ratings yet
Sta 111
94 pages
Unit - 1: Statistics: Meaning, Significance & Limitations
No ratings yet
Unit - 1: Statistics: Meaning, Significance & Limitations
11 pages
Statistics and Analysis Notes
No ratings yet
Statistics and Analysis Notes
8 pages
Stat
No ratings yet
Stat
9 pages
BBA III StatisticS
No ratings yet
BBA III StatisticS
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.