Book Allied Paper I Statistics BSC Geography New
Book Allied Paper I Statistics BSC Geography New
B.Sc. GEOGRAPHY
STATISTICS
I Year Allied Paper
(Full Package)
BHARARHIDASAN UNIVERSITY
TIRUCHIRAPPALLI - 620 024
B.Sc. GEOGRAPHY
Allied Paper - I
FIRST YEAR
STATISTICS
(Full Package)
Lesson 1 1
I
Lesson 2 22
Lesson 3 80
II
Lesson 4 103
Lesson 5 128
III
Lesson 6 149
Lesson 7 157
IV
Lesson 8 175
Lesson 9 193
V
Lesson 10 227
UNIT - I
Lesson 1
Statistics
Introduction
“Statistics”, that a word is often used, has been derived from the Latin word
‘Status’ that means a group of numbers or figures; those represent some information of
our human interest.
Data are individual pieces of factual information recorded and used for the purpose
of analysis. It is the raw information from which statistics are created. Statistics are the
results of data analysis - its interpretation and presentation, often these types of statistics
are referred to as 'statistical data'.
Data
Data is the name given to basic facts and entities such as names and numbers. The
main examples of data are weights, prices, costs, numbers of items sold, employee names,
product names, addresses, tax codes, registration marks etc.
1
We do not generally associate data with mathematics. However, data is the base of all
operations in statistics. So let us learn more about data collection, primary data, secondary
data, and a few other important terms.
Types of Data
Data may be qualitative or quantitative. Once you know the difference between them,
you can know how to use them.
Data Collection
Depending on the source, it can classify as primary data or secondary data. Let us
take a look at them both.
2
Collection of Data
3
Statistical investigation is a comprehensive and requires systematic collection of
data about some group of people or objects, describing and organizing the data, analyzing
the data with 28 the help of different statistical method, summarizing the analysis and
using these results for making judgements, decisions and predictions. The validity and
accuracy of final judgement is most crucial and depends heavily on how well the data was
collected in the first place. The quality of data will greatly affect the conditions and hence
at most importance must be given to this process and every possible precautions should be
taken to ensure accuracy while collecting the data.
Nature of Data
It may be noted that different types of data can be collected for different purposes.
The data can be collected in connection with time or geographical location or in
connection with time and location. The following are the three types of data:
4
Spatial Data
If the data collected is connected with that of a place, then it is termed as spatial
data.
Spacio-Temporal Data
If the data collected is connected to the time as well as place then it is known as
spacio-temporal data.
Categories of Data
Any statistical data can be classified under two categories depending upon the
sources utilized. These categories are,
Primary data
Secondary data
Primary Data
Primary data is the one, which is collected by the investigator himself for the
purpose of a specific inquiry or study. Such data is original in character and is generated
by survey conducted by individuals or research institution or any organisation.
These are the data that are collected for the first time by an investigator for a specific
purpose. Primary data are ‘pure’ in the sense that no statistical operations have been
performed on them and they are original. An example of primary data is the Census of India.
5
Example: If a researcher is interested to know the impact of noon meal scheme for
the school children, he has to undertake a survey and collect data on the opinion of parents
and children by asking relevant questions. Such a data collected for the purpose is called
primary data.
The persons from whom information’s are collected are known as informants. The
investigator personally meets them and asks questions to gather the necessary
information’s. It is the suitable method for intensive rather than extensive field surveys. It
suits best for intensive study of the limited field.
6
In some cases, police interrogated third parties who are supposed to have
knowledge of a theft or a murder and get some clues. Enquiry committees appointed by
governments generally adopt this method and get people’s views and all possible details of
facts relating to the enquiry.
This method is suitable whenever direct sources do not exists or cannot be relied
upon or would be unwilling to part with the information. The validity of the results
depends upon a few factors, such as the nature of the person whose evidence is being
recorded, the ability of the interviewer to draw out information from the third 32 parties
by means of appropriate questions and cross examinations, and the number of persons
interviewed. For the success of this method one person or one group alone should not be
relied upon.
The advantage of this method is that it is cheap and appropriate for extensive
investigations. But it may not ensure accurate results because the correspondents are likely
to be negligent, prejudiced and biased. This method is adopted in those cases where
informations are to be collected periodically from a wide area for a long time.
7
Mailed Questionnaire Method
Under this method a list of questions is prepared and is sent to all the informants
by post. The list of questions is technically called questionnaire. A covering letter
accompanying the questionnaire explains the purpose of the investigation and the
importance of correct informations and request the informants to fill in the blank spaces
provided and to return the form within a specified time. This method is appropriate in
those cases where the informants are literates and are spread over a wide area.
Under this method enumerators or interviewers take the schedules, meet the
informants and filling their replies. Often distinction is made between the schedule and a
questionnaire. A schedule is filled by the interviewers in a face-to-face situation with the
informant. A questionnaire is filled by the informant which he receives and returns by
post. It is suitable for extensive surveys.
Secondary Data
Secondary data are those data which have been already collected and analysed by
some earlier agency for its own use; and later the same data are used by a different agency.
According to W.A. Neiswanger, ‘A primary source is a publication in which the data are
published by the same authority which gathered and analysed them. A secondary source is
a publication, reporting the data which have been gathered by other authorities and for
which others are responsible’.
8
Sources of Secondary Data
They are the data that are sourced from someplace that has originally collected it.
This means that this kind of data has already been collected by some researchers or
investigators in the past and is available either in published or unpublished form. This
information is impure as statistical operations may have been performed on them already. An
example is an information available on the Government of India, the Department of
Finance’s website or in other repositories, books, journals, etc.
The sources of secondary data can broadly be classified under two heads:
Published sources
Unpublished sources
Published Sources
The various sources of published data are clinical and other personal records, death
certificates, published mortality statistics, census publications, etc. Examples include:
9
Note: A lot of secondary data is available in the internet. We can access it at any time for
the further studies.
Unpublished Sources
All statistical material is not always published. There are various sources of
unpublished data such as records maintained by various Government and private offices,
studies made by research institutions, scholars, etc. Such sources can also be used where
necessary Precautions in the use of Secondary data. The following are some of the points
that are to be considered in the use of secondary data
Classification of Data
The collected data, also known as raw data or ungrouped data are always in an un
organised form and need to be organised and presented in meaningful and readily
comprehensible form in order to facilitate further statistical analysis. It is, therefore,
10
essential for an investigator to condense a mass of data into more and more
comprehensible and assimilable form.
The process of grouping into different classes or sub classes according to some
characteristics is known as classification, tabulation is concerned with the systematic
arrangement and presentation of classified data. Thus classification is the first step in
tabulation. For Example, letters in the post office are classified according to their
destinations viz., Delhi, Madurai, Bangalore, Mumbai etc.
Objects of Classification
The following are main objectives of classifying the data:
Types of Classification
Statistical data are classified in respect of their characteristics. Broadly there are
four basic types of classification namely
Chronological classification
Geographical classification
Qualitative classification
Quantitative classification.
11
Chronological classification
Geographical classification
Similarly, they can also be classified into ‘married or ‘single’ on the basis of
another attribute ‘marital status’. Thus when the classification is done with respect to one
attribute, which is dichotomous in nature, two classes are formed, one possessing the
attribute and the other not possessing the attribute.
Quantitative classification
12
Tabulation of Data
Sampling
It is also a time-convenient and a cost-effective method and hence forms the basis
of any research design. Sampling techniques can be used in research survey software for
optimum derivation.
For example, if a drug manufacturer would like to research the adverse side effects
of a drug on the country’s population, it is almost impossible to conduct a research study
that involves everyone. In this case, the researcher decides a sample of people from
each demographic and then researches them, giving him/her indicative feedback on the
drug’s behaviour.
13
Types of Sampling
14
Alternatively Random sampling Non-random sampling
Known as method. method
Population The population is The population is selected
selection selected randomly. arbitrarily.
Nature The research is The research is exploratory.
conclusive.
Sample Since there is a method Since the sampling method
for deciding the sample, is arbitrary, the population
the population demographics
demographics are representation is almost
conclusively represented. always skewed.
Time Taken Takes longer to conduct This type of sampling
since the research design method is quick since
defines the selection neither the sample or
parameters before the selection criteria of the
market research study sample are undefined.
begins.
Results This type of sampling is This type of sampling is
entirely unbiased and entirely biased and hence
hence the results are the results are biased too,
unbiased too and rendering the research
conclusive. speculative.
15
Types of Probability Sampling
16
Stratified Random Sampling: Stratified random sampling is a method in
which the researcher divides the population into smaller groups that don’t
overlap but represent the entire population.
While sampling, these groups can be organized and then draw a sample from
each group separately.
Reduce Sample Bias: Using the probability sampling method, the bias in the
sample derived from a population is negligible to non-existent.
The selection of the sample mainly depicts the understanding and the inference
of the researcher.
Probability sampling leads to higher quality data collection as the sample
appropriately represents the population.
17
Create an Accurate Sample: Probability sampling helps the researchers plan
and create an accurate sample. This helps to obtain well-defined data.
18
Quota Sampling: In Quota sampling, the selection of members in this
sampling technique happens based on a pre-set standard. In this case, as a
sample is formed based on specific attributes, the created sample will have the
same qualities found in the total population. It is a rapid method of collecting
samples.
Budget and Time Constraints: The non-probability method when there are
budget and time constraints, and some preliminary data must be collected.
Since the survey design is not rigid, it is easier to pick respondents at random
and have them take the survey or questionnaire.
19
Sampling Design
Sampling design is a mathematical function that gives you the probability of any
given sample being drawn. It involves not only learning how to derive the probability
functions which describe a given sampling method but also understanding how to design a
best-fit sampling method for a real life situation.
20
Sampling Designing Process
The sampling design process includes five steps which are closely related and are
important to all aspect of the marketing research project. The five steps are: defining the
target population; determining the sample frame; selecting a sampling technique;
determining the sample size; and executing the sampling process.
21
Lesson 2
Representation of Geographic Data
Discrete Data
These are data that can take only certain specific values rather than a range of values.
For example, data on the blood group of a certain population or on their genders is termed as
discrete data. A usual way to represent this is by using bar charts.
Continuous Data
These are data that can take values between a certain range with the highest and
lowest values. The difference between the highest and lowest value is called the range of
data. For example, the age of persons can take values even in decimals or so is the case of the
height and weights of the students of your school.
These are classified as continuous data. Continuous data can be tabulated in what is
called a frequency distribution. They can be graphically represented using histograms.
Frequency Polygon
22
Presentation of Data
The key objective of statistics is to collect and organize data. One of the basics of
data organization comes from presentation of data in a recognizable form so that it can be
interpreted easily. You can organize data in the form of tables or you can present it
pictorially.
Pictorial representation of data takes the form of bar charts, pie charts, histograms or
frequency polygons. The benefit of this is that data in the visual form is easy to understand in
one glance.
Graphical representation refers to the use of charts and graphs to visually display,
analyze, clarify, and interpret numerical data, functions, and other qualitative structures.
All these graphs are used in various places to represent a specific set of data
concisely. The details of each of these graphs (or charts) are explained below in detail
which will not only help to know about these graphs better but will also help to choose the
right kind of graph for a particular data set.
23
Statistical Graphs
The statistical data can be represented by various methods such as tables, bar
graphs, pie charts, histograms, frequency polygons, etc.
The four basic graphs used in statistics include bar, line, histogram and pie charts.
These are explained here in brief.
Bar Graph
Bar graphs are the pictorial representation of grouped data in vertical or horizontal
rectangular bars, where the length of bars is proportional to the measure of data.
The chart’s horizontal axis represents categorical data, whereas the chart’s vertical
axis defines discrete data.
Bar graphs are the pictorial representation of data (generally grouped), in the form
of vertical or horizontal rectangular bars, where the length of bars are proportional to the
measure of data. They are also known as bar charts. Bar graphs are one of the means
of data handling in statistics.
24
Types of Bar Charts
The bar graphs can be vertical or horizontal. The primary feature of any bar graph
is its length or height. If the length of the bar graph is more, then the values are greater
than any given data.
Bar graphs normally show categorical and numeric variables arranged in class
intervals. They consist of an axis and a series of labelled horizontal or vertical bars.
Even though the graph can be plotted using horizontally or vertically, the most
usual type of bar graph used is the vertical bar graph.
The orientation of the x-axis and y-axis are changed depending on the type of
vertical and horizontal bar chart. Apart from the vertical and horizontal bar graph, the two
different types of bar charts are:
25
Vertical Bar Graphs
When the grouped data are represented vertically in a graph or chart with the help
of bars, where the bars denote the measure of data, such graphs are called vertical bar
graphs. The data is represented along the y-axis of the graph, and the height of the bars
shows the values.
When the grouped data are represented horizontally in a chart with the help of
bars, then such graphs are called horizontal bar graphs, where the bars show the measure
of data. The data is depicted here along the x-axis of the graph, and the length of the bars
denote the values.
The grouped bar graph is also called the clustered bar graph, which is used to
represent the discrete value for more than one object that shares the same category. In this
type of bar chart, the total number of instances are combined into a single bar.
In other words, a grouped bar graph is a type of bar graph in which different sets of
data items are compared. Here, a single colour is used to represent the specific series
across the set. The grouped bar graph can be represented using both vertical and horizontal
bar charts.
26
Stacked Bar Graph
The stacked bar graph is also called the composite bar chart, which divides the
aggregate into different parts. In this type of bar graph, each part can be represented using
different colours, which helps to easily identify the different categories.
The stacked bar chart requires specific labelling to show the different parts of the
bar. In a stacked bar graph, each bar represents the whole and each segment represents the
different parts of the whole.
Bar graphs are used to match things between different groups or to trace changes
over time. Yet, when trying to estimate change over time, bar graphs are most suitable
when the changes are bigger.
Bar charts possess a discrete domain of divisions and are normally scaled so that
all the data can fit on the graph. When there is no regular order of the divisions being
matched, bars on the chart may be organized in any order. Bar charts organized from the
highest to the lowest number are called Pareto charts.
27
The height of the bar should correspond to the data value.
Advantages
Bar graph summarises the large set of data in simple visual form.
Disadvantages
Sometimes, the bar graph fails to reveal the patterns, cause, effects, etc.
It can be easily manipulated to yield fake information.
Example : The number of trees planted by Eco-club of a school in different years is given
below. Draw the bar graph to represent the data.
28
Solution:
450
400
350
Number of Trees to be Planted
300
250
200
150
100
50
0
2005 2006 2007 2008 2009 2010
Year
Pie Chart
The “pie chart” is also known as a “circle chart”, dividing the circular statistical
graphic into sectors or sections to illustrate the numerical problems.
Each sector denotes a proportionate part of the whole. To find out the composition
of something, Pie-chart works the best at that time. In most cases, pie charts replace other
graphs like the bar graph, line plots, histograms, etc.
A pie chart is a pictorial representation of data. The slices of pie here shows the
relative sizes of data. The same data is represented in different sizes with the help of pie
charts.
29
Pie charts are used to represent the proportional data or relative data in a single
chart. The concept of pie slices is used to show the percentage of a particular data from the
whole pie.
Measure the angle of each slice of the pie chart and divide by 360 degrees. Now
multiply the value by 100. The percentage of particular data will be calculated.
The examples of a pie chart, there are many real-life examples of pie charts, such
as: Representation of marks obtained by students in a class. Representation of kinds of
cars sold in a month. To show the type of food liked by people in a room.
Formula
The sum of all the data is equal to 360°. The total value of the pie is always 100%.
To work out with the percentage for a pie chart, follow the steps given below:
30
Uses of Pie Chart
Advantages
To emphasize a few points you want to make, you can manipulate pieces of
data in the pie chart.
31
Disadvantages
It becomes less effective if there are too many pieces of data to use.
If there are too many pieces of data. Even if you add data labels and numbers
may not help here, they themselves may become crowded and hard to read.
As this chart only represents one data set, you need a series to compare
multiple sets.
This may make it more difficult for readers when it comes to analyze and
assimilate information quickly.
Example : The following data shows the agricultural production in India during a certain
year. Draw a pie chart to represent the data.
Rice 57
Wheat 76
Coarsev Cereals 38
Pulses 19
32
Line Graph
For example, the price of different flavours of chocolates varies, which we can
represent with the help of this graph. This variation is usually plotted in a two-dimensional
XY plane. If the relation including any two measures can be expressed utilizing a straight
line in a graph, then such graphs are called linear graphs. Thus, the line graph is also
called a linear graph.
A line graph or line chart or line plot is a graph that utilizes points and lines to
represent change over time. It is a chart that shows a line joining several points or a line
that shows the relation between the points.
33
The graph represents quantitative data between two changing variables with a line
or curve that joins a series of successive data points. Linear graphs compare these two
variables in a vertical axis and a horizontal axis.
A line graph is a graph that is used to display change over time as a series of
data points connected by straight line segments on two axes.
34
A line graph is also called a line chart. It helps to determine the relationship
between two sets of values, with one data set always being dependent on the
other data set.
The slope of the line is the most important observation in this case. The slope
represents how steep a line is. It helps in comparing the magnitude of change
between any two consecutive points on the graph. For example: The steeper the
slope, the greater is the change in magnitude between two consecutive points.
Line graph consists of a horizontal x-axis and a vertical y-axis. Most line graphs
only deal with positive number values, so these axes typically intersect near the bottom of
the y-axis and the left end of the x-axis. The point at which the axes intersect is
always (0,0). Each axis is labelled with a data type. For example, the x-axis could be days,
weeks, quarters, or years, while the y-axis shows revenue in dollars.
Scale: The scale is the numbers that explain the units utilized on the linear graph.
35
Labels: Both the side and the bottom of the linear graph have a label that indicates what
kind of data is represented in the graph. X-axis describes the data points on the line and
the Y-axis shows the numeric value for each point on the line.
Data values: they are the actual numbers for each data point.
Vertical line graphs are graphs in which a vertical line extends from each data
point down to the horizontal axis. Vertical line graph sometimes also called a column
graph. A line parallel to the y-axis is called a vertical line.
36
Horizontal Line Graph
Horizontal line graphs are graphs in which a horizontal line extends from each data
point parallel to the earth. Horizontal line graph sometimes also called a row graph. A line
parallel to the x-axis is called a vertical line.
37
Straight Line Graph
A line graph is a graph formed by segments of straight lines that join the plotted
points that represent given data. The line graph is used to solve changing conditions,
often over a certain time interval. A general linear function has the form y = mx + c,
where m and c are constants.
The fundamental rule at the rear of sketching a linear graph is that we require only
two points to graph a straight line. The subsequent procedure is followed in drawing linear
graphs:
Plot the horizontal line and vertical line and select the suitable scale for both
the axes.
If the given table values are large choose the scale for that particular value. It
depends on the given value.
Plot the two points in the Cartesian plane of the paper. Join the two points
using a line segment and extend to two directions. The closed figure obtained
is the required linear graph.
38
Example: Draw a graph for the line y = 2x - 3y=2x−3.
X -2 0 2 4
Y
To work out the missing values, we use the equation like a formula, substituting
the values from the table in, we get the following:
X -2 0 2 4
Y -7 -3 1 5
Step 3: So, we know that the line passes through (−2,−7),(0,−3),(2,1) and (4, 5)(4,5)
Now all that remains is to plot them on a pair of axes and draw a straight line
through them. The result should look like the graph below.
39
40
Double Line Graph
A double line graph is a line graph with two lines. A graph that compares two
different subjects over a period of time. A double line graph shows how things change
over a period of time. The double line graph shows two line graphs within one chart.
Double line graphs are used to compare trends and patterns between two subjects.
Draw and label the scale on the vertical and horizontal axis.
List each item and locate the points on the graph for both the lines.
Connect the points with line segments separately of both the lines.
41
Uses of Line Graph
The important use of line graph is to track the changes over the short and long
period of time. It is also used to compare the changes over the same period of time for
different groups. It is always better to use the line than the bar graph, whenever the small
changes exist.
For example, in a company finance team wants to plot the changes in the cash
amount that the company has on hand over time. In that case, they use the line graph
plotting the points over the horizontal and the vertical axis. It usually represents the time
period of the data.
Example : The population of a small village is recorded every 10 years. Draw a line graph
to show this data.
1970 0.97
1980 1.69
1990 3.51
2000 4.1
2010 6.71
42
43
The line graph should have the year on the x-axis and the population on the y-axis.
It should also have the axes clearly labelled and an appropriate title at the top. With all
points plotted correctly and joined with straight lines, the line graph should look like:
Histogram
In such representations, all the rectangles are adjacent since the base covers the
intervals between class boundaries.
44
In other words, histogram a diagram involving rectangles whose area is
proportional to the frequency of a variable and width is equal to the class interval.
The vertical axis (frequency) represents the amount of data that is present in
each range.
The number ranges depend upon the data that is being used.
In the same histogram, the number count or multiple occurrences in the data for
each column is represented by the y-axis.
It is the easiest manner that can be used to visualize data distributions. Let us
understand the histogram graph by plotting one for the given below example.
Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a different
height. The height of the trees (in inches): 61, 63, 64, 66, 68, 69, 71, 71.5, 72, 72.5, 73,
73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78, 78.5, 79, 79.2, 80, 81, 82, 83, 84, 85 and 87.
We can group the data as follows in a frequency distribution table by setting a range:
45
Number of Trees
Height Range (ft)
(Frequency)
60 - 75 3
66 - 70 3
71 - 75 8
76 - 80 10
81 - 85 5
86 - 90 1
This data can be now shown using a histogram. We need to make sure that while
plotting a histogram, there shouldn’t be any gaps between the bars.
46
How to Make a Histogram?
The process of making a histogram using the given data is described below:
Step 3: Then draw the bars corresponding to each of the given weights using
their frequencies.
Example:
Construct a histogram for the following frequency distribution table that describes
the frequencies of weights of 25 students in a class.
65 - 70 4
70 - 75 10
75 - 80 8
80 - 85 4
47
Steps to draw a Histogram
Step 1: On the horizontal axis, we can choose the scale to be 1 unit = 11 lb.
Since the weights in the table start from 65, not from 0, we give a break/kink
on the X-axis.
Step 2: On the vertical axis, the frequencies are varying from 4 to 10. Thus, we
choose the scale to be 1 unit = 2.
Step 3: Then draw the bars corresponding to each of the given weights using
their frequencies.
48
Frequency Histogram
1 4
2 5
3 8
4 2
5 1
Histogram Shapes
The histogram can be classified into different types based on the frequency
distribution of the data. There are different types of distributions, such as normal
distribution, skewed distribution, bimodal distribution, multimodal distribution, comb
distribution, edge peak distribution, dog food distribution, heart cut distribution, and so on.
49
The histogram can be used to represent these different types of distributions. The
different types of a histogram are uniform histogram, symmetric histogram, bimodal
histogram, probability histogram.
Symmetric Histogram
When you draw the vertical line down the centre of the histogram, and the two
sides are identical in size and shape, the histogram is said to be symmetric. The diagram is
perfectly symmetric if the right half portion of the image is similar to the left half. The
histograms that are not symmetric are known as skewed.
50
Probability Histogram
Bimodal Histogram
Uniform Histogram
Bell-Shaped Histogram
A bell-shaped histogram has a single peak. The histogram has just one peak at this
time interval and hence it is a bell-shaped histogram. For example, the following
histogram shows the number of children visiting a park at different time intervals. This
histogram has only one peak. The maximum number of children who visit the park is
between 5.30 p.m. to 6 p.m.
51
Bimodal Histogram
If a histogram has two peaks, it is said to be bimodal. Bimodality occurs when the
data set has observations on two different kinds of individuals or combined groups if the
centres of the two separate histograms are far enough to the variability in both the data
sets.
52
A bimodal histogram has two peaks and it looks like the graph given below. For
example, the following histogram shows the marks obtained by the 48 students of Class 8
of St. Mary’s School.
53
Skewed Right Histogram
54
Skewed Left Histogram
A skewed left histogram is a histogram that is skewed to the left. In this histogram,
the bars of the histogram are skewed to the left side, hence, called a skewed left histogram.
For example, the following histogram shows the number of students of Class 10 of
Greenwood High School according to the amount of time they spent on their studies on a
daily basis. The maximum number of students study 4.5-5 (hours) on daily basis.
55
Uniform Histogram
A uniform distribution reveals that the number of classes is too small, and each
class has the same number of elements. It may involve distribution that has several peaks.
A uniform histogram is a histogram where all the bars are more or less of the same
height. In this histogram, the lengths of all the bars are more or less the same. Hence, it is
a uniform histogram.
For example, Ma’am Lucy, the Principal of Little Lilly Playschool, wanted to
record the heights of her students. The following histogram shows the number of students
and their varying heights. The height of the students ranges between 30 inches to 50
inches.
56
Difference Between a Bar Chart and a Histogram
The fundamental difference between histograms and bar graphs from a visual
aspect is that bars in a bar graph are not adjacent to each other.
The main differences between a bar chart and a histogram are as follows:
But in both graphs, Y-axis represents numbers only. We can understand these
differences from the following figure:
57
Difference between Bar Chart and Histogram
58
Example : Consider the following histogram that represents the weights of 34 newborn
babies in a hospital. If the children weighing between 6.5 lb to 8.5 lb are considered
healthy, then find the percentage of the children of this hospital that are healthy.
Solution: We have to first find the number of children weighing between 4.4 lb to 6.6 lb.
From the given histogram, the number of children weighing between:
6.5 lb - 7.5 lb = 10
7.5 lb - 8.5 lb = 18
59
Uses of Histogram
Used to check whether the process changes from one period to another.
Used to determine whether the output is different when it involves two or more
processes.
Used to analyse whether the given process meets the customer requirements
Graphical Representation
Line Graphs – Line graph or the linear graph is used to display the continuous
data and it is useful for predicting future events over time.
Bar Graphs – Bar Graph is used to display the category of data and it
compares the data using solid bars to represent the quantities.
60
Histograms – The graph that uses bars to represent the frequency of numerical
data that are organised into intervals. Since all the intervals are equal and
continuous, all the bars have the same width.
Frequency Table – The table shows the number of pieces of data that falls
within the given interval.
Circle Graph – Also known as the pie chart that shows the relationships of the
parts of the whole. The circle is considered with 100% and the categories
occupied is represented with that specific percentage like 15%, 56%, etc.
Stem and Leaf Plot – In the stem and leaf plot, the data are organised from
least value to the greatest value. The digits of the least place values from the
leaves and the next place value digit forms the stems.
Box and Whisker Plot – The plot diagram summarises the data by dividing
into four parts. Box and whisker show the range (spread) and the middle (
median) of the data.
General Make sure that the appropriate title is given to the graph which
indicates the subject of the presentation.
61
Proper Scale: To represent the data in an accurate manner, choose a proper
scale.
Index: Index the appropriate colours, shades, lines, design in the graphs for
better understanding.
Data Sources: Include the source of information wherever it is necessary at the bottom of
the graph.
There are certain rules to effectively present the information in the graphical
representation. They are suitable title, size, fonts, colours etc in such a way that the graph
should be a visual aid for the presentation of information.
It saves time.
62
Different Types of Graphical Representation
63
Graphical Representation in Maths
It helps to study the relationship between two variables where it helps to measure
the change in the variable amount with respect to another variable within a given interval
of time. It helps to study the series distribution and frequency distribution for a given
problem. There are two types of graphs to visually depict the information. They are:
The point at which two lines intersect is called an origin ‘O’. Consider x-axis, the
distance from the origin to the right side will take a positive value and the distance from
the origin to the left side will take a negative value. Similarly, for the y-axis, the points
above the origin will take a positive value, and the points below the origin will a negative
value.
64
Principles of Graphical Representation
Histogram
Pie diagram
Frequency Polygon
65
Merits of Using Graphs
It saves time
It allows us to relate and compare the data for different time periods
It is used in statistics to determine the mean, median and mode for different
data, as well as in the interpolation and the extrapolation of data.
Here are the steps to follow to find the frequency distribution of a frequency
polygon and it is represented in a graphical way.
Obtain the frequency distribution and find the midpoints of each class interval.
Represent the midpoints along x-axis and frequencies along the y-axis.
To complete the polygon, join the point at each end immediately to the lower
or higher class marks on the x-axis.
66
Frequency Polygon
Mark the class intervals for each class on the horizontal axis. We will plot the
frequency on the vertical axis.
Calculate the class mark for each class interval. The formula for class mark is:
Mark all the class marks on the horizontal axis. It is also known as the mid-
value of every class.
Corresponding to each class mark, plot the frequency as given to you. The
height always depicts the frequency. Make sure that the frequency is plotted
against the class mark and not the upper or lower limit of any class.
Join all the plotted points using a line segment. The curve obtained will be
kinked.
67
Note that the above method is used to draw a frequency polygon without drawing
a histogram. You can also draw a histogram first by drawing rectangular bars against the
given class intervals. After this, you must join the midpoints of the bars to obtain the
frequency polygon. Remember that the bars will have no spaces between them in a
histogram.
Answer: We first need to calculate the cumulate frequency from the frequency given.
68
We now start by plotting the class marks such as 54.5, 64.5, 74.5 and so on till 94.5.
Note that we will also plot the previous and next class marks to start and end the polygon, i.e.
we plot 44.5 and 104.5 as well.
Then, the frequencies corresponding to the class marks are plotted against each class
mark. Like you can see below, this makes sense as the frequency for class marks 44.5 and
104.5 are zero and touching the x-axis. These plot points are used only to give a closed shape
to the polygon. The polygon looks like this:
69
Construction of frequency polygon
Creation of a histogram.
Finding the midpoints for each bar that exists on the histogram.
The frequency histogram has the similarity to a column graph without the presence of
spaces between columns. The frequency polygon happens to be a special line graph whose
use takes place in statistics. One can draw these graphs either separately or combined. One
can make use of the information that is available in a frequency distribution table for
drawings of these graphs. Frequency polygons provide us with an understanding of the shape
of the data and its trends.
The major difference between a frequency polygon and frequency curve is that the
drawing of a frequency polygon by joining points by a straight line while the drawing of a
frequency curve takes place by a smooth hand.
The following is the age distribution of 1000 persons working in a large industrial
house:
70
Age group Number of persons
20-25 30
25-30 160
30-35 210
35-40 180
40-45 145
45-50 105
50-55 70
55-60 60
60-65 40
To create the Pie diagram, line diagram, bar diagram and histograms.
Bar Diagram
250
Number of persons
200
150
100
50
0
20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65
number of persons
71
Pie Diagram
Number of persons
20-25 25-30
4% 3%
6%
16%
7%
30-35 35-40
11%
40-45 45-50
21%
50-55 55-60
14%
60-65
18%
Line Diagram
Number of persons
250
200
150
100
50
0
20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65
number of persons
72
Histogram
73
The following distribution table represents the number of miles run by 20
randomly selected runners during a recent road race:
Bin (Size) Frequency
5.5-10.5 1
10.5-15.5 3
15.5-20.5 2
20.5-25.5 4
25.5-30.5 5
30.5-35.5 3
35.5-40.5 2
Using this table, construct a frequency polygon.
Step 1: Calculate the midpoint of each bin by adding the 2 numbers of the interval
and dividing the sum by 2.
74
Step 2: Plot the midpoints on a grid, making sure to number the x-axis with a scale
that will include the bin sizes. Join the plotted midpoints with lines.
A frequency polygon usually extends 1 unit below the smallest bin value and 1
unit beyond the greatest bin value. This extension gives the frequency polygon an
appearance of having a starting point and an ending point, which provides a view of the
distribution of data. If the data set were very large so that the number of bins had to be
increased and the bin size decreased, the frequency polygon would appear as a smooth
curve.
In other words, the cumulative percents are added on the graph from left to right.
75
An ogives graph plots cumulative frequency on the y-axis and class
boundaries along the x-axis. It’s very similar to a histogram, only instead of rectangles,
an ogive has a single point marking where the top right of the rectangle would be.
Example: Draw an Ogives graph for the following set of data: 2, 7, 3, 8, 3, 15, 19, 16, 17,
13, 29, 20, 21, 21, 22, 25, 31, 51, 55, 55, 57, 58, 56, 57 and 58.
Step 1: Make a relative frequency table from the data. The first column has the
class limits, the second column has the frequency (the count) and the third column has the
relative frequency (class frequency / total number of items):
01 to 09 5 5/25=0.20
10 to 19 5 5/25=0.20
20 to 29 6 6/25=0.24
30 to 39 1 1/25=0.04
40 to 49 0 0/25=0
50 to 59 8 8/25=0.32
76
Step 2: Add a fourth column and cumulate (add up) the frequencies in column 2,
going down from top to bottom. For example, the second entry is the sum of the first row
and the second row in the frequency column (5 + 5 = 10), and the third entry is the sum of
the first, second, and third rows in the frequency column (5 + 5 + 6 = 16):
Relative Cumulative
Class Limits Frequency
Frequency frequency
01 to 09 5 5/25=0.20 5
10 to 19 5 5/25=0.20 10
20 to 29 6 6/25=0.24 16
30 to 39 1 1/25=0.04 17
40 to 49 0 0/25=0 17
50 to 59 8 8/25=0.32 25
77
Step 3: Add a fifth column and cumulate the relative frequencies from column 3.
If you do this step correctly, your values should add up to 100% (or 1 as a decimal):
Cumulative
Class Relative Cumulative
Frequency Relative
Limits Frequency frequency
frequency
01 to 09 5 5/25=0.20 5 0.2
10 to 19 5 5/25=0.20 10 0.4
20 to 29 6 6/25=0.24 16 0.64
30 to 39 1 1/25=0.04 17 0.68
40 to 49 0 0/25=0 17 0.68
50 to 59 8 8/25=0.32 25 1
Step 4: Draw an Cartesian plane (x-y graph) with percent cumulative relative
frequency on the y-axis (from 0 to 100%, or as a decimal, 0 to 1). Mark the x-axis with the
class boundaries.
Note: Each point should be plotted on the upper limit of the class boundary. For example,
if your first class boundary is 0 to 10, the point should be plotted at 10.
Step 6: Connect the dots with straight lines. the ogive is one continuous line, made
up of several smaller lines that connect pairs of dots, moving from left to right.
78
79
UNIT - II
Lesson 3
Measures of Central Tendency
Arithmetic Mean
The mean (or average) is the most popular and well known measure of central
tendency. It can be used with both discrete and continuous data, although its use is most
often with continuous data. The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set.
X
fx
N
where x mean
N = total frequency
Median
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data.
80
n
cf
For Grouped Data : Median l 2 c
f
where,
n = number of observation
h = class size
Mode
The mode is the most frequent score in our data set. On a histogram it represents
the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the
mode as being the most popular option.
f1 f 0
M O l h
2 f1 f 0 f 2
where,
81
Finding the Mean, Median and Mode
We want to work out the mean, median and mode for the data as 5, 9, 12, 4, 5, 14,
19, 16, 3, 5 and 7.
To calculate the mean, we need to add all the values up and divide by the number
of values.
5 + 9 + 12 + 4 + 5 + 14 + 19 + 16 + 3 + 5 + 7 99
____________________________________ = ___ = 9
11 11
82
In this case the mean is 9 which is one of the values in the list. Sometimes the
mean will not appear in the original list. It might even be a decimal value.
To calculate the median, we need to put the numbers in order and find the middle
value.
3 4 5 5 5 7 9 12 14 16 19
Here the median is 7 because this is the middle value. Half of the other values in
the list are below 7 and half are above 7.
To calculate the mode, we need to look at which value appears the most often. It
can help if the numbers are in order.
3 4 5 5 5 7 9 12 14 16 19
When there are an even number of values, there is no clear middle value. For
example,
83
3 6 7 8 11 15
7+8 = 7.5
So the median for this set of values is 7.5. Like the mean, the median value does
not always appear in the original list of values.
Example : Find the mean, median, mode, and range for the following list of values: 1, 2, 4
and 7.
(1 + 2 + 4 + 7) ÷ 4 = 14 ÷ 4 = 3.5
The median is the middle number. In this example, the numbers are already listed
in numerical order, so we don’t have to rewrite the list. But there is no “middle” number,
because there is even number of numbers. Because of this, the median of the list will be
the mean (that is, the usual average) of the middle two values within the list. The middle
two numbers are 2 and 4, so:
(2 + 4) ÷ 2 = 6 ÷ 2 = 3
So the median of this list is 3, a value that isn’t in the list at all.
84
The mode is the number that is repeated most often, but all the numbers in this list
appear only once, so there is no mode.
Example: Find the mean, median, mode, and range for the following list of values: 13,
18, 13, 14, 13, 16, 14, 21 and 13.
Solution: The mean is the usual average, so we’ll add and then divide:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
Note that the mean, in this case, isn’t a value from the original list. This is a
common result.
You should not assume that your mean will be one of your original numbers. The
median is the middle value, so first we’ll have to rewrite the list in numerical order:
There are nine numbers in the list, so the middle one will be the
(9 + 1) ÷ 2 = 10 ÷ 2
= 5th number :
85
The mode is the number that is repeated more often than any other, so 13 is the
mode, since 13 is being repeated 4 times.
The largest value in the list is 21, and the smallest is 13, so the range is 21-13 = 8.
Merits of Mean
Arithmetic mean is simple to understand and easy to calculate.
It is rigidly defined.
It is suitable for further algebraic treatment.
It is least affected fluctuation of sampling.
It takes into account all the values in the series.
Demerits of Mean
It is highly affected by the presence of a few abnormally high or abnormally
low scores.
In absence of a single item, its value becomes inaccurate.
It cannot be determined by inspection.
Example :
x 1 2 3 4 5 6 7
y 5 9 12 17 14 10 6
86
(b) Calculate the arithmetic mean of the marks from the following table:
0-10 12
10-20 18
20-30 27
30-40 20
40-50 17
50-60 6
Solution:
(a) Computation of mean:
x f fx
1 5 5
2 9 18
3 12 36
4 17 68
5 14 70
6 10 60
7 6 42
1 299
x
N
fx = 73
= 4.09
87
(b) Computing mean:
Marks No. of students (f) Mid point(x) fx
0-10 12 5 60
10-20 18 15 270
20-30 27 25 675
30-40 20 35 700
40-50 17 45 765
50-60 6 55 330
1 1
x
N
fx
100
2,800 28
Merits of Median
88
Demerits
Example:
Obtain the median for the following frequency distribution:
x f
1 8
2 10
3 11
4 16
5 20
6 25
7 15
8 9
9 6
89
Solution:
x f c.f
1 8 8
2 10 18
3 11 29
4 16 45
5 20 65
6 25 90
7 15 105
8 9 114
9 6 120
Total N=120
Here N=120;
120
Median = 60
2
1
The cumulative frequency just greater than N is 65 and the value of x
2
corresponding to 65 is 5 therefore, median is 5.
90
Merits of Mode:
Demerits of Mode:
Mode for the series with unequal class intervals cannot be calculated.
Example:
A doctor who checked 9 patients’ sugar level is given below. Find the mode value
of the sugar levels. 80, 112, 110, 115, 124, 130, 100, 90, 150 and 180.
91
Example:
Compute mode value for the following observations: 2, 7, 10, 12, 10, 19, 2, 11, 3
and 12.
Solution: Here, the observations 10 and 12 occurs twice in the data set, the modes
are 10 and 12. For discrete frequency distribution, mode is the value of the variable
corresponding to the maximum frequency.
Geometric Mean
The arithmetic mean or mean can be found It can be found by multiplying all the
by adding all the numbers for the given numbers in the given data set and take
data set divided by the number of data the nth root for the obtained result.
points in a set.
For example, the given data sets are For example, consider the given data
5, 10, 15 and 20. set, 4, 10, 16 and 24.
Here, the number of data points = 4 Here n = 4
Arithmetic mean or mean = Therefore, the G.M = 4th root of
(5+10+15+20)/4 (4 ×10 ×16 × 24) = 4th root of 15360
Mean = 50/4 =12.5 G.M = 11.13
92
Geometric Mean Properties
The G.M for the given data set is always less than the arithmetic mean for the
data set
If each object in the data set is substituted by the G.M, then the product of the
objects remains unchanged.
The ratio of the corresponding observations of the G.M in two series is equal to
the ratio of their geometric means
The products of the corresponding items of the G.M in two series are equal to
the product of their geometric mean.
The greatest assumption of the G.M is that data can be really interpreted as a
scaling factor. Before that, we have to know when to use the G.M. The answer to this is, it
should be only applied to positive values and often used for the set of numbers whose
values are exponential in nature and whose values are meant to be multiplied together.
This means that there will be no zero value and negative value which we cannot really
apply. Geometric mean has a lot of advantages and it is used in many fields. Some of the
applications are as follows:
It is used in stock indexes. Because many of the value line indexes which is
used by financial departments use G.M.
93
It is used to calculate the annual return on the portfolio.
It is used in finance to find the average growth rates which are also referred to
the compounded annual growth rate.
It is also used in studies like cell division and bacterial growth etc.
45 1.653
60 1.778
48 1.681
100 2.000
65 1.813
Total 8.925
1 n
G=Antilog
N
f
i 1
i log x = Antilog 8.925/5
94
Example: Find the geometric mean of the following grouped data for the frequency
distribution of weights.
60-80 22
80-100 38
100-120 45
120-140 35
140-160 20
Total 160
Solution:
95
From the given data, n = 160,
1 n
G=Antilog
N
f
i 1
i log x
GM = Antilog (324.2/160)
Harmonic Mean
where
n = the number of the values in a dataset
96
Applications of Harmonic Mean
A few common applications of the harmonic mean formula are given below
Example : The number of tomatoes per plant is given below. Calculate the harmonic
mean.
20 4
21 2
22 7
23 1
24 3
25 1
97
Solution:
20 4 0.05 0.2
21 2 0.0476 0.0952
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.04 0.04
n=18 0.8216
1
H= 1 n 1
n i 1 x i
18
Harmonic Mean= 0.8216 =21.908
98
Example :
Calculate the harmonic mean for the following data:
x f
1 2
3 4
5 6
7 8
9 10
11 12
Solution:
The calculation for the harmonic mean is shown in the below table:
x f 1/x f(l/x)
1 2 1 2
3 4 0.333 1.332
5 6 0.2 1.2
7 8 0.143 1.144
9 10 0.1111 1.111
11 12 0.091 1.092
99
The formula for weighted harmonic mean is
H= 1
1 n 1
n i1 xi
= 42 / 7.879 = 5.331
The harmonic mean is applied in the finance to the average multiples like
price-earnings ratio
It is also used by the market technicians in order to determine the patterns like
Fibonacci Sequences
It is rigidly confined.
100
It provides a more reliable result when the results to be achieved are the same
for the various means adopted.
The harmonic mean is greatly affected by the values of the extreme items
Weighted Mean
In calculating arithmetic mean we suppose that all the item in the distribution have
equal importance. But in practice this may not be so. If some items in a distribution are
more importance than other, then this point must be such cases, proper weight age is to be
given to various items, the weights attached to each item being proportional to the
importance of the item in the distribution.
101
x i i
Weighted arithmetic mean (or weighted mean)= i
i
i
A weighted average is the average of all the values which are arranged on a
priority basis. The weighted average of values is the sum of the weight times values
divided by the sum of the weights.
Examples :
The numbers 40, 45, 80, 75 and 10 have weights 1, 2, 3, 4, and 5 respectively.
Find the weighted mean for the given data set.
Solution:
x
i
i i
Weighted Arithmetic Mean=
i
i
40 1 45 2 80 3 75 4 10 5
=
1 2 3 4 5
102
Lesson 4
Measures of Dispersion
In statistics, the measures of dispersion help to interpret the variability of data i.e.
to know how much homogenous or heterogeneous the data is. In simple terms, it shows
how squeezed or scattered the variable is.
Meaning of Dispersion
Dispersion is the extent to which values in a distribution differ from the average of
the distribution.
In the former case weconsider the range, Quartile Deviation, standard deviation etc.
In the latter case we consider the coefficient of range, coefficient quartile deviation, the
coefficientof variation etc.
There are two main types of dispersion methods in statistics which are:
103
Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data set.
Absolute dispersion method expresses the variations in terms of the average of deviations
of observations like standard or means deviations. It includes range, standard deviation,
quartile deviation, etc.
Range: It is simply the difference between the maximum value and the
minimum value given in a data set.
Range (R) = L – S
𝐿−𝑆
Co-efficient of Range = 𝐿+𝑆
Example: 1, 3,5, 6, 7
Variance: Deduct the mean from each data in the set then squaring each of
them and adding each square and finally dividing them by the total no of
values in the data set is the variance. Variance (σ 2)=∑(X−μ)2/N
Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
104
Quartiles and Quartile Deviation: The quartiles are values that divide a list
of numbers into quarters. The quartile deviation is half of the distance between
the third and the first quartile.
Mean and Mean Deviation: The average of numbers is known as the mean
and the arithmetic mean of the absolute deviations of the observations from a
measure of central tendency is known as the mean deviation (also called mean
absolute deviation).
The relative measures of dispersion are used to compare the distribution of two or
more data sets. This measure compares values without units. Common relative dispersion
methods include:
Co-efficient of Range
Co-efficient of Variation
105
Figure - Relative Measure of Dispersion
Co-efficient of Dispersion
The coefficients of dispersion are calculated (along with the measure of dispersion)
when two series are compared, that differ widely in their averages. The dispersion
coefficient is also used when two series with different measurement units are compared. It
is denoted as C.D.
106
Properties of a good measure of dispersion:
Easy to understand
Simple to calculate
Uniquely defined
Range:
𝐿−𝑆
(i) Co-efficient of range=𝐿+𝑆
S= Smallest item
Q3 Q1
(ii) Co-efficient of quartile deviation
Q3 Q1
107
(iii) Inter Quartile Range=Q3-Q1
Quartile deviation is half of the inter quartile range it is also called semi inter
quartile range.
Q3 Q1
Quartile Deviation=
2
MeanDeviation
(iv) Co-efficient of Mean Deviation=
MeanorMedi an
(v) Co-efficient of Standard Deviation=
X
(vi) Co-efficient of variance 100
X
𝑋̅ =Arithmetic mean
Standard Deviation
108
It is defined as the square root of the arithmetic mean of the squares of the
Denoted by 𝜎 (sigma)
X X x
2 2
(or )
N N
where, x X X
Co-efficient of S.D=
X
Individual series Assumed Mean / Shortcut Method
2
d 2
d
N N
or
2
X2 X
N N
fd 2
fd
N N
where d = X-A
109
Merits of Standard Deviation
110
Higher the C.V lesser the consistency
C.V =
X
Merits of Range
Easy to calculate
Easy to understand
Demerits of Range
The quartiles divide a data set into quarters. The first quartile, (Q1) is the middle
number between the smallest number and the median of the data. The second
quartile, (Q2) is the median of the data set. The third quartile, (Q3) is the middle
number between the median and the largest number.
Quartile deviation or semi-inter-quartile deviation is Q = ½ × (Q3 – Q1).
111
Merits of Quartile Deviation
All the drawbacks of Range are overcome by quartile deviation
Mean Deviation
Mean deviation is the arithmetic mean of the absolute deviations of the observations
from a measure of central tendency. If x1, x2, … , xn are the set of observation, then the mean
deviation of x about the average A (mean, median, or mode) is
Here, xi and fi are respectively the mid value and the frequency of the ith class
interval.
112
Merits of Mean Deviation
It provides a minimum value when the deviations are taken from the median
Ignorance of negative sign creates artificiality and becomes useless for further
mathematical treatment
Coefficient of Dispersion
Whenever we want to compare the variability of the two series which differ widely in
their averages. Also, when the unit of measurement is different. We need to calculate the
coefficients of dispersion along with the measure of dispersion. The coefficients of dispersion
(C.D.) based on different measures of dispersion are
113
Based on mean deviation = Mean deviation/average from which it is calculated.
Coefficient of Variation
100 times the coefficient of dispersion based on standard deviation is the coefficient
of variation (C.V.).
Range
Example : The amount spent (in rupees) by the group of 10 students in the school canteen
is as follows:
110, 117, 129, 197, 190, 100, 100, 178, 255, 790.
Example : Find the range and it’s co-efficient from the following data.
Frequency 2 2 3 4 2
114
Solution: R = L – S = 100 – 10 = 90
𝐿−𝑆 100−10 90
Co-efficient of range = 𝐿+𝑆 = 100+10 = 110 = 0.82
Example : Find out the quartile deviation of daily wages (in rupees) of 7 persons is given
below 120,70,150,100,190,170 and 250.
Solution : Arranging the data in an ascending order we get 70, 100, 120, 150, 170, 190
and 250.
Here, n=7
Q1= Size of
N 1th item
4
7 1 items
=Size of 4 = 2nd item
= 100 rupees
3(7 1)
= Size of item = 6th item
4
=190 rupees
Q3 Q1 190 100
Quartile Deviation = 45 rupees
2 2
115
Example : The wheat production (in kg) of 20 acres given as: 1120, 1240, 1320, 1040,
1080, 1200, 1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470,
1750 and 1885, find the quartile deviation and co efficient of quartile deviation.
Solution : After arranging the observations in ascending order we get 1040, 1080, 1120,
1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750, 1755, 1785,
1880, 1885 and 1960.
Q1 = Value of
N 1th item = Value of
20 1th item
4 4
= 1240+0.25(1320-1240) = 1240 + 20
Q1 = 1260 kg.
3 N 1 320 1
th th
Q3= value of item = value of item
4 4
= 15th item + 0.75 (16th item-15th item) = 1750+ 0.75(1755-1750) = 1750 + 3.7
Q3 =1753.75kg.
Q3 Q1 1753.75 1260
Co-efficient of Quartile Deviation = = 0.164
Q3 Q1 1753.75 1260
116
Mean Deviation
Calculate median/mean/mode
Take deviations of items from median/mean ignoring ± signs and denote the
column as |D|
Calculate the sum of these deviation in case of discrete and continuous series
∑ f|D|
M. D =
N
M.D
Co-efficient of M.D=
Median/Mean/mode
Example :
Calculate mean deviation and co-efficient of mean deviation from both mean and
median for the following data on the monthly income (in rupees) of households.
117
Monthly Deviation from Deviation from
income mean(7730) |D| median (7920) |D|
6350 1380 1570
7920 190 0
∑𝑋 38650
Mean = = =7730
𝑁 5
∑|𝐷| 3220
M.D = = = 644
𝑁 5
𝑀.𝐷 644
Co-efficient of M.D =𝑀𝑒𝑎𝑛 = 7730 = 0.083
and
𝑁+1 5+1
Median= size of item = size of item = size of 3rd item =7920
2 2
∑|𝐷| 3030
M.D= = =606
𝑁 5
𝑀.𝐷 606
Co-efficient of M.D =𝑀𝑒𝑑𝑖𝑎𝑛 =7920
118
Lorenz Curve
Definition: The Lorenz curve is a way of showing the distribution of income (or
wealth) within an economy. It was developed by Max O. Lorenz in 1905 for representing
wealth distribution. In other words, it is a graphical representation of the distribution of
income or wealth.
The Lorenz curve shows the cumulative share of income from different
sections of the population.
If there was perfect equality – if everyone had the same salary – the poorest
20% of the population would gain 20% of the total income. The poorest 60%
of the population would get 60% of the income.
The graph plots percentiles of the population on the horizontal axis according to
income or wealth. It plots cumulative income or wealth on the vertical axis, so that an x-
value of 45 and a y-value of 14.2 would mean that the bottom 45% of the population
controls 14.2% of the total income or wealth.
119
In this Lorenz curve, the poorest 20% of households have 5% of the nation’s total
income. The poorest 90% of the population holds 55% of the total income. That means the
richest 10% of income earners gain 45% of total income.
120
Shift in the Lorenz Curve
In this example, there has been a reduction in inequality – the Lorenz curve has
moved closer to the line of equality.
The richest 10% of the population used to gain 45% of total income but now
only get 25% of total income.
121
The Lorenz Curve and Gini Coefficient
The Lorenz Curve can be used to calculate the Gini coefficient – another measure
of inequality.
The closer the Lorenz curve is to the line of equality, the smaller area A is. And the
Gini coefficient will be low.
A rise in the Gini coefficient shows a rise in inequality – it shows the Lorenz curve
is further away from the line of equality.
122
Lorenz Curve and wealth
Following is the example to understand the Lorenz curve with the help of a graph.
Let us consider an economy with the following population and income statistics:
Population 0 20 40 60 80 100
Income portion % 0 10 20 35 60 100
123
And for the line of perfect equality, let us consider this table:
0 0
20 20
40 40
80 80
100 100
Let us now see how a graph for this data actually looks:
124
As we can see, there are two lines in the graph of the Lorenz curve, the curved red
line, and the straight black line.
The black line represents the fictional line called the line of equality i.e. the ideal
graph when income or wealth is equally distributed amongst the population. The red
curve, the Lorenz curve, which we have been discussing, represents the actual
distribution of wealth among the population.
Hence, we can say that the Lorenz curve is the graphical method of studying
dispersion. Gini Coefficient, also known as the Gini Index, can be computed as follows.
Let us assume in the graph area between the Lorenz Curve and the line is
represented by A1 and the line below the curve is represented by A2. So,
Gini Coefficient lies between 0 and 1; 0 being the instance where there is perfect
equality and 1 being the instance where there is perfect inequality. The higher the area
enclosed between the two lines represents higher inequality in the economy.
By this, we can say that in measuring income inequality, there are two indicators:
The Lorenz curve is the Visual Indicator and
The Gini Coefficient is the Mathematical Indicator.
Income inequality is a pressing issue across the world. So, what are the reasons for
inequality in an economy?
125
Corruption
Education
Tax
Gender differences
Culture
Race and Cast discriminations
The difference in preferences of leisure and risks.
126
It is one of the simplest representations of inequality.
It can be used majorly while taking specific measures to develop the weaker
sections in the economy.
Limitations
This might not always be rigorously true for a finite level of population.
When two Lorenz curves are being compared and such two curves intersect, it
is not possible to ascertain which distribution represented by the curves display
more inequality.
127
UNIT - III
Lesson 5
Correlation
Definition
Correlation means association - more precisely it is a measure of the extent to
which two variables are related. There are three possible results of a correlation study a
positive correlation, a negative correlation, and no correlation.
A zero correlation exists when there is no relationship between two variables. For
example there is no relationship between the amount of tea drunk and level of intelligence.
128
one variable are accompanied by movements in other variable. For example, husband’s
age move together, scores on an I.Q. test move with scores in university examinations.
However, the above statements are not precise enough to be of use to decision
makers. We are therefore on the lookout for a quantitative measure of the relationship
between the two variables, and also for an appropriate mathematical or statistical form of
the relationship.
Meaning of Correlation
129
For example, the correlation between (i) the heights and weights of a group of
persons, and (ii) the income and expenditure; is positive and the correlation between
(i) price and demand of a commodity and (ii) the volume and pressure of a perfect gas; is
negative.
Scatter diagram
A correlation can be expressed visually. This is done by drawing a scatter diagram
(also known as a scatter plot, scatter graph, scatter chart, or scatter diagram).
A scatter graph indicates the strength and direction of the correlation between the
co-variables.
130
When you draw a scatter diagram it doesn't matter which variable goes on the
x-axis and which goes on the y-axis.
Remember, in correlations we are always dealing with paired scores, so the values
of the 2 variables taken together will be used to make the diagram. Decide which variable
goes on each axis and then simply put a cross at the point where the 2 values coincide.
Apart from diagrams, Graphic presentation is another way of the presentation of data
and information. Usually, graphs are used to present time series and frequency distributions.
In this article, we will look at the graphic presentation of data and information along with its
merits, limitations, and types.
Construction of a Graph
The graphic presentation of data and information offers a quick and simple way of
understanding the features and drawing comparisons. Further, it is an effective analytical tool
and a graph can help us in finding the mode, median, etc.
We can locate a point in a plane using two mutually perpendicular lines – the X-axis
(the horizontal line) and the Y-axis (the vertical line). Their point of intersection is the origin.
We can locate the position of a point in terms of its distance from both these axes.
For example, if a point P is 3 units away from the Y-axis and 5 units away from the
X-axis, then its location is as follows:
131
General Rules for Graphic Presentation of Data and Information
There are certain guidelines for an attractive and effective graphic presentation of
data and information. These are as follows:
Suitable Title – Ensure that you give a suitable title to the graph which clearly
indicates the subject for which you are presenting it.
Unit of Measurement – Clearly state the unit of measurement below the title.
Suitable Scale – Choose a suitable scale so that you can represent the entire
data in an accurate manner.
132
Index – Include a brief index which explains the different colors and
shades, lines and designs that you have used in the graph. Also, include a scale
of interpretation for better understanding.
Keep it Simple – You should construct a graph which even a layman (without
any exposure in the areas of statistics or mathematics) can understand.
Neat – A graph is a visual aid for the presentation of data and information.
Therefore, you must keep it neat and attractive. Choose the right size, right
lettering, and appropriate lines, colors, dashes, etc.
Merits of a Graph
133
The viewer does not require prior knowledge of mathematics or statistics to
understand a graph.
We can use a graph to locate the mode, median, and mean values of the data.
Limitations of a Graph
Typically, a graph shows the unreasonable tendency of the data and the actual
values are not clear.
Types of Graphs
134
Time Series Graphs
A time series graph or a “histogram” is a graph which depicts the value of a variable
over a different point of time.
In a time series graph, time is the most important factor and the variable is related to
time. It helps in the understanding and analysis of the changes in the variable at a different
point of time.
Many statisticians and businessmen use these graphs because they are easy to
understand and also because they offer complex information in a simple manner.
Further, constructing a time series graph does not require a user with technical skills.
Here are some major steps in the construction of a time series graph:
Represent time on the X-axis and the value of the variable on the Y-axis.
Start the Y-value with zero and devise a suitable scale which helps you present
the whole data in the given space.
Plot the values of the variable and join different point with a straight line.
Line Graph
You can use a line graph to summarize how two pieces of information are related and
how they vary with each other.
135
Advantages
You can infer the interim data from the graph line
Disadvantages
Usually, in a graph, the vertical line starts from the Origin. However, in some cases, a
false Base Line is used for a better representation of the data. There are two scenarios where
you should use a false Base Line:
If you have to show the net balance of income and expenditure or revenue and costs
or imports and exports, etc., then you must use a net balance graph. You can use different
colors or shades for positive and negative differences.
136
Histogram
137
Frequency Curve
When you join the verticals of a polygon using a smooth curve, then the resulting
figure is a Frequency Curve. As the number of observations increase, we need to
accommodate more classes. Therefore, the width of each class reduces. In such a scenario,
the variable tends to become continuous and the frequency polygon starts taking the shape of
a frequency curve.
138
Uses of Correlations
Prediction
Validity
139
Reliability
Test-retest reliability (are measures consistent).
Theory verification
Predictive validity.
Coefficient of Correlation
A coefficient of correlation is generally applied in statistics to calculate a
relationship between two variables. The correlation shows a specific value of the degree of
a linear relationship between the X and Y variables, say X and Y. There are various types
of correlation coefficients. However, Pearson’s correlation (also known as Pearson’s R) is
the correlation coefficient that is frequently used in linear regression.
r
X X Y Y
X X Y Y
2 2
𝑌̅ = mean of Y variable
140
Problem : The following data gives the heights (in inches) of father and his eldest son.
Compute the correlation coefficient between the heights of fathers and sons using Karl
Pearson’s method.
65 67
66 68
67 65
67 68
68 72
69 72
70 69
72 71
Solution: Let x denote height of father and y denote height of son. The data is on the ratio
scale.
n n n
n xi y i xi y i
r i 1 i 1 i 1
2
n
2
n
n n
n x xi
2
i n y y i
2
i
i 1 i 1 i 1 i 1
141
xi yi x i2 y i2 xi y i
Heights of father and son are positively correlated. It means that on the average , if
fathers are tall then sons will probably tall and if fathers are short, probably sons may be
short.
142
Problem 2: The following are the marks scored by 7 students in two tests in a subject.
Calculate coefficient of correlation from the following data and interpret.
Marks in test-1 12 9 8 10 11 13 7
Marks in test-2 14 8 6 9 11 12 3
xi yi x i2 y i2 xi y i
9 8 81 64 72
8 6 64 36 48
10 9 100 81 90
7 3 49 9 21
143
n n n
n xi y i xi y i
r i 1 i 1 i 1
2 2
n
n 2
n
n x xi
2
i n y y i
2
i
i 1 i 1 i 1 i 1
x
i 1
i 70
n
x
i 1
2
i 728
n
x y
i 1
i i 676
n
y
i 1
i 63
n
y
i 1
2
i 651
here n=7
There is a high positive correlation between test -1 and test-2. That is those who
perform well in test-1 will also perform well in test-2 and those who perform poor in test-
1 will perform poor in test- 2.
144
Spearman’s Rank Correlation
For example, if we consider the relation between intelligence and beauty, it is not
necessary that a beautiful individual is intelligent also. Let (𝑥𝑖, 𝑦𝑖 ); i=1, 2,...., n be the
ranks of the 𝑖 𝑡ℎ individual in two characteristics A and B respectively. Pearsonian
coefficient of correlation between the ranks 𝑥𝑖′ and 𝑦𝑖′ is called the rank correlation
coefficient between A and B for that group of individuals.
Problem : The scores of 9 students in History and Geography are mentioned in the table
below.
145
Step 1- Create a table of the data obtained.
Step 2- Start by ranking the two data sets. Data ranking can be achieved by
assigning the ranking “1” to the biggest number in the column, “2” to the
second biggest number and so forth.
The smallest value will usually get the lowest ranking. This should be done for
both sets of measurements.
Step 3- Add a third column d to your data set, d here denotes the difference
between ranks.
For example, if the first student’s physics rank is 3 and the math rank is 5 then
the difference in the rank is 3. In the fourth column, square your d values.
6i d i2
rR 1
n n2 1
=1-(6x12) / (9(81-1))
The Spearman’s Rank Correlation for this data is 0.9 and as mentioned above if
the ⍴ value is nearing +1 then they have a perfect association of rank.
146
Problem : To calculate a Spearman rank-order correlation on data without any ties we
will use the following data:
English 56 75 45 71 62 64 58 80 76 61
Maths 66 70 40 60 65 56 59 77 67 63
56 66 9 4 5 25
75 70 3 2 1 1
45 40 10 10 0 0
71 60 4 7 3 9
62 65 6 5 1 1
64 56 5 9 4 16
58 59 8 8 0 0
80 77 1 1 0 0
76 67 2 3 1 1
61 63 7 6 1 1
147
We then calculate the following:
d i
2
25 1 9 1 16 1 1 54
We then substitute this into the main equation with the other information as
follows:
6 d i2
p 1
n n 1
6 54
p 1
1010 2 1
324
p 1
990
= 1-0.33 = 0.67
as n = 10.
This indicates a strong positive relationship between the ranks individuals obtained
in the Maths and English exam. That is, the higher you ranked in Maths, the higher you
ranked in English also, and vice versa.
148
Lesson 6
Regression
The term “regression” literally means “stepping back towards the average”.
It was first used by British biometrician Sir Francis Galton (1822-1911), in connection
with the inheritance of stature.
Galton found that the off springs of abnormally tall or short parents tend to
“regress” or “step back” to the average population height.
But the term “regression” as now used in statistics is only a convenient term
without having any reference to biometry.
Regression Analysis
In regression analysis there are two types of variables. The variable whose value is
influenced or is to be predicted is called dependent variable and the variable and the
variable which influence
149
Problem : For 10 randomly selected observations, the following data were recorded:
Observation No. 1 2 3 4 5 6 7 8 9 10
Y = a+𝑏1 X+𝑏2 𝑋 2 .
Solution:
Sl. No. X Y 𝑋2 𝑋3 𝑋4 XY 𝑋2 Y
1 1 2 1 1 1 2 2
2 1 7 1 1 1 7 7
3 2 7 4 8 16 14 28
4 2 10 4 8 16 20 40
5 3 8 9 27 81 24 72
6 3 12 9 27 81 36 108
7 4 10 16 64 256 40 160
8 5 14 25 125 625 70 350
9 6 11 36 216 1296 66 396
10 7 14 49 343 2401 98 686
Total 34 95 154 820 4774 377 1849
150
Using normal equation, we get
and
Y = 1.80+3.48X-0.27𝑋 2 .
Types of Regression
Based on the form of the regression line, the regression analysis is divide up into
two types:
Linear Regression
Non-Linear Regression
The shape of the regression line depends on the distribution of the data. We can
infer this from the image below. The first image shows linear regression whereas the
second image shows non-linear regression.
151
Linear Function Nonlinear Function
Linear Regression
If the variables in a bivariate distribution are related, we will find that the points in
the scatter diagram will cluster round some curve called me “curve of regression”. If the
curve is a straight line, it is called the line of regression and there is said to be linear
regression between the variables, otherwise regression is said to be curvilinear.
The line of regression is the line which gives the best estimate to the value of one
variable for any specific value of the other variable.
Thus the line of regression is the line of “best fit” and is obtained by the principle
of least squares.
152
Linear Regression Types
In the case of linear regression, if there is only one input variable, then we will do
simple linear regression.
If instead, the input variables are two or more, we will need to perform multiple
linear regression.
To summarize this we can say, a simple linear regression shows the relationship
between a dependent variable y and an independent variable x.
153
Simple Linear Regression Multiple Linear Regression
The equation of the linear regression line with multiple explanatory variables can
be reduce down to:
Y W1 X B0
Whereas the equation of regression line with multiple response variables or we can
say the equation of a multivariate regression line is given by:
Y W1 X 1 W2 X 2 .... B0
The sales of a company (in million dollars) for each year are shown in the table as
follows:
154
X (year) 2005 2006 2007 2008 2009
Y (sales) 12 19 29 37 45
b) Use the least squares regression line as a model to estimate the sales of the
company in 2012.
Solution:
a) We first change the variable x into t such that t = x-2005 and therefore t represents
the number of years after 2005.
Using t instead of x makes the numbers smaller and therefore manageable. The
table of values becomes.
0 12
1 19
2 29
3 37
4 45
155
We now use the table to calculate a and b included in the least regression line
formula.
t y ty 𝑡2
0 12 0 0
1 19 19 1
2 29 58 4
3 37 111 9
4 45 180 16
x 10 y 142 xy 368 x 2
30
We now calculate a and b using the least square regression formulas for a and b.
A=( n ty t y / n t 2 t
2
5 368 10 142 / 5 30 102 8.4
b) In 2012, t = 2012-2005 = 7
156
UNIT - IV
Lesson 7
Measurement of Trend
Time Series
In time series analysis, current data in a series may be compared with past data in
the same series. We may also compare the development of two or more series over time.
These comparisons may afford important guide lines for the individual firm. In
Economics, statistics and commerce it plays an important role.
Definition
157
Fit a model and proceed to forecasting, monitoring or even feedback and feed
forward control.
158
The essential requirements of a Time Series are:
The time gap, between various values must be as far as possible, equal.
It must consist of a homogeneous set of values.
Data must be available for a long period.
Symbolically if ‘t’ stands for time and ‘yt’ represents the value at time t then the
paired values (t, yt) represents a time series data.
Example : Production of rice in Tamil Nadu for the period from 2010-11 to 2016-17.
Year Production
2010-11 400
2011-12 450
2012-13 440
2013-14 420
2014-15 460
2016-17 520
159
Components of Time Series
Secular Trend
Seasonal Variations
Cyclical Variations
Irregular Variations
Secular Trend
Secular Trend is also called long term trend or simply trend. The trend is the long
term pattern of a time series.
160
A trend can be positive or negative depending on whether the time series exhibits
an increasing long term pattern or a decreasing long term pattern. If a time series does not
show an increasing or decreasing pattern then the series is stationary in the mean.
For example if we are studying the figures of sales of cloth store for 1996- 1997
and we find that in 1997 the sales have gone up, this increase cannot be called as secular
trend because it is too short period of time to conclude that the sales are showing the
increasing tendency.
Cyclical Variations
This is a short term variation occurs for a period of more than one year. The
rhythmic movements in a time series with a period of oscillation( repeated again and again
in same manner) more than one year is called a cyclical variation and the period is called a
cycle.
The time series related to business and economics show some kind of cyclical
variations.
One of the best examples for cyclical variations is “Business Cycle”. In this cycle
there are four well defined periods or phases.
Boom
Decline
Depression
Improvement
161
Seasonal Variations
Seasonal variations occur during a period of one year and have the same pattern
year after year. Here the period of time may be monthly, weekly or hourly.
But if the figure is given in yearly terms then seasonal fluctuations does not exist.
There occur seasonal fluctuations in a time series due to two factors.
The most important factor causing seasonal variations is the climate changes in the
climate and weather conditions such as rain fall, humidity, heat etc. act on different
products and industries differently.
For example during winter there is greater demand for woolen clothes, hot drinks
etc. Where as in summer cotton clothes, cold drinks have a greater sale and in rainy season
umbrellas and rain coats have greater demand.
For example on occasions like dipawali, dusserah, Christmas etc. there is a big
demand for sweets and clothes etc.
There is a large demand for books and stationary in the first few months of the
opening of schools and colleges.
162
Irregular Variations
This type of fluctuations occurs in random way or irregular ways which are
unforeseen, unpredictable and due to some irregular circumstances which are beyond the
control of human being such as earth quakes, wars, floods, famines, lockouts, etc.
The following are the two models which we generally use for the decomposition of
time series into its four components. The objective is to estimate and separate the four
types of variations and to bring out the relative effect of each on the overall behavior of
the time series.
Additive Model
i.e. O=T+S+C+I
163
where O represents the original data, T represents the trend. S represents the seasonal
variations, C represents the cyclical variations and I represent the irregular variations.
Multiplicative Model
i.e. O = T × S × C × I
This model is the most used model in the decomposition of time series. To remove
any doubt between the two models, it should be made clear that in Multiplicative model S,
C, and I are indices expressed as decimal percentages whereas, in Additive model S, C
and I are quantitative deviations about a trend that can be expressed as seasonal, cyclical
and irregular in nature.
T = 500,
164
S = 1.4,
C = 1.20,
I = 0.7
Then O = T × S × C × I
we get
If in additive model,
T = 500,
S = 100,
C = 25,
I = –60
Then O = T + S + C + I
we get
165
Trend Analysis
Trend is a long term movement in a time series. This component represents basic
tendency of the series.
The following methods are generally used to determine trend in any given time
series.
Graphic method is the simplest of all methods and easy to understand. The method
is as follows.
Then a smooth free hand curve is drawn through the plotted points in such a way
that it represents general tendency of the series.
As the curve is drawn through eye inspection, this is also called as eye-inspection
method.
The graphic method removes the short term variations to show the basic tendency
of the data.
166
The trend line drawn through the graphic method can be extended further to
predict or estimate values for the future time periods.
Example: Fit a trend line by the Graphic method for the given data.
Year Sales
2000 30
2001 46
2002 25
2003 59
2004 40
2005 60
2006 38
2007 65
167
Solution:
Advantages
It is very simplest method for study trend values and easy to draw trend.
Sometimes the trend line drawn by the statistician experienced in computing
trend may be considered better than a trend line fitted by the use of a
mathematical formula.
Although the free hand curves method is not recommended for beginners, it
has considerable merits in the hands of experienced statisticians and widely
used in applied situations.
168
Disadvantages
This method is highly subjective and curve varies from person to person who
draws it.
The work must be handled by skilled and experienced people.
Since the method is subjective, the prediction may not be reliable.
While drawing a trend line through this method a careful job has to be done.
It is a method for computing trend values in a time series which eliminates the
short term and random fluctuations from the time series by means of moving average.
The first average is the mean of first m terms; the second average is the mean of
2nd term to (m+1)th term and 3rd average is the mean of 3rd term to (m+2)th term and so on.
If m is odd then the moving average is placed against the mid value of the time
interval it covers. But if m is even then the moving average lies between the two middle
periods which does not correspond to any time period.
So further steps has to be taken to place the moving average to a particular period
of time. For that we take 2-yearly moving average of the moving averages which
correspond to a particular time period. The resultant moving averages are the trend values.
169
Example : Calculate 3-yearly moving average for the following data.
Advantages
170
The moving average has the advantage that it follows the general movements
of the data and that its shape is determined by the data rather than the
statistician’s choice of mathematical function.
171
Disadvantages
For a moving average of 2m+1, one does not get trend values for first m and
last m periods.
As the trend path does not correspond to any mathematical; function, it cannot
If the trend is not linear, the trend values calculated through moving averages
The choice of the period is sometimes left to the human judgment and hence
In this method the whole data is divided in two equal parts with respect to time.
For example if we are given data from 1999 to 2016 i.e. over a period of 18 years
the two equal parts will be first nine years i.e. from 1999 to 2007 and 2008 to 2016.
In case of odd number of years like 9, 13, 17 etc. Two equal parts can be made
simply by omitting the middle year.
For example if the data are given for 19 years from 1998 to 2016 the two equal
parts would be from 1998 to 2006 and from 2008 to 2016, the middle year 2007 will be
omitted.
172
After the data have been divided into two parts, an average (arithmetic mean) of
each part is obtained.
We thus get two points. Each point is plotted against the mid year of the each part.
Then these two points are joined by a straight line which gives us the trend line. The line
can be extended downwards or upwards to get intermediate values or to predict future
values.
Example:
Thus we get two points 41.75 and 53.75 which shall be plotted corresponding to
their middle years i.e. 2002.5 and 2006.5. By joining these points we shall obtain the
required trend line. This line can be extended and can be used either for prediction or for
determining intermediate values.
173
Example: Fit a trend line by the method of semi-averages for the given data.
Solution:
2000 105
2004 110
174
Lesson 8
Measurement of Variations
Cyclical Variations
Residual method
Reference cycle analysis method
Direct method
Harmonic analysis method
Business Cycle
175
A cycle is measured either from trough-to-trough or from peak-to-peak. Recession
and contraction are the result of cumulative downswing of a cycle whereas revival and
expansion are the result of cumulative upswing of a cycle.
Seasonal Variations
Seasonal variations are regular and periodic variations having a period of one year
duration. Some of the examples which show seasonal variations are production of cold
drinks, which are high during summer months and low during winter season. Sales of
sarees in a cloth store which are high during festival season and low during other periods.
The reason for determining seasonal variations in a time series is to isolate it and to study
its effect on the size of the variable in the index form which is usually referred as seasonal
index.
176
Measurement of Seasonal Variations
The study of seasonal variation has great importance for business enterprises to
plan the production schedule in an efficient way so as to enable them to supply to the
public demands according to seasons.
There are different devices to measure the seasonal variations. These are
This is the simplest of all the methods of measuring seasonality. This method is
based on the additive modal of the time series. That is the observed values of the series is
expressed by Yt = Tt St + Ct Rt and in this method we assume that the trend
component and the cyclical component are absent.
Arrange the data by years and months (or quarters if quarterly data is given).
Compute the average xi (i = 1,2,…..12 for monthly and i=1,2,3,4 for quarterly)
for the i th month or quarter for all the years.
177
1 12 1 4
i.e. x xi for monthly and x xi for quarterly
12 i 1 4 i 1
This method is based on the basic assumption that the data do not contain any
trend and cyclic components. Since most of the economic and business time series have
trends and as such this method though simple is not of much practical utility.
Example : Assuming that the trend is absent, determine if there is any seasonality in the
data given below.
178
Solution:
Quarterly average
Seasonal index = 100
General average
3.675
Seasonal index for the first quarter = 100 98.66
3.725
4.125
Seasonal index for the second quarter = 100 110.74
3.725
3.55
Seasonal index for the third quarter = 100 95.30
3.725
179
Ratio to Moving Average Method
The steps necessary for determining seasonal variations by this method are
The seasonal indices are now obtained by eliminating the irregular or random
components by averaging these percentages using A.M or median.
The sum of these indices will not in general be equal to 1200 (for monthly) or
400 (for quarterly).
Finally the adjustment is done to make the sum of the indices to a total of 1200
for monthly and 400 for quarterly data by multiplying them through out by a
constant K which is given by
1200
K for monthly
Total of the indices
400
K for quarterly
Total of the indices
180
Advantages
The fluctuation of indices based on ratio to moving average method is less than
based on other methods.
Disadvantages
This method does not completely utilize the data. For example in case of 12-
monthly moving average seasonal indices cannot be obtained for the first and
last 6 months.
Example:
Calculating seasonal indices by the ratio to moving average method, from the
following data:
2005 68 62 61 63
2006 65 58 66 61
2007 68 63 63 67
181
Solution
182
Calculation of Seasonal Index
399.32
Arithmetic average of averages = 99.83
4
95.05
Seasonal index for the second quarter = 100 95.21
99.83
100.80
Seasonal index for the third quarter = 100 100.97
99.83
98.35
Seasonal index for the fourth quarter = 100 98.52
99.83
183
Link Relative Method
This method is slightly more complicated than other methods. This method is also
known as Pearson’s method. This method consists in the following steps.
The link relatives for each period are calculated by using the below formula.
Calculate the average of the link relatives for each period for all the years using
mean or median.
Convert the average link relatives into chain relatives on the basis of the first
season.
Avg link relative for that period chain relative of the previous period
100
Now the adjusted chain relatives are calculated by subtracting correction factor
‘k d’ from (k+1)th chain relative respectively.
Where k = 1,2,…….11 for monthly and k = 1,2,3 for quarterly data and
1
d [ New chain relative for first period - 100] where N denotes the number
N
of periods i.e. N = 12 for monthly N = 4 for quarterly.
184
Finally calculate the average of the corrected chain relatives and convert the
corrected chain relatives as the percentages of this average.
These percentages are seasonal indices calculated by the link relative method.
Advantages
As compared to the method of moving average the link relative method uses
data more.
Example :
Apply the method of-link relatives to the following data and calculate seasonal
indices:
Quarterly Figures
185
Solution: Calculation of Seasonal Indices by the method of Link Relatives
Quarter
Year I II III IV
186
Chain relative of the first quarter (on the basis of first quarter) = 100
Chain relative of the first quarter (on the basis of last quarter) =
86.35 123.64
= 106.7
100
6.7
Difference per quarter = 1.675.
4
from the chain relatives of the 2nd, 3rd and 4th quarters respectively.
Disadvantages
187
The link relative method needs extensive calculations compared to other
methods and is not as simple as the method of moving average.
The average of link relatives contains both trend and cyclical components and
these components are eliminated by applying correction.
This method is an improvement over the simple averages method and this method
assumes a multiplicative model
i.e. Yt = Tt St Ct Rt
Obtain the trend values by the least square method by fitting a mathematical
curve, either a straight line or second degree polynomial.
Express the original data as the percentage of the trend values. Assuming the
multiplicative model these percentages will contain the seasonal, cyclical and
irregular components.
188
Finally these indices obtained in step(3) are adjusted to a total of 1200 for
monthly and 400 for quarterly data by multiplying them through out by a
constant K which is given by
1200
K for monthly
Total of the indices
400
K for quarterly
Total of the indices
Advantages
It has an advantage over the ratio to moving average method that in this
method we obtain ratio to trend values for each period for which data are
available where as it is not possible in ratio to moving average method.
Disadvantages
The main defect of the ratio to trend method is that if there are cyclical swings
in the series, the trend whether a straight line or a curve can never follow the
actual data as closely as a 12-monthly moving average does. So a seasonal
index computed by the ratio to moving average method may be less biased than
the one calculated by the ratio to trend method.
189
Example: Calculate Seasonal Indices by Ratio to Moving Average Method from the
following data:
2003 30 40 36 34
2004 34 52 50 44
2005 40 58 54 48
2006 54 76 68 62
2007 80 92 86 82
Deviations
Yearly Yearly Trend
Year from XY X2
totals average Y Values
mid-year X
190
The equation of the straight line is Yc a bX
Y 280 XY 120 12
X 0; a 56, b
N 5 X 10
2
12
Quarterly increment = 3
4
Consider 2003, trend values for the middle quarter, i.e., half of 2nd and
3
half 3rd is 32. Quarterly increment is 3. So the trend value of 2 nd quarter is 32 ,
2
3
i.e., 30.5 and for 3rd quarter is 32 , i.e., 33.5. trend value for the 1st quarter is 30.5-3.
2
i.e., 27.5 and 4th quarter is 33.5+3, i.e., 36.5.
Trend Values
191
The given values are expressed as percentage of the corresponding trend values.
Thus for 1st quarter of 2003, the percentage shall be
Since the total is more than 400 an adjustment is made by multiplying each
400
average by and final indices are obtained.
403.12
192
UNIT-5
Lesson 9
Probability
Probability is the measure of the likelihood that an event will occur in a Random
Experiment. The probability of an event is a number between 0 and 1, where, roughly
speaking, 0 indicates impossibility of the event and 1 indicates certainty. The higher the
probability of an event P(E), the more likely it is that the event will occur.
Example
A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the
two outcomes (“heads” and “tails”) are both equally probable; the probability of “heads”
equals the probability of “tails”; and since no other outcomes are possible, the probability
of either “heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).
Formula of Probability
There are various terms utilized in the probability and statistics concepts, such as:
193
Random Experiment
Sample Space
Random variables
Expected Value
Independence
Variance
Mean
Random Experiment
For example, when we throw a dice randomly, the result is uncertain to us.
Sample Space
194
Suppose, if we have thrown a dice, randomly, then the sample space for this
experiment will be all possible outcomes of throwing a dice, such as;
Random Variable
Discrete Random Variable is one which may take on only a countable number
of distinct values. Example
Independent Event
When the probability of occurrence of one event has no impact on the probability
of another event, then both the events are termed as independent of each other.
195
For example, if you flip a coin and at the same time you throw a dice, the
probability of getting a ‘head’ is independent of the probability of getting a 6 in dice.
Mean
Mean of a random variable is the average of the random values of the possible
outcomes of a random experiment.
Expected Value
For example, if we roll a dice having six faces, then the expected value will be the
average value of all the possible outcomes, i.e. 3.5.
Variance
Basically, the variance tells us how the values of the random variable are spread
around the mean value.
196
Probability Terms and Definition
Sample Space The set of all the possible outcomes Tossing a coin,
to occur in any trial Sample Space (S) = {H,T}
Rolling a die,
Sample Space (S) = {1,2,3,4,5,6}
Experiment or A series of actions where the The tossing of a coin, Selecting a card
Trial outcomes are always uncertain. from a deck of cards, throwing a dice.
Impossible The event cannot happen In tossing a coin, impossible to get both
Event head and tail at the same time
197
Example : A bucket contains 5 blue, 4 green and 5 red balls. Sudheer is asked to pick 2
balls randomly from the bucket without replacement and then one more ball is to be
picked. What is the probability he picked 2 green balls and 1 blue ball?
Probability of drawing
Probability of picking 2 green balls and 1 blue ball = 4/14 * 3/13 * 5/12 = 5/182.
Example : What is the probability that Ram will choose a marble at random and that it is
not black if the bowl contains 3 red, 2 black and 5 green marbles.
Solution:
Find the number of marbles that are not black and divide by the total number of
marbles. So,
198
Types of Probability
Theoretical Probability
Example: A die is rolled 100 times. The number 3 is rolled 12 times. The relative
frequency of rolling a 3 is 12/100.
These are values (between 0 and 1 or 0 and 100%) assigned by individuals based
on how likely they think events are to occur.
Example: The probability of my being asked on a date for this weekend is 10%.
199
Probability Rules (or) Probability Models
2. The sum of the probabilities of all possible outcomes is 1 or 100%. If A, B, and C are
the only possible outcomes, then
Example: A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles.
P(A) + P(not A) = 1 or
P(not A) = 1 - P(A).
Example: A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles.
200
4. If two events A and B are independent (this means that the occurrence of A has no
impact at all on whether B occurs and vice versa), then the probability of A and B
occurring is the product of their individual probabilities.
1 1 1
2 6 12
5. If two events A and B are mutually exclusive (meaning A cannot occur at the same time
as B occurs), then the probability of either A or B occurring is the sum of their individual
probabilities.
Example: A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles.
5 2 7
10 10 10
201
6. If two events A and B are not mutually exclusive (meaning it is possible that A and B
occur at the same time), then the probability of either A or B occurring is the sum of their
individual probabilities minus the probability of both A and B occurring.
Example: There are 20 people in the room: 12 girls (5 with blond hair and 7 with brown
hair) and 8 boys (4 with blond hair and 4 with brown hair). There are a total of 9 blonds
and 11 with brown hair. One person from the group is chosen randomly.
12 9 5 16
20 20 20 20
7. The probability of at least one event occurring out of multiple events is equal to one
minus the probability of none of the events occurring.
Example: Roll a die 4 times. What is the probability of getting at least one head on the
4 rolls.
1 1 1 1 1 15
1 1
2 2 2 2 16 16
202
8. If event B is a subset of event A, then the probability of B is less than or equal to the
probability of A.
P(B) ≤ P(A)
Example: There are 20 people in the room: 12 girls (5 with blond hair and 7 with brown
hair) and 8 boys (4 with blond hair and 4 with brown hair).
7 12
20 20
Probability of an Event
P(E) = r/n
The probability that the event will not occur or known as its failure is expressed as:
E’ represents that the event will not occur. Therefore, now we can say;
P(E) + P(E’) = 1
This means that the total of all the probabilities in any random test or experiment
is equal to 1.
203
Equally Likely Events
When the events have the same theoretical probability of happening, then they are
called equally likely events. The results of a sample space are called equally likely if all of
them have the same probability of occurring.
For example, if you throw a die, then the probability of getting 1 is 1/6. Similarly,
the probability of getting all the numbers from 2,3,4,5 and 6, one at a time is 1/6. Hence,
the following are some examples of equally likely events when throwing a die:
Complementary Events
The possibility that there will be only two outcomes which states that an event will
occur or not. Like a person will come or not come to your house, getting a job or not
getting a job, etc. are examples of complementary events. Basically, the complement of an
event occurring in the exact opposite that the probability of it is not occurring. Some more
examples are:
204
Basic Rules of Probability
P(A∪B)=P(A)+P(B)−P(A∩B)
P(A∩B)=P(B) P(A|B)
Distribution
Probability Distribution
205
• Discrete Probability Distribution: Assigns probabilities (masses) to the
individual outcomes
A discrete probability distribution lists each possible value the random variable
can assume, together with its probability. A probability distribution must satisfy the
following conditions:
Guidelines
Let x be a discrete random variable with possible outcomes x1, x2, … , xn.
206
Make a frequency distribution for the possible outcomes.
Find the sum of the frequencies.
Find the probability of each possible outcome by dividing its frequency by the
sum of the frequencies.
Check that each probability is between 0 and 1 and that the sum is 1.
Mean
μ = Σx P(x)
Each value of x is multiplied by its corresponding probability and the products are
added.
Variance
Standard Deviation
σ = σ 2.
207
Expected Value
The expected value of a discrete random variable is equal to the mean of the
random variable.
E(x) = μ=Σx P(x)
Binomial Distribution
The experiment is repeated for a fixed number of trials, where each trial is
There are only two possible outcomes of interest for each trial. The outcomes
Notation: X~Bin(n,p)
208
In a binomial distribution, the probability of exactly x successes in n trials is
P (x ) nC x p xq n x n! p xq n x .
(n x )! x !
Notations
Mean
Mean (μ) = np
Variance
Standard Deviation
209
Example:
The number of trials (n) is 10. The probability of success (p) is 0.5.
Solution:
= 0.2051
210
Example :
He finds that 80% of the people who purchase motor insurance are men.
He wants to find out that if 8 motor insurance owners are randomly selected, what
would be the probability that exactly 5 of them are men.
Solution:
56 0.32768 0.008
= 0.14680064
211
Example :
Evans Electronics is concerned about a low retention rate for its employees.
In recent years, management has seen a turnover of 10% of the hourly employees
annually.
Choosing 3 hourly employees at random, what is the probability that 1 of them will
leave the company this year?
Notice that:
• There are two outcomes for each trial—the employee leaves (S) or the
employee stays (F).
n n!
x x! ( n x )!
212
In our Evans Electronics example, n = 3 and x = 1. Thus
3 ( 3 )( 2 )( 1 ) 6
3
1 ( 1 )( 2 )( 1 ) 2
The probability of the first employee leaving and the second and third employees
staying, denoted (S, F, F), is given by
p(1 – p)(1 – p)
With a 0.10 probability of an employee leaving on any one trial, the probability of
an employee leaving on the first trial and not on the second and third trials is given by,
Two other experimental outcomes also result in one success and two failures. The
probabilities for all three experimental outcomes involving one success follow:
Total = 0.243
213
Because these events are independent, we can multiply probabilities. Thus the
probability of (S, F, F) is given by: p = 0.1 and q = 0.9.
P(1) = 0.243
Cumulative
X P(x)
Probability
0 0.729 0.729
1 0.243 0.972
2 0.027 0.999
3 0.001 1.000
214
Continuous Probability Distribution
• A continuous random variable can assume any value in an interval on the real
line or in a collection of intervals.
• It is not possible to talk about the probability of the random variable assuming
a particular value.
• Instead, we talk about the probability of the random variable assuming a value
within a given interval.
• The probability of the random variable assuming a value within some given
interval from x1 to x2 is defined to be the area under the graph of the probability
density function between x1 and x2.
Normal Distribution
215
50% of values less than the mean and 50% greater than the mean
Empirical Rule
The Standard Deviation is a measure of how spread out numbers are (read that
page for details on how to calculate it).
216
95% of values are within 2 standard deviations of the mean.
The empirical rule is a quick way to get an overview of the data and check for any
outliers or extreme values that don’t follow this pattern.
If data from small samples do not closely follow this pattern, then other
distributions like the t-distribution may be more appropriate. Once you identify the
distribution of your variable, you can apply appropriate statistical tests.
217
Central Limit Theorem
The central limit theorem is the basis for how normal distributions work in
statistics.
In research, to get a good idea of a population mean, ideally you’d collect data
from multiple random samples within the population.
Law of Large Numbers: As you increase sample size (or the number of
samples), then the sample mean will approach the population mean.
With multiple large samples, the sampling distribution of the mean is normally
distributed, even if your original variable is not normally distributed.
Parametric statistical tests typically assume that samples come from normally
distributed populations, but the central limit theorem means that this assumption isn’t
necessary to meet when you have a large enough sample.
You can use parametric tests for large samples from populations with any kind of
distribution as long as other important assumptions are met.
218
For small samples, the assumption of normality is important because the sampling
distribution of the mean isn’t known.
For accurate results, you have to be sure that the population is normally distributed
before you can use parametric tests with small samples.
The general formula for the probability density function of the normal distribution
is
x
1 2
f ( x) e
, x 0,1, 2, ....
2
2
ex /2
f ( x)
2
Since the general form of probability functions can be expressed in terms of the
standard distribution, all subsequent formulas in this section are given for the standard
form of the function.
219
The following is the plot of the standard normal probability density function.
The formula for the cumulative distribution function of the standard normal
distribution is
x 2
ex /2
F ( x)
2
220
Note that this integral does not exist in a simple closed formula. It is computed
numerically.
You collect SAT scores from students in a new test preparation course. The data
follows a normal distribution with a mean score (M) of 1150 and a standard deviation (SD)
of 150.
221
Solution: Following the empirical rule:
Around 68% of scores are between 1000 and 1300, 1 standard deviation above
and below the mean.
Around 95% of scores are between 850 and 1450, 2 standard deviations above
and below the mean.
Around 99.7% of scores are between 700 and 1600, 3 standard deviations
above and below the mean.
222
Standard Normal Distribution
223
Z-scores tell you how many standard deviations away from the mean each value
lies.
You only need to know the mean and standard deviation of your distribution to
find the z-score of a value.
x = individual value
μ = mean
σ = standard deviation
224
We convert normal distributions into the standard normal distribution for several
reasons:
To find the probability that a sample mean significantly differs from a known
population mean.
To find the probability of SAT scores in your sample exceeding 1380, find the z-
score.
The mean of our distribution is 1150, and the standard deviation is 150.
The z-score tells you how many standard deviations away 1380 is from the mean.
Formula Calculation
225
For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores
being 1380 or less (93.7%), and it’s the area under the curve left of the shaded area.
To find the shaded area, you take away 0.937 from 1, which is the total area under
the curve.
That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.
226
Lesson 10
Hypothesis Testing
Hypothesis
The null hypothesis (H0) often represents either a skeptical perspective or a claim
to be tested. The alternative hypothesis (H1) represents an alternative claim under
consideration and is often represented by a range of possible parameter values.
The sceptic will not reject the null hypothesis (H0), unless the evidence in favour
of the alternative hypothesis (H1) is so strong that she rejects H0 in favor of H1.
227
Null Hypothesis (H0)
Example:
It is hypothesized that flowers watered with lemonade will grow faster than
flowers watered with plain water.
228
Example:
If one plant is fed lemonade for one month and another is fed plain water,
Null hypothesis: There will be no difference in growth between the two plants.
Alternative Hypothesis: If one plant is fed lemonade for one month and another is
fed plain water, the plant that is fed lemonade will grow more than the plant that is fed
plain water
In hypothesis testing, we want to test is if H1 is “likely” true. So, there are two
possible outcomes:
Failure to reject H0 does not mean the null hypothesis is true. There is no formal
outcome that says “accept H0”. It only means that we do not have sufficient evidence to
support H1.
Example:
H0 : defendant is innocent.
H1 : defendant is guilty.
229
H0 (innocent) is rejected if H1 (guilty) is supported by evidence beyond
“reasonable doubt”. Failure to reject H0 (prove guilty) does not imply innocence, only that
the evidence is insufficient to reject it.
H1: µ > 3 College students have been in more than 3 exclusive relationships, on
average.
For this case, our intervals span from 2.7 to 3.7 and the null value µ=3 are actually
included in the interval. And the interval says any value within this range could
conceivably be the true population mean therefore we cannot reject the null hypothesis in
favor of the alternative.
230
This is quick and dirty approach for hypothesis testing. However, it doesn’t tell us
the likelihood of certain outcome under the null hypothesis. In the other words it does not
tell us the p value.
Note:
The p-value is a way of quantifying the strength of the evidence against the null
hypothesis and in favor of the alternative. Formally the p-value is a conditional probability
p-Value:
When N=50,
X = 3.2,
S= 1.74
𝑠 1.74
SE= = = 0.246
√𝑛 √50
231
We are trying to find the value of P ( X > 3.2 | H0: µ= 3) which is coming from null
hypothesis.
Since we are assuming null hypothesis to be true, we can use that to construct the
sampling distribution based on the Central Limit Theorem.
Here 3 is coming from null hypothesis as we are assuming null hypothesis is true.
See the below picture, our area of interest for p-value is the red shaded area.
232
The Z-Score can be calculated by this formula
● We use the test statistic to calculate the p-value, the probability of observing
data at least as favorable to the alternative hypothesis as our current data, if the
null hypothesis was true.
● If the p-value is low (lower than the significance level α, which is usually 5%)
we say that it would be very unlikely to observe the data if the null hypothesis
were true, and hence reject H0.
● If the p-value is high (higher than α) we say that it is likely to observe the data
even if the null hypothesis were true, and hence do not reject H0.
Since p-value for this case is 0.209 and it is higher than 0.05, so we do not reject
the null hypothesis.
What is that meaning context of this question? Our null hypothesis was that
college student on average have 3 exclusive relationships Vs the alternative hypothesis
was college students have been in more than 3 exclusive relationships, on average.
In this case, we fail to reject null hypothesis as we do not have enough evidence to
reject null hypothesis.
233
That sets the population average of number of exclusive relationships college
student have been in to 3.
This is a pretty high probability, so we think that a sample mean of 3.2 or more
exclusive relationships is likely to happen simply by chance.
● These data do not provide convincing evidence that college students have been
in more than 3 relationship on average.
● The difference between the null value of 3 relationships and the observed
sample mean of 32 relationship is due to chance or sampling variability.
Often instead of looking for a divergence from the null in a specific direction,
we might be interested in divergence in any direction.
234
The definition of a p-value is the same regardless of doing a one or two-sided test,
however the calculation is slightly different since we need to consider “at least as extreme
as the observed outcome” in both direction away from the mean.
For the above example if we want to do the two-sided hypothesis testing then we
have to find P ( X > 3.2) or ( X < 2.8 | H0: µ= 3).
235
Type I and Type II Errors
Type I Error
The Probability of getting a type I error is the significance level because if our null
hypothesis is true, let’s say that our significance level is 5%. Well, 5% of the time, even if
our null hypothesis is true, we are going to get a statistic that’s going to make you reject
the null hypothesis. So, one way to think about the probability of a Type I error is our
significance level.
Type II Error
A type II error is also known as a false negative and occurs when reject a null
hypothesis which is really false. Here concludes there is not a significant effect, when
actually there really is.
236
The probability of making a type II error is called Beta (β), and this is related to
the power of the statistical test (power = 1- β). You can decrease your risk of committing a
type II error by ensuring your test has enough power.
Power Test
This is the probability that you are doing the right thing when the null hypothesis is
not true i.e. we should reject the null hypothesis if it’s not true.
237
Types of Hypothesis Tests
Spearman’s Rank
Correlation
Kendall’s Rank
Correlation
Chi-Squared Test
238
Reference Books
1. Gupta, S.P. (1995) Statistical Methods, Sultan Chand and Sons, New Delhi.
4. Cole, J.P. and King, C.A.M. (1968) Quantitative Techniques in Geography. John
Wiley & sons Inc. New York.
6. Burt, J.E., Barber, G.M., and Rigby, D.L. (2009) Elementary Statistics for
Geographers (3/E), The Guilford Press, New York.
239