Page 2 of 181
STA 201: Statistics I
It is with great pleasure that I welcome you as learners to the Olabisi Onabanjo University
Open and Distance Learning Centre.
Massive and Democratisation of higher education via Open and Distance Learning as
advocated globally has since been one of the goals of Olabisi Onabanjo University
Management, hence, Open and Distance Learning constitutes one of the areas of focus
since my assumption of duty. Through the efforts of the University Governing Council
and Senate, the establishment of the Open and Distance Learning Centre was approved in
July, 2016.
Open and Distance Learning is a mode of study that affords tertiary education
opportunities to all and sundry regardless of age, gender, location, space and other limiting
Quite a large number of qualified applicants for tertiary education are denied admission
yearly, there are also several others who wish to advance educationally but could not,
because of their job which is their means of livelihood.
Olabisi Onabanjo University via its Open and Distance Learning Centre offers quality,
technology driven, flexible, self-directed and cost effective tertiary education. It is a viable
option for learners who wish to study online from their location and at desired time.
This course material provides learners with vital information relevant to our programme
and schedules. I advise learners to make judicious use of it. I congratulate our Open and
Distance Learning Centre Staff, Department and Faculty for their effort towards the
production of this handbook.
I hope your learning experience with the Olabisi Onabanjo University Open and Distance
Learning Centre is memorable and exciting.
Page 3 of 181
STA 201: Statistics I
STA 201 titled Statistics I is a 3-unit course for students studying towards acquiring a
Bachelor of Science in Accounting. The course is divided into 8 study sessions. The
course will introduce you to the basic statistics concept in solving practical problems.
The course study guide therefore gives you an overview of what STA 201 is all about, the
textbooks and other materials to be referenced, what you are expected to know in each
unit and how to work through the course materials. Define a set and identify various
notations of sets, present statistical data in various ways and know the applications of
This course is a 3 unit course divided into 8 study sessions. You are enjoined to spend at
least 3 hours in studying the content of each study unit
The overall aim of this course, STA 201 is to introduce you to Statistics, Presentation of
data, Measure of central tendency and dispersion, probability, Random variable and
statistics hypothesis, Analysis of categorical data, regression and correlation analysis,
Analysis of variation.
Course Aims
This course aims to introduce students to the basic statistical terms. It is expected that the
knowledge will help the reader to effectively use mathematics principles to solve even life
Page 4 of 181
STA 201: Statistics I
Course Objectives
It is important to note that each unit has specific objectives. You should study them
carefully before proceeding to subsequent units. Therefore, it may be useful to refer to
these objectives in the course of your study of the unit to assess your progress. You should
always look at the unit objectives after completing a unit. In this way, you can be sure that
you have done what is required of you by the end of the unit.
However, the overall objective of STA 201 is to give basic knowledge of data
presentation, interpretation and analysis, and familiarity with the techniques to use them
In order to have a thorough understanding of the course units, you will need to read and
understand the contents, practice the steps by designing and implementing a mini
computer application system for your department and be committed to learning and
implementing your knowledge.
This course is designed to cover approximately fifteen weeks and it will require your
devoted attention. You should do the exercises in the Tutor-Marked Assignments and
submit to your tutors via the Learning Management System (LMS).
Page 5 of 181
STA 201: Statistics I
Course Materials
1. Course Guide
2. Printed Lecture materials
3. Text Books
4. Interactive DVD
5. Electronic Lecture materials via LMS
6. Tutor Marked Assignments
There are two aspects to the assessment of this course. First, there are tutor marked
assignments and second, the written examinations. Therefore, you are expected to take
note of the facts, information and problem solving gathered during the course. The tutor
marked assignments must be submitted to your tutor for formal assessment in accordance
to the deadline given. The work submitted will count for 30% of your total course mark.
At the end of the course, you will need to sit for a final written examination. This
examination will account for 70% of your total score. You will be required to submit some
assignments by uploading them to STA 201 page on the Learning Management System
There are TMAs in this course. You need to submit all the TMAs. The best 10 will
therefore be counted. When you have completed each assignment, send them to your tutor
as soon as possible and make certain that it gets to your tutor on or before the stipulated
deadline. If for any reason you cannot complete your assignment on time, contact your
Page 6 of 181
STA 201: Statistics I
tutor before the assignment is due to discuss the possibility of extension. Extension will
not be granted after the deadline, unless on extraordinary cases.
The final examination for STA 201 will last for a period not more than 2hours and has a
value of 70% of the total course grade. The examination will consist of questions which
reflect the Self-Assessment Questions (SAQs), In-text Questions (ITQs), some applied
questions and tutor marked assignments that you have previously encountered.
Furthermore, all areas of the course will be examined. It would be better to use the time
between finishing the last unit and sitting for the examination to revise the entire course.
You might find it useful to review your TMAs and comment on them before the
examination. The final examination covers information from all parts of the course. Most
examinations will be conducted via Computer Based Testing (CBT)
There are few hours of face-to-face tutorial provided in support of this course. You will be
notified of the dates, time and location together with the name and phone number of your
tutor as soon as you are allocated a tutorial group. Your tutor will mark and comment on
your assignments, keep a close watch on your progress and on any difficulties you might
encounter and provide assistance to you during the course. You must submit your tutor
marked assignment to your tutor well before the due date. At least two working days are
required for this purpose. They will be marked by your tutor and returned as soon as
possible via the same means of submission.
Do not hesitate to contact your tutor by telephone, e-mail or discussion board if you need
help. The following might be circumstances in which you would find help necessary:
contact your tutor if:
You do not understand any part of the study units or the assigned readings.
Page 7 of 181
STA 201: Statistics I
You should endeavour to attend the tutorials. This is the only opportunity to have face-to-
face contact with your tutor and ask questions which are answered instantly. You can raise
any problem encountered in the course of your study. To gain the maximum benefit from
the course tutorials, have some questions handy before attending them. You will learn a
lot from participating actively in discussions.
Good luck!
Recommended Texts
The following texts and Internet resource links will be of enormous benefit to you in
learning this course:
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
Page 8 of 181
STA 201: Statistics I
Table of Contents
Introduction ...................................................................................................................... 4
Introduction .................................................................................................................... 17
Page 9 of 181
STA 201: Statistics I
References ...................................................................................................................... 27
Introduction .................................................................................................................... 28
References ...................................................................................................................... 42
Introduction .................................................................................................................... 43
References ...................................................................................................................... 62
Introduction .................................................................................................................... 63
4.3.1 Permutation................................................................................................... 71
References ...................................................................................................................... 80
Introduction .................................................................................................................... 81
Page 16 of 181
STA 201: Statistics I
Today, there have been advancements in all sectors like Commerce, Economics,
Maths, etc. Not only that, but our life has also been going through a lot
of development in various zones. Some of them are defence, banking, and
hospitality. However, all of these depend largely on “statistics”
Page 17 of 181
STA 201: Statistics I
Statistics is the science of learning from experience, especially experiences that arrives a
little bit at a time. It can be generally defined as a scientific methodology used for
collection, presentation, analysis and interpretation of data in order to draw valuable
decision and conclusion.
Statistics can be broadly classified into Descriptive and Inferential Statistics. Descriptive
statistics are brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire or a sample of a population. Descriptive statistics are
broken down into measures of central tendency and measures of variability. Inferential
Statistics involves using data from a sample to make inferences about the larger
population from which the sample was drawn.
Statistical data can be either a variable or an attribute in nature. Variable can be either
discrete or continuous in nature and these are measurable while attribute is non-
measurable in nature.
Page 18 of 181
STA 201: Statistics I
i. Statistics deals with only those subjects of inquiry which are capable of being
quantitatively measured and numerically expressed.
ii. It deals only with aggregates of facts and no importance is attached to
individual items.
iii. Statistical results might be misleading if data collection is faulty
iv. Statistics can be used to establish wrong conclusions and therefore, can be used
only by experts.
i. Define Statistics
ii. What are the classes of statistics?
Page 19 of 181
STA 201: Statistics I
Population is the totality of the individual observations about which inferences are to be
made. In statistics, a population is the entire pool from which a statistical sample is drawn.
Through biological definition of the term “population” is the totality of individuals of a
given species per given time and given area, population in “statistics” always means the
totality of the individual observations about which inferences are to be made. A
population can thus be said to be an aggregate observation of subjects grouped together by
a common feature such as weight or tail lengths of all the albino rats, number of newborn
babies in Nigeria, and hemoglobin or serum protein levels of adults, and nutrients contents
of varieties of foods, number of workers in commercial banks in Nigeria and number of
students offering management sciences course in Olabisi Onabanjo University.
Sample is a part of the population. Large number of samples may be taken from the same
population, though all members may not be covered. Inferences drawn from the sample
refer to the defined population from which sample or samples are drawn.
Page 20 of 181
STA 201: Statistics I
Basically, there are two major sources of data, namely primary and secondary sources of
data collection. Primary Sources refer to the statistical data or information which the
investigator originates himself for the purpose of the enquiry at hand. Examples are
census, surveys and experiments. Secondary sources refer to those statistical data which
are not originated by the investigator himself, but which he obtains from someone else’s
records or from some organization, either in published or unpublished forms. Examples
include publications of the National Bureau of Statistics (NBS), Central Bank of Nigeria
(CBN), National Population Commission (NPC) World Health Organization (WHO).
i. Questionnaires
ii. Interview (which may be Telephone, Personal or Indirect)Telephone interviews
iii. Experiment
iv. Observation
v. Group discussion
i. What is population?
ii. List the two major sources of data
Page 21 of 181
STA 201: Statistics I
i. Population is the totality of the individual observations about which inferences are
to be made
ii. Primary and Secondary
1.3 Questionnaire
8. Questions that rely too much on memory should be avoided. Since some people
forget events too soon.
i. Define Questionnaire
ii. What are the types of questionnaire?
Page 23 of 181
STA 201: Statistics I
Page 24 of 181
STA 201: Statistics I
Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
1. What is statistics
2. Explain the following
a. Data
b. Sample
c. Variable
d. Attitude
e. Descriptive Statistics
f. Discrete Variable
g. Observation
h. Population
i. Sample
j. Statistical method
k. Continuous variable
a. Explain what you understand by Questionnaire
b. Outline the types of Questionnaire
a. Outline the functions of Statistics
b. State the limitations of Statistics
5. Mention the two types of data and illustrate with examples.
6. Discuss the sources of data and the various methods of data collection.
Page 25 of 181
STA 201: Statistics I
Glossary of Terms
Measures of central tendency: a single value that attempts to describe a set of data by
identifying the central position within that set of data
Measures of variability: describe how far apart data points lie from each other and from
the center of a distribution
Page 26 of 181
STA 201: Statistics I
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by
Page 27 of 181
STA 201: Statistics I
Once data has been collected, it has to be classified and organized in such a
way that it becomes easily readable and interpretable that is, converted to
information. Before the calculation of descriptive statistics, it is sometimes a
good idea to present data as tables, charts, diagrams or graphs. Most people find
‘pictures’ much more helpful than ‘numbers’ in the sense that, in their opinion, they
present data more meaningfully.
Page 28 of 181
STA 201: Statistics I
This is an act of arranging facts and figures in the form of table(s) or list. In order to make
the data easily understandable, the first task of the statistician is to condense and simplify
them in such a manner that irrelevant details are eliminated and their significant features
stand out prominently. The procedure that is adopted for this purpose is known as the
method of classification and tabulation.
It is the representation of data in appropriate form in order to make the comparison and
understanding easy through charts, diagram or graph. No matter how informative and
well designed a statistical table is, it’s a medium for conveying to the reader an
immediate and clear impression of its content, it is a compliment to a good chart, diagram
or graph. The most popular charts, diagrams and graphs are, pie charts, bar diagrams (bar
chart and histogram) and graphs (frequency polygons and Ogives).
A pie chart is simply a circle divided into sections. This circle represents the total of the
data being presented and each section is drawn proportional to its relative size. The main
advantage of a pie chart is that it is easy to understand.
Page 29 of 181
STA 201: Statistics I
An investigation of the marital status of the staff of a known commercial bank in Nigeria
reveals the following distribution:
35 + 130 + 25 + 10 = 200
Single 360 0 63 0
Married 360 0 234 0
Widowed 360 0 45 0
Divorced 360 0 18 0
Page 30 of 181
STA 201: Statistics I
Widowed, Series1, Single,
45, 13% 63, 17%
Series1, Married
, 234, 65%
Observation: the chart clearly shows that majority of the staff in the institution are
Bar charts could be simple, multiple or component in nature. A single bar chart
comprises of a number of equally spaced rectangles.
A multiple bar chart is usually used in the comparison of two or more attributes.
A component bar chart comprises of bars which are subdivided into components.
Page 31 of 181
STA 201: Statistics I
Bar Chart
Table 2.2
Page 32 of 181
STA 201: Statistics I
Page 33 of 181
STA 201: Statistics I
2.2.3 Histogram
Histogram and bar charts look alike in presentation, but while the bars of the bar charts
are usually not joined, those of the histogram are usually joined. Furthermore, while the
bar chart attaches importance only to its heights, histogram attaches importance to both
heights and the widths.
Page 34 of 181
STA 201: Statistics I
A frequency polygon is obtained by joining the midpoints of the top of the rectangles of a
Page 35 of 181
STA 201: Statistics I
To obtain a cumulative frequency curve, we plot the cumulative frequencies against the
upper class boundaries of the class intervals.
Page 36 of 181
STA 201: Statistics I
The shape of the cumulative frequency curve is usually like that of an elongated S.
Page 37 of 181
STA 201: Statistics I
1. Tabulation is an act of arranging facts and figures in the form of table(s) or list.
2. Data presentation is the representation of data in appropriate form in order to make
the comparison and understanding easy through charts, diagram or graph.
3. Data can be represented in bar chart, histogram, pie chart, cumulative frequency
curve etc.
Page 38 of 181
STA 201: Statistics I
Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
Married 480
Separated 120
Divorce 330
Widow 400
Page 39 of 181
STA 201: Statistics I
6. The following data gives the enrolment of students from STA201 in some sessions
in Olabisi Onabanjo University, Ago-Iwoye.
Page 40 of 181
STA 201: Statistics I
Glossary of Terms
Page 41 of 181
STA 201: Statistics I
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
Page 42 of 181
STA 201: Statistics I
Do all the players in a soccer team have the same height, weight, or years of
experience? Of course not! Suppose, if we make a table to record each players
‘years of experience’ then we get a ‘distribution of data’. Here, the knowledge
of no. of the experienced players can help a lot to predict the kind of game this team
would play. Now, to extract such resourceful information from the provided
distribution of data we need to study and categorize it. So, to categorize this data, we
need to know about measures of central tendency and dispersion. In layman’s terms,
central tendency is nothing but ‘average’. Dispersion helps in evaluating how near or far
the other values are from this average value.
Page 43 of 181
STA 201: Statistics I
A measure of central tendency is a summary statistic that represents the center point or
typical value of a dataset. These measures indicate where most values in a distribution fall
and are also referred to as the central location of a distribution. You can think of it as the
tendency of data to cluster around a middle value. In statistics, the three most common
measures of central tendency are the mean, median, and mode. Each of these measures
calculates the location of the central point using a different method.
The arithmetic mean of a series is obtained by adding the values of all observations and
divide the total by the number of observations. This is generally called the measure. In
symbols, X1, X2, …, Xn are n observed values, then the mean is given by:
Total of all individual values x x , ..., xn
1 2
x 1
sample size n n
50 + 60 + 40 + 40 + 40 + 70
𝑥̅ =
𝑥̅ = = 50
Page 44 of 181
STA 201: Statistics I
Case Study 3.2
59, 53, 66, 55, 57, 65, 48, 59, 51, 58, 52, 68, 60, 70, 71, 55, 70, 64, 54, 67, 62, 53, 49, 56,
63, 48, 57, 61, 58, 55, 50, 55, 61, 52, 54, 65, 56, 50, 62, 60
Obtain a frequency distribution and calculate the mean weight of the students.
Weights (kg) F X Fx
48 – 50 8 50 400
53 – 57 12 55 660
58 – 62 10 60 600
63 – 67 6 65 390
68 – 72 4 70 280
Total 40 2330
fx 2330
f 40
Page 45 of 181
STA 201: Statistics I
This is the value or number that has the highest frequency in a distribution. The mode may
not exist and even when it does exist, it may not be unique.
In a STA201 test with the following scores:5, 2, 4, 7, 5, 3. Find the mode of the test score.
fm fa
Mode L C
2 f f a f b
If a set of data is arranged in order of magnitude, the middle value, which divides the set
into two equal parts is the median. Generally, for N data
Page 46 of 181
STA 201: Statistics I
N 1
Median item
Find the median of the following test scores in STA201: (a) 3, 6, 2, 4, 3 (b)2, 5, 3, 4, 8, 3
N 1
Median item
5 1
the 3rd item = 3
b. Arrangement in order 2, 3, 3, 4, 5, 8
Here N = 6
6 1
The median can be obtained graphically from the cumulative frequency curve (Ogive) or
by calculation using the formula. {Refer to the ogive diagram above}
Median L 2 C
Page 47 of 181
STA 201: Statistics I
F = Cumulative frequency of the class just above the one containing the median.
Using the following as the frequency distribution of money saved in O.O.U Micro finance
bank over a period of time
Construct the histogram and hence estimate the mode of the distribution.
fm fa
i. L C
2 f m fa fb
Page 48 of 181
STA 201: Statistics I
12 8
Mode 52.5 5
212 8 10
iii. Median L 2 C
2 40
2 20 i.e the median is the 20th value. From the cumulative frequency distribution
table 20th item falls within the class 53 – 57. Thus the median class is 53-57, hence, L =
52.5, F = 8, f = 12 and C =5
20 8
Median 52.5 5 57.5
In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a
distribution is stretched or squeezed. Common examples of measures of
statistical dispersion are the variance, standard deviation, and interquartile range.
The Range is one of the measures of dispersion, and is defined as the difference between
the largest and smallest items of the sample of observations.
Page 49 of 181
STA 201: Statistics I
Given the following observations as the number of students who failed to attend STA201
class in the last 5 weeks: 5, 6, 7, 8 and 9. Find the range.
The range is 9 – 5 = 4
𝑄= (𝑄3 − 𝑄1)
Where Q1 and Q3 are the first and third quartiles respectively. Quartile deviation is
better than range, since it is calculated using first and third quartile values.
The mean deviation is the arithmetic mean of the absolute values of the deviations from
some average like mean or median or mode.
Mean deviation
f x
i i x
for grouped data
Mean deviation
x i x
for ungrouped data
Page 50 of 181
STA 201: Statistics I
This is the most commonly used measure of variation or dispersion. It takes into account
all the values of the variable. Standard deviation (SD) is defined as the square root of the
arithmetic mean of the squared deviations of the individual values from their arithmetic
mean. The formula for large samples.
SD 2 x x
1 2
n i
n = sample size
SD 2 x x
1 2
n 1 i
n 1 SS CF
SS = sum of squares = x 2
Page 51 of 181
STA 201: Statistics I
CF = correction factor =
x i
SD 2 f x x
1 2
n 1 i i
f i xi2
f x 2
SD 1 i i
n 1
n = sample size
Give the following ungrouped data in million Naira as the excess profit made by five
businessmen during Coronavirus pandemic:5, 6, 7, 8, 9. Find (i) mean (ii) variance and
standard deviation.
i. Mean = x 35
5 7
Page 52 of 181
STA 201: Statistics I
SS = xi2 52 62 7 2 82 92 255
x 2
35 2
CF 245
n n
SD 1
n 1
255 245 1.58
45 – 50 47.5 2 2 95.0
50 – 55 52.5 3 5 157.5
55 – 60 57.5 6 11 345.0
60 – 65 62.5 4 15 250.0
65 – 70 67.5 6 21 405.0
75 – 80 77.5 5 30 387.5
Total 30 30 1930.0
Find the Mean and Standard deviation of money saved by the customer.
N = Σfi = 30
Page 53 of 181
STA 201: Statistics I
i 1
i i 1970
f x i i
f i 30
f i xi2 126637.50
f x 2
i i
n 30
f i xi2
f x 2
126637.50 124163.33 = 9.24
SD 1 i i 1
n 1 29
3.2.5 Variance
The variance is measured in the square of the units in which the variable X is measured.
x x x nx 2
2 2
i i
n n
A better estimate of the population variation is obtained by suing a division (n-1) instead
of n.
x x
2 i
Estimated variance = S
n 1
∑(𝑥𝑖 −𝑥̅ )2
Estimated standard deviation = 𝑆 = √ 𝑛−1
Page 54 of 181
STA 201: Statistics I
The following are the Total credit point (TCP) of some students in 400 level in faculty of
social and management sciences
Mean =
x 2758 183.9
n 15
x x
2 i
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = S = = 4178.55
n 1 14
Page 55 of 181
STA 201: Statistics I
Page 56 of 181
STA 201: Statistics I
Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
1. The following distribution is the age range and frequency of some market woman
in Ago-Iwoye market:
15 – 19 24
20 – 24 37
25 – 29 81
30 – 34 43
35 – 39 30
40 – 44 16
2. Consider the distribution given as the number of students whose stipend falls
within the ranges given below:
20 - 25 81
25 - 30 43
30 - 35 24
35 - 40 9
40 - 45 6
Total 200
Page 58 of 181
STA 201: Statistics I
3. A company administers an aptitude test to 100 applicants for a job with, the
company. The following are the times taken to complete a simple task for each
applicant, measured to the nearest second.
44 92 72 45 85 61 66 46 59 57 52 40 93 54
52 64 65 44 51 66 92 58 74 42 43 56 46 52
45 56 68 40 48 76 71 99 51 72 52 56 69 58
40 76 70 42 52 46 73 59 41 55 74 66 64 47
58 46 52 54 63 89 87 41 57 68 59 81 82 60
67 68 97 57 47 53 61 52 49 47 86 55 54 48
85 45 84 53 49 47 70 78 58 96 54 62 60 57 58
a. Construct a frequency table for the above data using classes of 40 – 49, 50
– 59, 60 – 69, etc.
b. Construct a cumulative frequency distribution.
c. Construct a relative frequency distribution.
d. Draw the histogram.
e. Draw the Ogive.
Page 59 of 181
STA 201: Statistics I
20 < 25 81
25 < 30 43
30 < 35 24
35 < 40 9
40 < 45 6
a. The mean
b. Median
c. Mode
d. Variance
e. Standard deviation
f. Quartile deviation
Page 60 of 181
STA 201: Statistics I
Glossary of Terms
Deviation: measure of difference between the observed value of a variable and some other
value, often that variable's mean
Distribution of data: the shape of the graph when all possible values are plotted on a
frequency graph
Quartile: a type of quantile which divides the number of data points into four parts, or
quarters, of more-or-less equal size
Page 61 of 181
STA 201: Statistics I
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
Page 62 of 181
STA 201: Statistics I
Ever heard about a weather forecast at the end of a news bulletin on TV or read
about the weather conditions of your city/country for the next few days in any
newspaper? They specifically use the term “probability.” We are going to learn a
few basic concepts, probability formulas involved to calculate the probability for different
types of situations.
Page 63 of 181
STA 201: Statistics I
4.1 Definition
Probability concepts are the foundations of statistics. The understanding of the concepts of
probability will help the interpretation of the statistics in a skillful way. Probability is a
term applied to events that are not certain. It is the study of random or non-deterministic
The three commonly used approaches are classical approach, the relative frequency
approach and personal or subjective approach.
This method can be used whenever the possible outcomes of the experiment are equally
likely. In this case, the probability of the occurrence of event A is given by:
n A
Number of ways A can occur
n s number of ways the experiment can proceed
where S is the sample size and ACS.
Case Study 4.1
What is the probability that a child born to a couple, each with genes from both brown and
blue eyes, will be brown-eyed?
Since the child receives one gene from each parent, the possibility for the child are
(brown, blue), (blue, brown), (blue, blue) and (brown, brown).
Where the finish member of each pair represents the gene received from the father. Since
each parent is just as likely to contribute a gene for brown eyes as for blue eyes, all four
possibilities are equally likely.
Since the gene for brown eyes is dominant, three of the four possibilities lead to a brown-
eyed child. Hence, the probability that the child is brown-eyed is ¾ = 0.75.
Page 64 of 181
STA 201: Statistics I
This method can be used in any situation in which the experiment can be repeated many
times and the results observed. Then the approximate probability of the occurrence of
event A, denoted P (A), is given by:
n A
Number of times event A occured
N number of times experiment was run
The disadvantage of this method is that the experiment cannot be a one-short situation, it
must be repeatable. The advantage in this method or approach is that usually it is more
accurate, because it is based on actual observation rather than personal opinion.
Thus for a large number of trials, the approximate probability obtained by using the
relative frequency approach is usually quite accurate.
Case Study 4.3
A researcher is developing a new drug to be used in desensitizing patients to bee stings of
200 subjects tested, 180 showed a lessening in the severity of symptoms upon being stung
after the treatment was administered. It is natural to assumed, then, that the probability of
this occurring in another patient receiving treatment is at least approximately
On the basis of this study, the drug is reported to be 90% effective in lessening the
reaction of sensitive patients to stings.
Page 65 of 181
STA 201: Statistics I
Sample space: This refers to the collection of all possible outcomes of an experiment.
Page 66 of 181
STA 201: Statistics I
P s P spade
13 1
52 4
P c P c lub
13 1
52 4
The outcomes are mutually exclusive, therefore, the P (S or C) = P (s) + P (c)
Hence P (s or c) = ¼ + ¼ = ½
Given a sample space S, let A be a non-empty proper subset of S. i.e. A and AcS. The
probability of an event B happening given that an event A has taken place is denoted by
P(B/A) and is defined as:
If A and B are any two events in a sample space S and P(A) 0, the conditional
probability of B given A is:
P A B P both events
P A P given event
Page 67 of 181
STA 201: Statistics I
P A B P both events
P B P given event
It is estimated that 15% of the adult population has hypertension due to economic
hardship, but that 75% of all adults feel that personally they do not have this problem. It is
also estimated that 6% of the population has hypertension but does not think that the
disease is present. If an adult patient reports thinking that he or she does not have
hypertension, what is the probability that the disease is, in fact, present?
Letting A denote the event that the patient does not feel that the disease is present and B
the event that the disease is present. We are given that P(A) = 0.75, P(B) = 0.15 and P
(AB) = 0.06
P both P A B 0.06
P B A 0.08
P given P A 0.75
There is 8% chance that a patient who expresses the opinion that she or he has no problem
with hypertension does, in fact, have the disease.
Page 68 of 181
STA 201: Statistics I
This theorem was formulated by the Reverend Thomas Bayes (1761). It deals with
conditional probability. Baye’s theorem is used to find P(A/B) when the available
information is not directly compatible with that required in conditional probability. That
is, it is used to find P[A/B] when P[AB] and P[B] are not immediately available.
Theorem 4.1:
P A j B
P B A P A j
i 1
Baye’s theorem is much easier to use in practical problem than to state formally.
The blood type distribution in Olabisi Onabanjo University is type A, 41%; type B, 9%;
type AB, 4%, and type O, 46%. It is estimated that during an investigation, 4% of
inductees with type O blood were typed as having type A; 88% of those with type A blood
were correctly typed; 4% with type B blood were typed as Aj and 10% with type AB were
typed as A. one student was wounded and brought to surgery. He was typed as having
type A blood. What is the probability that this is his true blood type?
Page 69 of 181
STA 201: Statistics I
B: It is typed as type A.
By Baye’s theorem
P B A1 PA1
P A1 B 4
i 1
1 1
0.88 0.41 = 0.93
0.88 0.41 0.04 0.09 0.10 0.04 0.04 0.46
Practically speaking, this means that there is a 93% chance that the blood type is A if it
has been typed as A, and there is a 7% chance that it has been mistyped as A when it is
actually some other type.
Page 70 of 181
STA 201: Statistics I
4.3 Factorials
Factorial is a special multiplication operator. The factorial sign “!” indicates a special
repeated multiplication which is used frequently in statistical applications.
3! = 3 2 1 = 6
4! = 4 3 2 1 = 24
4.3.1 Permutation
If r objects are selected from a set of n objects, any particular arrangement (order) of these
objects is called a permutation.
n r !
Find the number of ways of arranging the letters of the world CHEMISTRY if:
Page 71 of 181
STA 201: Statistics I
9! = 362880
9! 9! 362880
P4 = 3024
9 4! 5! 120
4.3.2 Combination
This deals with the number of ways in which r objects can be selected from a set of n
objects. The number of ways in which r objects can be selected from a set of n distinct
objects is or nCr and is given by:
r! n r !
In how many ways can we select three academic sound students from faculty of
management sciences from a list of 7 students?
Hence n = 7 and r = 3
7! 7!
3! 7 3! 3! 4!
3 2 1
Page 72 of 181
STA 201: Statistics I
This is one of the most important and most widely used probability in the entire field of
The graph of a normal distribution is a bell-shaped curve that extends indefinitely in both
directions with characteristics of mean, median and mode values being equal.
f x
; for x
2 2
where > 0
Page 73 of 181
STA 201: Statistics I
Z ~ N 0,1
P (Z > a) = 1 – P (Z < a)
Case Study 4.11
(a) P(Z>0.5) (b) P (Z < -2.5) (c) P (1.6 < Z < 2.20)
Given that the normal distribution of a company income has mean 230 and standard
deviation 20, what is the probability that the company income will be:
a. Z ; x 280, 230, 20
Page 74 of 181
STA 201: Statistics I
280 230
Z 2 .5
P (Z > 280) = P (Z > 2.5) = P (Z > 2.5) = 1 – P (Z < 2.5)= 1 – 0.9938= 0.062
b. P (X = 220)
= 1 – P (Z < 0.5)
= 1 – 0.6915
= 0.3085
= 1.6853 – 1 = 0.6853
Page 75 of 181
STA 201: Statistics I
Page 76 of 181
STA 201: Statistics I
Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
1. At a given period in OOU Health Centre there are six expectant mothers. What is
the probability.
Page 77 of 181
STA 201: Statistics I
4. Show that the letters of the word ANTICIPATION can be arranged in three times
as many ways as the letters of the word COMMENCEMENT. b) In the random
experiment of tossing 5 coins, list the event that (i) at least 3 heads occur (ii)
exactly 2 heads (iii) no head occurs.
a. Simplify the following:
Page 78 of 181
STA 201: Statistics I
Glossary of Terms
Page 79 of 181
STA 201: Statistics I
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
Page 80 of 181
STA 201: Statistics I
Page 81 of 181
STA 201: Statistics I
This is a random variable that can assume at most a finite or a countable infinite number
of possible values. That is, Let X be a random variable, if the number of possible values of
X is finite or countably infinite, we called X a discrete random variable. For example,
possible value of x may be listed as x1, x2, …, xn in the finite case. Therefore, discrete
random variables are those random variables which can take on only finite number of
values or whose choice of values are countable.
A random variable X is continuous if it can assume any value in some interval (or
intervals) of real numbers and the probability that it assumes any specific value is zero.
These are random variables whose choice of values falls within an interval. Measured
random variables that are treated as continuous random variables are; income,
temperature, heights, intelligence quotients (IQ) etc. A continuous random variable is the
one that may assume any numerical value on a continuous scale. Therefore, X is said to be
a continuous random variable, if there exists a function called the probability density
function (p.d.f.) of X satisfying the following conditions:
i. f x 0 for all x
ii. f x dx 1
iii. For any a, b with -< a < b < we have P a x b f xdx
Page 82 of 181
STA 201: Statistics I
The density function of a random variable completely describes the behaviour of the
variable. However, associated with any random variable are constants, or parameters that
are descriptive knowledge of the numerical values of these parameters which gives the
researcher quick insight into the nature of the variables.
We consider three such parameters: the mean , the variance 2, and the standard
deviation . If the exact density of the variable of interest is known, then the numerical
value of each parameter can be found from mathematical considerations. To understand
the reasoning behind most statistical methods, it is necessary to become familiar with one
general concept, namely, the idea of mathematical expectation or expected value. This
concept is used in defining most statistical parameters and provides the logical basis for
most of the methods of statistical inference presented in this chapter.
Intuitively, let X be a random variable. The expected value of X, denoted E(X), is the
long run theoretical average value of X. Let X be a discrete random variable with density
f(x). Let X is be a random variable. The expected value of X is given by:
EX X f X
all x
5.2.2 Variance
Let X be a random variable with mean . The variance of X, denoted Var X, or 2, is
given by:
Var X 2 E X
Page 83 of 181
STA 201: Statistics I
Note that the variance essentially measures variability by considering X - , the difference
between the variable and its mean. The difference is squared so that negative values will
not cancel positive ones in the process of finding the expected value.
The most widely used measure is E X . This measure is called the variance of X.
Var X E X 2 E X 2
Let X be a random variable with variance 2. The standard deviation of X, denoted , is
given by:
Var X 2
i. Var C = 0
ii. Var CX = C2 Var X
iii. If X and Y are independent, then Var (X + Y) = Var X + Var Y. Two variables
are independent on the value assume by the other.
Page 84 of 181
STA 201: Statistics I
A balanced coin is tossed thrice. Let X be a random variable denoting the number of times
that head appears.
1 3
2 3
3 1
Var X E X 2 E X 2
E X2 X f x 2
0 2 18 12 3 8 2 2 3 8 32 18
0 3
8 12
8 9
8 24
8 =3
Page 85 of 181
STA 201: Statistics I
iii. Var X = 3 - 3 2 2
3 9
4 3
Consider the random variable X, the number of mimics escaping detection in the Batesian
mimicry experiments. The density for X is given by:
X 0 1 2 3
f(x) 8
E X 0. 1000
3. 1000
1000 12
5 2.4
Var X E X 2 E X
E X 2 0 2. 1000
2 2.1000
32. 1000
1000 624
624 2
Var X = 100
100 144
25 12
Two drugs are being compared for use in maintaining a steady heart rate in patients who
have suffered a mild heart attack. Let X denote the number of heartbeats per minutes
Page 86 of 181
STA 201: Statistics I
obtained by using drug A and Y the number per minute with drug B. consider the
following hypothetical densities.
X 40 60 68 70 72 80 100
fx 0.01 0.04 0.05 0.8 0.05 0.04 0.01
Y 40 60 68 70 72 80 100
Fy 0.4 0.05 0.04 0.02 0.04 0.05 0.4
Since each of the densities is symmetric, inspection shows that x = y = 70. Each drug
produces on the average the same number of heartbeats per minutes. However, there is
obviously a drastic difference between the two drugs that is not being detected by the
mean. If we examined only the mean, we would conclude that the two drugs had identical
effects which may not be exactly true. But we can further examine the variability of the
two drugs by their variances. The variances of X, denoted Var X, or 2, is given by
Var X 2 E X 2
Var X x 70 f x
all x
30 0.01 10 0.04 2 0.05 0 0.8 2 0.05 10 0.04 30 0.01
2 2 2 2 2 2 2
= 26.4
Page 87 of 181
STA 201: Statistics I
Var Y y 70 f y
all y
30 0.4 10 0.05 2 0.04 0 0.02 2 0.04 10 0.05 30 0.4
2 2 2 2 2 2 2
= 730.32
As expected, Var Y > Var X. even though the two drugs produce the same mean number
of heartbeats per minute, they do not behave in the same way. Drug B induces greater
variability than drug A. It is not as consistent in its effect as drug A.
The most frequent application of statistics is to test some scientific hypotheses. Results of
experiments, and investigations are usually not clear cut and, therefore, need statistical
tests to support decisions between alternative hypotheses. A statistical tests examines a set
of sample data and on the basis of an expected distribution of the data, leads to a decision
on whether to accept the hypothesis or whether to reject that hypothesis and accept an
alternative one. The nature of the tests varies with the data and the hypothesis, but the
same general philosophy of hypothesis testing is common to all tests. A statistical
hypothesis is an assumption or statement which may or may not be true concerning one or
more population.
Page 88 of 181
STA 201: Statistics I
A type I error has been committed if we reject the null hypothesis when it is true and a
type II error has been committed if we accept the null hypothesis when it is false. The
following table summarizes the various situations that can arise when testing H0 against
Accept H0 Accept H1
H0 is true No error Type I Error
H1 is true Type II error No error
The probabilities of committing a type I and type II errors are called level of significance
of the tests and are written as and , respectively. is called the size of the test and (1-
) is called the power of the test, and (1-) is also the probability of rejecting null
hypothesis (H0) when it is false. The area such that if the sample point falls in it we reject
H0 is called the critical region. When the primary concern of a test is to see whether the
null hypothesis can be rejected, such a test is called a test of significance. In that case, the
quantity is called the level of significance at which the test is being conducted.
A test of any statistical hypothesis where the alternative is one sided such as:
H0: = 0 or H0: = 0
Page 89 of 181
STA 201: Statistics I
Is called a one-tailed test. The critical region for H1: >0 lies entirely in the right tail
while the critical region for H1: <0 lies entirely in the left tail.
A test of any statistical hypothesis where the alternative is two-sided such as:
H0: = 0
H1: 0
Is called a two-tailed test, values in the both tails of the distribution constitute the critical
The steps involved in general and in the utilization of any test of significance are:
We will assume that the sampling distribution of the sample estimates will be
approximately normal and that the variance is known. Hence, for large samples (n 30),
Page 90 of 181
STA 201: Statistics I
we can use the normal probability distribution for testing a hypothesized value of the
population mean.
S .E. X
S .E. X
where is the population standard deviation (usually known) and n is the sample size.
A bottling company which bottles a soft drink claims that the liquids content is 35cl with
standard deviation 0.75cl. A researcher randomly collects 50 bottles, measured their
contents and got mean of 34.2cl. Test at 0.01 level of significance that the bottling
company has been cheating their consumers.
= 35cl
= 0.75cl
n = 50
Page 91 of 181
STA 201: Statistics I
X = 34.2
= 0.01 (1%)
H0: = 35 that is, the company has not been cheating the consumers.
H1: < 35 that the company has been cheating the consumers.
Test statistics is
X n
34.2 35 50
0.8 7.0711
= -7.54
Decision: the Z calculated value 7.54 is greater than the Z tabulated value 2.33. we reject
H0 and accept H1.
Conclusion: There is significant difference between the population and sample mean.
Hence, the bottling company has been cheating their consumers.
Page 92 of 181
STA 201: Statistics I
There are situations in real life experiment, such as, testing the efficiency of a newly
produced drug, where it is impracticable to get a large sample and yet tests of
significance still have to be carried out. When we do not know the value of the population
standard deviation and the sample size is small (n < 30), we shall assume again that the
population we are sampling from has roughly the shape of a normal distribution. The test
statistics is:
X n
Whose sampling distribution is the t distribution with n-1 degree of freedom. S is the
sample standard deviation. As with large samples, we compare it with its value at a given
level of significance, and then draw our conclusions.
Suppose that we want to test on the basis of a random sample of size n = 5 whether or not
the fat content of a certain kind of ice cream exceeds 12 percent. What can we conclude
about the null hypothesis. = 12 percent at the 0.01 level of significance, if the sample
has the mean X as 12.7 percent and the standard deviation S is 0.38 percent.
H1: > 12
= 0.01
Test statistics
12.7 12
0. 7
t 4.12
t0.01,4 = 4.12
Conclusion: Therefore, the content of the given kind of ice cream exceeds 12 percent.
The life time of telephone for a random sampling 10 from a large consignment give the
following data:
Page 94 of 181
STA 201: Statistics I
Can we accept the hypothesis that the average life time of telephone is 4,000hours at 5%
level of significance?
H0: = 4,000hours
H1: 4,000hours
X i
4.2 4.0 ,..., 5.6 43.5
X i 1
n 10 10
X 4.3
i 1
n 1
Page 95 of 181
STA 201: Statistics I
4.2 4.32 4.0 4.32 ... 4.4 4.3 5.6 4.3
2 2
10 1
S2 = 0.358
Test Statistics
X n
4.3 4 10
S 0.358 = 0.598
t = 1.587
t0.025,9 = 2.262
Conclusion: Since tcal>ttab, then we accept H0 and conclude that the average life time is
The test statistics for large sample test concerning difference between two means is given
Page 96 of 181
STA 201: Statistics I
X1 X 2
12 22
n1 n2
In a study designed to test whether or not there is a difference between the average amount
used to buy food by families living in two different communities, random samples yield
the following results.
The amount used to buy food are in Thousand Naira. Use the 0.05 level of significance to
test the null hypothesis that the corresponding population means are equal against the
alternative hypothesis that they are not equal.
H 0 1 2
H 1 1 2
Test Statistics
X1 X 2
12 22
n1 n2
Page 97 of 181
STA 201: Statistics I
62.7 61.8
2.50 2 2.62 2
120 150
Z = 2.88
Conclusion: Since Z cal Z tab , the null hypothesis must be rejected and we conclude that
there is a difference between the true amount used to buy food in the two given
The test statistics for small sample test concerning difference between two means is given
X1 X 2
Sp 2 1
n2 1
Page 98 of 181
STA 201: Statistics I
The following random samples are amount used by two states in Nigeria to provide health
facilities (in millions naira) for five months:
Use 0.05 level of significance to test whether the difference between the means of these
two samples is significant.
X i
8400 8230 ... 7930
X1 i 1
5 5
X 1 8160
X i
7510 7690 ... 7660
X2 i 1
5 5
X 2 7730
X Xi
i 1
S 21
n1 1
S 21 63450
Page 99 of 181
STA 201: Statistics I
X X2
i 1
S 22
n2 1
S 22 42650
H 0 1 2
H 1 1 2
Test statistics
X1 X 2
n1 1S12 n2 1S 22
n1 n2 2 1
n1 1
Conclusion: Since tcal ttab , the null hypothesis should be rejected then, we conclude that
the average amount spend on health facilities from the states are not the same.
A statistic is the characteristics of sample data that is used to estimate the population
parameter. For instance, given xi, i = 1, 2, 3,…, n to be sampled data, then
Page 100 of 181
STA 201: Statistics I
x i
x i 1
A point estimator is a procedure leading to a single numerical value for the estimate of the
unknown parameter. Suppose xi,i= 1, 2, 3, …, n is a random sample of n observations on
the random variable X then the sample mean x is a point estimator of the population mean
and the sample variance S2 is a point estimator of the population variance 2.
There are various properties of a good point estimator but the most important two are that,
the point estimator should be unbiased and also should have the minimum variance. This
means that the expected value of any estimator must be equal to its equivalent population
parameter. If there are other estimators, the variance of the estimator said to be a good
point estimator should be the minimum.
Interval estimation is the use of sample data to calculate an interval of possible values of
an unknown population parameter; this is in contrast to point estimation, which gives a
single value. An interval estimator is a random interval in which the true value of the
parameter lies with some probability which is usually called confidence interval. A 100
(1-)% confidence interval on a parameter in a random interval (L1, L2) such that
P L1 L2 1
The distance between two statistics that includes the true value of the parameter in
question with some probability is an interval estimate of the parameter. For instance, to
obtain an interval estimate of a parameter , we need to obtain two statistics L1 and L2
(the lower and upper confidence limits respectively)
Suppose X is normally distributed with unknown mean and variance 2. That is,
X ~ N (, 2)
~ N 0,1
P Z 2 Z 2 1
P x Z 2 n
is a 100 (1-) percent confidence interval for . If the variance of the distribution is
unknown then the sample variance S2 may be used to estimate 2. Then
Page 102 of 181
STA 201: Statistics I
~ t 2 , n 1
S n
1- = 95%
1- = 0.95
2 0.025
Z 2 Z 0.025
This implies that the area between Z 2 and Z 2 (-1.96, 1.96) is 0.95. This is the
P Z 2 Z Z 2 1 i.e. P 1.96 Z 1.96 0.95, to construct
confidence interval (C.I.) for P 1.96 Z 1.96 0.95 , replace Z by and solve
for .
1.96 1.96
The peak of 100 months’ stock market index was recorded, given that the sample mean
and standard deviation are 180million and 10million respectively. Determine correctly to
three significant figure 95% confidence limits of the stock market population mean.
𝑥̅ = 180 𝑚𝑖𝑙𝑙𝑖𝑜𝑛
= 10 million
n = 100
let , be the population mean, the 95% C.I. for population mean is
x 1.96 n
x 1.96 n
10000000 10000000
= 180000000 − 1.96 ( ) ≤ 𝜇 ≤ 180000000 + 1.96 ( )
√100 √100
i f x 0 for all x
ii f x dx 1
iii For any a, b with -< a < b < we have P a x b f xdx
Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
Mine 1 84 82 83 78 79
Mine 2 75 76 77 80 76
Use 0.05 level of significance to test whether the difference between the means of these
two samples is significant.
7. A research carried out reveal that 40 percent of customers using a king of detergent
soap wants the NAFDAC to stop the selling and using the soap, following some
rumour about its side effects. However, after the advertisement of the producer,
Page 106 of 181
STA 201: Statistics I
only 180 out of 500 interviewed now believe that NAFDAC should stop the selling
and using of the soap. Does the advertisement reduce the customer’s believe that
the soap should be stopped? Test at 5% level of significance.
8. ‘Family planning or control’ is a popular agitation of every government. A
research agency wants to know whether the programme is popular amongst people
living in Lagos State and Kano State. The survey reveals that in a random sample
of 1000 people living in Oyo State, 450 of them were aware of the programme. In
Ogun State 800 people were randomly sampled and 400 were aware of the
programme ‘Family Planning’. Do these facts indicate a significant difference
between the two states as far as this programme is concerned? Test both at 1% and
5% level of significance.
Glossary of Terms
Estimate: any of numerous procedures used to calculate the value of some property of a
population from observations of a sample drawn from the population
Statistical inference: the process through which inferences about a population are made
based on certain statistics calculated from a sample of data drawn from that population
Random phenomenon: a situation in which we know what outcomes can occur, but we
do not know which outcome will occur
Expected value: an anticipated value for an investment at some point in the future
Sample results: the results of n experiments in which the same quantity is measured
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
The use of statistical methods for categorical data has increased dramatically,
particularly for applications in the biomedical and social sciences. This study
session summarizes these methods and shows readers how to use them. You
will find a unified generalized linear models approach that connects logistic regression
and loglinear models for discrete data with normal regression for continuous data.
6.1 Introduction
The statistical problem is to determine whether the observed category frequencies tend to
support or refute a stated hypothesis. The statistical procedures or method used in this
analysis is 2 (Chi-Square) method. For instance, in the banking industry the method can
be used to determine the general believe of customers about the services, that a known
commercial banks provided. This can be done based on information collected from
customers in other to be sure whether their services are satisfactory or not.
Chi-square (2) is a special significance test which is used in a very large number of cases
to test the accordance between fact and theory (or between observed values and expected
values). The statistics 2 may be defined as
Oi Ei
where Oi refers to the observed values of the sample and Ei refers to the expected values
i.e. values we expect on the basis of some hypothesis
The summation (Σ) extends over all the classes in the data and n is the number in the
The three most important situations where 2 – test can be used are:
Test at the 0.05 level of significance whether the discrepancies between the observed and
expected frequencies can be attributed to chance.
Test statistics
Oi Ei
Since the expected frequencies 250 (0.1) = 25 make sure that you substitute the formula
for 2 yields
22 25
17 25 23 25
29 25 26 25
25 25 25 25 25
2 5.60
02.05,9 16.919
Conclusion: Since cal tab , we accept the null hypothesis and conclude that the
2 2
1600 families were selected randomly in Ogun State to test the belief that high income
families usually have access to basic socio-economic amenities such as health and
education, while low income families do not. Below is the result obtained.
Test at 5% level of significance whether income and social economic amenities are
d.f. = (r-1) (c-1) = (2-1) (2-1) = 1 and hence the correction for continuity is necessary.
|0 E| 0.5 2
O E O-E |O-E| Y = |0 E| 0.5 Y2 Y2 E
438 354 84 84 83.5 6972.25 19.70
162 246 -84 84 83.5 6972.25 28.34
506 590 -84 84 83.5 6972.25 11.82
494 410 84 84 83.5 6972.25 17.01
Total 1600 76.87
2 calculated = 76.87
tabulated = 3.84
Since cal tab , we reject H0 and conclude that there is association between income and
2 2
900 men and 700 women were interviewed by an independent public opinion poll agency
on their attitude towards family planning. Below is the result of the interview.
Does this data indicate a significant sex difference in the attitude towards family planning?
Test at 1% level.
O E 2
2 E
Since cal tab , we reject H0 and conclude that there is sex difference towards family
2 2
The three most important situations where 2 – test can be used are:
Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
Performance Total
Poor Fair Good
Below average 67 64 25 156
Average 42 76 56 174
Above average 10 23 37 70
Total 119 163 118 400
Glossary of Terms
Summation (∑ ): add up
Loglinear models: each frequency is a random variable with a finite and positive
expectation, and the logarithms of the expectations of the frequencies are assumed to
satisfy a linear model
Fit: a statistical hypothesis test used to see how closely observed data mirrors expected
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
7.1 Introduction
In most empirical studies a relationship is found to exist between two or more variables. In
most statistical investigations the major objective is to establish relationship between these
variables, which make it possible to predict one or more variables in terms of others.
Regression analysis is often used to predict the response variables from the knowledge of
the independent variables. Likewise, regression analysis is utilized primarily to
examining the nature of the relationship between the independent variables and the
response (dependent) variable.
In an experiment of the capacity of electrolytic cells, four cells were taken with electrolyte
respectively. The results were as follows:
y = 38x - 18.1
R² = 0.9627
Mean Capacity (Y)
Linear (Series1)
Quality of Electrolyte
It can be observed that there is a form of linear relationship between the estimate mean
capacity and the electrolyte. In this case the line is called the regression line of mean
capacity on quantity of electrolyte.
Page 124 of 181
STA 201: Statistics I
Now that we have collected and plotted the data on mean capacity and quantity of
electrolyte and determined that there is a linear relationship between them, how do we
describe that relationship? We know that mean capacity increases with quantity of
electrolyte, but how much, and starting where? We would want to have a mathematical
equation, or function or model which specifies what the relationship is since the
relationship is linear, we want a linear equation stating 𝑌 as a function of × .
The linear equation 𝑌 =∝ +𝛽 × is the equation of a straight line. The Greek letter ∝
(alpha) and𝛽 (beta) are parameters of the line: Once they are specified, ∝ is the Y-
intercept, the values of Y when X=0. The parameter 𝛽 is the slope of the line, the
number of units increase in × .
𝑌 =∝ +𝛽 ×
The points all lie on a straight line. But, in a scatter diagram, all points do not lie on a
straight line. How then can the relationship between Y and × be described using the
equation for a straight line? The equation 𝑌 =∝ +𝛽 × represents a deterministic, or
mathematical model.
what distinguishes one identical twin from another or one laboratory rat from its
littermate. It is what give a chemist slightly different results on any runs of the same
experiment. By the same token, our statistical predictions will always be subject to
random error. They will always be, to some degree, imperfect when applied to any
specific situation.
Recognizing that we can never predict anything exactly, we can describe a relationship by
means of the probabilistic or statistical model,
𝑌 =∝ +𝛽 × +𝜀.
Here, 𝜀, the Greek letter epsilon, represents the error in the predictions. In this model
at a constant rate with , and says that this relationship is not exact for every
individual pair of observation. The error term accounts for variables that affect but
are not included as predictors. It accounts for chance, or random variability as well as
imprecision in the specified model which might be almost but not exactly linear. We can
thus say that the error term is composed of two general kinds of error.
i. Model error or lack of fit: Meaning that all relevant predictors are not specified
or that the form of the relationship is not correctly specified:
ii. Random error: This is unpredictable and uncontrollable.
The model + + merely states the conceptual frame work of the problem. It is a
shorthand way of saying that we are trying to investigate a problem in which there is an
imperfect linear relationship between two variables. This we shall use the data to get
numerical estimates, which we will call and , so that the estimated value of a
given value can be obtained by simply substituting its value X in the equation
̂ X .
That is, we want to estimate how much error is involved in our predictions if this error is
quite large, this might be an indication that the relationship is not strong enough to bother
with. Finally, we recognize that because our information comes from a randomly close
sample, there is a chance that it is given as a distorted picture of how X and Y are related.
It might indicate a relationship when in fact none exists. The question of whether or not a
relationship exists is central to the whole study of regression.
A regression using only one predictor is called simple regression and when there are two
or more predictors the analysis is called a multiple regression.
When only two variables are involved the regression is said to be simple. A simple linear
regression equation is therefore of the form
This is the most reliable of all the methods used to find regression lines. It leads to unique
regression line and regression coefficient. The least square method could be used to
estimate the parameters and from the model.
Yi X i i
as follow:
n n
L= 2 1
i2 (Y2 X i )2
i 1
Minimized the function L with respect to and by taking the partial derivatives
2 (Yi X i )
2 (Yi xi )
Set these partial derivatives equal to zero and solve for and , we obtain:
n xy x ( y)
n x 2 ( x ) 2
Y x
i i
where x
xi and Y Yi
When only two variables are involved, the regression is said to be simple. A simple linear
regression equation is therefore, of the form: = + X , once and are estimated,
we can substitute a given value of X into the equation and calculate the predicted value of
A study was made on the effect of income level on the standard of living. The following
data was obtained in coded form. Calculate the regression of standard of living on income
y x
xy n
x x2 / n
Y x
x 0, y 102, x 2
110, xy 158
0 102
11 1.44
(0) 2
x 0
Y 9.27,
y 9.27 1.44 x
=9.27+1.44 (6)
The table below shows the Nigeria gross domestic product (X) and inflation rate(Y) over a
period of time.
X 65 63 67 64 68 62 70 66 68 67 69 71
Y 68 66 68 65 69 66 68 65 71 67 68 70
2 Y2
65 68 4225 4624 4420
63 66 3969 4356 4158
67 68 4489 4624 4556
64 65 1096 4225 4160
68 69 4624 4769 4692
62 66 3844 4356 4092
70 68 4900 4624 4760
66 65 4356 4225 4290
68 71 4624 5041 4290
67 67 4489 4889 4889
69 68 4761 4624 4692
71 70 5041 4900 4970
n xx x y
n x 2 ( x ) 2
12(54107) (800)(811)
12(53418) (800) 2
y x
n n
811 800
0.4764 x = 35.8233
12 12
x = Y
n xy x y
n y 2 ( y ) 2
12(54107) (800)(811)
12(54849) (811) 2
Y = - 3.38 + 1.036Y
We have dealt with the problem of regression or estimation of one variable (the dependent
variable) from one or more related variables (the independent variables). We shall now
consider the degree of relationship that exists between variables, the correlation analysis.
The Karl Pearson’s product moment correlation coefficient is devoted by r and given by:
n xy x y
[n x 2 ( x) 2 ][n y 2 ( y ) 2 ]
It should be noted that the higher the magnitude of r, the stronger the association.
The table below is used to present Nigerian Government income (x) and Nigerian
Government expenditure (y) for a period of 12 months. This is given in million naira.
x y
xy - n
( y ) 2
[ x - ( x ) [ y -
2 2 2
138.00 x138.25
r 12
(138.00) (138.25)
(1632.75 )(1607.81
12 12
Calculate the correlation coefficients between the following pairs of variables: gain in
height and weight and intelligent quotient in 10 children given in the table below:
Child number
1 2 3 4 5 6 7 8 9 10 Total
Gain in weight (y) 1.0 3.0 2.5 4.5 1.5 2.0 3.1 4.1 2.5 4.2 28.4
Gain in height (x) 2.0 3.5 3.0 5.0 2.1 2.5 3.6 3.8 3.0 4.0 32.5
Intelligent Quotient 0(IQ) 1.0 6.0 4.0 10.0 2.0 9.0 7.0 8.0 5.0 9.0 61.0
XY 2.0 10.5 7.5 22.5 3.15 50 11.16 15.58 7.5 16.8 101.69
YZ 1.0 18.0 10.0 45.0 3.0 18.0 21.7 32.8 12.5 37.8 199.80
XZ 2.0 21.0 12.0 50.0 4.2 7.5 22.5 30.4 15.0 36.0 218.30
x 2
113.31, y 2 93.06, z 2 457, xy 101.69,
xy ( n
x y)
y 2
32.52 28.4 2
[113.31 93.06
10 10
(x(z )
xz n
2 x 2 .z 2
x z
n n
32.5 2
113.31 457
10 10
ryz 199.80
10 26.56
61 2 32.45
93.06 457
10 10
Correlation coefficients are higher. It can be concluded that (i) gains in height and weight
are positively correlated. Children having good gain in height have good gain in weight,
(ii) Gains in height and weight are positively correlated with IQ. Children who have good
gains in height or weight have good IQ.
When variables do not follow normal distribution and one desires to assess the
relationship, correlation coefficient known as spearman rank correlation coefficient is
used. The variables are ranked based on the magnitude. The correlation between ranks of
variables x and y is obtained. The symbol used is R, the formula is:
6 d i2
R 1
n n 2 1
where d is the difference between ranks given to the variables of each pair and n is the
number of pairs studied. The procedure was developed by spearman. Hence, it is known as
spearman rank correlation coefficient. Its value also ranges from – 1 to 1.
Calculate the value correlation coefficient between the corresponding values ofincome
(X) and expenditure (Y) of a company given below.
Page 137 of 181
STA 201: Statistics I
X 22 24 25 16 28 19
Y 48 42 40 38 47 45
X Y RX RY d d2
22 48 3 6 -3 9
24 42 4 3 1 1
25 40 5 2 3 9
16 38 1 1 0 0
28 47 6 5 1 1
19 45 2 4 -2 4
6 d 2
n(n 2 1)
6(36 1)
Most times, two or more values of a variable might be equal. In such cases, we assign to
each of the tied observations the mean of the ranks which they jointly occupy. For
example if the 5th and 6th largest values of a variable are equal, we assign to each the rank
(5 6)
=5.5, and if the of fifth, smith and seventh largest values of a variable are the same
(5 6 7 )
we assign each the rank =6.6
The table give below shows the respective weight and (in kg) of 12 fathers and their
eldest sons.
Father ( ) 66 64 68 65 69 63 71 67 69 68 70 72
Sons ( ) 69 67 69 66 70 67 69 66 72 68 69 71
Calculate the coefficient of rank correlation and comment on the degree of correlation
between the father’s weight and their son.
RX RX D= RX - RY d2
66 69 4 7.5 -3.5 12.25
64 67 2 3.5 -1.5 2.25
68 69 6.5 7.5 -1.0 1.00
65 66 3 1.5 1.5 2.25
69 70 8.5 10 -1.5 2.25
63 67 1 3.5 -2.5 6.25
71 69 11 7.5 3.5 12.25
67 66 5 1.5 3.5 12.25
69 72 8.5 1.5 -3.5 12.25
68 68 6.5 1.2 1.5 2.25
70 69 10 5 2.5 6.25
72 71 12 11 1.0 1.00
6d 2
n(n 2 1)
12(144 1
Comment: There is a fairly high positive correlation between the father’s weights and that
of their eldest sons.
It is defined as
) 2
( )
r 2
r= coefficient of determination
r2= 0.70
r2= (0.70)2
r 0.4900 =0.70.
Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
a. Regression analysis
b. Correlation Analysis
c. Model
2. The table give below shows the respective income (X)and expenditure (Y) (in
million naira) of O.O.U, Ago-Iwoye for a year.
Income ( ) 66 64 68 65 69 63 71 67 69 68 70 72
Expenditure ( ) 69 67 69 66 70 67 69 66 72 68 69 71
Glossary of Terms
Degree: extent
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
Statistical design of experiments refers to the process of planning the experiment, so that
appropriate data that can be analyzed by statistical methods will be collected, resulting in a
valid and objective conclusion. There are two aspects to any experimental problem, which
are, the design of the experiment and the statistical analysis of the data generated. The two
are closely related, because the method of analysis depends directly on the design
Whether one is experimenting with animals or human beings and analysis of the results is
to be carried out, the design of experiment is of central importance. The statistical aspects
of the design (or planning) of an experiment are:
among the means of several groups of observations, where each group follows a normal
distribution. ANOVA is particularly useful when the basic differences between the groups
cannot be stated quantitatively. A one-way ANOVA is used to determine the effect of
one independent variable on a dependent variable. A two-way ANOVA is used to
determine the effects of two independent variables on a dependent variable. As the
number of independent variable increases, the calculation becomes much more complex
and are best carried out on a digital computer. The term independent variable is also
referred to as factor or treatment.
One-way ANOVA is used when we wish to test the equality of k-population means. The
procedure is based on the assumptions that each of K groups of observation is a random
sample from a normal distribution and that the population variance 2 is constant among
the groups. ANOVA models provide an appropriate estimate to facilitate comparison of
several means. A very simple form of experimental design in which the treatments are
allocated to the experimental units purely on a chance or random basis. It should be used
when the experimental units are homogeneous. The model involves only one treatment
variables in the design.
X ij j ij
i = Treatment effect
ij ~ NID 0, 2
The results may be analyzed by one-way analysis of variance and F-test.
Trt 1 X 11 X 12 .......... X 1n X 1. X 12
Trt 2 X 21 X 22 ......... X 2 n X. X 22
Trt k X k 2 X k 2 ......... X kn X k. X k2
X i.. 1
n X
j 1
ij i 1, 2, ...k
k n
k X ij
X i.. 1
k X j 1
i. i 1 j 1
i 1, 2, ...k
Total sum of square (TSS): The TSS is defined as the sum of the square of the deviations
from the grand mean.
X X ..
k n
TSS ij
i 1 j 1
It is a measure of the dispersion of all the variates about the grand mean. Its degree of
freedom (df) = k-1. It can be shown that the TSS, SStotal or total variations can be
partitioned into two.
TSS X ij X .. X X i. n X i. X ..
k n k n
2 2
i 1 j 1 i 1 j 1
TSS Between treatment
Treatment sum of squares sum of squares
Within Sum of Squares (WSS): WSS or sum of squares due to error (residual error) is
defined as the deviation of Xij (original observation) from the treatment means. It
the experimental error of the given experiment its degree of freedom is k(n-1) denoted by
Between sum of squares (BSS): It is defined as the deviations of the treatment mans about
the grand mean. The less the samples differ from each other, the smaller the BSS or
treatment sum of squares (SSTr).
X X ..
k n
i 1 j 1
k n
X ij
i 1 j 1
k n
T X ij
i1 j 1
X i. X
j 1
ij , X i.
X i.
k n
X .. X ij , X ..
X ..
i 1 j 1
BSS n X i. X ..
i 1
1 T2
T 2
Given model, X ij i ij
H0: 1 = 2 = … = k
Source SS Df MS F
Between treatments BSS k-1 BSS
k 1 A A/B
The critical value is F1 , v1 ,V2 where df, v1 = k-1, v2 = k(n-1) and is the significant levels.
Suppose the following has the classification of examination performance for five students
classified as A, B, C, D and E in three subject. Perform an analysis of variance to test
whether the treatment effects and the same or not and compute the coefficient of variation
to determine its precision at 5% level of significance.
3 5 7 6 4
2 8 8 8 9
4 8 6 7 5
Fig 8.2
Test of Hypothesis
H0: 1 = 2 = … = 5
3 5 7 6 4
2 8 8 8 9
4 8 6 7 5
Ti. Total 9 21 21 21 18
X i. mean 3 7 7 7 6
Fig 8.3
X X .. 3 6 2 6 4 6 ... 9 6 5 6 62
2 2 2 2 2
BSS n X i. X .. 3 3 6 7 6 7 6 6 6
2 2 2 2
= 3 [9 + 1 + 1 + 1] = 36
= 62 – 36 = 26
T = Grand total
i 1 nk
= 3 3905 = 62
2 2 4 2 ... 4 2 9 2 52
1 T2
T 2
9 2 212 212 212 18 2
90 = 36
3 3 5
Source of Variation SS df MS F
Between treatments 36 4 36
4 9 9 / 2.6 0.346
Within treatments 26 10 26
10 2.6
Total 62 14
Decision: Since Fcal>Ftab, we accept H0 and conclude that the treatment mean effects in the
five treatments are equal or there is no significant difference between the treatment means
in the five treatments.
Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
1 2 3 4
A 33 28 29 31
B 37 34 34 34
C 35 33 37 30
Glossary of Terms
Homogeneous: made up of things (people, events, objects, etc.) that are similar to each
Experimenttal units: a physical entity that is the primary unit of interest in a specific
research objective
Variate: a quantity having a numerical value for each member of a group, especially one
whose values occur according to a frequency distribution
1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
i. Statistics deals with only those subjects of inquiry which are capable of being
quantitatively measured and numerically expressed.
ii. It deals only with aggregates of facts and no importance is attached to individual
iii. Statistical results might be misleading if data collection is faulty
iv. Statistics can be used to establish wrong conclusions and therefore, can be used
only by experts.
i. Questionnaires
ii. Interview (which may be Telephone, Personal or Indirect) Telephone interviews
iii. Experiment
iv. Observation
v. Group discussion
2. Bar chart has space between its bars while histogram does not
3. Single bar chart, Component bar chart and multiple bar chart
4. Histogram has rectangular bars while a frequency polygon has irregular straight lines
gotten by joining the midpoints of histogram
Number of women
a. Single bar chart
No of male
No of male
90/91 91/92 92/93 93/94
No of female
960 No of female
90/91 91/92 92/93 93/94
No of female
No of Male
90/91 91/92 92/93 93/04
600 No of male
No of female
90/91 91/92 92/93 93/04
a. Class interval 10 – 14, 15 – 19, 20 – 24, 25 – 29, 30 – 34, 35 – 39, 40 – 44
b. Class boundary 9.5 – 14.5, 14.5 – 19.5, 19.5 – 24.5, 24.5 – 29.5, 29.5 –
34.5, 34.5 – 39.5, 39.5 – 44.5
c. 4
15 – 19 24 43
20 – 24 37 80
25 – 29 81
30 – 34 43
35 – 39 30
40 – 44 16
10 – 14 19 0.076
15 – 19 24 0.096
20 – 24 37 0.148
25 – 29 81
30 – 34 43
35 – 39 30
40 – 44 16
Total 250
Cumulative frequency
1 2 3 4 5 6
Page 166 of 181
STA 201: Statistics I
15 < 20 20 < 25 25 < 30 30 < 35 35 < 40 40 < 45
Frequency Polygon
0 1 2 3 4 5 6 7
Bar Chart
15 < 20 20 < 25 25 < 30 30 < 35 35 < 40 40 < 45
3. K
Class Interval Frequency Cumulative Relative
frequency Frequency
40 – 49 25 25 0.25
50 – 59 32 57 0.32
60 – 69 17 74 0.17
70 – 79 11 85 0.11
80 – 89 8 93 0.08
90 – 99 6 99 0.06
40 - 49 50 - 59 60 - 69 70 - 79 80 - 89 90 -99
Cumulative Frequency
40 - 49 50 - 59 60 - 69 70 - 79 80 - 89 90 - 99
Cumulative Frequency
1ii A type I error has been committed if we reject the null hypothesis when it
is true and a type II error has been committed if we accept the null hypothesis when it
is false.
2ii A point estimator is a procedure leading to a single numerical value for the
estimate of the unknown parameter
2iii An interval estimator is a random interval in which the true value of the
parameter lies with some probability which is usually called confidence interval
i. Var C = 0
ii. Var CX = C2 Var X
iii. If X and Y are independent, then Var (X + Y) = Var X + Var Y. Two variables
are independent on the value assume by the other.
Mine 1 84 82 83 78 79
Mine 2 75 76 77 80 76
𝑋1 (𝑋1
− 𝑥̅ )2
84 9
82 1
83 4
78 9
79 4
406 27 𝑋2 (𝑋2
− 𝑥̅ )2
75 4
76 1
77 0
80 9
76 1
384 15
∑ 𝑋1 406
𝑥̅1 = = = 81.2 ~ 81
𝑛1 5
∑(𝑋1 − 𝑋̅2 ) 27
𝑆12 = = = 5.4
𝑛 5
𝑋̅1 − 𝑋̅2
𝑡= 𝑆1 𝑆2
√𝑛1 √𝑛2
81 − 77 4
= 2.32 1.73 = = 2.197
+ 1.82
√5 √5
𝑝̂ = = 0.36
Test statistic
As (H0) is two-sided, we shall determine the rejection regions applying two-failed test
at 5% level of significance
= 1.96
The observed value of Z is -0.579 which is the acceptance region and such H0 is
𝑃̂1 = = 0.45 𝑞̂1 = 1 − 𝑃1 = 1 − 0.45 = 0.55, 𝑛1 = 1000
𝑃̂2 = = 0.5 𝑞̂1 = 1 − 𝑞2 = 1 − 0.5 = 0.5, 𝑛2 = 400
−0.05 −0.005
𝑍= = = −0.213
√0.000563 0.0237
𝑍𝑡𝑎𝑏𝑙𝑒 𝑎𝑡 1% = 1.64
𝑍𝑡𝑎𝑏𝑙𝑒 𝑎𝑡 5% = 1.96
The observed value of Z is -0.213 which is acceptance region at 1% and 5% level and
such H0 is accepted.
1a Chi-square (2) is a special significance test which is used in a very large number
of cases to test the accordance between fact and theory (or between observed values and
expected values).
Below 67 64 25 156
Average 42 76 56 174
Above 10 23 37 70
Total 119 163 118 400
Null hypothesis (H0): There is a relationship between the intelligence of persons and
their subsequent performance in the banking hall.
∑(𝑂𝑖 − 𝐸𝑖 )2
𝑋𝑐𝑎𝑙𝑢𝑐𝑎𝑡𝑒𝑑 =
∑ 𝑎𝑖𝑏𝑗
𝑤ℎ𝑒𝑟𝑒 𝐸𝑖 =
70 𝑋 119 70 𝑋 163
𝑂21 = = 20.83 𝑂32 = = 28.53 𝑂33 =
400 400
70 𝑋 118
= 20.65
𝑋𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 41.009
2 2 2
𝑋𝑇𝑎𝑏𝑢𝑙𝑎𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 𝑋0.01(𝑟−1)(𝑐−1) = 𝑋0.01(4) = 13.277
a. Regression Analysis is a statistical tool which helps to study the trend and pattern
of movement in one variable in response to changes in another variable on the
basis of an assumed relationship existing between them
b. Correlation analysis refers to the degree or extent of relationship or association
between two or more variables.
c. Model is any representation of reality, it can be physical or graphical
Page 177 of 181
STA 201: Statistics I
X = a + by
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑛 ∑ 𝑦 2 − (∑ 𝑦)2
𝑎 = 𝑥̅ − 𝑏𝑦̅
X Y Xy X2 Y2
66 69 4554 4356 4761
64 67 4288 4096 4489
68 69 4692 4624 4761
65 66 4290 4225 4356
69 70 4830 4761 4900
63 67 4221 3969 4489
71 69 4899 5041 4761
67 66 4422 4489 4356
69 72 4968 4761 5184
68 68 4420 4624 4624
70 69 4830 4900 4761
72 71 5112 5184 5041
812 823 55,526 55,030 56,483
𝑎 = 67.667— 4.206(68.583)
= 67.667 + 288.460
= 356.127
𝑋 = 𝑎 + 𝑏𝑦
𝑋 = 356.127 − 4.206𝑌
𝑌 = 𝑎 + 𝑏𝑥 𝑤ℎ𝑒𝑟𝑒;
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥𝑦
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2
𝑎 = 𝑌̅ − 𝑏𝑥̅
= − = −1.933
𝑎 = 68.583— 1.933(67.667)
= 130.80
𝑌 = 𝑎 + 𝑏𝑥
= 130.80 − 1.933𝑥
Alternative Hypothesis: The flavours of new magi are not equally accepted.
1 2 3 4
Feed A 33 28 29 31
B 37 34 34 34
C 35 33 37 30
𝑇. .2
𝑆𝑆𝑇𝑂𝑇𝐴𝐿 = ∑ 𝑦𝑖 𝑗 2 −
2 2 2 2
3952 2
33 + 28 + 29 + 31 + … + 30 −
= 13095 − 13002.083
= 92.917
∑ 𝑦𝑖1 2 𝑇. .2
𝑆𝑆𝑇𝑅𝐸𝐴𝑇𝑀𝐸𝑁𝑇 (𝐹𝑙𝑎𝑣𝑜𝑢𝑟𝑠) = −
𝑘 𝑁
= 13,025 − 13002.083
= 22,917
= 92.917 − 22.917
= 70,000
Total 11