STA201

Download as pdf or txt
Download as pdf or txt
You are on page 1of 181

OLABISI ONABANJO UNIVERSITY

OPEN AND DISTANCE LEARNING CENTRE


AGO-IWOYE

STA 201: Statistics I


STA 201: Statistics I

COURSE DEVELOPMENT TEAM

Prof. T.O. Olatayo – Subject Expert

Prof. D.A. Agunbiade – Course Reviewer

Prof. Taofik Azeez – Language Editor

Prof. Oyesoji Aremu – ODL Expert

Mr. Moyosola Ayodele – Instructional Designer

Page 2 of 181
STA 201: Statistics I

Vice Chancellor’s Message

It is with great pleasure that I welcome you as learners to the Olabisi Onabanjo University
Open and Distance Learning Centre.
Massive and Democratisation of higher education via Open and Distance Learning as
advocated globally has since been one of the goals of Olabisi Onabanjo University
Management, hence, Open and Distance Learning constitutes one of the areas of focus
since my assumption of duty. Through the efforts of the University Governing Council
and Senate, the establishment of the Open and Distance Learning Centre was approved in
July, 2016.

Open and Distance Learning is a mode of study that affords tertiary education
opportunities to all and sundry regardless of age, gender, location, space and other limiting
factors.

Quite a large number of qualified applicants for tertiary education are denied admission
yearly, there are also several others who wish to advance educationally but could not,
because of their job which is their means of livelihood.

Olabisi Onabanjo University via its Open and Distance Learning Centre offers quality,
technology driven, flexible, self-directed and cost effective tertiary education. It is a viable
option for learners who wish to study online from their location and at desired time.

This course material provides learners with vital information relevant to our programme
and schedules. I advise learners to make judicious use of it. I congratulate our Open and
Distance Learning Centre Staff, Department and Faculty for their effort towards the
production of this handbook.

I hope your learning experience with the Olabisi Onabanjo University Open and Distance
Learning Centre is memorable and exciting.

Prof Ganiyu Olatunji Olatunde

Vice Chancellor OOU

Page 3 of 181
STA 201: Statistics I

Course Study Guide

Introduction

STA 201 titled Statistics I is a 3-unit course for students studying towards acquiring a
Bachelor of Science in Accounting. The course is divided into 8 study sessions. The
course will introduce you to the basic statistics concept in solving practical problems.

The course study guide therefore gives you an overview of what STA 201 is all about, the
textbooks and other materials to be referenced, what you are expected to know in each
unit and how to work through the course materials. Define a set and identify various
notations of sets, present statistical data in various ways and know the applications of
statistics.

Recommended Study Time

This course is a 3 unit course divided into 8 study sessions. You are enjoined to spend at
least 3 hours in studying the content of each study unit

What you are about to learn in this course

The overall aim of this course, STA 201 is to introduce you to Statistics, Presentation of
data, Measure of central tendency and dispersion, probability, Random variable and
statistics hypothesis, Analysis of categorical data, regression and correlation analysis,
Analysis of variation.

Course Aims

This course aims to introduce students to the basic statistical terms. It is expected that the
knowledge will help the reader to effectively use mathematics principles to solve even life
problems.

Page 4 of 181
STA 201: Statistics I

Course Objectives

It is important to note that each unit has specific objectives. You should study them
carefully before proceeding to subsequent units. Therefore, it may be useful to refer to
these objectives in the course of your study of the unit to assess your progress. You should
always look at the unit objectives after completing a unit. In this way, you can be sure that
you have done what is required of you by the end of the unit.

However, the overall objective of STA 201 is to give basic knowledge of data
presentation, interpretation and analysis, and familiarity with the techniques to use them
effectively.

Working through this course

In order to have a thorough understanding of the course units, you will need to read and
understand the contents, practice the steps by designing and implementing a mini
computer application system for your department and be committed to learning and
implementing your knowledge.

This course is designed to cover approximately fifteen weeks and it will require your
devoted attention. You should do the exercises in the Tutor-Marked Assignments and
submit to your tutors via the Learning Management System (LMS).

Page 5 of 181
STA 201: Statistics I

Course Materials

The major components of the course are;

1. Course Guide
2. Printed Lecture materials
3. Text Books
4. Interactive DVD
5. Electronic Lecture materials via LMS
6. Tutor Marked Assignments

Assessment

There are two aspects to the assessment of this course. First, there are tutor marked
assignments and second, the written examinations. Therefore, you are expected to take
note of the facts, information and problem solving gathered during the course. The tutor
marked assignments must be submitted to your tutor for formal assessment in accordance
to the deadline given. The work submitted will count for 30% of your total course mark.

At the end of the course, you will need to sit for a final written examination. This
examination will account for 70% of your total score. You will be required to submit some
assignments by uploading them to STA 201 page on the Learning Management System
(LMS).

Tutor-Marked Assignment (TMA)

There are TMAs in this course. You need to submit all the TMAs. The best 10 will
therefore be counted. When you have completed each assignment, send them to your tutor
as soon as possible and make certain that it gets to your tutor on or before the stipulated
deadline. If for any reason you cannot complete your assignment on time, contact your
Page 6 of 181
STA 201: Statistics I

tutor before the assignment is due to discuss the possibility of extension. Extension will
not be granted after the deadline, unless on extraordinary cases.

Final Examination and Grading

The final examination for STA 201 will last for a period not more than 2hours and has a
value of 70% of the total course grade. The examination will consist of questions which
reflect the Self-Assessment Questions (SAQs), In-text Questions (ITQs), some applied
questions and tutor marked assignments that you have previously encountered.
Furthermore, all areas of the course will be examined. It would be better to use the time
between finishing the last unit and sitting for the examination to revise the entire course.
You might find it useful to review your TMAs and comment on them before the
examination. The final examination covers information from all parts of the course. Most
examinations will be conducted via Computer Based Testing (CBT)

Tutors and Tutorials

There are few hours of face-to-face tutorial provided in support of this course. You will be
notified of the dates, time and location together with the name and phone number of your
tutor as soon as you are allocated a tutorial group. Your tutor will mark and comment on
your assignments, keep a close watch on your progress and on any difficulties you might
encounter and provide assistance to you during the course. You must submit your tutor
marked assignment to your tutor well before the due date. At least two working days are
required for this purpose. They will be marked by your tutor and returned as soon as
possible via the same means of submission.

Do not hesitate to contact your tutor by telephone, e-mail or discussion board if you need
help. The following might be circumstances in which you would find help necessary:
contact your tutor if:

 You do not understand any part of the study units or the assigned readings.

Page 7 of 181
STA 201: Statistics I

 You have difficulty with the self-test or exercise.


 You have questions or problems with an assignment, with your tutor’s comments on an
assignment or with the grading of an assignment.

You should endeavour to attend the tutorials. This is the only opportunity to have face-to-
face contact with your tutor and ask questions which are answered instantly. You can raise
any problem encountered in the course of your study. To gain the maximum benefit from
the course tutorials, have some questions handy before attending them. You will learn a
lot from participating actively in discussions.

Good luck!

Recommended Texts

The following texts and Internet resource links will be of enormous benefit to you in
learning this course:

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
B

Page 8 of 181
STA 201: Statistics I

Table of Contents

Vice Chancellor’s Message .................................................................................................. 3

Course Study Guide .............................................................................................................. 4

Introduction ...................................................................................................................... 4

Table of Contents .................................................................................................................. 9

Study Session 1: Introduction to Statistics ....................................................................... 17

Introduction .................................................................................................................... 17

Learning Outcomes for Study Session 1 ........................................................................ 17

1.1 Meaning of Statistics ........................................................................................... 18

1.1.1 Classification of Statistics ............................................................................ 18

1.1.2 Types of Statistical Data ............................................................................... 18

1.1.3 Characteristics of Statistical Data ................................................................. 18

1.1.4 Functions of Statistics................................................................................... 19

1.1.5 Limitations of Statistics ................................................................................ 19

1.2 Population and Sample ........................................................................................ 20

1.2.1 Sources of Data ............................................................................................. 21

1.2.2 Methods of Data Collection.......................................................................... 21

1.3 Questionnaire ....................................................................................................... 22

1.3.1 Types of Questionnaire ................................................................................. 22

1.3.2 Quality of a Good Questionnaire .................................................................. 22

Summary of Study Session 1 ......................................................................................... 24

Page 9 of 181
STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 1............................................... 25

Glossary of Terms .......................................................................................................... 26

References ...................................................................................................................... 27

Study Session 2: Presentation of Data ............................................................................. 28

Introduction .................................................................................................................... 28

Learning Outcomes for Study Session 2 ........................................................................ 28

2.1 Tabulation and Classification .............................................................................. 29

2.2 Data Presentation ................................................................................................. 29

2.2.1 Pie charts ...................................................................................................... 29

2.2.2 Bar Charts ..................................................................................................... 31

2.2.3 Histogram ..................................................................................................... 34

2.2.4 Frequency polygon ....................................................................................... 35

2.2.5 Cumulative Frequency Curve (Ogive) ......................................................... 36

Summary of Study Session 2 ......................................................................................... 38

Self-Assessment Questions (SAQs) for Study Session 2............................................... 39

Glossary of Terms .......................................................................................................... 41

References ...................................................................................................................... 42

Study Session 3: Measures of Central Tendency and Dispersion ................................... 43

Introduction .................................................................................................................... 43

Learning Outcomes for Study Session 3 ........................................................................ 43

3.1 Measure of Central Tendency .............................................................................. 44

3.1.1 The Arithmetical Mean ................................................................................. 44


Page 10 of 181
STA 201: Statistics I

3.1.2 Mean of a Group Data .................................................................................. 45

3.1.3 The Mode...................................................................................................... 46

3.1.4 The Median ................................................................................................... 46

3.1.5 Median of a Grouped Data ........................................................................... 47

3.2 Measures of Dispersion ....................................................................................... 49

3.2.1 The Range ........................................................................................................ 49

3.2.2 Quartile Deviation ........................................................................................ 50

3.2.3 Mean Deviation ............................................................................................ 50

3.2.4 Standard Deviation ....................................................................................... 51

3.2.5 Variance ........................................................................................................ 54

Summary of Study Session 3 ......................................................................................... 56

Self-Assessment Questions (SAQs) for Study Session 3............................................... 57

Glossary of Terms .......................................................................................................... 61

References ...................................................................................................................... 62

Study Session 4: Concepts of Probability ........................................................................ 63

Introduction .................................................................................................................... 63

Learning Outcomes for Study Session 4 ........................................................................ 63

4.1 Definition ............................................................................................................. 64

4.2 Approaches of Assigning Probability .................................................................. 64

4.2.1 The Classical Approach ................................................................................ 64

4.2.2 The Relative Frequency Approach ............................................................... 65

4.2.3 The Subjective or Personal approach ........................................................... 66


Page 11 of 181
STA 201: Statistics I

4.2.4 Some Basic Definitions ................................................................................ 66

4.2.5 Axioms of probability................................................................................... 66

4.2.6 Conditional Probability ................................................................................ 67

4.2.7 Baye’s Theorem ............................................................................................ 69

4.3 Factorials .............................................................................................................. 71

4.3.1 Permutation................................................................................................... 71

4.3.2 Combination ................................................................................................. 72

4.4 Probability Distribution ....................................................................................... 73

4.4.1 Normal Distribution...................................................................................... 73

Summary of Study Session 4 ......................................................................................... 76

Self-Assessment Questions (SAQs) for Study Session 4............................................... 77

Glossary of Terms .......................................................................................................... 79

References ...................................................................................................................... 80

Study Session 5: Random Variable and Statistical Hypothesis .................................... 81

Introduction .................................................................................................................... 81

Learning Outcomes for Study Session 5 ........................................................................ 81

5.1 Random Variable ................................................................................................. 82

5.1.1 Discrete Random Variables .......................................................................... 82

5.1.2 Continuous Random Variables ..................................................................... 82

5.2 Expectation and Distribution Parameters ............................................................ 83

5.2.1 Expected Value ............................................................................................. 83

5.2.2 Variance ........................................................................................................ 83


Page 12 of 181
STA 201: Statistics I

5.2.3 Standard deviation ........................................................................................ 84

5.2.4 Rules for Expectation ................................................................................... 84

5.2.5 Rules for Variance ........................................................................................ 84

5.3 Test of Hypothesis ............................................................................................... 88

5.3.1 Type I and Type II Errors ............................................................................. 89

5.3.2 One and Two Tailed Test ............................................................................. 89

5.3.3 Test Procedure and Steps .............................................................................. 90

5.3.4 Test Concerning the Mean (For Large Sample) ........................................... 90

5.3.5 Test Concerning Means (Small Samples) .................................................... 93

5.3.6 Test Concerning Two Population Means (Large Sample) ........................... 96

5.3.7 Test Concerning Two Population Means (Small Sample) ........................... 98

5.4 Estimation of Parameters ................................................................................... 100

5.4.1 Point Estimation ......................................................................................... 101

5.4.2 Interval Estimation ..................................................................................... 101

Summary of Study Session 5 ....................................................................................... 105

Self-Assessment Questions (SAQs) for Study Session 5............................................. 106

Glossary of Terms ........................................................................................................ 108

References .................................................................................................................... 109

Study Session 6: Analysis of Categorical Data .............................................................. 110

Introduction .................................................................................................................. 110

Learning Outcomes for Study Session 6 ...................................................................... 110

6.1 Introduction ........................................................................................................ 111


Page 13 of 181
STA 201: Statistics I

6.2 The Chi Square Test .......................................................................................... 111

6.2.1 Uses of Chi-Square Test ............................................................................. 111

Summary of Study Session 6 ...................................................................................... 118

Self-Assessment Questions (SAQs) for Study Session 6............................................. 119

Glossary of Terms ........................................................................................................ 120

References .................................................................................................................... 121

Study Session 7: Regression and Correlation Analysis ................................................. 122

Introduction .................................................................................................................. 122

Learning Outcomes for Study Session 7 ...................................................................... 122

7.1 Introduction ........................................................................................................ 123

7.1.1 Scatter Diagram .......................................................................................... 123

7.1.2 The Model ...................................................................................................... 125

7.1.3 Deterministic and Probabilistic .................................................................. 125

7.2 Simple Linear Regression .................................................................................. 127

7.2.1 The Least Squares Method ......................................................................... 127

7.2.2 Regression Analysis ................................................................................... 128

7.3 Correlation Analysis ....................................................................................... 133

7.3.1 Product moment Correlation Coefficient ................................................... 133

7.3.2 Spearman Rank Correlation Coefficient..................................................... 137

7.3.3 Tie in Ranks ................................................................................................ 138

7.3.4 Coefficient of Determination ...................................................................... 140

Summary of Study Session 7 ....................................................................................... 142


Page 14 of 181
STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 7............................................. 143

Glossary of Terms ........................................................................................................ 144

References .................................................................................................................... 145

Study Session 8: Analysis of Variance .......................................................................... 146

Introduction .................................................................................................................. 146

Learning Outcomes for Study Session 8 ...................................................................... 146

8.1 Introduction to Analysis of Variance ................................................................. 147

8.1.1 Basic Terms in Analysis of Variance ......................................................... 148

8.2 Analysis of Variance (ANOVA) ........................................................................ 148

8.2.1 One-way Analysis of Variance ................................................................... 149

8.2.2 Sum of Squares identity.............................................................................. 150

8.2.3 One-way ANOVA Table (Equal observation) ........................................... 152

Summary of Study Session 8 ....................................................................................... 156

Self-Assessment Questions (SAQs) for Study Session 8............................................. 157

Glossary of Terms ........................................................................................................ 158

References .................................................................................................................... 159

Notes on Self-Assessment Questions (SAQs) .................................................................. 160

Notes on Self-Assessment Questions 1 ........................................................................ 160

Notes on Self-Assessment Questions 2 ........................................................................ 161

Notes on Self-Assessment Questions 3 ........................................................................ 165

Notes on Self-Assessment Questions 4 ........................................................................ 170

Notes on Self-Assessment Questions 5 ........................................................................ 170


Page 15 of 181
STA 201: Statistics I

Notes on Self-Assessment Questions 6 ........................................................................ 175

Notes on Self-Assessment Questions 7 ........................................................................ 177

Notes on Self-Assessment Questions 8 ........................................................................ 180

Page 16 of 181
STA 201: Statistics I

Study Session 1: Introduction to Statistics

Introduction

Today, there have been advancements in all sectors like Commerce, Economics,
Maths, etc. Not only that, but our life has also been going through a lot
of development in various zones. Some of them are defence, banking, and
hospitality. However, all of these depend largely on “statistics”

Learning Outcomes for Study Session 1

On completion of this study session, you should be able to:


1.1 Explain the term statistics
1.2 State the concept of population and sample
1.3 Describe a questionnaire

Page 17 of 181
STA 201: Statistics I

1.1 Meaning of Statistics

Statistics is the science of learning from experience, especially experiences that arrives a
little bit at a time. It can be generally defined as a scientific methodology used for
collection, presentation, analysis and interpretation of data in order to draw valuable
decision and conclusion.

1.1.1 Classification of Statistics

Statistics can be broadly classified into Descriptive and Inferential Statistics. Descriptive
statistics are brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire or a sample of a population. Descriptive statistics are
broken down into measures of central tendency and measures of variability. Inferential
Statistics involves using data from a sample to make inferences about the larger
population from which the sample was drawn.

1.1.2 Types of Statistical Data

Statistical data can be either a variable or an attribute in nature. Variable can be either
discrete or continuous in nature and these are measurable while attribute is non-
measurable in nature.

1.1.3 Characteristics of Statistical Data

i. They must be in aggregates of facts


ii. They must be affected to a mark extent by multiplicity of causes.
iii. They must be enumerated or estimated according to reasonable standard of
accuracy.
iv. They must have been collected in a systematic manner for a predetermined
purpose.
v. They must be placed in relation to each other.

Page 18 of 181
STA 201: Statistics I

1.1.4 Functions of Statistics

Statistics can be useful in the following ways:

i. Present facts in a definite form


ii. Simplifies unwieldy and complex mass of data
iii. Classifies numerical facts.
iv. Furnishes a technique of comparison.
v. Endeavors to interpret conditions.

1.1.5 Limitations of Statistics

i. Statistics deals with only those subjects of inquiry which are capable of being
quantitatively measured and numerically expressed.
ii. It deals only with aggregates of facts and no importance is attached to
individual items.
iii. Statistical results might be misleading if data collection is faulty
iv. Statistics can be used to establish wrong conclusions and therefore, can be used
only by experts.

In-Text Questions (ITQs)

i. Define Statistics
ii. What are the classes of statistics?

In-Text Answers (ITAs)

i. A scientific method used for collection, presentation, analysis and


interpretation data in order to draw valuable decision and conclusion.

Page 19 of 181
STA 201: Statistics I

ii. Descriptive and Inferential statistics.

1.2 Population and Sample

Population is the totality of the individual observations about which inferences are to be
made. In statistics, a population is the entire pool from which a statistical sample is drawn.
Through biological definition of the term “population” is the totality of individuals of a
given species per given time and given area, population in “statistics” always means the
totality of the individual observations about which inferences are to be made. A
population can thus be said to be an aggregate observation of subjects grouped together by
a common feature such as weight or tail lengths of all the albino rats, number of newborn
babies in Nigeria, and hemoglobin or serum protein levels of adults, and nutrients contents
of varieties of foods, number of workers in commercial banks in Nigeria and number of
students offering management sciences course in Olabisi Onabanjo University.

Sample is a part of the population. Large number of samples may be taken from the same
population, though all members may not be covered. Inferences drawn from the sample
refer to the defined population from which sample or samples are drawn.

Fig 1.1: Population vs Sample

Page 20 of 181
STA 201: Statistics I

1.2.1 Sources of Data

Basically, there are two major sources of data, namely primary and secondary sources of
data collection. Primary Sources refer to the statistical data or information which the
investigator originates himself for the purpose of the enquiry at hand. Examples are
census, surveys and experiments. Secondary sources refer to those statistical data which
are not originated by the investigator himself, but which he obtains from someone else’s
records or from some organization, either in published or unpublished forms. Examples
include publications of the National Bureau of Statistics (NBS), Central Bank of Nigeria
(CBN), National Population Commission (NPC) World Health Organization (WHO).

1.2.2 Methods of Data Collection

Some methods of collecting data are:

i. Questionnaires
ii. Interview (which may be Telephone, Personal or Indirect)Telephone interviews
iii. Experiment
iv. Observation
v. Group discussion

In-Text Questions (ITQs)

i. What is population?
ii. List the two major sources of data

Page 21 of 181
STA 201: Statistics I

In-Text Answers (ITAs)

i. Population is the totality of the individual observations about which inferences are
to be made
ii. Primary and Secondary

1.3 Questionnaire

A questionnaire contains a sequence of questions relevant to the data or information being


sought. This is a formal questions prepared but which is to be answered by the respondent.
Questionnaires are usually of two parts, part one is the classification section. It contains
such details of the respondents like sex, age, marital status, occupation, state of origin etc.
The second part is related to the subject matter of the enquiry.

1.3.1 Types of Questionnaire

a. Close-End Questionnaire: this is a questionnaire designed in such a way that


respondents are limited to stated alternatives or options thereby not permitting further
or additional explanation and is called structure questionnaire.
b. Open-End Questionnaire: This is unstructured questionnaire design which allows
the respondent free to make whatever reply that they choose, that is, the respondents
are not in any way restricted to options.

1.3.2 Quality of a Good Questionnaire

1. Questionnaires should be simple and easily understood.


2. It should be in logical sequence.
3. It should be short and unambiguous.
4. Questions should not offend, frightened or be tele-guiding.
5. Questions that may arouse the resentment of the respondents should be avoided.
6. Question should not require calculation to be made.
7. Question should be able to have precise answer like “Yes” or “No”.
Page 22 of 181
STA 201: Statistics I

8. Questions that rely too much on memory should be avoided. Since some people
forget events too soon.

In-Text Questions (ITQs)

i. Define Questionnaire
ii. What are the types of questionnaire?

In-Text Answers (ITAs)

i. Population is the totality of the individual observations about which


inferences are to be made.
ii. Close-end and open-end questionnaire

Page 23 of 181
STA 201: Statistics I

Summary of Study Session 1

In study session 1, you have learnt that:

1. Statistics can be broadly classified into Descriptive and Inferential Statistics.


2. Sample is a part of the population.
3. Questionnaires should be simple and easily understood
4. Postal questionnaires, Personal interviews, Telephone interviews, Questionnaire
and Indirect interview are sources of data.

Page 24 of 181
STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 1

Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
session.

Self-Assessment Questions (SAQs) 1.1

1. What is statistics
2. Explain the following
a. Data
b. Sample
c. Variable
d. Attitude
e. Descriptive Statistics
f. Discrete Variable
g. Observation
h. Population
i. Sample
j. Statistical method
k. Continuous variable
3.
a. Explain what you understand by Questionnaire
b. Outline the types of Questionnaire
4.
a. Outline the functions of Statistics
b. State the limitations of Statistics
5. Mention the two types of data and illustrate with examples.
6. Discuss the sources of data and the various methods of data collection.
Page 25 of 181
STA 201: Statistics I

Glossary of Terms

Aggregates: combination of related categories, usually within a common branch of a


hierarchy, to provide information at a broader level to that at which detailed observations
are taken.

Enumerated: the act or process of counting something

Descriptive coefficients: coefficients that summarize a given data set

Scientific methodology: mathematical and experimental technique employed in the


sciences

Measures of central tendency: a single value that attempts to describe a set of data by
identifying the central position within that set of data

Measures of variability: describe how far apart data points lie from each other and from
the center of a distribution

Page 26 of 181
STA 201: Statistics I

References

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by
GUPTA C. B
5. https://www.toppr.com/guides/business-mathematics-and-statistics/statistical-
description-of-data/introduction-to-statistics/

Should you require more explanations on this study session? Please


do not hesitate to contact your e-tutor via the LMS.

Page 27 of 181
STA 201: Statistics I

Study Session 2: Presentation of Data

Introduction

Once data has been collected, it has to be classified and organized in such a
way that it becomes easily readable and interpretable that is, converted to
information. Before the calculation of descriptive statistics, it is sometimes a
good idea to present data as tables, charts, diagrams or graphs. Most people find
‘pictures’ much more helpful than ‘numbers’ in the sense that, in their opinion, they
present data more meaningfully.

Learning Outcomes for Study Session 2

On completion of this study session, you should be able to:


2.1 Classify and tabulate data
2.2 Present data in different types of graph

Page 28 of 181
STA 201: Statistics I

2.1 Tabulation and Classification

This is an act of arranging facts and figures in the form of table(s) or list. In order to make
the data easily understandable, the first task of the statistician is to condense and simplify
them in such a manner that irrelevant details are eliminated and their significant features
stand out prominently. The procedure that is adopted for this purpose is known as the
method of classification and tabulation.

2.2 Data Presentation

It is the representation of data in appropriate form in order to make the comparison and
understanding easy through charts, diagram or graph. No matter how informative and
well designed a statistical table is, it’s a medium for conveying to the reader an
immediate and clear impression of its content, it is a compliment to a good chart, diagram
or graph. The most popular charts, diagrams and graphs are, pie charts, bar diagrams (bar
chart and histogram) and graphs (frequency polygons and Ogives).

2.2.1 Pie charts

A pie chart is simply a circle divided into sections. This circle represents the total of the
data being presented and each section is drawn proportional to its relative size. The main
advantage of a pie chart is that it is easy to understand.

Page 29 of 181
STA 201: Statistics I

Case Study 2.1

An investigation of the marital status of the staff of a known commercial bank in Nigeria
reveals the following distribution:

Marital status No of staff


Single 35
Married 130
Widowed 25
Divorced 10
Table 2.1

Draw a pie chart using the above information.

Solution

Total no of staff in the institution is

35 + 130 + 25 + 10 = 200

Angle corresponding to each status are found thus:

35
Single   360 0  63 0
200

130
Married   360 0  234 0
200

25
Widowed   360 0  45 0
200

10
Divorced   360 0  18 0
200

Page 30 of 181
STA 201: Statistics I

Thus, the pie chart is:

Series1, Total Number of Staff in the Institution


Divorced, 18, 5%

Series1,
Widowed, Series1, Single,
45, 13% 63, 17%

Series1, Married
, 234, 65%

Fig 2.1: Pie Chart

Observation: the chart clearly shows that majority of the staff in the institution are
married.

2.2.2 Bar Charts

Bar charts could be simple, multiple or component in nature. A single bar chart
comprises of a number of equally spaced rectangles.

A multiple bar chart is usually used in the comparison of two or more attributes.

A component bar chart comprises of bars which are subdivided into components.

Page 31 of 181
STA 201: Statistics I

Case Study 2.2

Represents the data in Case Study 2.1 in bar chart.

Solution

Bar Chart

Single
Married
Widowed
Divorced

Fig 2.2: Single Bar Chart

Case Study 2.3

The sex distribution of staff in a cement production company is given below

S/No Departments Male Female Total


1 Exploration 25 15 40
ii. production 65 30 95
iii. security 45 40 85
iv. Sales and Marketing 35 5 50
v. Transportation 30 10 40
Total 200 110 310

Table 2.2
Page 32 of 181
STA 201: Statistics I

Present the above information on a

i.) Multiple Bar Chart


ii.) Component Bar Chart.
Solution

i.) Multiple bar chart of the table in example 2.3 is

Multiple Bar CHart

Male
Female

Fig 2.3: Multiple Bar Chart

Page 33 of 181
STA 201: Statistics I

ii.) Component Bar Chart for the table in example 2.3

Female
Male

Fig 2.4: Component Bar Chart

2.2.3 Histogram

Histogram and bar charts look alike in presentation, but while the bars of the bar charts
are usually not joined, those of the histogram are usually joined. Furthermore, while the
bar chart attaches importance only to its heights, histogram attaches importance to both
heights and the widths.

Page 34 of 181
STA 201: Statistics I

Case Study 2.4

Obtain the histogram of the data in case study 2.2

Solution

The Histogram of the data in case study 2.2 is

Single
Married
Widowed
Divorced

Fig 2.5: Histogram

2.2.4 Frequency polygon

A frequency polygon is obtained by joining the midpoints of the top of the rectangles of a
histogram.

Page 35 of 181
STA 201: Statistics I

Series1

y = -4.2857x2 + 25.714x - Poly.


2 (Series1)
R² = 0.9184

Fig 2.6: Frequency Polygon

2.2.5 Cumulative Frequency Curve (Ogive)

To obtain a cumulative frequency curve, we plot the cumulative frequencies against the
upper class boundaries of the class intervals.

Page 36 of 181
STA 201: Statistics I

1
2
3
4
5
6
7
8
9
10
11
12
13

Upper Class Boundary

Fig 2.7: Cumulative Frequency Curve

The shape of the cumulative frequency curve is usually like that of an elongated S.

Page 37 of 181
STA 201: Statistics I

Summary of Study Session 2

In study session 2, you have learnt that:

1. Tabulation is an act of arranging facts and figures in the form of table(s) or list.
2. Data presentation is the representation of data in appropriate form in order to make
the comparison and understanding easy through charts, diagram or graph.
3. Data can be represented in bar chart, histogram, pie chart, cumulative frequency
curve etc.

Page 38 of 181
STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 2

Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
session.

1. List the methods of data presentation


2. What is the difference between bar chart and histogram?
3. Outline the types of bar chart
4. Differentiate between a histogram and frequency polygon and illustrate each with
examples
5. Assume the table below represents the marital status of women in Ago-Iwoye
Central Market, draw a pie chart

Marital status No. of Women


Single 670

Married 480

Separated 120

Divorce 330

Widow 400

Page 39 of 181
STA 201: Statistics I

6. The following data gives the enrolment of students from STA201 in some sessions
in Olabisi Onabanjo University, Ago-Iwoye.

Session No. of Male No. of Female No. of Students


90/91 500 1,000 1,500

91/92 750 1,000 1,750

92/93 840 960 1,800

93/94 1,050 950 2,000

Present the information in a:

a. Single bar diagram.


b. Component bar.
c. Percentage component bar.
d. Multiple bar diagram.

Page 40 of 181
STA 201: Statistics I

Glossary of Terms

Charts: A chart is a graphical representation of data

Cumulative: Step-by-step addition

Proportional: corresponding in size or amount to something else

Component: element of a larger whole

Condense: make something denser or more concentrated

Statistical table: a way of presenting statistical data through a systematic arrangement of


the numbers describing some mass phenomenon or process

Page 41 of 181
STA 201: Statistics I

References

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
B
5. http://pages.intnet.mu/cueboy/education/notes/statistics/presentationofdata.pdf

Should you require more explanations on this study session? Please


do not hesitate to contact your e-tutor via the LMS.

Page 42 of 181
STA 201: Statistics I

Study Session 3: Measures of Central Tendency and Dispersion

Introduction

Do all the players in a soccer team have the same height, weight, or years of
experience? Of course not! Suppose, if we make a table to record each players
‘years of experience’ then we get a ‘distribution of data’. Here, the knowledge
of no. of the experienced players can help a lot to predict the kind of game this team
would play. Now, to extract such resourceful information from the provided
distribution of data we need to study and categorize it. So, to categorize this data, we
need to know about measures of central tendency and dispersion. In layman’s terms,
central tendency is nothing but ‘average’. Dispersion helps in evaluating how near or far
the other values are from this average value.

Learning Outcomes for Study Session 3

On completion of this study session, you should be able to:

3.1 Explain measure of central tendency

3.2 Solve problems associated with measure of dispersion

Page 43 of 181
STA 201: Statistics I

3.1 Measure of Central Tendency

A measure of central tendency is a summary statistic that represents the center point or
typical value of a dataset. These measures indicate where most values in a distribution fall
and are also referred to as the central location of a distribution. You can think of it as the
tendency of data to cluster around a middle value. In statistics, the three most common
measures of central tendency are the mean, median, and mode. Each of these measures
calculates the location of the central point using a different method.

3.1.1 The Arithmetical Mean

The arithmetic mean of a series is obtained by adding the values of all observations and
divide the total by the number of observations. This is generally called the measure. In
symbols, X1, X2, …, Xn are n observed values, then the mean is given by:

X 
Total of all individual values x  x  , ..., xn
 1 2 
x 1

sample size n n

Case Study 3.1

The gain in weights of 6 students of Economics Department over a period of one-month


holiday are 50, 60, 40, 40, 40, 70. Find the arithmetic mean weight of the students.

Solution

The arithmetic mean or mean is

50 + 60 + 40 + 40 + 40 + 70
𝑥̅ =
6

300
𝑥̅ = = 50
6

Page 44 of 181
STA 201: Statistics I

3.1.2 Mean of a Group Data

The mean of a grouped data is obtained using

x
 fx
f
Case Study 3.2

Suppose the weights in kg of a number of 40 students in the Faculty of Management


Sciences of Olabisi Onabanjo University are given below:

59, 53, 66, 55, 57, 65, 48, 59, 51, 58, 52, 68, 60, 70, 71, 55, 70, 64, 54, 67, 62, 53, 49, 56,
63, 48, 57, 61, 58, 55, 50, 55, 61, 52, 54, 65, 56, 50, 62, 60

Obtain a frequency distribution and calculate the mean weight of the students.

Solution

Weights (kg) F X Fx
48 – 50 8 50 400
53 – 57 12 55 660
58 – 62 10 60 600
63 – 67 6 65 390
68 – 72 4 70 280
Total 40 2330

X 
 fx  2330
 58.25
f 40

Page 45 of 181
STA 201: Statistics I

3.1.3 The Mode

This is the value or number that has the highest frequency in a distribution. The mode may
not exist and even when it does exist, it may not be unique.

Case Study 3.3

In a STA201 test with the following scores:5, 2, 4, 7, 5, 3. Find the mode of the test score.

Solution

Since 5 occurred twice, therefore 5 is the modal score.

For grouped data, the formula is

 fm  fa 
Mode  L   C
 m
2 f  f a  f b

where

L = Lower class boundary of the modal class

𝑓𝑚 = Frequency of the modal class

𝑓𝑎 = frequency of the class above the modal class

𝑓𝑏 = frequency of the class below the modal class

C = size of the modal class interval.

3.1.4 The Median

If a set of data is arranged in order of magnitude, the middle value, which divides the set
into two equal parts is the median. Generally, for N data
Page 46 of 181
STA 201: Statistics I

 N  1
th

Median   item
 2 

Case Study 3.4.

Find the median of the following test scores in STA201: (a) 3, 6, 2, 4, 3 (b)2, 5, 3, 4, 8, 3

Solution

(a) Arrangement in order: 2, 3, 3, 4, 6


Here N = 5

 N  1
th

Median   item
 2 
 5  1
   the 3rd item = 3
 2 
b. Arrangement in order 2, 3, 3, 4, 5, 8
Here N = 6

 6  1
th

Median   item  3.5th item


 2 
This will be interpreted as the
3rd item  4th item 3 4
  3. 5
2 2

3.1.5 Median of a Grouped Data

The median can be obtained graphically from the cumulative frequency curve (Ogive) or
by calculation using the formula. {Refer to the ogive diagram above}

N  F 
Median  L   2 C
 f 

where
Page 47 of 181
STA 201: Statistics I

L = value of the lower class boundary of the median class.

F = Cumulative frequency of the class just above the one containing the median.

f = frequency of the median class

C = size of the median class interval

Case Study 3.5

Using the following as the frequency distribution of money saved in O.O.U Micro finance
bank over a period of time

Money saved (in F x Fx


millions)
48 – 50 8 50 400
53 – 57 12 55 660
58 – 62 10 60 600
63 – 67 6 65 390
68 – 72 4 70 280
Total 40 2330

Construct the histogram and hence estimate the mode of the distribution.

i. Calculate the mode


ii. Construct the cumulative frequency curve and deduce the median value.
Solution

 fm  fa 
i. L   C
 2 f m  fa  fb 

Page 48 of 181
STA 201: Statistics I

The mode class is 53 – 57

Hence, L = 52.5, fm = 12, fa = 8, fb = 10 and C = 5

Thus

 12  8 
Mode  52.5   5
 212  8  10 

= 52.5 + 3.33 = 55.83

N  F 
iii. Median  L   2 C
 f 

N
2  40
2  20 i.e the median is the 20th value. From the cumulative frequency distribution
table 20th item falls within the class 53 – 57. Thus the median class is 53-57, hence, L =
52.5, F = 8, f = 12 and C =5

 20  8 
Median  52.5   5  57.5
 12 

3.2 Measures of Dispersion

In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a
distribution is stretched or squeezed. Common examples of measures of
statistical dispersion are the variance, standard deviation, and interquartile range.

3.2.1 The Range

The Range is one of the measures of dispersion, and is defined as the difference between
the largest and smallest items of the sample of observations.

Page 49 of 181
STA 201: Statistics I

Case Study 3.6

Given the following observations as the number of students who failed to attend STA201
class in the last 5 weeks: 5, 6, 7, 8 and 9. Find the range.

Solution

The range is 9 – 5 = 4

3.2.2 Quartile Deviation

Quartile deviation is the semi-interquartile range Q, and is given by the expression

1
𝑄= (𝑄3 − 𝑄1)
2

Where Q1 and Q3 are the first and third quartiles respectively. Quartile deviation is
better than range, since it is calculated using first and third quartile values.

3.2.3 Mean Deviation

The mean deviation is the arithmetic mean of the absolute values of the deviations from
some average like mean or median or mode.

Mean deviation 
 f x
i i  x
for grouped data
N

Mean deviation 
 x i  x
for ungrouped data
N

where

fi = is the frequency of the ith class interval

Page 50 of 181
STA 201: Statistics I

xi = is the ith mid value of class interval or ith individual value.

x = is the arithmetic men

N = is the number of observations or N = f i

3.2.4 Standard Deviation

This is the most commonly used measure of variation or dispersion. It takes into account
all the values of the variable. Standard deviation (SD) is defined as the square root of the
arithmetic mean of the squared deviations of the individual values from their arithmetic
mean. The formula for large samples.

SD 2   x  x
1 2
n i

where

xi = is the ith individual value

x = is the arithmetic mean

n = sample size

for small samples, the formula is,

SD 2   x  x
1 2
n 1 i

 1
n 1 SS  CF 

where

SS = sum of squares = x 2
i

Page 51 of 181
STA 201: Statistics I

CF = correction factor =
 x i
2

For grouped data the formula is,

SD 2   f x  x
1 2
n 1 i i


 f i xi2 
 f x  2


SD  1 i i
n 1
 n 

where

fi = is the frequency of the ith class interval

xi = is the mid value of the ith class interval

x = is the arithmetic men

n = sample size

Case Study 3.7

Give the following ungrouped data in million Naira as the excess profit made by five
businessmen during Coronavirus pandemic:5, 6, 7, 8, 9. Find (i) mean (ii) variance and
standard deviation.

Solution

i. Mean = x  35
5 7

ii. The variance and standard deviation (SD) is obtained as follows

Page 52 of 181
STA 201: Statistics I

SS =  xi2  52  62  7 2  82  92  255

 x  2
35 2
CF    245
i

n n

Variance = 10/4 = ₦2.5

SD  1
n 1
SS  CF   1
4
255  245  1.58

Case Study 3.8

Money Deposit (#’m) Middle value of xi Frequency (fi) CF fixi

45 – 50 47.5 2 2 95.0

50 – 55 52.5 3 5 157.5

55 – 60 57.5 6 11 345.0

60 – 65 62.5 4 15 250.0

65 – 70 67.5 6 21 405.0

70-75 72.5 4 25 290.0

75 – 80 77.5 5 30 387.5

Total 30 30 1930.0

Find the Mean and Standard deviation of money saved by the customer.

Solution

N = Σfi = 30
Page 53 of 181
STA 201: Statistics I

fx
i 1
i i  1970

x 
f x i i

1930
 64.33
f i 30

f i xi2  126637.50

 f x  2
3724900
 124163.33
i i

n 30


 f i xi2 
 f x  2

  126637.50  124163.33 = 9.24
SD  1 i i 1
n 1 29
 n 

3.2.5 Variance

The variance is measured in the square of the units in which the variable X is measured.

The formula for variance is:

 x  x x  nx 2
2 2

 
i i
Variance
n n

A better estimate of the population variation is obtained by suing a division (n-1) instead
of n.

 x  x
2


2 i
Estimated variance = S
n 1

∑(𝑥𝑖 −𝑥̅ )2
Estimated standard deviation = 𝑆 = √ 𝑛−1

Page 54 of 181
STA 201: Statistics I

Case Study 3.9

The following are the Total credit point (TCP) of some students in 400 level in faculty of
social and management sciences

196 101 184 227 253

185 217 126 336 148

114 135 233 198 109

Calculate the mean, variance and standard deviation

Solution

Mean =
 x  2758  183.9
n 15

 x  x
2
58499.7

2 i
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = S = = 4178.55
n 1 14

S tan dard Deviation, SD  S 2  4178.55  64.65

Page 55 of 181
STA 201: Statistics I

Summary of Study Session 3

In study session 3, you have learnt that:

1. A measure of central tendency is a summary statistic that represents the center


point or typical value of a dataset.
2. Measure of central tendency include mean, median and mode
3. Dispersion (also called variability, scatter, or spread) is the extent to which a
distribution is stretched or squeezed.
4. Measure of dispersion includes range, standard deviation, variance etc,

Page 56 of 181
STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 3

Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
session.

1. The following distribution is the age range and frequency of some market woman
in Ago-Iwoye market:

Classes No. of Object


10 – 14 19

15 – 19 24

20 – 24 37

25 – 29 81

30 – 34 43

35 – 39 30

40 – 44 16
250

Determine the following:

a. The class interval


b. The class boundaries
c. The class mark
Page 57 of 181
STA 201: Statistics I

d. The class width or size of class


e. The cumulative frequency of the distribution
f. The relative frequency of the distribution

2. Consider the distribution given as the number of students whose stipend falls
within the ranges given below:

Age (Year to the next birthday) Frequency


15 - 20 37

20 - 25 81

25 - 30 43

30 - 35 24

35 - 40 9

40 - 45 6
Total 200

Draw the following:

a. A cumulative frequency graph


b. A histogram
c. A frequency polygon
d. A bar chart

Page 58 of 181
STA 201: Statistics I

3. A company administers an aptitude test to 100 applicants for a job with, the
company. The following are the times taken to complete a simple task for each
applicant, measured to the nearest second.

44 92 72 45 85 61 66 46 59 57 52 40 93 54

52 64 65 44 51 66 92 58 74 42 43 56 46 52

45 56 68 40 48 76 71 99 51 72 52 56 69 58

40 76 70 42 52 46 73 59 41 55 74 66 64 47

58 46 52 54 63 89 87 41 57 68 59 81 82 60

67 68 97 57 47 53 61 52 49 47 86 55 54 48

85 45 84 53 49 47 70 78 58 96 54 62 60 57 58

a. Construct a frequency table for the above data using classes of 40 – 49, 50
– 59, 60 – 69, etc.
b. Construct a cumulative frequency distribution.
c. Construct a relative frequency distribution.
d. Draw the histogram.
e. Draw the Ogive.

4. Explain the followings:


a. Measures of Central Tendency
b. Measures of Variability
5. What are the properties of a typical value or our central tendency?
6. Consider the distribution given as the age in year to the next birthday of some

Page 59 of 181
STA 201: Statistics I

students of Olabisi Onabanjo University, Ago-Iwoye.

Age in year to the next birthday No. of observation


15 < 20 37

20 < 25 81

25 < 30 43

30 < 35 24

35 < 40 9

40 < 45 6
200

Calculate:

a. The mean
b. Median
c. Mode
d. Variance
e. Standard deviation
f. Quartile deviation

Page 60 of 181
STA 201: Statistics I

Glossary of Terms

Deviation: measure of difference between the observed value of a variable and some other
value, often that variable's mean

Cluster: a significant subset within a population

Distribution of data: the shape of the graph when all possible values are plotted on a
frequency graph

Resourceful information: to find quick and clever ways to overcome difficulties

Quartile: a type of quantile which divides the number of data points into four parts, or
quarters, of more-or-less equal size

Absolute values: the non-negative value of x without regard to its sign

Page 61 of 181
STA 201: Statistics I

References

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
B
5. https://statisticsbyjim.com/basics/measures-central-tendency-mean-median-mode/

Should you require more explanations on this study session? Please


do not hesitate to contact your e-tutor via the LMS.

Page 62 of 181
STA 201: Statistics I

Study Session 4: Concepts of Probability

Introduction

Ever heard about a weather forecast at the end of a news bulletin on TV or read
about the weather conditions of your city/country for the next few days in any
newspaper? They specifically use the term “probability.” We are going to learn a
few basic concepts, probability formulas involved to calculate the probability for different
types of situations.

Learning Outcomes for Study Session 4

On completion of this study session, you should be able to:

4.1 Explain meaning of probability

4.2 Identify and explain different approaches of assigning probability

4.3 Calculate combination and permutation of different problems

4.4 Explain normal distribution

Page 63 of 181
STA 201: Statistics I

4.1 Definition

Probability concepts are the foundations of statistics. The understanding of the concepts of
probability will help the interpretation of the statistics in a skillful way. Probability is a
term applied to events that are not certain. It is the study of random or non-deterministic
experiments.

4.2 Approaches of Assigning Probability

The three commonly used approaches are classical approach, the relative frequency
approach and personal or subjective approach.

4.2.1 The Classical Approach

This method can be used whenever the possible outcomes of the experiment are equally
likely. In this case, the probability of the occurrence of event A is given by:
n  A
P A 
Number of ways A can occur

n s  number of ways the experiment can proceed
where S is the sample size and ACS.
Case Study 4.1
What is the probability that a child born to a couple, each with genes from both brown and
blue eyes, will be brown-eyed?
Solution
Since the child receives one gene from each parent, the possibility for the child are
(brown, blue), (blue, brown), (blue, blue) and (brown, brown).
Where the finish member of each pair represents the gene received from the father. Since
each parent is just as likely to contribute a gene for brown eyes as for blue eyes, all four
possibilities are equally likely.
Since the gene for brown eyes is dominant, three of the four possibilities lead to a brown-
eyed child. Hence, the probability that the child is brown-eyed is ¾ = 0.75.

Page 64 of 181
STA 201: Statistics I

Case Study 4.2


What is the probability of drawing an ace at random from a well shuffled deck of 52
playing cards?
Solution
There are 4 aces in a check of 52 cards that is x = 4 and n = 52.
Hence, probability of ace x
n  4
52  1
13

4.2.2 The Relative Frequency Approach

This method can be used in any situation in which the experiment can be repeated many
times and the results observed. Then the approximate probability of the occurrence of
event A, denoted P (A), is given by:
n  A
P A 
Number of times event A occured

N number of times experiment was run
The disadvantage of this method is that the experiment cannot be a one-short situation, it
must be repeatable. The advantage in this method or approach is that usually it is more
accurate, because it is based on actual observation rather than personal opinion.
Thus for a large number of trials, the approximate probability obtained by using the
relative frequency approach is usually quite accurate.
Case Study 4.3
A researcher is developing a new drug to be used in desensitizing patients to bee stings of
200 subjects tested, 180 showed a lessening in the severity of symptoms upon being stung
after the treatment was administered. It is natural to assumed, then, that the probability of
this occurring in another patient receiving treatment is at least approximately
180
 0.90
200
On the basis of this study, the drug is reported to be 90% effective in lessening the
reaction of sensitive patients to stings.

Page 65 of 181
STA 201: Statistics I

Case Study 4.4


If 1,000 tosses of a coin results in 520 heads, then the relative frequency of heads is
520
  0.52
1000

4.2.3 The Subjective or Personal approach

This is the probability assigned to an event based on subjective or personal experience,


information and believe. Hence, probabilities are interpreted as the strength of one’s belief
in the occurrence of an event.

4.2.4 Some Basic Definitions

Experiment: This refers to any process of observation or measurement we may not be


able to predict.

Outcome: This refers to results obtained from an experiment.

Sample point: This is an outcome in the sample space

Sample space: This refers to the collection of all possible outcomes of an experiment.

Event: This refers to any subset of a sample space.

4.2.5 Axioms of probability

 Let S denote a sample space of an experiment. Then P [S] = 1


 P [A] ≥ 0 for every event A
 Let A1, A2, A3, … be a sequence of mutually exclusive events. Then P [A1 A2
A3…] = P [A1] + P [A2] + P [A3]…

Axiom 1: the probability assigned to a sure or certain, event is 1.

Page 66 of 181
STA 201: Statistics I

Axiom 2 ensures that probabilities can never be negative.

Axiom 3 is called the property of countable additivity.

Case Study 4.5


What is the probability that a card drawn at random from a well shuffled standard pack
will be either a spade or a club?
Solution
S = 13, C = 13, n = 52.

P s   P spade  
13 1

52 4

P c   P c lub  
13 1

52 4
The outcomes are mutually exclusive, therefore, the P (S or C) = P (s) + P (c)

Hence P (s or c) = ¼ + ¼ = ½

4.2.6 Conditional Probability

Given a sample space S, let A be a non-empty proper subset of S. i.e. A  and AcS. The
probability of an event B happening given that an event A has taken place is denoted by
P(B/A) and is defined as:

P B  C 
P  B A 
P  A

If A and B are any two events in a sample space S and P(A)  0, the conditional
probability of B given A is:

P A  B P both events
PB A  
P  A P given event 

Page 67 of 181
STA 201: Statistics I

Likewise, the conditional probability of A given B and P (B)  0 is:

P A  B P both events
P A B   
P B  P given event 

Case Study 4.6

It is estimated that 15% of the adult population has hypertension due to economic
hardship, but that 75% of all adults feel that personally they do not have this problem. It is
also estimated that 6% of the population has hypertension but does not think that the
disease is present. If an adult patient reports thinking that he or she does not have
hypertension, what is the probability that the disease is, in fact, present?

Solution

Letting A denote the event that the patient does not feel that the disease is present and B
the event that the disease is present. We are given that P(A) = 0.75, P(B) = 0.15 and P
(AB) = 0.06

We are asked to find:

P both  P  A  B  0.06
P  B A     0.08
P given P  A 0.75

There is 8% chance that a patient who expresses the opinion that she or he has no problem
with hypertension does, in fact, have the disease.

Page 68 of 181
STA 201: Statistics I

4.2.7 Baye’s Theorem

This theorem was formulated by the Reverend Thomas Bayes (1761). It deals with
conditional probability. Baye’s theorem is used to find P(A/B) when the available
information is not directly compatible with that required in conditional probability. That
is, it is used to find P[A/B] when P[AB] and P[B] are not immediately available.

Theorem 4.1:

Let A1 , A2 , A3 , ..., An be a collection of events which partition S. Let B be an event such

that P[B]  0. Then for any of the events Aj, j = 1, 2, 3, …, n

P A j B  
 
P B A P A j

 P B A P A 
n

j
i 1

Baye’s theorem is much easier to use in practical problem than to state formally.

Case Study 4.7

The blood type distribution in Olabisi Onabanjo University is type A, 41%; type B, 9%;
type AB, 4%, and type O, 46%. It is estimated that during an investigation, 4% of
inductees with type O blood were typed as having type A; 88% of those with type A blood
were correctly typed; 4% with type B blood were typed as Aj and 10% with type AB were
typed as A. one student was wounded and brought to surgery. He was typed as having
type A blood. What is the probability that this is his true blood type?

Solution

Let:

A1 = he has type A blood

Page 69 of 181
STA 201: Statistics I

A2 = he has type B blood

A3 = he has step AB blood

A4 = He has type O blood

B: It is typed as type A.

We want to find P [A1/B]

We are given that

P [A1] = 0.41 P [B/A1] = 0.88

P [A2] = 0.09 P [B/A2] = 0.04

P [A3] = 0.04 P[B/A3] = 0.10

P [A4] = 0.46 P [B/A4] = 0.04

By Baye’s theorem

P B A1  PA1 
P A1 B  4

 P B A  PA 
i 1
1 1


0.88 0.41 = 0.93
0.88 0.41  0.04 0.09  0.10 0.04  0.04 0.46

Practically speaking, this means that there is a 93% chance that the blood type is A if it
has been typed as A, and there is a 7% chance that it has been mistyped as A when it is
actually some other type.

Page 70 of 181
STA 201: Statistics I

4.3 Factorials

Factorial is a special multiplication operator. The factorial sign “!” indicates a special
repeated multiplication which is used frequently in statistical applications.

Case Study 4.8

3! = 3  2  1 = 6

4! = 4  3  2  1 = 24

In general, n! = n  n-1  n-2, …, 3  2  1

4.3.1 Permutation

If r objects are selected from a set of n objects, any particular arrangement (order) of these
objects is called a permutation.

The number of permutations of r objects selected from a set of n distinct objects is

n!
n
Pr 
n  r !

Case Study 4.9

Find the number of ways of arranging the letters of the world CHEMISTRY if:

a. All the letters are to be taken at a time


b. Four of the letters are to be taken at a time

Solution

a. Required number of arrangements = n!

Page 71 of 181
STA 201: Statistics I

9! = 362880

b. Required number of arrangements = nPr

9! 9! 362880
9
P4    = 3024
9  4! 5! 120

4.3.2 Combination

This deals with the number of ways in which r objects can be selected from a set of n
objects. The number of ways in which r objects can be selected from a set of n distinct
n
objects is   or nCr and is given by:
r 

n!
n
Cr 
r! n  r !

Case Study 4.10

In how many ways can we select three academic sound students from faculty of
management sciences from a list of 7 students?

Solution

Hence n = 7 and r = 3

Number of possible selections nCr

7! 7!
7
C3  
3! 7  3! 3! 4!

765
  35
3 2 1

Page 72 of 181
STA 201: Statistics I

4.4 Probability Distribution

A probability distribution is a function that describes the likelihood of obtaining the


possible values that a random variable can assume. In other words, the values of the
variable vary based on the underlying probability distributio. It is a statistical function
that describes all the possible values and likelihoods that a random variable can take
within a given range. This range will be bounded between the minimum and maximum
possible values.

4.4.1 Normal Distribution

This is one of the most important and most widely used probability in the entire field of
statistics.


The graph of a normal distribution is a bell-shaped curve that extends indefinitely in both
directions with characteristics of mean, median and mode values being equal.

A random variable X has a normal distribution and it is referred to as a normal random


variable, if and only if its probability density function is given by:

 x 
2
 12  
f x  
1   
 ; for    x   
2  2

where > 0

The normal distribution, with  = 0 and  = 1 is referred to as the standard normal


distribution. If X has a normal distribution with mean  and the standard deviation , then

Page 73 of 181
STA 201: Statistics I

X 
Z ~ N 0,1

P (Z > a) = 1 – P (Z < a)
Case Study 4.11

Using normal tables, find the values of the following probabilities.

(a) P(Z>0.5) (b) P (Z < -2.5) (c) P (1.6 < Z < 2.20)

Solution

(a) P (Z > a) = 1 – P (Z < a)

P (Z > 0.5) = 1 – P (Z < 0.5) = 1 – 0.6915= 0.3085

(b) P (Z < -2.5) = P (Z > 2.5 = 1 – P (Z < 2.5)= 1 – 0.9938= 0.0062


(c) P (a1< Z < a2) = P (Z2< a2) – P (Z1< a1)= P (Z < 2.2) – P (Z < 1.62)

= 0.9861 – 0.9474= 0.0387

Case Study 4.12

Given that the normal distribution of a company income has mean 230 and standard
deviation 20, what is the probability that the company income will be:

a. Greater than 280


b. Less than 220
c. Lies between 220 and 280

Solution

x
a. Z  ; x  280,   230,   20

Page 74 of 181
STA 201: Statistics I

280  230
Z   2 .5
20

P (Z > 280) = P (Z > 2.5) = P (Z > 2.5) = 1 – P (Z < 2.5)= 1 – 0.9938= 0.062

b. P (X = 220)

x 220  230


Z   = 0.5
 20

P (X < 220) = P (z < -0.5)

Since: P (Z < -0.5) = P (Z > 0.5)

= 1 – P (Z < 0.5)

= 1 – 0.6915

= 0.3085

c. P (220 < Z < 280) = P (-0.50 < Z <2.5)

= P (Z < 2.5) – P (Z < - 0.5)

= P (Z < 2.5) – [1 – P (Z < 0.5)]

= P (Z < 2.5) + P (Z < 0.5) – 1

= 1.6853 – 1 = 0.6853

Page 75 of 181
STA 201: Statistics I

Summary of Study Session 4

In study session 4, you have learnt that:

1. It is the study of random or non-deterministic experiments.


2. The three commonly used approaches are classical approach, the relative
frequency approach and personal or subjective approach.
3. Classical approach can be used whenever the possible outcomes of the experiment
are equally likely
4. Relative frequency approach can be used in any situation in which the
experiment can be repeated many times and the results observed.
5. Subjective approach is the probability assigned to an event based on subjective or
personal experience, information and believe
6. The axioms of probability are:
 Let S denote a sample space of an experiment. Then P [S] = 1
 P [A] ≥ 0 for every event A
 Let A1, A2, A3, … be a sequence of mutually exclusive events. Then P
[A1 A2 A3…] = P [A1] + P [A2] + P [A3]…
7. If r objects are selected from a set of n objects, any particular arrangement (order)
of these objects is called a permutation
8. This deals with the number of ways in which r objects can be selected from a set of
n objects.

Page 76 of 181
STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 4

Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
session.

1. At a given period in OOU Health Centre there are six expectant mothers. What is
the probability.

a. 2 boys will be delivered


b. At least a boy will be delivered
c. Exactly 5 boys will be delivered
2. Given that the normal distribution of a company income has mean 240 and
standard deviation 25, find the probability that the company income will
a. Less than 230
b. Greater than 300
c. Lies between 250 and 280
3. Define the following:
a. Experiment
b. Sample Space
c. Sample Point
d. Event
e. Permutation
f. Combination
g. Mutually Exclusive
h. Independent event
i. Dependent event
j. Conditional Probability

Page 77 of 181
STA 201: Statistics I

4. Show that the letters of the word ANTICIPATION can be arranged in three times
as many ways as the letters of the word COMMENCEMENT. b) In the random
experiment of tossing 5 coins, list the event that (i) at least 3 heads occur (ii)
exactly 2 heads (iii) no head occurs.
5.
a. Simplify the following:

i.) 10P4 (ii) 10P4 (iii) 5C2 (iv) 5P2

b. If nP5:nP3 = 2:1, what is the value of n?


c. If nP3 / nC4 = 6, find n.
6. Using normal tables, find the values of the following probabilities:
a. P(Z<0.20)
b. p(z<-1.62)
c. P(0.57 < Z < 1.62)

Page 78 of 181
STA 201: Statistics I

Glossary of Terms

Non-deterministic: inability to objectively predict an outcome or result of a process due


to lack of knowledge of a cause and effect relationship or the inability to know initial
conditions

Axiom: a statement accepted as true as the basis for argument or inference

Non-deterministic experiments: experiment whose outcome may be predicted with


certainty beforehand

Relative frequency approach: a classical approach to probability is relative frequency,


which is the ratio of the occurrence of a singular event and the total number

Underlying probability distribution: the theoretical distribution for a given population


of interest.

Page 79 of 181
STA 201: Statistics I

References

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
B
5. https://www.toppr.com/guides/maths/probability/introduction-to-probability/

Should you require more explanations on this study session? Please


do not hesitate to contact your e-tutor via the LMS.

Page 80 of 181
STA 201: Statistics I

Study Session 5: Random Variable and Statistical Hypothesis

Introduction

A statistical hypothesis, sometimes called confirmatory data analysis, is


a hypothesis that is testable on the basis of observing a process that
is modeled via a set of random variables. A statistical hypothesis test is a
method of statistical inference. Commonly, two statistical data sets are compared, or a
data set obtained by sampling is compared against a synthetic data set from an idealized
model. An alternative hypothesis is proposed for the statistical-relationship between the
two data-sets, and is compared to an idealized null hypothesis that proposes no
relationship between these two data-sets.

Learning Outcomes for Study Session 5

5.1 Explain random variable

5.2 Describe the expectation and distribution parameters

5.3 Test for hypothesis

5.4 Estimate parameters

Page 81 of 181
STA 201: Statistics I

5.1 Random Variable

A random variable is a variable whose values depend on outcomes of a random


phenomenon. There are two easily identifiable types of random variables, discrete and
continuous random variables.

5.1.1 Discrete Random Variables

This is a random variable that can assume at most a finite or a countable infinite number
of possible values. That is, Let X be a random variable, if the number of possible values of
X is finite or countably infinite, we called X a discrete random variable. For example,
possible value of x may be listed as x1, x2, …, xn in the finite case. Therefore, discrete
random variables are those random variables which can take on only finite number of
values or whose choice of values are countable.

5.1.2 Continuous Random Variables

A random variable X is continuous if it can assume any value in some interval (or
intervals) of real numbers and the probability that it assumes any specific value is zero.
These are random variables whose choice of values falls within an interval. Measured
random variables that are treated as continuous random variables are; income,
temperature, heights, intelligence quotients (IQ) etc. A continuous random variable is the
one that may assume any numerical value on a continuous scale. Therefore, X is said to be
a continuous random variable, if there exists a function called the probability density
function (p.d.f.) of X satisfying the following conditions:

i. f x   0 for all x

ii.  f x dx  1


b
iii. For any a, b with -< a < b < we have P a  x  b   f xdx
a

Page 82 of 181
STA 201: Statistics I

5.2 Expectation and Distribution Parameters

The density function of a random variable completely describes the behaviour of the
variable. However, associated with any random variable are constants, or parameters that
are descriptive knowledge of the numerical values of these parameters which gives the
researcher quick insight into the nature of the variables.

We consider three such parameters: the mean , the variance 2, and the standard
deviation . If the exact density of the variable of interest is known, then the numerical
value of each parameter can be found from mathematical considerations. To understand
the reasoning behind most statistical methods, it is necessary to become familiar with one
general concept, namely, the idea of mathematical expectation or expected value. This
concept is used in defining most statistical parameters and provides the logical basis for
most of the methods of statistical inference presented in this chapter.

5.2.1 Expected Value

Intuitively, let X be a random variable. The expected value of X, denoted E(X), is the
long run theoretical average value of X. Let X be a discrete random variable with density
f(x). Let X is be a random variable. The expected value of X is given by:

EX    X f X 
all x

5.2.2 Variance

Let X be a random variable with mean . The variance of X, denoted Var X, or 2, is
given by:


Var  X    2  E  X   
2

Page 83 of 181
STA 201: Statistics I

Note that the variance essentially measures variability by considering X - , the difference
between the variable and its mean. The difference is squared so that negative values will
not cancel positive ones in the process of finding the expected value.

 
The most widely used measure is E  X    . This measure is called the variance of X.
2

computation formula for 2

 
Var X  E X 2  E  X 2

5.2.3 Standard deviation

Let X be a random variable with variance 2. The standard deviation of X, denoted , is
given by:

  Var X  2

5.2.4 Rules for Expectation

Let X and Y be random variables, and let C be any real number.

i. E[C] = C (the expected value of any constant is that constant).


ii. E[CX] = CE [X] (Constants can be factored from expectations).
iii. E[X+Y] = E[X] + E[Y] (the expected value of a sum is equal to the sum of the
expected values).

5.2.5 Rules for Variance

i. Var C = 0
ii. Var CX = C2 Var X
iii. If X and Y are independent, then Var (X + Y) = Var X + Var Y. Two variables
are independent on the value assume by the other.

Page 84 of 181
STA 201: Statistics I

Case Study 5.1

A balanced coin is tossed thrice. Let X be a random variable denoting the number of times
that head appears.

i. Obtain the probability distribution of X.


ii. Obtain the mean and variance of X.
Solution

S = {HHH, HHT, HTH, HTT, TTT, TTH, THT, THH}

i. The probability distribution of X is


X F(x)
0 1
8

1 3
8

2 3
8

3 1
8

ii. Mean = E[X] = ΣXf(x)


 0  18  13 8  23 8  3 18
 0  3
8  6
8  38
 3
2

Var X  E X 2  E  X   2

E X2     X f x  2

 0 2  18  12 3 8   2 2 3 8   32  18

 0 3
8  12
8  9
8  24
8 =3

Page 85 of 181
STA 201: Statistics I

iii. Var X = 3 -  3 2 2
3 9
4  3
4

Case Study 5.2

Consider the random variable X, the number of mimics escaping detection in the Batesian
mimicry experiments. The density for X is given by:

X 0 1 2 3

f(x) 8
1000
96
1000
384
1000
512
1000

Solution

E  X   0. 1000
8
  1.1000
96
  2.1000
384
  3. 1000
512

 2400
1000  12
5  2.4

 
Var X  E X 2  E  X 
2

 
E X 2  0 2. 1000
8
  12.1000
96
  2 2.1000
384
  32. 1000
512

 6240
1000  624
100

 125 
624 2
Var X = 100

 624
100  144
25  12
25

Case Study 5.3

Two drugs are being compared for use in maintaining a steady heart rate in patients who
have suffered a mild heart attack. Let X denote the number of heartbeats per minutes

Page 86 of 181
STA 201: Statistics I

obtained by using drug A and Y the number per minute with drug B. consider the
following hypothetical densities.

X 40 60 68 70 72 80 100
fx 0.01 0.04 0.05 0.8 0.05 0.04 0.01

Y 40 60 68 70 72 80 100
Fy 0.4 0.05 0.04 0.02 0.04 0.05 0.4

Since each of the densities is symmetric, inspection shows that x = y = 70. Each drug
produces on the average the same number of heartbeats per minutes. However, there is
obviously a drastic difference between the two drugs that is not being detected by the
mean. If we examined only the mean, we would conclude that the two drugs had identical
effects which may not be exactly true. But we can further examine the variability of the
two drugs by their variances. The variances of X, denoted Var X, or 2, is given by

Var X   2  E  X    2

Var X   x  70  f x 
2

all x

  30  0.01   10  0.04   2 0.05  0 0.8  2 0.05  10  0.04  30  0.01
2 2 2 2 2 2 2

= 26.4

Page 87 of 181
STA 201: Statistics I

Var Y    y  70 f  y 
all y
2

  30 0.4   10 0.05   2 0.04  0 0.02  2 0.04  10 0.05  30 0.4
2 2 2 2 2 2 2

= 730.32

As expected, Var Y > Var X. even though the two drugs produce the same mean number
of heartbeats per minute, they do not behave in the same way. Drug B induces greater
variability than drug A. It is not as consistent in its effect as drug A.

The standard deviations are:

 x  Var X  26.4  5.14 heartbeats per minute.

 y  Var Y  730.32  27.02 heartbeats per minute.

5.3 Test of Hypothesis

The most frequent application of statistics is to test some scientific hypotheses. Results of
experiments, and investigations are usually not clear cut and, therefore, need statistical
tests to support decisions between alternative hypotheses. A statistical tests examines a set
of sample data and on the basis of an expected distribution of the data, leads to a decision
on whether to accept the hypothesis or whether to reject that hypothesis and accept an
alternative one. The nature of the tests varies with the data and the hypothesis, but the
same general philosophy of hypothesis testing is common to all tests. A statistical
hypothesis is an assumption or statement which may or may not be true concerning one or
more population.

A statistical hypothesis (or inference) is a statement about the parameters or form of a


population. A test of a statistical hypothesis is a criteria which specifies for what sample

Page 88 of 181
STA 201: Statistics I

results the hypothesis is to be accepted or rejected. The hypothesis which is to be tested


is generally called the Null hypothesis denoted by the H0 and hypothesis against which it
is to be tested is called the alternative hypothesis and also denoted by H1.

5.3.1 Type I and Type II Errors

A type I error has been committed if we reject the null hypothesis when it is true and a
type II error has been committed if we accept the null hypothesis when it is false. The
following table summarizes the various situations that can arise when testing H0 against
H1:

Accept H0 Accept H1
H0 is true No error Type I Error
H1 is true Type II error No error

The probabilities of committing a type I and type II errors are called level of significance
of the tests and are written as  and , respectively.  is called the size of the test and (1-
) is called the power of the test, and (1-) is also the probability of rejecting null
hypothesis (H0) when it is false. The area such that if the sample point falls in it we reject
H0 is called the critical region. When the primary concern of a test is to see whether the
null hypothesis can be rejected, such a test is called a test of significance. In that case, the
quantity  is called the level of significance at which the test is being conducted.

5.3.2 One and Two Tailed Test

A test of any statistical hypothesis where the alternative is one sided such as:

H0:  = 0 or H0:  = 0

H1: >0 H1: <0

Page 89 of 181
STA 201: Statistics I

Is called a one-tailed test. The critical region for H1: >0 lies entirely in the right tail
while the critical region for H1: <0 lies entirely in the left tail.

A test of any statistical hypothesis where the alternative is two-sided such as:

H0:  = 0

H1: 0

Is called a two-tailed test, values in the both tails of the distribution constitute the critical
region.

5.3.3 Test Procedure and Steps

The steps involved in general and in the utilization of any test of significance are:

i. Find the type of problem and the question to be answered.


ii. To state the null hypothesis (H0) and the appropriate alternative (H1)
hypothesis
iii. Selection of the appropriate test to be utilized and calculation of the test
criterion based on the type of test.
iv. Fixation of the level of significance 
v. Decision making on test criterion value, whether to reject or accept the
hypothesis.
vi. Drawing of the conclusion (or inference) on the basis of level of significance is
deciding whether the difference observed is due to chance or due to some other
known factors.

5.3.4 Test Concerning the Mean (For Large Sample)

We will assume that the sampling distribution of the sample estimates will be
approximately normal and that the variance is known. Hence, for large samples (n  30),
Page 90 of 181
STA 201: Statistics I

we can use the normal probability distribution for testing a hypothesized value of the
population mean.

The test statistics

X  
Z 
S .E. X 

where X is the sample mean

 is the population mean

S.E. ( X ) is the standard error of the sample mean.


S .E. X  
n

where  is the population standard deviation (usually known) and n is the sample size.

Case Study 5.4

A bottling company which bottles a soft drink claims that the liquids content is 35cl with
standard deviation 0.75cl. A researcher randomly collects 50 bottles, measured their
contents and got mean of 34.2cl. Test at 0.01 level of significance that the bottling
company has been cheating their consumers.

Solution

 = 35cl

 = 0.75cl

n = 50

Page 91 of 181
STA 201: Statistics I

X = 34.2

 = 0.01 (1%)

H0:  = 35 that is, the company has not been cheating the consumers.

H1: < 35 that the company has been cheating the consumers.

Test statistics is

Z 
X   n


34.2  35 50
0.75

 0.8  7.0711

0.75

= -7.54

Thus, |Z| = |-7.541| = 7.54

At 0.01 level of significance the Z tabulated value (one tailed) is 2.33

Decision: the Z calculated value 7.54 is greater than the Z tabulated value 2.33. we reject
H0 and accept H1.

Conclusion: There is significant difference between the population and sample mean.
Hence, the bottling company has been cheating their consumers.

Page 92 of 181
STA 201: Statistics I

5.3.5 Test Concerning Means (Small Samples)

There are situations in real life experiment, such as, testing the efficiency of a newly
produced drug, where it is impracticable to get a large sample and yet tests of
significance still have to be carried out. When we do not know the value of the population
standard deviation and the sample size is small (n < 30), we shall assume again that the
population we are sampling from has roughly the shape of a normal distribution. The test
statistics is:

t
X 

X   n
S S
n

Whose sampling distribution is the t distribution with n-1 degree of freedom. S is the
sample standard deviation. As with large samples, we compare it with its value at a given
level of significance, and then draw our conclusions.

Case Study 5.5

Suppose that we want to test on the basis of a random sample of size n = 5 whether or not
the fat content of a certain kind of ice cream exceeds 12 percent. What can we conclude
about the null hypothesis.  = 12 percent at the 0.01 level of significance, if the sample
has the mean X as 12.7 percent and the standard deviation S is 0.38 percent.

Solution

Hypothesis: H0:  = 12%

H1: > 12

 = 0.01

n = 5, d.f. = n – 1 = t0.01,4 degree of freedom


Page 93 of 181
STA 201: Statistics I

Test statistics

X 
t
S
n

12.7  12
t
0.38
5

0. 7
t  4.12
0.1699

t0.01,4 = 4.12

Decision: Since tcal>ttab, we reject H0

Conclusion: Therefore, the content of the given kind of ice cream exceeds 12 percent.

Case Study 5.6

The life time of telephone for a random sampling 10 from a large consignment give the
following data:

Item Life in 1,000hrs x- X (X - X )2


1 4.2 -0.1 0.01
2 4.0 -0.3 0.09
3 3.9 -0.4 0.16
4 4.1 -0.2 0.04
5 5.2 0.9 0.81
6 3.8 -0.5 0.25

Page 94 of 181
STA 201: Statistics I

7 3.9 -0.5 0.16


8 4.3 0 0
9 4.4 0.1 0.01
10 5.6 1.3 1.69

Can we accept the hypothesis that the average life time of telephone is 4,000hours at 5%
level of significance?

Solution

Hypothesis

H0:  = 4,000hours

H1:  4,000hours

 = 0.05 level of significance

Since, n = 10 d.f. = n-1 = 10 – 1, 9

t  2  n 1  n 1  t 0.25,9

X i
4.2  4.0  ,..., 5.6 43.5
X  i 1
 
n 10 10

X  4.3

 X  X
10
2
i
i 1
S2 
n 1

Page 95 of 181
STA 201: Statistics I

S2 
4.2  4.32  4.0  4.32  ... 4.4  4.3  5.6  4.3
2 2

10 1

0.01  0.09  , ..., 0.01  1.69



9

3.22
S2  = 0.358
9

Test Statistics

t
X   n
S

t
4.3  4 10
0.598

where

S  0.358 = 0.598

t = 1.587

t0.025,9 = 2.262

Decision: Reject H0 if tcal>ttab

Conclusion: Since tcal>ttab, then we accept H0 and conclude that the average life time is
4,000hours

5.3.6 Test Concerning Two Population Means (Large Sample)

The test statistics for large sample test concerning difference between two means is given
as:
Page 96 of 181
STA 201: Statistics I

X1  X 2
Z 
 12  22
n1  n2

Case Study 5.7

In a study designed to test whether or not there is a difference between the average amount
used to buy food by families living in two different communities, random samples yield
the following results.

n1 = 120 x1  62.7 1 = 2.50


n2 = 150 x2  61.8 2 = 2.62

The amount used to buy food are in Thousand Naira. Use the 0.05 level of significance to
test the null hypothesis that the corresponding population means are equal against the
alternative hypothesis that they are not equal.

Solution

H 0  1   2

H 1  1   2

n1  120, x1  62.7,  1  2.50

n2  150, x2  61.8,  2  2.62

 = 0.05 level of significance

Test Statistics

X1  X 2
Z 
 12  22
n1  n2

Page 97 of 181
STA 201: Statistics I

62.7  61.8
Z 
2.50 2 2.62 2
120  150

0.9
Z 
0.0979

Z = 2.88

Conclusion: Since Z cal  Z tab , the null hypothesis must be rejected and we conclude that
there is a difference between the true amount used to buy food in the two given
communities.

5.3.7 Test Concerning Two Population Means (Small Sample)

The test statistics for small sample test concerning difference between two means is given
as:

X1  X 2
t
Sp 2  1
n2  1
n2

where

n1 1 S12  n2 1S 22


Sp 2 
n1  n2  2

where t-distribution with n1 + n2 – 2 is known as the pooled variance. Assumption when


using t-distribution.

i. The two populations are normal

Page 98 of 181
STA 201: Statistics I

ii. The two populations have the same variance.


iii. The two samples are random ones.
Case Study 5.8

The following random samples are amount used by two states in Nigeria to provide health
facilities (in millions naira) for five months:

State 1 8400 8230 8380 7860 7930


State 2 7510 7690 7720 8070 7660

Use 0.05 level of significance to test whether the difference between the means of these
two samples is significant.

Solution

X i
8400  8230  ...  7930
X1  i 1

5 5

X 1  8160

X i
7510  7690  ...  7660
X2  i 1

5 5

X 2  7730

 X  Xi 
5
2
i
i 1
S 21 
n1  1

S 21  63450

Page 99 of 181
STA 201: Statistics I

 X  X2
5
2
i
i 1
S 22 
n2  1

S 22  42650

H 0  1   2

H 1  1   2

Test statistics

X1  X 2
t
n1  1S12   n2  1S 22
n1  n2  2  1
n1  1
n2 

8160  7730 430


t =t
4 63450   4  42650  1
552 5
  1
5
 21220

t = 2.9 = ttab= 2.306

Conclusion: Since tcal  ttab , the null hypothesis should be rejected then, we conclude that
the average amount spend on health facilities from the states are not the same.

5.4 Estimation of Parameters

A parameter is a population characteristic which is used to describe the population or a


random variable. For example, the mean  and variance 2 are the parameters used to
describe a normally distributed population.

A statistic is the characteristics of sample data that is used to estimate the population
parameter. For instance, given xi, i = 1, 2, 3,…, n to be sampled data, then
Page 100 of 181
STA 201: Statistics I

x i
x i 1
n

Hence, x is an estimate which is a statistic used to estimate the population parameter .

An estimator of an unknown parameter is simply a statistics that can estimate the


parameter unbiasedly, thus, x is an estimator of population parameter .

5.4.1 Point Estimation

A point estimator is a procedure leading to a single numerical value for the estimate of the
unknown parameter. Suppose xi,i= 1, 2, 3, …, n is a random sample of n observations on
the random variable X then the sample mean x is a point estimator of the population mean
 and the sample variance S2 is a point estimator of the population variance 2.

There are various properties of a good point estimator but the most important two are that,
the point estimator should be unbiased and also should have the minimum variance. This
means that the expected value of any estimator must be equal to its equivalent population
parameter. If there are other estimators, the variance of the estimator said to be a good
point estimator should be the minimum.

5.4.2 Interval Estimation

Interval estimation is the use of sample data to calculate an interval of possible values of
an unknown population parameter; this is in contrast to point estimation, which gives a
single value. An interval estimator is a random interval in which the true value of the
parameter lies with some probability which is usually called confidence interval. A 100
(1-)% confidence interval on a parameter  in a random interval (L1, L2) such that

P L1   L2   1  

Page 101 of 181


STA 201: Statistics I

Regardless of the value of 

The distance between two statistics that includes the true value of the parameter in
question with some probability is an interval estimate of the parameter. For instance, to
obtain an interval estimate of a parameter , we need to obtain two statistics L1 and L2
(the lower and upper confidence limits respectively)

Such that the probability statistics

P L1    L2   1   is true. The interval is called a 100 (1-) percent

confidence interval for the parameter .

Suppose X is normally distributed with unknown mean  and variance 2. That is,

X ~ N (, 2)

If a random sample of n observations is taken and sample mean x computed, then

x
~ N 0,1
 n

As a probability statement, is given as:

 x  
P  Z  2   Z 2   1  
  n 

Thus,


P x  Z 2  n 
is a 100 (1-) percent confidence interval for . If the variance of the distribution is
unknown then the sample variance S2 may be used to estimate 2. Then
Page 102 of 181
STA 201: Statistics I

x
~ t 2 , n 1
S n

and 100 (1-) percent confidence interval of  becomes x  t 2 , n 1 S  


n thus, the

value of Z 2 is obtained from standard normal table.

The 95% is given by 1-

1- = 95%

1- = 0.95

1- = 1-0.95 = 0.05

 2  0.025

Therefore,

Z  2  Z 0.025

From the table, Z0.025 = 1.96

This implies that the area between  Z  2 and Z  2 (-1.96, 1.96) is 0.95. This is the

probability,  
P  Z  2  Z  Z  2  1   i.e. P  1.96  Z  1.96  0.95, to construct

x
confidence interval (C.I.) for  P  1.96  Z  1.96  0.95 , replace Z by and solve
 n
for .

x
 1.96   1.96
 n

Page 103 of 181


STA 201: Statistics I

Thus, x  1.96  n    x  1.96  n

Case Study 5.9

The peak of 100 months’ stock market index was recorded, given that the sample mean
and standard deviation are 180million and 10million respectively. Determine correctly to
three significant figure 95% confidence limits of the stock market population mean.

Solution

Given

𝑥̅ = 180 𝑚𝑖𝑙𝑙𝑖𝑜𝑛

 = 10 million

n = 100

let , be the population mean, the 95% C.I. for population mean is

x  1.96  n
   x  1.96  n

10000000 10000000
= 180000000 − 1.96 ( ) ≤ 𝜇 ≤ 180000000 + 1.96 ( )
√100 √100

= 1.80 × 1014 ≤ 𝜇 ≤ 1.80 × 1014

Page 104 of 181


STA 201: Statistics I

Summary of Study Session 5

In study session 5, you have learnt that:

1. Random variable is either discrete of continuous random variable


2. X is said to be a continuous random variable, if there exists a function called the
probability density function (p.d.f.) of X satisfying the following conditions:

i f x   0 for all x

ii  f x dx  1


b
iii For any a, b with -< a < b < we have P a  x  b   f xdx
a

3. The most frequent application of statistics is to test some scientific hypotheses.


4. A parameter is a population characteristic which is used to describe the
population or a random variable

Page 105 of 181


STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 5

Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
session.

1. Explain the following terms


i. Statistical hypothesis
ii. Type I and Type II Errors
2. Define the following terms
i. An Estimator
ii. Point Estimator
iii. Confidence interval
3. What is random variable?
4. What are the procedure in testing statistical hypothesis?
5. What is ‘P’ values and what are the rules for expectation and variances
respectively?
6. The following random samples are the spending power of families (in millions
naira) for two states in Nigeria:

Mine 1 84 82 83 78 79
Mine 2 75 76 77 80 76

Use 0.05 level of significance to test whether the difference between the means of these
two samples is significant.

7. A research carried out reveal that 40 percent of customers using a king of detergent
soap wants the NAFDAC to stop the selling and using the soap, following some
rumour about its side effects. However, after the advertisement of the producer,
Page 106 of 181
STA 201: Statistics I

only 180 out of 500 interviewed now believe that NAFDAC should stop the selling
and using of the soap. Does the advertisement reduce the customer’s believe that
the soap should be stopped? Test at 5% level of significance.
8. ‘Family planning or control’ is a popular agitation of every government. A
research agency wants to know whether the programme is popular amongst people
living in Lagos State and Kano State. The survey reveals that in a random sample
of 1000 people living in Oyo State, 450 of them were aware of the programme. In
Ogun State 800 people were randomly sampled and 400 were aware of the
programme ‘Family Planning’. Do these facts indicate a significant difference
between the two states as far as this programme is concerned? Test both at 1% and
5% level of significance.

Page 107 of 181


STA 201: Statistics I

Glossary of Terms

Hypothesis: an assumption about a population parameter

Parameter: numbers that summarize data for an entire population.

Estimate: any of numerous procedures used to calculate the value of some property of a
population from observations of a sample drawn from the population

Statistical inference: the process through which inferences about a population are made
based on certain statistics calculated from a sample of data drawn from that population

Random phenomenon: a situation in which we know what outcomes can occur, but we
do not know which outcome will occur

Expected value: an anticipated value for an investment at some point in the future

Sample results: the results of n experiments in which the same quantity is measured

Impracticable: impossible in practice to do or carry out

Page 108 of 181


STA 201: Statistics I

References

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
B

Should you require more explanations on this study session? Please


do not hesitate to contact your e-tutor via the LMS.

Page 109 of 181


STA 201: Statistics I

Study Session 6: Analysis of Categorical Data

Introduction

The use of statistical methods for categorical data has increased dramatically,
particularly for applications in the biomedical and social sciences. This study
session summarizes these methods and shows readers how to use them. You
will find a unified generalized linear models approach that connects logistic regression
and loglinear models for discrete data with normal regression for continuous data.

Learning Outcomes for Study Session 6

On completion of this study session, you should be able to:

6.1 State the analysis of categorical data

6.2 Use the Chi Square test

Page 110 of 181


STA 201: Statistics I

6.1 Introduction

The statistical problem is to determine whether the observed category frequencies tend to
support or refute a stated hypothesis. The statistical procedures or method used in this
analysis is 2 (Chi-Square) method. For instance, in the banking industry the method can
be used to determine the general believe of customers about the services, that a known
commercial banks provided. This can be done based on information collected from
customers in other to be sure whether their services are satisfactory or not.

6.2 The Chi Square Test

Chi-square (2) is a special significance test which is used in a very large number of cases
to test the accordance between fact and theory (or between observed values and expected
values). The statistics 2 may be defined as

n
Oi  Ei 
2

2   is distributed as 2 with (k-r) degrees of freedom, where r is the


i 1 Ei

number of parameters used to fit the distribution.

where Oi refers to the observed values of the sample and Ei refers to the expected values
i.e. values we expect on the basis of some hypothesis

The summation (Σ) extends over all the classes in the data and n is the number in the
sample.

6.2.1 Uses of Chi-Square Test

The three most important situations where 2 – test can be used are:

i. To test the discrepancies between observed and expected frequencies.


ii. To test the goodness of fit.

Page 111 of 181


STA 201: Statistics I

iii. To determine association between two or more attributes.


Therefore, the 2 test procedures are carried out in a similar manner as the normal and t-
tests.

Case Study 6.1

Using information below:

Digit Probability Observed frequency (o) Expected frequency (E)


0 0.1 22 25
1 0.1 17 25
2 0.1 23 25
3 0.1 26 25
4 0.1 27 25
5 0.1 31 25
6 0.1 26 25
7 0.1 23 25
8 0.1 29 25
0 0.1 26 25
250 250

Test at the 0.05 level of significance whether the discrepancies between the observed and
expected frequencies can be attributed to chance.

Solution

H0: The frequencies can be attributed to chance.

H1: The frequencies cannot be attributed to chance

Test statistics

Page 112 of 181


STA 201: Statistics I

Oi  Ei 
2

 2

Ei

Since the expected frequencies 250 (0.1) = 25 make sure that you substitute the formula
for 2 yields

 2

22  25
2

17  25 23  25
2

2
 ...
29  25 26  25
2

2

25 25 25 25 25

140
2 
25

 2  5.60

 02.05,9  16.919

Conclusion: Since  cal   tab , we accept the null hypothesis and conclude that the
2 2

frequencies can be attributed to chances.

Case Study 6.2

1600 families were selected randomly in Ogun State to test the belief that high income
families usually have access to basic socio-economic amenities such as health and
education, while low income families do not. Below is the result obtained.

Health Education Total


High income 438 162 600
Low income 506 494 1000
Total 944 656 1600

Page 113 of 181


STA 201: Statistics I

Test at 5% level of significance whether income and social economic amenities are
independent.

Solution

H0: Income and social economic amenities are independent.

H1: Income and social economic amenities are dependent.

The expected frequencies are:

Health Education Total


High 354 246 600
Low income 590 410 1000
Total 944 656 1600

Since we have a 22 contingency table

d.f. = (r-1) (c-1) = (2-1) (2-1) = 1 and hence the correction for continuity is necessary.

2   
|0  E|  0.5 2
E

O E O-E |O-E| Y = |0  E|  0.5  Y2 Y2 E
438 354 84 84 83.5 6972.25 19.70
162 246 -84 84 83.5 6972.25 28.34
506 590 -84 84 83.5 6972.25 11.82
494 410 84 84 83.5 6972.25 17.01
Total 1600 76.87

2 calculated = 76.87

At 5% level of significance with d.f. = 1


Page 114 of 181
STA 201: Statistics I

tabulated = 3.84

Conclusion:

Since  cal   tab , we reject H0 and conclude that there is association between income and
2 2

social economic amenities.

Case Study 6.3

900 men and 700 women were interviewed by an independent public opinion poll agency
on their attitude towards family planning. Below is the result of the interview.

Respondents Infavour Opposed Neutral


Men 355 400 145
Women 365 240 95

Does this data indicate a significant sex difference in the attitude towards family planning?
Test at 1% level.

Solution

H0: There is no sex difference

H1: There is sex difference

The table of observed frequencies is given below:

Respondents Infavour Opposed Neutral Total


Men 355 400 145 900
Women 365 240 95 700
Total 720 640 240 1600

Page 115 of 181


STA 201: Statistics I

The expected frequencies are:

Respondents Infavour Opposed Neutral Total


900  720 900  640 900  240
Men
1600  405 1600  360 1600  135 900

700  720 700  640 700  240


Women
1600  315 1600  280 1600  105
Total 720 640 240 1600

Since we have 23 contingency table

d.f. = (2-1) (3-1) = 2 and hence the test statistics is given

 O  E 2 
2   E 
 

Using the above text statistics, we obtained the following results:

O E O-E (O-E)2 (O-E)2/E


355 405 -50 2500 6.173
400 360 40 1600 4.444
145 135 10 100 0.741
365 315 50 2500 7.937
240 280 -40 1600 5.714
95 105 -10 100 0.952
1600 1600 25.961

 cal
2
 25.961

2 = from table at 1% level of significance is 9.21

 tab
2
 9.21

Page 116 of 181


STA 201: Statistics I

Conclusion

Since  cal   tab , we reject H0 and conclude that there is sex difference towards family
2 2

planning.

Page 117 of 181


STA 201: Statistics I

Summary of Study Session 6

In study session 6, you have learnt that:

The three most important situations where 2 – test can be used are:

i. To test the discrepancies between observed and expected frequencies.


ii. To test the goodness of fit.

iii. To determine association between two or more attributes.

Page 118 of 181


STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 6

Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
session.

1. Define chi-square and what are its uses?


2. Investigate whether there is a relationship between the intelligence of persons who
have gone through a certain job-training program and their subsequent
performance on the job, test at 0.01 level of significance.

Performance Total
Poor Fair Good
Below average 67 64 25 156
Average 42 76 56 174
Above average 10 23 37 70
Total 119 163 118 400

Page 119 of 181


STA 201: Statistics I

Glossary of Terms

Procedure: a method of analyzing or representing data

Summation (∑ ): add up

Loglinear models: each frequency is a random variable with a finite and positive
expectation, and the logarithms of the expectations of the frequencies are assumed to
satisfy a linear model

Expected frequencies: a theoretical frequency that we expect to occur in an experiment

Fit: a statistical hypothesis test used to see how closely observed data mirrors expected
data

Page 120 of 181


STA 201: Statistics I

References

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
B
5. https://www.wiley.com/en-
us/An+Introduction+to+Categorical+Data+Analysis%2C+3rd+Edition-p-
9781119405283

Should you require more explanations on this study session? Please


do not hesitate to contact your e-tutor via the LMS.

Page 121 of 181


STA 201: Statistics I

Study Session 7: Regression and Correlation Analysis

Introduction

Correlation analysis is used to quantify the association between two continuous


variables (e.g., between an independent and a dependent variable or between
two independent variables). Regression analysis is a related technique to assess
the relationship between an outcome variable and one or more risk factors or
confounding variables. The outcome variable is also called the response or dependent
variable and the risk factors and confounders are called the predictors,
or explanatory or independent variables. In regression analysis, the dependent variable is
denoted "y" and the independent variables are denoted by "x".

Learning Outcomes for Study Session 7

On completion of this study session, you should be able to:

7.1 Explain the relationship between two or more variable

7.2 State the various methods in finding regression

7.3 Apply different methods to calculate correlation

Page 122 of 181


STA 201: Statistics I

7.1 Introduction

In most empirical studies a relationship is found to exist between two or more variables. In
most statistical investigations the major objective is to establish relationship between these
variables, which make it possible to predict one or more variables in terms of others.
Regression analysis is often used to predict the response variables from the knowledge of
the independent variables. Likewise, regression analysis is utilized primarily to
examining the nature of the relationship between the independent variables and the
response (dependent) variable.

Therefore regression is the study of relationship among variables. One purpose of


regression may be to predict, or estimate, the values of other variables related to it. For
example, a psychologist working with laboratory rats might use number of hours of food
deprivation to predict how long it will take a rat to learn its way through a maze to find
food.

7.1.1 Scatter Diagram

When an independent variable × is plotted against dependent variable Y to show


algebraically, the relationship existing between the two variables, the plot so obtained is
called scatter diagram. At a glance it enables us to know the form of relationship that
exists between the two variables. From the diagram, it can be observed whether the
relationship is linear, non-linear, or no relationship at all.

Page 123 of 181


STA 201: Statistics I

Case Study 7.1

In an experiment of the capacity of electrolytic cells, four cells were taken with electrolyte
respectively. The results were as follows:

Quality of electrolyte 0.50 0.55 0.60 0.65


Mean capacity 1 3 4 7

We can plot such sets of data on a graph called a setter diagram.

y = 38x - 18.1
R² = 0.9627
Mean Capacity (Y)

Series1
Linear (Series1)

Quality of Electrolyte

Fig 7.1: Scatter diagram

It can be observed that there is a form of linear relationship between the estimate mean
capacity and the electrolyte. In this case the line is called the regression line of mean
capacity on quantity of electrolyte.
Page 124 of 181
STA 201: Statistics I

7.1.2 The Model

Now that we have collected and plotted the data on mean capacity and quantity of
electrolyte and determined that there is a linear relationship between them, how do we
describe that relationship? We know that mean capacity increases with quantity of
electrolyte, but how much, and starting where? We would want to have a mathematical
equation, or function or model which specifies what the relationship is since the
relationship is linear, we want a linear equation stating 𝑌 as a function of × .

The linear equation 𝑌 =∝ +𝛽 × is the equation of a straight line. The Greek letter ∝
(alpha) and𝛽 (beta) are parameters of the line: Once they are specified, ∝ is the Y-
intercept, the values of Y when X=0. The parameter 𝛽 is the slope of the line, the
number of units increase in × .

7.1.3 Deterministic and Probabilistic

In equation of the form

𝑌 =∝ +𝛽 ×

The points all lie on a straight line. But, in a scatter diagram, all points do not lie on a
straight line. How then can the relationship between Y and × be described using the
equation for a straight line? The equation 𝑌 =∝ +𝛽 × represents a deterministic, or
mathematical model.

Once a value of 𝑋 is chosen, the value of Y is automatically determined by the specified


value of ∝ and 𝛽 and the rules of arithmetic. However, no such exact relationship exists
between two variables, many factors could be responsible furthermore, even if we
consider every conceivable factor affecting them with all the same characteristics we
would still expect their relationship to differ. Why? We can only attribute the difference to
some unpredictable factor or to chance. This uncontrollable, immeasurable variation is

Page 125 of 181


STA 201: Statistics I

what distinguishes one identical twin from another or one laboratory rat from its
littermate. It is what give a chemist slightly different results on any runs of the same
experiment. By the same token, our statistical predictions will always be subject to
random error. They will always be, to some degree, imperfect when applied to any
specific situation.

Recognizing that we can never predict anything exactly, we can describe a relationship by
means of the probabilistic or statistical model,

𝑌 =∝ +𝛽 × +𝜀.

Here, 𝜀, the Greek letter epsilon, represents the error in the predictions. In this model 

=     says that there is a linear relationship, that generally ,  increases or decreases

at a constant rate with ,  and  says that this relationship is not exact for every
individual pair of observation. The error term  accounts for variables that affect  but
are not included as predictors. It accounts for chance, or random variability as well as
imprecision in the specified model which might be almost but not exactly linear. We can
thus say that the error term is composed of two general kinds of error.

i. Model error or lack of fit: Meaning that all relevant predictors are not specified
or that the form of the relationship is not correctly specified:
ii. Random error: This is unpredictable and uncontrollable.

The model  +   +  merely states the conceptual frame work of the problem. It is a
shorthand way of saying that we are trying to investigate a problem in which there is an
imperfect linear relationship between two variables. This we shall use the data to get
numerical estimates, which we will call  and  , so that the estimated value  of a
given value can be obtained by simply substituting its value X in the equation

̂    X .

Page 126 of 181


STA 201: Statistics I

That is, we want to estimate how much error is involved in our predictions if this error is
quite large, this might be an indication that the relationship is not strong enough to bother
with. Finally, we recognize that because our information comes from a randomly close
sample, there is a chance that it is given as a distorted picture of how X and Y are related.
It might indicate a relationship when in fact none exists. The question of whether or not a
relationship exists is central to the whole study of regression.

A regression using only one predictor is called simple regression and when there are two
or more predictors the analysis is called a multiple regression.

7.2 Simple Linear Regression

When only two variables are involved the regression is said to be simple. A simple linear
regression equation is therefore of the form

 =  + X , once  and  are estimated, we can substitute a given value of X into


the equation and calculate the predicted value of  .

7.2.1 The Least Squares Method

This is the most reliable of all the methods used to find regression lines. It leads to unique
regression line and regression coefficient. The least square method could be used to
estimate the parameters  and  from the model.

Yi    X i   i

as follow:

 i2  (Yi    xi ) 2 i  1, 2, ..., n

where the least square function is

Page 127 of 181


STA 201: Statistics I

n n
L= 2 1
 i2   (Y2    X i )2
i 1

Minimized the function L with respect to  and  by taking the partial derivatives

L
 2 (Yi     X i )


L
 2 (Yi    xi )


Set these partial derivatives equal to zero and solve for  and , we obtain:

n xy   x ( y)
=
n x 2  ( x ) 2

and

 Y x

i i
where x 
n
 xi and Y   Yi
n

7.2.2 Regression Analysis

When only two variables are involved, the regression is said to be simple. A simple linear
regression equation is therefore, of the form:  =  + X , once  and  are estimated,
we can substitute a given value of X into the equation and calculate the predicted value of
.

Page 128 of 181


STA 201: Statistics I

Case Study 7.2

A study was made on the effect of income level on the standard of living. The following
data was obtained in coded form. Calculate the regression of standard of living on income
level.

Income level Standard of Living


-5 1
-4 5
-3 4
-2 7
-1 10
0 8
1 9
2 13
3 14
4 13
5 18

Solution

The regression equation is

y    x

xy
 xy  n
=
 x   x2 / n
2

 Y x

From the table we compute


Page 129 of 181
STA 201: Statistics I

 x  0,  y  102,  x 2
 110,  xy  158

0  102
158 
 11  1.44
(0) 2
110 
11

0
x 0
11

102
Y   9.27,
11

  9.27  0(0)  9.27

y  9.27  1.44 x

Suppose we are to estimate or predict the value of Y when X= 6 we obtain

 =9.27+1.44 (6)

The predicted value of  =17.91 respectively

Page 130 of 181


STA 201: Statistics I

Case Study 7.3

The table below shows the Nigeria gross domestic product (X) and inflation rate(Y) over a
period of time.

i) Find least square regression line of Y on X

ii) Find least square regression line of X on Y is considering X as dependent and Y as


independent variable respectively

X 65 63 67 64 68 62 70 66 68 67 69 71
Y 68 66 68 65 69 66 68 65 71 67 68 70

Solution

i) The regression line of  on  is given by Y    x

  2 Y2 
65 68 4225 4624 4420
63 66 3969 4356 4158
67 68 4489 4624 4556
64 65 1096 4225 4160
68 69 4624 4769 4692
62 66 3844 4356 4092
70 68 4900 4624 4760
66 65 4356 4225 4290
68 71 4624 5041 4290
67 67 4489 4889 4889
69 68 4761 4624 4692
71 70 5041 4900 4970

Page 131 of 181


STA 201: Statistics I

800 811 53418 54849 54107

i. The regression line of y on x is given by

n xx   x y

n x 2  ( x ) 2

12(54107)  (800)(811)

12(53418)  (800) 2

  0.4764

      yx
y x
n n

811 800
  0.4764 x = 35.8233
12 12

The regression equation of Y on X is given as Y = 35.823 + 0.476X

(ii) The regression line of x on y is given by

x =   Y

  n xy   x y
n y 2  ( y ) 2

12(54107)  (800)(811)

12(54849)  (811) 2

  1.036

  x    y  800  1.036 x 811


n n 12 12

Page 132 of 181


STA 201: Statistics I

 3.38

The regression equation of x on y is given as

Y = - 3.38 + 1.036Y

7.3 Correlation Analysis

We have dealt with the problem of regression or estimation of one variable (the dependent
variable) from one or more related variables (the independent variables). We shall now
consider the degree of relationship that exists between variables, the correlation analysis.

Correlation analysis is a technique for estimating the closeness or degree of relationship


between two or more variables. Correlation is the degree of association between two or
more variables. The degree of relationship may be positive that is, an increase in one
variable accompanied by an increase in the other or negative when decrease in one
variable is accompanied by an increase in the other. The patterns of correlation are perfect
and positive correlation when r = 1, perfect and negative correlation when r = - 1, positive
correlation when r > 0, negative correlation when r <0 and no correlation when r = 0.

The correlation coefficient or coefficient of correlation denoted by r, is a measure of the


strength of the linear relationship between two variables. Two types of the measures of
correlation are:

i. Karl Pearson’s’ product moment correlation coefficient (r)


ii. Spearman’s rank correlation coefficient (R)

7.3.1 Product moment Correlation Coefficient

The Karl Pearson’s product moment correlation coefficient is devoted by r and given by:

Page 133 of 181


STA 201: Statistics I

n xy  x  y
r
[n x 2  ( x) 2 ][n y 2  ( y ) 2 ]

where – 1 < r <1

It should be noted that the higher the magnitude of r, the stronger the association.

Case Study 7.5

The table below is used to present Nigerian Government income (x) and Nigerian
Government expenditure (y) for a period of 12 months. This is given in million naira.

Month(s) Nigerian Government Nigerian Government expenditure


income
1 11.50 11.25
2 9.50 11.75
3 13.00 11.75
4 15.50 12.50
5 12.50 12.50
6 11.50 12.75
7 9.00 9.50
8 11.50 10.75
9 9.25 11.00
10 9.75 9.50
11 14.25 13.00
12 10 12.00

Calculate the coefficient of correlation

Page 134 of 181


STA 201: Statistics I

Solution

x y
 xy - n
r
( y ) 2
[ x - ( x ) [ y -
2 2 2

 x  138.00, y  138.25,  x  1608.12 ,  x 2


 1632.75,  y 2 1602.81

138.00 x138.25
1608.12 
r 12
2
(138.00) (138.25)
(1632.75  )(1607.81 
12 12

r = 0.70 (to 2 decimal places)

There is a significant relationship between Nigerian Government expenditure and


government income.

Case Study 7.6

Calculate the correlation coefficients between the following pairs of variables: gain in
height and weight and intelligent quotient in 10 children given in the table below:

Child number

1 2 3 4 5 6 7 8 9 10 Total
Gain in weight (y) 1.0 3.0 2.5 4.5 1.5 2.0 3.1 4.1 2.5 4.2 28.4
Gain in height (x) 2.0 3.5 3.0 5.0 2.1 2.5 3.6 3.8 3.0 4.0 32.5
Intelligent Quotient 0(IQ) 1.0 6.0 4.0 10.0 2.0 9.0 7.0 8.0 5.0 9.0 61.0
(Z)
XY 2.0 10.5 7.5 22.5 3.15 50 11.16 15.58 7.5 16.8 101.69

Page 135 of 181


STA 201: Statistics I

YZ 1.0 18.0 10.0 45.0 3.0 18.0 21.7 32.8 12.5 37.8 199.80
XZ 2.0 21.0 12.0 50.0 4.2 7.5 22.5 30.4 15.0 36.0 218.30

x 2
 113.31, y 2  93.06,  z 2  457,  xy  101.69,

 xz  218.3, yz  199.80,  x  32.5,  y  28.4,

 z  61.0, X  3.25, Y  2.84, Z  6.10

 xy  (  n
x y)
rxy 
  x   
 x  
2
 y  2


2

n 
 y 
n
2


 

101.69 
32.5(28.4)
 10
32.52  28.4 2 
[113.31  93.06  
10  10 

9.39
  0.9620
9.76

(x(z )
 xz  n
rxz 
 2  x   2  .z 2
2

 x     z  

  n   n 

218.30 
32.5(61.0)
 10
 32.5   2
612 
113.31   457  
 10   10 

Page 136 of 181


STA 201: Statistics I

20.05
  0.7849
25.5436

ryz  199.80 
28.461.0
10 26.56
  0.8185
 28.4  
2
61  2 32.45
93.06   457  
 10   10 

Correlation coefficients are higher. It can be concluded that (i) gains in height and weight
are positively correlated. Children having good gain in height have good gain in weight,
(ii) Gains in height and weight are positively correlated with IQ. Children who have good
gains in height or weight have good IQ.

7.3.2 Spearman Rank Correlation Coefficient

When variables do not follow normal distribution and one desires to assess the
relationship, correlation coefficient known as spearman rank correlation coefficient is
used. The variables are ranked based on the magnitude. The correlation between ranks of
variables x and y is obtained. The symbol used is R, the formula is:

6 d i2
R  1

n n 2 1 
where d is the difference between ranks given to the variables of each pair and n is the
number of pairs studied. The procedure was developed by spearman. Hence, it is known as
spearman rank correlation coefficient. Its value also ranges from – 1 to 1.

Case Study 7.7

Calculate the value correlation coefficient between the corresponding values ofincome
(X) and expenditure (Y) of a company given below.
Page 137 of 181
STA 201: Statistics I

X 22 24 25 16 28 19
Y 48 42 40 38 47 45

Solution

The varying is in ascending order of magnitude

X Y RX RY d d2
22 48 3 6 -3 9
24 42 4 3 1 1
25 40 5 2 3 9
16 38 1 1 0 0
28 47 6 5 1 1
19 45 2 4 -2 4
24

6 d 2
R=1-
n(n 2  1)

6(24)
=1-
6(36  1)

=1-0.6857=0.3143

=0.31

There is a low or weak positive correlation between the two variables.

7.3.3 Tie in Ranks

Most times, two or more values of a variable might be equal. In such cases, we assign to
each of the tied observations the mean of the ranks which they jointly occupy. For

Page 138 of 181


STA 201: Statistics I

example if the 5th and 6th largest values of a variable are equal, we assign to each the rank
(5  6)
=5.5, and if the of fifth, smith and seventh largest values of a variable are the same
2
(5  6  7 )
we assign each the rank =6.6
3

Case Study 7.8

The table give below shows the respective weight  and  (in kg) of 12 fathers and their
eldest sons.

Father (  ) 66 64 68 65 69 63 71 67 69 68 70 72
Sons (  ) 69 67 69 66 70 67 69 66 72 68 69 71

Calculate the coefficient of rank correlation and comment on the degree of correlation
between the father’s weight and their son.

Solution

  RX RX D= RX - RY d2
66 69 4 7.5 -3.5 12.25
64 67 2 3.5 -1.5 2.25
68 69 6.5 7.5 -1.0 1.00
65 66 3 1.5 1.5 2.25
69 70 8.5 10 -1.5 2.25
63 67 1 3.5 -2.5 6.25
71 69 11 7.5 3.5 12.25
67 66 5 1.5 3.5 12.25
69 72 8.5 1.5 -3.5 12.25
68 68 6.5 1.2 1.5 2.25
70 69 10 5 2.5 6.25

Page 139 of 181


STA 201: Statistics I

72 71 12 11 1.0 1.00
72.50

6d 2
R=1-
n(n 2  1)

6(72.50)
=1-
12(144  1

72.50
=1-
2(143)

=1-0.2535

=0.7465

=0.75

Comment: There is a fairly high positive correlation between the father’s weights and that
of their eldest sons.

7.3.4 Coefficient of Determination

The square of sample correlation coefficient is defined as the coefficient of determination


usually denoted by r2 the value r obtained from our correlation analysis merely provide a
relative measure of strength of relationship between the two variables.

It is defined as

exp lained var iation SSR


r2  
total var iation SS

Page 140 of 181


STA 201: Statistics I

 (  
 ) 2

 (   )
2=
r 2

Also Coefficient of determination

r2= (coefficient of determination)2

r= coefficient of determination

From simple 7.4

r2= 0.70

r2= (0.70)2

r2=0.4900

r  0.4900 =0.70.

Therefore, the correlation is positive.

Page 141 of 181


STA 201: Statistics I

Summary of Study Session 7

In study session 7, you have learnt that:

1. When an independent variable × is plotted against dependent variable Y to show


the algebraically relationship existing between the two variables, the plot so
obtained is called scatter diagram.
2. The two kinds of error are model error and random error
3. The two types of the measures of correlation are Karl Pearson’s’ product moment
correlation coefficient (r) and Spearman’s rank correlation coefficient (R)
4. In cases where two or more values of a variable might be equal, we assign to each
of the tied observations the mean of the ranks which they jointly occupy

Page 142 of 181


STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 7

Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
session.

1. Explain the following terms

a. Regression analysis
b. Correlation Analysis
c. Model

2. The table give below shows the respective income (X)and expenditure (Y) (in
million naira) of O.O.U, Ago-Iwoye for a year.

Income (  ) 66 64 68 65 69 63 71 67 69 68 70 72
Expenditure (  ) 69 67 69 66 70 67 69 66 72 68 69 71

a. Find the regression of lines of X on Y and Y on X.


b. Calculate the coefficient of rank correlation and comment on the degree of
correlation between the father’s weight and their son.

Page 143 of 181


STA 201: Statistics I

Glossary of Terms

Coefficient: measures a certain property or characteristic of a data set, phenomenon, or


process, given specified conditions

Degree: extent

Risk factors: determinants associated with an increased risk of a problem

Regression analysis: a set of statistical methods used to estimate relationships between a


dependent variable and one or more independent variables

Statistical predictions: the process of using correlations between variables to hypothesize


about future events and outcomes

Page 144 of 181


STA 201: Statistics I

References

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
B
5. https://sphweb.bumc.bu.edu/otlt/MPH-
Modules/BS/BS704_Multivariable/BS704_Multivariable5.html

Should you require more explanations on this study session? Please


do not hesitate to contact your e-tutor via the LMS.

Page 145 of 181


STA 201: Statistics I

Study Session 8: Analysis of Variance

Introduction

Analysis of Variance (ANOVA) is a statistical method used to test differences


between two or more means. It may seem odd that the technique is called
"Analysis of Variance" rather than "Analysis of Means." As you will see, the
name is appropriate because inferences about means are made by analyzing variance.
ANOVA is used to test general rather than specific differences among means. This can be
seen best by example.

Learning Outcomes for Study Session 8

On completion of this study session, you should be able to:

8.1 Explain Analysis of Variation.

8.2 Use different methods to find the Analysis of Variance

Page 146 of 181


STA 201: Statistics I

8.1 Introduction to Analysis of Variance

Statistical design of experiments refers to the process of planning the experiment, so that
appropriate data that can be analyzed by statistical methods will be collected, resulting in a
valid and objective conclusion. There are two aspects to any experimental problem, which
are, the design of the experiment and the statistical analysis of the data generated. The two
are closely related, because the method of analysis depends directly on the design
employed.

Whether one is experimenting with animals or human beings and analysis of the results is
to be carried out, the design of experiment is of central importance. The statistical aspects
of the design (or planning) of an experiment are:

i. Selecting the treatments (factors or level of factors) whose effects are to be


studied.
ii. Specifying a layout for the experimental units (animals or human beings) to
which the treatments are to be applied.
iii. Providing rules according to which the treatments are to be distributed among the
experimental units.
iv. Specifying what measurements are to be made for each experimental unit.
All these things must be accomplished in such a way that the techniques to be used in the
analysis of the results should be clear prior to the conducting of an experiment. Thus, an
experimental unit might be a plot of land, human being, batch of seeds or species of
animals.

Page 147 of 181


STA 201: Statistics I

8.1.1 Basic Terms in Analysis of Variance

The basic terms used in analysis of variance are discussed below

1. Randomization: This is the allocation of treatments to units such that the


probability that a particular treatment will be allocated to a particular unit is the
same for all treatments.
2. Replication: It is a complete repetition of the basic experiment, that is, it provides
an estimate of the magnitude of the experimental error and a more precise measure
of treatment effects.
3. Experiment: It is a means of getting an answer to the question that the
experimenter has in mind. This may be to decide which of several pain relieving
drugs is most effective or whether they are equally effective.
4. Treatment: This means the experimental conditions which are imposed on an
experimental unit in a particular experiment. In a dietary or medical experiment,
the different diets or medicines etc. are the treatments. In an agricultural
experiment, the different varieties of a crop or different manures will be the
treatments.
5. Blocking: This is the assignment of the experimental units to blocks in such a
manner that the units within any particular block are as homogenous as possible.
6. Factor: A factor is a possible cause of response or variation. Factors include age,
sex, variety, etc. it may be observed that treatments are often different
combinations of the levels of one or more factors.

8.2 Analysis of Variance (ANOVA)

Analysis of variance is a useful technique for comparison of means of several groups. We


employ the technique of analysis of variance to analyze our collected data in any
experiment performed. The technique known as analysis of variance (ANOVA) employs
tests based on variance ratios to determine whether or not significant difference exist

Page 148 of 181


STA 201: Statistics I

among the means of several groups of observations, where each group follows a normal
distribution. ANOVA is particularly useful when the basic differences between the groups
cannot be stated quantitatively. A one-way ANOVA is used to determine the effect of
one independent variable on a dependent variable. A two-way ANOVA is used to
determine the effects of two independent variables on a dependent variable. As the
number of independent variable increases, the calculation becomes much more complex
and are best carried out on a digital computer. The term independent variable is also
referred to as factor or treatment.

8.2.1 One-way Analysis of Variance

One-way ANOVA is used when we wish to test the equality of k-population means. The
procedure is based on the assumptions that each of K groups of observation is a random
sample from a normal distribution and that the population variance 2 is constant among
the groups. ANOVA models provide an appropriate estimate to facilitate comparison of
several means. A very simple form of experimental design in which the treatments are
allocated to the experimental units purely on a chance or random basis. It should be used
when the experimental units are homogeneous. The model involves only one treatment
variables in the design.

The model is:

X ij     j   ij

where  = grand mean

i = Treatment effect

ε = Error terms or residuals, assumed to be independent and normally


distributed with zero means and common variance for the treatment i.e.

Page 149 of 181


STA 201: Statistics I

 ij ~ NID 0, 2  
The results may be analyzed by one-way analysis of variance and F-test.

Notation trt total trt mean

Trt 1 X 11 X 12 .......... X 1n X 1. X 12
Trt 2 X 21 X 22 ......... X 2 n X. X 22
     
Trt k X k 2 X k 2 ......... X kn X k. X k2

where

n
X i..  1
n X
j 1
ij i  1, 2, ...k

k n

k  X ij

X i..  1
k X j 1
i.  i 1 j 1

kn
i  1, 2, ...k

8.2.2 Sum of Squares identity

ANOVA is partitioning of total variability into components parts.

Total sum of square (TSS): The TSS is defined as the sum of the square of the deviations
from the grand mean.

 X  X .. 
k n
TSS  ij
i 1 j 1

Page 150 of 181


STA 201: Statistics I

It is a measure of the dispersion of all the variates about the grand mean. Its degree of
freedom (df) = k-1. It can be shown that the TSS, SStotal or total variations can be
partitioned into two.

TSS   X ij  X ..     X  X i.   n  X i.  X .. 
k n k n
2 2
ij
i 1 j 1 i 1 j 1

WSS BSS
TSS  Between treatment
Treatment sum of squares sum of squares

Within Sum of Squares (WSS): WSS or sum of squares due to error (residual error) is
defined as the deviation of Xij (original observation) from the treatment means. It
represents

the experimental error of the given experiment its degree of freedom is k(n-1) denoted by
SSE.

Between sum of squares (BSS): It is defined as the deviations of the treatment mans about
the grand mean. The less the samples differ from each other, the smaller the BSS or
treatment sum of squares (SSTr).

For easy computation, we can use the following:

 X  X .. 
k n
TSS 
2
ij
i 1 j 1

k n
T2
  X ij 
i 1 j 1
2

nk

where

Page 151 of 181


STA 201: Statistics I

2
k n 
T   X ij 
2

 i1 j 1 

n
X i.  X
j 1
ij , X i. 
X i.
N

k n
X ..   X ij , X .. 
X ..
i 1 j 1
N

where N = total number of observations

BSS  n  X i.  X .. 
k
2

i 1

1 T2

n
T 2
i. 
nk

where Ti. = sum of observations in treatment i group

WSS = TSS – BSS

8.2.3 One-way ANOVA Table (Equal observation)

Given model, X ij     i   ij

To test the hypothesis

H0: 1 = 2 = … = k

H1: at least two di’s are not equal.

Test Statistics: F-calculated from the ANOVA table below

Page 152 of 181


STA 201: Statistics I

Source SS Df MS F
Between treatments BSS k-1 BSS
k 1 A A/B

Within treatments WSS K(n-1) WSS


k  n 1 B
Total TSS Kn-1

Fig 8.1: ANOVA Table

The critical value is F1 , v1 ,V2 where df, v1 = k-1, v2 = k(n-1) and  is the significant levels.

Case study 8.1

Suppose the following has the classification of examination performance for five students
classified as A, B, C, D and E in three subject. Perform an analysis of variance to test
whether the treatment effects and the same or not and compute the coefficient of variation
to determine its precision at 5% level of significance.

A B C D E
3 5 7 6 4
2 8 8 8 9
4 8 6 7 5

Fig 8.2

Solution

Test of Hypothesis

H0: 1 = 2 = … = 5

H1: at least 2 di’s are not equal.

Page 153 of 181


STA 201: Statistics I

A B C D E
3 5 7 6 4
2 8 8 8 9
4 8 6 7 5
Ti. Total 9 21 21 21 18
X i. mean 3 7 7 7 6

Fig 8.3

 X  X ..   3  6  2  6  4  6  ... 9  6  5  6  62
2
TSS 
2 2 2 2 2
ij


BSS  n  X i.  X ..   3 3  6  7  6  7  6  6  6
2 2 2 2

= 3 [9 + 1 + 1 + 1] = 36

WSS = TSS – BSS

= 62 – 36 = 26

Another Computational Formula or Method:

T2n
TSS   X  2
ij
T = Grand total
i 1 nk

= 3  3905 = 62
2
2
 2 2  4 2  ...  4 2  9 2  52 

1 T2
BSS 
n
T 2
i. 
nk

1

 9 2  212  212  212  18 2 
90 = 36 
2

3 3 5

Page 154 of 181


STA 201: Statistics I

WSS = TSS – BSS = 62 – 36 = 26

Source of Variation SS df MS F
Between treatments 36 4 36
4 9 9 / 2.6  0.346

Within treatments 26 10 26
10  2.6

Total 62 14

Fig 8.4: ANOVA Table#

Test statistics = Fcal = 3.462

Critical Value = F(1-),4,10 = 3.48

Decision: Since Fcal>Ftab, we accept H0 and conclude that the treatment mean effects in the
five treatments are equal or there is no significant difference between the treatment means
in the five treatments.

Page 155 of 181


STA 201: Statistics I

Summary of Study Session 8

In study session 8, you have learnt that:

The statistical aspects of the design (or planning) of an experiment are:

i. Selecting the treatments (factors or level of factors) whose effects are to be


studied.
ii. Specifying a layout for the experimental units (animals or human beings) to
which the treatments are to be applied.
iii. Providing rules according to which the treatments are to be distributed among the
experimental units.
iv. Specifying what measurements are to be made for each experimental unit.

Page 156 of 181


STA 201: Statistics I

Self-Assessment Questions (SAQs) for Study Session 8

Now that you have completed this study session, you can assess how well you have
achieved its learning outcomes by answering these questions. Write your answers in your
study diary and discuss them with your tutor at the next study support meeting. You can
check your answers with the notes on the Self-Assessment Questions at the end of this
session.

1. Explain the following terms


i. Randomization
ii. Replication
iii. Experiment
iv. Factor
2. Define and state the uses of the followings
i. ANOVA
ii. One-way ANOVA
3. The following data were obtained from a market survey to test if the flavor of 4
new Maggi randomly in A, B, C are equal accepted by the customers. In order to
determine this, carryout an analysis of variance test at 5% level of significance.

Breed

1 2 3 4

A 33 28 29 31

B 37 34 34 34
Feed
C 35 33 37 30

Page 157 of 181


STA 201: Statistics I

Glossary of Terms

Quantitatively: measures of values or counts and are expressed as numbers

Homogeneous: made up of things (people, events, objects, etc.) that are similar to each
other.

Experimenttal units: a physical entity that is the primary unit of interest in a specific
research objective

Experimental design: how participants are allocated to the different groups in an


experiment

Variate: a quantity having a numerical value for each member of a group, especially one
whose values occur according to a frequency distribution

Page 158 of 181


STA 201: Statistics I

References

1. Probability and statistics for engineers & scientists by Walpole and Myers.
2. Introduction to Statistics. Jedidiah Publishers by Sojobi O.A.
3. Schaum’s Outline Series Theory and Problems of Probability (S.I. Metric) Edition
McGraw Hill Book Company, New York by Symour L.
4. An Introduction to Statistical Methods. Vikas Publishing House. Delhi by GUPTA C.
B
5. http://onlinestatbook.com/2/analysis_of_variance/intro.html

Should you require more explanations on this study session? Please


do not hesitate to contact your e-tutor via the LMS.

Page 159 of 181


STA 201: Statistics I

Notes on Self-Assessment Questions (SAQs)

Notes on Self-Assessment Questions 1

1. Statistics is a scientific methodology used for collection, presentation, analysis and


interpretation of data in order to draw valuable decision and conclusion.
2.
a. Data is a series of observation or information that can be measured (quantitative)
or qualified (quantitative information)
b. Sample is a subject or fractional part of the population selected for the purpose of
making scientific statement about the population.
c. Variable is the characteristics that makes one factor different to another.
d. Attribute is the quality or characteristics of a certain data or information
e. Descriptive statistics are brief descriptive coefficients that summarize a given data
set, which can be either a representation of the entire or a sample of a population
f. Discrete variable is a variable that does not accept decimal in the counting process
g. Observation is a systematic method of collecting or approaching data
h. Population is the total number of items under investigation
i. Sample is a subject or fractional part of the population selected for the purpose of
making scientific statement about the population
j. Statistical method is a method of analyzing or representing statistical data
k. Continuous variable is a variable that accept decimal value in the counting process
e.g height, weight
3.
a. A questionnaire contains a sequence of questions relevant to the data or
information being sought
b. Open-end and closed-end
4.
a. Functions of Statistics

Page 160 of 181


STA 201: Statistics I

i. Present facts in a definite form


ii. Simplifies unwieldy and complex mass of data
iii. Classifies numerical facts.
iv. Furnishes a technique of comparison.
v. Endeavors to interpret conditions
b. Limitations of statistics

i. Statistics deals with only those subjects of inquiry which are capable of being
quantitatively measured and numerically expressed.
ii. It deals only with aggregates of facts and no importance is attached to individual
items.
iii. Statistical results might be misleading if data collection is faulty
iv. Statistics can be used to establish wrong conclusions and therefore, can be used
only by experts.

5. Statistical data can be either a variable or an attribute in nature. Variable can be


either discrete or continuous in nature and these are measurable while attribute is
non-measurable in nature
6. Primary data and secondary data. Primary data are those that the investigator
originate by himself while secondary data are gotten from outside sources.
Methods of data collection are:

i. Questionnaires
ii. Interview (which may be Telephone, Personal or Indirect) Telephone interviews
iii. Experiment
iv. Observation
v. Group discussion

Notes on Self-Assessment Questions 2

1. Pie chart, histogram, bar chart, cumulative frequency


Page 161 of 181
STA 201: Statistics I

2. Bar chart has space between its bars while histogram does not
3. Single bar chart, Component bar chart and multiple bar chart
4. Histogram has rectangular bars while a frequency polygon has irregular straight lines
gotten by joining the midpoints of histogram
5.

Number of women

single married separated divorce widow

6.
a. Single bar chart

No of male
1200

1000

800

600
No of male

400

200

0
90/91 91/92 92/93 93/94

Page 162 of 181


STA 201: Statistics I

No of female
1,010

1,000

990

980

970

960 No of female

950

940

930

920
90/91 91/92 92/93 93/94

b. Component bar chart


2500

2000

1500
No of female
No of Male
1000

500

0
90/91 91/92 92/93 93/04

c. Percentage component bar chart

Page 163 of 181


STA 201: Statistics I

enrolment of students from STA201


100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

d. Multiple bar chart


1200

1000

800

600 No of male
No of female

400

200

0
90/91 91/92 92/93 93/04

Page 164 of 181


STA 201: Statistics I

Notes on Self-Assessment Questions 3

1.
a. Class interval 10 – 14, 15 – 19, 20 – 24, 25 – 29, 30 – 34, 35 – 39, 40 – 44
b. Class boundary 9.5 – 14.5, 14.5 – 19.5, 19.5 – 24.5, 24.5 – 29.5, 29.5 –
34.5, 34.5 – 39.5, 39.5 – 44.5
c. 4
d.

Classes No. of Object Cumulative


frequency
10 – 14 19 19

15 – 19 24 43

20 – 24 37 80

161
25 – 29 81

204
30 – 34 43

234
35 – 39 30
250
40 – 44 16

e.

Classes No. of Object Relative


Frequency

Page 165 of 181


STA 201: Statistics I

10 – 14 19 0.076

15 – 19 24 0.096

20 – 24 37 0.148

0.324
25 – 29 81

0.172
30 – 34 43

35 – 39 30
0.12
40 – 44 16
0.064
Total 250
1

2.
a.

Cumulative frequency
250

200

150

100

50

0
1 2 3 4 5 6

b.
Page 166 of 181
STA 201: Statistics I

HISTOGRAM
90
80
70
60
50
40
30
20
10
0
15 < 20 20 < 25 25 < 30 30 < 35 35 < 40 40 < 45

c.

Frequency Polygon
90
80
70
60
50
40
30
20
10
0
0 1 2 3 4 5 6 7

d.

Page 167 of 181


STA 201: Statistics I

Bar Chart
90
80
70
60
50
40
30
20
10
0
15 < 20 20 < 25 25 < 30 30 < 35 35 < 40 40 < 45

3. K
Class Interval Frequency Cumulative Relative
frequency Frequency
40 – 49 25 25 0.25
50 – 59 32 57 0.32
60 – 69 17 74 0.17
70 – 79 11 85 0.11
80 – 89 8 93 0.08
90 – 99 6 99 0.06

Page 168 of 181


STA 201: Statistics I

35

30

25

20

15

10

0
40 - 49 50 - 59 60 - 69 70 - 79 80 - 89 90 -99

frequency

Cumulative Frequency
120

100

80

60

40

20

0
40 - 49 50 - 59 60 - 69 70 - 79 80 - 89 90 - 99

Cumulative Frequency

4.

Page 169 of 181


STA 201: Statistics I

a. A measure of central tendency is a summary statistic that represents the center


point or typical value of a dataset. It includes mean, median mode
b. Measure of variability is the extent to which a distribution is stretched or squeezed
e.g. range, variance, standard deviation etc.
5. Mean, median mode
6.
a. Mean = 25.13
b. Median = 23-89
c. 20 – 25
d. Variance = 37.23
e. Standard deviation = 6.1
f. 3.96

Notes on Self-Assessment Questions 4

1a 0.234 1b 0.979 1c 0.09

2a 0.5948 2b 0.008 2c 0.2898

3a Experiment: This refers to any process of observation or measurement we


may not be able to predict.

Sample space: This refers to the collection of all possible outcomes of an


experiment.

Event: This refers to any subset of a sample space.

Notes on Self-Assessment Questions 5

1i. A statistical hypothesis (or inference) is a statement about the parameters or


form of a population

Page 170 of 181


STA 201: Statistics I

1ii A type I error has been committed if we reject the null hypothesis when it
is true and a type II error has been committed if we accept the null hypothesis when it
is false.

2i An estimator of an unknown parameter is simply a statistics that can


estimate the parameter unbiasedly, thus, x is an estimator of population parameter 

2ii A point estimator is a procedure leading to a single numerical value for the
estimate of the unknown parameter

2iii An interval estimator is a random interval in which the true value of the
parameter lies with some probability which is usually called confidence interval

3 A random variable is a variable whose values depend on outcomes of a


random phenomenon

i. Find the type of problem and the question to be answered


ii. To state the null hypothesis (H0) and the appropriate alternative (H1)
hypothesis
iii. Selection of the appropriate test to be utilized and calculation of the test
criterion based on the type of test.
iv. Fixation of the level of significance 
v. Decision making on test criterion value, whether to reject or accept the
hypothesis.
vi. Drawing of the conclusion (or inference) on the basis of level of significance is
deciding whether the difference observed is due to chance or due to some other
known factors.

5 Rules for Expectation

Page 171 of 181


STA 201: Statistics I

Let X and Y be random variables, and let C be any real number.

i. E[C] = C (the expected value of any constant is that constant).


ii. E[CX] = CE [X] (Constants can be factored from expectations).
iii. E[X+Y] = E[X] + E[Y] (the expected value of a sum is equal to the sum of the
expected values).
Rules for Variance

i. Var C = 0
ii. Var CX = C2 Var X
iii. If X and Y are independent, then Var (X + Y) = Var X + Var Y. Two variables
are independent on the value assume by the other.
6

QUESTION 6

Mine 1 84 82 83 78 79
Mine 2 75 76 77 80 76

Mine 1 (X1) Mine 2 (X2)

𝑋1 (𝑋1
− 𝑥̅ )2
84 9
82 1
83 4
78 9
79 4

Page 172 of 181


STA 201: Statistics I

406 27 𝑋2 (𝑋2
− 𝑥̅ )2
75 4
76 1
77 0
80 9
76 1
384 15

∑ 𝑋1 406
𝑥̅1 = = = 81.2 ~ 81
𝑛1 5

∑(𝑋1 − 𝑋̅2 ) 27
𝑆12 = = = 5.4
𝑛 5

𝑋̅1 − 𝑋̅2
𝑡= 𝑆1 𝑆2
+
√𝑛1 √𝑛2

81 − 77 4
= 2.32 1.73 = = 2.197
+ 1.82
√5 √5

𝑡𝑡𝑎𝑏𝑢𝑙𝑎𝑡𝑒𝑑 = 𝑡0.05 (𝑛1 + 𝑛2−2 ) = 𝑡0.05 (8) = 1.894

Interpretation

Null hypothesis (H0): P = 40% = 0.4

Alternate hypothesis (H1) = P ≠ 40% ≠ 0.4

Hence: P = 0.40, q = 0.60

Observed sample population


Page 173 of 181
STA 201: Statistics I

180
𝑝̂ = = 0.36
500

Test statistic

𝑝̂ − 𝑝 0.36 − 0.40 −0.40 −0.04


𝑧= = = = = −0.579
𝑝𝑞 0.24 √0.00048 0.069
√𝑛 √
500

As (H0) is two-sided, we shall determine the rejection regions applying two-failed test
at 5% level of significance

𝑍5%
= 1.96
2

The observed value of Z is -0.579 which is the acceptance region and such H0 is
accepted.

Null hypothesis (H0): 𝑃̂1 = 𝑃̂2

Alternative Hypothesis (H1): 𝑃̂1 ≠ 𝑃̂2

450
𝑃̂1 = = 0.45 𝑞̂1 = 1 − 𝑃1 = 1 − 0.45 = 0.55, 𝑛1 = 1000
1000

400
𝑃̂2 = = 0.5 𝑞̂1 = 1 − 𝑞2 = 1 − 0.5 = 0.5, 𝑛2 = 400
800

The test statistic

𝑃̂1 − 𝑃̂2 0.45 − 0.50 −0.05


𝑍= = =
𝑃̂1 𝑞̂1 𝑃̂2 𝑞̂2 0.45(0.55) 0.5(0.5) √0.00025 + 0.00031
√ + √ +
𝑛1 𝑛2 1000 800

Page 174 of 181


STA 201: Statistics I

−0.05 −0.005
𝑍= = = −0.213
√0.000563 0.0237

𝑍𝑡𝑎𝑏𝑙𝑒 𝑎𝑡 1% = 1.64

𝑍𝑡𝑎𝑏𝑙𝑒 𝑎𝑡 5% = 1.96

The observed value of Z is -0.213 which is acceptance region at 1% and 5% level and
such H0 is accepted.

Notes on Self-Assessment Questions 6

1a Chi-square (2) is a special significance test which is used in a very large number
of cases to test the accordance between fact and theory (or between observed values and
expected values).

1b

i. To test the discrepancies between observed and expected frequencies.


ii. To test the goodness of fit.
iii. To determine association between two or more attributes.

CHI-SQUARE

Performance

Poor Fair Good

Total
Below 67 64 25 156
average

Page 175 of 181


STA 201: Statistics I

Average 42 76 56 174

Above 10 23 37 70
average
Total 119 163 118 400

Null hypothesis (H0): There is a relationship between the intelligence of persons and
their subsequent performance in the banking hall.

Alternate Hypothesis (H1): There is no relationship between the intelligence of


persons and their subsequent performance in the banking hall.

2
∑(𝑂𝑖 − 𝐸𝑖 )2
𝑋𝑐𝑎𝑙𝑢𝑐𝑎𝑡𝑒𝑑 =
𝐸𝑖

∑ 𝑎𝑖𝑏𝑗
𝑤ℎ𝑒𝑟𝑒 𝐸𝑖 =
𝑁

156 𝑋 119 156 𝑋 119


𝑂11 = = 46.41 𝑂12 = = 46.41 𝑂13 =
400 400
156 𝑋 118
= 46.02
400

174 𝑋 119 174 𝑋 163 174 𝑋 118


𝑂21 = = 51.77 𝑂22 = = 70.91 𝑂23 = =
400 400 400

51.33

70 𝑋 119 70 𝑋 163
𝑂21 = = 20.83 𝑂32 = = 28.53 𝑂33 =
400 400
70 𝑋 118
= 20.65
400

Page 176 of 181


STA 201: Statistics I

Observed Expected (Oi – Ei) (Oi – Ei)2 (Oi – Ei)2


value (Oi) value (Ei) Ei
67 46.41 20.59 423.95 9.135
64 63.57 0.43 0.1849 0.002
25 46.02 -21.02 441.840 9.601
42 51.77 -9.77 95.45 1.844
76 70.91 5.01 25.100 0.354
56 51.33 4.67 21.809 0.425
10 20.83 -10.83 117.29 5.631
23 28.53 -5.53 30.58 1.072
37 20.65 16.35 267.32 12.945
41.009

2
𝑋𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 41.009

2 2 2
𝑋𝑇𝑎𝑏𝑢𝑙𝑎𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 𝑋0.01(𝑟−1)(𝑐−1) = 𝑋0.01(4) = 13.277

Notes on Self-Assessment Questions 7

1
a. Regression Analysis is a statistical tool which helps to study the trend and pattern
of movement in one variable in response to changes in another variable on the
basis of an assumed relationship existing between them
b. Correlation analysis refers to the degree or extent of relationship or association
between two or more variables.
c. Model is any representation of reality, it can be physical or graphical
representation
Page 177 of 181
STA 201: Statistics I

To find the regression of line of X or Y

X = a + by

Where

𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑏=
𝑛 ∑ 𝑦 2 − (∑ 𝑦)2

𝑎 = 𝑥̅ − 𝑏𝑦̅

X Y Xy X2 Y2
66 69 4554 4356 4761
64 67 4288 4096 4489
68 69 4692 4624 4761
65 66 4290 4225 4356
69 70 4830 4761 4900
63 67 4221 3969 4489
71 69 4899 5041 4761
67 66 4422 4489 4356
69 72 4968 4761 5184
68 68 4420 4624 4624
70 69 4830 4900 4761
72 71 5112 5184 5041
812 823 55,526 55,030 56,483

12(55,526) − 812 (823) 1,964


𝑏= = − = −4.206
12(56,483) − (823)2 467

Page 178 of 181


STA 201: Statistics I

𝑎 = 67.667— 4.206(68.583)

= 67.667 + 288.460

= 356.127

Regression equation of X on Y becomes

𝑋 = 𝑎 + 𝑏𝑦

𝑋 = 356.127 − 4.206𝑌

To find the regression analysis of line Y on X

𝑌 = 𝑎 + 𝑏𝑥 𝑤ℎ𝑒𝑟𝑒;

𝑛 ∑ 𝑥𝑦 − ∑ 𝑥𝑦
𝑏=
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2

𝑎 = 𝑌̅ − 𝑏𝑥̅

12(55,526) − 812 (823)


𝑏=
12(56,030) − (812)2

1964
= − = −1.933
1016

𝑎 = 68.583— 1.933(67.667)

= 130.80

Regression equation of line Y or X becomes

𝑌 = 𝑎 + 𝑏𝑥

Page 179 of 181


STA 201: Statistics I

= 130.80 − 1.933𝑥

Notes on Self-Assessment Questions 8

Null Hypothesis: The flavours of new magi are equally accepted

Alternative Hypothesis: The flavours of new magi are not equally accepted.

Breed
1 2 3 4
Feed A 33 28 29 31

B 37 34 34 34

C 35 33 37 30

TOTAL 105 95 100 95 395

𝑇. .2
𝑆𝑆𝑇𝑂𝑇𝐴𝐿 = ∑ 𝑦𝑖 𝑗 2 −
𝑁

2 2 2 2
3952 2
33 + 28 + 29 + 31 + … + 30 −
12

= 13095 − 13002.083

= 92.917

∑ 𝑦𝑖1 2 𝑇. .2
𝑆𝑆𝑇𝑅𝐸𝐴𝑇𝑀𝐸𝑁𝑇 (𝐹𝑙𝑎𝑣𝑜𝑢𝑟𝑠) = −
𝑘 𝑁

Page 180 of 181


STA 201: Statistics I

1052 + 952 + 1002 + 952 3952


= −
3 12

= 13,025 − 13002.083

= 22,917

𝑆𝑆𝐸𝑅𝑅𝑂𝑅 = 𝑆𝑆𝑇𝑂𝑇𝐴𝐿 − 𝑆𝑆𝑇𝑅𝐸𝐴𝑇𝑀𝐸𝑁𝑇

= 92.917 − 22.917

= 70,000

ANOVA TABLE

SOURCE DEGRE SUM OF MEAN SUM FCALCULA


OF E OF SQUARE( OF TED

VARIATI FREED SS) SQUARE(M


ON OM SS)
Treatment 3 22,917 7,639
FCALCULAT
ED =
9.822
Error 9 70,000 7,777.77

Total 11

𝐹𝐶𝐴𝐿𝐶𝑈𝐿𝐴𝑇𝐸𝐷 = 𝐹3,9 (0.05) = 8.85

The null hypothesis is rejected.

Page 181 of 181

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy