285 Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 45

Page |1

CHAPTER 1:
INTRODUCTION TO DATA PRESENTATION

Definition : science of collecting, organizing, presenting, interpretation and analyzing data in


order to make informed decisions.
Types of statistics:
1. Qualitative:Non-categorical data eg religion, class, name of school
2. Quantitative : data is in terms of figures.
This is further divided into two
a) discrete data: It can take only certain distinct values in a given range eg number of
children in your family. It is usually as a result of counting. Falls under ungrouped data.
b) continuous data: It can take any value in a given range eg height, age. It is usually as a
result of measurement. Falls under grouped data.

Exercise
What form of data are these, giving reasons
1. Number of hours of operation
2. Time
3. Volume of water in a tank
4. Growth rate of trees
5. Wages earned in a week
6. Age in years
7. Height in cm
8. Cost of meals

Data collection
Raw data is collected using any form of instrument eg questionnaires, interviews etc. Once
collected it is either presented in the form of graphs or analyzed numerically. The data is either
grouped or ungrouped, that is, continuous or discrete respectively. The nature of the data
informs a statistician what graph to use.

Frequency distributions

Ungrouped data
Example 1

1. a)Construct a frequency distribution for the following data


9 8 5 2 10 8 0 3 8 2
4 1 4 2 4 2 8 4 5 5
8 9 6 3 4 10 7 2 1 3
9 7 5 4 8 7 10 9 9 4
b)Cumulative
c)Relative frequency
Page |2

Eg Copy and complete the table below using the data above.

Number of Tally of number of frequency Cumulative relative frequency


marks students frequency
0 I 1 1 1/40= 0.025
1 II 2 3 2/40=0.05
2 IIII 5 8 5/40=0.125
3
4
5
6
7
8
9
10 40
TOTAL 40 1

GROUPED DATA

Using the above example 1, copy and complete the table below for a grouped data class size 2.

Number of Tally of number of frequency Cumulative relative frequency


marks students frequency
0-1 III 3 3 3/40= 0.075
2-3 IIII III 8 11 8/40=0.2
4-5
6-7
8-9
10-11 40
TOTAL 40 1

Exercise
1. Construct a frequency distribution including the relative and cumulative frequency
tables for the discrete data below:
a) 1 2 1 1 2 0 5 4 4 0 6 2
3 0 4 3 3 4 5 2 2 3 0 1
4 4 5 6 1 5 3 2 1 4 5

b) 2 2 1 3 3 1 2 3 2 4 1 3
3 1 2 1 2 3 4 3 2 3 1 3 4

c) 32 33 29 34 30 30 31 28 30 29
Page |3

31 32 29 31 29 33 33 29 32 28
32 34 33 33 29 34 30 32 33 30

2. Construct a frequency distribution, cumulative and a relative frequency table

i. By taking the first class as 0-19:


62 51 37 70 42 12 93 53 45 70 72 03
91 27 21 58 57 22 68 08 56 87 76 17
59 70 51 41 55 36 56 73 57 50 19 29
61 67 35 69 69 76 57 79 46 60 39 80
58 46 77 59 75 77 14 54 63 13 30 44

ii. By taking the first class as 0-4:


25 10 17 15 20 18 15 35 30 10
16 19 20 11 10 10 05 07 25 10
04 15 05 14 20 10 06 13 08 22
09 12 15 26 08 03 34 23 09 21
iii. By grouping data into a class size of 10 starting with class 15-24
20 25 20 51 32 37 18 18 40 37 45 20
24 66 35 38 56 22 20 61 35 24 16 18
54 27 57 45 24 65 43 16 20 45 42 46

22 72 57 29 55 16 25 31 53 22 28 56

iv. By grouping data into a class size of 2:


87 80 85 83 86 86 90 91 92 90 88
82 81 84 83 84 85 83 88 81 85 88
87 82 85 82 87 85 81 86 85 86 87

Graphs
Stem and leaf
Box and whisker
Bar graph
Pie chart
Pictogram
Histogram
Cumulative frequency curve
Frequency polygons

Do a reading assignment for the graphs and hand in the summaries being careful to give at least
a worked relevant example in each case.

Measures of central tendency


Page |4

MEAN
Ungrouped data

Mean : X́ =¿
∑ xi or X́ = ∑ xf
i=1
n
∑f
Sum of all observations divided by the total number of observations

MODE:
The value that appears more often or with the highest frequency

Worked examples
1) The following were the quiz scores of one student in a Statistics class at the end of the
course. Find the mean score and comment:
5; 10; 15; 20; 25
5+10+15+20+25
Solution: =
5
75
Mean =
5
=15
Interpretation: The average score for the quizzes was 15.

2) Peter scored the following number of goals in eleven matches. Write down the modal
score and comment:
1 0 0 2 2 0 1 2 3 1 2
3) Solution: modal score is 2.
Interpretation: In most of the matches Peter scored 2 goals.

4) 100 members of an orchestra were asked how many instruments each could play. The
results were shown on the following table.
I. Calculate the mean number of instruments played and comment.
II. Find the mode and comment.

No. of instruments x 1 2 3 4 5
Frequency f 41 33 18 6 2

Solution:
Copy and complete the table below
Number of instruments x Frequency f Xf
1 41 41
2 33 66
3 18
4 6
Page |5

5 2
Total 100 195

195
Therefore mean= =1.95
100
Interpretation: The average number of instruments played is 1.95
Interpretation: Most members could play 2 instruments.

Grouped data:
Worked examples
The summary of the test scores [out of 100] in one Statistics course with 40 students is as
shown below:

Mark 31-35 36-40 41-45 46-50 51-55 56-60


Frequency 4 6 10 13 5 2
Find the mean and mode. Comment on your findings.

Solution:
Copy and complete the table below
Mark Midpoint x Frequency f xf
31-35 [(31+35)]/2)= 33 4 132
36-40 [(36+40)]/2)=38 6 228
41-45 10
46-50 13
51-55 5
56-60 2
Total 40 1795

1795
Therefore mean =
40
=44.9
Interpretation: on average students got 44.9.
Modal class = 46-50.
The mode is [(46+50)/2] = 48
Interpretation(modal class): Most students scored between 46 and 50.
Interpretation(mode): Most students scored 48.

MEDIAN

This is the middle value of an ordered set of data. Before defining the median it is necessary to
rearrange the data in ascending order of magnitude.
Page |6

Worked examples

Ungrouped data
In the above examples on ungrouped data find the median and comment
Solution
5; 10; 15; 20; 25
The data is already arranged. Therefore the median is 15
Interpretation: In half of the quizzes the student scored 15.

Solution: (Peter example) Rearranging first


0 0 0 1 1 1 2 2 2 2 3
Therefore median is 1.
Interpretation: In half of the matches Peter scored 1 goal.

Solution: (instruments example)


Data is already arranged in order 1 up to 5. Since there were 100 musicians the median position
is the 50th and 51st musician. By listing we see that 50 th is a 2 and 51st is also a 2. Thus the
average of the two is 2 [2+2].
Therefore median is 2.
Interpretation: Half of the musicians have played 2 instruments in their lifetime.

Grouped data

Median= Lm (+ 12 n−cf ) c where


b

fm

Lm is the lower class limit of the median class.


cfb is the cumulative frequency before the median class.
fm is the frequency of the median class.
c is the class width of the median class.
n is the summation of f.
NB This is also called the method of interpolation.

1. For the grouped data below estimate the median using the method of interpolation.
Class 5-9 10-14 15-19 20-24 25-30
Frequency 6 7 9 8 4

Solution:
Rewriting the table and identifying the median position

Classes Frequency Cumulative frequency


4.5 - 9.5 6 6
9.5 - 14.5 7 13
Page |7

14.5-19.5 9 22
19.5-24.5 8 30
24.5-30.5 4 34

There are 34 units in the experiment. The median position is 17 th and 18th . Thus the first and
second classes give up to 13 th position. The third class gives up to 22 nd position! This includes
the 17th and the 18th positions.
Thus median class is 14.5-19.5
1
Median = 14.5 + 2[ ]
( 34 ) −13 5

9
=16.72

Exercise
1. Find the median for the following data
a.
X 0 1 2 3 4 5
Y 2 5 9 4 2 1
b.
X 1 2 3 4 5 6
Y 3 9 12 11 8 7
c.
Weight 53-56 57-60 61-64 65-68 69-72 73-76 77-80
frequency 2 13 4 11 9 6 5
d.
time 10-19 20-24 25-29 30-39 40-49 50-64 65-89
frequency 10 20 25 30 24 12 10

2. Write down at least 2 advantages and 2 disadvantages of mean, mode and median.
3. When should we use what measure? Consider skewness.

VARIANCE & STANDARD DEVIATION

For ungrouped data


∑ x2 2
σ =
2
- x́
n

Worked example
Given 150 200 180 160 170 calculate the standard deviation
X X2
150
200
180
Page |8

160
170
∑ x =860 ∑ x 2=149400
2
σ = 149400 −( 860 )
√ 5
=17.2047
5

For grouped data


∑ f x2 2
σ= - x́
∑f
Worked example
Height 145-155 155-165 165-175 175-185 185-195
Frequency 3 9 21 13 4
Solution:
Copy and complete

Height Midpoint x X2 f fx fx2


145-155 150(145+155 22 500 3 450 67 500
)
155-165 160 9
165-175 2
1
175-185 1
3
185-195 36 100 4
TOTAL 5 1470400
0

Variance would be:

2
1470400 ∑ fx
σ2= –( )
50 50
=

Exercise
1. Calculate the standard deviation in each case a) - d) in the previous exercise.
2. Give the importance of standard/ variance.

Other measures of variation


a. Variance is simple the square of the s.d.(s2). Later look at ANOVA.
Page |9

100 s
b. Coefficient of variation is defined as and it is expressed as percentage. It is used to

compare the variability in two sets of data where there is obvious difference in magnitude in
both the means and standard deviations eg height of boys aged 5 and 15. Suppose the means
are 100 and 150 and s.d are 6 and 9 respectively then both sets have a coefficient of variation
of 6%.
c. Range is largest value – lowest value. It is commonly used because of its ease to calculate
however it is not reliable except in special occasions because only one of the sample values are
used to calculate it.
P a g e | 10

CHAPTER 2:

DISCRETE DISTRIBUTIONS

Suppose X has the following properties:


a. It is a discrete random variable
b. It can only assume certain values
c. Probabilities associated with these values are p1, p2, p3, …… pn where
P(X=x1)= p1
P(X=x2)= p2
….

….
P(X=xn)= pn
Then x is a discrete random variable if p1+ p2+ p3,…+pn =1
Symbolically ∑ P ¿ ¿)= 1
all x

A discrete random variable X has the following probability distribution:

X 1 2 3 4
P(X=x) 1 1 1 a
3 4 3

This table is called a probability distribution of x.


The function or formula which is responsible for allocating probabilities is known as the
probability density function (p.d.f) of X. a is called a constant.

Worked example
From the table above, find

1. The value of a
1 1 1
: + + +a=1
3 4 3
1
a=
12

2. P(X=2)
1
=
4
3. P(X≤2)
1 1
= +
3 4
7
=
12
4. P(X>2)
P a g e | 11

1
=
3

Exercise
1. Given that
X 0 1 2 3
P(X=x) 0.1 0.2 0.3 B
Find
a) b
b) P(X>1)
2. Given that
Y -2 -1 0 1 2
P(Y=y) 1/6 1/12 x 1/3 1/12

Find
a) x
b) P(Y≥0)

3. Given that
X 0 1 2
P(X=x) 1 A b
4
Find
i. a and b given that 2a+b=1 and
ii. P(X≤1)

4. Given that
X 1 3 5 -y
P(X=x 3 3 1 1
) 8 8 8 8
Find y if ∑ xP (X= x)=0
all x

Expectation E(x)
This is the mean of the distribution.
Given by E(x)=∑ xP (X= x)
all x

A random variable X has a p.d.f defined as shown. Find E(x)


X 0 1 2 3 4
P(X=x) 0.1 0.2 0.3 0.3 0.1
P a g e | 12

E(x)= 0x0.1 + 1x0.2 + 2x0.3 + 3x0.3 + 4x0.1


= 2.1

Variance Var(x)

Given by Var(x)=E(x2)-[E(x)]2

Find Var(x)
X 0 1 2 3
P(X=x) 0.1 0.2 0.3 0.4

First find E(x) and E(x2)


E(x)= 0x0.1 + 1x0.2 + 2x0.3 + 3x0.4
=2
E(x2)= 02 x0.1 + 12 x0.2 + 22 x0.3 + 32 x0.4
=5
Var(x)=E(x2)-[E(x)]2
Var(x)= 5-22
=1

Worked Example
The discrete r.v X is given by P(X=x)=kx for x=1, 2, 3, 4 where k is a constant, find k and E(x)
X 1 2 3 4
P(X=x) K 2k 3k 4k

Since generally
k+2k=3k+4k=1
1
K=
10
1 2 3 4
E(X) = 1x +2x +3x + 4 x
10 10 10 10
E(X) = 3

Exercise
2. Birds of a particular species lay 0, 1, 2, 3 eggs in their nests with probabilities as shown
in the following table
No. of eggs 0 1 2 3
Probability 0.2 0.1 0.35 K
5
Find
a) The value of k.
b) The expected number of eggs laid in a nest.
P a g e | 13

c) The standard deviation of the number of eggs laid in a nest.

3. A wholesaler sells apples in boxes of hundred. The probability that there are x bad
apples in a box is given in the following table

Value of x 0 1 2 3 4 >4
P(X=x) 8k 5k 3k 2 2k 0
k
a) Calculate the value of k
b) Var (x)
4. A discrete random variable R takes integer values from 0 to 4 with probabilities
given as shown below. Find the expectation and Variance of R.
r +1
P(R=r)=
{
10
9−2 r
10
r =0,1,2

r =3,4

BINOMIAL DISTRIBUTION
A Binomial distribution refers to an experiment with two possible outcomes a success and a
failure.
Modeling a binomial
1. A fixed number n of independent trials.
2. Each trial results in either a success or a failure.
3. The probability of success p is constant for each trial.

X follows a Binomial distribution with parameters n and p.


If X Bin(n,p) the probability of obtaining r successes in n trials is P(X=r) where
P(X=r)=C nx q n−x p x

EXPECTATION(MEAN) E(X) and VARIANCE Var(X)

E(X)=np Var(X)=npq

Example
At OK supermarket 60% of customers pay on credit cards. Find the probability that in a
randomly selected sample of ten customers
a. Exactly 2 pay by credit cards
b. More than seven pay by credit cards

Solution
Let X be the number of customers who pay on credit cards.
p=0.6 q=0.4 n=10
P a g e | 14

hence x Bin(10,0.6)
a. P(X=2)
using the formula:
C 10
2 (0.4)
10-2
(0.6)2 =0.011
b. P(X>7)
C 10
8 (0.4)
10-8
(0.6)8 + C 10
9 (0.4)
10-9
(0.6)9 +C 10
10 (0.4)
10-10
(0.6)10=0.17

Exercise
1. Of the articles produced by a particular machine, 15% are defective. If a sample of 20
articles is taken find the expected number of defective articles and the standard
deviation.(n=20, p=0.15).
2. A very lazy candidate has done no revision for his multiple choice stats exam and
guesses the answer to each of the 40 questions. Given that each question offers 4
alternative answers, only one of which is correct. Determine the expectation and
1
variance. (n=40, p= )
4

3. Each day a bakery delivers the same number of loaves to a certain shop which sells, on
average 98% of them. Assuming that the number of loaves sold per day has a binomial
distribution with a standard deviation of 7 find the number of loaves the shop would
expect to sell per day. [2450]

4. On a particular farm 60% of all eggs laid by hens are classified ‘large’. Everyday the
farmer selects the same number of eggs at random and sends them as a batch to the
market. Given that the standard deviation of the number of large eggs in a batch is 12,
find
a) The number of eggs in a batch
b) The mean number of large eggs in a batch [600;360]

5. The probability of a component being defective is p. a pack of these components


contains n such components. The number of defective components per pack has a mean
of 3 and a standard deviation of 1.5, find
a) The values of n and p
b) The probability that a pack has no defectives [12;0.25;0.032]

POISSON DISTRIBUTION
Refers to the counts of items that occur at random points in time or space

e−λ λx
A discrete random variable X having a p.d.f of the form P(X=x)= .
x!
Where λcan take any positive value, is said to follow the Poisson distribution.

Modeling a Poisson distribution


P a g e | 15

1. the number of occurrences of a particular event must occur in an interval of fixed length
in space or time
2. the events must be independent
3. the events occur singly in continuous space or time
4. the events occur at a constant rate

Example
If X follows a Poisson distribution with λ=3 being the mean occurrences in a given interval. Find
a) P(X=2)
b) P(X≥3)

Solution
e−λ λx
a. P(X=x)= .
x!
e−3 λ 2
P(X=2)= .
2!
=0.22 (to 2s.f)
b. P(X≥3)
e−3 λ 2 e−3 λ 1 e−3 λ 0
=1- P(X≤2)= 1-( - - )
2! 1! 0!
=0.58 (to 2s.f)

Exercise
1. The number of road accidents per day on a particular stretch of road follows a
Poisson distribution with mean 2.5. Find the probability that on a particular day
there are
i. No accidents [0.082]
ii. Exactly two accidents [0.26]

2. The number of flaws per meter of a particular cloth follows a Poisson distribution
with mean 1.6, find the probability that in
i. 1 m of cloth there are two flaws
ii. 2m of cloth there are exactly 3 flaws
iii. 4m of cloth there are less than 3 flaws [0.26; 0.22; 0.046]

3. In a book containing 200 pages there are 300 misprints


i. Find the mean number of misprints
ii. Find the probability that in a particular page chosen at random there are
2 misprints [1.5; 0.25]

4. In a particular industry there are on average two fatal accidents per year. Find
the probability that
i. The industry is free from fatal accidents
ii. The industry has three fatal accidents in a year [0.14; 0.18]
P a g e | 16

5. A shop sells a particular product at a rate of 4 per week on average. The number
sold in a week has a Poisson distribution. Find the probability that the shop sells
at least 2 in a week. [0.91]

EXPECTATION AND VARIANCE OF A POISSON DISTRIBUTION


E(X)=λ and
Var (X) = λ

POISSON APPROXIMATION TO BINOMIAL


λ=np if n is large (>50 say) and p is small (<0.1 say) or np<5

Example
In a certain manufacturing process the proportion of defective articles being produced is 2%. In
a batch of 200 articles find the probability that there are exactly 4 defectives.
Solution
Let X be a r.v ‘number of defective articles
Then X follows a Binomial (200,0.02)
Since n is large and p is small the λ=np=200x0.02=4(<5)

Therefore X~Po(4)

e−4 λ4
P(X=4)= .
4!

= 0.20

Exercise
1. The probability that a wrapped chocolate biscuit is double wrapped is 0.01. Use a suitable
approximation to find the probability that of the next 60 biscuits that are unwrapped.
i. None are double wrapped
ii. At least 2 are double wrapped [0.55;0.12]

2. An insurance salesman sells policies to five men, all of the same age and in good health. The
2
probability that a man of this age will be alive in 30 years is . Show that the probability of at
3
least 3 men surviving the 30 years is 0.79 to 2 decimal places.

3. The probability that an individual suffers a bad reaction from an infection of a given serum is
0.001. Use the Poisson distribution to determine the probability that out of 2000 individuals
more than two will suffer a bad reaction.

4. A factory packs bulbs in boxes of 400. The probability that a bulb is defective is 0.01. Find
the probability that a box contains more than 3 defective bulbs. [0.57]
P a g e | 17
P a g e | 18

CHAPTER 3:
REGRESSION AND CORRELATION

All along the data dealt with concerns only one variable. It does not mean that we cannot deal
with two or more variables at the same time. Data involving two variables is known as Bivariate
data, more than two it is called multivariate data. Common examples include marks of students
in various subjects, Rainfall patterns in different regions over the years. It should be noted that
the data must always be paired and have the same unit of measurement. For instance the
marks in the various subjects must correspond to the same student.

Why “regression”

In regression, we are generally interested in finding out the nature and strength of a
relationship between the two or more variables under study, that is, Is there an association
between the values? What is the general trend of one variable as you increase the other?

Scatter Diagram

This is a simple and easy to draw graph to represent bivariate data. One variable assumes X
values and the other Y plotted on a Cartesian plane. The graph gives an idea of the relationship
between the X and Y variables. Having obtained our scatter diagram it we can easily regress and
find the correlation of the graph. We generally obtain three possible graphs (positive, negative
and no correlation)

Positive: the high values of X are associated with high values of Y

Negative: low values of X are associated with low values of Y

Horizontal: loosely speaking no correlation. It is difficult to determine the relationship until we


calculate the product moment correlation coefficient r.

The greatest disadvantage of using the scatter diagram is that though it gives the nature, it does
not tell us the strength of the relationship.

Example 1

A B C D E F G H I J
12 18 25 20 16 28 15 12 23 17
15 22 32 13 17 13 10 16 15 30

a) Draw a scatter diagram of the data given.


P a g e | 19

From the diagram we can observe that it produces a positive relationship.[plot the points]

Exercise

1) The table below shows the volume of sales and total expenses for ten companies. Plot the
figures on a scatter diagram and comment

X 20 2 4 2 1 8 1 8
3 0 3
Y 60 2 2 6 4 1 4 3
5 6 6 1 8 0 3

REGRESSION LINE Y ON X / X ON Y

The equation of the regression line is given by

n ∑ xy−∑ x ∑ y
у− ӯ =m(x- x́) where m=
n ∑ x2 −¿ ¿ ¿ ¿¿

Example 1

For a given set of data

n¿ 10 , ∑ x =406 , ∑ у=130 , ∑ x 2=18656, ∑ y2 =2126 and ∑ xу =6094

Calculate the regression line on the regression line y on x

10 x 6094−406 x 130
ý = 13 x́ = 40,6 m= = 0,38
10 x 18656−4062

Y=0.38x-2.4

Exercise

1. For the given sets of data find the regression line y on x and x on y. Comment on your
answers.

a) n¿ 9 , ∑ x =510 , ∑ у =624 , ∑ x 2=34140, ∑ y2 =45056 and ∑ xу =37884


y=0.48x+42 ; x=1.41y-41
b) n¿ 6 , ∑ x =45 , ∑ у=506, ∑ x 2=355, ∑ y2 =43014 and ∑ xу=3872
y=4.4x+51.3 ; x=0.23y-11.5
c) n¿ 10 , ∑ x =570 , ∑ у =60, ∑ x 2=32632, ∑ y2 =404 and ∑ xу =3482
P a g e | 20

y=0.015x+5.15 ; x=1.41y+48.5
d) n¿ 8 , ∑ x =34 , ∑ у=1047, ∑ x 2=155, ∑ y2 =138370 and ∑ xу=4504
y=5.17x+109 ; x=0.04y-1.03

X 375 326 357 408 940 44


10 42
11 4812
e) Y 3.774 3.8 3.7 3.6 3.7 3.4 3.4 3.3
77 82 86 92 95 91 88
y=-
0.033x+4.9 ; x=-25.7y+131.6

f)

x=0.28y-15.5
g)
X 18 20 21 27 23 34 24 42 38 44
PRODUCT Y 8 5 6 8 7 11 8 10 6 8 MOMENT
CORRELATION COEFFIENT, R

We can use the formulae below

n ∑ xy−∑ x ∑ y
r= 2
√n ∑ x −¿ ¿ ¿ ¿ ¿ ¿
Interpretation of the coefficient of correlation

-correlation coefficient r lies between ‐1and+1

-If r is nearer to negative one then it shows a very high degree of negative correlation. R=-1
implies a perfect inverse association between x and y i.e. all the sample points will fall on a
straight line with negative slope.

-A value of r near positive one indicates a high degree of positive correlation. R=1 implies a
perfect direct correlation between x and y i.e. all the sample points will fall on a straight
line with positive slope.

-when r=0 it implies that there is no correlation between x and y

Therefore it follows that a value near to -1 e.g. -0.89 has a very strong/ high degree of
negative correlation or vise versa. Anything between 6 and 8 is strong while 0.3<r<0.6
implies weak linear relationship. Otherwise there is no correlation.

COFFICIENT OF DETERMINATION

This is equal to the square of the coefficient of correlation and is denoted by r 2. This value
lies between 0 and 1. It gives a percentage of variation in one variable explained by
P a g e | 21

variation in the other variable. For example if r = 0.84 then r 2 = 0.71 and we conclude that
71% of the variation in y can be explained by the regression equation leaving 29% to be
explained by other variables or factors thereby indicating a high positive linear
relationship between the two variables.

Similarly if r=0.7 implies a strong/high correlation yet r 2 =0.49. Thus about 49% of the
variation in y is explained by the variation in x indicating a moderate relationship between
the two variables.

EXERCISE

1. For the given sets of data find the product moment correlation coefficient, r and the
coefficient of determination r2. Comment on your answers.

i. n¿ 10 , ∑ x =406 , ∑ у=130, ∑ x 2=18656, ∑ y2=2126 and ∑ xу =6094


[0.84]
ii. n¿ 6 , ∑ x =45 , ∑ у=506, ∑ x 2=355, ∑ y2=43014 and ∑ xу=3872
[0.996]
iii. n¿ 10 , ∑ x =570 , ∑ у =60, ∑ x 2=32632, ∑ y2=404 and ∑ xу =3482
[0.784]

2. The records for Alpha Steel for the past 8 years were as shown below:

YEAR ADVERTISING (US$ m) SALES (US$ m)


1992 8 162
1993 12 180
1994 6 164
1995 13 192
1996 15 205
1997 12 190
1998 7 150
1999 11 163
Find the regression line of sales on advertising [y=5.3x+120.1]

Prediction

The regression line is used to predict the future values or expected values of y given the
estimated values of x. Simple substitute in the equation.

EXERCISE

1. A shop sells home computers. The numbers of computers sold in each of five successive
years were as follows:
P a g e | 22

Year (x) 1 2 3 4 5
Sales (y) 10 30 70 140 170

a) Draw a scatter diagram for the data


b) Find the least squares regression line of y on x and fit it on you scatter plot
c) The shop manager uses this regression line to predict the sales in the following (i.e.
the 6th) year. Find the predicted sales.
d) Comment on the fit of this regression line and on the prediction made for the sales
in the sixth year
[y=43x-45, 213]

2. The table below shows for each of the years 1985 to 1991, the average unemployment
rate x (expressed as a percentage) and the number of crimes committed y (measured in
millions). This was in Scotland.

Year 1985 1986 1987 1988 1989 1990 1991


Unemployment rate x 12.9 13.3 13.0 11.3 9.3 8.1 8.7
No. of crimes y 0.80 0.82 0.86 0.86 0.90 0.96 1.02

a) Plot a scatter diagram on a graph paper, showing employment rate on the horizontal
axis and number of crimes on the vertical axis.
b) Calculate the product moment correlation coefficient between x and y and interpret
the result of the calculation in terms of your scatter diagram
c) The number of years after 1985 is denoted by t. calculate the equation of the
regression line of y on t in the form y = a + bt where a and b are constants.
d) Use your equation to estimate the number of crimes committed in 1992 and discuss
briefly the likely reliability of this estimate.
[-0.884 as unemployment rate decreases the number of crimes increases,
y=0.784+0.035t, 1.03, as 1992 is out of range of the years under consideration,
estimate is not reliable unless the trend continues]
P a g e | 23

CHAPTER 4:

THE NORMAL DISTRIBUTION

This is an example of a continuous variable. A continuous random variable X has a normal


distribution if it has a p.d.f.

1 −(x− μ)
2

f(x)= 2σ
2
−∞<x<∞
σ √ 2 π e (¿ )¿

From the normal distribution curve it can be deduced that

 The distribution is symmetrical about the mean μ


 The mean, mode and median coincide and are equal due to symmetry of the
distribution
 X ranges from −∞ <x<∞
 The horizontal axis is asymptotic to the curve as x→−∞ and x→ ∞
 Area under the curve is equal to one

From the p.d.f above we can see that the probability of X depends only on μ and σ and thus
instead of remembering the above formula it will be sufficient to refer to the random
variable as having a normal distribution by using the notation:

X N (μ , σ 2 )

The first parameter in the bracket is mean μ and the second one is the variance σ 2.

If X N (50 , 32) the mean is 50, standard deviation 3 and the variance is 9. What about here

a) X N (80,25)
b) X N (20,10)

The normal distribution is the most important continuous distribution in statistics because
many quantities in natural sciences follow a normal distribution.

THE STANDARD NORMAL DISTRIBUTION

To be able to evaluate probabilities associated with the normal distribution the standard
normal distribution is used which has a mean of zero and a standard deviation of 1. The
standard normal variable is denoted by Z where

Z N (0 , 12)
P a g e | 24

It should be noted therefore that any normal distribution can be mapped onto the standard
normal distribution. This transformation is achieved by subtracting the mean and dividing
by the standard deviation

X−μ
Z=
σ

Where X N (μ , σ 2 ) and Z N (0 , 12)

Standardize the following:

a) X N (50 , 122)
b) X N (80,25)
c) X N (20,10)

USE OF THE STANDARD NORMAL TABLES USING ϕ(Z)

To be able to make use of the standard normal tables it is sufficient to make good clear
diagrams.

EXERCISE

1) The random variable Z N (0 , 12), find


a) P(Z<1) [0.8413]
b) P(Z<-1.5) [0.0668]
c) P(0.5<Z<1) [0.1498]
d) P(-1<Z<1.2) [0.7262]
e) P(|Z| <0.4)= P(-0.4<Z<0.4) [0.3108]
f) P(|Z| >0.50)= P(Z>0.5) or P(Z<-0.5) [0.0.617]
2) Find the value of Z
a) ϕ(Z)=0.8133 [0.89]
b) ϕ(Z)=0.028 [-1.911]
c) ϕ(-Z)=0.613 [-0.287]
d) ϕ(-Z)=0.013 [2.25]
3) The random variable Z N (0 , 12), find the value of a
a) P(Z<a) =0.613 [0.287]
b) P(Z<a) =0.453 [-0.118]
c) P(Z>a) =0.823 [-0.927]
d) P(Z>a) =0.4 [0.253]
e) P(|Z| <a) =0.5 [0.674]
P a g e | 25

USE OF THE STANDARD NORMAL TABLES FOR ANY NORMAL DISTRIBUTION

Example

The random variable X N (12 , 22) find

X−12 17−12
a) P(X>17)= P( > ) by standardizing
2 2
=P(Z>2.5)
=0.00621

X−12 10−12
b) P(X<10) = P( < ) by standardizing
2 2
=P(Z<-1)
=0.1587

X−12 17−12
c) P(9<X<13)= P( <Z< ) by standardizing
2 2
=P(-1.5<Z<0.5)
=0.6247

EXERCISE

1) Packages from a packing machine have a mass that is normally distributed with
mean of 56g and standard deviation 10g. find the probability that a package from
the machine weighs
a) Greater than 68g
b) Between 56 and 65g [0.1151;0.3159]
2) The mean weight of a consignment of 500 sacks of sugar is 151kg and a standard
deviation of 15kg. Assuming that the weights are normally distributed, find how
many sacks weigh
a) Between 120 and 155kg
b) Less than 120kg [294;32]
3) The number of hours of the life of a torch battery is normally distributed with a
mean of 120hours and a standard deviation of 16 hours. Find the probability that a
torch battery has a life of
a) More than 140 hours
b) Between 110 and 128 hours [0.1056;0.4255]
4) The masses of tablets of chocolates produced by a certain machine are found to be
normally distributed with a mean of 140g and standard deviation of 5g. estimate
the number of tablets in a batch of 200 whose masses
a) Are greater than 145g
P a g e | 26

b) Between 138g and 143g [32;76]

PROBLEMS INVOLVING FINDING THE VALUE OF μ AND σ OR BOTH

i. The random variable X is normally distributed with mean μ and standard deviation
5. Given that P(X<27.5)=0.8686 find μ [21.9]
ii. The random variable X is normally distributed with mean 50 and standard deviation
σ . Given that P(X>54)=0.2538 find σ to 2 decimal places [6.04]
iii. μ
The random variable X is normally distributed with mean and standard deviation
σ 2. Given that P(X>34)=0.0228 and P(X<25)=0.0062 find μ and σ . [σ =2, μ =30]

NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION

If X Bin(n , p) then E(X)=μ=np and Var (X)=σ =npq

Therefore for large n and p not too small or too large

X N (np ,npq) approximately

The conditions for approximating a binomial distribution by a normal distribution are


enumerated below:

 n is large enough
 np>5
 nq >5
Note that the binomial is used with discrete data while normal is used with
continuous random variables hence we are approximating discrete by a continuous
one and an allowance must be made for this using the half continuity correction.
P a g e | 27

CHAPTER 5:

SIGNIFICANCE TESTING (LARGE SAMPLE-Z TEST)

Null and Alternative hypotheses

In a statistical enquiry we often put forward a hypothesis concerning a population


parameter. For example the mean mark of the pupils is 64.5, the proportion of drivers who
take lead-free petrol is 42%. This hypothesis is called the null hypothesis and is denoted by
H0. In order to test the validity of this hypothesis we consider some observations made
from random samples taken from that population and perform a statistical test. If the test
shows that we should reject the null hypothesis, H 0. We do so in favor of an alternative
hypothesis, denoted by H1.

Critical region and Critical values

The critical region of a test statistics Z is the range of values of Z such that if the value of Z,
name z, obtained from your particular sample lies in this critical region then you reject the
null hypothesis. The boundary value of the critical region is called the critical value.

One or Two tailed tests

There are two types of tests that can be performed depending on the alternative hypothesis
being made. These are a one tailed test and a two tailed test.

One-tailed test

This looks for a definite increase or decrease in the parameter. Lookout for words such as
increase, rise, growth, underestimation! The tail is to the right in this case.

Example 1

H0: μ=30

H1: μ>30

To find the critical value at 5% significance level we use the normal tables which give
1.645. (how you get these will depend on the kind of tables used. Some give a list of these
whereas other you have to say 100%-5%=95% and then look up for corresponding value
to 0.95) We always reject H0 if Zcalculated>Ztabulated .

Two-tailed test

This looks for any change in the parameter. It does not reveal or imply an ‘increase’.
P a g e | 28

Example 2

H0: μ=30

H1: μ ≠30

To find the critical values at 5% we first have to divide 5% by 2 to get 2.5% in each of the
tails (two-tailed). Proceed as before eg 100%-2.5%=97.5%. From the tables Z=1.96 on
both tails.

Test statistics

The test statistics for normal distribution would be given by

X́−μ
Z= σ
√n
STEPS TO FOLLOW WHEN TESTING A MEAN

 State the hypotheses


 Consider the appropriate distribution as given by the null hypothesis
 Decide on the level of tests
 Decide on the rejection criteria0
 Calculate the test statistic
 Make a conclusion

Example 1

A certain high way in a city centre has been investigated in the past for noise pollution by
vehicles. As a result many measurements it has been found that between 2.30 and 3.30p.m
on any weekday the average noise level is 130 decibels with a standard deviation of 20
decibels. The residents are convinced that the noise levels are getting worse. Having taken
50 random readings, they obtained an average of 134 decibels. Is their claim justified? Test
at 5% significant level.

Let X be the random variable ‘noise levels on a certain highway’

This is a one tailed test (why)

H0: μ=130

H1: μ>130

If H0 is true, then X N(130,202)


P a g e | 29

Performing a one tailed test we would reject H0 if Z > 1.645

X́−μ
The test statistic is Z= σ
√n
134−130
Z= 20
√ 50
Z=1.41

Conclusion: Since 1.41<1.645 we fail to reject H 0 at 5% significance level to conclude that


the noise levels have not increased.

Example 2

At a certain college new students are weighed when they join the college. The distribution
of weights of students at the college when they enroll is normal with a standard deviation
of 7.5kg and a mean of 70kg. A random sample of 90 students from the new entry was
weighed and their mean weight was 71.6kg. Assuming that the standard deviation did not
change and that the weights of the new class were also normally distributed test at 5%
significance level, whether there is evidence that the mean of the new entry has changed.

Let X be the random variable ‘weight of students’

This is a two tailed test (why)

H0: μ=70

H1: μ ≠70

If H0 is true, then X N(70,7.52)

Performing a two tailed test we would reject H0 if |Z| > 1.96

X́−μ
The test statistic is Z= σ
√n
71.6−70
Z= 7.5
√ 90
Z=2.02
P a g e | 30

Conclusion: Since 2.02>1.96 we reject H 0 at 5% significance level to conclude that the


weight of the new class has changed.

EXERCISE

1. The masses of loaves from a certain bakery are normally distributed with mean 450g
and standard deviation 20g. A sample of 36 loaves yielded a mean mass of 438g. Does this
provide sufficient evidence of a reduced population mean? [Z cal=-3.6<-1.645; reject H0]

2. Experience has shown that the scores obtained in a particular test at the VID are
normally distributed with mean 80 and standard deviation 7. When the test was taken by a
random sample of 64 candidates, the mean score was 76.5. Is there sufficient evidence at
3% that the candidates did not perform as well as expected? [Z cal=-4<-1.881; reject H0]

3. A chamber of commerce claims that the average take-home pay of manual workers in
fulltime employment in its area is $140.00 per week. A sample of 125 such workers had a
mean take-home pay of $148.00 with a standard deviation of $28.00. Test at 5%
significance level, the hypothesis that the mean take-home pay of all manual workers in the
area is $140.00. [Zcal=3.194>1.96; reject H0]

4. A pharmaceutical company claimed that a course of its vitamin tablets would improve
examination performance. To publicize its claim the company gave free tablets to some but
not all students taking a particular ZIMSEC examination. The average mark in the
examination for all candidates who did not take the course was 42. A random sample of
120 candidates from those who had taken the course of the vitamin tablet gave a mean
mark of 43.8 and a standard deviation of 12.8. Test at 5% significance level whether the
candidates who took the vitamin tablets had a mean mark greater than 42.

[Zcal=1.540<1.645; accept H0]

SMALL SAMPLES-THE t- TEST

The r.v. X is said to follow the t-distribution if the p.d.f of X is

x 2 −1 (v−1)
f(x)=C υ (1+ )2 for - ∞ <x<∞
υ

X has one parameter, υ, known as the number of degrees of freedom. The letter υ is
pronounced ‘new’. The constant C υ depends on υ. We say X t(υ)

USE OF t-DISTRIBUTION TABLES


P a g e | 31

If we want to calculate t(3) at 5% for a two tailed test, it is clear that each tail represents
2.5% so we take 100%-2.5=97.5% so we use t-tables at 0.975 at 3 degrees of freedom

Hence t(3) at 5% =3.182 two tailed. Similarly t(5) at 1% one tailed = 3.365! You practice
now.

EXERCISE

a) t(4) at 5% one tailed


b) t(6) at 10% one tailed
c) t(7) at 1% two tailed
d) t(6) at 2% two tailed

THE t-TEST

CONDITIONS UNDER WHICH THE t-TEST IS USED

There are two conditions under which a t-test is valid

1. The sample size (n) must be small (n<30).

2. The population variance σ 2 must be unknown

To perform a t-test we use the test statistic

x́−μ
T= s whis is distributed as t(n-1) under the null hypothesis that true population is μ
√ n−1
.

Example

A random sample of 8 women yielded the following cholesterol levels

3.1 2.8 1.5 1.7 2.4 1.9 3.3 1.6

It is required to test whether the sample could be drawn from a population whose mean
cholesterol is 3.1. Perform a t-test at 5% significance level. Make relevant conclusions.

H0: μ=3.1

H1: μ ≠3.1 (why)

If H0 is true, then T t(n-1) and n-1=7 therefore T t(7)

Using a two tailed test at 5% t(7)=2.365. So we reject H 0if |t test|>2.365


P a g e | 32

We then calculate the sample mean and variance which are

∑ x = 18.3 =2.2875 x2 45.41


x́= s 2= ∑ - x́ 2 = – 2.28752 = therefore s=
n 8 n 8

Conclusion : Since 2.02>1.96 we reject H0 at 5% significance level to conclude that the.

Classwork

Six cleaning firms were selected at random and asked about their hourly rates of pay $x
with the following results

7.00 6.80 6.62 6.94 7.48 7.04

Carry a t-test at the 1% significance level to establish whether the mean hourly rate of pay,
paid by cleaning firms falls below a proposed minimum of $7.40

[Mean=6.98 and sample variance =0.0696 and degrees of freedom = 5. Reject H 0 to


conclude that the mean hourly rate falls below the proposed minimum of $7.40]

EXERCISE

1) In a textile manufacturing process the average time taken is 6.0 hours. An


innovation which it is hoped will streamline the process and reduce the time is
introduced. A series of 8 trials used the modified process and produced the
following results

6.1 5.9 6.3 6.5 6.2 6.0 6.4 6.2

Use a 5% sig. level to decide whether the innovation has succeeded in reducing average
process time.

2) Five readings of the resistance in ohms of a piece of wire gave the following results
1.51 1.49 1.54 1.52 1.54

If the wire were pure silver its resistance would be 1.50 ohms. If the wire were impure
the resistance would be increased. Test at the 5% sig. level the hypothesis that the wire is
pure silver. [t=2.828>1.895; reject H 0]

3) An advertisement for a certain model of mobile phone promised that it allowed ‘3.5
hours talk time ‘. A random sample of 8 such phones of this type was tested and the
lengths off talk time x hours available were as follows
3.4 3.7 3.1 3.2 3.6 3.5 3.3 3.4
Test at 5% sig. level to examine whether the advertisement was overstating the
length of talk time available from the phone. [t=-1.414>-1.895; accept H 0]
P a g e | 33
P a g e | 34

CHAPTER 6:

CHI SQUARED TEST

χ 2is pronounced as ‘kye squared’ an is written as ‘Chi-squared’. The test enables us to


decide whether it is valid to use a particular distribution such as the Binomial and Poisson
as a model so that we can interpret observed data. We can also use this test in contingency
table to decide whether two variables are independent or not.

OBSERVED AND EXPECTED FREQUECY (fo and fe)

A girl threw a coin 10 times and the number of heads and tails were summarized as shown

Heads Tails Total


7 3 10
This is what was observed. However we all know that before tossing the coin there are
equal chances of each outcome. Thus the expected outcomes would be as shown

Heads Tails Total


5 5 10

χ 2 TEST STATISTIC

2 ( f o −f e )2
χ =∑
fe

χ 2 DISTRIBUTION

i) the graph of chi squared is J-shaped for v =1 and 2 and it is

ii) positively skewed for v>2 and as v becomes large the distribution is appropriately
normal.

DEGREES OF FREEDOM

The parameter v is known as the number of degrees of freedom

V = number of cells – number of restrictions (we will explain through examples how this is
calculated)

CRITICAL VALUES OF CHI-SQUARED

Similarly the chi-squared tables are read more or less the same way as other tables. Enjoy!

a) χ 25 % (9)
P a g e | 35

b) χ 25 % (7)
c) χ 21 % (8)
d) χ 20.1 % (4)

GENERAL METHOD FOR TESTING THE GOODNESS OF FIT

 Determine which distribution is likely to be a good model by examining the


conditions applying to the observed data.
 Estimate parameters ( λ , n , p , μ , σ 2 ) if necessary from the observe data
 Calculate expected frequencies
 Combine any expected frequencies so that none are less than 5
 Find υ using υ =no. of cells- no. of restrictions
 Find the critical value of chi-square from tables
 Calculate the test statistics χ 2
 Make relevant conclusions

GOODNESS OF FIT: DISTRIBUTION IN A GIVEN RATIO


A manufacturer of fashion garments for the younger age groups suspects that the
market for his product has changed recently. Sales records for previous years
showed that 14% of the buyers were below 16years of age, 38% were 16-20years of
age, 26% were 21-25 years and 22% were over 25. A random sample of 200 recent
buyers however showed the following results:

Age Under 16 16-20 21-25 Above 25


Frequency 22 62 60 56
Are the differences between the results of the sample with those expected from
previous records significant at 1% level.
H0: Differences are not significant
H1: Differences are significant
Expected frequencies are:
14 38 26 22
x 200=28; x 200=76; x 200=52; x 200=44
100 100 100 100
degrees of freedom=υ= number of columns-number of restrictions
¿ 4−1=3

Observed frequencies Expected frequencies (f ¿ ¿ o−f e )2


¿
fe
22 28 1.29
62 76 2.58
P a g e | 36

60 52 1.23
56 44 3.27
Total χ 2cal =¿8.37
χ 2cal =8.37<11.34, accept H0

GOODNESS OF FIT: UNIFORM DISTRIBUTION


A mail order firm receives packets every day through mail. They think that their
deliveries are uniformly distributed throughout the week. Test this assertion given
that their deliveries over a 4 week period were as follows using 5% significance
level

Day Mon Tue Wed Thurs Fri Sat


Frequency 15 23 19 20 14 11

GOODNESS OF FIT: BINOMIAL DISTRIBUTION


Testing a Binomial distribution as a model
 There must a fixed number n of trial s in each observation
 The trials must be independent
 The trials have only two outcomes
 Probability of success p is constant

The frequency f with which each x occurs when the number of observations is N
is given by f=P(X=x) x N

total number of successes ∑ xf


p= =
number of trials X N nN

If p is given : =no. of cells -1


If p is not given : =no. of cells -2

Example

X 0 1 2 3 4 5
F 12 39 27 15 4 3

Perform a chi-squared test to investigate whether the following data as drawn


from a binomial distribution with p=0.3. Use a 5% level of significance.

NB the last number in the first row is always the value of n.

Use the Binomial formula to calculate expected frequencies.


P a g e | 37

[fe=17,36,31,13,3,0; υ=3; χ 2cal =¿4.49; accept H0]

EXERCISE

1. A forth year Agriculture student was trying to model the number of female
kittens born in a litter. She has examined 250 liters of size 5 and obtained the
following results:

No. of females 0 1 2 3 4 5
Number of litters 2 40 90 85 30 3

Test at 5% level whether or not the binomial distribution with parameters n=5
and p=0.5 is an adequate model for these data. [ χ 2cal =¿11.824; reject H0]

2. Components produced by a certain machine are deposited in plastic bins as


they are produced. It is thought that the machine will produce only 5% defective
components. Samples of 20 components are taken from each of the bins with the
following results:

No. of defectives 0 1 2 3 4 5
frequency 20 31 30 10 6 3

Using 5% level and an appropriate test to see if these results can be modeled by
a binomial distribution with p=0.05.

GOODNESS OF FIT: POISSONDISTRIBUTION


Testing a Poisson distribution as a model
 The N events must occur independently of each other
 The events occur singly in continuous space or time
 The events occur at a constant rate i.e. the mean number in an interval is
proportional to the length of the interval.
 The mean and variance are equal
The Poisson distribution has a single parameter λ which may be known or which
may be estimated from the observed data using

Mean = λ=
∑ xf
∑f
Ifλ is given : =no. of cells -1
P a g e | 38

Ifλ is not given : =no. of cells -2

Example
A factory produces a serum for use in hospitals. The number of bacteria per 1ml of
serum is thought to have a Poisson distribution with a mean of 3. The laboratory
producing the serum tested 150 batches with the following results:

No. of bacteria 0 1 2 3 4 5 6 7 8
Frequency 10 18 27 38 28 14 10 5 0

Test at 5% significance level to see if the assumption is correct

Use the Poisson formula to calculate expected frequencies.

[fe= 7.5, 22.4 ,33.6 ,33.6 ,25.2 ,15.1 ,7.6 ,3.2 ,1.8 (combine frequencies less than 5);

υ=7; χ 2cal =¿4.72; accept H0]

EXERCISE

1. A department store records the daily number of sales charged to stolen credit cards.
The results for the first months of 1990 are as follows:

No. of sales 0 1 2 3 4 or more


No. of days 31 39 19 11 0

Explain why a Poisson distribution may be appropriate as a model for the daily number of
sales charged to stolen credit cards.

Test at 5% level the hypothesis that the daily number of sales does follow a Poisson
distribution. [ χ 2cal =¿0.48; accept H0]

N B there are 2 restrictions here.

2. The number of times a farm tractor broke down each week was recorded over 100
weeks with the following results:

No. of breakdowns, X 0 1 2 3 4 5
Frequency, f 50 24 12 9 5 0

It is thought that the distribution is Poisson, justify. Conduct a test at 5% level using λ=0.95
and see if the assumption is reasonable.
P a g e | 39

CHI-SQUARED IN CONTINGENCY TABLES

Sometimes situation arise when individuals are classified according to two sets of
attributes eg

 Age and voting preferences


 Test scores in Mathematics and in General Paper.

We may wish to investigate whether the attributes are independent or whether there is
evidence of an association between them.

The general method for testing in contingency tables

 Define H0
 Find row totals, column totals and grand total
 Find expected frequencies
row total X column total
Expected frequency=
grand total
 Find degrees of freedom υ=(no. of rows-1)x(no. of columns-1)
 Find the critical value of chi-square from the tables
2 ( f o −f e )2
 Calculate χ =∑
fe
 State conclusion

Example

Analysis of the rate of turnover of employees by a personnel manager produced the


following table showing the length of stay of 200 people who left the company for other
employment. Use a 1% significance level to test whether length of employment is
independent of grade.

Length of employment (years)


Grade 0-2 2-5 >5
Managerial 4 11 6
Skilled 32 28 21
Unskilled 25 23 50

[fe= 6.41, 6.51, 8.09, 24.71,25.1131.19,29.89, 30.38 37.73 (combine frequencies less than
5); υ=4; χ 2cal =¿16.9; reject H0]

EXERCISE
P a g e | 40

1. The following are data o 150 chickens divided into two groups according to breed and
into three groups according to yield of eggs.

YIELD
BRAND NAME High Medium Low
Chantecler 46 29 28
Carmen 27 14 6

Is there evidence at 5% level of an association between brand name and yield?

[υ=2; χ 2cal =¿5.01; accept H0]

2. Two factories using the materials purchased from the same supplier and closely
controlled to an agreed specification produce output for a given period classified into three
quality grades A, B, C as follows:

OUTPUT IN TONES
FACTORY A B C
X 42 13 33
y 20 8 25

Is there evidence of a significant difference at 5% level? [ χ 2cal =¿1.5; accept H0]


P a g e | 41

CHAPTER 7

DESIGN OF EXPERIMENTS: ONE WAY ANOVA

The way a sample is selected is called the sampling plan or experimental design and it
determines the amount of information in the sample.

Some definitions

Experimental unit: the object on which a measurement is taken.

Factor: an independent variable whose values are controlled and varied by the experiment

Level: is the intensity setting of a factor

Treatment: is a specific combination of factor levels.

Response: is the variable being measured by the experimenter.

EXERCISE

a) A group of people is randomly divided into an experimental and control group. Both
the control and experimental groups are given an aptitude test after having a full
breakfast and no breakfast in each case respectively. Determine the experimental
units, factors, levels and treatments for this experiment

Experimental unit: people

Factor: meal or breakfast

Level: two levels

Treatment: type of breakfast.

Response: aptitude test scores

b) Suppose in the previous example the experimenter had begun by randomly


selecting 20 men and 20 women for the experiment. These two groups were then
randomly divided into 10 each for the experimental and control groups. Determine
the experimental units, factors, levels and treatments for this experiment.

Experimental unit: people

Factor: meal or breakfast and gender

Level: two levels in each case


P a g e | 42

Treatment: 4

Response: aptitude test scores

Assumptions**

Observations are

-normally distributed with a common variance σ 2

- linearly independent

ONE WAY ANOVA

This involves comparing the k populations with means μ1 up to μk and deciding if there are
significant differences between them.

H 0: There are no significant differences between the means of the treatments

H 1: There are significant differences between the means of the treatments

Calculations

grand total
CF=
N

sstotal = ∑ X 2i – CF

Ti
sstreatment = ∑ - CF
n

sserror = sstotal - sstreatment

ANOVA TABLE

Sources of Degrees of Sum of Mean squares F calculated F tabulated


variation freedom squares
Treatmen n-1 T ss treatment MStreatment F(5 %)[(n-1;{(N-1)-(n-1)}]
t ∑ ni – CF n−1 MS error
Error (N-1)-(n-1) sstotal - sstrea ss treatment
(N −1)(n−1)
Total N-1 x
∑ ni – CF
P a g e | 43

Conclusion

Reject H 0 if F calculated > F tabulated. Use the standard conclusion as usual.

Least significance difference

If you reject H 0 then there is need to show us between which pairs really the difference lies.
Thus you perform a L.S.D. The difference is insignificant if X A - X B < LSD otherwise it is
significantly different.

1 1
LSD= t α [df ] s (
2 √2
+ ) where S2 = MSE
n A nB

Confidence intervals

X A ∓ t α [df ] s 2 ( 1 + 1 )
For just one mean:
2 √ n A nB

2 1 1
( X A - X B) ∓ t α [df ] s ( + )
For comparing any two means:
2 √ n A nB

Report

The report is simply a brief discussion on the analysis of the results found on L.S.D.

Exercise

No breakfast Light breakfast Full breakfast


8 14 10
7 16 12
9 12 16
13 17 15
10 11 12

i. Construct the analysis of variance, ANOVA, table for this data and draw relevant
conclusions
ii. Perform a further analysis if necessary and write a report

[2208.2667; 58.53333; 129.73333; 71.2; 4.933]


P a g e | 44

ASSIGNMENT 2: DUE DATE: TBA !

1. A markerting student at Solusi University heard that you were doing experimental
designs and is clueless in terms of data analysis and needs your assistance. The
following is a question she apparently brought to you. The task was to determine if there
was a difference in the mean fuel consumption for three makes of cars.

Three types of medium sized cars assembled in Japan have been test driven by a
motoring magazine and compared on a variety of criteria. In the area of fuel efficiency
performance, five cars of each brand were each test driven 1000km; the km per litre
data are obtained as follows:

Kilometres per Litre Totals Means

Brand A 7.6 8.4 8.0 7.6 8.4 40.0 8.0

Brand B 7.8 8.0 9.1 8.5 9.6 43.0 8.6

Brand C 9.6 10.4 9.2 9.7 10.6 49.5 9.9

You decide to use SPSS to do the analysis and your remember that before any analysis
is done, ALL the Analysis of Variance (ANOVA) assumptions MUST be met, the
residual plots outputs are shown below.

a. State the ANOVA assumptions. [3 marks]


b. Interpret the residual plots fully and
i. Indicate whether you would proceed to do the analyses or not and
why [3 marks]
ii. What options would you recommend if the data was in violation of
the assumption? Give one alternative. [2]
P a g e | 45

a. Construct an appropriate ANOVA table for this data


[12 marks]
b. Perform any other tests and write a full report on your
findings. [10 marks]

FuelConsumption
Histogram of residuals Fitted-value plot
5 1.00

0.75
4
0.50

3 0.25

0.00
2
-0.25
Residuals -0.50
1

-0.75
0
-0.5 0.0 0.5 1.0 1.5 8.00 8.25 8.50 8.75 9.00 9.25 9.50 9.75

Fitted values

Normal plot Half-Normal plot


1.00 1.0

0.75
0.8
0.50

0.25 0.6

0.00
0.4
-0.25
Residuals

-0.50 0.2

-0.75
0.0
Absolute values of residuals

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Expected Normal quantiles Expected Normal quantiles

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy