285 Notes
285 Notes
285 Notes
CHAPTER 1:
INTRODUCTION TO DATA PRESENTATION
Exercise
What form of data are these, giving reasons
1. Number of hours of operation
2. Time
3. Volume of water in a tank
4. Growth rate of trees
5. Wages earned in a week
6. Age in years
7. Height in cm
8. Cost of meals
Data collection
Raw data is collected using any form of instrument eg questionnaires, interviews etc. Once
collected it is either presented in the form of graphs or analyzed numerically. The data is either
grouped or ungrouped, that is, continuous or discrete respectively. The nature of the data
informs a statistician what graph to use.
Frequency distributions
Ungrouped data
Example 1
Eg Copy and complete the table below using the data above.
GROUPED DATA
Using the above example 1, copy and complete the table below for a grouped data class size 2.
Exercise
1. Construct a frequency distribution including the relative and cumulative frequency
tables for the discrete data below:
a) 1 2 1 1 2 0 5 4 4 0 6 2
3 0 4 3 3 4 5 2 2 3 0 1
4 4 5 6 1 5 3 2 1 4 5
b) 2 2 1 3 3 1 2 3 2 4 1 3
3 1 2 1 2 3 4 3 2 3 1 3 4
c) 32 33 29 34 30 30 31 28 30 29
Page |3
31 32 29 31 29 33 33 29 32 28
32 34 33 33 29 34 30 32 33 30
22 72 57 29 55 16 25 31 53 22 28 56
Graphs
Stem and leaf
Box and whisker
Bar graph
Pie chart
Pictogram
Histogram
Cumulative frequency curve
Frequency polygons
Do a reading assignment for the graphs and hand in the summaries being careful to give at least
a worked relevant example in each case.
MEAN
Ungrouped data
Mean : X́ =¿
∑ xi or X́ = ∑ xf
i=1
n
∑f
Sum of all observations divided by the total number of observations
MODE:
The value that appears more often or with the highest frequency
Worked examples
1) The following were the quiz scores of one student in a Statistics class at the end of the
course. Find the mean score and comment:
5; 10; 15; 20; 25
5+10+15+20+25
Solution: =
5
75
Mean =
5
=15
Interpretation: The average score for the quizzes was 15.
2) Peter scored the following number of goals in eleven matches. Write down the modal
score and comment:
1 0 0 2 2 0 1 2 3 1 2
3) Solution: modal score is 2.
Interpretation: In most of the matches Peter scored 2 goals.
4) 100 members of an orchestra were asked how many instruments each could play. The
results were shown on the following table.
I. Calculate the mean number of instruments played and comment.
II. Find the mode and comment.
No. of instruments x 1 2 3 4 5
Frequency f 41 33 18 6 2
Solution:
Copy and complete the table below
Number of instruments x Frequency f Xf
1 41 41
2 33 66
3 18
4 6
Page |5
5 2
Total 100 195
195
Therefore mean= =1.95
100
Interpretation: The average number of instruments played is 1.95
Interpretation: Most members could play 2 instruments.
Grouped data:
Worked examples
The summary of the test scores [out of 100] in one Statistics course with 40 students is as
shown below:
Solution:
Copy and complete the table below
Mark Midpoint x Frequency f xf
31-35 [(31+35)]/2)= 33 4 132
36-40 [(36+40)]/2)=38 6 228
41-45 10
46-50 13
51-55 5
56-60 2
Total 40 1795
1795
Therefore mean =
40
=44.9
Interpretation: on average students got 44.9.
Modal class = 46-50.
The mode is [(46+50)/2] = 48
Interpretation(modal class): Most students scored between 46 and 50.
Interpretation(mode): Most students scored 48.
MEDIAN
This is the middle value of an ordered set of data. Before defining the median it is necessary to
rearrange the data in ascending order of magnitude.
Page |6
Worked examples
Ungrouped data
In the above examples on ungrouped data find the median and comment
Solution
5; 10; 15; 20; 25
The data is already arranged. Therefore the median is 15
Interpretation: In half of the quizzes the student scored 15.
Grouped data
fm
1. For the grouped data below estimate the median using the method of interpolation.
Class 5-9 10-14 15-19 20-24 25-30
Frequency 6 7 9 8 4
Solution:
Rewriting the table and identifying the median position
14.5-19.5 9 22
19.5-24.5 8 30
24.5-30.5 4 34
There are 34 units in the experiment. The median position is 17 th and 18th . Thus the first and
second classes give up to 13 th position. The third class gives up to 22 nd position! This includes
the 17th and the 18th positions.
Thus median class is 14.5-19.5
1
Median = 14.5 + 2[ ]
( 34 ) −13 5
9
=16.72
Exercise
1. Find the median for the following data
a.
X 0 1 2 3 4 5
Y 2 5 9 4 2 1
b.
X 1 2 3 4 5 6
Y 3 9 12 11 8 7
c.
Weight 53-56 57-60 61-64 65-68 69-72 73-76 77-80
frequency 2 13 4 11 9 6 5
d.
time 10-19 20-24 25-29 30-39 40-49 50-64 65-89
frequency 10 20 25 30 24 12 10
2. Write down at least 2 advantages and 2 disadvantages of mean, mode and median.
3. When should we use what measure? Consider skewness.
Worked example
Given 150 200 180 160 170 calculate the standard deviation
X X2
150
200
180
Page |8
160
170
∑ x =860 ∑ x 2=149400
2
σ = 149400 −( 860 )
√ 5
=17.2047
5
2
1470400 ∑ fx
σ2= –( )
50 50
=
Exercise
1. Calculate the standard deviation in each case a) - d) in the previous exercise.
2. Give the importance of standard/ variance.
100 s
b. Coefficient of variation is defined as and it is expressed as percentage. It is used to
x́
compare the variability in two sets of data where there is obvious difference in magnitude in
both the means and standard deviations eg height of boys aged 5 and 15. Suppose the means
are 100 and 150 and s.d are 6 and 9 respectively then both sets have a coefficient of variation
of 6%.
c. Range is largest value – lowest value. It is commonly used because of its ease to calculate
however it is not reliable except in special occasions because only one of the sample values are
used to calculate it.
P a g e | 10
CHAPTER 2:
DISCRETE DISTRIBUTIONS
….
P(X=xn)= pn
Then x is a discrete random variable if p1+ p2+ p3,…+pn =1
Symbolically ∑ P ¿ ¿)= 1
all x
X 1 2 3 4
P(X=x) 1 1 1 a
3 4 3
Worked example
From the table above, find
1. The value of a
1 1 1
: + + +a=1
3 4 3
1
a=
12
2. P(X=2)
1
=
4
3. P(X≤2)
1 1
= +
3 4
7
=
12
4. P(X>2)
P a g e | 11
1
=
3
Exercise
1. Given that
X 0 1 2 3
P(X=x) 0.1 0.2 0.3 B
Find
a) b
b) P(X>1)
2. Given that
Y -2 -1 0 1 2
P(Y=y) 1/6 1/12 x 1/3 1/12
Find
a) x
b) P(Y≥0)
3. Given that
X 0 1 2
P(X=x) 1 A b
4
Find
i. a and b given that 2a+b=1 and
ii. P(X≤1)
4. Given that
X 1 3 5 -y
P(X=x 3 3 1 1
) 8 8 8 8
Find y if ∑ xP (X= x)=0
all x
Expectation E(x)
This is the mean of the distribution.
Given by E(x)=∑ xP (X= x)
all x
Variance Var(x)
Given by Var(x)=E(x2)-[E(x)]2
Find Var(x)
X 0 1 2 3
P(X=x) 0.1 0.2 0.3 0.4
Worked Example
The discrete r.v X is given by P(X=x)=kx for x=1, 2, 3, 4 where k is a constant, find k and E(x)
X 1 2 3 4
P(X=x) K 2k 3k 4k
Since generally
k+2k=3k+4k=1
1
K=
10
1 2 3 4
E(X) = 1x +2x +3x + 4 x
10 10 10 10
E(X) = 3
Exercise
2. Birds of a particular species lay 0, 1, 2, 3 eggs in their nests with probabilities as shown
in the following table
No. of eggs 0 1 2 3
Probability 0.2 0.1 0.35 K
5
Find
a) The value of k.
b) The expected number of eggs laid in a nest.
P a g e | 13
3. A wholesaler sells apples in boxes of hundred. The probability that there are x bad
apples in a box is given in the following table
Value of x 0 1 2 3 4 >4
P(X=x) 8k 5k 3k 2 2k 0
k
a) Calculate the value of k
b) Var (x)
4. A discrete random variable R takes integer values from 0 to 4 with probabilities
given as shown below. Find the expectation and Variance of R.
r +1
P(R=r)=
{
10
9−2 r
10
r =0,1,2
r =3,4
BINOMIAL DISTRIBUTION
A Binomial distribution refers to an experiment with two possible outcomes a success and a
failure.
Modeling a binomial
1. A fixed number n of independent trials.
2. Each trial results in either a success or a failure.
3. The probability of success p is constant for each trial.
E(X)=np Var(X)=npq
Example
At OK supermarket 60% of customers pay on credit cards. Find the probability that in a
randomly selected sample of ten customers
a. Exactly 2 pay by credit cards
b. More than seven pay by credit cards
Solution
Let X be the number of customers who pay on credit cards.
p=0.6 q=0.4 n=10
P a g e | 14
hence x Bin(10,0.6)
a. P(X=2)
using the formula:
C 10
2 (0.4)
10-2
(0.6)2 =0.011
b. P(X>7)
C 10
8 (0.4)
10-8
(0.6)8 + C 10
9 (0.4)
10-9
(0.6)9 +C 10
10 (0.4)
10-10
(0.6)10=0.17
Exercise
1. Of the articles produced by a particular machine, 15% are defective. If a sample of 20
articles is taken find the expected number of defective articles and the standard
deviation.(n=20, p=0.15).
2. A very lazy candidate has done no revision for his multiple choice stats exam and
guesses the answer to each of the 40 questions. Given that each question offers 4
alternative answers, only one of which is correct. Determine the expectation and
1
variance. (n=40, p= )
4
3. Each day a bakery delivers the same number of loaves to a certain shop which sells, on
average 98% of them. Assuming that the number of loaves sold per day has a binomial
distribution with a standard deviation of 7 find the number of loaves the shop would
expect to sell per day. [2450]
4. On a particular farm 60% of all eggs laid by hens are classified ‘large’. Everyday the
farmer selects the same number of eggs at random and sends them as a batch to the
market. Given that the standard deviation of the number of large eggs in a batch is 12,
find
a) The number of eggs in a batch
b) The mean number of large eggs in a batch [600;360]
POISSON DISTRIBUTION
Refers to the counts of items that occur at random points in time or space
e−λ λx
A discrete random variable X having a p.d.f of the form P(X=x)= .
x!
Where λcan take any positive value, is said to follow the Poisson distribution.
1. the number of occurrences of a particular event must occur in an interval of fixed length
in space or time
2. the events must be independent
3. the events occur singly in continuous space or time
4. the events occur at a constant rate
Example
If X follows a Poisson distribution with λ=3 being the mean occurrences in a given interval. Find
a) P(X=2)
b) P(X≥3)
Solution
e−λ λx
a. P(X=x)= .
x!
e−3 λ 2
P(X=2)= .
2!
=0.22 (to 2s.f)
b. P(X≥3)
e−3 λ 2 e−3 λ 1 e−3 λ 0
=1- P(X≤2)= 1-( - - )
2! 1! 0!
=0.58 (to 2s.f)
Exercise
1. The number of road accidents per day on a particular stretch of road follows a
Poisson distribution with mean 2.5. Find the probability that on a particular day
there are
i. No accidents [0.082]
ii. Exactly two accidents [0.26]
2. The number of flaws per meter of a particular cloth follows a Poisson distribution
with mean 1.6, find the probability that in
i. 1 m of cloth there are two flaws
ii. 2m of cloth there are exactly 3 flaws
iii. 4m of cloth there are less than 3 flaws [0.26; 0.22; 0.046]
4. In a particular industry there are on average two fatal accidents per year. Find
the probability that
i. The industry is free from fatal accidents
ii. The industry has three fatal accidents in a year [0.14; 0.18]
P a g e | 16
5. A shop sells a particular product at a rate of 4 per week on average. The number
sold in a week has a Poisson distribution. Find the probability that the shop sells
at least 2 in a week. [0.91]
Example
In a certain manufacturing process the proportion of defective articles being produced is 2%. In
a batch of 200 articles find the probability that there are exactly 4 defectives.
Solution
Let X be a r.v ‘number of defective articles
Then X follows a Binomial (200,0.02)
Since n is large and p is small the λ=np=200x0.02=4(<5)
Therefore X~Po(4)
e−4 λ4
P(X=4)= .
4!
= 0.20
Exercise
1. The probability that a wrapped chocolate biscuit is double wrapped is 0.01. Use a suitable
approximation to find the probability that of the next 60 biscuits that are unwrapped.
i. None are double wrapped
ii. At least 2 are double wrapped [0.55;0.12]
2. An insurance salesman sells policies to five men, all of the same age and in good health. The
2
probability that a man of this age will be alive in 30 years is . Show that the probability of at
3
least 3 men surviving the 30 years is 0.79 to 2 decimal places.
3. The probability that an individual suffers a bad reaction from an infection of a given serum is
0.001. Use the Poisson distribution to determine the probability that out of 2000 individuals
more than two will suffer a bad reaction.
4. A factory packs bulbs in boxes of 400. The probability that a bulb is defective is 0.01. Find
the probability that a box contains more than 3 defective bulbs. [0.57]
P a g e | 17
P a g e | 18
CHAPTER 3:
REGRESSION AND CORRELATION
All along the data dealt with concerns only one variable. It does not mean that we cannot deal
with two or more variables at the same time. Data involving two variables is known as Bivariate
data, more than two it is called multivariate data. Common examples include marks of students
in various subjects, Rainfall patterns in different regions over the years. It should be noted that
the data must always be paired and have the same unit of measurement. For instance the
marks in the various subjects must correspond to the same student.
Why “regression”
In regression, we are generally interested in finding out the nature and strength of a
relationship between the two or more variables under study, that is, Is there an association
between the values? What is the general trend of one variable as you increase the other?
Scatter Diagram
This is a simple and easy to draw graph to represent bivariate data. One variable assumes X
values and the other Y plotted on a Cartesian plane. The graph gives an idea of the relationship
between the X and Y variables. Having obtained our scatter diagram it we can easily regress and
find the correlation of the graph. We generally obtain three possible graphs (positive, negative
and no correlation)
The greatest disadvantage of using the scatter diagram is that though it gives the nature, it does
not tell us the strength of the relationship.
Example 1
A B C D E F G H I J
12 18 25 20 16 28 15 12 23 17
15 22 32 13 17 13 10 16 15 30
From the diagram we can observe that it produces a positive relationship.[plot the points]
Exercise
1) The table below shows the volume of sales and total expenses for ten companies. Plot the
figures on a scatter diagram and comment
X 20 2 4 2 1 8 1 8
3 0 3
Y 60 2 2 6 4 1 4 3
5 6 6 1 8 0 3
REGRESSION LINE Y ON X / X ON Y
n ∑ xy−∑ x ∑ y
у− ӯ =m(x- x́) where m=
n ∑ x2 −¿ ¿ ¿ ¿¿
Example 1
10 x 6094−406 x 130
ý = 13 x́ = 40,6 m= = 0,38
10 x 18656−4062
Y=0.38x-2.4
Exercise
1. For the given sets of data find the regression line y on x and x on y. Comment on your
answers.
y=0.015x+5.15 ; x=1.41y+48.5
d) n¿ 8 , ∑ x =34 , ∑ у=1047, ∑ x 2=155, ∑ y2 =138370 and ∑ xу=4504
y=5.17x+109 ; x=0.04y-1.03
f)
x=0.28y-15.5
g)
X 18 20 21 27 23 34 24 42 38 44
PRODUCT Y 8 5 6 8 7 11 8 10 6 8 MOMENT
CORRELATION COEFFIENT, R
n ∑ xy−∑ x ∑ y
r= 2
√n ∑ x −¿ ¿ ¿ ¿ ¿ ¿
Interpretation of the coefficient of correlation
-If r is nearer to negative one then it shows a very high degree of negative correlation. R=-1
implies a perfect inverse association between x and y i.e. all the sample points will fall on a
straight line with negative slope.
-A value of r near positive one indicates a high degree of positive correlation. R=1 implies a
perfect direct correlation between x and y i.e. all the sample points will fall on a straight
line with positive slope.
Therefore it follows that a value near to -1 e.g. -0.89 has a very strong/ high degree of
negative correlation or vise versa. Anything between 6 and 8 is strong while 0.3<r<0.6
implies weak linear relationship. Otherwise there is no correlation.
COFFICIENT OF DETERMINATION
This is equal to the square of the coefficient of correlation and is denoted by r 2. This value
lies between 0 and 1. It gives a percentage of variation in one variable explained by
P a g e | 21
variation in the other variable. For example if r = 0.84 then r 2 = 0.71 and we conclude that
71% of the variation in y can be explained by the regression equation leaving 29% to be
explained by other variables or factors thereby indicating a high positive linear
relationship between the two variables.
Similarly if r=0.7 implies a strong/high correlation yet r 2 =0.49. Thus about 49% of the
variation in y is explained by the variation in x indicating a moderate relationship between
the two variables.
EXERCISE
1. For the given sets of data find the product moment correlation coefficient, r and the
coefficient of determination r2. Comment on your answers.
2. The records for Alpha Steel for the past 8 years were as shown below:
Prediction
The regression line is used to predict the future values or expected values of y given the
estimated values of x. Simple substitute in the equation.
EXERCISE
1. A shop sells home computers. The numbers of computers sold in each of five successive
years were as follows:
P a g e | 22
Year (x) 1 2 3 4 5
Sales (y) 10 30 70 140 170
2. The table below shows for each of the years 1985 to 1991, the average unemployment
rate x (expressed as a percentage) and the number of crimes committed y (measured in
millions). This was in Scotland.
a) Plot a scatter diagram on a graph paper, showing employment rate on the horizontal
axis and number of crimes on the vertical axis.
b) Calculate the product moment correlation coefficient between x and y and interpret
the result of the calculation in terms of your scatter diagram
c) The number of years after 1985 is denoted by t. calculate the equation of the
regression line of y on t in the form y = a + bt where a and b are constants.
d) Use your equation to estimate the number of crimes committed in 1992 and discuss
briefly the likely reliability of this estimate.
[-0.884 as unemployment rate decreases the number of crimes increases,
y=0.784+0.035t, 1.03, as 1992 is out of range of the years under consideration,
estimate is not reliable unless the trend continues]
P a g e | 23
CHAPTER 4:
1 −(x− μ)
2
f(x)= 2σ
2
−∞<x<∞
σ √ 2 π e (¿ )¿
From the p.d.f above we can see that the probability of X depends only on μ and σ and thus
instead of remembering the above formula it will be sufficient to refer to the random
variable as having a normal distribution by using the notation:
X N (μ , σ 2 )
The first parameter in the bracket is mean μ and the second one is the variance σ 2.
If X N (50 , 32) the mean is 50, standard deviation 3 and the variance is 9. What about here
a) X N (80,25)
b) X N (20,10)
The normal distribution is the most important continuous distribution in statistics because
many quantities in natural sciences follow a normal distribution.
To be able to evaluate probabilities associated with the normal distribution the standard
normal distribution is used which has a mean of zero and a standard deviation of 1. The
standard normal variable is denoted by Z where
Z N (0 , 12)
P a g e | 24
It should be noted therefore that any normal distribution can be mapped onto the standard
normal distribution. This transformation is achieved by subtracting the mean and dividing
by the standard deviation
X−μ
Z=
σ
a) X N (50 , 122)
b) X N (80,25)
c) X N (20,10)
To be able to make use of the standard normal tables it is sufficient to make good clear
diagrams.
EXERCISE
Example
X−12 17−12
a) P(X>17)= P( > ) by standardizing
2 2
=P(Z>2.5)
=0.00621
X−12 10−12
b) P(X<10) = P( < ) by standardizing
2 2
=P(Z<-1)
=0.1587
X−12 17−12
c) P(9<X<13)= P( <Z< ) by standardizing
2 2
=P(-1.5<Z<0.5)
=0.6247
EXERCISE
1) Packages from a packing machine have a mass that is normally distributed with
mean of 56g and standard deviation 10g. find the probability that a package from
the machine weighs
a) Greater than 68g
b) Between 56 and 65g [0.1151;0.3159]
2) The mean weight of a consignment of 500 sacks of sugar is 151kg and a standard
deviation of 15kg. Assuming that the weights are normally distributed, find how
many sacks weigh
a) Between 120 and 155kg
b) Less than 120kg [294;32]
3) The number of hours of the life of a torch battery is normally distributed with a
mean of 120hours and a standard deviation of 16 hours. Find the probability that a
torch battery has a life of
a) More than 140 hours
b) Between 110 and 128 hours [0.1056;0.4255]
4) The masses of tablets of chocolates produced by a certain machine are found to be
normally distributed with a mean of 140g and standard deviation of 5g. estimate
the number of tablets in a batch of 200 whose masses
a) Are greater than 145g
P a g e | 26
i. The random variable X is normally distributed with mean μ and standard deviation
5. Given that P(X<27.5)=0.8686 find μ [21.9]
ii. The random variable X is normally distributed with mean 50 and standard deviation
σ . Given that P(X>54)=0.2538 find σ to 2 decimal places [6.04]
iii. μ
The random variable X is normally distributed with mean and standard deviation
σ 2. Given that P(X>34)=0.0228 and P(X<25)=0.0062 find μ and σ . [σ =2, μ =30]
n is large enough
np>5
nq >5
Note that the binomial is used with discrete data while normal is used with
continuous random variables hence we are approximating discrete by a continuous
one and an allowance must be made for this using the half continuity correction.
P a g e | 27
CHAPTER 5:
The critical region of a test statistics Z is the range of values of Z such that if the value of Z,
name z, obtained from your particular sample lies in this critical region then you reject the
null hypothesis. The boundary value of the critical region is called the critical value.
There are two types of tests that can be performed depending on the alternative hypothesis
being made. These are a one tailed test and a two tailed test.
One-tailed test
This looks for a definite increase or decrease in the parameter. Lookout for words such as
increase, rise, growth, underestimation! The tail is to the right in this case.
Example 1
H0: μ=30
H1: μ>30
To find the critical value at 5% significance level we use the normal tables which give
1.645. (how you get these will depend on the kind of tables used. Some give a list of these
whereas other you have to say 100%-5%=95% and then look up for corresponding value
to 0.95) We always reject H0 if Zcalculated>Ztabulated .
Two-tailed test
This looks for any change in the parameter. It does not reveal or imply an ‘increase’.
P a g e | 28
Example 2
H0: μ=30
H1: μ ≠30
To find the critical values at 5% we first have to divide 5% by 2 to get 2.5% in each of the
tails (two-tailed). Proceed as before eg 100%-2.5%=97.5%. From the tables Z=1.96 on
both tails.
Test statistics
X́−μ
Z= σ
√n
STEPS TO FOLLOW WHEN TESTING A MEAN
Example 1
A certain high way in a city centre has been investigated in the past for noise pollution by
vehicles. As a result many measurements it has been found that between 2.30 and 3.30p.m
on any weekday the average noise level is 130 decibels with a standard deviation of 20
decibels. The residents are convinced that the noise levels are getting worse. Having taken
50 random readings, they obtained an average of 134 decibels. Is their claim justified? Test
at 5% significant level.
H0: μ=130
H1: μ>130
X́−μ
The test statistic is Z= σ
√n
134−130
Z= 20
√ 50
Z=1.41
Example 2
At a certain college new students are weighed when they join the college. The distribution
of weights of students at the college when they enroll is normal with a standard deviation
of 7.5kg and a mean of 70kg. A random sample of 90 students from the new entry was
weighed and their mean weight was 71.6kg. Assuming that the standard deviation did not
change and that the weights of the new class were also normally distributed test at 5%
significance level, whether there is evidence that the mean of the new entry has changed.
H0: μ=70
H1: μ ≠70
X́−μ
The test statistic is Z= σ
√n
71.6−70
Z= 7.5
√ 90
Z=2.02
P a g e | 30
EXERCISE
1. The masses of loaves from a certain bakery are normally distributed with mean 450g
and standard deviation 20g. A sample of 36 loaves yielded a mean mass of 438g. Does this
provide sufficient evidence of a reduced population mean? [Z cal=-3.6<-1.645; reject H0]
2. Experience has shown that the scores obtained in a particular test at the VID are
normally distributed with mean 80 and standard deviation 7. When the test was taken by a
random sample of 64 candidates, the mean score was 76.5. Is there sufficient evidence at
3% that the candidates did not perform as well as expected? [Z cal=-4<-1.881; reject H0]
3. A chamber of commerce claims that the average take-home pay of manual workers in
fulltime employment in its area is $140.00 per week. A sample of 125 such workers had a
mean take-home pay of $148.00 with a standard deviation of $28.00. Test at 5%
significance level, the hypothesis that the mean take-home pay of all manual workers in the
area is $140.00. [Zcal=3.194>1.96; reject H0]
4. A pharmaceutical company claimed that a course of its vitamin tablets would improve
examination performance. To publicize its claim the company gave free tablets to some but
not all students taking a particular ZIMSEC examination. The average mark in the
examination for all candidates who did not take the course was 42. A random sample of
120 candidates from those who had taken the course of the vitamin tablet gave a mean
mark of 43.8 and a standard deviation of 12.8. Test at 5% significance level whether the
candidates who took the vitamin tablets had a mean mark greater than 42.
x 2 −1 (v−1)
f(x)=C υ (1+ )2 for - ∞ <x<∞
υ
X has one parameter, υ, known as the number of degrees of freedom. The letter υ is
pronounced ‘new’. The constant C υ depends on υ. We say X t(υ)
If we want to calculate t(3) at 5% for a two tailed test, it is clear that each tail represents
2.5% so we take 100%-2.5=97.5% so we use t-tables at 0.975 at 3 degrees of freedom
Hence t(3) at 5% =3.182 two tailed. Similarly t(5) at 1% one tailed = 3.365! You practice
now.
EXERCISE
THE t-TEST
x́−μ
T= s whis is distributed as t(n-1) under the null hypothesis that true population is μ
√ n−1
.
Example
It is required to test whether the sample could be drawn from a population whose mean
cholesterol is 3.1. Perform a t-test at 5% significance level. Make relevant conclusions.
H0: μ=3.1
Classwork
Six cleaning firms were selected at random and asked about their hourly rates of pay $x
with the following results
Carry a t-test at the 1% significance level to establish whether the mean hourly rate of pay,
paid by cleaning firms falls below a proposed minimum of $7.40
EXERCISE
Use a 5% sig. level to decide whether the innovation has succeeded in reducing average
process time.
2) Five readings of the resistance in ohms of a piece of wire gave the following results
1.51 1.49 1.54 1.52 1.54
If the wire were pure silver its resistance would be 1.50 ohms. If the wire were impure
the resistance would be increased. Test at the 5% sig. level the hypothesis that the wire is
pure silver. [t=2.828>1.895; reject H 0]
3) An advertisement for a certain model of mobile phone promised that it allowed ‘3.5
hours talk time ‘. A random sample of 8 such phones of this type was tested and the
lengths off talk time x hours available were as follows
3.4 3.7 3.1 3.2 3.6 3.5 3.3 3.4
Test at 5% sig. level to examine whether the advertisement was overstating the
length of talk time available from the phone. [t=-1.414>-1.895; accept H 0]
P a g e | 33
P a g e | 34
CHAPTER 6:
A girl threw a coin 10 times and the number of heads and tails were summarized as shown
χ 2 TEST STATISTIC
2 ( f o −f e )2
χ =∑
fe
χ 2 DISTRIBUTION
ii) positively skewed for v>2 and as v becomes large the distribution is appropriately
normal.
DEGREES OF FREEDOM
V = number of cells – number of restrictions (we will explain through examples how this is
calculated)
Similarly the chi-squared tables are read more or less the same way as other tables. Enjoy!
a) χ 25 % (9)
P a g e | 35
b) χ 25 % (7)
c) χ 21 % (8)
d) χ 20.1 % (4)
60 52 1.23
56 44 3.27
Total χ 2cal =¿8.37
χ 2cal =8.37<11.34, accept H0
The frequency f with which each x occurs when the number of observations is N
is given by f=P(X=x) x N
Example
X 0 1 2 3 4 5
F 12 39 27 15 4 3
EXERCISE
1. A forth year Agriculture student was trying to model the number of female
kittens born in a litter. She has examined 250 liters of size 5 and obtained the
following results:
No. of females 0 1 2 3 4 5
Number of litters 2 40 90 85 30 3
Test at 5% level whether or not the binomial distribution with parameters n=5
and p=0.5 is an adequate model for these data. [ χ 2cal =¿11.824; reject H0]
No. of defectives 0 1 2 3 4 5
frequency 20 31 30 10 6 3
Using 5% level and an appropriate test to see if these results can be modeled by
a binomial distribution with p=0.05.
Mean = λ=
∑ xf
∑f
Ifλ is given : =no. of cells -1
P a g e | 38
Example
A factory produces a serum for use in hospitals. The number of bacteria per 1ml of
serum is thought to have a Poisson distribution with a mean of 3. The laboratory
producing the serum tested 150 batches with the following results:
No. of bacteria 0 1 2 3 4 5 6 7 8
Frequency 10 18 27 38 28 14 10 5 0
[fe= 7.5, 22.4 ,33.6 ,33.6 ,25.2 ,15.1 ,7.6 ,3.2 ,1.8 (combine frequencies less than 5);
EXERCISE
1. A department store records the daily number of sales charged to stolen credit cards.
The results for the first months of 1990 are as follows:
Explain why a Poisson distribution may be appropriate as a model for the daily number of
sales charged to stolen credit cards.
Test at 5% level the hypothesis that the daily number of sales does follow a Poisson
distribution. [ χ 2cal =¿0.48; accept H0]
2. The number of times a farm tractor broke down each week was recorded over 100
weeks with the following results:
No. of breakdowns, X 0 1 2 3 4 5
Frequency, f 50 24 12 9 5 0
It is thought that the distribution is Poisson, justify. Conduct a test at 5% level using λ=0.95
and see if the assumption is reasonable.
P a g e | 39
Sometimes situation arise when individuals are classified according to two sets of
attributes eg
We may wish to investigate whether the attributes are independent or whether there is
evidence of an association between them.
Define H0
Find row totals, column totals and grand total
Find expected frequencies
row total X column total
Expected frequency=
grand total
Find degrees of freedom υ=(no. of rows-1)x(no. of columns-1)
Find the critical value of chi-square from the tables
2 ( f o −f e )2
Calculate χ =∑
fe
State conclusion
Example
[fe= 6.41, 6.51, 8.09, 24.71,25.1131.19,29.89, 30.38 37.73 (combine frequencies less than
5); υ=4; χ 2cal =¿16.9; reject H0]
EXERCISE
P a g e | 40
1. The following are data o 150 chickens divided into two groups according to breed and
into three groups according to yield of eggs.
YIELD
BRAND NAME High Medium Low
Chantecler 46 29 28
Carmen 27 14 6
2. Two factories using the materials purchased from the same supplier and closely
controlled to an agreed specification produce output for a given period classified into three
quality grades A, B, C as follows:
OUTPUT IN TONES
FACTORY A B C
X 42 13 33
y 20 8 25
CHAPTER 7
The way a sample is selected is called the sampling plan or experimental design and it
determines the amount of information in the sample.
Some definitions
Factor: an independent variable whose values are controlled and varied by the experiment
EXERCISE
a) A group of people is randomly divided into an experimental and control group. Both
the control and experimental groups are given an aptitude test after having a full
breakfast and no breakfast in each case respectively. Determine the experimental
units, factors, levels and treatments for this experiment
Treatment: 4
Assumptions**
Observations are
- linearly independent
This involves comparing the k populations with means μ1 up to μk and deciding if there are
significant differences between them.
Calculations
grand total
CF=
N
sstotal = ∑ X 2i – CF
Ti
sstreatment = ∑ - CF
n
ANOVA TABLE
Conclusion
If you reject H 0 then there is need to show us between which pairs really the difference lies.
Thus you perform a L.S.D. The difference is insignificant if X A - X B < LSD otherwise it is
significantly different.
1 1
LSD= t α [df ] s (
2 √2
+ ) where S2 = MSE
n A nB
Confidence intervals
X A ∓ t α [df ] s 2 ( 1 + 1 )
For just one mean:
2 √ n A nB
2 1 1
( X A - X B) ∓ t α [df ] s ( + )
For comparing any two means:
2 √ n A nB
Report
The report is simply a brief discussion on the analysis of the results found on L.S.D.
Exercise
i. Construct the analysis of variance, ANOVA, table for this data and draw relevant
conclusions
ii. Perform a further analysis if necessary and write a report
1. A markerting student at Solusi University heard that you were doing experimental
designs and is clueless in terms of data analysis and needs your assistance. The
following is a question she apparently brought to you. The task was to determine if there
was a difference in the mean fuel consumption for three makes of cars.
Three types of medium sized cars assembled in Japan have been test driven by a
motoring magazine and compared on a variety of criteria. In the area of fuel efficiency
performance, five cars of each brand were each test driven 1000km; the km per litre
data are obtained as follows:
You decide to use SPSS to do the analysis and your remember that before any analysis
is done, ALL the Analysis of Variance (ANOVA) assumptions MUST be met, the
residual plots outputs are shown below.
FuelConsumption
Histogram of residuals Fitted-value plot
5 1.00
0.75
4
0.50
3 0.25
0.00
2
-0.25
Residuals -0.50
1
-0.75
0
-0.5 0.0 0.5 1.0 1.5 8.00 8.25 8.50 8.75 9.00 9.25 9.50 9.75
Fitted values
0.75
0.8
0.50
0.25 0.6
0.00
0.4
-0.25
Residuals
-0.50 0.2
-0.75
0.0
Absolute values of residuals
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00