Biostat Lecture Four
Biostat Lecture Four
Descriptive Statistics:
teshomedemis112@gmail.com 1
Measures of Central Tendency (MCT)
• A frequency distribution is a general picture of the
distribution of a variable .
• But, can’t indicate the average value and the
spread of the values .
• The tendency of the statistical data to get
concentrated at a certain value is called “central
tendency”
• The various methods of determining the point
about which the observations tend to concentrate
are called MCT.
teshomedemis112@gmail.com 2
Measures of Central Tendency (MCT)
teshomedemis112@gmail.com 4
• The most common measures of central tendency include:
Arithmetic Mean
Median
Mode
Others
teshomedemis112@gmail.com 5
1. Arithmetic Mean
A. Ungrouped Data
• The arithmetic mean is the "average" of the data
set and by far the most widely used measure of
central location and it is usually denoted by
• Is the sum of all the observations divided by the
total number of observations.
teshomedemis112@gmail.com 6
b)G ro
u pe d d
ata
I
n c alculatingthem e
anfr
o mgr
o up
eddata
,weass
u m
eth
ata
llvalu e
sfallingin
toa
par ticularc la
ssinte
rva
larelo
cate
d a
tth
em id
-po
into
fth
ein
ter
va l.I
tisc alc
ula
teda
s
f
o llo w:
k
mf
i=
1
i i
x= k
f
i=
1
i
w
he
re,
k =thenum be
rofclassinterv a
ls
th
m i=them id
-po
intofthei c la
ssinte
rva
l
fi=thefr
eq u
encyoftheithc lassin
ter
val
teshomedemis112@gmail.com 7
Example. Compute the mean age of 169 subjects from the
grouped data.
teshomedemis112@gmail.com 8
When the data are skewed, the mean is “dragged” in
the direction of the skewness .
• It is possible in extreme cases for all but one of the sample points
to be on one side of the arithmetic mean & in this case, the mean is
a poor measure of central location or does not reflect the center of
the sample.
teshomedemis112@gmail.com 9
Properties of the Arithmetic Mean.
• For a given set of data there is one and only one arithmetic
mean (uniqueness).
• Easy to calculate and understand (simple).
• Influenced by each and every value in a data set
• Greatly affected by the extreme values.
• In case of grouped data if any class interval is open,
arithmetic mean can not be calculated .
teshomedemis112@gmail.com 10
2. Median
a) Ungrouped data
• The median is the value which divides the data set into two equal
parts.
• If the number of values is odd, the median will be the
middle value when all values are arranged in order of
magnitude.
• When the number of observations is even, there is no
single middle value but two middle observations.
• In this case the median is the mean of these two middle
observations, when all observations have been arranged in
the order of their magnitude.
teshomedemis112@gmail.com 11
teshomedemis112@gmail.com 12
teshomedemis112@gmail.com 13
teshomedemis112@gmail.com 14
• The median is a better description (than the mean) of the
majority when the distribution is skewed .
• Example
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93
teshomedemis112@gmail.com 15
b) Grouped data
• In calculating the median from grouped data, we
assume that the values within a class-interval are
evenly distributed through the interval.
• The first step is to locate the class interval in which
the median is located, using the following procedure.
• Find n/2 and see a class interval with a minimum cumulative
frequency which contains n/2.
• Then, use the following formula.
teshomedemis112@gmail.com 16
n
Fc
~
x = Lm 2 W
fm
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median
class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
teshomedemis112@gmail.com 17
Example. Compute the median age of 169
subjects from the grouped data.
teshomedemis112@gmail.com 18
• n/2 = 84.5 = in the 3rd class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47
• (n/2 – fc) = 84.5-70 = 14.5
teshomedemis112@gmail.com 19
Properties of the median
• There is only one median for a given set of data
(uniqueness)
• The median is easy to calculate
• Median is a positional average and hence it is
insensitive to very large or very small values .
• Median can be calculated even in the case of
open end intervals
• It is determined mainly by the middle points and
less sensitive to the remaining data points
(weakness).
teshomedemis112@gmail.com 20
3. Mode
teshomedemis112@gmail.com 21
3. Mode
Mode
teshomedemis112@gmail.com 22
a) Ungrouped data
• It is a value which occurs most frequently in a set of
values.
• If all the values are different there is no mode, on the
other hand, a set of values may have more than one
mode.
teshomedemis112@gmail.com 23
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4
• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
teshomedemis112@gmail.com 24
b) Grouped data
• To find the mode of grouped data, we usually refer to
the modal class, where the modal class is the class
interval with the highest frequency.
• If a single value for the mode of grouped data must
be specified, it is taken as the mid-point of the modal
class interval.
teshomedemis112@gmail.com 25
x̂ = L m
w f 2
0
f f 2
where
L - Lower boundary of the Modal class
f0 – The frequency of the class next below the modal
class in value
f2 – the frequency of the class next above the modal class
in value
w – length of the interval of the modal class
teshomedemis112@gmail.com 26
teshomedemis112@gmail.com 27
Properties of mode
It is not affected by extreme values
It can be calculated for distributions with open end
classes
Often its value is not unique
The main drawback of mode is that often it does not
exist
teshomedemis112@gmail.com 28
Which measure of central tendency is best with a
given set of data?
teshomedemis112@gmail.com 29
• The mean can be used for discrete and continuous data .
• The median is appropriate for discrete and continuous
data as well, but can also be used for ordinal data.
• The mode can be used for all types of data, but may be
especially useful for nominal and ordinal measurements .
• For discrete or continuous data, the “modal class” can be
used .
teshomedemis112@gmail.com 30
(a) Symmetric and unimodal distribution — Mean, median,
and mode should all be approximately the same .
teshomedemis112@gmail.com 31
(b) Bimodal — Mean and median should be about the
same, but may take a value that is unlikely to occur; two
modes might be best
teshomedemis112@gmail.com 32
(c) Skewed to the right (positively skewed) —Mean is
sensitive to extreme values, so median might be more
appropriate
Mode
Median
Mean
teshomedemis112@gmail.com 33
(d) Skewed to the left (negatively skewed) — Same as (c)
Mode
Median
Mean
teshomedemis112@gmail.com 34
Measures of Dispersion
teshomedemis112@gmail.com 35
These two distributions have the same mean,
median, and mode
teshomedemis112@gmail.com 36
Measures of Dispersion
• MCT are not enough to give a clear
understanding about the distribution of the data.
teshomedemis112@gmail.com 37
Measures of Dispersion
Other synonymous term:
– “Measure of Variation”
– “Measure of Spread”
– “Measures of Scatter”
teshomedemis112@gmail.com 38
• Measures of dispersion include:
– Range
– Variance
– Standard deviation
– Coefficient of variation
– Standard error
– Others
teshomedemis112@gmail.com 39
1. Range (R)
• The difference between the largest and smallest
observations in a sample.
• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
• Data set with higher range exhibit more variability
teshomedemis112@gmail.com 40
Properties of range
It is the simplest crude measure and can be easily
understood
It takes into account only two values which causes it to be
a poor measure of dispersion
Very sensitive to extreme observations
The larger the sample size, the larger the
range
teshomedemis112@gmail.com 41
2. Variance (2, s2)
• Variance is used to measure the dispersion of values
relative to the mean.
• The variance is the average of the squares of the
deviations taken from the mean.
• When values are close to their mean (narrow range) the
dispersion is less than when there is scattering over a
wide range.
– Population variance = σ2
– Sample variance = S2
teshomedemis112@gmail.com 42
Ungrouped data
teshomedemis112@gmail.com 43
Degrees of freedom
• In computing the variance there are (n-1) degrees of
freedom because only (n-1) of the deviations are
independent from each other .
• The last one can always be calculated from the others
automatically.
teshomedemis112@gmail.com 44
b) Grouped data
k
(m i x) 2 f i
S2 i =1
k
i =1
fi - 1
where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
k = the number of class intervals
x = the sample mean
teshomedemis112@gmail.com 45
Properties of Variance:
The main disadvantage of variance is that its unit
is the square of the unite of the original
measurement values .
The variance gives more weight to the extreme
values as compared to those which are near to
mean value, because the difference is squared in
variance.
• The drawbacks of variance are overcome by the
standard deviation.
teshomedemis112@gmail.com 46
4. Standard deviation (, s)
• It is the square root of the variance.
• This produces a measure having the same scale as
that of the individual values.
and S = S 2 2
teshomedemis112@gmail.com 47
teshomedemis112@gmail.com 48
Example. Compute the variance and SD of the age of 169
subjects from the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22
teshomedemis112@gmail.com 49
Properties of SD
• The SD has the advantage of being expressed in
the same units of measurement as the mean
teshomedemis112@gmail.com 51
5. Coefficient of variation (CV)
• When two data sets have different units of
measurements, or their means differ sufficiently in
size, the CV should be used as a measure of
dispersion.
• It is the best measure to compare the variability of
two series of sets of observations.
• Data with less coefficient of variation is considered
more consistent.
teshomedemis112@gmail.com 52
teshomedemis112@gmail.com 53