5.0 Summary Statistics (1)
5.0 Summary Statistics (1)
5.0 Summary Statistics (1)
Math 100
1. MEAN (Average)
• One computes the mean by using all the values of the data.
• The mean is used in computing other statistics, such as the variance.
• The mean for the data set is unique and not necessarily one of the data values.
• The mean if affected by extremely high or low values, and may not be
appropriate average to use in these situations.
• The mean is an appropriate measure of central tendency for interval and ratio
variables, hence it is also known as an interval statistic.
• `
2. MEDIAN
The median is used when one must find the center or middle
value of a data set.
The median is used one must determine whether the data
falls into the upper half or lower half of the distribution.
The mode can be used when the data are nominal such as
gender, political affiliations, or religious preferences.
COMPUTATIONS OF THE MEAN FOR UNGROUPED DATA
The simple arithmetic mean or simple mean consists of dividing the sum of values by
the number of values in the group.
Sample mean ( 𝑿 )
𝑥
• 𝑋=
𝑛
Where :
• 𝑋 is the sample mean
• x are the values in a given sample
∑ X is the sum or total of all the values or scores in the sample
n is the total no. of scores or values given/no of samples
N is the population
𝑥
• For Population mean we use µ=
𝑁
Example 2:
• Find the mean of the following values that shows the scores
of 10 students in Math Ed 503 : 10, 5, 4, 3, 11, 7, 9, 12, 8, 6
10+5+4+3+11+7+9+12+8+6
• Sol’n: 𝑥= = 7.5
10
Therefore: the mean scores of 10 students in MathEd 503 is
7.5.
Example 3:What should be the score of the 11th student in problem 2
so that the average is 8 ?
• Sol’n: Let s be the score of the 11th student
𝑋 10+5+4+3+11+7+9+12+8+6+𝑺
•𝑋 = 8= ∶
𝑛 11
• 𝑠 = 88 − 75 = 𝟏𝟑
• Therefore: The score of the 11th student in the data set to have
an average of 8 should be 13.
Weighted Mean
• The weighted mean refers to the average of the means of all the
groups given. It is sometimes used when a ”typical” value is
required but you want to give greater weight to some
measurements than others.
• Arithmetic mean computed by considering relative
importance of each items is called weighted arithmetic mean. To
give due importance to each item under consideration, we assign
number called weight to each item in proportion to its relative
importance.
Weighted Mean
• Weighted Arithmetic Mean is computed by using following formula:
𝑊∗𝑋
• 𝑋𝑊 =
𝑊
• Where:
• 𝑋𝑊 Stands for weighted arithmetic mean.
X Stands for values of the items and
W Stands for weight of the item
• ΣXw is the sum of the products of X and w
• Σw is the sum of the weights
Weighted Mean
Example 1:
A student obtained 40, 50, 60, 80, and 45 marks in the subjects of Math,
Statistics, Physics, Chemistry and Biology respectively. Assuming weights 5,
2, 4, 3, and 1 respectively for the above mentioned subjects. Find
Weighted Arithmetic Mean per subject.
odd-numbered set – when the number of observations is odd, the middle value is the
median.
even-numbered set – when the number of observations is even, the median is the mean or
average of the two middle scores.
Illustrative Examples:
1. What is the median amount spent on groceries when 7 customers spent the
following ( in dollars ) : 15, 18, 10, 4, 27, 5, 32?
• 4 5 10 15 18 27 32
Therefore: The median amount spent on groceries among 7 customers is $15
2. Suppose we add one more customer who spent 25 dollars on groceries in the
above example. What will be the median amount?
4 5 10 15 18 25 27 32 (15+18)/2= 16.5
Therefore: The median amount spent on groceries among 8 customers is $16.5
3. How about the following series of scores below? Find the median.
a. 1 3 6 6 6 8 10
Therefore: The median score is 6.
b. 12 14 15 16 16 16 17 18 20 21
Therefore: The median score is 16
Def. 5: The mode ( Mo ) is defined as the score which has the highest
frequency or the number that occurs most often in a set of data.
Note:
A distribution can have more than one mode.
If all the scores or values appear the same number of times, there
exists no mode.
• Introduction
• Measures of average such as the median and mean represent the
typical value for a dataset. Within the dataset the actual values usually
differ from one another and from the average value itself. The extent
to which the median and mean are good representatives of the values
in the original dataset depends upon the variability or dispersion in
the original data. Datasets are said to have high dispersion when they
contain values considerably higher and lower than the mean value.
• In figure 1 the number of different sized tutorial groups in semester 1
and semester 2 are presented. In both semesters the mean and median
tutorial group size is 5 students, however the groups in semester 2 show
more dispersion (or variability in size) than those in semester 1.
• Dispersion within a dataset can be measured or described in several
ways including the range, inter-quartile range and standard deviation.
Measure of Variability
• Range - The range is the difference between the highest value and the lowest value
in the ungrouped data set. In the grouped data set, the range is the difference
between the upper limit of the highest interval and the lower limit of the lowest class
interval.
• For ungrouped data, R = HV – LV
• Properties of the Range
• It is quick to find but gives only a rough measure of dispersion.
• The larger the value of the range, the more dispersed the observations.
• It considers only the lowest and highest values in the population.
Measure of Variability
• In figure 1, the size of the largest semester 1 tutorial group is 6
students and the size of the smallest group is 4 students, resulting
in a range of 2 (6-4). In semester 2, the largest tutorial group size
is 7 students and the smallest tutorial group contains 3 students,
therefore the range is 4 (7-3).
• The range is simple to compute and is useful when you wish to
evaluate the whole of a dataset.
• The range is useful for showing the spread within a dataset and
for comparing the spread between similar datasets.
• An example of the use of the range to compare spread within datasets is
provided in table 1. The scores of individual students in the examination
and coursework component of a module are shown.
• To find the range in marks the highest and lowest values need to be found
from the table. The highest coursework mark was 48 and the lowest was
27 giving a range of 21. In the examination, the highest mark was 45 and
the lowest 12 producing a range of 33. This indicates that there was wider
variation in the students’ performance in the examination than in the
coursework for this module.
Range
• Since the range is based solely on the two most extreme values
within the dataset, if one of these is either exceptionally high or
low (sometimes referred to as outlier) it will result in a range that
is not typical of the variability within the dataset. For example,
imagine in the above example that one student failed to hand in
any coursework and was awarded a mark of zero, however they
sat the exam and scored 40. The range for the coursework marks
would now become 48 (48-0), rather than 21, however the new
range is not typical of the dataset as a whole and is distorted by
the outlier in the coursework marks. In order to reduce the
problems caused by outliers in a dataset, the inter-quartile range
is often calculated instead of the range.
• Interquartile Range - The IQR is the amount of spread between the first quartile and
the median or the median and the third quartile. In effect, it is showing the range for
the middle 50% of the data, as such, is not affected by the extreme values in the data
set. Formula IQR = Q3-Q1
• Properties of the Inter Quartile Range
• It measures the dispersion in the middle half of the items arranged in an array
• The inter-quartile range is a measure that indicates the extent to which the central
50% of values within the dataset are dispersed. It is based upon, and related to, the
median.
• In the same way that the median divides a dataset into two halves, it can be further
divided into quarters by identifying the upper and lower quartiles. The lower quartile
is found one quarter of the way along a dataset when the values have been arranged
in order of magnitude; the upper quartile is found three quarters along the dataset.
Therefore, the upper quartile lies half way between the median and the highest
value in the dataset whilst the lower quartile lies halfway between the median and
the lowest value in the dataset. The inter-quartile range is found by subtracting the
lower quartile from the upper quartile.
• For example, the examination marks for 20 students following a particular
module are arranged in order of magnitude.
• The median lies at the mid-point between the two central values (10th and 11th)
• = half-way between 60 and 62 = 61
• The lower quartile lies at the mid-point between the 5th and 6th values
• = half-way between 52 and 53 = 52.5
• The upper quartile lies at the mid-point between the 15th and 16th values
• = half-way between 70 and 71 = 70.5
• The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the
range is:
• 80 - 43 = 37.
IQR
• The inter-quartile range provides a clearer picture of the overall dataset
by removing/ignoring the outlying values.
• Like the range however, the inter-quartile range is a measure of
dispersion that is based upon only two values from the dataset.
Statistically, the standard deviation is a more powerful measure of
dispersion because it takes into account every value in the dataset. The
standard deviation is explored in the next section of this guide.
c. The Variance - The variance, denoted by σ2, is the mean of the squared deviations of
the observations from their arithmetic mean.
Population Variance Sample Variance
2 2
𝑥 (𝑥 − 𝑥)
Ungrouped Data 𝜎2 = − µ2 𝑠2 =
𝑁 𝑛−1
• The standard deviation measures how concentrated the data are around
the mean; the more concentrated, the smaller the standard deviation.
• A small standard deviation can be a goal in certain situations where the
results are restricted, for example, in product manufacturing and quality
control. A particular type of car part that has to be 2 centimeters in
diameter to fit properly had better not have a very big standard
deviation during the manufacturing process. A big standard deviation in
this case would mean that lots of parts end up in the trash because they
don’t fit right; either that or the cars will have problems down the road.
Variability
• But in situations where you just observe and record data, a large
standard deviation isn’t necessarily a bad thing; it just reflects a large
amount of variation in the group that is being studied. For example, if
you look at salaries for everyone in a certain company, including
everyone from the student intern to the CEO, the standard deviation
may be very large. On the other hand, if you narrow the group down by
looking only at the student interns, the standard deviation is smaller,
because the individuals within this group have salaries that are less
variable. The second data set isn’t better, it’s just less variable.
• Here are some properties that can help you when interpreting a
standard deviation:
• The standard deviation can never be a negative number, due to the way
it’s calculated and the fact that it measures a distance (distances are
never negative numbers).
• The smallest possible value for the standard deviation is 0, and that
happens only in contrived situations where every single number in the
data set is exactly the same (no deviation).
• The standard deviation is affected by outliers (extremely low or
extremely high numbers in the data set). That’s because the standard
deviation is based on the distance from the mean. And remember, the
mean is also affected by outliers.
• The standard deviation has the same units as the original data.
• Properties of the Standard Deviation
• It is always non-negative
• It is easy to manipulate for further mathematical treatment.
• It makes use of all observations.
• Its unit of measure is the same unit of measure of the given set of values.
• Population SD 𝝈 = 𝝈𝟐 𝒂𝒏𝒅 𝑺𝒂𝒎𝒑𝒍𝒆 𝑺𝑫 𝒔 = 𝒔𝟐