CH 03
CH 03
Chapter
3
Numerical Descriptive Measures
3.1 Measures of Central W hich Major League Baseball team do you think has the highest average ticket price? Do you
Tendency for Ungrouped think it is one of the two New York teams, Mets or Yankees, assuming these teams play in one of
Data the most expensive cities in the world? No, you are not even close. Then, is it one of the teams
Case Study 3–1 High-Priced playing in Los Angeles—Dodgers or Angels? Still wrong. Actually it is the Boston Red Sox that had
Tickets in Big Markets the highest average ticket price ($40.77) among all Major League Baseball teams in 2004. The next
Case Study 3–2 Median Annual highest was $28.45 for the Chicago Cubs. See Case Study 3–1.
Starting Salary for MBAs
3.2 Measures of Dispersion In Chapter 2 we discussed how to summarize data using different methods and to display data using
for Ungrouped Data graphs. Graphs are one important component of statistics; however it is also important to numerically
3.3 Mean, Variance, and describe the main characteristics of a data set. The numerical summary measures, such as the ones
Standard Deviation for
that identify the center and spread of a distribution, identify many important features of a distribution.
Grouped Data
For example, the techniques learned in Chapter 2 can help us graph data on family incomes.
3.4 Use of Standard
Deviation However, if we want to know the income of a “typical” family (given by the center of the distribution),
74
1519T_c03 10/31/05 6:54 AM Page 75
Figure 3.1
Spread
Position of a
particular family
3.1.1 Mean
The mean, also called the arithmetic mean, is the most frequently used measure of central
tendency. This book will use the words mean and average synonymously. For ungrouped
data, the mean is obtained by dividing the sum of all values by the number of values in the
data set.
Sum of all values
Mean
Number of values
The mean calculated for sample data is denoted by x (read as “x bar”), and the mean cal-
culated for population data is denoted by (Greek letter mu). We know from the discussion
in Chapter 2 that the number of values in a data set is denoted by n for a sample and by N
for a population. In Chapter 1, we learned that a variable is denoted by x and the sum of all
values of x is denoted by x. Using these notations, we can write the following formulas for
the mean.
Calculating Mean for Ungrouped Data The mean for ungrouped data is obtained by dividing the
sum of all values by the number of values in the data set. Thus,
x
Mean for population data: m
N
x
Mean for sample data: x
n
where x is the sum of all values, N is the population size, n is the sample size, is the
population mean, and x is the sample mean.
1519T_c03 10/31/05 6:54 AM Page 76
EXAMPLE 3–1
Table 3.1 lists the total number of identity fraud victims in 2004 for six states.
Calculating the sample
mean for ungrouped data.
California 43,839
Florida 16,062
Illinois 11,138
New York 17,680
Ohio 6956
Texas 26,454
Source: Federal Trade Commission’s Identity Theft Data
Clearinghouse.
Find the mean number of identity fraud victims in 2004 for these six states.
Solution The variable in this example is the number of identity fraud victims in 2004 for
six states. Let us denote it by x. Then, the six values of x are
x1 43,839, x2 16,062, x3 11,138, x4 17,680, x5 6956, and x6 26,454
where x1 43,839 represents the number of identity fraud victims in 2004 for California,
x2 16,062 represents the number of identity fraud victims in 2004 for Florida, and so on. The
sum of the numbers of identity fraud victims for these six states is
gx x1 x2 x3 x4 x5 x6
43,839 16,062 11,138 17,680 6956 26,454 122,129
Note that the given data includes only six states. Hence, it represents a sample. Because the
data set contains six values, n 6. Substituting the values of gx and n in the sample formula,
we obtain the mean number of identity fraud victims in 2004 for these six states:
x 122,129
x 20,354.83
n 6
Thus, the mean number of identity fraud victims in 2004 for these six states is 20,354.83.
EXAMPLE 3–2
The following are the ages of all eight employees of a small company:
Calculating the population
mean for ungrouped data. 53 32 61 27 39 44 49 57
Find the mean age of these employees.
Solution Because the given data set includes all eight employees of the company, it repre-
sents the population. Hence, N 8.
x 53 32 61 27 39 44 49 57 362
The population mean is
x 362
m 45.25 years
N 8
Thus, the mean age of all eight employees of this company is 45.25 years, or 45 years and 3
months.
1519T_c03 10/31/05 6:54 AM Page 77
Reconsider Example 3–2. If we take a sample of three employees from this company and
calculate the mean age of those three employees, this mean will be denoted by x. Suppose the
three values included in the sample are 32, 39, and 57. Then, the mean age for this sample is
32 39 57
x 42.67 years
3
If we take a second sample of three employees of this company, the value of x will (most likely)
be different. Suppose the second sample includes the values 53, 27, and 44. Then, the mean age
for this sample is
53 27 44
x 41.33 years
3
Consequently, we can state that the value of the population mean is constant. However, the
value of the sample mean x varies from sample to sample. The value of x for a particular sam-
ple depends on what values of the population are included in that sample.
Sometime a data set may contain a few very small or a few very large values. As mentioned
in Chapter 2 on page 58, such values are called outliers or extreme values.
A major shortcoming of the mean as a measure of central tendency is that it is very sensi-
tive to outliers. Example 3–3 illustrates this point.
EXAMPLE 3–3
Table 3.2 lists the total philanthropic givings (in million dollars) by six donors during their
Illustrating the effect of an
lifetimes until 2004.
outlier on the mean.
Notice that the lifetime givings of Bill and Melinda Gates are very large compared to the life-
time givings of other donors. Hence, it is an outlier. Show how the inclusion of this outlier af-
fects the value of the mean.
Solution If we do not include the lifetime givings of Bill and Melinda Gates (the outlier),
the mean of the lifetime givings of the remaining five donors is
The above chart, reproduced from USA TODAY, shows the average ticket prices of the five Major League
Baseball teams that had the highest average ticket prices in 2004. According to the information given in
Source: USA TODAY, April 26, 2004.
Copyright © 2004, USA TODAY. Chart the chart, the highest average price for an MLB team was for the Boston Red Sox, which was $40.77. The
reproduced with permission. Chicago Cubs had the second highest average ticket price of $28.45.
The preceding example should encourage us to be cautious. We should remember that the mean
is not always the best measure of central tendency because it is heavily influenced by outliers.
Sometimes other measures of central tendency give a more accurate impression of a data set. For
example, when a data set has outliers, instead of using the mean, we can use either the trimmed
mean (defined in Exercise 3.33) or the median (to be discussed next) as a measure of central tendency.
3.1.2 Median
Another important measure of central tendency is the median. It is defined as follows.
Definition
Median The median is the value of the middle term in a data set that has been ranked in increas-
ing order.
As is obvious from the definition of the median, it divides a ranked data set into two equal
parts. The calculation of the median consists of the following two steps:
1. Rank the data set in increasing order.
2. Find the middle term. The value of this term is the median.1
1
The value of the middle term in a data set ranked in decreasing order will also give the value of the median.
78
1519T_c03 10/31/05 6:54 AM Page 79
Note that if the number of observations in a data set is odd, then the median is given by
the value of the middle term in the ranked data. However, if the number of observations is even,
then the median is given by the average of the values of the two middle terms.
EXAMPLE 3–4
The following data give the weight lost (in pounds) by a sample of five members of a health
Calculating the median
club at the end of two months of membership.
for ungrouped data: odd
10 5 19 8 3 number of data values.
3 5 8 10 19
Since there are five terms in the data set and the middle term is the third term, the median is
given by the value of the third term in the ranked data.
3 5 8 10 19
c
Median
The median weight loss for this sample of five members of this health club is 8 pounds.
EXAMPLE 3–5
Table 3.3 lists the number of car thefts during 2003 in 12 cities.
Calculating the median
for ungrouped data: even
number of data values.
Table 3.3 Number of Car Thefts in 2003 in 12 Cities
Solution First we rank the given data on car thefts in increasing order as follows:
11,669 13,435 14,413 18,103 18,215 21,088 26,343 29,920 33,956 40,197 40,769 42,082
There are 12 values in the data set. Because there is an even number of values in the data
set, the median will be given by the mean of the two middle values. The two middle values
1519T_c03 10/31/05 6:54 AM Page 80
The above chart, reproduced from USA TODAY, shows the median annual starting salary of MBAs. These
Source: USA TODAY, February 4,
2004. Copyright © 2004, USA TODAY. salaries are based on a survey conducted in August 2003. According to this survey, the median starting
Chart reproduced with permission. salary for males with an MBA degree was $75,000 and that of females was $67,500.
are the sixth and seventh values in the above arranged data, which are 21,088 and 26,343. The
median, which is given by the average of these two values, is calculated below.
11,669 13,435 14,413 18,103 18,215 21,088 26,343 29,920 33,956 40,197 40,769 42,082
c
Median
The median gives the center of a histogram, with half of the data values to the left of the
median and half to the right of the median. The advantage of using the median as a measure
of central tendency is that it is not influenced by outliers. Consequently, the median is pre-
ferred over the mean as a measure of central tendency for data sets that contain outliers.
3.1.3 Mode
Mode is a French word that means fashion—an item that is most popular or common. In sta-
tistics, the mode represents the most common value in a data set.
Definition
Mode The mode is the value that occurs with the highest frequency in a data set.
80
1519T_c03 10/31/05 6:55 AM Page 81
EXAMPLE 3–6
The following data give the speeds (in miles per hour) of eight cars that were stopped on
Calculating the mode for
I-95 for speeding violations.
ungrouped data.
77 82 74 81 79 84 74 78
Find the mode.
Solution In this data set, 74 occurs twice and each of the remaining values occurs only
once. Because 74 occurs with the highest frequency, it is the mode. Therefore,
A major shortcoming of the mode is that a data set may have none or may have more than
one mode, whereas it will have only one mean and only one median. For instance, a data set with
each value occurring only once has no mode. A data set with only one value occurring with the
highest frequency has only one mode. The data set in this case is called unimodal. A data set with
two values that occur with the same (highest) frequency has two modes. The distribution, in this
case, is said to be bimodal. If more than two values in a data set occur with the same (highest)
frequency, then the data set contains more than two modes and it is said to be multimodal.
EXAMPLE 3–7
Last year’s incomes of five randomly selected families were $46,150, $95,750, $64,985, Data set with no mode.
$87,490, and $53,740. Find the mode.
Solution Because each value in this data set occurs only once, this data set contains no
mode.
EXAMPLE 3–8
The prices of the same brand of television set at eight stores are found to be $895, $886, $903,
Data set with two modes.
$895, $870, $905, $870, and $899. Find the mode.
Solution In this data set, each of the two values $895 and $870 occurs twice and each of the
remaining values occurs only once. Therefore, this data set has two modes: $895 and $870.
EXAMPLE 3–9
The ages of 10 randomly selected students from a class are 21, 19, 27, 22, 29, 19, 25, 21, 22,
Data set with three modes.
and 30. Find the mode.
Solution This data set has three modes: 19, 21, and 22. Each of these three values occurs
with a (highest) frequency of 2.
One advantage of the mode is that it can be calculated for both kinds of data, quantitative
and qualitative, whereas the mean and median can be calculated for only quantitative data.
EXAMPLE 3–10
The status of five students who are members of the student senate at a college are senior, soph-
Finding the mode
omore, senior, junior, senior. Find the mode. for qualitative data.
Solution Because senior occurs more frequently than the other categories, it is the mode
for this data set. We cannot calculate the mean and median for this data set.
1519T_c03 10/31/05 6:55 AM Page 82
To sum up, we cannot say for sure which of the three measures of central tendency is a better
measure overall. Each of them may be better under different situations. Probably the mean is the
most used measure of central tendency, followed by the median. The mean has the advantage that
its calculation includes each value of the data set. The median is a better measure when a data set
includes outliers. The mode is simple to locate, but it is not of much use in practical applications.
Variable
Mean = median = mode
2. For a histogram and a frequency distribution curve skewed to the right (see Figure 3.3), the
value of the mean is the largest, that of the mode is the smallest, and the value of the me-
dian lies between these two. (Notice that the mode always occurs at the peak point.) The
value of the mean is the largest in this case because it is sensitive to outliers that occur in
the right tail. These outliers pull the mean to the right.
Variable
Mode Median Mean
3. If a histogram and a frequency distribution curve are skewed to the left (see Figure 3.4), the
value of the mean is the smallest and that of the mode is the largest, with the value of the me-
dian lying between these two. In this case, the outliers in the left tail pull the mean to the left.
Variable
Mean Median Mode
1519T_c03 10/31/05 6:55 AM Page 83
EXERCISES
CONCEPTS AND PROCEDURES
3.1 Explain how the value of the median is determined for a data set that contains an odd number of ob-
servations and for a data set that contains an even number of observations.
3.2 Briefly explain the meaning of an outlier. Is the mean or the median a better measure of central ten-
dency for a data set that contains outliers? Illustrate with the help of an example.
3.3 Using an example, show how outliers can affect the value of the mean.
3.4 Which of the three measures of central tendency (the mean, the median, and the mode) can be cal-
culated for quantitative data only, and which can be calculated for both quantitative and qualitative data?
Illustrate with examples.
3.5 Which of the three measures of central tendency (the mean, the median, and the mode) can assume
more than one value for a data set? Give an example of a data set for which this summary measure as-
sumes more than one value.
3.6 Is it possible for a (quantitative) data set to have no mean, no median, or no mode? Give an exam-
ple of a data set for which this summary measure does not exist.
3.7 Explain the relationships among the mean, median, and mode for symmetric and skewed histograms.
Illustrate these relationships with graphs.
3.8 Prices of cars have a distribution that is skewed to the right with outliers in the right tail. Which of
the measures of central tendency is the best to summarize this data set? Explain.
3.9 The following data set belongs to a population:
5 7 2 0 9 16 10 7
Calculate the mean, median, and mode.
3.10 The following data set belongs to a sample:
14 18 10 8 8 16
Calculate the mean, median, and mode.
APPLICATIONS
Exercises 3.11 and 3.12 are based on the following data.
The following table gives the sticker prices and dealer’s prices for base models of 10 two-door small cars
as of January 2004.
319 3800 1600 161 2300 645 736 2500 238 739
189 1000 3800 430 162 1400 1200 303 1300 123
a. Calculate the mean and median for these data. Are these values of the mean and median the sam-
ple statistics or population parameters?
b. Do these data have a mode? Explain.
3.14 The following data give the 2004 profits (in millions of dollars) of the nine computer and office
equipment companies included in the Fortune 1000 (FORTUNE, April 18, 2005). The data, entered in that
order, are for International Business Machines, Hewlett-Packard, Dell, Xerox, Sun Microsystems, Apple
Computer, NCR, Pitney Bowes, and Gateway.
8430 3497 3043 859 388 276 290 481 568
Find the mean and median for these data. Do these data have a mode?
3.15 The following data give the annual salaries (in dollars) of governors of 13 western states for 2004
(Source: Council of State Governments, The Book of the States, 2004; The New York Times Almanac, 2005).
The salaries, listed in that order are for AK, HI, CA, OR, WA, ID, MT, WY, CO, UT, NV, AZ, and NM.
85,776 94,780 175,000 93,600 139,087
98,500 93,089 130,000 90,000 100,600
117,000 95,000 110,000
Find the mean and median for these data.
3.16 The following data give the numbers of car thefts that occurred in a city during the past 12 days.
6 3 7 11 4 3 8 7 2 6 9 15
Find the mean, median, and mode.
3.17 The following data give the revenues (in millions of dollars) for the last available fiscal year for a
sample of six charitable organizations that are related to serious diseases (Forbes, December 13, 2004). The
values listed in that order are for Alzheimer’s Association, American Cancer Society, American Diabetes
Association, American Heart Association, American Lung Association, and Cystic Fibrosis Foundation.
136 816 192 513 158 152
Compute the mean and median. Do these data have a mode? Why or why not?
3.18 The following table gives the numbers of takeaways (recoveries of opponents’ fumbles and inter-
ceptions of opponents’ passes) during the 2004 season for all 16 teams in the National Conference of the
National Football League.
Team Takeaways
Carolina 38
Seattle 35
New Orleans 33
Philadelphia 28
Detroit 24
N.Y. Giants 28
Atlanta 32
Arizona 30
Minnesota 22
Washington 26
Chicago 29
Tampa Bay 27
Green Bay 15
Dallas 22
San Francisco 21
St. Louis 15
Source: USA TODAY, January 5, 2005.
Compute the mean and median for the data on takeaways. Do these data have a mode? Why or why not?
1519T_c03 10/31/05 6:55 AM Page 85
3.19 Due to antiquated equipment and frequent windstorms, the town of Oak City often suffers power
outages. The following data give the numbers of power outages for the past 12 months.
4 5 7 3 2 0 2 3 2 1 2 4
Compute the mean, median, and mode for these data.
3.20 A brochure from the department of public safety in a northern state recommends that motorists should
carry 12 items (flashlights, blankets, and so forth) in their vehicles for emergency use while driving in
winter. The following data give the number of items out of these 12 that were carried in their vehicles by
15 randomly selected motorists.
5 3 7 8 0 1 0 5 12 10 7 6 7 11 9
Find the mean, median, and mode for these data. Are the values of these summary measures population
parameters or sample statistics? Explain.
3.21 Nixon Corporation manufactures computer monitors. The following data are the numbers of com-
puter monitors produced at the company for a sample of 10 days.
24 32 27 23 35 33 29 40 23 28
Calculate the mean, median, and mode for these data.
3.22 The Tri-City School District has instituted a zero-tolerance policy for students carrying any objects
that could be used as weapons. The following data give the number of students suspended during each of
the past 12 weeks for violating this school policy.
15 9 12 11 7 6 9 10 14 3 6 5
Calculate the mean, median, and mode for these data.
3.23 The following data give the numbers of casinos in 11 states as of December 21, 2003 (USA
TODAY, July 16, 2004). The data entered in that order are for CO, IL, IN, IA, LA, MI, MS, MO, NV, NJ,
and SD.
44 9 10 13 18 3 29 11 256 12 38
a. Calculate the mean and median for these data.
b. Do these data contain an outlier? If so, drop the outlier and recalculate the mean and median.
Which of these two summary measures changes by a larger amount when you drop the outlier?
c. Which is the better summary measure for these data, the mean or the median? Explain.
3.24 The following data, based on the AAA Foundation for Traffic Safety estimates, give the number of
fatal crashes caused by road debris from 1999 to 2001 in 10 states with the most such accidents (USA
TODAY, June 16, 2004). The data entered in that order are for TX, FL, MO, VA, OK, MD, AZ, LA, WI,
and IN.
33 17 13 6 6 5 5 4 3 3
Compute the mean and median for these data. Do these data have modes? Why or why not?
*3.25 One property of the mean is that if we know the means and sample sizes of two (or more) data sets,
we can calculate the combined mean of both (or all) data sets. The combined mean for two data sets is
calculated by using the formula
n1 x1 n2 x2
Combined mean x
n1 n2
where n1 and n2 are the sample sizes of the two data sets and x1 and x2 are the means of the two data sets,
respectively. Suppose a sample of 10 statistics books gave a mean price of $95 and a sample of 8 mathe-
matics books gave a mean price of $104. Find the combined mean. (Hint: For this example:
n1 10, n2 8, x1 $95, x2 $104.)
*3.26 Twenty business majors and 18 economics majors go bowling. Each student bowls one game. The
scorekeeper announces that the mean score for the 18 economics majors is 144 and the mean score for
the entire group of 38 students is 150. Find the mean score for the 20 business majors.
*3.27 For any data, the sum of all values is equal to the product of the sample size and mean; that is,
x n x. Suppose the average amount of money spent on shopping by 10 persons during a given week
is $105.50. Find the total amount of money spent on shopping by these 10 persons.
*3.28 The mean 2005 income for five families was $79,520. What was the total 2005 income of these
five families?
*3.29 The mean age of six persons is 46 years. The ages of five of these six persons are 57, 39, 44, 51,
and 37 years. Find the age of the sixth person.
1519T_c03 10/31/05 6:55 AM Page 86
*3.30 Seven airline passengers in economy class on the same flight paid an average of $361 per ticket.
Because the tickets were purchased at different times and from different sources, the prices varied. The
first five passengers paid $420, $210, $333, $695, and $485. The sixth and seventh tickets were purchased
by a couple who paid identical fares. What price did each of them pay?
*3.31 Consider the following two data sets.
Data Set I: 12 25 37 8 41
Data Set II: 19 32 44 15 48
Notice that each value of the second data set is obtained by adding 7 to the corresponding value of the
first data set. Calculate the mean for each of these two data sets. Comment on the relationship between
the two means.
*3.32 Consider the following two data sets.
Data Set I: 4 8 15 9 11
Data Set II: 8 16 30 18 22
Notice that each value of the second data set is obtained by multiplying the corresponding value of the
first data set by 2. Calculate the mean for each of these two data sets. Comment on the relationship be-
tween the two means.
*3.33 The trimmed mean is calculated by dropping a certain percentage of values from each end of a
ranked data set. The trimmed mean is especially useful as a measure of central tendency when a data
set contains a few outliers at each end. Suppose the following data give the ages of 10 employees of a
company:
47 53 38 26 39 49 19 67 31 23
To calculate the 10% trimmed mean, first rank these data values in increasing order; then drop 10% of the
smallest values and 10% of the largest values. The mean of the remaining 80% of the values will give
the 10% trimmed mean. Note that this data set contains 10 values, and 10% of 10 is 1. Thus, if we drop
the smallest value and the largest value from this data set, the mean of the remaining 8 values will be
called the 10% trimmed mean. Calculate the 10% trimmed mean for this data set.
*3.34 The following data give the prices (in thousands of dollars) of 20 houses sold recently in a city.
184 297 365 309 245 387 369 438 195 390
323 578 410 679 307 271 457 795 259 590
Find the 20% trimmed mean for this data set.
*3.35 In some applications, certain values in a data set may be considered more important than others.
For example, to determine students’ grades in a course, an instructor may assign a weight to the final exam
twice as much as to each of the other exams. In such cases, it is more appropriate to use the weighted
mean. In general, for a sequence of n data values x1, x2, . . . xn that are assigned weights w1, w2, . . . wn,
respectively, the weighted mean is found by the formula
x w
Weighted mean
w
where xw is obtained by multiplying each data value by its weight and then adding the products.
Suppose an instructor gives two exams and a final, assigning the final exam a weight twice that of each
of the other exams. Find the weighted mean for a student who scores 73 and 67 on the first two ex-
ams, and 85 on the final. (Hint: Here, x1 73, x2 67, x3 85, and w1 w2 1, and w3 2.)
*3.36 When studying phenomena such as inflation or population changes, which involve periodic increases
or decreases, the geometric mean is used to find the average change over the entire period under study.
To calculate the geometric mean of a sequence of n values x1, x2, . . . xn, we multiply them together and
then find the nth root of this product. Thus
n
Geometric mean 1 x1 x2 x3 . . . xn
Suppose that the inflation rates for the last five years are 4%, 3%, 5%, 6%, and 8%, respectively. Thus at
the end of the first year, the price index will be 1.04 times the price index at the beginning of the year,
and so on. Find the mean rate of inflation over the five-year period by finding the geometric mean of the
data set 1.04, 1.03, 1.05, 1.06, and 1.08. (Hint: Here, n 5, x1 1.04, x2 1.03, etc. Use the x1n key
on your calculator to find the fifth root. Note that the mean inflation rate will be obtained by subtracting
1 from the geometric mean.)
1519T_c03 10/31/05 6:55 AM Page 87
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the same, 40 years. If we do not know the
ages of individual workers in these two companies and are told only that the mean age of the work-
ers in both companies is the same, we may deduce that the workers in these two companies have a
similar age distribution. But as we can observe, the variation in the workers’ ages for each of these
two companies is very different. As illustrated in the diagram, the ages of the workers in the second
company have a much larger variation than the ages of the workers in the first company.
Company 1 36 39
35 38 40 45 47
Company 2
18 27 33 52 70
Thus, the mean, median, or mode by itself is usually not a sufficient measure to reveal the
shape of the distribution of a data set. We also need a measure that can provide some informa-
tion about the variation among data values. The measures that help us learn about the spread of
a data set are called the measures of dispersion. The measures of central tendency and disper-
sion taken together give a better picture of a data set than the measures of central tendency alone.
This section discusses three measures of dispersion: range, variance, and standard deviation.
3.2.1 Range
The range is the simplest measure of dispersion to calculate. It is obtained by taking the dif-
ference between the largest and the smallest values in a data set.
EXAMPLE 3–11
Table 3.4 gives the total areas in square miles of the four western South-Central states of the
Calculating the range for
United States. ungrouped data.
Table 3.4
Total Area
State (square miles)
Arkansas 53,182
Louisiana 49,651
Oklahoma 69,903
Texas 267,277
Solution The maximum total area for a state in this data set is 267,277 square miles, and
the smallest area is 49,651 square miles. Therefore,
Range Largest value Smallest value
267,277 49,651 217,626 square miles
Thus, the total areas of these four states are spread over a range of 217,626 square miles.
The range, like the mean, has the disadvantage of being influenced by outliers. In Example
3–11, if the state of Texas with a total area of 267,277 square miles is dropped, the range de-
creases from 217,626 square miles to 20,252 square miles. Consequently, the range is not a
good measure of dispersion to use for a data set that contains outliers.
Another disadvantage of using the range as a measure of dispersion is that its calculation
is based on two values only: the largest and the smallest. All other values in a data set are ignored
when calculating the range. Thus, the range is not a very satisfactory measure of dispersion.
The quantity x or x x in the above formulas is called the deviation of the x value
from the mean. The sum of the deviations of the x values from the mean is always zero; that
is, 1x m2 0 and 1x x2 0.
For example, suppose the midterm scores of a sample of four students are 82, 95, 67, and
92. Then, the mean score for these four students is
82 95 67 92
x 84
4
The deviations of the four scores from the mean are calculated in Table 3.5. As we can observe from
the table, the sum of the deviations of the x values from the mean is zero; that is, 1x x2 0.
For this reason we square the deviations to calculate the variance and standard deviation.
Table 3.5
x xx
82 82 84 2
95 95 84 11
67 67 84 17
92 92 84 8
1x x2 0
2
Note that is uppercase sigma and is lowercase sigma of the Greek alphabet.
3
From the formula for 2, it can be stated that the population variance is the mean of the squared deviations of x
values from the mean. However, this is not true for the variance calculated for a sample data set.
1519T_c03 10/31/05 6:55 AM Page 89
From the computational point of view, it is easier and more efficient to use short-cut for-
mulas to calculate the variance and standard deviation. By using the short-cut formulas, we re-
duce the computation time and round-off errors. Use of the basic formulas for ungrouped data
is illustrated in Section A3.1.1 of Appendix 3.1 of this chapter. The short-cut formulas for cal-
culating the variance and standard deviation are given next.
Short-Cut Formulas for the Variance and Standard Deviation for Ungrouped Data
1x2 2 1x2 2
x 2 x 2
N n
s2 and s2
N n1
where 2 is the population variance and s2 is the sample variance.
The standard deviation is obtained by taking the positive square root of the variance.
Population standard deviation: s 2s2
Sample standard deviation: s 2s2
Note that the denominator in the formula for the population variance is N, but that in the for-
mula for the sample variance it is n 1.4
EXAMPLE 3–12
The following table, based on Forbes magazine’s list of the wealthiest people in the world,
Calculating the sample
gives the total wealth (in billions of dollars) of five persons (USA TODAY, March 11, 2005).
variance and standard deviation
for ungrouped data.
Total Wealth
Billinaire (billions of dollars)
Bill Gates 46.5
Helen Walton 18.0
Michael Dell 16.0
Keith Rupert Murdoch 7.8
George Soros 7.2
Solution Let x denote the total wealth (in billions of dollars) of a person. The values of gx
and gx2 are calculated in Table 3.6.
Table 3.6
x x2
46.5 2162.25
18.0 324.00
16.0 256.00
7.8 60.84
7.2 51.84
x 95.5 x 2854.93
2
4
The reason that the denominator in the sample formula is n 1 and not n follows: The sample variance underesti-
mates the population variance when the denominator in the sample formula for variance is n. However, the sample
variance does not underestimate the population variance if the denominator in the sample formula for variance is
n 1. In Chapter 8 we will learn that n 1 is called the degrees of freedom.
1519T_c03 10/31/05 6:55 AM Page 90
1 gx2 2 195.52 2
gx2 2854.93
n 5 2854.93 1824.05
s2 257.72
n1 51 4
Step 4. Obtain the standard deviation.
The standard deviation is obtained by taking the (positive) square root of the variance.
s 2257.72 16.05366 $16.05 billion
Thus, the standard deviation of the wealth of these five individuals is $16.05 billion.
Two Observations 1. The values of the variance and the standard deviation are never negative. That is,
the numerator in the formula for the variance should never produce a negative value.
Usually the values of the variance and standard deviation are positive, but if a data set
has no variation, then the variance and standard deviation are both zero. For example,
if four persons in a group are the same age—say, 35 years—then the four values in the
data set are
35 35 35 35
If we calculate the variance and standard deviation for these data, their values are zero. This
is because there is no variation in the values of this data set.
2. The measurement units of variance are always the square of the measurement units
of the original data. This is so because the original values are squared to calculate the
variance. In Example 3–12, the measurement units of the original data are billions of
dollars. However, the measurement units of the variance are squared billions of dollars,
which, of course, does not make any sense. Thus, the variance of the wealth of these
five persons in Example 3–12 is 257.72 squared billion dollars. But the measurement
units of the standard deviation are the same as the measurement units of the original
data because the standard deviation is obtained by taking the square root of the
variance.
EXAMPLE 3–13
Following are the 2005 earnings (in thousands of dollars) before taxes for all six employees
Calculating the population
of a small company.
variance and standard deviation
for ungrouped data. 48.50 38.40 65.50 22.60 79.80 54.60
Calculate the variance and standard deviation for these data.
Solution Let x denote the 2005 earnings before taxes of an employee of this company. The
values of x and x2 are calculated in Table 3.7.
1519T_c03 10/31/05 6:55 AM Page 91
Table 3.7
x x2
48.50 2352.25
38.40 1474.56
65.50 4290.25
22.60 510.76
79.80 6368.04
54.60 2981.16
x 309.40 x 17,977.02
2
Because the data are on earnings of all employees of this company, we use the population for-
mula to compute the variance. Thus, the variance is
1 x2 2 1309.402 2
x 2 17,977.02
N 6
s2 337.0489
N 6
The standard deviation is obtained by taking the (positive) square root of the variance:
s 1337.0489 $18.359 thousand $18,359
Thus, the standard deviation of the 2005 earnings of all six employees of this company is
$18,359.
Note that x 2 is not the same as (x)2. The value of x 2 is obtained by squaring the x values Warning
and then adding them. The value of (x)2 is obtained by squaring the value of x.
The uses of the standard deviation are discussed in Section 3.4. Later chapters explain how
the mean and the standard deviation taken together can help in making inferences about the
population.
EXERCISES
CONCEPTS AND PROCEDURES
3.37 The range, as a measure of spread, has the disadvantage of being influenced by outliers. Illustrate
this with an example.
3.38 Can the standard deviation have a negative value? Explain.
3.39 When is the value of the standard deviation for a data set zero? Give one example. Calculate the
standard deviation for the example and show that its value is zero.
3.40 Briefly explain the difference between a population parameter and a sample statistic. Give one ex-
ample of each.
1519T_c03 10/31/05 6:55 AM Page 92
APPLICATIONS
3.43 The following data give the number of shoplifters apprehended during each of the past eight weeks
at a large department store.
7 10 8 3 15 12 6 11
a. Find the mean for these data. Calculate the deviations of the data values from the mean. Is the
sum of these deviations zero?
b. Calculate the range, variance, and standard deviation.
3.44 The following data give the prices of seven textbooks randomly selected from a university book-
store.
$89 $67 $104 $113 $36 $121 $147
a. Find the mean for these data. Calculate the deviations of the data values from the mean. Is the
sum of these deviations zero?
b. Calculate the range, variance, and standard deviation.
3.45 The following data give the numbers of car thefts that occurred in a city in the past 12 days.
6 3 7 11 4 3 8 7 2 6 9 15
Calculate the range, variance, and standard deviation.
3.46 During the 2004 presidential election campaign, spending on television commercials was high,
particularly in key states where the vote was expected to be close. The following data give the ex-
penditures on television commercials (in millions of dollars) by all candidates in 10 states where
such spending was the highest. The data, entered in that order, are for Florida, California, Ohio,
Pennsylvania, Missouri, New Jersey, Delaware, Michigan, Wisconsin, and North Carolina (USA TODAY,
November 26, 2004).
236.7 190.7 166.8 133.9 98.0 88.3 65.3 61.6 54.4 51.6
Find the range, variance, and standard deviation for these data.
3.47 The following data give the numbers of pieces of junk mail received by 10 families during the past
month.
41 33 28 21 29 19 14 31 39 36
Find the range, variance, and standard deviation.
3.48 The following data give the number of highway collisions with large wild animals, such as deer or
moose, in one of the northeastern states during each week of a nine-week period.
7 10 3 8 2 5 7 4 9
Find the range, variance, and standard deviation.
3.49 Attacks by stinging insects, such as bees or wasps, may become medical emergencies if either the
victim is allergic to venom or multiple stings are involved. The following data give the number of patients
treated each week for such stings in a large regional hospital during 13 weeks last summer.
1 5 2 3 0 4 1 7 0 1 2 0 1
Compute the range, variance, and standard deviation for these data.
3.50 The following data give the number of hot dogs consumed by 10 participants in a hot-dog-eating
contest.
21 17 32 8 20 15 17 23 9 18
Calculate the range, variance, and standard deviation for these data.
1519T_c03 10/31/05 6:55 AM Page 93
3.51 Following are the temperatures (in degrees Fahrenheit) observed during eight wintry days in a mid-
western city:
23 14 6 7 2 11 16 19
Compute the range, variance, and standard deviation.
3.52 The following data give the numbers of hours spent partying by 10 randomly selected college stu-
dents during the past week.
7 14 5 0 9 7 10 4 0 8
Compute the range, variance, and standard deviation.
3.53 The following data, based on Forbes Magazine’s rankings of the wealthiest people in the world, give
the net worth (in billions of dollars) of the 10 wealthiest people in the world (USA TODAY, March 11,
2005). The data, entered in that order, are for Bill Gates, Warren Buffett, Lakshmi Mittal, Carlos Slim
Helu, Prince Alwaleed Bin Talal Alsaud, Ingvar Kamprad, Paul Allen, Karl Albrecht, Lawrence Ellison,
and S. Robson Walton.
46.5 44.0 25.0 23.8 23.7 23.0 21.0 18.5 18.4 18.3
Find the range, variance, and standard deviation for these data.
3.54 The following data give the average speeds (rounded to the nearest mile per hour) at the Indianapolis
500 auto race for the years 1995 to 2004 (The New York Times 2005 Almanac).
154 148 146 145 153 168 142 166 156 139
Find the range, variance, and standard deviation for these data.
3.55 The following data give the hourly wage rates of eight employees of a company.
12 12 12 12 12 12 12 12
Calculate the standard deviation. Is its value zero? If yes, why?
3.56 The following data are the ages (in years) of six students.
19 19 19 19 19 19
Calculate the standard deviation. Is its value zero? If yes, why?
*3.57 One disadvantage of the standard deviation as a measure of dispersion is that it is a measure of ab-
solute variability and not of relative variability. Sometimes we may need to compare the variability of two
different data sets that have different units of measurement. The coefficient of variation is one such meas-
ure. The coefficient of variation, denoted by CV, expresses standard deviation as a percentage of the mean
and is computed as follows:
s
For population data: CV 100%
m
s
For sample data: CV 100%
x
The yearly salaries of all employees who work for a company have a mean of $62,350 and a standard
deviation of $6820. The years of experience for the same employees have a mean of 15 years and a stan-
dard deviation of 2 years. Is the relative variation in the salaries greater or less than that in years of ex-
perience for these employees?
*3.58 The SAT scores of 100 students have a mean of 975 and a standard deviation of 105. The GPAs of
the same 100 students have a mean of 3.16 and a standard deviation of .22. Is the relative variation in SAT
scores greater or less than that in GPAs?
*3.59 Consider the following two data sets.
Data Set I: 12 25 37 8 41
Data Set II: 19 32 44 15 48
Note that each value of the second data set is obtained by adding 7 to the corresponding value of the first
data set. Calculate the standard deviation for each of these two data sets using the formula for sample data.
Comment on the relationship between the two standard deviations.
*3.60 Consider the following two data sets.
Data Set I: 4 8 15 9 11
Data Set II: 8 16 30 18 22
1519T_c03 10/31/05 6:55 AM Page 94
Note that each value of the second data set is obtained by multiplying the corresponding value of the first
data set by 2. Calculate the standard deviation for each of these two data sets using the formula for pop-
ulation data. Comment on the relationship between the two standard deviations.
To calculate the mean for grouped data, first find the midpoint of each class and then mul-
tiply the midpoints by the frequencies of the corresponding classes. The sum of these products,
denoted by mf, gives an approximation for the sum of all values. To find the value of the mean,
divide this sum by the total number of observations in the data.
EXAMPLE 3–14
Table 3.8 gives the frequency distribution of the daily commuting times (in minutes) from
Calculating the population
home to work for all 25 employees of a company.
mean for grouped data.
Table 3.8
Solution Note that because the data set includes all 25 employees of the company, it rep-
resents the population. Table 3.9 shows the calculation of mf. Note that in Table 3.9, m de-
notes the midpoints of the classes.
Table 3.9
To calculate the mean, we first find the midpoint of each class. The class midpoints are
recorded in the third column of Table 3.9. The products of the midpoints and the corresponding
frequencies are listed in the fourth column. The sum of the fourth column values, denoted by
mf, gives the approximate total daily commuting time (in minutes) for all 25 employees.
The mean is obtained by dividing this sum by the total frequency. Therefore,
mf 535
m 21.40 minutes
N 25
Thus, the employees of this company spend an average of 21.40 minutes a day commuting
from home to work.
What do the numbers 20, 135, 150, 140, and 90 in the column labeled mf in Table 3.9 rep-
resent? We know from this table that 4 employees spend 0 to less than 10 minutes commuting
per day. If we assume that the time spent commuting by these 4 employees is evenly spread in
the interval 0 to less than 10, then the midpoint of this class (which is 5) gives the mean time
spent commuting by these 4 employees. Hence, 4 5 20 is the approximate total time (in
minutes) spent commuting per day by these 4 employees. Similarly, 9 employees spend 10 to
less than 20 minutes commuting per day, and the total time spent commuting by these 9 em-
ployees is approximately 135 minutes a day. The other numbers in this column can be inter-
preted the same way. Note that these numbers give the approximate commuting times for these
employees based on the assumption of an even spread within classes. The total commuting time
for all 25 employees is approximately 535 minutes. Consequently, 21.40 minutes is an approx-
imate and not the exact value of the mean. We can find the exact value of the mean only if we
know the exact commuting time for each of the 25 employees of the company.
EXAMPLE 3–15
Table 3.10 gives the frequency distribution of the number of orders received each day during
Calculating the sample mean
the past 50 days at the office of a mail-order company. for grouped data.
Table 3.10
Solution Because the data set includes only 50 days, it represents a sample. The value of
mf is calculated in Table 3.11.
Table 3.11
Number of Orders f m mf
10–12 4 11 44
13–15 12 14 168
16–18 20 17 340
19–21 14 20 280
n 50 mf 832
f 1m m2 2 f 1m x2 2
s2 and s 2
N n1
where 2 is the population variance, s2 is the sample variance, and m is the midpoint of a class.
In either case, the standard deviation is obtained by taking the positive square root of the
variance.
Again, the short-cut formulas are more efficient for calculating the variance and standard
deviation. Section A3.1.2 of Appendix 3.1 at the end of this chapter shows how to use the
basic formulas to calculate the variance and standard deviation for grouped data.
Short-Cut Formulas for the Variance and Standard Deviation for Grouped Data
1m f 2 2 1m f 2 2
m 2 f m 2 f
N n
s2 and s2
N n1
where 2 is the population variance, s 2 is the sample variance, and m is the midpoint of a class.
The standard deviation is obtained by taking the positive square root of the variance.
Population standard deviation: s 2s2
Sample standard deviation: s 2s2
Examples 3–16 and 3–17 illustrate the use of these formulas to calculate the variance and
standard deviation.
0 to less than 10 4
10 to less than 20 9
20 to less than 30 6
30 to less than 40 4
40 to less than 50 2
Solution All four steps needed to calculate the variance and standard deviation for grouped
data are shown after Table 3.12.
Table 3.12
m f 535
m2f 14,825
1mf 2 2 15352 2
m2f 14,825
N 25 3376
s2 135.04
N 25 25
Thus, the standard deviation of the daily commuting times for these employees is 11.62 minutes.
1519T_c03 10/31/05 6:55 AM Page 98
Note that the values of the variance and standard deviation calculated in Example 3–16 for
grouped data are approximations. The exact values of the variance and standard deviation can be
obtained only by using the ungrouped data on the daily commuting times of the 25 employees.
EXAMPLE 3–17
The following data, reproduced from Table 3.10 of Examle 3–15, give the frequency distri-
Calculating the sample
bution of the number of orders received each day during the past 50 days at the office of a
variance and standard deviation
for grouped data.
mail-order company.
Number of Orders f
10–12 4
13–15 12
16–18 20
19–21 14
Solution All the information required for the calculation of the variance and standard de-
viation appears in Table 3.13.
Table 3.13
Because the data set includes only 50 days, it represents a sample. Hence, we use the sam-
ple formulas to calculate the variance and standard deviation. By substituting the values into
the formula for the sample variance, we obtain
1mf 2 2 18322 2
m2f 14,216
n 50
s2 7.5820
n1 50 1
Hence, the standard deviation is
s 2s 2 17.5820 2.75 orders
Thus, the standard deviation of the number of orders received at the office of this mail-order
company during the past 50 days is 2.75.
EXERCISES
CONCEPTS AND PROCEDURES
3.61 Are the values of the mean and standard deviation that are calculated using grouped data exact or
approximate values of the mean and standard deviation, respectively? Explain.
3.62 Using the population formulas, calculate the mean, variance, and standard deviation for the follow-
ing grouped data.
3.63 Using the sample formulas, find the mean, variance, and standard deviation for the grouped data dis-
played in the following table.
x f
0 to less than 4 17
4 to less than 8 23
8 to less than 12 15
12 to less than 16 11
16 to less than 20 8
20 to less than 24 6
APPLICATIONS
3.64 The following table gives the frequency distribution of the amounts of telephone bills for October
2005 for a sample of 50 families.
3.65 The following table gives the frequency distribution of the number of hours spent per week playing
video games by all 60 students of the eighth grade at a school.
3.66 The following table gives the grouped data on the weights of all 100 babies born at a hospital
in 2005.
3.67 The following table gives the frequency distribution of the total miles driven during 2005 by 300 car
owners.
1519T_c03 10/31/05 6:55 AM Page 100
Find the mean, variance, and standard deviation. Give a brief interpretation of the values in the column
labeled mf in your table of calculations. What does mf represent?
3.68 The following table gives information on the amounts (in dollars) of electric bills for August 2005
for a sample of 50 families.
Find the mean, variance, and standard deviation. Give a brief interpretation of the values in the column
labeled mf in your table of calculations. What does mf represent?
3.69 For 50 airplanes that arrived late at an airport during a week, the time by which they were late was
observed. In the following table, x denotes the time (in minutes) by which an airplane was late and f de-
notes the number of airplanes.
x f
0 to less than 20 14
20 to less than 40 18
40 to less than 60 9
60 to less than 80 5
80 to less than 100 4
Find the mean, variance, and standard deviation. (Hint: The classes in this example are single-valued. These
values of classes will be used as values of m in the formulas for the mean, variance, and standard deviation.)
1519T_c03 10/31/05 6:55 AM Page 101
3.71 During fall 2004, oil prices fluctuated a lot due to wars, political unrests, and storm damages in some
oil-producing nations. The following data give the spot prices (in dollars) per barrel of crude oil for 15
business days from October 20 to November 9, 2004.
54.92 54.47 55.17 55.18 55.95 52.47 50.93
51.74 50.14 49.63 50.89 48.83 49.62 49.09 47.38
a. Find the mean for these data.
b. Construct a frequency distribution table for these data using a class width of 2.00 and the lower
boundary of the first class equal to 47.00.
c. Using the method of Section 3.3.1, find the mean of the grouped data of part b.
d. Compare your means from parts a and c. If the two means are not equal, then explain why they
differ.
Definition
Chebyshev’s Theorem For any number k greater than 1, at least (1 1k 2) of the data values
lie within k standard deviations of the mean.
At least 1 – 1/k 2
of the values lie in
the shaded areas
µ − kσ µ µ + kσ
kσ kσ
At least 75% of
the values lie in
the shaded areas
µ − 2σ µ µ + 2σ
According to Chebyshev’s theorem, at least .89 or 89% of the values fall within three standard
deviations of the mean. This is shown in Figure 3.7.
At least 89% of
the values lie in
the shaded areas
µ − 3σ µ µ + 3σ
Although in Figures 3.5 through 3.7 we have used the population notation for the mean and
standard deviation, the theorem applies to both sample and population data. Note that Chebyshev’s
theorem is applicable to a distribution of any shape. However, Chebyshev’s theorem can be used
only for k 1. This is so because when k 1, the value of 1 1k 2 is zero, and when k 1,
the value of 1 1k 2 is negative.
EXAMPLE 3–18
The average systolic blood pressure for 4000 women who were screened for high blood pressure
Applying Chebyshev’s theorem.
was found to be 187 with a standard deviation of 22. Using Chebyshev’s theorem, find at least
what percentage of women in this group have a systolic blood pressure between 143 and 231.
Solution Let and be the mean and the standard deviation, respectively, of the systolic
blood pressures of these women. Then, from the given information,
m 187 and s 22
To find the percentage of women whose systolic blood pressures are between 143 and 231,
the first step is to determine k. As shown below, each of the two points, 143 and 231, is 44
units away from the mean.
The value of k is obtained by dividing the distance between the mean and each point by the
standard deviation. Thus,
k 44 22 2
1 1 1
1 1 1 1 .25 .75 or 75%
k2 122 2 4
1519T_c03 10/31/05 6:55 AM Page 103
Hence, according to Chebyshev’s theorem, at least 75% of the women have systolic blood
pressure between 143 and 231. This percentage is shown in Figure 3.8.
Figure 3.9 illustrates the empirical rule. Again, the empirical rule applies to population data
as well as to sample data.
µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ
EXAMPLE 3–19
The age distribution of a sample of 5000 persons is bell-shaped with a mean of 40 years and
Applying the empirical rule.
a standard deviation of 12 years. Determine the approximate percentage of people who are 16
to 64 years old.
Solution We use the empirical rule to find the required percentage because the distribution
of ages follows a bell-shaped curve. From the given information, for this distribution,
16 − 40 = −24 64 − 40 = 24
= −2s = 2s
16 x = 40 64 Ages
x − 2s x + 2s
Consequently, as shown in Figure 3.10, the area from 16 to 64 is the area from x 2s to
x 2s.
Because the area within two standard deviations of the mean is approximately 95% for a bell-
shaped curve, approximately 95% of the people in the sample are 16 to 64 years old.
104
1519T_c03 10/31/05 6:55 AM Page 105
EXERCISES
CONCEPTS AND PROCEDURES
3.72 Briefly explain Chebyshev’s theorem and its applications.
3.73 Briefly explain the empirical rule. To what kind of distribution is it applied?
3.74 A sample of 2000 observations has a mean of 74 and a standard deviation of 12. Using Chebyshev’s
theorem, find at least what percentage of the observations fall in the intervals x 2s, x 2.5s, and
x 3s. Note that here x 2s represents the interval x 2s to x 2s, and so on.
3.75 A large population has a mean of 230 and a standard deviation of 41. Using Chebyshev’s theorem, find
at least what percentage of the observations fall in the intervals 2, 2.5, and 3.
3.76 A large population has a mean of 310 and a standard deviation of 37. Using the empirical rule, find
what percentage of the observations fall in the intervals 1, 2, and 3.
3.77 A sample of 3000 observations has a mean of 82 and a standard deviation of 16. Using the empiri-
cal rule, find what percentage of the observations fall in the intervals x 1s, x 2s, and x 3s.
APPLICATIONS
3.78 The mean time taken by all participants to run a road race was found to be 220 minutes with a stan-
dard deviation of 20 minutes. Using Chebyshev’s theorem, find the percentage of runners who ran this
road race in
a. 180 to 260 minutes b. 160 to 280 minutes c. 170 to 270 minutes
3.79 The 2005 gross sales of all firms in a large city have a mean of $2.3 million and a standard devia-
tion of $.6 million. Using Chebyshev’s theorem, find at least what percentage of firms in this city had
2005 gross sales of
a. $1.1 to $3.5 million b. $.8 to $3.8 million c. $.5 to $4.1 million
3.80 Suppose the average credit card debt for households currently is $9500 with a standard deviation of
$2600.
a. Using Chebyshev’s theorem, find at least what percentage of current credit card debts for all house-
holds are between
i. $4300 and $14,700 ii. $3000 and $16,000
*b. Using Chebyshev’s theorem, find the interval that contains credit card debts of at least 89% of all
households.
3.81 The mean monthly mortgage paid by all home owners in a city is $2365 with a standard deviation
of $340.
a. Using Chebyshev’s theorem, find at least what percentage of all home owners in the city pay a
monthly mortgage of
i. $1685 to $3045 ii. $1345 to $3385
*b. Using Chebyshev’s theorem, find the interval that contains the monthly mortgage payments of at
least 84% of all home owners.
3.82 The mean life of a certain brand of auto batteries is 44 months with a standard deviation of 3 months.
Assume that the lives of all auto batteries of this brand have a bell-shaped distribution. Using the empir-
ical rule, find the percentage of auto batteries of this brand that have a life of
a. 41 to 47 months b. 38 to 50 months c. 35 to 53 months
3.83 According to Hewitt and Associates (a consulting firm in Lincolnshire, Illinois), the employee
share of health insurance premiums at large U.S. companies was expected to be $1481, on average, in
2005. Suppose the current payments by all such employees toward health insurance premiums have a
bell-shaped distribution with a mean of $1481 per year and a standard deviation of $355. Using the
empirical rule, find the percentage of employees whose annual payments toward such premiums are
between
a. $771 and $2191 b. $1126 and $1836 c. $416 and $2546
3.84 The prices of all college textbooks follow a bell-shaped distribution with a mean of $105 and a stan-
dard deviation of $20.
a. Using the empirical rule, find the percentage of all college textbooks with their prices be-
tween
i. $85 and $125 ii. $65 and $145
*b. Using the empirical rule, find the interval that contains the prices of 99.7% of college textbooks.
1519T_c03 10/31/05 6:55 AM Page 106
3.85 Suppose that on a certain section of I-95, with a posted speed limit of 65 miles per hour, the speeds of
all vehicles have a bell-shaped distribution with a mean of 72 mph and a standard deviation of 3 mph.
a. Using the empirical rule, find the percentage of vehicles with the following speeds on this sec-
tion of I-95.
i. 63 to 81 mph ii. 69 to 75 mph
*b. Using the empirical rule, find the interval that contains the speeds of 95% of vehicles traveling
on this section of I-95.
Definition
Quartiles Quartiles are three summary measures that divide a ranked data set into four
equal parts. The second quartile is the same as the median of a data set. The first quartile is
the value of the middle term among the observations that are less than the median, and the
third quartile is the value of the middle term among the observations that are greater than the
median.
Approximately 25% of the values in a ranked data set are less than Q1 and about 75% are
greater than Q1. The second quartile, Q2, divides a ranked data set into two equal parts; hence,
the second quartile and the median are the same. Approximately 75% of the data values are less
than Q3 and about 25% are greater than Q3.
The difference between the third quartile and the first quartile for a data set is called the
interquartile range (IQR).
Calculating Interquartile Range The difference between the third and the first quartiles gives the
interquartile range; that is,
IQR Interquartile range Q3 Q1
Examples 3–20 and 3–21 show the calculation of the quartiles and the interquartile range.
1519T_c03 10/31/05 6:55 AM Page 107
EXAMPLE 3–20
Refer to Table 3.3 in Example 3–5 that lists the number of car thefts during 2003 in 12 cities.
Finding quartiles and the
That table is reproduced below.
interquartile range.
City Number of Car Thefts
Phoenix-Mesa, Arizona 40,769
Washington, D.C. 33,956
Miami, Florida 21,088
Atlanta, Georgia 29,920
Chicago, Illinois 42,082
Kansas City, Kansas 11,669
Baltimore, Maryland 13,435
Detroit, Michigan 40,197
St. Louis, Missouri 18,215
Las Vegas, Nevada 18,103
Newark, New Jersey 14,413
Dallas, Texas 26,343
Source: National Insurance Crime Bureau.
(a) Find the values of the three quartiles. Where does the number of car thefts of 40,197
fall in relation to these quartiles?
(b) Find the interquartile range.
Solution
(a) First we rank the given data in increasing order. Then we calculate the three quartiles
Finding quartiles for an
as follows: even number of data values.
Values less than the median Values greater than the median
11,669 13,435 14,413 18,103 18,215 21,088 26,343 29,920 33,956 40,197 40,769 42,082
c c c
ƒ ƒ ƒ
The value of Q2, which is also the median, is given by the value of the middle term
in the ranked data set. For the data of this example, this value is the average of the sixth
and seventh terms. Consequently, Q2 is 23,715.50 car thefts. The value of Q1 is given by
the value of the middle term of the six values that fall below the median (or Q2 2. Thus,
it is obtained by taking the average of the third and fourth terms. So, Q1 is 16,258 car
thefts. The value of Q3 is given by the value of the middle term of the six values that fall
above the median. For the data of this example, Q3 is obtained by taking the average of
the ninth and tenth terms, and it is 37,076.50 car thefts.
The value of Q1 16,258 indicates that the number of car thefts in (approximately)
25% of these cities were less than 16,258 in 2003 and those in (approximately) 75%
of the cities were greater than this value. Similarly, we can state that the car thefts in
about half of these cities were less than 23,715.50 (which is Q2 2 in 2003 and those in
the other half were greater than this value. The value of Q3 37,076.50 indicates that
the car thefts in (approximately) 75% of the cities in this sample were less than
37,076.50 in 2003 and those in (approximately) 25% of the cities were greater than
this value.
By looking at the position of 40,197, we can state that this value lies in the top
25% of the car thefts.
1519T_c03 10/31/05 6:55 AM Page 108
(b) The interquartile range is given by the difference between the values of the third and
Finding the interquartile range.
the first quartiles. Thus,
IQR Interquartile range Q3 Q1 37,076.50 16,258 20,818.50 car thefts
EXAMPLE 3–21
The following are the ages of nine employees of an insurance company:
Finding quartiles and the
interquartile range. 47 28 39 51 33 37 59 24 33
(a) Find the values of the three quartiles. Where does the age of 28 fall in relation to the
ages of these employees?
(b) Find the interquartile range.
Solution
(a) First we rank the given data in increasing order. Then we calculate the three quartiles
Finding quartiles for an odd
number of data values.
as follows:
Values less than the median Values greater than the median
24 28 33 33 37 39 47 51 59
c c c
ƒ ƒ ƒ
28 33 47 51
Q1 Q2 37 Q3
2 2
c
30.5 ƒ 49
Also the median
1% 1% 1% 1% 1% 1%
P1 P2 P3 P97 P98 P99
Thus, the kth percentile, Pk, can be defined as a value in a data set such that about k% of
the measurements are smaller than the value of Pk and about (100 k)% of the measurements
are greater than the value of Pk.
The approximate value of the kth percentile is determined as explained next.
1519T_c03 10/31/05 6:55 AM Page 109
Calculating Percentiles The (approximate) value of the kth percentile, denoted by Pk, is
EXAMPLE 3–22
Refer to the data on 2003 car thefts in 12 cities given in Example 3–20. Find the value of the
Finding the
42nd percentile. Give a brief interpretation of the 42nd percentile. percentile for a data set.
Solution From Example 3–20, the data arranged in increasing order are as follows:
11,669 13,435 14,413 18,103 18,215 21,088 26,343 29,920 33,956 40,197 40,769 42,082
We can also calculate the percentile rank for a particular value xi of a data set by using
the formula given below. The percentile rank of xi gives the percentage of values in the data set
that are less than xi.
Example 3–23 shows how the percentile rank is calculated for a data value.
EXAMPLE 3–23
Refer to the data on 2003 car thefts in 12 cities given in Example 3–20. Find the percentile
Finding the percentile
rank for 29,920 car thefts. Give a brief interpretation of this percentile rank. rank for a data value.
Solution From Example 3–20, the data arranged in increasing order are as follows:
11,669 13,435 14,413 18,103 18,215 21,088 26,343 29,920 33,956 40,197 40,769 42,082
In this data set, 7 of the 12 values are less than 29,920. Hence,
7
Percentile rank of 29,920 100 58.33%
12
1519T_c03 10/31/05 6:55 AM Page 110
Rounding this answer to the nearest integral value, we can state that about 58% of the cities
in these 12 cities had less than 29,920 car thefts in 2003. Hence, about 42% of the 12 cities
had 29,920 or higher car thefts in 2003.
EXERCISES
CONCEPTS AND PROCEDURES
3.86 Briefly describe how the three quartiles are calculated for a data set. Illustrate by calculating the three
quartiles for two examples, the first with an odd number of observations and the second with an even num-
ber of observations.
3.87 Explain how the interquartile range is calculated. Give one example.
3.88 Briefly describe how the percentiles are calculated for a data set.
3.89 Explain the concept of the percentile rank for an observation of a data set.
APPLICATIONS
3.90 The following data give the weights (in pounds) lost by 15 members of a health club at the end of
two months after joining the club.
5 10 8 7 25 12 5 14
11 10 21 9 8 11 18
a. Compute the values of the three quartiles and the interquartile range.
b. Calculate the (approximate) value of the 82nd percentile.
c. Find the percentile rank of 10.
3.91 The following data give the speeds of 13 cars, measured by radar, traveling on I-84.
73 75 69 68 78 69 74
76 72 79 68 77 71
a. Find the values of the three quartiles and the interquartile range.
b. Calculate the (approximate) value of the 35th percentile.
c. Compute the percentile rank of 71.
3.92 The following data give the numbers of computer keyboards assembled at the Twentieth Century
Electronics Company for a sample of 25 days.
45 52 48 41 56 46 44 42 48 53
51 53 51 48 46 43 52 50 54 47
44 47 50 49 52
a. Calculate the values of the three quartiles and the interquartile range.
b. Determine the (approximate) value of the 53rd percentile.
c. Find the percentile rank of 50.
3.93 The following data give the number of runners left on bases by each of the 30 Major League Base-
ball teams in the games played on August 12, 2004.
6 6 6 7 6 10 6 3 6 8 10 7 18 11 6
9 4 8 9 5 5 4 8 8 8 5 5 5 13 8
a. Calculate the values of the three quartiles and the interquartile range.
b. Find the (approximate) value of the 63rd percentile.
c. Find the percentile rank of 10.
3.94 Refer to Exercise 3.22. The following data give the number of students suspended for bringing
weapons to schools in the Tri-City School District for each of the past 12 weeks.
15 9 12 11 7 6 9 10 14 3 6 5
a. Determine the values of the three quartiles and the interquartile range. Where does the value of
10 fall in relation to these quartiles?
b. Calculate the (approximate) value of the 55th percentile.
c. Find the percentile rank of 7.
1519T_c03 10/31/05 6:55 AM Page 111
3.95 Nixon Corporation manufactures computer monitors. The following data give the numbers of com-
puter monitors produced at the company for a sample of 30 days.
24 32 27 23 33 33 29 25 23 36
26 26 31 20 27 33 27 23 28 29
31 35 34 22 37 28 23 35 31 43
a. Calculate the values of the three quartiles and the interquartile range. Where does the value of 31
lie in relation to these quartiles?
b. Find the (approximate) value of the 65th percentile. Give a brief interpretation of this percentile.
c. For what percentage of the days was the number of computer monitors produced 32 or higher?
Answer by finding the percentile rank of 32.
3.96 The following data give the numbers of new cars sold at a dealership during a 20-day period.
8 5 12 3 9 10 6 12 8 8
4 16 10 11 7 7 3 5 9 11
a. Calculate the values of the three quartiles and the interquartile range. Where does the value of 4
lie in relation to these quartiles?
b. Find the (approximate) value of the 25th percentile. Give a brief interpretation of this percentile.
c. Find the percentile rank of 10. Give a brief interpretation of this percentile rank.
3.97 According to the National Association of Realtors, the median home price in San Diego for
the second quarter of 2003 was $559,700 (USA TODAY, August 27, 2004). Suppose the following
data give the sale prices (in thousands of dollars) of a random sample of 20 recently sold homes in
San Diego.
605 789 550 881 499 675 700 543 910 808
1016 929 544 397 649 752 698 710 495 509
a. Calculate the values of the three quartiles and the interquartile range. Where does the value of
649 fall in relation to these quartiles?
b. Calculate the (approximate) value of the 77th percentile. Give a brief interpretation of this per-
centile.
c. Find the percentile rank of 700. Give a brief interpretation of this percentile rank.
Definition
Box-and-Whisker Plot A plot that shows the center, spread, and skewness of a data set. It is con-
structed by drawing a box and two whiskers that use the median, the first quartile, the third quar-
tile, and the smallest and the largest values in the data set between the lower and the upper inner
fences.
Example 3–24 explains all the steps needed to make a box-and-whisker plot.
EXAMPLE 3–24
The following data are the incomes (in thousands of dollars) for a sample of 12 households.
Constructing a
35 29 44 72 34 64 41 50 54 104 39 58 box-and-whisker plot.
Solution The following five steps are performed to construct a box-and-whisker plot.
Step 1. First, rank the data in increasing order and calculate the values of the median, the
first quartile, the third quartile, and the interquartile range. The ranked data are
29 34 35 39 41 44 50 54 58 64 72 104
For these data,
Median 144 502 2 47
Q1 135 392 2 37
Q3 158 642 2 61
IQR Q3 Q1 61 37 24
Step 2. Find the points that are 1.5 IQR below Q1 and 1.5 IQR above Q3. These two
points are called the lower and the upper inner fences, respectively.
25 35 45 55 65 75 85 95 105
Income
Step 5. By drawing two lines, join the points of the smallest and the largest values within
the two inner fences to the box. These values are 29 and 72 in this example as listed in
Step 3. The two lines that join the box to these two values are called whiskers. A value
that falls outside the two inner fences is shown by marking an asterisk and is called an out-
lier. This completes the box-and-whisker plot, as shown in Figure 3.14.
25 35 45 55 65 75 85 95 105
Income
1519T_c03 10/31/05 6:55 AM Page 113
In Figure 3.14, about 50% of the data values fall within the box, about 25% of the values
fall on the left side of the box, and about 25% fall on the right side of the box. Also, 50% of
the values fall on the left side of the median and 50% lie on the right side of the median. The
data of this example are skewed to the right because the lower 50% of the values are spread
over a smaller range than the upper 50% of the values.
The observations that fall outside the two inner fences are called outliers. These outliers
can be classified into two kinds of outliers—mild and extreme outliers. To do so, we define
two outer fences—a lower outer fence at 3.0 IQR below the first quartile and an upper
outer fence at 3.0 IQR above the third quartile. If an observation is outside either of the
two inner fences but within either of the two outer fences, it is called a mild outlier. An ob-
servation that is outside either of the two outer fences is called an extreme outlier. For the pre-
vious example, the outer fences are at 35 and 133. Because 104 is outside the upper inner
fence but inside the upper outer fence, it is a mild outlier.
For a symmetric data set, the line representing the median will be in the middle of the box
and the spread of the values will be over almost the same range on both sides of the box.
EXERCISES
CONCEPTS AND PROCEDURES
3.98 Briefly explain what summary measures are used to construct a box-and-whisker plot.
3.99 Prepare a box-and-whisker plot for the following data:
36 43 28 52 41 59 47 61
24 55 63 73 32 25 35 49
31 22 61 42 58 65 98 34
Does this data set contain any outliers?
3.100 Prepare a box-and-whisker plot for the following data:
11 8 26 31 62 19 7 3 14 75
33 30 42 15 18 23 29 13 16 6
Does this data set contain any outliers?
APPLICATIONS
3.101 The following data give the time (in minutes) that each of 20 students selected from a univer-
sity waited in line at their bookstore to pay for their textbooks in the beginning of the Spring 2006
semester.
15 8 23 21 5 17 31 22 34 6
5 10 14 17 16 25 30 3 31 19
Prepare a box-and-whisker plot. Comment on the skewness of these data.
3.102 Refer to Exercise 3.97. The following data give the sale prices (in thousands of dollars) of a ran-
dom sample of 20 recently sold homes in San Diego.
605 789 550 881 499 675 700 543 910 808
1016 929 544 397 649 752 698 710 495 509
Prepare a box-and-whisker plot. Are the data skewed in any direction?
3.103 The following data give the crude oil reserves (in billions of barrels) of Saudi Arabia, Iraq, Kuwait,
Iran, United Arab Emirates, Venezuela, Russia, Libya, Nigeria, China, Mexico, and the United States (USA
TODAY, June 7, 2004). The reserves for these countries are listed in that order.
261.7 112.0 97.7 94.4 80.3 64.0
51.2 29.8 27.0 26.8 25.0 22.5
Prepare a box-and-whisker plot. Are the data symmetric or skewed?
1519T_c03 10/31/05 6:55 AM Page 114
3.104 The following data give the numbers of computer keyboards assembled at the Twentieth Century
Electronics Company for a sample of 25 days.
45 52 48 41 56 46 44 42 48 53
51 53 51 48 46 43 52 50 54 47
44 47 50 49 52
Prepare a box-and-whisker plot. Comment on the skewness of these data.
3.105 Refer to Exercise 3.93. The following data give the number of runners left on bases by each of the
30 Major League Baseball teams in the games played on August 12, 2004.
6 6 6 7 6 10 6 3 6 8 10 7 18 11 6
9 4 8 9 5 5 4 8 8 8 5 5 5 13 8
Prepare a box-and-whisker plot. Are the data symmetric or skewed?
3.106 Refer to Exercise 3.22. The following data give the number of students suspended for bringing
weapons to schools in the Tri-City School District for each of the past 12 weeks.
15 9 12 11 7 6 9 10 14 3 6 5
Make a box-and-whisker plot. Comment on the skewness of these data.
3.107 Nixon Corporation manufactures computer monitors. The following are the numbers of computer
monitors produced at the company for a sample of 30 days:
24 32 27 23 33 33 29 25 23 28
21 26 31 20 27 33 27 23 28 29
31 35 34 22 26 28 23 35 31 27
Prepare a box-and-whisker plot. Comment on the skewness of these data.
3.108 The following data give the numbers of new cars sold at a dealership during a 20-day period.
8 5 12 3 9 10 6 12 8 8
4 16 10 11 7 7 3 5 9 11
Make a box-and-whisker plot. Comment on the skewness of these data.
Glossary
Bimodal distribution A distribution that has two modes. Mode The value (or values) that occurs with highest frequency in
a data set.
Box-and-whisker plot A plot that shows the center, spread, and
skewness of a data set with a box and two whiskers using the median, Multimodal distribution A distribution that has more than two
the first quartile, the third quartile, and the smallest and the largest modes.
values in the data set between the lower and the upper inner fences.
Parameter A summary measure calculated for population data.
Chebyshev’s theorem For any number k greater than 1, at least
(1 1k 2) of the values for any distribution lie within k standard Percentile rank The percentile rank of a value gives the percent-
deviations of the mean. age of values in the data set that are smaller than this value.
Coefficient of variation A measure of relative variability that Percentiles Ninety-nine values that divide a ranked data set into
expresses standard deviation as a percentage of the mean. 100 equal parts.
Empirical rule For a specific bell-shaped distribution, about 68% Quartiles Three summary measures that divide a ranked data set
of the observations fall in the interval ( ) to ( ), about into four equal parts.
95% fall in the interval ( 2) to ( 2), and about 99.7% fall Range A measure of spread obtained by taking the difference be-
in the interval ( 3) to ( 3). tween the largest and the smallest values in a data set.
First quartile The value in a ranked data set such that about 25% Second quartile Middle or second of the three quartiles that di-
of the measurements are smaller than this value and about 75% are vide a ranked data set into four equal parts. About 50% of the val-
larger. It is the median of the values that are smaller than the me- ues in the data set are smaller and about 50% are larger than the sec-
dian of the whole data set. ond quartile. The second quartile is the same as the median.
Geometric mean Calculated by taking the nth root of the product Standard deviation A measure of spread that is given by the pos-
of all values in a data set. itive square root of the variance.
Interquartile range (IQR) The difference between the third and Statistic A summary measure calculated for sample data.
the first quartiles.
Third quartile Third of the three quartiles that divide a ranked
Lower inner fence The value in a data set that is 1.5 IQR be- data set into four equal parts. About 75% of the values in a data set
low the first quartile. are smaller than the value of the third quartile and about 25% are
Lower outer fence The value in a data set that is 3.0 IQR be- larger. It is the median of the values that are greater than the median
low the first quartile. of the whole data set.
Mean A measure of central tendency calculated by dividing the Trimmed mean The k% trimmed mean is obtained by dropping
sum of all values by the number of values in the data set. k% of the smallest values and k% of the largest values from the given
data and then calculating the mean of the remaining (100 2k)%
Measures of central tendency Measures that describe the center of the values.
of a distribution. The mean, median, and mode are three of the meas-
ures of central tendency. Unimodal distribution A distribution that has only one mode.
Measures of dispersion Measures that give the spread of a distri- Upper inner fence The value in a data set that is 1.5 IQR above
bution. The range, variance, and standard deviation are three such the third quartile.
measures.
Upper outer fence The value in a data set that is 3.0 IQR above
Measures of position Measures that determine the position of a the third quartile.
single value in relation to other values in a data set. Quartiles, per-
Variance A measure of spread.
centiles, and percentile rank are examples of measures of position.
Weighted mean Mean of a data set whose values are assigned dif-
Median The value of the middle term in a ranked data set.
ferent weights before the mean is calculated.
The median divides a ranked data set into two equal parts.
Supplementary Exercises
3.109 Each year the faculty at Metro Business College chooses 10 members from the current graduating
class that they feel are most likely to succeed. The data below give the current annual incomes (in thou-
sands of dollars) of the 10 members of the class of 2005 who were voted most likely to succeed.
59 68 44 68 57 104 56 44 47 40
1519T_c03 10/31/05 6:55 AM Page 116
a. Calculate the mean and median. Do these data have a mode(s)? Why or why not? Explain.
b. Find the range, variance, and standard deviation.
3.112 The following data give the numbers of driving citations received by 12 drivers.
4 8 0 3 11 7 4 14 8 13 7 9
Find the mean, variance, and standard deviation. Are the values of these summary measures population
parameters or sample statistics?
1519T_c03 10/31/05 6:55 AM Page 117
3.114 The following table gives the frequency distribution of the times (in minutes) that 50 commuter stu-
dents at a large university spent looking for parking spaces on the first day of classes in the Spring semester
of 2006.
Find the mean, variance, and standard deviation. Are the values of these summary measures population
parameters or sample statistics?
3.115 The mean time taken to learn the basics of a word processor by all students is 200 minutes with a
standard deviation of 20 minutes.
a. Using Chebyshev’s theorem, find at least what percentage of students will learn the basics of
this word processor in
i. 160 to 240 minutes ii. 140 to 260 minutes
*b. Using Chebyshev’s theorem, find the interval that contains the time taken by at least 75% of
all students to learn this word processor.
3.116 According to the Statistical Abstract of the United States, Americans were expected to spend an av-
erage of 1669 hours watching television in 2004 (USA TODAY, March 30, 2004). Assume that the aver-
age time spent watching television by Americans this year will have a distribution that is skewed to the
right with a mean of 1750 hours and a standard deviation of 450 hours.
a. Using Chebyshev’s theorem, find at least what percentage of Americans will watch television
this year for
i. 850 to 2650 hours ii. 400 to 3100 hours
*b. Using Chebyshev’s theorem, find the interval that will contain the television viewing times of
at least 84% of all Americans.
3.117 Refer to Exercise 3.115. Suppose the times taken to learn the basics of this word processor by all stu-
dents have a bell-shaped distribution with a mean of 200 minutes and a standard deviation of 20 minutes.
a. Using the empirical rule, find the percentage of students who learn the basics of this word
processor in
i. 180 to 220 minutes ii. 160 to 240 minutes
*b. Using the empirical rule, find the interval that contains the time taken by 99.7% of all
students to learn this word processor.
3.118 Assume that the annual earnings of all employees with CPA certification and 12 years of experi-
ence and working for large firms have a bell-shaped distribution with a mean of $134,000 and a standard
deviation of $12,000.
a. Using the empirical rule, find the percentage of all such employees whose annual earnings are
between
i. $98,000 and $170,000 ii. $110,000 and $158,000
*b. Using the empirical rule, find the interval that contains the annual earnings of 68% of all such
employees.
3.119 Refer to the data of Exercise 3.109 on the current annual incomes (in thousands of dollars) of the
10 members of the class of 2005 of the Metro Business College who were voted most likely to succeed.
59 68 44 68 57 104 56 44 47 40
a. Determine the values of the three quartiles and the interquartile range. Where does the value
of 40 fall in relation to these quartiles?
b. Calculate the (approximate) value of the 70th percentile. Give a brief interpretation of this
percentile.
c. Find the percentile rank of 47. Give a brief interpretation of this percentile rank.
1519T_c03 10/31/05 6:55 AM Page 118
3.120 Refer to the data given in Exercise 3.111 on the total yards gained by the top 10 NFL pass receivers
in single games during the 2004 regular National Football League season.
a. Determine the values of the three quartiles and the interquartile range. Where does the value
of 179 lie in relation to these quartiles?
b. Calculate the (approximate) value of the 70th percentile. Give a brief interpretation of this
percentile.
c. Find the percentile rank of 171. Give a brief interpretation of this percentile rank.
3.121 A student washes her clothes at a laundromat once a week. The data below give the time (in min-
utes) she spent in the laundromat for each of 15 randomly selected weeks. Here, time spent in the laun-
dromat includes the time spent waiting for a machine to become available.
75 62 84 73 107 81 93 72
135 77 85 67 90 83 112
Prepare a box-and-whisker plot. Is the data set skewed in any direction? If yes, is it skewed to the right
or to the left? Does this data set contain any outliers?
3.122 The following data give the lengths of time (in weeks) taken to find a full-time job by 18 computer
science majors who graduated in 2005 from a small college.
10 3 12 21 15 8 4 2 16
8 9 14 33 7 24 11 42 15
Make a box-and-whisker plot. Comment on the skewness of this data set. Does this data set contain any
outliers?
Advanced Exercises
3.123 Melissa’s grade in her math class is determined by three 100-point tests and a 200-point final exam.
To determine the grade for a student in this class, the instructor will add the four scores together and divide
this sum by 5 to obtain a percentage. This percentage must be at least 80 for a grade of B. If Melissa’s three
test scores are 75, 69, and 87, what is the minimum score she needs on the final exam to obtain a B grade?
3.124 Jeffrey is serving on a six-person jury for a personal-injury lawsuit. All six jurors want to award
damages to the plaintiff but cannot agree on the amount of the award. The jurors have decided that each
of them will suggest an amount that he or she thinks should be awarded; then they will use the mean of
these six numbers as the award to recommend to the plaintiff.
a. Jeffrey thinks the plaintiff should receive $20,000, but he thinks the mean of the other
five jurors’ recommendations will be about $12,000. He decides to suggest an inflated
amount so that the mean for all six jurors is $20,000. What amount would Jeffrey have to
suggest?
b. How might this jury revise its procedure to prevent a juror like Jeffrey from having an undue
influence on the amount of damages to be awarded to the plaintiff?
3.125 The heights of five starting players on a basketball team have a mean of 76 inches, a median of
78 inches, and a range of 11 inches.
a. If the tallest of these five players is replaced by a substitute who is two inches taller, find the
new mean, median, and range.
b. If the tallest player is replaced by a substitute who is four inches shorter, which of the new val-
ues (mean, median, range) could you determine, and what would their new values be?
3.126 On a 300-mile auto trip, Lisa averaged 52 miles per hour for the first 100 miles, 65 mph for the
second 100 miles, and 58 mph for the last 100 miles.
a. How long did the 300-mile trip take?
b. Could you find Lisa’s average speed for the 300-mile trip by calculating (52 65 58)3?
If not, find the correct average speed for the trip.
3.127 A small country bought oil from three different sources in one week, as shown in the following table.
Find the mean price per barrel for all 1300 barrels of oil purchased in that week.
1519T_c03 10/31/05 6:55 AM Page 119
3.128 During the 2004 winter season, a homeowner received four deliveries of heating oil, as shown in
the following table.
The homeowner claimed that the mean price he paid for oil during the season was (1.10 1.25
1.28 1.33)4 $1.24 per gallon. Do you agree with this claim? If not, explain why this method of cal-
culating the mean is not appropriate in this case. Find the correct value of the mean price.
3.129 In the Olympic Games, when events require a subjective judgment of an athlete’s performance, the
highest and lowest of the judges’ scores may be dropped. Consider a gymnast whose performance is judged
by seven judges and the highest and the lowest of the seven scores are dropped.
a. Gymnast A’s scores in this event are 9.4, 9.7, 9.5, 9.5, 9.4, 9.6, and 9.5. Find this gymnast’s
mean score after dropping the highest and the lowest scores.
b. The answer to part a is an example of what percentage of trimmed mean?
c. Write another set of scores for a gymnast B so that gymnast A has a higher mean score than
gymnast B based on the trimmed mean, but gymnast B would win if all seven scores were
counted. Do not use any scores lower than 9.0.
3.130 A survey of young people’s shopping habits in a small city during the summer months of 2005
showed the following: Shoppers aged 12–14 took an average of 8 shopping trips per month and spent
an average of $14 per trip. Shoppers aged 15–17 took an average of 11 trips per month and spent an
average of $18 per trip. Assume that this city has 1100 shoppers aged 12–14 and 900 shoppers aged
15–17.
a. Find the total amount spent per month by all these 2000 shoppers in both age groups.
b. Find the mean number of shopping trips per person per month for these 2000 shoppers.
c. Find the mean amount spent per person per month by shoppers aged 12–17 in this city.
3.131 The following table shows the total population and the number of deaths (in thousands) due to heart
attack for two age groups in Countries A and B for 2005.
A B A B
Population 40,000 25,000 20,000 35,000
Deaths due to heart attack 1000 500 2000 3000
a. Calculate the death rate due to heart attack per 1000 population for the 30 and under age
group for each of the two countries. Which country has the lower death rate in this age
group?
b. Calculate the death rates due to heart attack for the two countries for the 31 and over age
group. Which country has the lower death rate in this age group?
c. Calculate the death rate due to heart attack for the entire population of Country A; then do the
same for Country B. Which country has the lower overall death rate?
d. How can the country with lower death rate in both age groups have the higher overall death
rate? (This phenomenon is known as Simpson’s paradox.)
3.132 In a study of distances traveled to a college by commuting students, data from 100 commuters
yielded a mean of 8.73 miles. After the mean was calculated, data came in late from three students, with
distances of 11.5, 7.6, and 10.0 miles. Calculate the mean distance for all 103 students.
3.133 The test scores for a large statistics class have an unknown distribution with a mean of 70 and a
standard deviation of 10.
a. Find k so that at least 50% of the scores are within k standard deviations of the mean.
b. Find k so that at most 10% of the scores are more than k standard deviations above the
mean.
1519T_c03 10/31/05 6:55 AM Page 120
3.134 The test scores for a very large statistics class have a bell-shaped distribution with a mean of
70 points.
a. If 16% of all students in the class scored above 85, what is the standard deviation of the
scores?
b. If 95% of the scores are between 60 and 80, what is the standard deviation?
3.135 How much does the typical American family spend to go away on vacation each year? Twenty-five
randomly selected households reported the following vacation expenditures (rounded to the nearest hun-
dred dollars) during the past year:
a. Using both graphical and numerical methods, organize and interpret these data.
b. What measure of central tendency best answers the original question?
3.136 Actuaries at an insurance company must determine a premium for a new type of insurance. A ran-
dom sample of 40 potential purchasers of this type of insurance were found to have suffered the follow-
ing values of losses during the past year. These losses would have been covered by the insurance if it were
available.
Men 87 68 92 79 83 67 71 92 112
75 77 102 79 78 85 75 72
Women 101 100 87 95 98 81 117 107 103
97 90 100 99 94 94
a. Make a box-and-whisker plot for each of the data sets and use them to discuss the similarities
and differences between the scores of the men and women golfers.
b. Compute the various descriptive measures you have learned for each sample. How do they
compare?
3.138 Answer the following questions.
a. The total weight of all pieces of luggage loaded onto an airplane is 12,372 pounds, which
works out to be an average of 51.55 pounds per piece. How many pieces of luggage are on
the plane?
b. A group of seven friends, having just gotten back a chemistry exam, discuss their scores. Six of
the students reveal that they received grades of 81, 75, 93, 88, 82, and 85, but the seventh stu-
dent is reluctant to say what grade she received. After some calculation she announces that the
group averaged 81 on the exam. What is her score?
3.139 Suppose that there are 150 freshmen engineering majors at a college and each of them will take the
same five courses next semester. Four of these courses will be taught in small sections of 25 students each,
whereas the fifth course will be taught in one section containing all 150 freshmen. To accommodate all
150 students, there must be six sections of each of the four courses taught in 25-student sections. Thus,
there are 24 classes of 25 students each and one class of 150 students.
1519T_c03 10/31/05 6:55 AM Page 121
Compute the mean, median, and standard deviation for the weights of all students, of men only, and
of women only. Of the mean and median, which is the more informative measure of central tendency?
Write a brief note comparing the three measures for all students, men only, and women only.
3.141 The distribution of the lengths of fish in a certain lake is not known, but it is definitely not bell-
shaped. It is estimated that the mean length is 6 inches with a standard deviation of 2 inches.
a. At least what proportion of fish in the lake are between 3 inches and 9 inches long?
b. What is the smallest interval that will contain the lengths of at least 84% of the fish?
c. Find an interval so that fewer than 36% of the fish have lengths outside this interval.
3.142 The following stem-and-leaf diagram gives the distances (in thousands of miles) driven during the
past year by a sample of drivers in a city.
0 3 6 9
1 2 8 5 1 0 5
2 5 1 6
3 8
4 1
5
6 2
a. Compute the sample mean, median, and mode for the data on distances driven.
b. Compute the range, variance, and standard deviation for these data.
c. Compute the first and third quartiles.
d. Compute the interquartile range. Describe what properties the interquartile range has. When
would it be preferable to using the standard deviation when measuring variation?
3.143 Refer to the data in Problem 3.140. Two individuals, one from Canada and one from England, are
interested in your analysis of these data but they need your results in different units. The Canadian indi-
vidual wants the results in grams (1 pound 435.59 grams). while the English individual wants the re-
sults in stone (1 stone 14 pounds).
a. Convert the data on weights from pounds to grams, and then recalculate the mean, median,
and standard deviation of weight for males and females separately. Repeat the procedure,
changing the unit from pounds to stones.
b. Convert your answers from Problem 3.140 to grams and stone. What do you notice about
these answers and your answers from part a?
c. What happens to the values of the mean, median, and standard deviation when you convert
from a larger unit to a smaller unit (e.g., from pounds to grams)? Does the same thing happen
if you convert from a smaller unit (e.g., pounds) to a larger unit (e.g., stone)?
d. Figure 3.15 on the next page gives a stacked dotplot of these weights in pounds and stone.
Which of these two distributions has more variability? Use your results from parts a to c to
explain why this is the case.
e. Now consider the weights in pounds and grams. Make a stacked dotplot for these data and
answer part d.
3.144 Although the standard workweek is 40 hours a week, many people work a lot more than 40 hours
a week. The data on the next page give the numbers of hours worked last week by 50 people.
1519T_c03 10/31/05 6:55 AM Page 122
40.5 41.3 41.4 41.5 42.0 42.2 42.4 42.4 42.6 43.3
43.7 43.9 45.0 45.0 45.2 45.8 45.9 46.2 47.2 47.5
47.8 48.2 48.3 48.8 49.0 49.2 49.9 50.1 50.6 50.6
50.8 51.5 51.5 52.3 52.3 52.6 52.7 52.7 53.4 53.9
54.4 54.8 55.0 55.4 55.4 55.4 56.2 56.3 57.8 58.7
a. The sample mean and sample standard deviation for this data set are 49.012 and 5.080, re-
spectively. Using the Chebyshev’s theorem, calculate the intervals that contain at least 75%,
88.89%, and 93.75% of the data.
b. Determine the actual percentages of the given data values that fall in each of the intervals that
you calculated in part a. Also calculate the percentage of the data values that fall within one
standard deviation of the mean.
c. Do you think the lower endpoints provided by Chebyshev’s Theorem in part a are useful for
this problem? Explain your answer.
d. Suppose that the individual with the first number (54.4) in the fifth row of the data is a worka-
holic who actually worked 84.4 hours last week, and not 54.4 hours. With this change now
x 49.61 and s 7.10. Recalculate the intervals for part a and the actual percentages for
part b. Did your percentages change a lot or a little?
e. How many standard deviations above the mean would you have to go to capture all 50 data
values? What is the lower bound for the percentage of the data that should fall in the interval,
according to Chebyshev?
3.145 Refer to the women’s golf scores in Exercise 3.137. It turns out that 117 was mistakenly entered.
Although this person still had the highest score among the 15 women, her score was not a mild or ex-
treme outlier according to the box-and-whisker plot, nor was she tied for the highest score. What are the
possible scores that she could have shot?
APPENDIX 3.1
A3.1.1 BASIC FORMULAS FOR THE VARIANCE AND STANDARD
DEVIATION FOR UNGROUPED DATA
Example 3–25 illustrates how to use the basic formulas to calculate the variance and standard deviation
for ungrouped data. From Section 3.2.2, the basic formulas for variance for ungrouped data are
1x m2 2 1x x 2 2
s2 and s2
N n1
where 2 is the population variance and s 2 is the sample variance.
In either case, the standard deviation is obtained by taking the square root of the variance.
1519T_c03 12/5/05 4:16 PM Page 123
Appendix 123
EXAMPLE 3–25 Refer to Example 3–12, where we used the short-cut formulas to compute the vari- Calculating the variance and
ance and standard deviation for the data on the total wealth (in billions of dollars) of five persons. Calcu- standard deviation for ungrouped
late the variance and standard deviation for those data using the basic formula. data using basic formulas.
Solution Let x denote the total wealth (in billions of dollars) of a person. Table 3.14 shows all the re-
quired calculations to find the variance and standard deviation.
Table 3.14
x 1x x2 1x x2 2
46.5 46.5 19.1 27.4 750.76
18.0 18.0 19.1 1.1 1.21
16.0 16.0 19.1 3.1 9.61
7.8 7.8 19.1 11.3 127.69
7.2 7.2 19.1 11.9 141.61
x 95.5 1x x2 2 1030.88
The following steps are performed to compute the variance and standard deviation.
Step 1. Find the mean as follows:
x 95.5
x 19.1
n 5
Step 2. Calculate x x, the deviation of each value of x from the mean. The results are shown in the
second column of Table 3.14.
Step 3. Square each of the deviations of x from x; that is, calculate each of the 1x x2 2 values. These val-
ues are called the squared deviations, and they are recorded in the third column.
Step 4. Add all the squared deviations to obtain 1x x 2 2; that is, sum all the values given in the third
column of Table 3.14. This gives
1x x2 2 1030.88
Step 5. Obtain the sample variance by dividing the sum of the squared deviations by n 1. Thus
1x x 2 2 1030.88
s2 257.72
n1 51
Step 6. Obtain the sample standard deviation by taking the positive square root of the variance. Hence,
frequency of a class.
In either case, the standard deviation is obtained by taking the square root of the variance.
EXAMPLE 3–26 In Example 3–17, we used the short-cut formula to compute the variance and standard Calculating the variance and
standard deviation for grouped
deviation for the data on the numbers of orders received each day during the past 50 days at the office of
data using basic formulas.
a mail-order company. Calculate the variance and standard deviation for those data using the basic formula.
1519T_c03 12/5/05 4:16 PM Page 124
Solution All the required calculations to find the variance and standard deviation appear in Table 3.15.
Table 3.15
Number of
Orders f m mf mx (m x )2 f(m x )2
The following steps are performed to compute the variance and standard deviation using the basic
formula.
Step 1. Find the midpoint of each class. Multiply the corresponding values of m and f. Find mf. From
Table 3.15, mf 832.
Step 2. Find the mean as follows:
f 1m x 2 2 371.5200
Step 6. Obtain the sample variance by dividing f 1m x 2 2 by n 1. Thus,
f 1m x 2 2 371.5200
s2 7.5820
n1 50 1
Step 7. Obtain the standard deviation by taking the positive square root of the variance.
Self-Review Test
1. The value of the middle term in a ranked data set is called the
a. mean b. median c. mode
2. Which of the following summary measures is/are influenced by extreme values?
a. mean b. median c. mode d. range
3. Which of the following summary measures can be calculated for qualitative data?
a. mean b. median c. mode
4. Which of the following can have more than one value?
a. mean b. median c. mode
5. Which of the following is obtained by taking the difference between the largest and the smallest val-
ues of a data set?
a. variance b. range c. mean
6. Which of the following is the mean of the squared deviations of x values from the mean?
a. standard deviation b. population variance c. sample variance
7. The values of the variance and standard deviation are
a. never negative b. always positive c. never zero
1519T_c03 10/31/05 6:55 AM Page 125
22. The following data give the number of times the metal detector was set off by passengers at a small
airport during 15 consecutive half-hour periods on February 1, 2006.
7 2 12 13 0 8 10
15 3 5 14 20 1 11 4
a. Calculate the three quartiles and the interquartile range. Where does the value of 4 lie in rela-
tion to these quartiles?
b. Find the (approximate) value of the 60th percentile. Give a brief interpretation of this value.
c. Calculate the percentile rank of 12. Give a brief interpretation of this value.
23. Make a box-and-whisker plot for the data on the number of times passengers set off the airport metal
detector given in Problem 22. Comment on the skewness of this data set.
*24. The mean weekly wages of a sample of 15 employees of a company are $435. The mean weekly
wages of a sample of 20 employees of another company are $490. Find the combined mean for these
35 employees.
*25. The mean GPA of five students is 3.21. The GPAs of four of these five students are 3.85, 2.67, 3.45,
and 2.91. Find the GPA of the fifth student.
*26. The following are the prices (in thousands of dollars) of 10 houses sold recently in a city:
179 166 58 207 287 149 193 2534 163 238
Calculate the 10% trimmed mean for this data set. Do you think the 10% trimmed mean is a better sum-
mary measure than the (simple) mean (i.e., the mean of all 10 values) for these data? Briefly explain why
or why not.
*27. Consider the following two data sets.
Data Set I: 8 16 20 35
Data Set II: 5 13 17 32
Note that each value of the second data set is obtained by subtracting 3 from the corresponding value of
the first data set.
a. Calculate the mean for each of these two data sets. Comment on the relationship between the
two means.
b. Calculate the standard deviation for each of these two data sets. Comment on the relationship
between the two standard deviations.
Mini-Projects
MINI-PROJECT 3–1
Refer to the data you collected for Mini-Project 1–1 of Chapter 1 and analyzed graphically in Mini-Project 2–1
of Chapter 2. Write a report summarizing those data. This report should include answers to at least the
following questions.
a. Calculate the summary measures (mean, standard deviation, five-number summary, interquartile
range) for the variables you graphed in Mini-Project 2–1. Do this for the entire data set, as well
as for the different groups formed by the categorical variable that you used to divide the data set
in Mini-Project 2–1.
b. Are the summary measures for the various groups similar to those for the entire data set? If not,
which ones differ and how do they differ? Make the same comparisons among the summary meas-
ures for various groups. Do the groups have similar levels of variability? Explain how you can
determine this from the graphs that you created in Mini-Project 2–1.
c. Draw a box-and-whisker plot for the entire data set. Also draw side-by-side box-and-whisker plots
for the various groups. Are there any outliers? If so, are there any values that are outliers in any
of the groups but not in the entire data set? Does the plot show any skewness?
d. Discuss which measures for the center and spread would be more appropriate to use to describe
your data set. Also, discuss your reasons for using those measures.
MINI-PROJECT 3–2
You are employed as a statistician for a company that makes household products, which are sold by part-
time salespersons who work during their spare time. The company has four salespersons employed in a
1519T_c03 10/31/05 6:55 AM Page 127
small town. Let us denote these salespersons by A, B, C, and D. The sales records (in dollars) for the past
six weeks for these four salespersons are shown in the following table.
Week A B C D
1 1774 2205 1330 1402
2 1808 1507 1295 1665
3 1890 2352 1502 1530
4 1932 1939 1104 1826
5 1855 2052 1189 1703
6 1726 1630 1441 1498
Your supervisor has asked you to prepare a brief report comparing the sales volumes and the consistency
of sales of these four salespersons. Use the mean sales for each salesperson to compare the sales volumes,
and then choose an appropriate statistical measure to compare the consistency of sales. Make the calcula-
tions and write a report.
Figure 3.16 Histogram of Prices of Homes in Suburb A. Figure 3.17 Histogram of Prices of Homes in Suburb B.
1519T_c03 10/31/05 6:55 AM Page 128
T ECH NOLOGY
I NSTR UCTION
TI-84
Numerical Descriptive Measures
1. To calculate the sample statistics (e.g., mean, standard deviation, and five-number sum-
mary), first enter your data into a list such as L1, then select STATCALC1-Var Stats,
and press Enter. Access the name of your list by pressing 2ndSTAT and scrolling
through the list of names until you get to your list name. Press Enter. You will obtain the
output shown in Screens 3.1 and 3.2.
Screen 3.1 shows, in this order, the sample mean, the sum of the data values, the sum
of the squared data values, the sample standard deviation, the value of the population stan-
dard deviation (you will use this only when your data constitute a census instead of a sam-
Screen 3.1
ple), and the number of data values (e.g., the sample or population size). Pressing the
downward arrow key will show the five-number summary, which is shown in Screen 3.2.
2. Constructing a box-and-whisker plot is similar to constructing
a histogram. First enter your data into a list such as L1, then
select STAT PLOT and go into one of the three plots. Make
sure the plot is turned on. For the type, select the second row,
first column (this boxplot will display outliers, if there are
any). Enter the name of your list for XList. Select ZOOM9
Screen 3.2 Screen 3.3
to display the plot as shown in Screen 3.3.
MINITAB
1. To find the sample statistics (e.g., the mean, standard deviation, and five-number summary),
first enter the given data in a column such as C1, and then select StatBasic Statistics
Display Descriptive Statistics. In the dialog box you obtain, enter the name of the column
where your data are stored in the Variables box as shown in Screen 3.4. Click the
Statistics button in this dialog box and choose the summary measures you want to
Screen 3.4
1519T_c03 10/31/05 6:55 AM Page 129
calculate in the new dialog box as shown in Screen 3.5. Click OK in both dialog boxes.
The output will appear in the Session window, which is shown in Screen 3.6 here.
Screen 3.5
Screen 3.6
2. To create a box-and-whisker plot, enter the given data in a column such as C1, select
GraphBoxplotSimple, and click OK. In the dialog box you obtain, enter the name of
the column with data in the Graph Variables box (see Screen 3.7) and click OK. The
boxplot shown in Screen 3.8 will appear.
Screen 3.7
1519T_c03 10/31/05 6:55 AM Page 130
Screen 3.8
EXCEL
1. For each of the commands in Excel,
a. Type command(
b. Select the range of data
c. Type a right parenthesis, and then press Enter.
2. To find the mean, use the command average. (See Screens 3.9 and 3.10)
3. To find the median, use the command median.
4. To find the mode, use the command mode.
TECHNOLOGY ASSIGNMENTS
TA3.1 Refer to the subsample taken in the Computer Assignment TA2.3 of Chapter 2 from the sample
data on the time taken to run the Manchester Road Race. Find the mean, median, range, and standard de-
viation for those data.
TA3.2 Refer to the data on phone charges given in Data Set I. From that data set select the 4th value
and then select every 10th value after that (i.e., select the 4th, 14th, 24th, 34th . . . values). Such a sam-
ple taken from a population is called a systematic random sample. Find the mean, median, standard devi-
ation, first quartile, and third quartile for the phone charges for this subsample.
TA3.3 Refer to Data Set I on the prices of various products in different cities across the country. Select
a subsample of the prices of regular unleaded gas for 40 cities. Find the mean, median, and standard de-
viation for the data of this subsample.
TA3.4 Refer to Data of TA3.3. Make a box-and-whisker plot for those data.
TA3.5 Refer to Data Set I on the prices of various products in different cities across the country. Make
a box-and-whisker plot for the data on the monthly telephone charges.
TA3.6 Refer to the data on the numbers of computer keyboards assembled at the Twentieth Century Elec-
tronics Company for a sample of 25 days given in Exercise 3.104. Prepare a box-and-whisker plot for
those data.