Boxplots
Boxplots
Comparison of Batches
xi D .xi1 ; xi 2 ; : : : ; xip /;
X D .X1 ; X2 ; : : : ; Xp /
1.1 Boxplots
Example 1.1 The Swiss bank data (see Chap. 22, Sect. 22.2) consists of 200
measurements on Swiss bank notes. The first half of these measurements are from
genuine bank notes, the other half are from counterfeit bank notes.
The authorities measured, as indicated in Fig. 1.1,
X2 X5 X3
X1
X4
These data are taken from Flury and Riedwyl (1988). The aim is to study
how these measurements may be used in determining whether a bill is genuine or
counterfeit.
The boxplot is a graphical technique that displays the distribution of variables. It
helps us see the location, skewness, spread, tail length and outlying points.
It is particularly useful in comparing different batches. The boxplot is a graphical
representation of the Five Number Summary. To introduce the Five Number
Summary, let us consider for a moment a smaller, one-dimensional data set:
the population of the 15 largest world cities in 2006 (Table 1.1).
In the Five Number Summary, we calculate the upper quartile FU , the lower quar-
tile FL , the median and the extremes. Recall that order statistics fx.1/ ; x.2/ ; : : : ; x.n/ g
are a set of ordered values x1 ; x2 ; : : : ; xn where x.1/ denotes the minimum and x.n/
the maximum. The median M typically cuts the set of observations in two equal
parts, and is defined as
8
< x nC1 n odd
M D n 2 o : (1.1)
:1 x n Cx n
2 .2/ . 2 C1/ n even
6 1 Comparison of Batches
Table 1.1 The 15 largest City Country Pop. (10,000) Order statistics
world cities in 2006
Tokyo Japan 3,420 x.15/
Mexico city Mexico 2,280 x.14/
Seoul South Korea 2,230 x.13/
New York USA 2,190 x.12/
Sao Paulo Brazil 2,020 x.11/
Bombay India 1,985 x.10/
Delhi India 1,970 x.9/
Shanghai China 1,815 x.8/
Los Angeles USA 1,800 x.7/
Osaka Japan 1,680 x.6/
Jakarta Indonesia 1,655 x.5/
Calcutta India 1,565 x.4/
Cairo Egypt 1,560 x.3/
Manila Philippines 1,495 x.2/
Karachi Pakistan 1,430 x.1/
The quartiles cut the set into four equal parts, which are often called fourths (that is
why we use the letter F ). Using a definition that goes back to Hoaglin, Mosteller,
and Tukey (1983) the definition of a median can be generalised to fourths, eights,
etc. Considering the order statistics we can define the depth of a data value x.i /
as minfi; n i C 1g. If n is odd, the depth of the median is nC1 2
. If n is even,
nC1
2 is a fraction. Thus, the median is determined to be the average between
the two data
n values belonging
o to the next larger and smaller order statistics, i.e.
M D 2 x. n / C x. n C1/ . In our example, we have n D 15 hence the median
1
2 2
M D x.8/ D 1;815.
We proceed in the same way to get the fourths. Take the depth of the median and
calculate
Œdepth of median C 1
depth of fourth D
2
with Œz denoting the largest integer smaller than or equal to z. In our example this
gives 4:5 and thus leads to the two fourths
1˚
FL D x.4/ C x.5/
2
1˚
FU D x.11/ C x.12/
2
(recalling that a depth which is a fraction corresponds to the average of the two
nearest data values).
1.1 Boxplots 7
FU C 1:5dF (1.2)
FL 1:5dF (1.3)
are the borders beyond which a point is regarded as an outlier. For the number of
points outside
˚ these bars ˚ For the n D
see Exercise 1.3. 15 data points the fourths are
1610 D 12 x.4/ C x.5/ and 2105 D 12 x.11/ C x.12/ . Therefore the F -spread and
the upper and lower outside bars in the above example are calculated as follows:
Since Tokyo is beyond the outside bars it is considered to be an outlier. The mini-
mum and the maximum are called the extremes. The mean is defined as
X
n
x D n1 xi ;
i D1
which is 1;939:7 in our example. The mean is a measure of location. The median
(1815), the fourths (1610;2105) and the extremes (1430;3420) constitute basic
information about the data. The combination of these five numbers leads to the Five
Number Summary as shown in Table 1.2. The depths of each of the five numbers
have been added as an additional column.
1. Draw a box with borders (edges) at FL and FU (i.e. 50 % of the data are in this
box).
2. Draw the median as a solid line (j) and the mean as a dotted line ().
3. Draw “whiskers” from each end of the box to the most remote point that is NOT
an outlier.
4. Show outliers as either “?” or “”depending on whether they are outside of FUL ˙
1:5dF or FUL ˙ 3dF respectively (this feather is not contained in some software).
Label them if possible.
8 1 Comparison of Batches
Boxplot
3500
3000
2500
2000
1500
World Cities
In the world cities example, the cut-off points (outside bars) are at 867:5 and
2847.5, hence we can draw whiskers to Karachi and Mexico City. We can see from
Fig. 1.2 that the data are very skew: The upper half of the data (above the median)
is more spread out than the lower half (below the median), the data contains one
outlier marked as a circle and the mean (as a non-robust measure of location) is
pulled away from the median.
Boxplots are very useful tools in comparing batches. The relative location of
the distribution of different batches tells us a lot about the batches themselves.
Before we come back to the Swiss bank data, let us compare the fuel economy
of vehicles from different countries, see Fig. 1.3 and Table 22.3.
Example 1.2 The data are from the second column of Table 22.3 and show
the mileage (miles per gallon) of American, Japanese and European cars.
The five-number summaries for these data sets are f12; 16:8; 18:8; 22; 30g,
f18; 22; 25; 30:5; 35g and f14; 19; 23; 25; 28g for American, Japanese and European
cars, respectively. This reflects the information shown in Fig. 1.3. The following
conclusions can be made:
• Japanese cars achieve higher fuel efficiency than US and European cars.
• There is one outlier, a very fuel-efficient car (VW-Rabbit Golf Diesel).
• The main body of the US car data (the box) lies below the Japanese car data.
• The worst Japanese car is more fuel-efficient than almost 50 % of the US cars.
• The spread of the Japanese and the US cars are almost equal.
• The median of the Japanese data is above that of the European data and the US
data.
1.1 Boxplots 9
30
25
20
15
US JAPAN EU
141
140
139
138
GENUINE COUNTERFEIT
Now let us apply the boxplot technique to the bank data set. In Fig. 1.4 we
show the parallel boxplot of the diagonal variable X6 . On the left is the value of
the genuine bank notes and on the right the value of the counterfeit bank notes. The
five number summary is reported in Table 1.3 and 1.4.
10 1 Comparison of Batches
215.5
215
214.5
214
GENUINE COUNTERFEIT
One sees that the diagonals of the genuine bank notes tend to be larger. It is
harder to see a clear distinction when comparing the length of the bank notes X1 ,
see Fig. 1.5. There are a few outliers in both plots. Almost all the observations of
the diagonal of the genuine notes are above the ones from the counterfeit notes.
There is one observation in Fig. 1.4 of the genuine notes that is almost equal to
the median of the counterfeit notes. Can the parallel boxplot technique help us
distinguish between the two types of bank notes?
Summary
,! The median and mean bars are measures of locations.
,! The relative location of the median (and the mean) in the box is a
measure of how skewed it is.
,! The length of the box and whiskers are a measure of spread.
,! The length of the whiskers indicate the tail length of the distribu-
tion.
,! The outlying points are indicated with a “?” or “” depending on
if they are outside of FUL ˙ 1:5dF or FUL ˙ 3dF respectively.
1.2 Histograms 11
Summary (continued)
,! The boxplots do not indicate multi-modality or clusters.