0% found this document useful (0 votes)
9 views9 pages

Boxplots

boxplots

Uploaded by

Claudia Ferraz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Boxplots

boxplots

Uploaded by

Claudia Ferraz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Chapter 1

Comparison of Batches

Multivariate statistical analysis is concerned with analysing and understanding data


in high dimensions. We suppose that we are given a set fxi gniD1 of n observations
of a variable vector X in Rp . That is, we suppose that each observation xi has p
dimensions:

xi D .xi1 ; xi 2 ; : : : ; xip /;

and that it is an observed value of a variable vector X 2 Rp . Therefore, X is


composed of p random variables:

X D .X1 ; X2 ; : : : ; Xp /

where Xj , for j D 1; : : : ; p, is a one-dimensional random variable. How do


we begin to analyse this kind of data? Before we investigate questions on what
inferences we can reach from the data, we should think about how to look at the data.
This involves descriptive techniques. Questions that we could answer by descriptive
techniques are:
• Are there components of X that are more spread out than others?
• Are there some elements of X that indicate sub-groups of the data?
• Are there outliers in the components of X ?
• How “normal” is the distribution of the data?
• Are there “low-dimensional” linear combinations of X that show “non-normal”
behaviour?
One difficulty of descriptive methods for high-dimensional data is the human
perceptional system. Point clouds in two dimensions are easy to understand and to
interpret. With modern interactive computing techniques we have the possibility
to see real time 3D rotations and thus to perceive also three-dimensional data.
A “sliding technique” as described in Härdle and Scott (1992) may give insight

© Springer-Verlag Berlin Heidelberg 2015 3


W.K. Härdle, L. Simar, Applied Multivariate Statistical Analysis,
DOI 10.1007/978-3-662-45171-7_1
4 1 Comparison of Batches

into four-dimensional structures by presenting dynamic 3D density contours as the


fourth variable is changed over its range.
A qualitative jump in presentation difficulties occurs for dimensions greater
than or equal to 5, unless the high-dimensional structure can be mapped into
lower-dimensional components (Klinke & Polzehl, 1995). Features like clustered
sub-groups or outliers, however, can be detected using a purely graphical analysis.
In this chapter, we investigate the basic descriptive and graphical techniques
allowing simple exploratory data analysis. We begin the exploration of a data
set using boxplots. A boxplot is a simple univariate device that detects outliers
component by component and that can compare distributions of the data among
different groups. Next, several multivariate techniques are introduced (Flury faces,
Andrews’ curves and parallel coordinates plots (PCPs)) which provide graphical
displays addressing the questions formulated above. The advantages and the
disadvantages of each of these techniques are stressed.
Two basic techniques for estimating densities are also presented: histograms and
kernel densities. A density estimate gives a quick insight into the shape of the
distribution of the data. We show that kernel density estimates (KDEs) overcome
some of the drawbacks of the histograms.
Finally, scatterplots are shown to be very useful for plotting bivariate or
trivariate variables against each other: they help to understand the nature of the
relationship among variables in a data set and allow for the detection of groups or
clusters of points. Draftman plots or matrix plots are the visualisation of several
bivariate scatterplots on the same display. They help detect structures in conditional
dependencies by brushing across the plots. Outliers and observations that need
special attention may be discovered with Andrews curves and PCPs. This chapter
ends with an explanatory analysis of the Boston Housing data.

1.1 Boxplots

Example 1.1 The Swiss bank data (see Chap. 22, Sect. 22.2) consists of 200
measurements on Swiss bank notes. The first half of these measurements are from
genuine bank notes, the other half are from counterfeit bank notes.
The authorities measured, as indicated in Fig. 1.1,

X1 D length of the bill


X2 D height of the bill (left)
X3 D height of the bill (right)
X4 D distance of the inner frame to the lower border
X5 D distance of the inner frame to the upper border
X6 D length of the diagonal of the central picture.
1.1 Boxplots 5

X2 X5 X3

X1

X4

Fig. 1.1 An old Swiss 1000-franc bank note

These data are taken from Flury and Riedwyl (1988). The aim is to study
how these measurements may be used in determining whether a bill is genuine or
counterfeit.
The boxplot is a graphical technique that displays the distribution of variables. It
helps us see the location, skewness, spread, tail length and outlying points.
It is particularly useful in comparing different batches. The boxplot is a graphical
representation of the Five Number Summary. To introduce the Five Number
Summary, let us consider for a moment a smaller, one-dimensional data set:
the population of the 15 largest world cities in 2006 (Table 1.1).
In the Five Number Summary, we calculate the upper quartile FU , the lower quar-
tile FL , the median and the extremes. Recall that order statistics fx.1/ ; x.2/ ; : : : ; x.n/ g
are a set of ordered values x1 ; x2 ; : : : ; xn where x.1/ denotes the minimum and x.n/
the maximum. The median M typically cuts the set of observations in two equal
parts, and is defined as
8
< x nC1  n odd
M D n 2 o : (1.1)
:1 x n Cx n
2 .2/ . 2 C1/ n even
6 1 Comparison of Batches

Table 1.1 The 15 largest City Country Pop. (10,000) Order statistics
world cities in 2006
Tokyo Japan 3,420 x.15/
Mexico city Mexico 2,280 x.14/
Seoul South Korea 2,230 x.13/
New York USA 2,190 x.12/
Sao Paulo Brazil 2,020 x.11/
Bombay India 1,985 x.10/
Delhi India 1,970 x.9/
Shanghai China 1,815 x.8/
Los Angeles USA 1,800 x.7/
Osaka Japan 1,680 x.6/
Jakarta Indonesia 1,655 x.5/
Calcutta India 1,565 x.4/
Cairo Egypt 1,560 x.3/
Manila Philippines 1,495 x.2/
Karachi Pakistan 1,430 x.1/

The quartiles cut the set into four equal parts, which are often called fourths (that is
why we use the letter F ). Using a definition that goes back to Hoaglin, Mosteller,
and Tukey (1983) the definition of a median can be generalised to fourths, eights,
etc. Considering the order statistics we can define the depth of a data value x.i /
as minfi; n  i C 1g. If n is odd, the depth of the median is nC1 2
. If n is even,
nC1
2 is a fraction. Thus, the median is determined to be the average between
the two data
n values belonging
o to the next larger and smaller order statistics, i.e.
M D 2 x. n / C x. n C1/ . In our example, we have n D 15 hence the median
1
2 2
M D x.8/ D 1;815.
We proceed in the same way to get the fourths. Take the depth of the median and
calculate

Œdepth of median C 1
depth of fourth D
2

with Œz denoting the largest integer smaller than or equal to z. In our example this
gives 4:5 and thus leads to the two fourths

1˚ 
FL D x.4/ C x.5/
2
1˚ 
FU D x.11/ C x.12/
2
(recalling that a depth which is a fraction corresponds to the average of the two
nearest data values).
1.1 Boxplots 7

Table 1.2 Five number # 15 World cities


summary
M 8 1,815
F 4.5 1,610 2,105
1 1,430 3,420

The F -spread, dF , is defined as dF D FU  FL . The outside bars

FU C 1:5dF (1.2)
FL  1:5dF (1.3)

are the borders beyond which a point is regarded as an outlier. For the number of
points outside
˚ these bars ˚ For the n D
 see Exercise 1.3.  15 data points the fourths are
1610 D 12 x.4/ C x.5/ and 2105 D 12 x.11/ C x.12/ . Therefore the F -spread and
the upper and lower outside bars in the above example are calculated as follows:

dF D FU  FL D 2105  1610 D 495 (1.4)


FL  1:5dF D 1610  1:5  495 D 867:5 (1.5)
FU C 1:5dF D 2105 C 1:5  495 D 2847:5: (1.6)

Since Tokyo is beyond the outside bars it is considered to be an outlier. The mini-
mum and the maximum are called the extremes. The mean is defined as
X
n
x D n1 xi ;
i D1

which is 1;939:7 in our example. The mean is a measure of location. The median
(1815), the fourths (1610;2105) and the extremes (1430;3420) constitute basic
information about the data. The combination of these five numbers leads to the Five
Number Summary as shown in Table 1.2. The depths of each of the five numbers
have been added as an additional column.

Construction of the Boxplot

1. Draw a box with borders (edges) at FL and FU (i.e. 50 % of the data are in this
box).
2. Draw the median as a solid line (j) and the mean as a dotted line ().
3. Draw “whiskers” from each end of the box to the most remote point that is NOT
an outlier.
4. Show outliers as either “?” or “”depending on whether they are outside of FUL ˙
1:5dF or FUL ˙ 3dF respectively (this feather is not contained in some software).
Label them if possible.
8 1 Comparison of Batches

Boxplot
3500

3000

2500

2000

1500

World Cities

Fig. 1.2 Boxplot for world cities MVAboxcity

In the world cities example, the cut-off points (outside bars) are at 867:5 and
2847.5, hence we can draw whiskers to Karachi and Mexico City. We can see from
Fig. 1.2 that the data are very skew: The upper half of the data (above the median)
is more spread out than the lower half (below the median), the data contains one
outlier marked as a circle and the mean (as a non-robust measure of location) is
pulled away from the median.
Boxplots are very useful tools in comparing batches. The relative location of
the distribution of different batches tells us a lot about the batches themselves.
Before we come back to the Swiss bank data, let us compare the fuel economy
of vehicles from different countries, see Fig. 1.3 and Table 22.3.
Example 1.2 The data are from the second column of Table 22.3 and show
the mileage (miles per gallon) of American, Japanese and European cars.
The five-number summaries for these data sets are f12; 16:8; 18:8; 22; 30g,
f18; 22; 25; 30:5; 35g and f14; 19; 23; 25; 28g for American, Japanese and European
cars, respectively. This reflects the information shown in Fig. 1.3. The following
conclusions can be made:
• Japanese cars achieve higher fuel efficiency than US and European cars.
• There is one outlier, a very fuel-efficient car (VW-Rabbit Golf Diesel).
• The main body of the US car data (the box) lies below the Japanese car data.
• The worst Japanese car is more fuel-efficient than almost 50 % of the US cars.
• The spread of the Japanese and the US cars are almost equal.
• The median of the Japanese data is above that of the European data and the US
data.
1.1 Boxplots 9

Fig. 1.3 Boxplot for the Car Data


mileage of American,
Japanese and European cars 40
(from left to right)
MVAboxcar
35

30

25

20

15

US JAPAN EU

Fig. 1.4 The X6 variable of Swiss Bank Notes


Swiss bank data (diagonal of
bank notes)
MVAboxbank6 142

141

140

139

138

GENUINE COUNTERFEIT

Table 1.3 Five number # 100 Genuine bank notes


summary
M 50.5 141.5
F 25.75 141.25 141.8
1 140.65 142.4

Now let us apply the boxplot technique to the bank data set. In Fig. 1.4 we
show the parallel boxplot of the diagonal variable X6 . On the left is the value of
the genuine bank notes and on the right the value of the counterfeit bank notes. The
five number summary is reported in Table 1.3 and 1.4.
10 1 Comparison of Batches

Table 1.4 Five number # 100 Counterfeit bank notes


summary
M 50.5 139.5
F 25.75 139.2 139.8
1 138.3 140.65

Fig. 1.5 The X1 variable of Swiss Bank Notes


Swiss bank data (length of
bank notes)
MVAboxbank1 216

215.5

215

214.5

214

GENUINE COUNTERFEIT

One sees that the diagonals of the genuine bank notes tend to be larger. It is
harder to see a clear distinction when comparing the length of the bank notes X1 ,
see Fig. 1.5. There are a few outliers in both plots. Almost all the observations of
the diagonal of the genuine notes are above the ones from the counterfeit notes.
There is one observation in Fig. 1.4 of the genuine notes that is almost equal to
the median of the counterfeit notes. Can the parallel boxplot technique help us
distinguish between the two types of bank notes?

Summary
,! The median and mean bars are measures of locations.

,! The relative location of the median (and the mean) in the box is a
measure of how skewed it is.
,! The length of the box and whiskers are a measure of spread.

,! The length of the whiskers indicate the tail length of the distribu-
tion.
,! The outlying points are indicated with a “?” or “” depending on
if they are outside of FUL ˙ 1:5dF or FUL ˙ 3dF respectively.
1.2 Histograms 11

Summary (continued)
,! The boxplots do not indicate multi-modality or clusters.

,! If we compare the relative size and location of the boxes, we are


comparing distributions.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy