0% found this document useful (0 votes)

9 views9 pages

Boxplots

boxplots

Uploaded by

Claudia Ferraz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views9 pages

Boxplots

boxplots

Uploaded by

Claudia Ferraz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Chapter 1

Comparison of Batches

Multivariate statistical analysis is concerned with analysing and understanding data

in high dimensions. We suppose that we are given a set fxi gniD1 of n observations
of a variable vector X in Rp . That is, we suppose that each observation xi has p
dimensions:

xi D .xi1 ; xi 2 ; : : : ; xip /;

and that it is an observed value of a variable vector X 2 Rp . Therefore, X is

composed of p random variables:

X D .X1 ; X2 ; : : : ; Xp /

where Xj , for j D 1; : : : ; p, is a one-dimensional random variable. How do

we begin to analyse this kind of data? Before we investigate questions on what
inferences we can reach from the data, we should think about how to look at the data.
This involves descriptive techniques. Questions that we could answer by descriptive
techniques are:
• Are there components of X that are more spread out than others?
• Are there some elements of X that indicate sub-groups of the data?
• Are there outliers in the components of X ?
• How “normal” is the distribution of the data?
• Are there “low-dimensional” linear combinations of X that show “non-normal”
behaviour?
One difficulty of descriptive methods for high-dimensional data is the human
perceptional system. Point clouds in two dimensions are easy to understand and to
interpret. With modern interactive computing techniques we have the possibility
to see real time 3D rotations and thus to perceive also three-dimensional data.
A “sliding technique” as described in Härdle and Scott (1992) may give insight

© Springer-Verlag Berlin Heidelberg 2015 3

W.K. Härdle, L. Simar, Applied Multivariate Statistical Analysis,
DOI 10.1007/978-3-662-45171-7_1
4 1 Comparison of Batches

into four-dimensional structures by presenting dynamic 3D density contours as the

fourth variable is changed over its range.
A qualitative jump in presentation difficulties occurs for dimensions greater
than or equal to 5, unless the high-dimensional structure can be mapped into
lower-dimensional components (Klinke & Polzehl, 1995). Features like clustered
sub-groups or outliers, however, can be detected using a purely graphical analysis.
In this chapter, we investigate the basic descriptive and graphical techniques
allowing simple exploratory data analysis. We begin the exploration of a data
set using boxplots. A boxplot is a simple univariate device that detects outliers
component by component and that can compare distributions of the data among
different groups. Next, several multivariate techniques are introduced (Flury faces,
Andrews’ curves and parallel coordinates plots (PCPs)) which provide graphical
displays addressing the questions formulated above. The advantages and the
disadvantages of each of these techniques are stressed.
Two basic techniques for estimating densities are also presented: histograms and
kernel densities. A density estimate gives a quick insight into the shape of the
distribution of the data. We show that kernel density estimates (KDEs) overcome
some of the drawbacks of the histograms.
Finally, scatterplots are shown to be very useful for plotting bivariate or
trivariate variables against each other: they help to understand the nature of the
relationship among variables in a data set and allow for the detection of groups or
clusters of points. Draftman plots or matrix plots are the visualisation of several
bivariate scatterplots on the same display. They help detect structures in conditional
dependencies by brushing across the plots. Outliers and observations that need
special attention may be discovered with Andrews curves and PCPs. This chapter
ends with an explanatory analysis of the Boston Housing data.

1.1 Boxplots

Example 1.1 The Swiss bank data (see Chap. 22, Sect. 22.2) consists of 200
measurements on Swiss bank notes. The first half of these measurements are from
genuine bank notes, the other half are from counterfeit bank notes.
The authorities measured, as indicated in Fig. 1.1,

X1 D length of the bill

X2 D height of the bill (left)
X3 D height of the bill (right)
X4 D distance of the inner frame to the lower border
X5 D distance of the inner frame to the upper border
X6 D length of the diagonal of the central picture.
1.1 Boxplots 5

X2 X5 X3

Fig. 1.1 An old Swiss 1000-franc bank note

These data are taken from Flury and Riedwyl (1988). The aim is to study
how these measurements may be used in determining whether a bill is genuine or
counterfeit.
The boxplot is a graphical technique that displays the distribution of variables. It
helps us see the location, skewness, spread, tail length and outlying points.
It is particularly useful in comparing different batches. The boxplot is a graphical
representation of the Five Number Summary. To introduce the Five Number
Summary, let us consider for a moment a smaller, one-dimensional data set:
the population of the 15 largest world cities in 2006 (Table 1.1).
In the Five Number Summary, we calculate the upper quartile FU , the lower quar-
tile FL , the median and the extremes. Recall that order statistics fx.1/ ; x.2/ ; : : : ; x.n/ g
are a set of ordered values x1 ; x2 ; : : : ; xn where x.1/ denotes the minimum and x.n/
the maximum. The median M typically cuts the set of observations in two equal
parts, and is defined as
8
< x nC1 n odd
M D n 2 o : (1.1)
:1 x n Cx n
2 .2/ . 2 C1/ n even
6 1 Comparison of Batches

Table 1.1 The 15 largest City Country Pop. (10,000) Order statistics
world cities in 2006
Tokyo Japan 3,420 x.15/
Mexico city Mexico 2,280 x.14/
Seoul South Korea 2,230 x.13/
New York USA 2,190 x.12/
Sao Paulo Brazil 2,020 x.11/
Bombay India 1,985 x.10/
Delhi India 1,970 x.9/
Shanghai China 1,815 x.8/
Los Angeles USA 1,800 x.7/
Osaka Japan 1,680 x.6/
Jakarta Indonesia 1,655 x.5/
Calcutta India 1,565 x.4/
Cairo Egypt 1,560 x.3/
Manila Philippines 1,495 x.2/
Karachi Pakistan 1,430 x.1/

The quartiles cut the set into four equal parts, which are often called fourths (that is
why we use the letter F ). Using a definition that goes back to Hoaglin, Mosteller,
and Tukey (1983) the definition of a median can be generalised to fourths, eights,
etc. Considering the order statistics we can define the depth of a data value x.i /
as minfi; n i C 1g. If n is odd, the depth of the median is nC1 2
. If n is even,
nC1
2 is a fraction. Thus, the median is determined to be the average between
the two data
n values belonging
o to the next larger and smaller order statistics, i.e.
M D 2 x. n / C x. n C1/ . In our example, we have n D 15 hence the median
1
2 2
M D x.8/ D 1;815.
We proceed in the same way to get the fourths. Take the depth of the median and
calculate

Œdepth of median C 1
depth of fourth D
2

with Œz denoting the largest integer smaller than or equal to z. In our example this
gives 4:5 and thus leads to the two fourths

1˚
FL D x.4/ C x.5/
2
1˚
FU D x.11/ C x.12/
2
(recalling that a depth which is a fraction corresponds to the average of the two
nearest data values).
1.1 Boxplots 7

Table 1.2 Five number # 15 World cities

summary
M 8 1,815
F 4.5 1,610 2,105
1 1,430 3,420

The F -spread, dF , is defined as dF D FU FL . The outside bars

FU C 1:5dF (1.2)
FL 1:5dF (1.3)

are the borders beyond which a point is regarded as an outlier. For the number of
points outside
˚ these bars ˚ For the n D
see Exercise 1.3. 15 data points the fourths are
1610 D 12 x.4/ C x.5/ and 2105 D 12 x.11/ C x.12/ . Therefore the F -spread and
the upper and lower outside bars in the above example are calculated as follows:

dF D FU FL D 2105 1610 D 495 (1.4)

FL 1:5dF D 1610 1:5 495 D 867:5 (1.5)
FU C 1:5dF D 2105 C 1:5 495 D 2847:5: (1.6)

Since Tokyo is beyond the outside bars it is considered to be an outlier. The mini-
mum and the maximum are called the extremes. The mean is defined as
X
n
x D n1 xi ;
i D1

which is 1;939:7 in our example. The mean is a measure of location. The median
(1815), the fourths (1610;2105) and the extremes (1430;3420) constitute basic
information about the data. The combination of these five numbers leads to the Five
Number Summary as shown in Table 1.2. The depths of each of the five numbers
have been added as an additional column.

Construction of the Boxplot

1. Draw a box with borders (edges) at FL and FU (i.e. 50 % of the data are in this
box).
2. Draw the median as a solid line (j) and the mean as a dotted line ().
3. Draw “whiskers” from each end of the box to the most remote point that is NOT
an outlier.
4. Show outliers as either “?” or “”depending on whether they are outside of FUL ˙
1:5dF or FUL ˙ 3dF respectively (this feather is not contained in some software).
Label them if possible.
8 1 Comparison of Batches

Boxplot
3500

3000

2500

2000

1500

World Cities

Fig. 1.2 Boxplot for world cities MVAboxcity

In the world cities example, the cut-off points (outside bars) are at 867:5 and
2847.5, hence we can draw whiskers to Karachi and Mexico City. We can see from
Fig. 1.2 that the data are very skew: The upper half of the data (above the median)
is more spread out than the lower half (below the median), the data contains one
outlier marked as a circle and the mean (as a non-robust measure of location) is
pulled away from the median.
Boxplots are very useful tools in comparing batches. The relative location of
the distribution of different batches tells us a lot about the batches themselves.
Before we come back to the Swiss bank data, let us compare the fuel economy
of vehicles from different countries, see Fig. 1.3 and Table 22.3.
Example 1.2 The data are from the second column of Table 22.3 and show
the mileage (miles per gallon) of American, Japanese and European cars.
The five-number summaries for these data sets are f12; 16:8; 18:8; 22; 30g,
f18; 22; 25; 30:5; 35g and f14; 19; 23; 25; 28g for American, Japanese and European
cars, respectively. This reflects the information shown in Fig. 1.3. The following
conclusions can be made:
• Japanese cars achieve higher fuel efficiency than US and European cars.
• There is one outlier, a very fuel-efficient car (VW-Rabbit Golf Diesel).
• The main body of the US car data (the box) lies below the Japanese car data.
• The worst Japanese car is more fuel-efficient than almost 50 % of the US cars.
• The spread of the Japanese and the US cars are almost equal.
• The median of the Japanese data is above that of the European data and the US
data.
1.1 Boxplots 9

Fig. 1.3 Boxplot for the Car Data

mileage of American,
Japanese and European cars 40
(from left to right)
MVAboxcar
35

US JAPAN EU

Fig. 1.4 The X6 variable of Swiss Bank Notes

Swiss bank data (diagonal of
bank notes)
MVAboxbank6 142

141

140

139

138

GENUINE COUNTERFEIT

Table 1.3 Five number # 100 Genuine bank notes

summary
M 50.5 141.5
F 25.75 141.25 141.8
1 140.65 142.4

Now let us apply the boxplot technique to the bank data set. In Fig. 1.4 we
show the parallel boxplot of the diagonal variable X6 . On the left is the value of
the genuine bank notes and on the right the value of the counterfeit bank notes. The
five number summary is reported in Table 1.3 and 1.4.
10 1 Comparison of Batches

Table 1.4 Five number # 100 Counterfeit bank notes

summary
M 50.5 139.5
F 25.75 139.2 139.8
1 138.3 140.65

Fig. 1.5 The X1 variable of Swiss Bank Notes

Swiss bank data (length of
bank notes)
MVAboxbank1 216

215.5

215

214.5

214

GENUINE COUNTERFEIT

One sees that the diagonals of the genuine bank notes tend to be larger. It is
harder to see a clear distinction when comparing the length of the bank notes X1 ,
see Fig. 1.5. There are a few outliers in both plots. Almost all the observations of
the diagonal of the genuine notes are above the ones from the counterfeit notes.
There is one observation in Fig. 1.4 of the genuine notes that is almost equal to
the median of the counterfeit notes. Can the parallel boxplot technique help us
distinguish between the two types of bank notes?

Summary
,! The median and mean bars are measures of locations.

,! The relative location of the median (and the mean) in the box is a
measure of how skewed it is.
,! The length of the box and whiskers are a measure of spread.

,! The length of the whiskers indicate the tail length of the distribu-
tion.
,! The outlying points are indicated with a “?” or “” depending on
if they are outside of FUL ˙ 1:5dF or FUL ˙ 3dF respectively.
1.2 Histograms 11

Summary (continued)
,! The boxplots do not indicate multi-modality or clusters.

,! If we compare the relative size and location of the boxes, we are

comparing distributions.

Umhlaba Ofile - Isahluko 1 & 2
No ratings yet
Umhlaba Ofile - Isahluko 1 & 2
21 pages
Ict 550 Asyraf Danial Bin Suhaimi 2023884694
100% (1)
Ict 550 Asyraf Danial Bin Suhaimi 2023884694
3 pages
INCLUSIVE LEARNING ENVIRONMENT FOR CHILDREN WITH AUTISM
No ratings yet
INCLUSIVE LEARNING ENVIRONMENT FOR CHILDREN WITH AUTISM
130 pages
Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
Full (Etextbook PDF) For Interpersonal Communication Everyday Encounters 9th Edition Ebook All Chapters
100% (5)
Full (Etextbook PDF) For Interpersonal Communication Everyday Encounters 9th Edition Ebook All Chapters
49 pages
Statistics For Css
No ratings yet
Statistics For Css
73 pages
DWDM_UNIT-2
No ratings yet
DWDM_UNIT-2
58 pages
Worksheet G11 Art
No ratings yet
Worksheet G11 Art
3 pages
ECS 2390 Fall 24 Thursdays
No ratings yet
ECS 2390 Fall 24 Thursdays
9 pages
Distribution - 1. Nat Sci
No ratings yet
Distribution - 1. Nat Sci
4 pages
Ca Program Design
No ratings yet
Ca Program Design
5 pages
Reading Quiz 6 CPSC 320 101 102 103 2022W1 Intermediate Algorithm Design and Analysis PDF
No ratings yet
Reading Quiz 6 CPSC 320 101 102 103 2022W1 Intermediate Algorithm Design and Analysis PDF
6 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
No ratings yet
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
85 pages
EWRC - 300 - Manual & Modbus
No ratings yet
EWRC - 300 - Manual & Modbus
107 pages
Lecture 1 21022024 033638pm
No ratings yet
Lecture 1 21022024 033638pm
30 pages
Subsection 1 - Prevention of Seismic Risk (Articles R563-1 To D563-8-1) - Legifrance
No ratings yet
Subsection 1 - Prevention of Seismic Risk (Articles R563-1 To D563-8-1) - Legifrance
13 pages
Bahan Bab 4 A. Uji Asumsi Klasik 1. Uji Normalitas
No ratings yet
Bahan Bab 4 A. Uji Asumsi Klasik 1. Uji Normalitas
4 pages
Assigment
No ratings yet
Assigment
11 pages
Topic 21- Statistics by Ui
No ratings yet
Topic 21- Statistics by Ui
58 pages
González-Pérez, S., Mateos de Cabo, R., & Sáinz, M. (2020) - Girls in STEM
No ratings yet
González-Pérez, S., Mateos de Cabo, R., & Sáinz, M. (2020) - Girls in STEM
21 pages
9-1 Data analysis and pre-processing part 1.pdf
No ratings yet
9-1 Data analysis and pre-processing part 1.pdf
19 pages
Helios Web Price List
No ratings yet
Helios Web Price List
1 page
SE 458 - Data Mining (DM) : Spring 2019 Section W1
No ratings yet
SE 458 - Data Mining (DM) : Spring 2019 Section W1
12 pages
1/2" Disc Type Thermostat Automatic Reset: Dimensions
No ratings yet
1/2" Disc Type Thermostat Automatic Reset: Dimensions
42 pages
Applied Statistics For Economic and Buisness
No ratings yet
Applied Statistics For Economic and Buisness
315 pages
Eme - Digital Notes
No ratings yet
Eme - Digital Notes
81 pages
Utilization of Electrical Energy
No ratings yet
Utilization of Electrical Energy
1 page
CHAPTER 2 - Self, Society & Culture
No ratings yet
CHAPTER 2 - Self, Society & Culture
13 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
Lec12HW
No ratings yet
Lec12HW
3 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
PR 1 Identifying the Inquiry and Stating the Problem (1)
No ratings yet
PR 1 Identifying the Inquiry and Stating the Problem (1)
79 pages
Getting To Know Your Data: 2.1 Exercises
100% (1)
Getting To Know Your Data: 2.1 Exercises
8 pages
02Data
No ratings yet
02Data
66 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
BTR Astro 1
No ratings yet
BTR Astro 1
6 pages
02 Data
No ratings yet
02 Data
64 pages
02Data
No ratings yet
02Data
65 pages
02Data
No ratings yet
02Data
24 pages
02 Data
No ratings yet
02 Data
42 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
data mining 2
No ratings yet
data mining 2
64 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
CH 2
No ratings yet
CH 2
68 pages
4 ExploratoryAnalysis
No ratings yet
4 ExploratoryAnalysis
42 pages
Module 1
No ratings yet
Module 1
64 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
02 Data
No ratings yet
02 Data
65 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
Statistical Data Analysis
No ratings yet
Statistical Data Analysis
23 pages
02data Part2
No ratings yet
02data Part2
34 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
PM School Deck Compilation
No ratings yet
PM School Deck Compilation
234 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
02 Data
No ratings yet
02 Data
62 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Guide OGP Pour Le Report Des Accident Incident
No ratings yet
Guide OGP Pour Le Report Des Accident Incident
30 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Electrical Installers Guidelines Presentation
No ratings yet
Electrical Installers Guidelines Presentation
16 pages
First Week
No ratings yet
First Week
8 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Note 02
No ratings yet
Note 02
31 pages
Multi Fetch
No ratings yet
Multi Fetch
65 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
PALM
100% (1)
PALM
21 pages
Parta PDF
No ratings yet
Parta PDF
153 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Test de Heterogeneidad
No ratings yet
Test de Heterogeneidad
6 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
GenPhysics2 - Q2 Module 4
No ratings yet
GenPhysics2 - Q2 Module 4
18 pages
Awoke LSEThesis
No ratings yet
Awoke LSEThesis
38 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
The Summation of Series
From Everand
The Summation of Series
Harold T. Davis
4/5 (1)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Boxplots

Uploaded by

Boxplots

Uploaded by

Chapter 1

Multivariate statistical analysis is concerned with analysing and understanding data

and that it is an observed value of a variable vector X 2 Rp . Therefore, X is

where Xj , for j D 1; : : : ; p, is a one-dimensional random variable. How do

© Springer-Verlag Berlin Heidelberg 2015 3

into four-dimensional structures by presenting dynamic 3D density contours as the

X1 D length of the bill

Fig. 1.1 An old Swiss 1000-franc bank note

Table 1.2 Five number # 15 World cities

The F -spread, dF , is defined as dF D FU FL . The outside bars

dF D FU FL D 2105 1610 D 495 (1.4)

Construction of the Boxplot

Fig. 1.2 Boxplot for world cities MVAboxcity

Fig. 1.3 Boxplot for the Car Data

Fig. 1.4 The X6 variable of Swiss Bank Notes

Table 1.3 Five number # 100 Genuine bank notes

Table 1.4 Five number # 100 Counterfeit bank notes

Fig. 1.5 The X1 variable of Swiss Bank Notes

,! If we compare the relative size and location of the boxes, we are

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.