Chapter_1
Chapter_1
Analysis with R
Antony Unwin
University of Augsburg
Germany
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xi
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 What features might continuous variables have? . . . . . . . . . . . 29
3.3 Looking for features . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Comparing distributions by subgroups . . . . . . . . . . . . . . . . 44
3.5 What plots are there for individual continuous variables? . . . . . . 46
3.6 Plot options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Modelling and testing for continuous variables . . . . . . . . . . . 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 What features might categorical variables have? . . . . . . . . . . . 56
4.3 Nominal data—no fixed category order . . . . . . . . . . . . . . . 57
4.4 Ordinal data—fixed category order . . . . . . . . . . . . . . . . . 62
4.5 Discrete data—counts and integers . . . . . . . . . . . . . . . . . 66
4.6 Formats, factors, estimates, and barcharts . . . . . . . . . . . . . . 70
4.7 Modelling and testing for categorical variables . . . . . . . . . . . 71
vii
viii
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 What features might be visible in scatterplots? . . . . . . . . . . . 77
5.3 Looking at pairs of continuous variables . . . . . . . . . . . . . . . 78
5.4 Adding models: lines and smooths . . . . . . . . . . . . . . . . . . 83
5.5 Comparing groups within scatterplots . . . . . . . . . . . . . . . . 86
5.6 Scatterplot matrices for looking at many pairs of variables . . . . . 88
5.7 Scatterplot options . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Modelling and testing for relationships between variables . . . . . 94
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 What is a parallel coordinate plot (pcp)? . . . . . . . . . . . . . . . 100
6.3 Features you can see with parallel coordinate plots . . . . . . . . . 102
6.4 Interpreting clustering results . . . . . . . . . . . . . . . . . . . . 106
6.5 Parallel coordinate plots and time series . . . . . . . . . . . . . . . 108
6.6 Parallel coordinate plots for indices . . . . . . . . . . . . . . . . . 112
6.7 Options for parallel coordinate plots . . . . . . . . . . . . . . . . . 115
6.8 Modelling and testing for multivariate continuous data . . . . . . . 127
6.9 Parallel coordinate plots and comparing model results . . . . . . . 127
9 Graphics and Data Quality: How Good Are the Data? 177
14 Summary 275
References 279
General index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Datasets index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Preface
Graphical Data Analysis is useful for data cleaning, exploring data structure, de-
tecting outliers and unusual groups, identifying trends and clusters, spotting local
patterns, evaluating modelling output, and presenting results. It is essential for ex-
ploratory data analysis and data mining. There are several fine books on graphics us-
ing R, such as “ggplot2” [Wickham, 2009], “Lattice” [Sarkar, 2008], and “R Graph-
ics” [Murrell, 2011]). These books concentrate on how you draw graphics in R. This
book concentrates on why you draw graphics and which graphics to draw (and uses
R to do so).
The target readership includes anyone carrying out data analyses who wants to
understand their data using graphics. The book can be used as the primary textbook
for a course in Graphical Data Analysis or as an accompanying text for a statistics
course. Prerequisites for the book are an interest in data analysis and some basic
knowledge of R.
The main aim of the book is to show, using real datasets, what information graph-
ical displays can reveal in data. Seeing graphics in action is the best way to learn
Graphical Data Analysis. Gaining experience in interpreting graphics and drawing
your own data displays is the most effective way forward.
The graphics shown in the book are a starting point. Sometimes more graph-
ics could have been drawn, and alternative graphics could always have been drawn.
Readers may have their own ideas of how best to present certain features of the
datasets. Although each graphic reveals information contained in its dataset, it is
likely that in every case there is more to be discovered. It is certainly one of the aims
of each analysis to find out as much as possible about the data. The graphics are not
drawn for their own sake, they are drawn to reveal and convey information.
A central idea underlying this book is that many graphics should be drawn. The
aim should not have to be to draw a single graphic that summarises everything that
can be said about the data. That is too difficult, if not impossible. The aim is to find
a number of graphics, maybe even a large number of them, where each contributes
something to the overall picture. Just as many photographs of the same object taken
from different angles in different lights make it easier for us to grasp a whole object,
datasets should be visualised in many different ways.
The emphasis is on exploring datasets first and on presenting results second.
Graphical Data Analysis is about using graphics to find results. One way to think
about this is to imagine you are looking at a new package in R and it uses a dataset
you are not familiar with for the examples in the help. What does the dataset look
like? How would you go about finding out what features it has, and how that might
affect the use of the methods in the package? What information can you find graphi-
xi
xii Graphical Data Analysis with R
cally in the data that a modelling approach should also find? What graphical displays
are there that help you understand the results of other people’s models, such as the
examples given on the help page? This presupposes an active interest on the part
of the reader. Roland Barthes, the French structuralist, referred to readerly texts and
writerly texts. In a writerly text the reader takes an active role in the construction of
meaning. I hope the readers of this book will take an active role in thinking about
what graphics show, what information can be gleaned from them, and why they were
chosen.
As every dataset used is available in R or one of its packages, information about
them can usually be found on the relevant help page, including which variables of
what types are involved and how big the dataset is. Ideally there should be a descrip-
tion of why and how it was collected, with references to original sources. Context is
important for interpreting results and you have to know your dataset and its prove-
nance. A well-developed sense of curiosity is very helpful in data analysis.
Graphical Data Analysis is an attractive way of working with data. It encourages
you to look at many different aspects and to investigate in many different directions.
You can be surprised by what you uncover and even by which graphic turns out to be
most effective in revealing information. Your results are easy to show to others and
are easy to discuss with others.
For any result found graphically, we should try to check what statistical sup-
port there is for it, just as we use graphics to review the results of our statistical
modelling. Graphical Data Analysis and more traditional statistical approaches com-
plement each other very well and we should take advantage of this.
Acknowledgements
No book on R should omit thanking Robert Gentleman, Ross Ihaka, and all the
many R contributors. They have made analysis of data much easier for the rest of
us. Thanks also to Hadley Wickham for all his R packages (sometimes referred to as
the Hadleyverse), especially for ggplot2, and to Yihui Xie for knitr, a major help in
keeping this book in order. Particular thanks are due to Bill Venables for his words
of wisdom and for R advice and code. If any of the book’s code looks elegant, then
it must be Bill’s, and if it looks clumsy, it is certainly mine.
Dennis Freuer, Urs Freund, Katrin Grimm, Harold Henderson, Ross Ihaka, Kary
Myers, Alexander Pilhöfer, Maryann Pirie, Friedrich Pukelsheim, Christina Sanchez,
Günther Sawitzki, Rolf Turner, Chris Wild, and Aisen Yang read one or more chap-
ters and made many helpful suggestions for improvement, some of which I have been
able to adopt. John Kimmel was an encouraging and efficient publisher, who organ-
ised several constructively critical reviewers, including Di Cook, Michael Friendly,
and Ramnath Vaidyanathan. I would also like to thank the Statistics Department at
the University of Auckland for a stimulating and sociable environment in which to
work on this book during my sabbatical.
Preface xiii
Finally I would like to thank my family for never asking me when the book would
be finished and for many other kindnesses.
Lewis Carroll
Female
0
Male
0
160 180 200 220
Speed (km/hr)
FIGURE 1.1: Histograms of speeds reached at the 2011 World Speed Skiing Champi-
onships. Source: www.fis-ski.com. There were more male competitors than females,
yet the fastest group of females were almost as fast as the fastest group of males. The
female competitors were all either fast or (relatively) slow—or were they?
1
2 Graphical Data Analysis with R
library(ggplot2); library(ggthemes)
data(SpeedSki, package = "GDAdata")
ggplot(SpeedSki, aes(x=Speed, fill=Sex)) + xlim(160, 220) +
geom_histogram(binwidth=2.5) + xlab("Speed (km/hr)") +
facet_wrap(~Sex, ncol=1) + ylab("") +
theme(legend.position="none")
The 2011 World Speed Skiing Championships were held at Verbier in Switzer-
land. Figure 1.1 shows histograms of the speeds reached by the 12 female and 79
male competitors. As well as emphasising that there were many more competitors in
the men’s competition than in the women’s, the plots show that the fastest person was
a man and that a woman was slowest. What is surprising (and more interesting) is
that the fastest women were almost as fast as the fastest men and that there were two
distinct groups of women, the fast ones and the slow ones. There also appear to be
two groups of men, although the gap between them is not so large. All of this infor-
mation is easy to see in the plots and would not be readily apparent from statistical
summaries of the data.
A little more investigation reveals the reason for the groupings: There are actu-
ally three different events, Speed One, Speed Downhill, and Speed Downhill Junior.
Figure 1.2 shows the histograms of speed by event and gender. We can see that Speed
One is the fastest event (competitors have special equipment), that no women took
part in the Downhill, and that there was little variation in speed amongst the Juniors.
The reason for the two female groups is now clear: They took part in two different
events. The distribution of the men’s speeds is affected by the inclusion of speeds for
the Downhill event and by the greater numbers of men who competed. It is interesting
that there is little variation in speed amongst the 7 women who competed in the Speed
One event, compared to that of the 39 men who took part. The women were faster
than most of the men.
The code for the plots takes a little getting used to. On the one hand the informa-
tion would still have been visible with less coding, although perhaps not so clearly.
Setting sensible scale limits, specifying meaningful binwidths, and aligning graphics
whose distributions you want to compare one above the other with the same size and
scales all help.
On the other hand, the plots might have benefitted from more coding to make
them look better: adding a title, choosing different colours, or specifying different
tick marks and labelling. That is more a matter of taste. This book is about data anal-
ysis, primarily exploratory analysis, rather than presentation, so the amount of coding
is reduced. Sometimes defaults are removed (like the legends) to reduce unnecessary
clutter.
The Speed Skiing example illustrates a number of issues that will recur through-
out the book. Graphics are effective ways of summarising and conveying informa-
tion. You need to think carefully about how to interpret a graphic. Context is im-
portant and you often have to gather additional background information. Drawing
several graphics is a lot better than just drawing one.
Setting the Scene 3
Female
4
0
8
Male
4
0
160 180 200 160 180 200 160 180 200
Speed (km/hr)
FIGURE 1.2: Histograms of speeds in the 2011 World Speed Skiing Championships
by event and gender. There was no Speed Downhill for women. The few women
taking part in the fastest event, Speed One, did very well, beating most of the men.
1.2 Introduction
There is no complex theory about graphics. In fact there is not much theory at all, and
so the topic is not covered in depth in books or lectures. Once the various graphics
forms have been described, the textbooks can pass on to supposedly more difficult
topics such as proving the central limit theorem or the asymptotic normality of max-
imum likelihood estimates.
The evidence of how graphics are used in practice suggests that they need more
attention than a cursory introduction backed up by a few examples. If we do not
have a theory which can be passed on to others about how to design and interpret
informative graphics, then we need to help them develop the necessary skills using a
range of instructive examples. It is surprising (and sometimes shocking) how casually
graphics may be employed, more as decoration than as information, more for reasons
of routine than for reasons of communication.
It is worthwhile, as always, to check what the justly famous John Tukey has to
say. In his paper [Tukey, 1993] he summarised what he described as the true purpose
of graphic display in four statements:
• The bars could all be the same height (as you might expect in a scientific study
with three groups).
• The bars might have slightly different heights (possibly suggesting some missing
values in a scientific study).
• One of the bars might be very small, suggesting that that category is either rare
(a particular illness perhaps) or not particularly relevant (support for a minor
political party).
• The bars might not follow an anticipated pattern (sales in different regions or the
numbers of people with various qualifications applying for a job).
• ...
There is literally no limit to the number of possibilities once you take into account
the different settings the data may have come from. This means that you need to gain
Setting the Scene 5
experience in looking at graphics to learn to appreciate what they can and cannot
show.
As with all statistical investigations it is not only necessary to identify potential
conclusions, there has to be enough evidence to support the conclusions. Tradition-
ally this has meant carrying out statistical tests. Unfortunately there are distinct limits
to testing. A lot of insights cannot easily be directly tested (Does that outlying clus-
ter of points really form a distinctive group? Is that distribution bimodal?) and even
those that can be require restrictive assumptions for the tests to be valid. Additionally
there is the issue of multiple testing. None of this should inhibit us from testing when
we can, and occasionally a visually tentative result can be shown to have such a con-
vincingly small p-value that no amount of concerns about assumptions can cast much
doubt on the result. The interplay of graphics with testing and modelling is effective
because the two approaches complement each other so well. The only downside is
that while it is usually feasible to find a graphic which tells you something about the
results of a test, it is not always possible to find a test which can help you assess a
feature you have discovered in a graphic.
to uncover information and there is every reason to draw more graphics rather than
fewer when doing GDA.
With presentation graphics you prepare one graphic for many potential viewers.
You need experience in deciding which graphic to present and expertise in how to
draw it well. With GDA you prepare many graphics for one viewer, yourself, and
your aim is to uncover the information hidden in the data. You need expertise in
choosing a set of informative graphics and experience in interpreting graphics.
20
count
10
0
2 4 6
Petal.Length
FIGURE 1.3: A histogram of petal lengths from Fisher’s iris dataset. The data divide
into two distinct groups.
We can look at a plot of the two petal attributes together, petal length and petal
width. Figure 1.4 shows that there is a very strong relationship between these two
attributes, providing further convincing evidence of at least two distinct groups of
flowers. The colouring by species shows that the lower group are all setosa, that the
upper group is made up of both versicolor and virginica flowers, and that these two
groups are moderately well separated by their petal measurements.
The iris dataset is so well known that many readers will be familiar with this in-
formation. Imagine, however, that you wanted to present this information to someone
who did not already know it. Are there better ways than simple graphics?
Setting the Scene 7
library(ggthemes)
ggplot(iris, aes(Petal.Length, Petal.Width, color=Species)) +
geom_point() + theme(legend.position="bottom") +
scale_colour_colorblind()
2.5 ! !!
! !
!!!! ! ! ! !
! ! !
!!!! ! !
2.0 !!!! ! !
!! ! !
!! ! !! ! ! !
! !
! ! ! !
1.5
Petal.Width
! !!! !!!
! ! !!! !
! !!!!!!!
!! ! ! !
! !!
1.0 ! ! ! !!
0.5 !
! !!! !
!!! !
! !!!!!! !
! !!
0.0
2 4 6
Petal.Length
FIGURE 1.4: A scatterplot of petal lengths and petal widths from Fisher’s iris dataset
with the flowers coloured by species. The two variables are highly correlated and
separate setosa clearly from the other two species. The colours used do not reflect
the real colours of the species, which are all fairly similar.
8 Graphical Data Analysis with R
library(gridExtra)
ucba <- as.data.frame(UCBAdmissions)
a <- ggplot(ucba, aes(Dept)) + geom_bar(aes(weight=Freq))
b <- ggplot(ucba, aes(Gender)) + geom_bar(aes(weight=Freq))
c <- ggplot(ucba, aes(Admit)) + geom_bar(aes(weight=Freq))
grid.arrange(a, b, c, nrow=1, widths=c(7,3,3))
750
2000 2000
count
count
count
500
1000 1000
250
0 0 0
A B C D E F Male Female AdmittedRejected
Dept Gender Admit
FIGURE 1.5: Numbers of applicants for Berkeley graduate programmes in 1973 for
the six biggest departments. The departments had different numbers of applicants.
Overall more males applied than females and fewer applicants were admitted than
rejected.
The main aim of the study was to examine the acceptance and rejection rates
by gender. For the six departments taken together the acceptance rate for females
was just over 30% and for males just under 45%, suggesting that there may have
been discrimination against females. Results by department are shown in Figure 1.6,
where the widths of the bars are proportional to the numbers in the respective groups..
In four of the six departments females had a higher rate of acceptance. This is an
example of Simpson’s paradox.
library(vcd)
ucb <- data.frame(UCBAdmissions)
ucb <- within(ucb, Accept <-
factor(Admit, levels=c("Rejected", "Admitted")))
doubledecker(xtabs(Freq~ Dept + Gender + Accept, data = ucb),
gp = gpar(fill = c("grey90", "steelblue")))
Accept
Rejected
Admitted
Male Fe Male F Male Female Male Female Ma Female Male Femal Gender
A B C D E F Dept
data(Pima.tr2, package="MASS")
h1 <- ggplot(Pima.tr2, aes(glu)) + geom_histogram()
h2 <- ggplot(Pima.tr2, aes(bp)) + geom_histogram()
h3 <- ggplot(Pima.tr2, aes(skin)) + geom_histogram()
h4 <- ggplot(Pima.tr2, aes(bmi)) + geom_histogram()
h5 <- ggplot(Pima.tr2, aes(ped)) + geom_histogram()
h6 <- ggplot(Pima.tr2, aes(age)) + geom_histogram()
grid.arrange(h1, h2, h3, h4, h5, h6, nrow=2)
25
20
30 20
15
15
20
count
count
count
10
10
10
5 5
0 0 0
50 100 150 200 50 75 100 0 25 50 75 100
glu bp skin
25 50
60
20 40
15 40 30
count
count
count
10 20
20
5 10
0 0 0
20 30 40 50 0.0 0.5 1.0 1.5 2.0 2.5 20 40 60
bmi ped age
FIGURE 1.7: Histograms of the six continuous variables in Pima.tr2. There are a
few possible outlying values. Two of the variables have skew distributions.
Setting the Scene 11
library(dplyr)
PimaV <- select(Pima.tr2, glu:age)
par(mar=c(3.1, 4.1, 1.1, 2.1))
boxplot(scale(PimaV), pch=16, outcol="red")
!
6
!
4
! !
! ! ! !
!
! ! !
! !
!
!
2
0
!2
FIGURE 1.8: Scaled boxplots of the six continuous variables in a version of the Pima
Indians dataset in R. A couple of big outliers are picked out, as is a low outlier on bp
(blood pressure). The distributions of the last two variables, ped (diabetes pedigree)
and age, are skewed to the right.
There are several outliers (including a couple of extreme ones) and boxplots are
better for showing that than histograms. The last two variables are clearly not sym-
metric. Two facts should be borne in mind: The boxplots are not to be compared with
one another, drawing them all together in one plot is primarily a time and space sav-
ing exercise. The scaling just transforms each variable to have a mean of zero with a
standard deviation of one and nothing more, so the same points are identified as out-
liers as would be in the equivalent unscaled plots. This display tells us a little about
the shapes of the distributions, but not much, and nothing about the missing values
in the data, a potentially important feature. Of course, the histograms told us nothing
about the missing values either. Plots for missing values are discussed in §9.2.
The two sets of displays, histograms and boxplots, have given us a lot of infor-
mation about the variables in the dataset. A scatterplot matrix, as in Figure 1.9, tells
us even more.
12 Graphical Data Analysis with R
We can see that only two variables are strongly associated, bmi and skin, and
that that association would be even better were it not for the outlying skin thickness
measurement. All of this is valuable information, which helps us to understand the
kind of data we are dealing with.
library(GGally)
ggpairs(PimaV, diag=list(continuous=’density’),
axisLabels=’show’)
200
100
0.253 0.221 0.218 0.0919 0.284
!
!
! !
100 !
!
! !!!! !
! !!!
!! !!
!!!! ! !
!!
!!!!!
!!
!!
!!!
!
!!
!!!
!!
! ! !
!!! !!
! !! ! !
0.264 0.279 !0.0382 0.399
!!! !
!
!
!!!
!
!
!!
!
!
!!
!!
!
!
!
!!
!
!
!!
!
!!
!!
! !!!!
!
!!!! !!!
! !
!!!
!
! ! ! !
60 !
!!!!!
!!
!!!
!! !
!! !! !
!
!!!!!
!!!!!!!
! !!
!!! ! ! !!
! !!! !!! !
! ! !! ! ! ! ! !
! !!
40 ! !
100 ! !
75
Corr: Corr: Corr:
skin
! !
50 ! ! !! !!
!!!!
! !
!! !!!
!! ! ! ! !!!!!!!
! !
!
!
!!! ! !!!
!!
! !
!!!!
!
!
!! !!
!!!!! !
! ! !! ! ! !!!! !! !
!!!!!!! !!! !
!
!!
!!
!!
! !
!!!!
!!!! ! ! !! ! !!!! ! !!
!!!!! !! !
!
!!! ! !
0.148 0.068
! !!! !!! !
!!
! ! !
! !
! !
!! !!!!!!! !! ! !
! !
!!!
!! !!
! !
! !!! !! ! !
!! !!! !!
!!! !!! ! !!!
!! !!
! ! !! !!!! ! !!! !
!!
!
!!
!!!!! !
30 ! !!!!!!!!!! !
! !
!!! ! !! !!!!
!!
!! !!!
! !!!
!!!!
!!!
!
!! !
!!
!
!
!!
!
!!
!!
!
! ! !!!! !!! ! ! ! ! ! !! ! !
!!
!!
! !!!
!!! !!!! !!!! !! !!!!!!!
!!
!
!
!!!
!! !!
!!!!
! !! ! !
!!
!
!
!!
!
!
!
!
!!
!! !
!!
!
!!! !!!!!
!! !!!!!!
! !! !!! !
! !!
!!!! !
!! !!!!
!!!
!!
!
!
!
!!!!! !
! !! !
!!
!!!
!
!!
! !
!
!!! ! !!!
!
! ! !
!!!!!!!! !!!!
!!
!!
!!
!!!
!
!
! !
!!!
!!!!!!! ! !!! ! ! !!!! ! !!!!! ! !!!!
!
20 ! !! ! ! !
! ! !
!
!
! ! ! ! !!! !! ! !!
!!
!
!
!! !
! ! ! !
2.0
Corr:
! ! ! !
1.5 ! ! !! ! ! !
ped
! !! ! !! !
! !!!! ! ! ! !!
! ! ! ! ! !!
1.0
!0.0523
! ! !! ! !
!!!!! ! !! !! ! ! !!!! ! !! !! ! ! !!
! !!! ! !! ! !!!!!! ! !
!!!! ! !!! !!! !!!!!!!!!! !! !!!!!!! ! ! !! !!!! ! !! !!
!!! !
!!!!!
! !
!! !!
! !! ! !! ! ! !
!!! !
! !!
!!
! ! !
! ! !!
! !
!!!!! ! !!!
!!! ! ! !
! ! ! ! ! ! ! ! !!! ! !!
! !
! ! ! !!
!!
! ! !
! ! !!
!!! !
! ! !
! !!!
0.5 !! !
!
!!!! ! ! !!! ! !
! ! !! ! !
! !!!!!!! !
!! !
! ! !!!!
!! !!!! ! !! !!!
!! ! !!! !! !!!!! ! ! !! ! !!!
! !!
!!
!
! !!
!!
!!! !
!!
!! ! !!
!!!!! !! ! ! !
!! !
!! !!
!
! ! ! !!!
! !!
!
!
!
!
!
!!
!
!!
!
!
!
! !
!!!!!!! !!
!
!!!!
!!
!!!!!!
!
!
!!!!!
!
!
!
! !!!
!!
! !
!
!!
! !!!!
!!
!
! !!!
! !!
!!!!!! !!!!! !!!!!!
!!!
!! !!
!!
!!!!
!!!
!!
! !!!
!! !
! !
!!!! ! ! !!!!
!!!
!
!!
!
!!
!
!
!!
!
!
!
!
! !
!!
!
!!
! !!
!
!!!
!! !!
!!!!!
!
!
!
!
!!
!
!! !
!
!!
!!
!
! !!!! !
!! !!! ! !
!!
!!!!!
!
!
!
!
!
!!
!
!!!
!
!
!
!
!
!
!!!
!
!
!
! !!
!!!
!
!
!!
!!!
!
!!
!!!!!
! ! !!!
!!! ! !!!
!!!!!!
!!!
!!
!!
!!!
!
!
!
!!!
!!
!!
!!
!!
!
!!!! ! !!!
!!
!!
!!!!
!
!
!!
!
!!
!
!
!
!
!!!
!
!!! !
!!!
!!!!!
!
!
!
!!
!!
!!
!
!
!! !!
!!
!
!
!
!
!!
!
!!
!
!!
!
!!
!!
!
!
!
!
!!!
!!! !
! !
! !
! !!
! ! ! ! !!
! !
!
!!!
!! !! !! !!!! ! !
!!!! ! ! !!
!
! !!!
!
!!
! !!! !
!
!!
! ! !
!! !!!
0.0 !! ! !
!
! ! !!!
! !
! ! ! !!
!! !
!!!!!! !!!
70 !
!! !!
!
! ! !
!
! ! ! !
60 ! !! ! !! !
! !
! ! ! !! ! !! !
!! !
! ! !!! !! ! ! !!!! ! !! ! ! !!!! !!
! ! !!
!! !! ! !! !! ! ! ! ! ! !! ! ! !
! ! !! !
!! ! ! ! ! ! !! ! !! !
!! !! !! ! !!! ! !
50 ! ! ! !! ! !!! ! ! !!!! ! ! !!!!! ! ! !!
age
!!!!! !! !! !!!
! ! ! ! !!! ! !! !!!!! !
! !
!
! ! !
!!
! !! !
!! ! !! ! ! ! !! !!!
!! !! !! !!! ! !! ! !
!
!!
!
!! ! ! ! !!
!!!!! !! !!!
! !! !
!!! ! !!
40
! !! ! ! !! ! !!
!
! !!! !!! !! ! !!! ! !! !! !! ! !
! !!! !! !! !! !
!!!
! !! ! !! !!! !!!! !!! ! ! !!
!!!
!!
!
!! !! !
! !
!!! !
! ! ! !! !!!!!!!
!! ! !! !!!!! !! !!
!
! ! !! ! !
! !
! !
! !!! ! ! ! !!! !!!!! !! ! !! ! !!!
! !! !! !! !! ! !!! !!
!!
!!! ! !!
!! !!! !!
! ! !! !!!!!
!!!!!! !!!!!
! !! !!
!! !!
!! !
!!!!
!
!! !! ! !! !!!! ! !! !!! !! ! ! !!!!!!
!
30 !!!
!! ! ! ! ! !! !! !! !!! !
!!
!
!
!
!
!
!!!
!!! !
!
!! ! !! ! ! !!!
!!! !
!!!
! !!!
!!! !! !!! ! ! !!!! ! ! !
! !!!!!!!!
!!
! !!!! ! !! !! !! !
!!
! !
!!!!!
!!!
! !!! !!
!! !
!! !!
!! !!!!!!!!
! !! ! !!!! !
!!! ! !!!!!!! !
!
! !!! ! !! !! !!!
!!!!! !!! !!!! ! !!!
!!
! !!
!
!!
!!
!!!
! !
!!!!!!
!!!! !
!!
!!
!
!!
!
!
!
!
!!!
!!
!!! !!!!!!
!!!
!
! ! !!! !!
!!!
!!!!
!
!
!!!!
! !! !!! !
!!!!
!!
!
!
!! !
!
!
!!!!
!
!
!!
!
!!
!!
!
!
!!!!
!
!! !!
!
!!!
!!!!!!!
!!
!
!
!!
! !! !
! !!! !! !
! !!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!!
!!! !!!! ! !
!! !!
!!!
!!! !
! ! !! ! ! !! !!
!! ! ! !! !!! !!!!!
!!
! ! !!
!!! !
!! !
!!!!!!!!!
! !! ! ! !!! !!
!!!!!!!!!!!!!
! !!
!!!!! !
!!!! !!!! ! !!
! !
!! !! ! !!!!!! !!
!!!!!
!!! !
!!!!!
20 ! ! !!!
!! ! !
!! ! !!
!! ! ! !!!!!! !
!! !! !! ! ! ! !
!! !!!
! !! !! !
! !!!!
!
!
!!!
! !!!!
! !!!
!!
!!
!!!
! !
! !!! !!!
!!!!!!!
!!!
!!!!
!!!!
!
!!!!!! ! !!
!
!!
!
!!!
!
!!
!!
!!
!
!!
!
!!!
!!! !!! !!!!!
!!!
!!!!
!
!
!!!
!! !!
!
!
! !
!!
!!! ! ! !!
! !! !!
!
!!
!!
!!
!!
!
!!
!!!
!
!!
!
!!
!
!!!
!! !!
FIGURE 1.9: A scatterplot matrix of the six continuous variables for the same Pima
Indians dataset. Only one of the scatterplots shows a strong association (bmi and
skin). The two extreme outlying values, one skin and one ped measurement,
make it harder to see what is going on, as there is less space left for the bulk of the
data.
Setting the Scene 13
GDA in context
GDA does not stand on its own. As has already been said (and should probably
be repeated several times more), any result found graphically should be checked
with statistical methods, if at all possible. Graphics are commonly used to check
statistical results (residual plots being the classic example) and statistics should be
used to check graphical results. You might say that seeing is believing, but testing is
convincing.
Graphics are for revealing structure rather than details, for highlighting big dif-
ferences rather than identifying subtle distinctions. Edwards, Lindman and Savage
wrote of the interocular traumatic test; you know what the data mean when the con-
clusion hits you between the eyes [Edwards et al., 1963]. They were referring to sta-
tistical analyses in general, but it is particularly relevant for graphics. Their article
continued: “The interocular traumatic test is simple, commands general agreement,
and is often applicable; well-conducted experiments often come out that way. But the
enthusiast’s interocular trauma may be the skeptic’s random error. A little arithmetic
to verify the extent of the trauma can yield great peace of mind for little cost.”
Approximate figures suffice for appreciating structure, there is no need to provide
meticulous accuracy. If, however, exact values are needed—and they often are—then
tables are more useful. Graphics and tables should not be seen as competitors, they
complement one another. With printed reports sometimes difficult choices have to
be made about whether to include a graphic or a table. With electronic reports the
question becomes how to include both gracefully and effectively.
Graphical Data Analysis is obviously appropriate for observational data where
the standard statistical assumptions that are needed for model building may not hold.
It can also be valuable for analysing experimental data. There may be patterns by time
or other factors that were not expected. In medical studies, the balance of control and
treatment groups has to be checked.
This book concentrates on exploratory graphics, using graphics to explore
datasets to discover information. Experience gained in looking at and interpreting
exploratory graphics will be valuable for looking at all kinds of graphics associated
with statistics, including diagnostic graphics (for checking models) and presentation
graphics (for displaying results).
The importance of data analysis in general is sometimes underplayed, because
there is little formal theory and because the results that are found may appear obvi-
ous in retrospect. Effective graphical analysis makes things seem obvious, the effort
involved in making the graphical analysis effective is not so obvious. In his poem
“Adam’s Curse” W. B. Yeats wrote of the amount of work that went into getting a
line of poetry right:
Yet if it does not seem a moment’s thought,
Our stitching and unstitching has been naught.
That is an appropriate analogy here.
14 Graphical Data Analysis with R
1.4 Using this book, the R code in it, and the book’s webpage
A graphic is more than just a picture and every display in the book should convey
some information about the dataset it portrays. There should be some description of
what you can see accompanying every graphic. In order to ensure that all discussions
of graphics are either on the same page as the graphic itself or on the opposite page,
gaps have been left on some pages. This is intentional, as however irritating gaps may
be, it always seems more irritating having to turn pages backwards and forwards to
flip between displays and descriptions.
Like anything else, using graphics effectively is mostly a matter of practice. Study
and criticise the examples. Test out the code yourself—it is all available on the book’s
webpage, rosuda.org/GDA. You can experiment with it while you are reading
the book, just copy and paste the code into R. Vary the size and aspect ratio of
your graphics, vary the scaling and formatting, vary the colours used. Draw lots of
graphics, see what you get, and decide what is most effective for you in making it
easy to recognise information.
Work through the exercises at the ends of the chapters. All these exercises use
datasets available in R or in one of the packages associated with R. There are no
purely technical exercises, they all require consideration of the context involved. The
goal is not just to draw graphics successfully, but to interpret the resulting displays
and deduce information about the data. Some exercises are more open-ended than
others and you should not expect definitive answers to all of them. The best approach
is to try several versions of each graphic and to work with sets of graphics of different
types, not just with individual ones. Doing the exercises is highly recommended—
to become experienced in carrying out Graphical Data Analysis, you need to gain
experience in looking at graphics.
There are far too many R packages to load them all. It makes sense to ask read-
ers to load the packages that are used more often in the book instead of repeatedly
referring to them in the text. Please ensure you have the following packages loaded:
ggplot2 for graphics based on ideas from “The Grammar of Graphics”. Most of the
book’s graphics use these ideas.
gridExtra for arranging graphics drawn with ggplot2.
ggthemes for its colour blind palette.
dplyr for advanced and transparent data manipulation capabilities.
GGally for additional graphics in ggplot2 form, including parallel coordinate plots.
vcd for a range of graphics for categorical data.
extracat for multivariate categorical data graphics and for missing value patterns.
Some of these packages load further packages via a namespace. To check the state
of your R session you can use the function sessionInfo().
Setting the Scene 15
Loading packages in advance will mean that the functions in these packages can
be immediately used, and that any datasets supplied with the packages are to hand.
For datasets in other packages there are two cases to consider. With LazyData
packages datasets can be accessed without loading the package, as long as you have
the package installed. With (most) other installed packages a data statement is
needed, an R function for making a dataset available. To avoid repetition, datasets
are generally only loaded once in each chapter.
Many graphics are improved by an appropriate choice of window size, informa-
tive labelling, sensible scaling, good positioning (for instance in multiple graphics
displays) and other details, which are more about presentation than the graphic it-
self and may require much more code than you might expect. This book is about
exploratory graphics so the code is mainly restricted to the graphics essentials.
You may find some graphics too small (or too big). If so, redraw them yourself
and experiment to find what looks best to you. Space and design restrictions in a
printed book can hamper displays. For some of the graphics the code includes ad-
justments to improve the look of the default versions. Graphics for exploration are
usually only on display temporarily, while graphics for presentation, especially in
print, are more permanent. Nevertheless a little enhancing helps to avoid occasional
displeasing elements in exploratory graphics. It is often a matter of taste and you
should develop your own style of graphics for your own use. For presenting to others
you need to think of their needs and expectations as well. Some general advice on
coding graphics in R is given in Chapter 13.
Code listings for every plot are given in the book and on the book’s webpage for
downloading. The code is not explained in detail, so if an option choice puzzles you,
check the help file for the function, especially the examples there. With R there are
always several ways of achieving the same goal and you may find you would have
done things differently. The end result is the important thing.
Graphical functions in R can offer very many options and working out what ef-
fects they have, especially in combination, can be complicated. It would be nice to be
able to say that all functions follow the same rules with similarly defined parameters.
Languages are rarely consistent like that, and R is no exception.
No formal statistical analyses are carried out in this text, as there are already many
fine books covering statistical modelling. A list of suggested references is given at the
end of Chapter 2 and there are a few remarks at the end of each chapter. Readers are
encouraged to look for statistical methods to complement their graphical analyses.
You should be able to find the tools to do so in R and its many packages. There is
often a number of ways offered to carry out particular analyses, each with its own
advantages (and possibly disadvantages), so no recommendations are made here.
The final exercises in each chapter are labelled “Intermission” and are intended
to be a break and a distraction. Perhaps they would have been better labelled “And
now for something completely different”. At any rate, I hope they lead you to some
interesting visual discoveries and to developing your visual skills in many other di-
rections.
16 Graphical Data Analysis with R
Main points
1. Graphical Data Analysis uses graphics to display and interpret data to reveal the
information in a dataset. It is an exploratory tool rather than a confirmatory one.
2. Simple plots can reveal useful information about datasets.
Figure 1.1 showed a surprising feature of the Speed Skiiing Championships and
Figure 1.2 explained it. Figure 1.3 showed the two groups of flowers with quite
different petal lengths. Figure 1.5 showed that more males than females applied
to the Berkeley graduate program. Figure 1.8 showed that there are some extreme
outliers in the Pima Indians dataset.
3. Scales and formatting of plots are important.
Using the same scales for the two histograms in Figure 1.1 and aligning them
above one another is essential for conveying the information in the plots effec-
tively. Figure 1.3 is an informative histogram for the iris variable, as we can
see that there are two groups and the distributions of lengths within them. Bar-
charts with different numbers of categories, but from the same dataset, have dif-
ferent default scales and so care is necessary with interpretations across plots
(Figure 1.5). Comparing distribution forms for differently scaled variables needs
some standardisation first (Figure 1.8).
4. Different plots give different views of the data.
While Figure 1.3 displays the distribution of petal lengths in the iris dataset,
Figure 1.4 shows the close relationship between petal length and petal width.
Figure 1.7 shows the distribution shapes of the Pima Indian variables and Fig-
ure 1.8 emphasises outliers. Figure 1.9 shows that only two of the variables in
the dataset appear to be strongly related.
Setting the Scene 17
Exercises
More detailed information on the datasets is available on their help pages in R.
1. Iris
How would you describe this histogram of sepal width?
ggplot(iris, aes(Sepal.Width)) +
geom_histogram(binwidth=0.1)
2. Pima Indians
Summarise what this barchart shows:
3. Pima Indians
Why is the upper left of this plot of numbers of pregnancies against age empty?
5. Titanic
The liner Titanic sank on its maiden voyage in 1912 with great loss of life. The
dataset is provided in R as a table. Convert this table into a data frame using
data.frame(Titanic).
(a) What plot would you draw for showing the distribution of all the values
together? What conclusions would you draw?
(b) Draw a graphic to show the number sailing in each class. What order of
variable categories did you choose and why? Are you surprised by the dif-
ferent class sizes?
(c) Draw graphics for the other three categorical variables. How good do you
think these data are? Why are there not more detailed data on the ages of
those sailing? Even if the age variable information (young and old) was
accurate, is this variable likely to be very useful in any modelling?
18 Graphical Data Analysis with R
6. Swiss
The dataset swiss contains a standardized fertility measure and various socio-
economic indicators for each of 47 French-speaking provinces of Switzerland in
about 1888.
(a) What plot would you draw for showing the distribution of all the values
together? What conclusions would you draw?
(b) Draw graphics for each variable. What can you conclude from the distribu-
tions concerning their form and possible outliers?
(c) Draw a scatterplot of Fertility against % Catholic. Which kind of
areas have the lowest fertility rates?
(d) What sort of relationship is there between the variables Education and
Agriculture?
7. Painters
The dataset painters in package MASS contains assessments of 54 classical
painters on four characteristics: composition, drawing, colour, and expression.
The scores are due to the eighteenth century art critic de Piles.
(a) What plot would you draw for showing the distribution of all the values
together? What conclusions would you draw?
(b) Draw a display to compare the distributions of the four assessments. Is it
necessary to scale the variables first? What information might you lose, if
you did? What comments would you make on the distributions individually
and as a set?
(c) What would you expect the association between the scores for drawing and
those for colour to be? Draw a scatterplot and discuss what the display
shows in relation to your expectations.
8. Old Faithful
The dataset faithful contains data on the time between eruptions and the duration
of the eruption for the Old Faithful geyser in Yellowstone National Park, USA.
(a) Draw histograms of the variable eruptions using the functions hist
and ggplot (from the package ggplot2). Which histogram do you prefer
and why? ggplot produces a warning, suggesting you choose your own
binwidth. What binwidth would you choose to convey all the information
you want to convey in a clear manner? Would a boxplot be a good alterna-
tive here?
(b) Draw a scatterplot of the two variables using either plot or ggplot. How
would you summarise the information in the plot?
9. Intermission
Van Dyck’s Charles I, King of England, from Three Angles belongs to the Royal
Collection in Windsor Castle. What is gained from having more than one view
of the King?